# What are we doing with vectors?

Vectors have multiple uses in Machine Learning. Apart from getting equivalent numeric representations for ML models, creating word vectors also allows us to measure similarity of given documents. These vectors also uses the same techniques used in Vector Algebra like Euclidean Distance, Cosine Similarity, etc.


**Why Cosine Similarity?**

A commonly used approach to match similar documents is based on counting the maximum number of common words between the documents. But, as the size of the document increases, the number of common words tend to increase even if the documents talk about different topics.

Cosine similarity is a metric used to determine how similar the documents are irrespective of their size.

If there are two vectors at in a multi-dimensional space, then the angle between them is given by the cosine similarity. The distance might be large (Euclidean Distance) but the angle could be smaller which indicates higher similarity.

Mathematical Formula -  ```u.v / |u||v| , where u, v are two vectors```

Read more - https://www.machinelearningplus.com/nlp/cosine-similarity/

In [1]:
doc_trump = "Mr. Trump became president after winning the political election. Though he lost the support of some republican friends, Trump is friends with President Putin"

doc_election = "President Trump says Putin had no political interference is the election outcome. He says it was a witchhunt by political parties. He claimed President Putin is a friend who had nothing to do with the election"

doc_putin = "Post elections, Vladimir Putin became President of Russia. President Putin had served as the Prime Minister earlier in his political career"

documents = [doc_trump, doc_election, doc_putin]

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

count_vectorizer = CountVectorizer(stop_words='english')
sparse_matrix = count_vectorizer.fit_transform(documents) # converts to a matrix of word vectors

doc_term_matrix = sparse_matrix.toarray()

df = pd.DataFrame(doc_term_matrix, 
                  columns=count_vectorizer.get_feature_names(), 
                  index=['doc_trump', 'doc_election', 'doc_putin'])
df

Unnamed: 0,career,claimed,earlier,election,elections,friend,friends,interference,lost,minister,...,putin,republican,russia,says,served,support,trump,vladimir,winning,witchhunt
doc_trump,0,0,0,1,0,0,2,0,1,0,...,1,1,0,0,0,1,2,0,1,0
doc_election,0,1,0,2,0,1,0,1,0,0,...,2,0,0,2,0,0,1,0,0,1
doc_putin,1,0,1,0,1,0,0,0,0,1,...,2,0,1,0,1,0,0,1,0,0


In [55]:
from sklearn.metrics.pairwise import cosine_similarity

print(cosine_similarity(df, df))

[[1.         0.51639778 0.36893239]
 [0.51639778 1.         0.45360921]
 [0.36893239 0.45360921 1.        ]]


# Document Similarity

Using Cosine Similarity we can understand how similar are two documents.

In [97]:
import nltk
import numpy as np
# nltk.download('punkt')
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')

def cosine_sim(text1, text2):
    tfidf = vectorizer.fit_transform([text1, text2])
    print('Vocab: {}'.format(vectorizer.vocabulary_))
    vec1 = tfidf.toarray()[0]
    vec2 = tfidf.toarray()[1]
    dot_prod = np.dot(vec1, vec2)
    normed = np.linalg.norm(vec1)* np.linalg.norm(vec2)
    return dot_prod/normed

In [98]:
cosine_sim('a little bird', 'a little bird')

Vocab: {'little': 1, 'bird': 0}


1.0

In [99]:
cosine_sim('a little bird', 'a little bird chirps')

Vocab: {'little': 2, 'bird': 0, 'chirps': 1}


0.7092972666062739

In [100]:
cosine_sim('a little bird', 'a big dog barks')

Vocab: {'little': 4, 'bird': 2, 'big': 1, 'dog': 3, 'barks': 0}


0.0

# Application of Document Similarity

1. Information Retrieval Systems - Similarity between query and document
2. Plagiarism Detection