<a href="https://colab.research.google.com/github/sridevibonthu/NLPCourseR19/blob/main/bowtfidf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Finding Similarity between two sentences
In this notebook, cosine similarity between two sentences is computed using two methods with following steps:

### Method 1
- Convert words into vector of numbers using bag of words (It returns count of the word in sentence based on all vocabulary words.
- After converting words to vector form, we can directly compute cosine - similarity b/w two vectors
- Then compute dot product.

**Note**: Bag of words does not account for semantic similarity, for example: two sentences: "Marry is quicker than John" and "John is quicker than Marry" are same when we use bag of words rather than word embeddings

### Method 2
- Convert words into vector of numbers using bag of words (It returns count of the word in sentence based on all vocabulary words.
- Apply TF-IDF to the result of bag of words for assign weight to each term.
- Then compute cosine similarity and dot product.

In [1]:
import sklearn
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [6]:
class TextSimilarity():
    
    def __init__(self):
        pass
    
    # convert words into numbers
    def applyBagofWords(self, data):
        
        vectorizer = CountVectorizer()
        return vectorizer.fit_transform(data).toarray()
    
    def applyTfIDF(self, data):
        
        vectorizer = TfidfVectorizer()
        tf = vectorizer.fit_transform(data).toarray()
        return tf 
    
    def computeCosineSimilarity(self, vectors):
        
        return cosine_similarity(vectors)
        
    def computeDotProduct(self, v1, v2):
        
        return np.dot(v1, v2)
    


In [7]:
textSim = TextSimilarity()
data_corpus = ["Marry and John are very good friends", "Jack and Martin are enemies"]
print ("=== Bag of words vector=== \n")
BoW_vectors = textSim.applyBagofWords(data_corpus)
print(BoW_vectors)
print("\n === Cosine Similarity ===")
cosine_sim = textSim.computeCosineSimilarity(BoW_vectors)
print(cosine_sim)
print("\n === Dot Product === ")
print(textSim.computeDotProduct(cosine_sim[0], cosine_sim[1]))

=== Bag of words vector=== 

[[1 1 0 1 1 0 1 1 0 1]
 [1 1 1 0 0 1 0 0 1 0]]

 === Cosine Similarity ===
[[1.        0.3380617]
 [0.3380617 1.       ]]

 === Dot Product === 
0.6761234037828131


In [8]:

tf_idf_vectors = textSim.applyTfIDF(data_corpus)
print (" \n ==== TF IDF ====== \n")
print (tf_idf_vectors)
print ( " \n ==== cosine similarity ==== \n ")
cosine_sim = textSim.computeCosineSimilarity(tf_idf_vectors)
print (cosine_sim)
print (" \n ==== dot product === \n ")
print (textSim.computeDotProduct(cosine_sim[0], cosine_sim[1]))

 

[[0.29017021 0.29017021 0.         0.4078241  0.4078241  0.
  0.4078241  0.4078241  0.         0.4078241 ]
 [0.35520009 0.35520009 0.49922133 0.         0.         0.49922133
  0.         0.         0.49922133 0.        ]]
 
 ==== cosine similarity ==== 
 
[[1.         0.20613697]
 [0.20613697 1.        ]]
 
 ==== dot product === 
 
0.41227393213657215
