# This Notebook provides the code and instructions to calculate the similarity between 2 texts

 Importing libraries

In [1]:
import pandas as pd
import re
import math
import numpy

The Documents that need to be processed should be added here

In [2]:
documents = ["The easiest way to earn points with Fetch Rewards is to just shop for the products you already love. If you have any participating brands on your receipt, you'll get points based on the cost of the products. You don't need to clip any coupons or scan individual barcodes. Just scan each grocery receipt after you shop and we'll find the savings for you.",
             "The easiest way to earn points with Fetch Rewards is to just shop for the items you already buy. If you have any eligible brands on your receipt, you will get points based on the total cost of the products. You do not need to cut out any coupons or scan individual UPCs. Just scan your receipt after you check out and we will find the savings for you.",
            "We are always looking for opportunities for you to earn more points, which is why we also give you a selection of Special Offers. These Special Offers are opportunities to earn bonus points on top of the regular points you earn every time you purchase a participating brand. No need to pre-select these offers, we'll give you the points whether or not you knew about the offer. We just think it is easier that way."]

Importing stopwords data from a custom text file

In [3]:
f = open("stopwords.txt", "r")
stopwords=[]
for line in f.readlines():
    stopwords.append(line[0:-1])

The process is as follows:

1) We first convert all the words in each document into lower case and split them by space.

2) We then check for the valid word format using a regex condition.(This is called stemming)

3) We can the lemmatize the words i.e, get the core word(ex: doing --> do, helpful -->help etc).Currently this is not implemented as it needs a model to be trained.

4) Then we filter out the lists obtained in step 3 using the custom stopwords(like a ,an,the ,you etc) list to get the meaningful words out.

5) Then we create a corpus of all the unique words from all the documents(This will be our master corpus).

6) For each word in the unique data we check for each sentence how many times the word gets repeated in it.

7) This gives us the term frequency list(putting in dictionary)

8) We them calculate the Inverse document frequency for each word in the unique data list.

9) This can be done by checking how in how many documents a particular word appears and dividing the number of documents by the word appearence number.

10) Multiplying the two values will give us the TF-IDF value for each document.

11) We can then use this to calculate the cosine similarity between 2 vectors.

12) Cosine similarity can be calculated by taking the (dot product of 2 vectors)/product of mod of the vectors

13) This gives us a score(numerical value between 0 to 1) on how similar 2 texts are.

14) If the queried texts are more similar the score is nearer to 1 and vice versa.

In [4]:
#Stemming 
def Stemmer(doc):
    stemmed_res = re.sub(r'[^\w\s]', '', doc).lower().split()
    return stemmed_res

Function for Filtering out the stop words

In [92]:
def filterSentences(docs):
    filter_sentences=[]
    for document in docs:
        filtered_doc=[w for w in Stemmer(document) if not w in stopwords] 
        filter_sentences.append(filtered_doc)
    return filter_sentences


Function for Finding the number of words

In [6]:
def numofWords(inp,distinct_words):
    Words = dict.fromkeys(distinct_words, 0)
    for word in inp:
        Words[word] += 1
    return Words

In [7]:
def numofWordsList(dist_word_list,distinct_words):
    word_num_list=[]
    for word_list in dist_word_list:
        word_num_list.append(numofWords(word_list,distinct_words))
    return word_num_list
        

Function for Computing the Term Frequency

In [8]:
def findTf(wrd_cnt,total_words ):
    tfdict = {}
    N = len(total_words)
    for word, count in wrd_cnt.items():
        tfdict[word] = count / float(N)
    return tfdict

Function for Computing the Inverse document frequency

In [9]:
def findIdf(sentences_list,dist_words):
    dict_total={}
    n=len(sentences_list)
    for i in dist_words:
        for each in sentences_list:
            if i in each:
                if i in dict_total:
                    dict_total[i]=dict_total[i]+1
                else:
                    dict_total[i]=1 
    for i in dict_total:
        dict_total[i]=math.log(n+1/dict_total[i])
    return dict_total

In [10]:
def findTFIDF(words, idf_val,distinct_words):
    tfidf = {}
    for word in distinct_words:
        tfidf[word] = words[word] * idf_val[word]
    return tfidf

Support functions to find sum and modulus of vectors

In [11]:
def dotProduct(l1,l2):
    sum_final=0
    for i in range(len(l1)):
        sum_final=sum_final+(l1[i]*l2[i])
    return sum_final

In [12]:
def modulus(vec):
    sum_of_sq=0
    for i in vec:
        sum_of_sq=sum_of_sq+(i*i)
    return numpy.sqrt(sum_of_sq)

This function is used to find the similarity between 2 documents(As the numbering starts from 0, please give the document ids starting from 0)

In [13]:
def findSimilarity(doc1_id,doc2_id,documnets):
    #Filtering out the stop words
    filtered_document_list=filterSentences(documents)
    #Finding the distinct set of words
    distinct_words= set().union(*filtered_document_list)
    #Finding the number of words
    word_count=numofWordsList(filtered_document_list,distinct_words)
    #Computing the Term Frequency
    tfs=[]
    for i in range(len(word_count)):
        tfs.append(findTf(word_count[i],filtered_document_list[i]))
    #Computing the Inverse document frequency
    idf_val=findIdf(filtered_document_list,distinct_words)
    doc1=list(findTFIDF(tfs[doc1_id],idf_val,distinct_words).values())
    doc2=list(findTFIDF(tfs[doc2_id],idf_val,distinct_words).values())
    #finding dot product
    dot_product=dotProduct(doc1,doc2)
    #finding modulud product
    deno=modulus(doc1)*modulus(doc2)
    #finding cosine similarity
    cosine_similarity=dot_product/deno
    return cosine_similarity

The following function is used to find the similarities between all the documents (gives 2d array) 

In [14]:
def findDocSimilarity(documents):
    final_arr=[]
    n=len(documents)
    for i in range(n):
        s1=[]
        for j in range(n):
            val=round(findSimilarity(i,j,documents),3)
            s1.append(val)
        final_arr.append(s1)
    return final_arr

In this 2d array we can check the similarity value of 2 sentences by referring to the matrix rows and columns accordingly.

In [15]:
final_output=findDocSimilarity(documents)
final_output

[[1.0, 0.753, 0.274], [0.753, 1.0, 0.238], [0.274, 0.238, 1.0]]

Suppose if we need the similarity score of first and second sentence we check row 1 column 2

In [16]:
final_output[0][1]

0.753

Suppose if we need the similarity score of first and third sentence we check row 2 column 3

In [17]:
final_output[1][2]

0.238

This function can be used directly to find similarity between 2 sentences directly

In [18]:
def findSentenceSimilarity(doc1,doc2):
    documents=[doc1,doc2]
    filtered_document_list=filterSentences(documents)
    distinct_words= set().union(*filtered_document_list)
    word_count=numofWordsList(filtered_document_list,distinct_words)
    tfs=[]
    for i in range(len(word_count)):
        tfs.append(findTf(word_count[i],filtered_document_list[i]))
    idf_val=findIdf(filtered_document_list,distinct_words)
    doc1=list(findTFIDF(tfs[0],idf_val,distinct_words).values())
    doc2=list(findTFIDF(tfs[1],idf_val,distinct_words).values())
    dot_product=dotProduct(doc1,doc2)
    deno=modulus(doc1)*modulus(doc2)
    cosine_similarity=dot_product/deno
    return cosine_similarity

The following is the score for first and second sentences

In [19]:
findSentenceSimilarity("The easiest way to earn points with Fetch Rewards is to just shop for the products you already love. If you have any participating brands on your receipt, you'll get points based on the cost of the products. You don't need to clip any coupons or scan individual barcodes. Just scan each grocery receipt after you shop and we'll find the savings for you.",
             "The easiest way to earn points with Fetch Rewards is to just shop for the items you already buy. If you have any eligible brands on your receipt, you will get points based on the total cost of the products. You do not need to cut out any coupons or scan individual UPCs. Just scan your receipt after you check out and we will find the savings for you.")

0.7259076475745524

The following is the score for second and third sentences

In [20]:
findSentenceSimilarity("The easiest way to earn points with Fetch Rewards is to just shop for the products you already love. If you have any participating brands on your receipt, you'll get points based on the cost of the products. You don't need to clip any coupons or scan individual barcodes. Just scan each grocery receipt after you shop and we'll find the savings for you.",
            "We are always looking for opportunities for you to earn more points, which is why we also give you a selection of Special Offers. These Special Offers are opportunities to earn bonus points on top of the regular points you earn every time you purchase a participating brand. No need to pre-select these offers, we'll give you the points whether or not you knew about the offer. We just think it is easier that way.")

0.24058207580417887

# End of approach one

# Approach 2 (extra) using library models (Ignore this if necessary)

The following is using the word2vec embedding model using gensim. We can always train a custom model for word 2 vec,but here we use gensim model

Training the model using the input data

In [109]:
import gensim
filtered_document_list=filterSentences(documents)
model = gensim.models.Word2Vec(filtered_document_list, size=150, window=10, min_count=1, workers=10, iter=10)

Finding the similarity using the trained model

document[0] is sentence 1

document[0] is sentence 2

document[0] is sentence 3

In [110]:
def findSimilarity(doc1,doc2):
    s1 = doc1
    s2 = doc2 
    s1=[w for w in Stemmer(s1) if not w in stopwords] 
    s2=[w for w in Stemmer(s2) if not w in stopwords] 
    similarity = model.wv.n_similarity(s1, s2)
    return similarity

Similarity between document 1 and document 2

In [111]:
findSimilarity(documents[0],documents[1])

0.84225047

Similarity between document 1 and document 2

In [112]:
findSimilarity(documents[0],documents[2])

0.3812753