# Text similarity

Text similarity has to determine how 'close' two pieces of text are both in surface closeness **lexical similarity** and meaning **semantic similarity**. For instance, how similar are the phrases “the cat ate the mouse” with “the mouse ate the cat food” by just looking at the words?

![./images/text_similarity.png](./images/text_similarity.png)

### Quora Question Pairs Dataset
There are over 400,000 lines of potential question duplicate pairs. Each line contains IDs for each question in the pair, the full text for each question, and a binary value that indicates whether the line truly contains a duplicate pair.

We can download dataset from [Quora Question Pairs Dataset](https://www.kaggle.com/quora/question-pairs-dataset)

In [0]:
import json
import pandas as pd
import numpy as np
import string
import matplotlib.pyplot as plt
% matplotlib inline
import spacy
sp = spacy.load('en_core_web_sm')

In [0]:
project_path = 'TextSimilarity/'

In [4]:
data = pd.read_csv(project_path+"questions.csv",nrows=1000)
data.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [0]:
# prepare table for removing punctuation
table = str.maketrans('', '', string.punctuation)
def clean_question(text):
    doc = sp(text)
    # tokenize
    # text = text.split()
    # Lemmatization
    text = [token.lemma_ for token in doc]
    # convert to lower case
    text = [word.lower() for word in text]
    # remove punctuation from each token
    text = [w.translate(table) for w in text]
    # remove words length less than 1
    text = [word for word in text if len(word)>1]
    # remove tokens with numbers in them
    text = [word for word in text if word.isalpha()]
    # store as string
    return ' '.join(text)

In [0]:
data["question1"] = data["question1"].apply(lambda x:clean_question(x))
data["question2"] = data["question2"].apply(lambda x:clean_question(x))

In [0]:
#now create a list that contains the id of all the sentences
sentence_ids = np.concatenate((data["qid1"].values, data["qid2"].values), axis=0)
sentence_ids = ['q'+str(id) for id in sentence_ids]
#create a list that stores the content of all text
sentences = np.concatenate((data["question1"].values, data["question2"].values), axis=0)
# creata a temp dataframe
temp_df = pd.DataFrame({"qid":sentence_ids,"questions":sentences})

In [0]:
#create a list that stores the wordslist of sentences
data = [word.split() for word in sentences]

In [0]:
from gensim.models.doc2vec import TaggedDocument
import gensim

In [0]:
# class LabeledLineSentence(object):
#     def __init__(self, doc_list, labels_list):
#         self.labels_list = labels_list
#         self.doc_list = doc_list
#     def __iter__(self):
#         for idx, doc in enumerate(self.doc_list):
#               yield gensim.models.doc2vec.LabeledSentence(doc,[self.labels_list[idx]])

In [0]:
#iterator returned over all questions
documents = [TaggedDocument(doc, [i]) for i, doc in zip(sentence_ids,data)]

In [12]:
model = gensim.models.Doc2Vec(size=300, min_count=0, alpha=0.025, min_alpha=0.025,workers=4)
model.build_vocab(documents)
#training of model
for epoch in range(1000):
    model.train(documents,total_examples=model.corpus_count ,epochs=model.iter)
    model.alpha -= 0.002
    model.min_alpha = model.alpha
#saving the created model
model.save('doc2vec_1000.model')
print ('model saved')

  """


model saved


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [13]:
#loading the model
d2v_model = gensim.models.doc2vec.Doc2Vec.load('doc2vec_1000.model')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [14]:
#start testing
#printing the vector of quetion at index 1 in data
docvec = d2v_model.docvecs[1]
print(docvec)

[ 4.2197876e+00  4.8312753e-02  3.0237710e+00  5.4978876e+00
  5.9577316e-01 -3.6213448e+00  1.5603197e+00 -3.5575607e+00
  2.8135810e+00  1.4412477e+00  5.8004332e+00 -1.1093522e+00
 -1.6479286e+00 -5.9140813e-01 -2.8680074e+00  5.2601206e-01
  9.4883829e-01  1.9628677e+00  2.1369214e+00  1.0583837e+00
 -2.2310190e+00  1.6844392e+00  1.4119526e+00 -4.6400557e+00
 -2.4797268e+00 -1.0261333e+00 -3.4789276e-01  2.5389395e+00
 -2.5144379e+00  2.0523310e+00 -1.0553601e+00 -4.0524144e+00
 -2.6139336e+00 -1.7444221e+00 -1.4102560e-01  1.0715770e+00
 -6.3769919e-01  3.6523359e+00  2.3656788e+00 -1.6876444e+00
 -4.3787060e+00 -1.9206141e+00 -4.0401635e+00 -4.3629346e+00
  2.0699105e+00  1.9567695e+00  1.6165894e+00  4.4256034e+00
 -3.2929420e+00 -6.6132431e+00  5.0168598e-01  8.3833343e-01
  2.1015339e+00 -2.0918705e+00  3.2042403e+00 -3.0005069e+00
  2.1545315e-01  1.3807440e+00 -7.2223020e-01 -2.9131355e+00
  1.7369833e+00  8.7035596e-03  4.0707278e+00 -2.2442057e+00
  1.1946799e+00 -3.93903

In [15]:
#printing the vector of the question using its id
docvec = d2v_model.docvecs['q3']
print(docvec)
len(docvec)

[ 4.2197876e+00  4.8312753e-02  3.0237710e+00  5.4978876e+00
  5.9577316e-01 -3.6213448e+00  1.5603197e+00 -3.5575607e+00
  2.8135810e+00  1.4412477e+00  5.8004332e+00 -1.1093522e+00
 -1.6479286e+00 -5.9140813e-01 -2.8680074e+00  5.2601206e-01
  9.4883829e-01  1.9628677e+00  2.1369214e+00  1.0583837e+00
 -2.2310190e+00  1.6844392e+00  1.4119526e+00 -4.6400557e+00
 -2.4797268e+00 -1.0261333e+00 -3.4789276e-01  2.5389395e+00
 -2.5144379e+00  2.0523310e+00 -1.0553601e+00 -4.0524144e+00
 -2.6139336e+00 -1.7444221e+00 -1.4102560e-01  1.0715770e+00
 -6.3769919e-01  3.6523359e+00  2.3656788e+00 -1.6876444e+00
 -4.3787060e+00 -1.9206141e+00 -4.0401635e+00 -4.3629346e+00
  2.0699105e+00  1.9567695e+00  1.6165894e+00  4.4256034e+00
 -3.2929420e+00 -6.6132431e+00  5.0168598e-01  8.3833343e-01
  2.1015339e+00 -2.0918705e+00  3.2042403e+00 -3.0005069e+00
  2.1545315e-01  1.3807440e+00 -7.2223020e-01 -2.9131355e+00
  1.7369833e+00  8.7035596e-03  4.0707278e+00 -2.2442057e+00
  1.1946799e+00 -3.93903

300

In [16]:
#to get most similar questions with similarity scores using question-index
similar_doc = d2v_model.docvecs.most_similar(14) 
print(similar_doc)

[('q740', 0.9731907844543457), ('q2', 0.9551281332969666), ('q141', 0.9514784812927246), ('q707', 0.9480893611907959), ('q220', 0.941907525062561), ('q784', 0.9396740198135376), ('q757', 0.9388040900230408), ('q1917', 0.937085747718811), ('q168', 0.935954749584198), ('q608', 0.9354759454727173)]


  if np.issubdtype(vec.dtype, np.int):


In [17]:
#to get most similar questions with similarity scores using question- name
sims = d2v_model.docvecs.most_similar('q3')
print(sims)

[('q789', 0.9842917919158936), ('q1615', 0.9836912155151367), ('q1447', 0.9775302410125732), ('q1776', 0.9750415086746216), ('q10', 0.974833607673645), ('q1803', 0.973342776298523), ('q520', 0.972236692905426), ('q1273', 0.9711782336235046), ('q177', 0.9708677530288696), ('q1165', 0.9704925417900085)]


  if np.issubdtype(vec.dtype, np.int):


In [18]:
actual_question = temp_df["questions"][temp_df["qid"] == 'q3'].values
print("actual question:\n",actual_question,"\n\n")
similar_questions = []
for id,score in sims:
    print("similar question:\n",temp_df["questions"][temp_df["qid"] == id].values)

actual question:
 ['what be the story of kohinoor koh noor diamond'] 


similar question:
 ['what be the good self help book pron have ever read']
similar question:
 ['what area of game programming be most mathematically involve and suit to math major']
similar question:
 ['what be the good horror novel in']
similar question:
 ['why be the big bang theory tv series so popular why be pron so popular with mainstream viewer']
similar question:
 ['which fish would survive in salt water']
similar question:
 ['prove that snr of power snr of voltage sequare']
similar question:
 ['can pron cancel tatkal waiting list ticket']
similar question:
 ['what be the good website to learn programming for begineer']
similar question:
 ['which be the good gaming laptop under inr']
similar question:
 ['why india do not have friendly relation with pron be neighbouring country']


In [19]:
#to get vector of new text that is not present in corpus  
text='What are the differences between a love marriage and an arranged marriage?'
docvec_test = d2v_model.infer_vector([clean_question(text)],steps=20, alpha=0.025)
print(docvec_test)

[ 8.13260558e-04  6.52711373e-04 -1.27835106e-03 -3.36558442e-04
  6.69761677e-04  1.41433149e-03 -2.44597439e-04 -1.60851443e-04
  6.15414581e-04  3.97924683e-04 -2.59113673e-04  3.47420224e-04
 -9.37240780e-04  9.11506242e-04 -8.29532743e-04 -8.00306880e-05
  1.37216260e-03 -9.78128868e-04  3.11138370e-04  4.24123456e-04
 -1.02365715e-03  2.33983257e-04 -1.02776522e-03 -1.28362456e-03
 -1.14987907e-03 -7.06099439e-04  1.07160580e-04 -8.99251900e-04
 -3.46198576e-05  2.52552301e-04 -9.02860658e-04  5.00085473e-04
  3.59014695e-04 -1.14721261e-04  6.62693492e-05 -1.58018293e-03
 -1.16872671e-03 -5.27093944e-04  3.54430842e-04  3.09811876e-04
 -5.87266913e-05  7.31529552e-04  9.53248120e-04  4.22610583e-06
  1.05116889e-03 -2.88103562e-04  1.13250641e-03 -8.85433459e-04
  2.39585963e-04  1.34278904e-03 -1.60948059e-03 -7.69341830e-04
  3.06174537e-04  6.01353706e-04 -1.43164699e-03  9.12399628e-05
  8.30846897e-04  1.36261445e-03 -6.05819572e-04  1.54437590e-03
 -1.45474484e-03 -3.06065

In [20]:
# get similar questions with similarity scores
d2v_model.docvecs.most_similar(positive=[docvec_test])

  if np.issubdtype(vec.dtype, np.int):


[('q155', 0.14983892440795898),
 ('q1988', 0.09502933919429779),
 ('q445', 0.09265796840190887),
 ('q1455', 0.08877728879451752),
 ('q1903', 0.08162131160497665),
 ('q1815', 0.0799749568104744),
 ('q153', 0.0777415782213211),
 ('q382', 0.06997652351856232),
 ('q916', 0.06949349492788315),
 ('q767', 0.06935332715511322)]