### Case Study: Document Similarity 

How do I find documents similar to a particular document?

We will use our favorite NLP library: Gensim

In [1]:
import gensim

In [2]:
# we'll use the same dataset that we used in Word2Vec lecture
# first, we'll get the list of txt files under bbc_sport folder.
# we can use glob library for that
import glob
file_names=glob.glob("bbc_sport/*.txt")
file_names

['bbc_sport/289.txt',
 'bbc_sport/504.txt',
 'bbc_sport/262.txt',
 'bbc_sport/276.txt',
 'bbc_sport/510.txt',
 'bbc_sport/060.txt',
 'bbc_sport/074.txt',
 'bbc_sport/048.txt',
 'bbc_sport/114.txt',
 'bbc_sport/100.txt',
 'bbc_sport/128.txt',
 'bbc_sport/470.txt',
 'bbc_sport/316.txt',
 'bbc_sport/302.txt',
 'bbc_sport/464.txt',
 'bbc_sport/458.txt',
 'bbc_sport/459.txt',
 'bbc_sport/303.txt',
 'bbc_sport/465.txt',
 'bbc_sport/471.txt',
 'bbc_sport/317.txt',
 'bbc_sport/129.txt',
 'bbc_sport/101.txt',
 'bbc_sport/115.txt',
 'bbc_sport/049.txt',
 'bbc_sport/075.txt',
 'bbc_sport/061.txt',
 'bbc_sport/277.txt',
 'bbc_sport/511.txt',
 'bbc_sport/505.txt',
 'bbc_sport/263.txt',
 'bbc_sport/288.txt',
 'bbc_sport/275.txt',
 'bbc_sport/261.txt',
 'bbc_sport/507.txt',
 'bbc_sport/249.txt',
 'bbc_sport/088.txt',
 'bbc_sport/077.txt',
 'bbc_sport/063.txt',
 'bbc_sport/103.txt',
 'bbc_sport/117.txt',
 'bbc_sport/498.txt',
 'bbc_sport/467.txt',
 'bbc_sport/301.txt',
 'bbc_sport/315.txt',
 'bbc_spor

In [3]:
# now we'll read all the txt files into one single list: list of texts
raw_documents=[]
for file in file_names:
    try: # we'll use try-catch block to prevent the code from crashing if it cannot read any of the txt files
        with open (file, "r", encoding="utf-8") as f:
            raw_documents.append(f.read())
    except:
        pass

In [4]:
print("Number of documents:",len(raw_documents))

Number of documents: 510


In [5]:
# then we'll prepare and tokenize the text data with the methods we learned before..

clean_texts=[]
for text in raw_documents:
    clean_texts.append(gensim.utils.simple_preprocess(text))

In [6]:
# next, we will create a dictionary from a list of documents. A dictionary maps every word to a number.

dictionary = gensim.corpora.Dictionary(clean_texts)

print("Number of words in dictionary:",len(dictionary))

# lets see the first 100 words from the dictionary
for i in range(100):
    print(i, dictionary[i])

Number of words in dictionary: 10164
0 able
1 about
2 absolutely
3 african
4 after
5 ahead
6 all
7 am
8 an
9 and
10 andy
11 are
12 at
13 awesome
14 back
15 ball
16 bbc
17 be
18 been
19 before
20 blasts
21 bodies
22 both
23 but
24 called
25 came
26 chance
27 charge
28 charlie
29 clive
30 coach
31 consult
32 cope
33 corry
34 cost
35 could
36 couple
37 crashed
38 credit
39 cross
40 cueto
41 decisions
42 declined
43 defended
44 denied
45 did
46 didn
47 disappointed
48 dominated
49 don
50 done
51 doubt
52 dublin
53 dying
54 effort
55 england
56 every
57 everything
58 famous
59 field
60 first
61 fly
62 for
63 forwards
64 four
65 fourth
66 from
67 fuming
68 game
69 games
70 gather
71 given
72 go
73 gone
74 good
75 got
76 had
77 half
78 has
79 have
80 he
81 his
82 hodgson
83 hoisted
84 how
85 hurt
86 in
87 insisted
88 ireland
89 irish
90 is
91 it
92 jonathan
93 josh
94 kaplan
95 kick
96 know
97 legal
98 lewis
99 lewsey


Now we will create a corpus. A corpus is a list of bags of words. A bag-of-words representation for a document just lists the number of times each word occurs in the document.

In [7]:
corpus = [dictionary.doc2bow(text) for text in clean_texts]
print(corpus[:10]) # print first 10 bags of words

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 2), (7, 2), (8, 1), (9, 11), (10, 1), (11, 3), (12, 2), (13, 2), (14, 1), (15, 1), (16, 1), (17, 2), (18, 3), (19, 1), (20, 1), (21, 1), (22, 2), (23, 3), (24, 1), (25, 1), (26, 2), (27, 1), (28, 2), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 3), (36, 1), (37, 1), (38, 1), (39, 1), (40, 4), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 2), (51, 1), (52, 1), (53, 1), (54, 1), (55, 3), (56, 2), (57, 2), (58, 1), (59, 1), (60, 2), (61, 1), (62, 3), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 4), (69, 1), (70, 1), (71, 1), (72, 2), (73, 1), (74, 1), (75, 3), (76, 1), (77, 2), (78, 1), (79, 8), (80, 3), (81, 3), (82, 2), (83, 1), (84, 1), (85, 1), (86, 6), (87, 1), (88, 2), (89, 1), (90, 4), (91, 4), (92, 1), (93, 3), (94, 2), (95, 1), (96, 1), (97, 1), (98, 1), (99, 3), (100, 1), (101, 1), (102, 2), (103, 2), (104, 1), (105, 1), (106, 1), (107, 2), (108, 1), (109, 1), (110, 1

In [8]:
# Now we create a tf-idf model from the corpus. 
# The num_nnz parameter that we'll see in the output is the number of tokens.

tf_idf = gensim.models.TfidfModel(corpus)
print(tf_idf)

TfidfModel(num_docs=510, num_nnz=89294)


Now we will create a similarity measure object in tf-idf space.
tf-idf stands for term frequency-inverse document frequency. Term frequency is how often the word shows up in the document and inverse document fequency scales the value by how rare the word is in the corpus.

According to Gensim official documentation, Gensim contains three classes for indexing:

gensim.similarities.MatrixSimilarity

gensim.similarities.SparseMatrixSimilarity

gensim.similarities.Similarity (we'll use this one)

In [16]:
similarity_object = gensim.similarities.Similarity('bbc_sport/', tf_idf[corpus], num_features=len(dictionary))
# 'bbc_sport/'>> we're specifying the folder that the similarity index object will be stored
print(similarity_object)
print(type(similarity_object))

Similarity index with 510 documents in 0 shards (stored under bbc_sport/)
<class 'gensim.similarities.docsim.Similarity'>


In [28]:
import joblib
model_file = 'sim_model'
joblib.dump(similarity_object, model_file)

['sim_model']

In [27]:
!ls -l

total 3856
-rw-r--r--    1 vkocaman  staff   73032 Sep  8 14:30 Document_similarity.ipynb
-rw-r--r--@   1 vkocaman  staff   60195 Jun 24 00:02 NLP Curriculum.docx
-rw-r--r--@   1 vkocaman  staff  152779 May 19 08:56 NR_architecture.png
-rw-r--r--    1 vkocaman  staff   12423 Jul 15 23:24 Topic Modelling.ipynb
-rw-r--r--    1 vkocaman  staff  494653 Jul 15 22:46 Udemy_NLP_edited_at_June24th.ipynb
drwxr-xr-x@ 514 vkocaman  staff   16448 Jul 22 20:17 [34mbbc_sport[m[m
-rw-r--r--@   1 vkocaman  staff   51564 May 13 17:26 ch-2 Python NLP Packages.docx
-rw-r--r--@   1 vkocaman  staff  633762 May 19 11:25 chapter 2 and 3.html
-rw-r--r--@   1 vkocaman  staff  120287 May 13 16:44 chapter-1 NLP Foundations.docx
-rw-r--r--@   1 vkocaman  staff   33596 Jun 23 22:54 ngrams.png
-rw-r--r--@   1 vkocaman  staff    2050 May 18 01:24 sample_text.txt
-rw-r--r--    1 vkocaman  staff  203653 Jul 22 19:46 word2vec.ipynb
-rw-r--r--@   1 vkocaman  staff  109625 Jul 22 16:19 word2vec.png
-rw-r

In [18]:
import pickle
model_file = 'sim_model'
pickle.dump(similarity_object, open(model_file, 'wb'))

In [29]:
import joblib
similarity_object = joblib.load(model_file)

Now create a query document and convert it to tf-idf.

query document is the one that we want to find the similar documents accordingly

we're going to use raw_documents[8] as our query doc and will try to see if our model would be able to find the same document as the most similar one


In [10]:
text=raw_documents[8]

In [11]:
text

'Robben sidelined with broken foot\n\nChelsea winger Arjen Robben has broken two metatarsal bones in his foot and will be out for at least six weeks.\n\nRobben had an MRI scan on the injury, sustained during the Premiership win at Blackburn, on Monday. "Six weeks is the average time to heal this injury and then I need a few more weeks to be completely fit again," he told Dutch newspaper Algemeen Dagblad. "I had a feeling it was serious but because of the swelling it was impossible to make a final diagnosis." The 21-year-old missed the first three months of the season with a similar injury after a challenge with Roma\'s Olivier Dacourt. And he added: "It felt different then last summer when I had the same injury on my other foot. "Then I could walk already after three days but I stayed sidelined for a long period. I hope that it will now take me six to eight weeks." Chelsea physio Mike Banks was hopeful that Robben could return at some point in March. "The fractures are tiny and he coul

In [12]:
# preapre the text
query_doc = gensim.utils.simple_preprocess(text)
print(query_doc)

['robben', 'sidelined', 'with', 'broken', 'foot', 'chelsea', 'winger', 'arjen', 'robben', 'has', 'broken', 'two', 'metatarsal', 'bones', 'in', 'his', 'foot', 'and', 'will', 'be', 'out', 'for', 'at', 'least', 'six', 'weeks', 'robben', 'had', 'an', 'mri', 'scan', 'on', 'the', 'injury', 'sustained', 'during', 'the', 'premiership', 'win', 'at', 'blackburn', 'on', 'monday', 'six', 'weeks', 'is', 'the', 'average', 'time', 'to', 'heal', 'this', 'injury', 'and', 'then', 'need', 'few', 'more', 'weeks', 'to', 'be', 'completely', 'fit', 'again', 'he', 'told', 'dutch', 'newspaper', 'algemeen', 'dagblad', 'had', 'feeling', 'it', 'was', 'serious', 'but', 'because', 'of', 'the', 'swelling', 'it', 'was', 'impossible', 'to', 'make', 'final', 'diagnosis', 'the', 'year', 'old', 'missed', 'the', 'first', 'three', 'months', 'of', 'the', 'season', 'with', 'similar', 'injury', 'after', 'challenge', 'with', 'roma', 'olivier', 'dacourt', 'and', 'he', 'added', 'it', 'felt', 'different', 'then', 'last', 'summer'

In [13]:
# printing out the bag of words
query_doc_bow = dictionary.doc2bow(query_doc)
print(query_doc_bow)


[(4, 3), (8, 1), (9, 6), (11, 1), (12, 3), (17, 3), (18, 1), (23, 3), (35, 3), (58, 1), (60, 1), (62, 4), (64, 1), (72, 1), (76, 3), (78, 4), (80, 6), (81, 2), (86, 4), (90, 4), (91, 5), (95, 1), (114, 3), (117, 1), (118, 4), (122, 6), (123, 1), (124, 2), (137, 1), (142, 1), (143, 1), (145, 3), (155, 3), (156, 18), (158, 3), (162, 2), (164, 2), (167, 5), (168, 2), (174, 2), (176, 1), (184, 5), (190, 2), (193, 2), (194, 1), (197, 4), (200, 1), (210, 1), (216, 2), (221, 2), (222, 1), (225, 1), (226, 1), (232, 1), (236, 1), (249, 1), (263, 2), (284, 1), (286, 1), (289, 2), (302, 1), (307, 1), (311, 1), (321, 1), (331, 1), (337, 1), (342, 1), (344, 1), (359, 1), (405, 1), (408, 1), (419, 1), (425, 1), (450, 1), (484, 1), (486, 1), (488, 1), (517, 1), (526, 2), (540, 1), (544, 1), (558, 1), (591, 1), (595, 1), (644, 1), (648, 1), (682, 1), (721, 1), (771, 1), (796, 1), (823, 1), (824, 1), (825, 1), (826, 1), (827, 1), (828, 2), (829, 1), (830, 1), (831, 2), (832, 1), (833, 1), (834, 1), (83

In [14]:
# getting the tfidf vector
query_doc_tf_idf = tf_idf[query_doc_bow]
print(query_doc_tf_idf)

[(4, 0.02263075027829594), (8, 0.009420774159789867), (9, 0.001632501190511257), (11, 0.01206382182138575), (12, 0.008330816019344692), (17, 0.016567430565034364), (18, 0.009774394648130666), (23, 0.007621970448415649), (35, 0.05641501191245958), (58, 0.06602230631084857), (60, 0.012616766574606766), (62, 0.00576897937120855), (64, 0.02504392869532371), (72, 0.023522471169963872), (76, 0.02914485293908593), (78, 0.02321531177230382), (80, 0.02560261799098236), (81, 0.010582962212341947), (86, 0.0010883341270075046), (90, 0.018289863300235173), (91, 0.021768990088126633), (95, 0.04216567075964854), (114, 0.05857286285797554), (117, 0.018211533956688244), (118, 0.0010883341270075046), (122, 0.012923538764816056), (123, 0.013115610919723698), (124, 0.020028488347842967), (137, 0.004095012315388415), (142, 0.019841999872251836), (143, 0.024757106640784576), (145, 0.0635249093796952), (155, 0.01358552824715018), (158, 0.06148674570000845), (162, 0.020637669734009267), (164, 0.03546429143735

Now we'll show an array of document similarities to our query. We see that the second document is the most similar with the overlapping of socks and force.

In [30]:
similarity_scores=list(similarity_object[query_doc_tf_idf])
similarity_scores
# here we see the similarity score for our target document to all the other documents in our corpus (510)

[0.032887433,
 0.016977508,
 0.018539036,
 0.019894533,
 0.017899081,
 0.022580015,
 0.010269554,
 0.013470748,
 1.0000002,
 0.027639873,
 0.022147087,
 0.038350023,
 0.021625647,
 0.016395621,
 0.03301203,
 0.05297974,
 0.015539058,
 0.04383433,
 0.03391991,
 0.020918177,
 0.01542627,
 0.0159646,
 0.02570059,
 0.037620053,
 0.015216005,
 0.023835687,
 0.022008173,
 0.016603677,
 0.052490875,
 0.015451965,
 0.012892699,
 0.02920356,
 0.060031164,
 0.01526795,
 0.018899424,
 0.026807508,
 0.0152140455,
 0.021564871,
 0.013839945,
 0.049074553,
 0.02592384,
 0.018810576,
 0.020572634,
 0.020307606,
 0.07500416,
 0.029557766,
 0.026355803,
 0.026690753,
 0.016304808,
 0.020372866,
 0.03595853,
 0.07396906,
 0.032260608,
 0.038677063,
 0.057104725,
 0.023761138,
 0.017442422,
 0.019532368,
 0.047460984,
 0.036169436,
 0.015970271,
 0.04445741,
 0.01716566,
 0.020482495,
 0.023376094,
 0.032064646,
 0.023642734,
 0.030902818,
 0.027590303,
 0.020692986,
 0.04769097,
 0.04777052,
 0.03368994

In [16]:
# now lets see which one is the most similar to our target document.
# the score for the most similar one should be around 1.0 
# because we used a document from our corpus that we trained our model on

# firt we'll find the max score in this list

max_score=max(similarity_scores)

print (max_score)
# then we'll find the index of the highest score

similarity_scores.index(max_score)
# as you see, the document 8 is the most similar one

1.0000002


8

In [54]:
print (raw_documents[8])

Robben sidelined with broken foot

Chelsea winger Arjen Robben has broken two metatarsal bones in his foot and will be out for at least six weeks.

Robben had an MRI scan on the injury, sustained during the Premiership win at Blackburn, on Monday. "Six weeks is the average time to heal this injury and then I need a few more weeks to be completely fit again," he told Dutch newspaper Algemeen Dagblad. "I had a feeling it was serious but because of the swelling it was impossible to make a final diagnosis." The 21-year-old missed the first three months of the season with a similar injury after a challenge with Roma's Olivier Dacourt. And he added: "It felt different then last summer when I had the same injury on my other foot. "Then I could walk already after three days but I stayed sidelined for a long period. I hope that it will now take me six to eight weeks." Chelsea physio Mike Banks was hopeful that Robben could return at some point in March. "The fractures are tiny and he could be p

In [17]:
# and lets see the second highest score and corresponding document

sorted_scores=sorted(similarity_scores, reverse=True)

print (sorted_scores[0]) # the highest one (document 8)

print (sorted_scores[1]) #the second highest one

1.0000002
0.33337453


In [18]:
# lets find which documnt has 0.33 score

similarity_scores.index(sorted_scores[1])

220

In [19]:
# looks like document 220 is the second most similar one.. but note that the score is quite low (0.33)
# however, when compared with document 8, we see that both documnts mention about Arjen Robben's injury 
# so our model did a pretty good job

print (raw_documents[220])

Robben plays down European return

Injured Chelsea winger Arjen Robben has insisted that he only has a 10% chance of making a return against Barcelona in the Champions League.

The 21-year-old has been sidelined since breaking a foot against Blackburn last month. Chelsea face Barcelona at home on 8 March having lost 2-1 in the first leg. And Robben told the Daily Star: "It is not impossible that I will play against Barcelona but it is just a very, very small chance - about 10%."

Robben has been an inspirational player for Chelsea this season following a switch from PSV Einhoven last summer. He added: "My recovery is going better than we expected a few weeks ago but I think the Barcelona game will come too soon. "I won't take any risks and come back too soon."

