**Step 0. Import packages** 

In [2]:
# prerequisite packages
import numpy
import scipy
import gensim

**Step 1. Read the data in** 

In [12]:
# read in the data
import csv
with open('homework.csv') as f:
    reader = csv.reader(f)
    raw_docs =list(reader)

* [How to unnest a nested list?](https://stackoverflow.com/questions/11860476/how-to-unnest-a-nested-list)

In [37]:
# take a look at first two records
raw_docs = sum(raw_docs, [])
raw_docs[:2]

['An apparatus and a method for diagnosis are provided. The apparatus for diagnosis lesion include: a model generation unit configured to categorize learning data into one or more categories and to generate one or more categorized diagnostic models based on the categorized learning data, a model selection unit configured to select one or more diagnostic model for diagnosing a lesion from the categorized diagnostic models, and a diagnosis unit configured to diagnose the lesion based on image data of the lesion and the selected one or more diagnostic model.',
 'Embodiments are disclosed to provide the prediction of viewable events. Predicting viewable events will allow users to know what events will likely be viewable in a particular venue, such as a restaurant, bar, or private home. Information about venues and events is populated in a database by a plurality of venues or users. Users wishing to view a particular event can search for a venue that has a high probability of showing that e

In [38]:
# Number of documents
len(raw_docs)

25

**Step 2. Tokenization** 

In [47]:
# tokenization
from nltk.tokenize import word_tokenize
gen_docs = [[w.lower() for w in word_tokenize(text)] 
            for text in raw_docs]

**Step 3. Create a dictionary**

note: a dictionary maps every word to a number.

In [53]:
# create a dictionary
dictionary = gensim.corpora.Dictionary(gen_docs)
# Number of words in the dictionary
len(dictionary)

563

In [56]:
# show first five words in the dictionary
for i in range(10):
    print(i, dictionary[i])

0 ,
1 .
2 :
3 a
4 an
5 and
6 apparatus
7 are
8 based
9 categories


**Step 4. Create a corpus**

note: A corpus is a list of bags of words.

In [75]:
# Create a list of bags of words
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
print(type(corpus))
print('\n')
print(corpus[1])

<class 'list'>


[(0, 3), (1, 4), (3, 7), (5, 1), (7, 1), (18, 1), (31, 3), (34, 2), (39, 1), (40, 3), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 2), (53, 4), (54, 1), (55, 1), (56, 1), (57, 2), (58, 1), (59, 1), (60, 1), (61, 1), (62, 2), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 2), (75, 3), (76, 2), (77, 2), (78, 1), (79, 3), (80, 1), (81, 2), (82, 1)]


In [77]:
# we could check the corresponding words by
print(dictionary[5])

and


**Step 5. Create a tf-idf model**

note: Learn more about [tf-idf](http://www.tfidf.com/)

Besides, other methods could be tried here.

In [81]:
tf_idf = gensim.models.TfidfModel(corpus)
print(tf_idf)
# num_nzz refers to the number of tokens

TfidfModel(num_docs=25, num_nnz=1246)


**Step 5. Similarity measures / Similarity Matrix**

In [97]:
index = gensim.similarities.MatrixSimilarity(tf_idf[corpus])
index

<gensim.similarities.docsim.MatrixSimilarity at 0x1bf58ba4a8>

**Step 6. Similarity interface**

In [96]:
query_doc = [w.lower() for w in word_tokenize(raw_docs[5])]
query_doc_bow = dictionary.doc2bow(query_doc)
query_doc_bow_tf_idf = tf_idf[query_doc_bow]
# show the part of tf-idf weights of the queried doc
print(query_doc_bow_tf_idf[:5])

[(0, 0.0035008157911770176), (4, 0.042894504746844565), (18, 0.0053671438020382375), (20, 0.07694178858551294), (27, 0.02434395030301869)]


In [99]:
# query
sims = index[query_doc_bow_tf_idf]
# sort 
sims_sorted = sorted(enumerate(sims), key = lambda item: -item[1])
print(sims_sorted)

[(5, 1.0), (24, 0.13637403), (11, 0.07579642), (7, 0.058889322), (12, 0.048985645), (8, 0.04739786), (10, 0.041341662), (6, 0.03414415), (17, 0.033516787), (1, 0.031384718), (0, 0.03130289), (15, 0.02824454), (18, 0.025812946), (9, 0.024417594), (19, 0.023246385), (14, 0.020976612), (3, 0.02051758), (21, 0.015337167), (13, 0.013489934), (22, 0.012365733), (16, 0.0087839635), (2, 0.0067955498), (23, 0.006663728), (4, 0.0064896205), (20, 0.003776135)]


In [102]:
# comparison
print(raw_docs[5])
print('\n')
print(raw_docs[24])

A candidate message chatbot system and method. The chatbot includes an interactive dialog interface for engaging in a chat session with a user. The user can enter one or more characters as an input message during the chat session. The chatbot can match the one or more characters with a plurality of candidate messages in a knowledge database, the plurality of candidate input messages being part of input/output knowledge entry or known to generate at least a quality response.


Method, system, and computer program product to analyze a plurality of candidate answers identified as responsive to a question presented to a deep question answering system, by computing a first feature score for a first feature of an item of evidence, of a plurality of items of evidence, the first feature score being based on at least one attribute of the first feature, the item of evidence relating to a first candidate answer, of the plurality of candidate answers, and computing a merged feature score for the f

### References

* [Tutorials on gensim](https://radimrehurek.com/gensim/tutorial.html)

* [How do I compare document similarity using Python?](https://www.oreilly.com/learning/how-do-i-compare-document-similarity-using-python)