[View in Colaboratory](https://colab.research.google.com/github/schwaaweb/aimlds1_11-NLP/blob/master/T11_CCS_AG_Genism_NLP_Coding_Challenge__2.ipynb)

### Objective 2: Comparing documents or words

A common task in NLP is to determine the similarity between documents or words. In order to facilitate the comparison between documents or words, we will utilize vectors. 

A vector contains a sequence of numbers; therefore, comparisons are possible since you can measure the difference between the numbers. In this section, we will be converting documents and words to vectors. We will make use of "Gensim" which is a free Python library. 

More details on Gensim is available here:

https://radimrehurek.com/gensim/intro.html

In [0]:
# https://radimrehurek.com/gensim/install.html
!pip install --upgrade gensim

Collecting gensim
[?25l  Downloading https://files.pythonhosted.org/packages/33/33/df6cb7acdcec5677ed130f4800f67509d24dbec74a03c329fcbf6b0864f0/gensim-3.4.0-cp36-cp36m-manylinux1_x86_64.whl (22.6MB)
[K    100% |████████████████████████████████| 22.6MB 1.9MB/s 
[?25hCollecting smart-open>=1.2.1 (from gensim)
  Downloading https://files.pythonhosted.org/packages/4b/69/c92661a333f733510628f28b8282698b62cdead37291c8491f3271677c02/smart_open-1.5.7.tar.gz
Requirement not upgraded as not directly required: numpy>=1.11.3 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.14.3)
Requirement not upgraded as not directly required: scipy>=0.18.1 in /usr/local/lib/python3.6/dist-packages (from gensim) (0.19.1)
Requirement not upgraded as not directly required: six>=1.5.0 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.11.0)
Collecting boto>=2.32 (from smart-open>=1.2.1->gensim)
[?25l  Downloading https://files.pythonhosted.org/packages/bd/b7/a88a67002b1185ed9a8e8a6ef15266728c2

**Determine document similarity using TF-IDF**


To determine document similarity, we will utilize the Gensim library to compute TF (Term Frequency) - IDF (Inter-Document Frequency). Before we walk through a tutorial, let's quickly review the concepts behind TF-IDF.

While comparing documents, we should account for how many times a word shows up in a document. This is the "Term-Frequency" portion of TF-IDF. However while performing the comparison, more weight should be given to "rare" words rather than the "common words" that occur in documents. The "IDF" portion is the relative strength of the words (or how common is the word)  across the documents. A log function is leveraged to compute "IDF".

**IDF **= log (total # of documents/# of document containing the term)

Here is an example:

We have 2 files:

**File 1**: Cheetahs are amazing to watch

**File 2**: Spotting a Cheetah in the Jungle is difficult

After tokenization and elimination of the stopwords, we will be left with:

**File 1**: Cheetahs amazing watch

**File 2**: Spotting Cheetah Jungle difficult

The “**Term Frequency**” is the relative strength  of the word in the document i.e. in the case of Cheetahs (file 1) it is 1/3

The “**Inter Document Frequency**” is the relative strength of the words across the documents or files i.e. in the case of Cheetah(s) it is log(2/2)

We will use the Gensim library to do the computation for us.

In [0]:
# Installing NLTK
# Reference: http://www.nltk.org/install.html
!pip install -U nltk

Collecting nltk
[?25l  Downloading https://files.pythonhosted.org/packages/50/09/3b1755d528ad9156ee7243d52aa5cd2b809ef053a0f31b53d92853dd653a/nltk-3.3.0.zip (1.4MB)
[K    100% |████████████████████████████████| 1.4MB 5.2MB/s 
[?25hRequirement not upgraded as not directly required: six in /usr/local/lib/python3.6/dist-packages (from nltk) (1.11.0)
Building wheels for collected packages: nltk
  Running setup.py bdist_wheel for nltk ... [?25l- \ | / - \ done
[?25h  Stored in directory: /content/.cache/pip/wheels/d1/ab/40/3bceea46922767e42986aef7606a600538ca80de6062dc266c
Successfully built nltk
Installing collected packages: nltk
  Found existing installation: nltk 3.2.5
    Uninstalling nltk-3.2.5:
      Successfully uninstalled nltk-3.2.5
Successfully installed nltk-3.3


In [0]:
# Import the NLTK package
import nltk

# Get all the data associated with NLTK – could take a while to download all the data
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /content/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /content/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to /content/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    |   Unzipping corpora/brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to /content/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_cat.zip.
[nltk_data]    | Downloading package cess_esp to /content/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_esp.zip.
[nltk_data]    | Downloading package chat80 to /c

[nltk_data]    |   Unzipping corpora/movie_reviews.zip.
[nltk_data]    | Downloading package names to /content/nltk_data...
[nltk_data]    |   Unzipping corpora/names.zip.
[nltk_data]    | Downloading package nombank.1.0 to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    | Downloading package nps_chat to /content/nltk_data...
[nltk_data]    |   Unzipping corpora/nps_chat.zip.
[nltk_data]    | Downloading package omw to /content/nltk_data...
[nltk_data]    |   Unzipping corpora/omw.zip.
[nltk_data]    | Downloading package opinion_lexicon to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    |   Unzipping corpora/opinion_lexicon.zip.
[nltk_data]    | Downloading package paradigms to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    |   Unzipping corpora/paradigms.zip.
[nltk_data]    | Downloading package pil to /content/nltk_data...
[nltk_data]    |   Unzipping corpora/pil.zip.
[nltk_data]    | Downloading package pl196x to /content/nltk_data...
[nltk_data]    |

[nltk_data]    |   Unzipping corpora/wordnet_ic.zip.
[nltk_data]    | Downloading package words to /content/nltk_data...
[nltk_data]    |   Unzipping corpora/words.zip.
[nltk_data]    | Downloading package ycoe to /content/nltk_data...
[nltk_data]    |   Unzipping corpora/ycoe.zip.
[nltk_data]    | Downloading package rslp to /content/nltk_data...
[nltk_data]    |   Unzipping stemmers/rslp.zip.
[nltk_data]    | Downloading package maxent_treebank_pos_tagger to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    |   Unzipping taggers/maxent_treebank_pos_tagger.zip.
[nltk_data]    | Downloading package universal_tagset to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    |   Unzipping taggers/universal_tagset.zip.
[nltk_data]    | Downloading package maxent_ne_chunker to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    |   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data]    | Downloading package punkt to /content/nltk_data...
[nltk_data]    |   Unzipping token

True

In [0]:
# Import word tokenizer
from nltk.tokenize import word_tokenize

# Raw documents

raw_documents = ['The dog ran up the steps and entered the owner\'s room to check if the owner was in the room.',
             'My name is Thomson Comer, commander of the Machine Learning program at Lambda school.',
             'I am creating the curriculum for the Machine Learning program and will be teaching the full-time Machine Learning program.',
            'Machine Learning is one of my favorite subjects.',
            'I am excited about taking the Machine Learning class at the Lambda school starting in April.',
                'When does the Machine Learning program kick-off at Lambda school?',
                'The batter hit the ball out off AT&T park into the pacific ocean.',
                'The pitcher threw the ball into the dug-out.']

"""
# A function that tokenizes the text
def convert_to_tokens(text):
    tokens = word_tokenize(text)
    # etc...
    return tokens

# Create a Gensim document that contains a list of tokens
gensim_doc = [convert_to_tokens(text) for text in raw_documents]
print(gensim_doc)
"""
gensim_doc = [word_tokenize(text) for text in raw_documents]
print(gensim_doc)

[['The', 'dog', 'ran', 'up', 'the', 'steps', 'and', 'entered', 'the', 'owner', "'s", 'room', 'to', 'check', 'if', 'the', 'owner', 'was', 'in', 'the', 'room', '.'], ['My', 'name', 'is', 'Thomson', 'Comer', ',', 'commander', 'of', 'the', 'Machine', 'Learning', 'program', 'at', 'Lambda', 'school', '.'], ['I', 'am', 'creating', 'the', 'curriculum', 'for', 'the', 'Machine', 'Learning', 'program', 'and', 'will', 'be', 'teaching', 'the', 'full-time', 'Machine', 'Learning', 'program', '.'], ['Machine', 'Learning', 'is', 'one', 'of', 'my', 'favorite', 'subjects', '.'], ['I', 'am', 'excited', 'about', 'taking', 'the', 'Machine', 'Learning', 'class', 'at', 'the', 'Lambda', 'school', 'starting', 'in', 'April', '.'], ['When', 'does', 'the', 'Machine', 'Learning', 'program', 'kick-off', 'at', 'Lambda', 'school', '?'], ['The', 'batter', 'hit', 'the', 'ball', 'out', 'off', 'AT', '&', 'T', 'park', 'into', 'the', 'pacific', 'ocean', '.'], ['The', 'pitcher', 'threw', 'the', 'ball', 'into', 'the', 'dug-ou

In [0]:
import gensim

# Use the Gensim document to create a dictionary - a dictionary maps every word to a number
dictionary = gensim.corpora.Dictionary(gensim_doc)
# Examine the length of the dictionary
num_of_words = len(dictionary)
print("# of words in dictionary: {}".format(num_of_words))
for index,word in dictionary.items():
    print(index,word)

# Output the string/word associated with the index
print(dictionary[5])

# Output the index associated with the string/word
print(dictionary.token2id['dog'])

# of words in dictionary: 69
0 's
1 .
2 The
3 and
4 check
5 dog
6 entered
7 if
8 in
9 owner
10 ran
11 room
12 steps
13 the
14 to
15 up
16 was
17 ,
18 Comer
19 Lambda
20 Learning
21 Machine
22 My
23 Thomson
24 at
25 commander
26 is
27 name
28 of
29 program
30 school
31 I
32 am
33 be
34 creating
35 curriculum
36 for
37 full-time
38 teaching
39 will
40 favorite
41 my
42 one
43 subjects
44 April
45 about
46 class
47 excited
48 starting
49 taking
50 ?
51 When
52 does
53 kick-off
54 &
55 AT
56 T
57 ball
58 batter
59 hit
60 into
61 ocean
62 off
63 out
64 pacific
65 park
66 dug-out
67 pitcher
68 threw
dog
5


In [0]:
# Illustrating the concept behind bag of words
# Create bag of words to showcase term frequency
# Compare the words below to the words in the dictionary
# Words not in the dictionary are ignored
bag_of_words = dictionary.doc2bow(['Machine','Learning','is','a','lot','of','fun','.', 'Lambda', 'School', 'rocks', '.'])
print(bag_of_words)

In [0]:
# Convert the list of tokens from the gensim_document (created above) into bag of words
# The bag of words highlights the term frequency
# each element in the bag of words is the index of the word in the dictionary and the # of times it occurs
bag_of_words_corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gensim_doc]
print(bag_of_words_corpus)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 2), (10, 1), (11, 2), (12, 1), (13, 4), (14, 1), (15, 1), (16, 1)], [(1, 1), (13, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1)], [(1, 1), (3, 1), (13, 3), (20, 2), (21, 2), (29, 2), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1)], [(1, 1), (20, 1), (21, 1), (26, 1), (28, 1), (40, 1), (41, 1), (42, 1), (43, 1)], [(1, 1), (8, 1), (13, 2), (19, 1), (20, 1), (21, 1), (24, 1), (30, 1), (31, 1), (32, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1)], [(13, 1), (19, 1), (20, 1), (21, 1), (24, 1), (29, 1), (30, 1), (50, 1), (51, 1), (52, 1), (53, 1)], [(1, 1), (2, 1), (13, 2), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1)], [(1, 1), (2, 1), (13, 2), (57, 1), (60, 1), (66, 1), (67, 1), (68, 1)]]


In [0]:
# Use Gensim to create a TF-IDF module
gensim_tfidf = gensim.models.TfidfModel(bag_of_words_corpus)

# Rewiew the 5th document
# Review the bag of words for the first document i.e. term frequency
# Review the Inter Document Frequency for each term in the bag of words
print(gensim_doc[4])
print(bag_of_words_corpus[4])
print(gensim_tfidf[bag_of_words_corpus][4])

['I', 'am', 'excited', 'about', 'taking', 'the', 'Machine', 'Learning', 'class', 'at', 'the', 'Lambda', 'school', 'starting', 'in', 'April', '.']
[(1, 1), (8, 1), (13, 2), (19, 1), (20, 1), (21, 1), (24, 1), (30, 1), (31, 1), (32, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1)]
[(1, 0.02253010613488428), (8, 0.2339027435896511), (13, 0.04506021226976856), (19, 0.16549057668178024), (20, 0.07930143947845378), (21, 0.07930143947845378), (24, 0.16549057668178024), (30, 0.16549057668178024), (31, 0.2339027435896511), (32, 0.2339027435896511), (44, 0.35085411538447664), (45, 0.35085411538447664), (46, 0.35085411538447664), (47, 0.35085411538447664), (48, 0.35085411538447664), (49, 0.35085411538447664)]


In [0]:
# Generate the idf measure for another bag of words
another_bag_of_words = dictionary.doc2bow(['Machine','Learning','rocks',';','Lambda','School','is','fun','.'])
# Print the term frequency
print(another_bag_of_words)
# Print the relative strength across the different documents
print(gensim_tfidf[another_bag_of_words])

[(1, 1), (19, 1), (20, 1), (21, 1), (26, 1)]
[(1, 0.07302714185524495), (19, 0.5364068747254807), (20, 0.2570408428370198), (21, 0.2570408428370198), (26, 0.758152169110506)]


In [0]:
# The Similarity class builds an index for a given set of documents
# Once the index is in place, queries like “Tell me how similar is this query document to each document in the index?” can be performed
# Reference: https://radimrehurek.com/gensim/similarities/docsim.html

sim_object = gensim.similarities.Similarity('/tmp/',
                                            gensim_tfidf[bag_of_words_corpus],
                                            num_features=len(dictionary))

#Query to compare - determine tf and idf
query_to_compare = "Machine Learning at Lambda school is awesome".split()
print(query_to_compare)
query_to_compare_bow = dictionary.doc2bow(query_to_compare)
print(query_to_compare_bow)
query_to_compare_tfidf = gensim_tfidf[query_to_compare_bow]
print(query_to_compare_tfidf)


['Machine', 'Learning', 'at', 'Lambda', 'school', 'is', 'awesome']
[(19, 1), (20, 1), (21, 1), (24, 1), (26, 1), (30, 1)]
[(19, 0.4280813359945935), (20, 0.20513232136176582), (21, 0.20513232136176582), (24, 0.4280813359945935), (26, 0.6050459245253357), (30, 0.4280813359945935)]


In [0]:
# Determine document similarity
sim_object[query_to_compare_tfidf]

# We have 8 documents in all (i.e. raw_documents)
# The 2nd document is the most similar and the 3rd document is the least/sonewhat similar

array([0.        , 0.39228612, 0.05962985, 0.22196668, 0.24506485,
       0.31248817, 0.        , 0.        ], dtype=float32)