# Gensim 

# Overview:  
-Install Gensim          
-Create tf-idf vectors from text  
-Create model of our vectors  
-Load vectors using Gensim api    
-Load vector using Gensim word2vec (KeyedVectors)      
Documentation: https://tedboy.github.io/nlps/api_gensim.html    

# Gensim
Gensim is a Python software library located at:    
https://radimrehurek.com/gensim/    
It is an open-source library for unsupervised topic modeling and natural language processing using modern statistical machine learning.      
Memory-wise, gensim makes heavy use of Python’s built-in generators and iterators for streamed data processing.        
Claims to be memory efficient.    
Usage:  Extracts underlying topics from large volumes of text.  

**What is term frequency–inverse document frequency (TF-IDF)?**

Transform text into meaningful representation of numbers.  Used to extract features in NLP applications.    
This is done by multiplying how many times a word appears in a document and the inverse document frequency of the word across a set of documents.    
Formula:  
(Number of times term t appears in a document) / (Total number of terms in the document)      
IDF(t) = log_e(Total number of documents / Number of documents with term t in it)      

**What is TF-IDF vectors?**    
TF-IDF score represents the relative importance of a term in the document and the entire corpus.   

To install Gensim:

Run in your terminal (recommended):

`$ pip install --upgrade gensim`

or, alternatively for conda environments:

`$ conda install -c conda-forge gensim`

Reference:
https://radimrehurek.com/gensim/install.html

In [1]:
!pip install --upgrade gensim



In [2]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
import gensim, logging

Define logging level. Only display warnings.

In [4]:
logging.basicConfig(format='%(asctime)s: %levelname)s: %(message)s',level= logging.WARNING)

Steps:  
1.  Create a document
2.  Split the sentence into words (word_tokenize or Python .split())
3.  Create a corpora dictionary.  Map every word to a number.  
4.  Create a bag-of-words corpus.  Words in documents are replaced with ids from dictionary.     
5.  Create tf-idf model from corpus.  

In [5]:
# 1.  Create a small document
raw_documents = ['I love tacos.',
             'She ran with the chicken.',
             'I don’t choose to take a nap. The nap chooses me.',
            'That man is nice as pie with ice cream.',
            'This pizza is an affront to nature.']

In [6]:
from nltk.tokenize import word_tokenize

In [7]:
# 2.  Tokenize using NLTK word_tokenize
def get_tokens(text):
    tokens = word_tokenize(text)
    return tokens

In [8]:
# A Gensim document (gen_docs) is a list of tokens for each sentence.
# We could optionally make all of the tokens lower case.
gen_docs = [get_tokens(text) for text in raw_documents]
print(gen_docs)

[['I', 'love', 'tacos', '.'], ['She', 'ran', 'with', 'the', 'chicken', '.'], ['I', 'don', '’', 't', 'choose', 'to', 'take', 'a', 'nap', '.', 'The', 'nap', 'chooses', 'me', '.'], ['That', 'man', 'is', 'nice', 'as', 'pie', 'with', 'ice', 'cream', '.'], ['This', 'pizza', 'is', 'an', 'affront', 'to', 'nature', '.']]


In [9]:
# 3.  Create dictionary from the list of the document.
# The dictionary maps every word to a number
dictionary = gensim.corpora.Dictionary(gen_docs)
num_words = len(dictionary)
print("Num words in dictionary: {}".format(num_words))
for idx,word in dictionary.items():
    print(idx,word)

Num words in dictionary: 33
0 .
1 I
2 love
3 tacos
4 She
5 chicken
6 ran
7 the
8 with
9 The
10 a
11 choose
12 chooses
13 don
14 me
15 nap
16 t
17 take
18 to
19 ’
20 That
21 as
22 cream
23 ice
24 is
25 man
26 nice
27 pie
28 This
29 affront
30 an
31 nature
32 pizza


In [10]:
# To convert from token id to string we have 2 ways:
print(dictionary[22])
print(dictionary.id2token[22])

cream
cream


In [11]:
# Convert string to token id
print(dictionary.token2id['cream'])

22


In [12]:
# 4.  Create bag of words
# A bag of words is a term frequency (tf) of tf-idf
# Called a "bag of words" because order is lost
# Note that "!" is not listed because it is not in the dictionary
bow_doc = dictionary.doc2bow(['I','love','love','love','tacos','!'])
print(bow_doc)

[(1, 1), (2, 3), (3, 1)]


In [13]:
# 4.  Create bag of words from gen_docs
# A corpus is a list of bags of words
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
print(corpus)
print(gen_docs)

[[(0, 1), (1, 1), (2, 1), (3, 1)], [(0, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)], [(0, 2), (1, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 2), (16, 1), (17, 1), (18, 1), (19, 1)], [(0, 1), (8, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1)], [(0, 1), (18, 1), (24, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1)]]
[['I', 'love', 'tacos', '.'], ['She', 'ran', 'with', 'the', 'chicken', '.'], ['I', 'don', '’', 't', 'choose', 'to', 'take', 'a', 'nap', '.', 'The', 'nap', 'chooses', 'me', '.'], ['That', 'man', 'is', 'nice', 'as', 'pie', 'with', 'ice', 'cream', '.'], ['This', 'pizza', 'is', 'an', 'affront', 'to', 'nature', '.']]


In [14]:
# 5.  Create tf-idf model from corpus
# num_nnz is the number of tokens
tf_idf = gensim.models.TfidfModel(corpus)
print(tf_idf)

TfidfModel<num_docs=5, num_nnz=41>


In [15]:
# Show document in text form, bag of words, and tf-idf
# 0 is tacos, 1 is love, 2 is I
# Value for I is lower because occurs multiple times.
# Value for '.' is 0 because it occurs in all sentences and log_2(1) = 0.
# Vectors are normalized so they sum to 1
print(gen_docs[0]) # our document (one sentence)
print(corpus[0])   # the corpus tokenized
print(tf_idf[corpus][0])  #

['I', 'love', 'tacos', '.']
[(0, 1), (1, 1), (2, 1), (3, 1)]
[(1, 0.37344696513776354), (2, 0.6559486886294514), (3, 0.6559486886294514)]


In [16]:
# Show bag of words and tf-idf for new document
bow = dictionary.doc2bow(['I','love','pizza','.'])
print(bow)
print(tf_idf[bow])

[(0, 1), (1, 1), (2, 1), (32, 1)]
[(1, 0.37344696513776354), (2, 0.6559486886294514), (32, 0.6559486886294514)]


In [17]:
# This is just a confirmation. Create tf-idf vector manually. Left as an exercise.
# idf if it occurs once in corpus (like "tacos" and "love")
# idf if it occurs twice in corpus (like "I")
from math import log
num_docs = tf_idf.num_docs
idf_1 = log(num_docs/1,2)
idf_2 = log(num_docs/2,2)
# only show nonzero values, and use numpy array
import numpy as np
v = np.array([idf_1,idf_1,idf_2])
print(v)
# normalize to the length is 1
norm_v = np.linalg.norm(v)
print(norm_v)
# Show normalized vector
print(v/norm_v)

[2.32192809 2.32192809 1.32192809]
3.539801413032522
[0.65594869 0.65594869 0.37344697]


In [18]:
# Create similarity measure object in tf-idf space
# First arg is temp external storage
# https://radimrehurek.com/gensim/similarities/docsim.html
sims = gensim.similarities.Similarity('tfidf/',tf_idf[corpus],
                                      num_features=len(dictionary))
print(sims)

Similarity<5 documents in 0 shards stored under tfidf/>


In [19]:
# Create query document and convert to tf-idf
# doc shares two words with each of first two docs in corpus
query_doc = "chicken with tacos love".split()
print(query_doc)
query_doc_bow = dictionary.doc2bow(query_doc)
print(query_doc_bow)
query_doc_tf_idf = tf_idf[query_doc_bow]
print(query_doc_tf_idf)

['chicken', 'with', 'tacos', 'love']
[(2, 1), (3, 1), (5, 1), (8, 1)]
[(2, 0.5484803253891997), (3, 0.5484803253891997), (5, 0.5484803253891997), (8, 0.31226270667960454)]


In [20]:
# You might have to create tfidf directory in the directory where this notebook resides.
# Uncomment the next line and run, if the following line gives you an error.
# !mkdir tfidf

In [21]:
!mkdir tfidf

mkdir: cannot create directory ‘tfidf’: File exists


In [22]:
# Show array of document similarities to query
# Also both document 0 and 1 match with two words each,
# document 1 matches with word "with" that occurs twice in corpus.
# Only one overlapping word with the fourth document
sims[query_doc_tf_idf]

array([0.7195499 , 0.34925455, 0.        , 0.06428327, 0.        ],
      dtype=float32)

# Load vectors using gensim

Example 1: Using gensim api

In [23]:
#Load vectors using the gensim api
import gensim.downloader as api
word_vectors = api.load("glove-wiki-gigaword-100")  # load pre-trained word-vectors from gensim-data

Find properties of data (similar, doesn't match, analogy)

In [24]:
result = word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])
print("{}: {:.4f}".format(*result[0]))

queen: 0.7699


In [25]:
print(word_vectors.doesnt_match("breakfast cereal dinner lunch".split()))

cereal


In [26]:
similarity = word_vectors.similarity('woman', 'man')
similarity> .80

True

In [27]:
similarity> .90

False

In [28]:
result = word_vectors.similar_by_word("cat")
print("{}: {:.4f}".format(*result[0]))

dog: 0.8798


In [29]:
similarity = word_vectors.n_similarity(['france', 'russia'], ['paris'])
print(similarity)

0.64570683


Example: 2  Using gensim word2vec  

This examples uses: vector_sm.txt

In [30]:
!dir vector_sm.txt

vector_sm.txt


**A line from vector.txt**  
71291 300  
</s> 0.001334 0.001473 -0.001277 ...  
santamaria -0.328541 0.143057 0.200979 0.176212 -0.043703 0.132309 -0.022670 0.268999 -0.098336      
Example as numpy array:  
array([0.7195499 , 0.34925455, 0.        , 0.06428327, 0.        ],
      dtype=float32)

In [31]:
from gensim.models import Word2Vec

In [32]:
#KeyedVectors is a mapping between entities and vectors.  Each entity is identified by its string id
#and is a mapping between str to 1D numpy array.
#Used your own vector file below.  You can also inclulde binary and change binary=True.
# Modify next line to add in your vector  (vector_sm.txt is a very small subset of data.)
model = gensim.models.KeyedVectors.load_word2vec_format('vector_sm.txt',binary=False)
#model = gensim.models.KeyedVectors.load_word2vec_format('vector.txt',binary=False)

In [33]:
print(model)

KeyedVectors<vector_size=300, 71291 keys>


Examine the vocabulary as a numpy array

In [34]:
#model.vocab

Examine the shape of the matrix

In [35]:
model.vectors.shape

(71291, 300)

Retrieve the vectors of individual words

In [36]:
model['fire'].shape

(300,)

In [37]:
model['house'][:10]

array([-1.662234,  0.604078, -3.122537,  1.535989,  1.123752,  0.031915,
       -1.284593,  0.07871 , -0.021927,  1.184942], dtype=float32)

In [38]:
model['fire'][:10]

array([ 0.25062 , -0.420875, -1.628221, -1.038453,  0.557791,  0.478556,
        1.470237, -0.198495, -1.811532,  0.945193], dtype=float32)

Return a tuple with 2 items:    
numpy array with the indexes of the similar words in the vocabulary and cosine similarity to each word.      

In [39]:
most_similar=model.most_similar(positive=['emperor', 'woman'], negative=['man'])
print(most_similar)

[('empress', 0.49833983182907104), ('emperors', 0.4949486255645752), ('theodora', 0.4320295453071594), ('fushimi', 0.42215168476104736), ('valentinian', 0.39994004368782043), ('empresses', 0.3943521976470947), ('suiko', 0.3869629502296448), ('pope', 0.3867180049419403), ('throne', 0.3856024742126465), ('saimei', 0.3818151354789734)]


In [40]:
most_similar2 = model.most_similar('queen', topn=5)
print(most_similar2)

[('elizabeth', 0.5374570488929749), ('king', 0.49853524565696716), ('boleyn', 0.45492681860923767), ('gracen', 0.45339256525039673), ('infanta', 0.4517417848110199)]


In [41]:
distance = model.distance("media", "media")
print("{:.2f}".format(distance))

0.00


In [42]:
analogy = model.most_similar(positive=['woman','king'],negative=['man'])
print(analogy)

[('queen', 0.5036637187004089), ('marries', 0.4303651452064514), ('betrothed', 0.4289798438549042), ('consort', 0.41862085461616516), ('daughter', 0.41833433508872986), ('anjou', 0.4096809923648834), ('infanta', 0.40592390298843384), ('heiress', 0.40402916073799133), ('montferrat', 0.4031989574432373), ('isabella', 0.4021216630935669)]


In [43]:
import numpy as np
import gensim
# Get the interactive Tools for Matplotlib
%matplotlib notebook
import matplotlib.pyplot as plt
plt.style.use('ggplot')
from sklearn.decomposition import PCA
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

In [44]:
def display_pca_scatterplot(model, words=None, sample=0):
    if words == None:
        if sample > 0:
            words = np.random.choice(list(model.vocab.keys()), sample)
        else:
            words = [ word for word in model.vocab ]
    word_vectors = np.array([model[w] for w in words])
    twodim = PCA().fit_transform(word_vectors)[:,:2]
    plt.figure(figsize=(6,6))
    plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')
    for word, (x,y) in zip(words, twodim):
        plt.text(x+0.05, y+0.05, word)
    for i in range(0, len(twodim), 2):
        plt.plot(twodim[:,0][i:i+2], twodim[:,1][i:i+2], 'bo-')

In [45]:
#display_pca_scatterplot
#display_pca_scatterplot(model,['king', 'man', 'woman'])
display_pca_scatterplot(model,['king', 'man', 'woman','queen','boy','prince'])

<IPython.core.display.Javascript object>