# Natural Language process - Gensim Python Library for text analysis.

# Introduction - Gensim is an acronym for Generate Similar. It is free python library for NLP. It is used in word embeddings, topic modeling and text similarity. It is developed for generating word and document vectors. It also extracts the topics from textual documents. It is an open-source, scalable, robust, fast, efficient multicore Implementation, and platform-independent. 

In [2]:
# installing Gensim 

!pip install gensim

Collecting gensim
  Downloading gensim-4.1.2-cp39-cp39-win_amd64.whl (24.0 MB)
Collecting Cython==0.29.23
  Downloading Cython-0.29.23-cp39-cp39-win_amd64.whl (1.7 MB)
Collecting smart-open>=1.8.1
  Downloading smart_open-5.2.1-py3-none-any.whl (58 kB)
Installing collected packages: smart-open, Cython, gensim
  Attempting uninstall: Cython
    Found existing installation: Cython 0.29.24
    Uninstalling Cython-0.29.24:
      Successfully uninstalled Cython-0.29.24
Successfully installed Cython-0.29.23 gensim-4.1.2 smart-open-5.2.1


In [4]:
# create gensim dictionary -
#first we load the text data file.

# open the text file 
file = open('hamlet.txt', encoding = 'utf-8')

# read the file
text = file.read()

# Now, we tokenize and preprocess the data using the string function split() and simple_preprocess() function available in gensim module.

In [5]:
# Tokenzie data : Handling punctuations and lowercasing the text
from gensim.utils import simple_preprocess

# preprocess the file to get a list of tokens
token_list = []

for sentence in text.split('.') :
    token_list.append(simple_preprocess(sentence, deacc = True))
    
print(token_list[:2])

[['the', 'tragedy', 'of', 'hamlet', 'prince', 'of', 'denmark', 'by', 'william', 'shakespeare', 'dramatis', 'personae', 'claudius', 'king', 'of', 'denmark'], ['marcellus', 'officer']]


# After tokenization and preprocessing, we will create gensim dictionary object for the above-tokenized text.




In [8]:
# import gensim corpora
from gensim import corpora

#storing the extracted tokens into the dictionary
my_dictionary = corpora.Dictionary(token_list)

#print the  dictionary
print(my_dictionary)

Dictionary(4593 unique tokens: ['by', 'claudius', 'denmark', 'dramatis', 'hamlet']...)


# Now, we will see how to save and load the dictionary object.

In [9]:
# save your dictionary to disk 
my_dictionary.save('dictionary.dict')

#load back
load_dict = corpora.Dictionary.load('dictionary.dict')
print(load_dict)

Dictionary(4593 unique tokens: ['by', 'claudius', 'denmark', 'dramatis', 'hamlet']...)


# Bag of Words
The Bag-of-words model(BoW ) is the simplest way of extracting features from the text. BoW converts text into the matrix of the occurrence of words within a document. This model concerns whether given words occurred or not in the document.

# Let’s create a bag of words using function doc2bow() for each tokenized sentence. Finally, we will have a list of tokens with their frequency.

In [10]:
BoW_corpus = [my_dictionary.doc2bow(sent, allow_update = True) for sent in token_list]
print(BoW_corpus[:2])

[[(0, 1), (1, 1), (2, 2), (3, 1), (4, 1), (5, 1), (6, 3), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)], [(13, 1), (14, 1)]]


# we can see the index and frequency of each token. If you want to replace the index with a token then you can try the following script:

In [11]:
# word weight in bag of words corpus

word_weight = []

for doc in BoW_corpus :
    for id, freq in doc :
        word_weight.append([my_dictionary[id], freq])
        
print(word_weight[:10])

[['by', 1], ['claudius', 1], ['denmark', 2], ['dramatis', 1], ['hamlet', 1], ['king', 1], ['of', 3], ['personae', 1], ['prince', 1], ['shakespeare', 1]]


# TF-IDF
In Term Frequency(TF), you just count the number of words that occurred in each document. The main issue with this Term Frequency is that it will give more weight to longer documents. Term frequency is basically the output of the BoW model.
IDF(Inverse Document Frequency) measures the amount of information a given word provides across the document. IDF is the logarithmically scaled inverse ratio of the number of documents that contain the word and the total number of documents.
TF-IDF(Term Frequency-Inverse Document Frequency) normalizes the document term matrix. It is the product of TF and IDF. Word with high tf-idf in a document, it is most of the time that occurred in given documents and must be absent in the other documents.

In [15]:
import gensim.downloader as api
from gensim import models


import numpy as np

#create TF-IDF model
tfIdf = models.TfidfModel(BoW_corpus, smartirs ='ntc')
# TF - IDF word weight
weight_tfidf = []
for doc in tfIdf[BoW_corpus]:
    for id, tf_idf in doc:
        weight_tfidf.append([my_dictionary[id], np.around(tf_idf, decimals =3)])

print(weight_tfidf[:10])

[['by', 0.146], ['claudius', 0.31], ['denmark', 0.407], ['dramatis', 0.339], ['hamlet', 0.142], ['king', 0.117], ['of', 0.241], ['personae', 0.339], ['prince', 0.272], ['shakespeare', 0.339]]
