### ***Get some familiar with `Gensim`!!***
---

##### **The motivation for creating this notebook is to work with Gensim library and understand how we can effectively use it for topic modelling and other tasks.**

#### **Notebook Contents**

1. [Import NLP Packages](#Import_NLP_packages)

#### **Import_NLP_packages**

In [1]:
# Basic libraries
import os
import re

# Data Wrangling and visualization libraris
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
## Easy, fast, interactive 3D visualizations
import babyplots

# Scientific python package
# import scipy

# Glob is used for importing the filenames along with their paths
from glob import glob

## NLP modules
# most popular in the python domain with more lexial resources
import nltk 

# includes Latent Semantic Analysis (LSA/LSI/SVD), Latent Dirichlet Allocation (LDA), 
# Random Projections (RP), Hierarchical Dirichlet Process (HDP) or word2vec deep learning
import gensim 

# developed by Stanford and built on java and supports four languages apart from English
# – Arabic, Chinese, German, French, and Spanish.
# import corenlp 

# equipped with pre-trained statistical models and word vectors; 
# and can support tokenization for over 49 languages
# import spacy 

# pronounced as pineapple; can be used for basic NLP tasks like n-grams etc.
# PyNLPl features as a very extensive library for working with FoLiA XML (Format for Linguistic Annotatation).
# import pynlpl 

## Spell Correction
### Pyspellchecker, Also pattern & textblob 
from spellchecker import SpellChecker

# Stopwords
from nltk.corpus import stopwords

# Tokenizer
from nltk.tokenize import sent_tokenize

# PlainTextCorpusReader is used for reading the files data
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

# Stemmers
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer

# NLTK Wordnet Lemmatizer
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
# TextBlob Lemmatizer
import textblob
from textblob import TextBlob, Word

## CountVectoriser
from sklearn.feature_extraction.text import CountVectorizer

## Tfidf
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

## Word2Vec
from gensim.models import Word2Vec
from gensim.models import KeyedVectors

#### **Q1.**
#### **How to create dictionary from a list of sentences?**

In [100]:
from gensim import corpora
from gensim.utils import simple_preprocess
from gensim import models
import gensim.downloader as api # Work with Gensim Downloader API
from smart_open import open as s_open

In [3]:
# How to create a dictionary from a list of sentences?
documents = ["The Saudis are preparing a report that will acknowledge that", 
             "Saudi journalist Jamal Khashoggi's death was the result of an", 
             "interrogation that went wrong, one that was intended to lead", 
             "to his abduction from Turkey, according to two sources."]

documents_2 = ["One source says the report will likely conclude that", 
                "the operation was carried out without clearance and", 
                "transparency and that those involved will be held", 
                "responsible. One of the sources acknowledged that the", 
                "report is still being prepared and cautioned that", 
                "things could change."]

In [4]:
# Tokenize(split) the sentences into words
texts = [[text for text in doc.split()] for doc in documents]
print(len(texts),texts)

# Create dictionary
dictionary = corpora.Dictionary(texts)
print('\n',dictionary)

4 [['The', 'Saudis', 'are', 'preparing', 'a', 'report', 'that', 'will', 'acknowledge', 'that'], ['Saudi', 'journalist', 'Jamal', "Khashoggi's", 'death', 'was', 'the', 'result', 'of', 'an'], ['interrogation', 'that', 'went', 'wrong,', 'one', 'that', 'was', 'intended', 'to', 'lead'], ['to', 'his', 'abduction', 'from', 'Turkey,', 'according', 'to', 'two', 'sources.']]

 Dictionary(33 unique tokens: ['Saudis', 'The', 'a', 'acknowledge', 'are']...)


In [10]:
# Number of processed docs
dictionary.num_docs

4

In [15]:
# Dictionary tokens and their respective ID's
dictionary.token2id

{'Saudis': 0,
 'The': 1,
 'a': 2,
 'acknowledge': 3,
 'are': 4,
 'preparing': 5,
 'report': 6,
 'that': 7,
 'will': 8,
 'Jamal': 9,
 "Khashoggi's": 10,
 'Saudi': 11,
 'an': 12,
 'death': 13,
 'journalist': 14,
 'of': 15,
 'result': 16,
 'the': 17,
 'was': 18,
 'intended': 19,
 'interrogation': 20,
 'lead': 21,
 'one': 22,
 'to': 23,
 'went': 24,
 'wrong,': 25,
 'Turkey,': 26,
 'abduction': 27,
 'according': 28,
 'from': 29,
 'his': 30,
 'sources.': 31,
 'two': 32}

In [14]:
# Document frequencies
dictionary.dfs

{1: 1,
 0: 1,
 4: 1,
 5: 1,
 2: 1,
 6: 1,
 7: 2,
 8: 1,
 3: 1,
 11: 1,
 14: 1,
 9: 1,
 10: 1,
 13: 1,
 18: 2,
 17: 1,
 16: 1,
 15: 1,
 12: 1,
 20: 1,
 24: 1,
 25: 1,
 22: 1,
 19: 1,
 23: 2,
 21: 1,
 30: 1,
 27: 1,
 29: 1,
 26: 1,
 28: 1,
 32: 1,
 31: 1}

In [25]:
# Manual counting of unique words as corpora.dictionary also contains unique words
words = []
all_words = []
for idx,sent in enumerate(texts):
    print("Total words in doc-{} are {}".format(idx+1,len(sent)))
    for word in sent:
        all_words.append(word)
        if word not in words:
            words.append(word)
            
print("\nTotal Unique words :", len(words))
print("\nTotal Processed words :", len(all_words))

Total words in doc-1 are 10
Total words in doc-2 are 10
Total words in doc-3 are 10
Total words in doc-4 are 9

Total Unique words : 33

Total Processed words : 39


In [23]:
# Number of processed words -- matched with above results
dictionary.num_pos

39

In [26]:
# Unique words at the level of documents
dictionary.num_nnz

36

#### **Q2.**
#### **Can we append new a document words/tokens/topics in existing dictionary?**

In [27]:
documents_2 = ["The intersection graph of paths in trees",
               "Graph minors IV Widths of trees and well quasi ordering",
               "Graph minors A survey"]

texts_2 = [[text for text in doc.split()] for doc in documents_2]

dictionary.add_documents(texts_2)

In [28]:
print(dictionary)

Dictionary(48 unique tokens: ['Saudis', 'The', 'a', 'acknowledge', 'are']...)


In [29]:
dictionary.token2id

{'Saudis': 0,
 'The': 1,
 'a': 2,
 'acknowledge': 3,
 'are': 4,
 'preparing': 5,
 'report': 6,
 'that': 7,
 'will': 8,
 'Jamal': 9,
 "Khashoggi's": 10,
 'Saudi': 11,
 'an': 12,
 'death': 13,
 'journalist': 14,
 'of': 15,
 'result': 16,
 'the': 17,
 'was': 18,
 'intended': 19,
 'interrogation': 20,
 'lead': 21,
 'one': 22,
 'to': 23,
 'went': 24,
 'wrong,': 25,
 'Turkey,': 26,
 'abduction': 27,
 'according': 28,
 'from': 29,
 'his': 30,
 'sources.': 31,
 'two': 32,
 'graph': 33,
 'in': 34,
 'intersection': 35,
 'paths': 36,
 'trees': 37,
 'Graph': 38,
 'IV': 39,
 'Widths': 40,
 'and': 41,
 'minors': 42,
 'ordering': 43,
 'quasi': 44,
 'well': 45,
 'A': 46,
 'survey': 47}

#### **Q3.**
#### **How to create a dictionary of tokens/words directly from a single text file?**

In [33]:
# Creating gensim dictionary form a single text file..
## Here, I'm using the smart_open package which allows us to read one row at a time from a file w/o keeping it in memory.
## This can be time efficient if working with huge files.

## Min_len and Max_len allows us to include only a specific length words/tokens.
dictionary = corpora.Dictionary(simple_preprocess(line, deacc=True, min_len=5, max_len=8)\
                                for line in s_open('sample.txt', encoding='utf-8'))

# Token to Id mapping
dictionary.token2id

{'chinaa': 0,
 'chinese': 1,
 'force': 2,
 'peoplea': 3,
 'recently': 4,
 'rocket': 5,
 'think': 6,
 'daily': 7,
 'private': 8,
 'reported': 9,
 'saturday': 10,
 'ceremony': 11,
 'experts': 12,
 'founding': 13,
 'letters': 14,
 'other': 15,
 'received': 16,
 'science': 17,
 'zhang': 18,
 'defense': 19,
 'national': 20,
 'panel': 21,
 'report': 22,
 'fellow': 23,
 'honored': 24,
 'œrocket': 25,
 'conduct': 26,
 'design': 27,
 'fields': 28,
 'members': 29,
 'overall': 30,
 'research': 31,
 'serve': 32,
 'which': 33,
 'missile': 34,
 'missiles': 35,
 'network': 36,
 'system': 37,
 'years': 38,
 'enjoy': 39,
 'firms': 40,
 'owned': 41,
 'state': 42,
 'their': 43,
 'china': 44,
 'civilian': 45,
 'marks': 46,
 'military': 47,
 'better': 48,
 'combat': 49,
 'could': 50,
 'forcea': 51}

#### **Q4.**
#### **How to create a dictionary of tokens/words directly from a more than one text file?**

In [45]:
# Dictionary from more than one file
class ReadTxtFiles(object):
    """
    Description: This class is created for reading multiples text files and building the dictionary.
    """
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        # Here, traversing the docs one by one and creating the tokens
        for fname in os.listdir(self.dirname):
            for sentence in s_open(os.path.join(self.dirname, fname), encoding='latin'):
                doc_tokens = simple_preprocess(sentence,deacc=True,min_len=7,max_len=8)
                yield doc_tokens

path_to_text_directory = "E:\\STUDY\\PROJECTS\\AAIC_Practice\\MODULES\\Datasets\\NLP_datasets\\"

print("Files/Docs in the given location",os.listdir(path_to_text_directory))

Files/Docs in the given location ['doc1.txt', 'doc2.txt', 'file_manipulations.txt', 'x_doc3.txt']


In [46]:
dictionary = corpora.Dictionary(ReadTxtFiles(path_to_text_directory))

In [47]:
dictionary

<gensim.corpora.dictionary.Dictionary at 0x134942d0dd8>

In [48]:
print(dictionary)

Dictionary(99 unique tokens: ['greatest', 'mountain', 'projects', 'science', 'working']...)


In [49]:
dictionary.token2id

{'greatest': 0,
 'mountain': 1,
 'projects': 2,
 'science': 3,
 'working': 4,
 'learning': 5,
 'machine': 6,
 'dynamic': 7,
 'service': 8,
 'stylish': 9,
 'dominoz': 10,
 'academy': 11,
 'account': 12,
 'airflow': 13,
 'american': 14,
 'articles': 15,
 'authors': 16,
 'bankhead': 17,
 'building': 18,
 'burnout': 19,
 'chairman': 20,
 'changed': 21,
 'children': 22,
 'chinese': 23,
 'closely': 24,
 'computer': 25,
 'continue': 26,
 'control': 27,
 'corridor': 28,
 'dramatic': 29,
 'droplet': 30,
 'droplets': 31,
 'edition': 32,
 'enforced': 33,
 'experts': 34,
 'explores': 35,
 'findings': 36,
 'floating': 37,
 'forming': 38,
 'further': 39,
 'general': 40,
 'hundreds': 41,
 'identify': 42,
 'improves': 43,
 'infected': 44,
 'insights': 45,
 'involved': 46,
 'involves': 47,
 'journal': 48,
 'laughing': 49,
 'laughter': 50,
 'lingers': 51,
 'medicine': 52,
 'occurred': 53,
 'pandemic': 54,
 'patient': 55,
 'patients': 56,
 'pattern': 57,
 'patterns': 58,
 'persons': 59,
 'physical': 60,


#### **Q5.**
#### **How to create a BOW or CORPUS from a list of two sentences?**

In [52]:
# CORPUS from List having 2 sentences
my_docs = ["Who let the dogs out? out?",
           "Who? Who? Who? Who?"]

# Tokenize the docs
tokenized_list = [simple_preprocess(doc,deacc=True) for doc in my_docs]
print(tokenized_list)

# Create the Dictionary
mydict = corpora.Dictionary(tokenized_list)
print('\n',mydict)
print(mydict.token2id)

[['who', 'let', 'the', 'dogs', 'out', 'out'], ['who', 'who', 'who', 'who']]

 Dictionary(5 unique tokens: ['dogs', 'let', 'out', 'the', 'who'])
{'dogs': 0, 'let': 1, 'out': 2, 'the': 3, 'who': 4}


In [54]:
# Creating the BOW or CORPUS from the above created dictionary
mycorpus = [mydict.doc2bow(doc, allow_update=True) for doc in tokenized_list]
print(mycorpus)

[[(0, 1), (1, 1), (2, 2), (3, 1), (4, 1)], [(4, 4)]]


#### **The above output is telling us the occurences of a word in the sentences. For example, (2,2) means 'out' is two times in the first sentence.**

In [56]:
# Here, just replaced the token-ids with their actual names
word_counts = [[(mydict[id], count) for id, count in line] for line in mycorpus]
print(word_counts)

[[('dogs', 1), ('let', 1), ('out', 2), ('the', 1), ('who', 1)], [('who', 4)]]


#### **Q6.**
#### **How to build the BOW or CORPUS from single file using streaming?**

In [71]:
# Build the CORPUS from files using streaming
stop_words = stopwords.words('english')

class BoWCorpus(object):
    def __init__(self, path, dictionary):
        self.filepath = path
        self.dictionary = dictionary

    def __iter__(self):
        global mydict  # OPTIONAL, only if updating the source dictionary.
        for line in s_open(self.filepath, encoding='latin'):
            print('Raw Text --',line)
            # tokenize
            tokenized_list = simple_preprocess(line, deacc=True,min_len=6,max_len=7)
            print('Tokens --',tokenized_list)

            # create bag of words
            bow = self.dictionary.doc2bow(tokenized_list, allow_update=True)
            print('BOW --',bow,'\n')

            # update the source dictionary (OPTIONAL)
            mydict.merge_with(self.dictionary)

            # lazy return the BoW
            yield bow

# Create the Dictionary
mydict = corpora.Dictionary()

# Create the Corpus
bow_corpus = BoWCorpus('sample.txt', dictionary=mydict) # memory friendly

# Print the token_id and count for each line.
for line in bow_corpus:
    line

Raw Text -- The think tank of ChinaÃ¢â¬â¢s PeopleÃ¢â¬â¢s Liberation Army Rocket Force recently recruited 13 Chinese technicians

Tokens -- ['chinaa', 'peoplea', 'rocket', 'chinese']
BOW -- [(0, 1), (1, 1), (2, 1), (3, 1)] 

Raw Text --  from private companies, PLA Daily reported on Saturday.

Tokens -- ['private']
BOW -- [(4, 1)] 

Raw Text -- Zhang Hao and 12 other science and technology experts received letters of appointment at the founding ceremony of

Tokens -- ['science', 'experts', 'letters']
BOW -- [(5, 1), (6, 1), (7, 1)] 

Raw Text -- the PLA Rocket Force national defense science and technology experts panel, according to a report published by the

Tokens -- ['rocket', 'defense', 'science', 'experts', 'report']
BOW -- [(3, 1), (5, 1), (7, 1), (8, 1), (9, 1)] 

Raw Text -- PLA Daily on Saturday.

Tokens -- []
BOW -- [] 

Raw Text -- Honored as Ã¢â¬Årocket force science and technology experts,Ã¢â¬Â Zhang and his fellow experts from private companies will

Tokens -- ['ho

#### **Above, output shows the Tokens and BOW created for each of the raw text or sentence.** 

In [67]:
print(bow_corpus.dictionary)

Dictionary(23 unique tokens: ['chinaa', 'chinese', 'peoplea', 'rocket', 'private']...)


In [68]:
bow_corpus.filepath

'sample.txt'

In [72]:
mydict

<gensim.corpora.dictionary.Dictionary at 0x134943017b8>

In [73]:
mydict.token2id

{'chinaa': 0,
 'chinese': 1,
 'peoplea': 2,
 'rocket': 3,
 'private': 4,
 'experts': 5,
 'letters': 6,
 'science': 7,
 'defense': 8,
 'report': 9,
 'fellow': 10,
 'honored': 11,
 'conduct': 12,
 'design': 13,
 'fields': 14,
 'members': 15,
 'overall': 16,
 'missile': 17,
 'network': 18,
 'system': 19,
 'better': 20,
 'combat': 21,
 'forcea': 22}

#### **Q7.**
#### **How to save the gensim dictionary and CORPUS?**

In [75]:
# Save the Dict and Corpus
mydict.save('mydict.dict')  # save dict to disk
corpora.MmCorpus.serialize('bow_corpus.mm', bow_corpus)  # save corpus to disk

# We have saved the dictionary and corpus objects. Let’s load them back.

# Load them back
loaded_dict = corpora.Dictionary.load('mydict.dict')

corpus = corpora.MmCorpus('bow_corpus.mm')
for line in corpus:
    line

Raw Text -- The think tank of ChinaÃ¢â¬â¢s PeopleÃ¢â¬â¢s Liberation Army Rocket Force recently recruited 13 Chinese technicians

Tokens -- ['chinaa', 'peoplea', 'rocket', 'chinese']
BOW -- [(0, 1), (1, 1), (2, 1), (3, 1)] 

Raw Text --  from private companies, PLA Daily reported on Saturday.

Tokens -- ['private']
BOW -- [(4, 1)] 

Raw Text -- Zhang Hao and 12 other science and technology experts received letters of appointment at the founding ceremony of

Tokens -- ['science', 'experts', 'letters']
BOW -- [(5, 1), (6, 1), (7, 1)] 

Raw Text -- the PLA Rocket Force national defense science and technology experts panel, according to a report published by the

Tokens -- ['rocket', 'defense', 'science', 'experts', 'report']
BOW -- [(3, 1), (5, 1), (7, 1), (8, 1), (9, 1)] 

Raw Text -- PLA Daily on Saturday.

Tokens -- []
BOW -- [] 

Raw Text -- Honored as Ã¢â¬Årocket force science and technology experts,Ã¢â¬Â Zhang and his fellow experts from private companies will

Tokens -- ['ho

#### **Q8.**
#### **How to create a Tf-IDF model with Gensim?**

In [85]:
# Create TF-IDF model
documents = ["This is the first line",
             "This is the second sentence",
             "This third document"]

# Create the Dictionary and Corpus
mydict = corpora.Dictionary([simple_preprocess(line) for line in documents])
corpus = [mydict.doc2bow(simple_preprocess(line)) for line in documents]

# Show the Word Weights in Corpus
print("Doc wise BOW-Corpus\n")
for doc in corpus:
    print([[mydict[id], freq] for id, freq in doc])

# Create the TF-IDF model
tfidf = models.TfidfModel(corpus, smartirs='ntc')

Doc wise BOW-Corpus

[['first', 1], ['is', 1], ['line', 1], ['the', 1], ['this', 1]]
[['is', 1], ['the', 1], ['this', 1], ['second', 1], ['sentence', 1]]
[['this', 1], ['document', 1], ['third', 1]]


In [79]:
# Inverse doc freq
tfidf.idfs

{0: 2.0,
 1: 1.0,
 2: 2.0,
 3: 1.0,
 4: 0.41503749927884376,
 5: 2.0,
 6: 2.0,
 7: 2.0,
 8: 2.0}

In [80]:
# Doc freq
tfidf.dfs

{0: 1, 1: 2, 2: 1, 3: 2, 4: 3, 5: 1, 6: 1, 7: 1, 8: 1}

In [87]:
mydict.token2id

{'first': 0,
 'is': 1,
 'line': 2,
 'the': 3,
 'this': 4,
 'second': 5,
 'sentence': 6,
 'document': 7,
 'third': 8}

In [83]:
corpus

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)],
 [(1, 1), (3, 1), (4, 1), (5, 1), (6, 1)],
 [(4, 1), (7, 1), (8, 1)]]

In [97]:
tfidf[corpus][0]

[(0, 0.6270776935874319),
 (1, 0.31353884679371596),
 (2, 0.6270776935874319),
 (3, 0.31353884679371596),
 (4, 0.1301303789000364)]

In [99]:
# Show the TF-IDF weights
for doc in tfidf[corpus]:
    print([[mydict[id], np.around(freq, decimals=2)] for id, freq in doc])

[['first', 0.63], ['is', 0.31], ['line', 0.63], ['the', 0.31], ['this', 0.13]]
[['is', 0.31], ['the', 0.31], ['this', 0.13], ['second', 0.63], ['sentence', 0.63]]
[['this', 0.15], ['document', 0.7], ['third', 0.7]]


#### **Above shows us the tfidf weights of words in the corpus.**

#### **Q9.**
#### **How we can use the downloader api of gensim?**

In [101]:
# Get information about the model or dataset
api.info('glove-wiki-gigaword-50')

{'num_records': 400000,
 'file_size': 69182535,
 'base_dataset': 'Wikipedia 2014 + Gigaword 5 (6B tokens, uncased)',
 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-wiki-gigaword-50/__init__.py',
 'license': 'http://opendatacommons.org/licenses/pddl/',
 'parameters': {'dimension': 50},
 'description': 'Pre-trained vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, 400K vocab, uncased (https://nlp.stanford.edu/projects/glove/).',
 'preprocessing': 'Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-50.txt`.',
 'read_more': ['https://nlp.stanford.edu/projects/glove/',
  'https://nlp.stanford.edu/pubs/glove.pdf'],
 'checksum': 'c289bc5d7f2f02c6dc9f2f9b67641813',
 'file_name': 'glove-wiki-gigaword-50.gz',
 'parts': 1}

In [102]:
# Download
w2v_model = api.load("glove-wiki-gigaword-50") # this will take some time
w2v_model.most_similar('blue')

[('red', 0.8901657462120056),
 ('black', 0.8648406863212585),
 ('pink', 0.845291793346405),
 ('green', 0.8346815705299377),
 ('yellow', 0.8320707082748413),
 ('purple', 0.8293111324310303),
 ('white', 0.8225342035293579),
 ('orange', 0.8114303350448608),
 ('bright', 0.799933910369873),
 ('colored', 0.787665605545044)]

#### **Q10.**
#### **How to create n-grams like bi or tri grams using gensim?**

In [103]:
# Build n-grams
dataset = api.load("text8") # this will take some time

In [105]:
dataset = [wd for wd in dataset]

dct = corpora.Dictionary(dataset)
corpus = [dct.doc2bow(line) for line in dataset]

# Build the bigram models
bigram = models.phrases.Phrases(dataset, min_count=10, threshold=20)

In [124]:
# Construct bigram
print(bigram[dataset[0]][0:20])

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working_class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution']


In [125]:
# Build the trigram models
trigram = gensim.models.phrases.Phrases(bigram[dataset], threshold=50)

In [126]:
# Construct trigram
print(trigram[bigram[dataset[0]]][0:20])

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working_class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution']


#### **Q11.**
#### **How to create W2V model using gensim?**

- The ML/DL algorithms cannot access text directly which is why we need some numerical represenation so that these algorithms can process the data. In simple Machine Learning applications CountVectorizer and TFIDF are used which do not preserve the relationship between the words.

- **Word2Vec**: Method to represent text to generate Word Embeddings which map all the words present in a language into a vector space of a given dimension. We can perform mathematical operations on these vectors which help preserve the relationship between the words.

#### Example: queen - women + man = king

- Pre-built word embedding models like **word2vec, GloVe, fasttext** etc. can be downloaded using the Gensim downloader API. Sometimes you may not find word embeddings for certain words in your document. So you can train your model.

In [127]:
from multiprocessing import cpu_count
from gensim.models.word2vec import Word2Vec

In [129]:
# load the text8 dataset
api.info("text8")

{'num_records': 1701,
 'record_format': 'list of str (tokens)',
 'file_size': 33182058,
 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/text8/__init__.py',
 'license': 'not found',
 'description': 'First 100,000,000 bytes of plain text from Wikipedia. Used for testing purposes; see wiki-english-* for proper full Wikipedia datasets.',
 'checksum': '68799af40b6bda07dfa47a32612e5364',
 'file_name': 'text8.gz',
 'read_more': ['http://mattmahoney.net/dc/textdata.html'],
 'parts': 1}

In [130]:
# We will split the data into two parts
data_1 = dataset[:1200]   # this is used to train the model
data_2 = dataset[1200:]   # this part will be used to update the model
  
# Training the Word2Vec model
w2v_model = Word2Vec(data_1, min_count = 10, workers = cpu_count())
  
# word vector for the word "time"
w2v_model.wv['time']

[ 7.7323055e-01 -7.8865415e-01 -1.2822171e+00 -9.8733413e-01
 -1.9855139e+00 -3.9086100e-01  7.1375251e-01 -9.3874671e-02
  6.3222629e-01 -1.7258669e+00 -1.2362329e+00  3.6021248e-03
  8.4651822e-01  9.1972429e-01 -4.5277666e-02 -1.2424864e+00
 -3.3635161e+00 -4.4902116e-01  1.2782342e+00  2.0816245e+00
 -4.1816279e-01 -2.9819381e+00 -8.4179085e-01  1.9626539e+00
 -1.4738626e+00 -1.3242916e+00  1.8861794e+00 -1.1898819e+00
  1.3238851e+00 -6.4920151e-01  3.8012328e+00  2.6445508e+00
 -1.5182890e-01  4.6848631e-01  3.1005518e+00  7.2152913e-01
  2.2590755e-01  4.5798478e-01 -2.2721294e-01  3.7635162e-02
  1.3822956e+00 -1.0529523e-01  1.4199548e+00  1.4831527e+00
 -2.3953424e+00  5.7593191e-01  2.0077407e+00  2.0307861e-02
 -5.0059938e-01 -2.0213137e+00 -1.7889544e+00 -1.5134095e+00
  9.6358620e-02 -1.2459183e+00  1.7569329e+00 -3.9924572e+00
 -1.1926625e-01 -2.8321953e+00  7.7366227e-01  2.1342990e+00
 -8.5490537e-01 -9.1503859e-01  6.0646707e-01 -1.5810367e+00
  1.9046652e-01 -1.75831

  if __name__ == '__main__':


In [141]:
# Similar words to time
print(w2v_model.wv.most_similar('time'))

# Saving the model
w2v_model.save('Word2Vec_model')

# Loading the model
Word2Vec.load('Word2Vec_model')

[('moment', 0.5944941639900208), ('glance', 0.5476539731025696), ('forefront', 0.5458520650863647), ('period', 0.5256564021110535), ('expense', 0.5117818713188171), ('hinted', 0.5058044195175171), ('stake', 0.4763037860393524), ('point', 0.4707433581352234), ('behest', 0.46174609661102295), ('outset', 0.45605048537254333)]


<gensim.models.word2vec.Word2Vec at 0x1355a22f128>

In [137]:
w2v_model.epochs

5

##### **Update the W2V model**

In [139]:
# build model vocabulary from a sequence of sentences
w2v_model.build_vocab(data_2, update = True)
  
# train word vectors
w2v_model.train(data_2, total_examples = w2v_model.corpus_count, epochs = w2v_model.epochs)
  
print(w2v_model.wv['time'])

[ 4.6802515e-01  5.6493774e-02 -1.5912900e+00 -1.2236629e-02
 -1.2509328e+00 -1.1412683e+00  1.5030310e+00 -1.4950004e-01
  6.2586015e-01 -3.7274805e-01 -9.9825066e-01 -3.4770525e-01
  2.1805830e+00  5.9991825e-01 -7.8322554e-01 -8.4842241e-01
 -2.6591659e+00  3.7505874e-01  3.7908617e-01  3.7979779e+00
 -3.5899135e-01 -2.8704739e+00 -1.2666919e+00  9.2186731e-01
 -2.4869249e+00 -8.4918106e-01  1.8594224e+00 -7.1344894e-01
  1.0425552e+00 -1.4413654e+00  2.2741675e+00  1.9501375e+00
 -1.2049654e+00  8.6784877e-02  2.7047594e+00  1.8577206e+00
  2.6372290e-01  4.8745653e-01 -6.5028089e-01 -3.0038711e-01
  1.5509744e+00  4.7653878e-01  1.5887014e+00  2.2717149e+00
 -3.7671083e-01 -9.2957735e-01  2.5331056e+00  6.5390236e-04
 -4.9770218e-01 -9.4314033e-01 -1.4131360e+00 -1.6895908e+00
  1.9554275e-01 -6.6305637e-01  2.6162763e+00 -2.2937651e+00
  5.8671814e-01 -1.5097600e+00  5.1442695e-01  1.7494600e+00
 -2.8709758e-02 -1.2273374e+00  1.0656810e+00 -1.2380966e+00
 -6.6773258e-02 -6.68761

In [140]:
w2v_model.wv.most_similar(['time'])

[('moment', 0.5944941639900208),
 ('glance', 0.5476539731025696),
 ('forefront', 0.5458520650863647),
 ('period', 0.5256564021110535),
 ('expense', 0.5117818713188171),
 ('hinted', 0.5058044195175171),
 ('stake', 0.4763037860393524),
 ('point', 0.4707433581352234),
 ('behest', 0.46174609661102295),
 ('outset', 0.45605048537254333)]

#### **Q12.**
#### **How to create Doc2Vec model using gensim?**

- In contrast to the Word2Vec model, the Doc2Vec model gives the vector representation for an entire document or group of words. With the help of this model, we can find the relationship among different documents such as-

- If we train the model for literature such as "Through the Looking Glass".We can say that-  

#### `Through The Looking Glass = Alice In WonderLand`

In [144]:
from gensim.models import doc2vec

# To train the model we need a list of tagged documents
def tagged_document(list_of_ListOfWords):
    for x, ListOfWords in enumerate(list_of_ListOfWords):
        yield doc2vec.TaggedDocument(ListOfWords, [x])
    
# training data
data_train = list(tagged_document(dataset))
    
# print trained dataset
print(data_train[:1])

[TaggedDocument(words=['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'culottes', 'of', 'the', 'french', 'revolution', 'whilst', 'the', 'term', 'is', 'still', 'used', 'in', 'a', 'pejorative', 'way', 'to', 'describe', 'any', 'act', 'that', 'used', 'violent', 'means', 'to', 'destroy', 'the', 'organization', 'of', 'society', 'it', 'has', 'also', 'been', 'taken', 'up', 'as', 'a', 'positive', 'label', 'by', 'self', 'defined', 'anarchists', 'the', 'word', 'anarchism', 'is', 'derived', 'from', 'the', 'greek', 'without', 'archons', 'ruler', 'chief', 'king', 'anarchism', 'as', 'a', 'political', 'philosophy', 'is', 'the', 'belief', 'that', 'rulers', 'are', 'unnecessary', 'and', 'should', 'be', 'abolished', 'although', 'there', 'are', 'differing', 'interpretations', 'of', 'what', 'this', 'means', 'anarchism', 'also', 'refers'

In [145]:
# Initialize the model
d2v_model = doc2vec.Doc2Vec(vector_size = 40, min_count = 10, epochs = 30)
  
# build the vocabulary
d2v_model.build_vocab(data_train)
  
# Train Doc2Vec model
d2v_model.train(data_train, total_examples = d2v_model.corpus_count, epochs = d2v_model.epochs)
  
# Analyzing the output
Analyze = d2v_model.infer_vector(['violent', 'means', 'to', 'destroy'])
print(Analyze)

[-0.17972104 -0.0843773  -0.2301176   0.1513759  -0.16097273  0.13224725
 -0.20869927  0.29783106 -0.3220737   0.0046985   0.06120256  0.04329256
  0.38654107  0.12910809  0.01477905 -0.07714464 -0.3160261   0.16582759
 -0.02708023 -0.05503297 -0.08408926 -0.2797598   0.17048396 -0.03487602
 -0.07939611  0.52941495 -0.15906492 -0.29947588 -0.08971202 -0.14121538
  0.08322426  0.26421347  0.07738372 -0.32780713  0.13353965 -0.279358
 -0.0877514  -0.17635392 -0.03658332  0.27801502]


#### **Q13.**
#### **How to compute similarity matrices using gensim?**

- Cosine Similarity: It is a measure of similarity between two non-zero vectors of an inner product space. It is defined to equal the cosine of the angle between them.

- Soft Cosine Similarity: It is similar to cosine similarity but the difference is that cosine similarity considers the vector space model(VSM) features as independent whereas soft cosine proposes to consider the similarity of features in VSM.

We need to take a word embedding model to compute soft cosines. Here we are using the pre-trained word2vec model.

TBD

#### **Reference Links**

Gensim Documentation
- https://radimrehurek.com/gensim/index.html

Gensim Tutorials
- https://www.machinelearningplus.com/nlp/gensim-tutorial/#2whatisadictionaryandcorpus
- https://www.geeksforgeeks.org/nlp-gensim-tutorial-complete-guide-for-beginners/
- https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial
- http://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/#.YJo_i6HhUl2

Different and SMART IR Formulaes of Tf-IDF
- https://en.wikipedia.org/wiki/SMART_Information_Retrieval_System

Information about Gensim Downloader Pre-trained models
- https://raw.githubusercontent.com/RaRe-Technologies/gensim-data/master/list.json

RareTechnologies : Gensim, Faster pkg for SparseSVD, Smart_open
- https://github.com/RaRe-Technologies
- https://github.com/RaRe-Technologies/gensim
- https://github.com/RaRe-Technologies/smart_open
- https://github.com/RaRe-Technologies/sparsesvd

How we use Generators, Iterators in python for lazy evaluation?
- https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/

Topic Modelling
- https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
- https://www.kdnuggets.com/2016/07/text-mining-101-topic-modeling.html

Skipgram, Subword, CBOW
- https://kavita-ganesan.com/comparison-between-cbow-skipgram-subword/#.YJqJIaHhUl0