# Gensim

Author: **Tasnima Sadekova**

**Gensim** is billed as a Natural Language Processing package that does ‘Topic Modeling for Humans’. But it is practically much more than that. It is a leading and a state-of-the-art package for processing texts, working with *word vector models* and for building *topic models*.

But the width and scope of facilities to build and evaluate topic models are unparalleled in gensim, plus many more convenient facilities for text processing.

Also, another significant advantage with gensim is: it lets you handle large text files without having to load the entire file in memory.

In [1]:
from gensim import corpora
import gensim.downloader as api
from gensim.utils import simple_preprocess
from gensim.models.word2vec import Word2Vec
from gensim.models.doc2vec import TaggedDocument, Doc2Vec

## Dictionary and Corpus

**Dictionary** is an object that maps each word to a unique id.

The dictionary object is typically used to create a ‘bag of words’ Corpus. It is this Dictionary and the bag-of-words (Corpus) that are used as inputs to topic modeling and other models that Gensim specializes in.

*gensim.utils.simple_preprocess* Convert a document into a list of lowercase tokens, ignoring tokens that are too short or too long.

In [2]:
# from a list of sentences

documents = ["If you use a car frequently, the first step to cutting",
             "down your emissions may well be to simply", 
             "fully consider the", 
             "alternatives available to you."
             ]

# Tokenize(split) the sentences into words
texts = [[text for text in doc.split()] for doc in documents]

# Create dictionary
dictionary = corpora.Dictionary(texts)

print(dictionary)
print(dictionary.token2id)

Dictionary(23 unique tokens: ['If', 'a', 'car', 'cutting', 'first']...)
{'If': 0, 'a': 1, 'car': 2, 'cutting': 3, 'first': 4, 'frequently,': 5, 'step': 6, 'the': 7, 'to': 8, 'use': 9, 'you': 10, 'be': 11, 'down': 12, 'emissions': 13, 'may': 14, 'simply': 15, 'well': 16, 'your': 17, 'consider': 18, 'fully': 19, 'alternatives': 20, 'available': 21, 'you.': 22}


With simple_preprocess

In [3]:
dictionary = corpora.Dictionary(simple_preprocess(line, deacc=True) for line in documents)
print(dictionary)
print(dictionary.token2id)

Dictionary(21 unique tokens: ['car', 'cutting', 'first', 'frequently', 'if']...)
{'car': 0, 'cutting': 1, 'first': 2, 'frequently': 3, 'if': 4, 'step': 5, 'the': 6, 'to': 7, 'use': 8, 'you': 9, 'be': 10, 'down': 11, 'emissions': 12, 'may': 13, 'simply': 14, 'well': 15, 'your': 16, 'consider': 17, 'fully': 18, 'alternatives': 19, 'available': 20}


In [6]:
#from document
dictionary = corpora.Dictionary(simple_preprocess(line, deacc=True) for line in open('sample.txt', encoding='utf-8'))

**Corpus** object that contains the word id and its frequency in each document.

The dictionary object is typically used to create a ‘bag of words’ Corpus. It is this Dictionary and the bag-of-words (Corpus) that are used as inputs to topic modeling and other models that Gensim specializes in.

- Bag of words

In [4]:
my_docs = ["Who let the dogs out?",
           "Who? Who? Who? Who?"]

# Tokenize the docs
tokenized_list = [simple_preprocess(doc) for doc in my_docs]

# Create the Corpus
dictionary = corpora.Dictionary()

#allow_update=True - add new words to dictionary
bow_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in tokenized_list]
print(bow_corpus)

print("Dictionary: ", dictionary.token2id)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], [(4, 4)]]
Dictionary:  {'dogs': 0, 'let': 1, 'out': 2, 'the': 3, 'who': 4}


The (0, 1) in line 1 means, the word with id=0 appears once in the 1st document.
Likewise, the (4, 4) in the second list item means the word with id 4 appears 4 times in the second document. And so on.

- TfIdf

In [5]:
from gensim import models
import numpy as np

documents = ["This is the first line",
             "This is the second sentence",
             "This third document"]

# Create the Dictionary and Corpus
mydict = corpora.Dictionary([simple_preprocess(line) for line in documents])
corpus = [mydict.doc2bow(simple_preprocess(line)) for line in documents]

# Show the Word Weights in Corpus
for doc in corpus:
    print([[mydict[id], freq] for id, freq in doc])
print()

# Create the TF-IDF model
tfidf = models.TfidfModel(corpus, smartirs='ntc')

# Show the TF-IDF weights
for doc in tfidf[corpus]:
    print([[mydict[id], np.around(freq, decimals=2)] for id, freq in doc])

[['first', 1], ['is', 1], ['line', 1], ['the', 1], ['this', 1]]
[['is', 1], ['the', 1], ['this', 1], ['second', 1], ['sentence', 1]]
[['this', 1], ['document', 1], ['third', 1]]

[['first', 0.63], ['is', 0.31], ['line', 0.63], ['the', 0.31], ['this', 0.13]]
[['is', 0.31], ['the', 0.31], ['this', 0.13], ['second', 0.63], ['sentence', 0.63]]
[['this', 0.15], ['document', 0.7], ['third', 0.7]]


Save and load

In [10]:
# Save the Dict and Corpus
dictionary.save('mydict.dict')  # save dict to disk
corpora.MmCorpus.serialize('bow_corpus.mm', bow_corpus)  # save corpus to disk

In [11]:
# Load them back
loaded_dict = corpora.Dictionary.load('mydict.dict')

corpus = corpora.MmCorpus('bow_corpus.mm')
for line in corpus:
    print(line)

[(0, 1.0), (1, 1.0), (2, 1.0), (3, 1.0), (4, 1.0)]
[(4, 4.0)]


## Datasets

Gensim provides an inbuilt API to download popular text datasets and word embedding models.

A comprehensive list of available datasets and models is maintained [here](https://raw.githubusercontent.com/RaRe-Technologies/gensim-data/master/list.json).

Using the API to download the dataset is as simple as calling the ```api.load()``` method with the right data or model name.

In [6]:
import gensim.downloader as api

Short description

In [7]:
api.info('text8')
api.info('glove-wiki-gigaword-50')

{'num_records': 400000,
 'file_size': 69182535,
 'base_dataset': 'Wikipedia 2014 + Gigaword 5 (6B tokens, uncased)',
 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-wiki-gigaword-50/__init__.py',
 'license': 'http://opendatacommons.org/licenses/pddl/',
 'parameters': {'dimension': 50},
 'description': 'Pre-trained vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, 400K vocab, uncased (https://nlp.stanford.edu/projects/glove/).',
 'preprocessing': 'Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-50.txt`.',
 'read_more': ['https://nlp.stanford.edu/projects/glove/',
  'https://nlp.stanford.edu/pubs/glove.pdf'],
 'checksum': 'c289bc5d7f2f02c6dc9f2f9b67641813',
 'file_name': 'glove-wiki-gigaword-50.gz',
 'parts': 1}

Dataset

In [8]:
dataset = api.load("text8")
data = [d for d in dataset]
print(data[0])

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'culottes', 'of', 'the', 'french', 'revolution', 'whilst', 'the', 'term', 'is', 'still', 'used', 'in', 'a', 'pejorative', 'way', 'to', 'describe', 'any', 'act', 'that', 'used', 'violent', 'means', 'to', 'destroy', 'the', 'organization', 'of', 'society', 'it', 'has', 'also', 'been', 'taken', 'up', 'as', 'a', 'positive', 'label', 'by', 'self', 'defined', 'anarchists', 'the', 'word', 'anarchism', 'is', 'derived', 'from', 'the', 'greek', 'without', 'archons', 'ruler', 'chief', 'king', 'anarchism', 'as', 'a', 'political', 'philosophy', 'is', 'the', 'belief', 'that', 'rulers', 'are', 'unnecessary', 'and', 'should', 'be', 'abolished', 'although', 'there', 'are', 'differing', 'interpretations', 'of', 'what', 'this', 'means', 'anarchism', 'also', 'refers', 'to', 'related', 'so

Pretrained model

In [21]:
w2v_model = api.load("glove-wiki-gigaword-50")
w2v_model.most_similar('blue')

INFO - 09:47:56: loading projection weights from C:\Users\v00524754/gensim-data\glove-wiki-gigaword-50\glove-wiki-gigaword-50.gz
INFO - 09:50:52: loaded (400000, 50) matrix from C:\Users\v00524754/gensim-data\glove-wiki-gigaword-50\glove-wiki-gigaword-50.gz
INFO - 09:50:53: precomputing L2-norms of word weight vectors


[('red', 0.8901656866073608),
 ('black', 0.8648407459259033),
 ('pink', 0.8452916741371155),
 ('green', 0.8346816301345825),
 ('yellow', 0.8320708274841309),
 ('purple', 0.829311192035675),
 ('white', 0.8225342035293579),
 ('orange', 0.8114303350448608),
 ('bright', 0.799933910369873),
 ('colored', 0.787665605545044)]

## Word2Vec

A word embedding model is a model that can provide numerical vectors for a given word. Using the Gensim’s downloader API, you can download pre-built word embedding models like word2vec, fasttext, GloVe and ConceptNet. These are built on large corpuses of commonly occurring text data such as wikipedia, google news etc.

The training algorithms in the Gensim package were actually ported from the original Word2Vec implementation by Google and extended with additional functionality.

This module implements the word2vec family of algorithms, using *highly optimized* C routines, data streaming and Pythonic interfaces.

**Parameters:**

- ```sentences``` - (iterable of iterables, optional) – The sentences iterable can be simply a *list of lists of tokens*, but for larger corpora, consider an *iterable* that streams the sentences directly from disk/network.
- ```corpus_file``` (str, optional) – Path to a corpus file in LineSentence format. You may use this argument instead of sentences to get performance boost. *Only one of sentences or corpus_file arguments need to be passed* 
- ```size``` = 100 - Dimensionality of the word vectors.
- ```window``` = 5 - Maximum distance between the current and predicted word within a sentence.
- ```min_count``` = 5 (int, optional) – Ignores all words with total frequency lower than this.
- ```workers``` = 3 (int, optional) – Use these many worker threads to train the model (=faster training with multicore machines).
- ```sg``` = 0 ({0, 1}, optional) – Training algorithm: 1 for skip-gram; otherwise CBOW.
- ```hs``` = 0 ({0, 1}, optional) – If 1, hierarchical softmax will be used for model training. If 0, and negative is non-zero, negative sampling will be used.
- ```negative``` = 5 (int, optional) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.
- ```max_vocab_size``` = None (int, optional) Limits the RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones.
- ```iter``` (int, optional) – Number of iterations (epochs) over the corpus.

In [10]:
import logging 
# Setting up the loggings to monitor gensim
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)

In [11]:
dataset = api.load("text8")
data = [d for d in dataset]

# Train Word2Vec model
model = Word2Vec(data)

INFO - 09:35:39: collecting all words and their counts
INFO - 09:35:39: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO - 09:36:01: collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
INFO - 09:36:01: Loading a fresh vocabulary
INFO - 09:36:03: effective_min_count=5 retains 71290 unique words (28% of original 253854, drops 182564)
INFO - 09:36:03: effective_min_count=5 leaves 16718844 word corpus (98% of original 17005207, drops 286363)
INFO - 09:36:05: deleting the raw counts dictionary of 253854 items
INFO - 09:36:05: sample=0.001 downsamples 38 most-common words
INFO - 09:36:05: downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)
INFO - 09:36:06: estimated required memory for 71290 words and 100 dimensions: 92677000 bytes
INFO - 09:36:06: resetting layer weights
INFO - 09:37:59: training model with 3 workers on 71290 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
INFO - 09:3

INFO - 09:39:14: EPOCH 2 - PROGRESS: at 33.22% examples, 200764 words/s, in_qsize 4, out_qsize 1
INFO - 09:39:15: EPOCH 2 - PROGRESS: at 35.16% examples, 202352 words/s, in_qsize 4, out_qsize 1
INFO - 09:39:16: EPOCH 2 - PROGRESS: at 37.57% examples, 206389 words/s, in_qsize 5, out_qsize 0
INFO - 09:39:17: EPOCH 2 - PROGRESS: at 39.39% examples, 207184 words/s, in_qsize 5, out_qsize 0
INFO - 09:39:18: EPOCH 2 - PROGRESS: at 41.45% examples, 209155 words/s, in_qsize 5, out_qsize 0
INFO - 09:39:19: EPOCH 2 - PROGRESS: at 42.68% examples, 206984 words/s, in_qsize 5, out_qsize 0
INFO - 09:39:20: EPOCH 2 - PROGRESS: at 44.50% examples, 207736 words/s, in_qsize 5, out_qsize 0
INFO - 09:39:21: EPOCH 2 - PROGRESS: at 46.03% examples, 207106 words/s, in_qsize 5, out_qsize 0
INFO - 09:39:22: EPOCH 2 - PROGRESS: at 47.62% examples, 206000 words/s, in_qsize 6, out_qsize 1
INFO - 09:39:23: EPOCH 2 - PROGRESS: at 49.32% examples, 206166 words/s, in_qsize 6, out_qsize 1
INFO - 09:39:24: EPOCH 2 - PRO

INFO - 09:40:38: EPOCH 3 - PROGRESS: at 80.25% examples, 227411 words/s, in_qsize 5, out_qsize 0
INFO - 09:40:39: EPOCH 3 - PROGRESS: at 82.07% examples, 227255 words/s, in_qsize 6, out_qsize 0
INFO - 09:40:40: EPOCH 3 - PROGRESS: at 84.19% examples, 227852 words/s, in_qsize 6, out_qsize 0
INFO - 09:40:41: EPOCH 3 - PROGRESS: at 86.54% examples, 229186 words/s, in_qsize 4, out_qsize 1
INFO - 09:40:42: EPOCH 3 - PROGRESS: at 88.95% examples, 230578 words/s, in_qsize 5, out_qsize 0
INFO - 09:40:43: EPOCH 3 - PROGRESS: at 91.01% examples, 231073 words/s, in_qsize 5, out_qsize 0
INFO - 09:40:44: EPOCH 3 - PROGRESS: at 93.36% examples, 232098 words/s, in_qsize 4, out_qsize 1
INFO - 09:40:45: EPOCH 3 - PROGRESS: at 95.06% examples, 231607 words/s, in_qsize 6, out_qsize 0
INFO - 09:40:46: EPOCH 3 - PROGRESS: at 97.35% examples, 232529 words/s, in_qsize 5, out_qsize 0
INFO - 09:40:47: EPOCH 3 - PROGRESS: at 99.29% examples, 232659 words/s, in_qsize 4, out_qsize 1
INFO - 09:40:47: worker thread

INFO - 09:41:58: EPOCH 5 - PROGRESS: at 38.33% examples, 224293 words/s, in_qsize 5, out_qsize 0
INFO - 09:41:59: EPOCH 5 - PROGRESS: at 40.51% examples, 226337 words/s, in_qsize 6, out_qsize 0
INFO - 09:42:00: EPOCH 5 - PROGRESS: at 42.15% examples, 225510 words/s, in_qsize 6, out_qsize 0
INFO - 09:42:01: EPOCH 5 - PROGRESS: at 44.15% examples, 226273 words/s, in_qsize 5, out_qsize 0
INFO - 09:42:02: EPOCH 5 - PROGRESS: at 46.21% examples, 227168 words/s, in_qsize 6, out_qsize 0
INFO - 09:42:03: EPOCH 5 - PROGRESS: at 48.21% examples, 227455 words/s, in_qsize 6, out_qsize 0
INFO - 09:42:04: EPOCH 5 - PROGRESS: at 49.79% examples, 225956 words/s, in_qsize 5, out_qsize 0
INFO - 09:42:05: EPOCH 5 - PROGRESS: at 51.32% examples, 224604 words/s, in_qsize 5, out_qsize 0
INFO - 09:42:06: EPOCH 5 - PROGRESS: at 53.03% examples, 223962 words/s, in_qsize 5, out_qsize 0
INFO - 09:42:07: EPOCH 5 - PROGRESS: at 54.61% examples, 223203 words/s, in_qsize 5, out_qsize 0
INFO - 09:42:08: EPOCH 5 - PRO

```Word2Vec``` without ```sentences``` or ```corpus``` is initialization only, should be trained

In [16]:
model = Word2Vec()
model.build_vocab(data)
model.train(data)

**Save and load model**

In [26]:
model.save('w2v_newmodel')
model = Word2Vec.load('w2v_newmodel')

INFO - 21:38:12: saving Word2Vec object under w2v_newmodel, separately None
INFO - 21:38:12: not storing attribute vectors_norm
INFO - 21:38:12: not storing attribute cum_table
INFO - 21:38:14: saved w2v_newmodel
INFO - 21:38:14: loading Word2Vec object from w2v_newmodel
INFO - 21:38:17: loading wv recursively from w2v_newmodel.wv.* with mmap=None
INFO - 21:38:17: setting ignored attribute vectors_norm to None
INFO - 21:38:17: loading vocabulary recursively from w2v_newmodel.vocabulary.* with mmap=None
INFO - 21:38:17: loading trainables recursively from w2v_newmodel.trainables.* with mmap=None
INFO - 21:38:17: setting ignored attribute cum_table to None
INFO - 21:38:17: loaded w2v_newmodel


You can **continue training**

In [25]:
model.train([["hello", "world"]], total_examples=1, epochs=1)

INFO - 21:38:12: training model with 3 workers on 71290 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
INFO - 21:38:12: worker thread finished; awaiting finish of 2 more threads
INFO - 21:38:12: worker thread finished; awaiting finish of 1 more threads
INFO - 21:38:12: worker thread finished; awaiting finish of 0 more threads
INFO - 21:38:12: EPOCH - 1 : training on 2 raw words (2 effective words) took 0.0s, 294 effective words/s
INFO - 21:38:12: training on a 2 raw words (2 effective words) took 0.0s, 103 effective words/s


(2, 2)

**Word2vec input**


1) Parameter ```sentences```

Gensim’s word2vec expects a sequence of sentences as its input. Each sentence a list of words

   1.1 *List of list of tokens*

In [34]:
input1 = [['first', 'sentence'], ['second', 'sentence']]
model1 = Word2Vec(input1, min_count=1)

INFO - 21:44:41: collecting all words and their counts
INFO - 21:44:41: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO - 21:44:41: collected 3 word types from a corpus of 4 raw words and 2 sentences
INFO - 21:44:41: Loading a fresh vocabulary
INFO - 21:44:41: effective_min_count=1 retains 3 unique words (100% of original 3, drops 0)
INFO - 21:44:41: effective_min_count=1 leaves 4 word corpus (100% of original 4, drops 0)
INFO - 21:44:41: deleting the raw counts dictionary of 3 items
INFO - 21:44:41: sample=0.001 downsamples 3 most-common words
INFO - 21:44:41: downsampling leaves estimated 0 word corpus (5.7% of prior 4)
INFO - 21:44:41: estimated required memory for 3 words and 100 dimensions: 3900 bytes
INFO - 21:44:41: resetting layer weights
INFO - 21:44:41: training model with 3 workers on 3 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
INFO - 21:44:41: worker thread finished; awaiting finish of 2 more threads
INFO - 21:44:41

It also could be in any other language, e.g. Russian:

In [None]:
input2 = [['первое', 'предложение'], ['второе', 'предложение']]
model2 = Word2Vec(input2, min_count=1)

1.2 Gensim only requires that the input must provide sentences sequentially, when iterated over. *No need to keep everything in RAM*

In [None]:
class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname
 
    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                yield line.split()
                
input2 = MySentences('/some/directory') # a memory-friendly iterator
model = gensim.models.Word2Vec(input2)

See [BrownCorpus](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.BrownCorpus), [Text8Corpus](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Text8Corpus) or [LineSentence](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.LineSentence) in word2vec module for such examples.

BrownCorpus and Text8Corpus were implemented special for BrownCorpus and Text8 datasets. Text8 corpus, for example, consists of one line of cleaned and joined wikipedia articles.

LineSentence iterate over a file that contains sentences: one line = one sentence. Words must be already preprocessed and separated by whitespace.

In [14]:
import gensim
from gensim.test.utils import datapath
from gensim.models.word2vec import LineSentence

input3 = LineSentence(datapath('lee_background.cor'))
model = gensim.models.Word2Vec(input3)

INFO - 09:43:35: collecting all words and their counts
INFO - 09:43:35: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO - 09:43:35: collected 10781 word types from a corpus of 59890 raw words and 300 sentences
INFO - 09:43:35: Loading a fresh vocabulary
INFO - 09:43:35: effective_min_count=5 retains 1762 unique words (16% of original 10781, drops 9019)
INFO - 09:43:35: effective_min_count=5 leaves 46084 word corpus (76% of original 59890, drops 13806)
INFO - 09:43:35: deleting the raw counts dictionary of 10781 items
INFO - 09:43:35: sample=0.001 downsamples 45 most-common words
INFO - 09:43:35: downsampling leaves estimated 32610 word corpus (70.8% of prior 46084)
INFO - 09:43:35: estimated required memory for 1762 words and 100 dimensions: 2290600 bytes
INFO - 09:43:35: resetting layer weights
INFO - 09:43:38: training model with 3 workers on 1762 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
INFO - 09:43:38: worker thread finish

2. Parameter ```corpus_file``` - path to a corpus file in LineSentence format

If corpus is in right format, parameter corpus_file may be passed instead of last cell

## Exploring the model

- Extract the trained word vectors from model.wv:

In [22]:
w2v_model.wv['topic']

  """Entry point for launching an IPython kernel.


array([-0.04278 ,  0.79271 , -0.70087 , -0.023487,  0.24581 ,  0.24413 ,
       -0.10713 , -0.51894 , -0.17387 , -0.15821 , -0.90392 , -0.33753 ,
       -0.13262 ,  0.67051 ,  0.93457 ,  0.046388, -0.26368 , -0.23655 ,
        0.80884 ,  0.1048  ,  0.43985 , -0.068909,  0.83773 ,  1.0383  ,
        1.0378  , -1.0609  , -0.015392,  0.39162 , -0.1175  ,  0.40644 ,
        1.9836  , -0.41525 , -0.050877, -1.2321  , -0.69079 , -0.35601 ,
       -0.75549 ,  0.94668 , -0.84225 ,  0.095134, -0.092177, -0.13833 ,
       -0.30647 ,  0.89785 ,  0.071583,  0.31083 ,  0.88585 ,  1.1397  ,
       -0.19138 ,  0.27392 ], dtype=float32)

- Similarity

In [23]:
w2v_model.wv.most_similar(positive = ['topic'])

  """Entry point for launching an IPython kernel.


[('topics', 0.8801343441009521),
 ('discussion', 0.8731067180633545),
 ('debates', 0.7824684381484985),
 ('discussing', 0.7815945744514465),
 ('questions', 0.7759715914726257),
 ('debate', 0.7666820883750916),
 ('context', 0.7591571807861328),
 ('subject', 0.7455185651779175),
 ('question', 0.733422040939331),
 ('focusing', 0.7311854362487793)]

In [24]:
model = w2v_model

In [25]:
print('Similarity between (walk, walking): ', model.wv.similarity('walk', 'walking'))
print('Similarity between (duck, ducks): ', model.wv.similarity('duck', 'ducks'))
print('Similarity between (banana, pear): ', model.wv.similarity('banana', 'pear'))
print()
print('Similarity between (banana, sky): ', model.wv.similarity('banana', 'sky'))
print('Similarity between (walk, lie): ', model.wv.similarity('walk', 'lie'))
print('Similarity between (dark, slow): ', model.wv.similarity('dark', 'slow'))

Similarity between (walk, walking):  0.90261763
Similarity between (duck, ducks):  0.6012516
Similarity between (banana, pear):  0.64588785

Similarity between (banana, sky):  0.08188744
Similarity between (walk, lie):  0.5613729
Similarity between (dark, slow):  0.38198376


  """Entry point for launching an IPython kernel.
  
  This is separate from the ipykernel package so we can avoid doing imports until
  """
  
  import sys


- Analogy

In [26]:
model.wv.most_similar(positive = ['lower', 'tall'], negative = ['low'])

  """Entry point for launching an IPython kernel.


[('marble', 0.7085803747177124),
 ('stands', 0.7059853672981262),
 ('wooden', 0.7055877447128296),
 ('conical', 0.6954946517944336),
 ('shaped', 0.6953530311584473),
 ('feet', 0.6921709775924683),
 ('atop', 0.6903060674667358),
 ('spire', 0.6899060010910034),
 ('upright', 0.6865338683128357),
 ('arches', 0.683355450630188)]

In [27]:
model.wv.most_similar(positive = ['mother', 'man'], negative = ['woman'])

  """Entry point for launching an IPython kernel.


[('father', 0.9208111763000488),
 ('friend', 0.9095532894134521),
 ('son', 0.8972886800765991),
 ('brother', 0.8844174742698669),
 ('uncle', 0.8572986721992493),
 ('husband', 0.8507049679756165),
 ('wife', 0.8354371786117554),
 ('himself', 0.8331096172332764),
 ('daughter', 0.8320738077163696),
 ('him', 0.8309528827667236)]

- Matching

In [28]:
print(model.wv.doesnt_match(['car', 'airplane', 'bed']), " doesn't match to [car, airplane]")
print(model.wv.doesnt_match(['red', 'blue', 'roof']), " doesn't match to [red, blue]")

bed  doesn't match to [car, airplane]
roof  doesn't match to [red, blue]


  """Entry point for launching an IPython kernel.
  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)
  


## Compare with other pretrained embeddings

In [29]:
fasttext_model300 = api.load('fasttext-wiki-news-subwords-300')
word2vec_model300 = api.load('word2vec-google-news-300')
glove_model300 = api.load('glove-wiki-gigaword-300')

[--------------------------------------------------] 1.2% 11.3/958.4MB downloaded

KeyboardInterrupt: 

To define which one performs better using the respective model's evaluate_word_analogies() 

Compute performance of the model on an analogy test set. The accuracy is reported (printed to log and returned as a score) for each section separately, plus there’s one aggregate summary at the end. 

Input:
- ```analogies``` (str) – Path to file, where lines are 4-tuples of words, split into sections by “: SECTION NAME” lines. See [this file](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/datasets/questions-words.txt) as example.

Output:
- ```score``` (float) – The overall evaluation score on the entire evaluation set

- ```sections``` (list of dict of {str : str or list of tuple of (str, str, str, str)}) – Results broken down by each section of the evaluation set. Each dict contains the name of the section under the key ‘section’, and lists of correctly and incorrectly predicted 4-tuples of words under the keys ‘correct’ and ‘incorrect’.

In [38]:
f = open('questions-words.txt', 'r')
for i in range(5):
    print(f.readline())
f.close()

: capital-common-countries

Athens Greece Baghdad Iraq

Athens Greece Bangkok Thailand

Athens Greece Beijing China

Athens Greece Berlin Germany



In [42]:
word2vec_model300.evaluate_word_analogies(analogies='questions-words.txt')[0]

# fasttext_accuracy
fasttext_model300.evaluate_word_analogies(analogies='questions-words.txt')[0]

# GloVe accuracy
glove_model300.evaluate_word_analogies(analogies='questions-words.txt')[0]

0.7401448525607863
0.8827876424099353
0.7195422354510931


## Doc2Vec

Unlike Word2Vec, a Doc2Vec model provides a vectorised representation of a group of words taken collectively as a single unit. It is not a simple average of the word vectors of the words in the sentence.

The training data for ```Doc2Vec``` should be a list of ```TaggedDocuments```. To create one, we pass a list of words and a unique integer as input to the ```models.doc2vec.TaggedDocument()```.

In [30]:
#prepare dataset
def create_tagged_document(list_of_list_of_words):
    for i, list_of_words in enumerate(list_of_list_of_words):
        yield TaggedDocument(list_of_words, [i])

dataset = api.load("text8")
data = [d for d in dataset]

train_data = list(create_tagged_document(data))
print(train_data[:1])

[TaggedDocument(words=['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'culottes', 'of', 'the', 'french', 'revolution', 'whilst', 'the', 'term', 'is', 'still', 'used', 'in', 'a', 'pejorative', 'way', 'to', 'describe', 'any', 'act', 'that', 'used', 'violent', 'means', 'to', 'destroy', 'the', 'organization', 'of', 'society', 'it', 'has', 'also', 'been', 'taken', 'up', 'as', 'a', 'positive', 'label', 'by', 'self', 'defined', 'anarchists', 'the', 'word', 'anarchism', 'is', 'derived', 'from', 'the', 'greek', 'without', 'archons', 'ruler', 'chief', 'king', 'anarchism', 'as', 'a', 'political', 'philosophy', 'is', 'the', 'belief', 'that', 'rulers', 'are', 'unnecessary', 'and', 'should', 'be', 'abolished', 'although', 'there', 'are', 'differing', 'interpretations', 'of', 'what', 'this', 'means', 'anarchism', 'also', 'refers'

In [31]:
#Train model
model = Doc2Vec(vector_size=50, min_count=1, epochs=40)
model.build_vocab(train_data)
model.train(train_data, total_examples=model.corpus_count, epochs=model.epochs)

INFO - 09:55:12: collecting all words and their counts
INFO - 09:55:12: PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
INFO - 09:55:35: collected 253854 word types and 1701 unique tags from a corpus of 1701 examples and 17005207 words
INFO - 09:55:35: Loading a fresh vocabulary
INFO - 09:55:39: effective_min_count=1 retains 253854 unique words (100% of original 253854, drops 0)
INFO - 09:55:39: effective_min_count=1 leaves 17005207 word corpus (100% of original 17005207, drops 0)
INFO - 09:55:44: deleting the raw counts dictionary of 253854 items
INFO - 09:55:44: sample=0.001 downsamples 36 most-common words
INFO - 09:55:44: downsampling leaves estimated 12819131 word corpus (75.4% of prior 17005207)
INFO - 09:55:47: estimated required memory for 253854 words and 50 dimensions: 228808800 bytes
INFO - 09:55:47: resetting layer weights
INFO - 10:01:37: training model with 3 workers on 253854 vocabulary and 50 features, using sg=0 hs=0 sample=0.001 negative=5 windo

INFO - 10:02:52: EPOCH 2 - PROGRESS: at 11.88% examples, 210481 words/s, in_qsize 6, out_qsize 0
INFO - 10:02:53: EPOCH 2 - PROGRESS: at 13.52% examples, 209096 words/s, in_qsize 3, out_qsize 2
INFO - 10:02:54: EPOCH 2 - PROGRESS: at 15.11% examples, 207823 words/s, in_qsize 5, out_qsize 0
INFO - 10:02:55: EPOCH 2 - PROGRESS: at 16.64% examples, 206563 words/s, in_qsize 5, out_qsize 0
INFO - 10:02:56: EPOCH 2 - PROGRESS: at 18.11% examples, 204596 words/s, in_qsize 5, out_qsize 0
INFO - 10:02:57: EPOCH 2 - PROGRESS: at 19.52% examples, 201395 words/s, in_qsize 5, out_qsize 0
INFO - 10:02:58: EPOCH 2 - PROGRESS: at 20.99% examples, 200404 words/s, in_qsize 5, out_qsize 0
INFO - 10:02:59: EPOCH 2 - PROGRESS: at 22.75% examples, 202097 words/s, in_qsize 6, out_qsize 0
INFO - 10:03:00: EPOCH 2 - PROGRESS: at 23.99% examples, 199371 words/s, in_qsize 5, out_qsize 0
INFO - 10:03:01: EPOCH 2 - PROGRESS: at 25.22% examples, 196348 words/s, in_qsize 6, out_qsize 0
INFO - 10:03:02: EPOCH 2 - PRO

INFO - 10:04:17: EPOCH 3 - PROGRESS: at 37.10% examples, 220107 words/s, in_qsize 6, out_qsize 0
INFO - 10:04:18: EPOCH 3 - PROGRESS: at 38.27% examples, 216978 words/s, in_qsize 6, out_qsize 0
INFO - 10:04:19: EPOCH 3 - PROGRESS: at 39.92% examples, 216387 words/s, in_qsize 6, out_qsize 0
INFO - 10:04:20: EPOCH 3 - PROGRESS: at 41.68% examples, 216573 words/s, in_qsize 6, out_qsize 0
INFO - 10:04:21: EPOCH 3 - PROGRESS: at 43.74% examples, 218251 words/s, in_qsize 5, out_qsize 0
INFO - 10:04:22: EPOCH 3 - PROGRESS: at 45.68% examples, 219285 words/s, in_qsize 6, out_qsize 0
INFO - 10:04:23: EPOCH 3 - PROGRESS: at 47.21% examples, 218224 words/s, in_qsize 5, out_qsize 1
INFO - 10:04:24: EPOCH 3 - PROGRESS: at 48.85% examples, 217732 words/s, in_qsize 4, out_qsize 1
INFO - 10:04:25: EPOCH 3 - PROGRESS: at 50.91% examples, 219082 words/s, in_qsize 4, out_qsize 1
INFO - 10:04:26: EPOCH 3 - PROGRESS: at 52.85% examples, 219992 words/s, in_qsize 6, out_qsize 0
INFO - 10:04:27: EPOCH 3 - PRO

INFO - 10:05:40: EPOCH 4 - PROGRESS: at 75.01% examples, 203476 words/s, in_qsize 6, out_qsize 1
INFO - 10:05:41: EPOCH 4 - PROGRESS: at 76.95% examples, 204171 words/s, in_qsize 5, out_qsize 0
INFO - 10:05:42: EPOCH 4 - PROGRESS: at 78.89% examples, 204922 words/s, in_qsize 5, out_qsize 0
INFO - 10:05:43: EPOCH 4 - PROGRESS: at 80.36% examples, 204545 words/s, in_qsize 6, out_qsize 0
INFO - 10:05:44: EPOCH 4 - PROGRESS: at 82.01% examples, 204616 words/s, in_qsize 6, out_qsize 0
INFO - 10:05:45: EPOCH 4 - PROGRESS: at 83.95% examples, 205240 words/s, in_qsize 4, out_qsize 1
INFO - 10:05:46: EPOCH 4 - PROGRESS: at 85.89% examples, 206029 words/s, in_qsize 5, out_qsize 0
INFO - 10:05:47: EPOCH 4 - PROGRESS: at 87.71% examples, 206378 words/s, in_qsize 5, out_qsize 0
INFO - 10:05:48: EPOCH 4 - PROGRESS: at 89.48% examples, 206444 words/s, in_qsize 6, out_qsize 0
INFO - 10:05:49: EPOCH 4 - PROGRESS: at 90.77% examples, 205295 words/s, in_qsize 5, out_qsize 1
INFO - 10:05:50: EPOCH 4 - PRO

INFO - 10:07:04: EPOCH 5 - PROGRESS: at 98.41% examples, 188082 words/s, in_qsize 5, out_qsize 0
INFO - 10:07:05: EPOCH 5 - PROGRESS: at 99.47% examples, 187162 words/s, in_qsize 5, out_qsize 0
INFO - 10:07:05: worker thread finished; awaiting finish of 2 more threads
INFO - 10:07:05: worker thread finished; awaiting finish of 1 more threads
INFO - 10:07:05: worker thread finished; awaiting finish of 0 more threads
INFO - 10:07:05: EPOCH - 5 : training on 17005207 raw words (12821110 effective words) took 68.5s, 187161 effective words/s
INFO - 10:07:06: EPOCH 6 - PROGRESS: at 1.12% examples, 143993 words/s, in_qsize 5, out_qsize 0
INFO - 10:07:07: EPOCH 6 - PROGRESS: at 2.41% examples, 153165 words/s, in_qsize 5, out_qsize 0
INFO - 10:07:08: EPOCH 6 - PROGRESS: at 4.17% examples, 176476 words/s, in_qsize 6, out_qsize 0
INFO - 10:07:09: EPOCH 6 - PROGRESS: at 5.88% examples, 185509 words/s, in_qsize 5, out_qsize 0
INFO - 10:07:10: EPOCH 6 - PROGRESS: at 7.47% examples, 187572 words/s, i

INFO - 10:08:25: EPOCH 7 - PROGRESS: at 24.51% examples, 190446 words/s, in_qsize 5, out_qsize 0
INFO - 10:08:26: EPOCH 7 - PROGRESS: at 26.04% examples, 190748 words/s, in_qsize 5, out_qsize 0
INFO - 10:08:27: EPOCH 7 - PROGRESS: at 27.57% examples, 190809 words/s, in_qsize 5, out_qsize 0
INFO - 10:08:28: EPOCH 7 - PROGRESS: at 28.87% examples, 189173 words/s, in_qsize 4, out_qsize 1
INFO - 10:08:29: EPOCH 7 - PROGRESS: at 30.57% examples, 190219 words/s, in_qsize 5, out_qsize 0
INFO - 10:08:30: EPOCH 7 - PROGRESS: at 31.75% examples, 188433 words/s, in_qsize 5, out_qsize 0
INFO - 10:08:31: EPOCH 7 - PROGRESS: at 33.22% examples, 188301 words/s, in_qsize 5, out_qsize 0
INFO - 10:08:32: EPOCH 7 - PROGRESS: at 34.51% examples, 186933 words/s, in_qsize 6, out_qsize 0
INFO - 10:08:34: EPOCH 7 - PROGRESS: at 35.68% examples, 185052 words/s, in_qsize 6, out_qsize 0
INFO - 10:08:35: EPOCH 7 - PROGRESS: at 37.45% examples, 186591 words/s, in_qsize 5, out_qsize 0
INFO - 10:08:36: EPOCH 7 - PRO

INFO - 10:09:50: EPOCH 8 - PROGRESS: at 51.15% examples, 192273 words/s, in_qsize 6, out_qsize 0
INFO - 10:09:51: EPOCH 8 - PROGRESS: at 52.50% examples, 191653 words/s, in_qsize 6, out_qsize 0
INFO - 10:09:52: EPOCH 8 - PROGRESS: at 54.32% examples, 192975 words/s, in_qsize 6, out_qsize 0
INFO - 10:09:53: EPOCH 8 - PROGRESS: at 55.91% examples, 193050 words/s, in_qsize 3, out_qsize 2
INFO - 10:09:54: EPOCH 8 - PROGRESS: at 57.91% examples, 194546 words/s, in_qsize 5, out_qsize 0
INFO - 10:09:55: EPOCH 8 - PROGRESS: at 59.49% examples, 194668 words/s, in_qsize 6, out_qsize 0
INFO - 10:09:56: EPOCH 8 - PROGRESS: at 60.91% examples, 194353 words/s, in_qsize 5, out_qsize 0
INFO - 10:09:57: EPOCH 8 - PROGRESS: at 62.55% examples, 194694 words/s, in_qsize 5, out_qsize 0
INFO - 10:09:58: EPOCH 8 - PROGRESS: at 63.90% examples, 194001 words/s, in_qsize 6, out_qsize 0
INFO - 10:09:59: EPOCH 8 - PROGRESS: at 65.43% examples, 193942 words/s, in_qsize 5, out_qsize 0
INFO - 10:10:00: EPOCH 8 - PRO

INFO - 10:11:14: EPOCH 9 - PROGRESS: at 97.53% examples, 229267 words/s, in_qsize 6, out_qsize 0
INFO - 10:11:15: EPOCH 9 - PROGRESS: at 99.18% examples, 228721 words/s, in_qsize 5, out_qsize 0
INFO - 10:11:15: worker thread finished; awaiting finish of 2 more threads
INFO - 10:11:15: worker thread finished; awaiting finish of 1 more threads
INFO - 10:11:15: worker thread finished; awaiting finish of 0 more threads
INFO - 10:11:15: EPOCH - 9 : training on 17005207 raw words (12820096 effective words) took 56.0s, 228853 effective words/s
INFO - 10:11:16: EPOCH 10 - PROGRESS: at 1.47% examples, 187342 words/s, in_qsize 5, out_qsize 0
INFO - 10:11:17: EPOCH 10 - PROGRESS: at 3.53% examples, 225311 words/s, in_qsize 5, out_qsize 0
INFO - 10:11:18: EPOCH 10 - PROGRESS: at 5.47% examples, 232128 words/s, in_qsize 5, out_qsize 0
INFO - 10:11:19: EPOCH 10 - PROGRESS: at 7.35% examples, 232639 words/s, in_qsize 5, out_qsize 0
INFO - 10:11:20: EPOCH 10 - PROGRESS: at 9.11% examples, 230845 words

INFO - 10:12:34: EPOCH 11 - PROGRESS: at 34.80% examples, 228198 words/s, in_qsize 5, out_qsize 0
INFO - 10:12:35: EPOCH 11 - PROGRESS: at 36.68% examples, 228789 words/s, in_qsize 4, out_qsize 0
INFO - 10:12:36: EPOCH 11 - PROGRESS: at 38.57% examples, 229227 words/s, in_qsize 5, out_qsize 0
INFO - 10:12:37: EPOCH 11 - PROGRESS: at 40.39% examples, 228848 words/s, in_qsize 5, out_qsize 0
INFO - 10:12:38: EPOCH 11 - PROGRESS: at 42.33% examples, 229573 words/s, in_qsize 5, out_qsize 0
INFO - 10:12:39: EPOCH 11 - PROGRESS: at 44.33% examples, 230429 words/s, in_qsize 5, out_qsize 0
INFO - 10:12:40: EPOCH 11 - PROGRESS: at 46.27% examples, 230511 words/s, in_qsize 3, out_qsize 2
INFO - 10:12:41: EPOCH 11 - PROGRESS: at 48.15% examples, 230955 words/s, in_qsize 5, out_qsize 0
INFO - 10:12:42: EPOCH 11 - PROGRESS: at 49.68% examples, 229671 words/s, in_qsize 5, out_qsize 0
INFO - 10:12:43: EPOCH 11 - PROGRESS: at 51.56% examples, 229941 words/s, in_qsize 5, out_qsize 1
INFO - 10:12:45: EPO

INFO - 10:13:58: EPOCH 12 - PROGRESS: at 71.84% examples, 192889 words/s, in_qsize 5, out_qsize 1
INFO - 10:14:00: EPOCH 12 - PROGRESS: at 73.54% examples, 193113 words/s, in_qsize 5, out_qsize 0
INFO - 10:14:01: EPOCH 12 - PROGRESS: at 74.90% examples, 192621 words/s, in_qsize 6, out_qsize 1
INFO - 10:14:02: EPOCH 12 - PROGRESS: at 76.37% examples, 192289 words/s, in_qsize 5, out_qsize 0
INFO - 10:14:03: EPOCH 12 - PROGRESS: at 77.95% examples, 192372 words/s, in_qsize 5, out_qsize 0
INFO - 10:14:04: EPOCH 12 - PROGRESS: at 79.84% examples, 193198 words/s, in_qsize 5, out_qsize 0
INFO - 10:14:05: EPOCH 12 - PROGRESS: at 81.36% examples, 193134 words/s, in_qsize 5, out_qsize 0
INFO - 10:14:06: EPOCH 12 - PROGRESS: at 83.48% examples, 194454 words/s, in_qsize 5, out_qsize 1
INFO - 10:14:07: EPOCH 12 - PROGRESS: at 85.24% examples, 194971 words/s, in_qsize 5, out_qsize 0
INFO - 10:14:08: EPOCH 12 - PROGRESS: at 86.89% examples, 195118 words/s, in_qsize 6, out_qsize 0
INFO - 10:14:09: EPO

INFO - 10:15:23: EPOCH 13 - PROGRESS: at 96.94% examples, 187484 words/s, in_qsize 4, out_qsize 1
INFO - 10:15:24: EPOCH 13 - PROGRESS: at 98.71% examples, 187936 words/s, in_qsize 5, out_qsize 0
INFO - 10:15:24: worker thread finished; awaiting finish of 2 more threads
INFO - 10:15:24: worker thread finished; awaiting finish of 1 more threads
INFO - 10:15:24: worker thread finished; awaiting finish of 0 more threads
INFO - 10:15:24: EPOCH - 13 : training on 17005207 raw words (12821511 effective words) took 68.0s, 188581 effective words/s
INFO - 10:15:25: EPOCH 14 - PROGRESS: at 1.41% examples, 179165 words/s, in_qsize 5, out_qsize 0
INFO - 10:15:26: EPOCH 14 - PROGRESS: at 3.12% examples, 191488 words/s, in_qsize 6, out_qsize 1
INFO - 10:15:27: EPOCH 14 - PROGRESS: at 4.64% examples, 191913 words/s, in_qsize 5, out_qsize 0
INFO - 10:15:29: EPOCH 14 - PROGRESS: at 6.17% examples, 189092 words/s, in_qsize 4, out_qsize 1
INFO - 10:15:30: EPOCH 14 - PROGRESS: at 8.11% examples, 199934 wo

INFO - 10:16:43: EPOCH 15 - PROGRESS: at 26.93% examples, 209705 words/s, in_qsize 5, out_qsize 0
INFO - 10:16:44: EPOCH 15 - PROGRESS: at 28.57% examples, 209414 words/s, in_qsize 5, out_qsize 0
INFO - 10:16:45: EPOCH 15 - PROGRESS: at 30.34% examples, 209894 words/s, in_qsize 6, out_qsize 0
INFO - 10:16:47: EPOCH 15 - PROGRESS: at 31.69% examples, 207942 words/s, in_qsize 6, out_qsize 0
INFO - 10:16:48: EPOCH 15 - PROGRESS: at 33.39% examples, 208154 words/s, in_qsize 5, out_qsize 0
INFO - 10:16:49: EPOCH 15 - PROGRESS: at 35.33% examples, 209793 words/s, in_qsize 4, out_qsize 1
INFO - 10:16:50: EPOCH 15 - PROGRESS: at 36.98% examples, 209203 words/s, in_qsize 4, out_qsize 1
INFO - 10:16:51: EPOCH 15 - PROGRESS: at 38.80% examples, 209762 words/s, in_qsize 5, out_qsize 0
INFO - 10:16:52: EPOCH 15 - PROGRESS: at 40.33% examples, 208713 words/s, in_qsize 5, out_qsize 0
INFO - 10:16:53: EPOCH 15 - PROGRESS: at 42.03% examples, 208868 words/s, in_qsize 5, out_qsize 0
INFO - 10:16:54: EPO

INFO - 10:18:07: EPOCH 16 - PROGRESS: at 65.55% examples, 222200 words/s, in_qsize 5, out_qsize 0
INFO - 10:18:08: EPOCH 16 - PROGRESS: at 67.49% examples, 222730 words/s, in_qsize 5, out_qsize 0
INFO - 10:18:09: EPOCH 16 - PROGRESS: at 69.66% examples, 223999 words/s, in_qsize 5, out_qsize 0
INFO - 10:18:10: EPOCH 16 - PROGRESS: at 71.78% examples, 224998 words/s, in_qsize 5, out_qsize 0
INFO - 10:18:11: EPOCH 16 - PROGRESS: at 73.90% examples, 225986 words/s, in_qsize 6, out_qsize 0
INFO - 10:18:12: EPOCH 16 - PROGRESS: at 75.66% examples, 225628 words/s, in_qsize 4, out_qsize 1
INFO - 10:18:13: EPOCH 16 - PROGRESS: at 77.48% examples, 225444 words/s, in_qsize 5, out_qsize 0
INFO - 10:18:14: EPOCH 16 - PROGRESS: at 79.72% examples, 226697 words/s, in_qsize 6, out_qsize 0
INFO - 10:18:15: EPOCH 16 - PROGRESS: at 81.48% examples, 226488 words/s, in_qsize 6, out_qsize 0
INFO - 10:18:16: EPOCH 16 - PROGRESS: at 83.42% examples, 226943 words/s, in_qsize 5, out_qsize 0
INFO - 10:18:17: EPO

INFO - 10:19:27: EPOCH 18 - PROGRESS: at 17.93% examples, 223517 words/s, in_qsize 4, out_qsize 1
INFO - 10:19:28: EPOCH 18 - PROGRESS: at 19.69% examples, 222699 words/s, in_qsize 6, out_qsize 0
INFO - 10:19:29: EPOCH 18 - PROGRESS: at 21.22% examples, 220472 words/s, in_qsize 4, out_qsize 1
INFO - 10:19:30: EPOCH 18 - PROGRESS: at 22.75% examples, 218164 words/s, in_qsize 5, out_qsize 0
INFO - 10:19:31: EPOCH 18 - PROGRESS: at 24.57% examples, 219191 words/s, in_qsize 6, out_qsize 0
INFO - 10:19:32: EPOCH 18 - PROGRESS: at 25.93% examples, 216155 words/s, in_qsize 6, out_qsize 0
INFO - 10:19:33: EPOCH 18 - PROGRESS: at 27.87% examples, 218217 words/s, in_qsize 6, out_qsize 0
INFO - 10:19:34: EPOCH 18 - PROGRESS: at 29.45% examples, 217461 words/s, in_qsize 5, out_qsize 0
INFO - 10:19:35: EPOCH 18 - PROGRESS: at 31.04% examples, 216561 words/s, in_qsize 5, out_qsize 0
INFO - 10:19:36: EPOCH 18 - PROGRESS: at 32.75% examples, 216544 words/s, in_qsize 6, out_qsize 0
INFO - 10:19:37: EPO

INFO - 10:20:51: EPOCH 19 - PROGRESS: at 57.55% examples, 218771 words/s, in_qsize 5, out_qsize 0
INFO - 10:20:52: EPOCH 19 - PROGRESS: at 59.32% examples, 218936 words/s, in_qsize 5, out_qsize 0
INFO - 10:20:53: EPOCH 19 - PROGRESS: at 61.26% examples, 219789 words/s, in_qsize 3, out_qsize 0
INFO - 10:20:54: EPOCH 19 - PROGRESS: at 63.20% examples, 220316 words/s, in_qsize 5, out_qsize 0
INFO - 10:20:55: EPOCH 19 - PROGRESS: at 64.96% examples, 220417 words/s, in_qsize 6, out_qsize 1
INFO - 10:20:56: EPOCH 19 - PROGRESS: at 66.96% examples, 221291 words/s, in_qsize 4, out_qsize 1
INFO - 10:20:57: EPOCH 19 - PROGRESS: at 68.84% examples, 221805 words/s, in_qsize 4, out_qsize 1
INFO - 10:20:58: EPOCH 19 - PROGRESS: at 70.78% examples, 222390 words/s, in_qsize 5, out_qsize 0
INFO - 10:20:59: EPOCH 19 - PROGRESS: at 72.90% examples, 223386 words/s, in_qsize 5, out_qsize 0
INFO - 10:21:00: EPOCH 19 - PROGRESS: at 75.01% examples, 224243 words/s, in_qsize 5, out_qsize 0
INFO - 10:21:01: EPO

INFO - 10:22:10: EPOCH 21 - PROGRESS: at 16.40% examples, 255663 words/s, in_qsize 5, out_qsize 0
INFO - 10:22:11: EPOCH 21 - PROGRESS: at 18.28% examples, 253378 words/s, in_qsize 5, out_qsize 0
INFO - 10:22:12: EPOCH 21 - PROGRESS: at 20.46% examples, 254088 words/s, in_qsize 5, out_qsize 0
INFO - 10:22:13: EPOCH 21 - PROGRESS: at 22.63% examples, 255030 words/s, in_qsize 6, out_qsize 2
INFO - 10:22:14: EPOCH 21 - PROGRESS: at 24.93% examples, 258017 words/s, in_qsize 5, out_qsize 0
INFO - 10:22:16: EPOCH 21 - PROGRESS: at 26.93% examples, 256755 words/s, in_qsize 3, out_qsize 2
INFO - 10:22:17: EPOCH 21 - PROGRESS: at 28.75% examples, 253556 words/s, in_qsize 5, out_qsize 0
INFO - 10:22:18: EPOCH 21 - PROGRESS: at 30.57% examples, 252496 words/s, in_qsize 5, out_qsize 0
INFO - 10:22:19: EPOCH 21 - PROGRESS: at 32.80% examples, 254533 words/s, in_qsize 5, out_qsize 0
INFO - 10:22:20: EPOCH 21 - PROGRESS: at 34.92% examples, 255544 words/s, in_qsize 5, out_qsize 0
INFO - 10:22:21: EPO

INFO - 10:23:33: EPOCH 22 - PROGRESS: at 81.25% examples, 254764 words/s, in_qsize 5, out_qsize 0
INFO - 10:23:34: EPOCH 22 - PROGRESS: at 82.95% examples, 253864 words/s, in_qsize 5, out_qsize 0
INFO - 10:23:35: EPOCH 22 - PROGRESS: at 85.30% examples, 254809 words/s, in_qsize 5, out_qsize 0
INFO - 10:23:36: EPOCH 22 - PROGRESS: at 87.30% examples, 254708 words/s, in_qsize 6, out_qsize 0
INFO - 10:23:37: EPOCH 22 - PROGRESS: at 89.42% examples, 255079 words/s, in_qsize 5, out_qsize 0
INFO - 10:23:39: EPOCH 22 - PROGRESS: at 91.65% examples, 255421 words/s, in_qsize 5, out_qsize 0
INFO - 10:23:40: EPOCH 22 - PROGRESS: at 93.59% examples, 255214 words/s, in_qsize 5, out_qsize 0
INFO - 10:23:41: EPOCH 22 - PROGRESS: at 95.65% examples, 255317 words/s, in_qsize 5, out_qsize 0
INFO - 10:23:42: EPOCH 22 - PROGRESS: at 97.82% examples, 255577 words/s, in_qsize 5, out_qsize 1
INFO - 10:23:43: worker thread finished; awaiting finish of 2 more threads
INFO - 10:23:43: worker thread finished; aw

INFO - 10:24:52: EPOCH 24 - PROGRESS: at 37.92% examples, 250663 words/s, in_qsize 5, out_qsize 0
INFO - 10:24:53: EPOCH 24 - PROGRESS: at 39.74% examples, 249698 words/s, in_qsize 6, out_qsize 0
INFO - 10:24:54: EPOCH 24 - PROGRESS: at 41.74% examples, 250044 words/s, in_qsize 5, out_qsize 0
INFO - 10:24:55: EPOCH 24 - PROGRESS: at 43.97% examples, 251455 words/s, in_qsize 4, out_qsize 1
INFO - 10:24:56: EPOCH 24 - PROGRESS: at 46.21% examples, 253026 words/s, in_qsize 5, out_qsize 0
INFO - 10:24:57: EPOCH 24 - PROGRESS: at 48.03% examples, 252186 words/s, in_qsize 5, out_qsize 0
INFO - 10:24:58: EPOCH 24 - PROGRESS: at 49.97% examples, 251894 words/s, in_qsize 5, out_qsize 0
INFO - 10:24:59: EPOCH 24 - PROGRESS: at 51.91% examples, 251713 words/s, in_qsize 5, out_qsize 0
INFO - 10:25:00: EPOCH 24 - PROGRESS: at 53.97% examples, 251767 words/s, in_qsize 5, out_qsize 0
INFO - 10:25:01: EPOCH 24 - PROGRESS: at 56.14% examples, 252745 words/s, in_qsize 5, out_qsize 0
INFO - 10:25:02: EPO

INFO - 10:26:15: worker thread finished; awaiting finish of 2 more threads
INFO - 10:26:15: worker thread finished; awaiting finish of 1 more threads
INFO - 10:26:15: worker thread finished; awaiting finish of 0 more threads
INFO - 10:26:15: EPOCH - 25 : training on 17005207 raw words (12818993 effective words) took 51.9s, 247172 effective words/s
INFO - 10:26:16: EPOCH 26 - PROGRESS: at 2.06% examples, 258012 words/s, in_qsize 5, out_qsize 0
INFO - 10:26:17: EPOCH 26 - PROGRESS: at 4.12% examples, 257989 words/s, in_qsize 5, out_qsize 0
INFO - 10:26:19: EPOCH 26 - PROGRESS: at 6.23% examples, 255260 words/s, in_qsize 4, out_qsize 1
INFO - 10:26:20: EPOCH 26 - PROGRESS: at 8.47% examples, 259157 words/s, in_qsize 5, out_qsize 0
INFO - 10:26:21: EPOCH 26 - PROGRESS: at 10.88% examples, 265980 words/s, in_qsize 5, out_qsize 0
INFO - 10:26:22: EPOCH 26 - PROGRESS: at 12.87% examples, 263000 words/s, in_qsize 6, out_qsize 0
INFO - 10:26:23: EPOCH 26 - PROGRESS: at 14.87% examples, 261307 w

INFO - 10:27:36: EPOCH 27 - PROGRESS: at 61.96% examples, 250090 words/s, in_qsize 5, out_qsize 0
INFO - 10:27:37: EPOCH 27 - PROGRESS: at 64.02% examples, 250191 words/s, in_qsize 4, out_qsize 1
INFO - 10:27:38: EPOCH 27 - PROGRESS: at 66.02% examples, 250252 words/s, in_qsize 5, out_qsize 0
INFO - 10:27:39: EPOCH 27 - PROGRESS: at 67.55% examples, 248680 words/s, in_qsize 4, out_qsize 1
INFO - 10:27:40: EPOCH 27 - PROGRESS: at 68.96% examples, 246793 words/s, in_qsize 6, out_qsize 2
INFO - 10:27:41: EPOCH 27 - PROGRESS: at 70.72% examples, 246129 words/s, in_qsize 5, out_qsize 0
INFO - 10:27:43: EPOCH 27 - PROGRESS: at 72.72% examples, 246059 words/s, in_qsize 6, out_qsize 1
INFO - 10:27:44: EPOCH 27 - PROGRESS: at 74.78% examples, 246234 words/s, in_qsize 5, out_qsize 0
INFO - 10:27:45: EPOCH 27 - PROGRESS: at 76.72% examples, 245806 words/s, in_qsize 4, out_qsize 1
INFO - 10:27:46: EPOCH 27 - PROGRESS: at 78.78% examples, 246026 words/s, in_qsize 5, out_qsize 0
INFO - 10:27:47: EPO

INFO - 10:28:57: EPOCH 29 - PROGRESS: at 18.64% examples, 250060 words/s, in_qsize 6, out_qsize 0
INFO - 10:28:58: EPOCH 29 - PROGRESS: at 20.69% examples, 250885 words/s, in_qsize 5, out_qsize 0
INFO - 10:28:59: EPOCH 29 - PROGRESS: at 22.93% examples, 252170 words/s, in_qsize 5, out_qsize 0
INFO - 10:29:00: EPOCH 29 - PROGRESS: at 25.04% examples, 253217 words/s, in_qsize 5, out_qsize 0
INFO - 10:29:01: EPOCH 29 - PROGRESS: at 27.16% examples, 254604 words/s, in_qsize 5, out_qsize 0
INFO - 10:29:02: EPOCH 29 - PROGRESS: at 28.92% examples, 252854 words/s, in_qsize 5, out_qsize 0
INFO - 10:29:03: EPOCH 29 - PROGRESS: at 30.81% examples, 251624 words/s, in_qsize 6, out_qsize 0
INFO - 10:29:04: EPOCH 29 - PROGRESS: at 32.80% examples, 251672 words/s, in_qsize 5, out_qsize 0
INFO - 10:29:05: EPOCH 29 - PROGRESS: at 34.51% examples, 249594 words/s, in_qsize 6, out_qsize 1
INFO - 10:29:06: EPOCH 29 - PROGRESS: at 36.86% examples, 251945 words/s, in_qsize 5, out_qsize 0
INFO - 10:29:07: EPO

INFO - 10:30:21: EPOCH 30 - PROGRESS: at 82.95% examples, 252577 words/s, in_qsize 6, out_qsize 0
INFO - 10:30:22: EPOCH 30 - PROGRESS: at 85.07% examples, 252997 words/s, in_qsize 5, out_qsize 0
INFO - 10:30:23: EPOCH 30 - PROGRESS: at 87.07% examples, 253072 words/s, in_qsize 6, out_qsize 0
INFO - 10:30:24: EPOCH 30 - PROGRESS: at 89.30% examples, 253514 words/s, in_qsize 4, out_qsize 1
INFO - 10:30:25: EPOCH 30 - PROGRESS: at 91.71% examples, 254402 words/s, in_qsize 4, out_qsize 1
INFO - 10:30:26: EPOCH 30 - PROGRESS: at 93.89% examples, 254843 words/s, in_qsize 6, out_qsize 0
INFO - 10:30:27: EPOCH 30 - PROGRESS: at 95.71% examples, 254420 words/s, in_qsize 6, out_qsize 0
INFO - 10:30:28: EPOCH 30 - PROGRESS: at 97.88% examples, 254696 words/s, in_qsize 6, out_qsize 0
INFO - 10:30:29: worker thread finished; awaiting finish of 2 more threads
INFO - 10:30:29: worker thread finished; awaiting finish of 1 more threads
INFO - 10:30:29: worker thread finished; awaiting finish of 0 more

INFO - 10:31:40: EPOCH 32 - PROGRESS: at 54.32% examples, 275168 words/s, in_qsize 6, out_qsize 0
INFO - 10:31:41: EPOCH 32 - PROGRESS: at 57.08% examples, 277594 words/s, in_qsize 6, out_qsize 0
INFO - 10:31:43: EPOCH 32 - PROGRESS: at 59.14% examples, 276969 words/s, in_qsize 5, out_qsize 0
INFO - 10:31:44: EPOCH 32 - PROGRESS: at 61.26% examples, 276789 words/s, in_qsize 5, out_qsize 0
INFO - 10:31:45: EPOCH 32 - PROGRESS: at 63.14% examples, 275380 words/s, in_qsize 5, out_qsize 0
INFO - 10:31:46: EPOCH 32 - PROGRESS: at 65.67% examples, 276731 words/s, in_qsize 6, out_qsize 1
INFO - 10:31:47: EPOCH 32 - PROGRESS: at 68.08% examples, 277309 words/s, in_qsize 4, out_qsize 1
INFO - 10:31:48: EPOCH 32 - PROGRESS: at 70.49% examples, 278112 words/s, in_qsize 6, out_qsize 0
INFO - 10:31:49: EPOCH 32 - PROGRESS: at 72.55% examples, 277418 words/s, in_qsize 5, out_qsize 0
INFO - 10:31:50: EPOCH 32 - PROGRESS: at 74.72% examples, 277303 words/s, in_qsize 5, out_qsize 0
INFO - 10:31:51: EPO

INFO - 10:33:01: EPOCH 34 - PROGRESS: at 35.16% examples, 294922 words/s, in_qsize 6, out_qsize 0
INFO - 10:33:02: EPOCH 34 - PROGRESS: at 37.21% examples, 292007 words/s, in_qsize 4, out_qsize 1
INFO - 10:33:03: EPOCH 34 - PROGRESS: at 39.98% examples, 295197 words/s, in_qsize 5, out_qsize 0
INFO - 10:33:04: EPOCH 34 - PROGRESS: at 42.21% examples, 294421 words/s, in_qsize 5, out_qsize 0
INFO - 10:33:05: EPOCH 34 - PROGRESS: at 44.56% examples, 294668 words/s, in_qsize 5, out_qsize 0
INFO - 10:33:06: EPOCH 34 - PROGRESS: at 47.09% examples, 295670 words/s, in_qsize 6, out_qsize 0
INFO - 10:33:07: EPOCH 34 - PROGRESS: at 49.38% examples, 295448 words/s, in_qsize 5, out_qsize 0
INFO - 10:33:08: EPOCH 34 - PROGRESS: at 51.62% examples, 295024 words/s, in_qsize 5, out_qsize 0
INFO - 10:33:09: EPOCH 34 - PROGRESS: at 54.03% examples, 295381 words/s, in_qsize 6, out_qsize 2
INFO - 10:33:10: EPOCH 34 - PROGRESS: at 56.44% examples, 295665 words/s, in_qsize 6, out_qsize 1
INFO - 10:33:11: EPO

INFO - 10:34:21: EPOCH 36 - PROGRESS: at 13.40% examples, 280818 words/s, in_qsize 6, out_qsize 0
INFO - 10:34:23: EPOCH 36 - PROGRESS: at 15.70% examples, 281397 words/s, in_qsize 6, out_qsize 0
INFO - 10:34:24: EPOCH 36 - PROGRESS: at 18.11% examples, 283722 words/s, in_qsize 5, out_qsize 0
INFO - 10:34:25: EPOCH 36 - PROGRESS: at 20.28% examples, 282033 words/s, in_qsize 6, out_qsize 0
INFO - 10:34:26: EPOCH 36 - PROGRESS: at 22.28% examples, 279441 words/s, in_qsize 5, out_qsize 0
INFO - 10:34:27: EPOCH 36 - PROGRESS: at 24.34% examples, 278304 words/s, in_qsize 6, out_qsize 0
INFO - 10:34:28: EPOCH 36 - PROGRESS: at 26.69% examples, 278585 words/s, in_qsize 5, out_qsize 0
INFO - 10:34:29: EPOCH 36 - PROGRESS: at 28.92% examples, 279024 words/s, in_qsize 5, out_qsize 0
INFO - 10:34:30: EPOCH 36 - PROGRESS: at 30.98% examples, 278111 words/s, in_qsize 5, out_qsize 1
INFO - 10:34:31: EPOCH 36 - PROGRESS: at 33.16% examples, 277419 words/s, in_qsize 5, out_qsize 0
INFO - 10:34:32: EPO

INFO - 10:35:45: EPOCH 37 - PROGRESS: at 92.06% examples, 274815 words/s, in_qsize 5, out_qsize 0
INFO - 10:35:46: EPOCH 37 - PROGRESS: at 94.42% examples, 275258 words/s, in_qsize 5, out_qsize 0
INFO - 10:35:47: EPOCH 37 - PROGRESS: at 96.36% examples, 274638 words/s, in_qsize 5, out_qsize 0
INFO - 10:35:48: EPOCH 37 - PROGRESS: at 98.41% examples, 274114 words/s, in_qsize 5, out_qsize 0
INFO - 10:35:49: worker thread finished; awaiting finish of 2 more threads
INFO - 10:35:49: worker thread finished; awaiting finish of 1 more threads
INFO - 10:35:49: worker thread finished; awaiting finish of 0 more threads
INFO - 10:35:49: EPOCH - 37 : training on 17005207 raw words (12820774 effective words) took 46.8s, 273842 effective words/s
INFO - 10:35:50: EPOCH 38 - PROGRESS: at 1.94% examples, 244730 words/s, in_qsize 5, out_qsize 0
INFO - 10:35:51: EPOCH 38 - PROGRESS: at 4.17% examples, 263612 words/s, in_qsize 5, out_qsize 0
INFO - 10:35:52: EPOCH 38 - PROGRESS: at 6.35% examples, 266016 

INFO - 10:37:05: EPOCH 39 - PROGRESS: at 54.32% examples, 264259 words/s, in_qsize 5, out_qsize 0
INFO - 10:37:06: EPOCH 39 - PROGRESS: at 56.38% examples, 263985 words/s, in_qsize 5, out_qsize 0
INFO - 10:37:07: EPOCH 39 - PROGRESS: at 58.38% examples, 263169 words/s, in_qsize 5, out_qsize 2
INFO - 10:37:08: EPOCH 39 - PROGRESS: at 60.49% examples, 263098 words/s, in_qsize 5, out_qsize 0
INFO - 10:37:09: EPOCH 39 - PROGRESS: at 62.61% examples, 263220 words/s, in_qsize 5, out_qsize 0
INFO - 10:37:10: EPOCH 39 - PROGRESS: at 64.61% examples, 262438 words/s, in_qsize 6, out_qsize 1
INFO - 10:37:11: EPOCH 39 - PROGRESS: at 66.49% examples, 261739 words/s, in_qsize 5, out_qsize 0
INFO - 10:37:12: EPOCH 39 - PROGRESS: at 68.67% examples, 262109 words/s, in_qsize 4, out_qsize 1
INFO - 10:37:13: EPOCH 39 - PROGRESS: at 70.72% examples, 261916 words/s, in_qsize 6, out_qsize 0
INFO - 10:37:14: EPOCH 39 - PROGRESS: at 72.90% examples, 262264 words/s, in_qsize 5, out_qsize 0
INFO - 10:37:15: EPO

In [51]:
#Get document vector
print(model.infer_vector('gensim is really awesome'.split(' ')))

[ 0.18035454  0.41629905  0.25469506 -0.22324803 -0.09268823 -0.05151142
 -0.08480603 -0.00570442  0.08062097  0.00310741  0.17478175 -0.00154936
 -0.09330877 -0.02339078  0.0199794   0.04549846 -0.04683316 -0.03828195
 -0.3936776  -0.2056224  -0.04153012 -0.2085073   0.3692062   0.12547413
 -0.03771724  0.09459786  0.34567693 -0.34003636 -0.14349678  0.0229786
  0.1584827   0.02953391  0.13116093  0.02679738  0.02890064  0.06687283
  0.2509888   0.15626836 -0.03766628  0.24216698  0.04326655  0.18548243
  0.25629497 -0.10298692 -0.3509259  -0.06340942 -0.22913803 -0.14003526
  0.04637286 -0.14653628]


Used materials: 
    
- https://www.machinelearningplus.com/nlp/gensim-tutorial/
- https://radimrehurek.com/gensim/models/word2vec.html
- https://radimrehurek.com/gensim/models/keyedvectors.html
- https://rare-technologies.com/word2vec-tutorial/