# Big Data Content Analytics - AUEB

## Introduction to Word Embeddings using Gensim

* Lab Assistant: George Perakis
* Email: gperakis[at]aeub.gr 

Information about [Word Embeddings](https://en.wikipedia.org/wiki/Word_embedding)


<img src='https://www.tensorflow.org/images/linear-relationships.png'>


### Importing Modules

In [2]:
import os
import zipfile

import gensim
import logging
import re

# import from nltk the functions that split a text into sentences and tokens
from nltk.tokenize import sent_tokenize, word_tokenize
from six.moves import urllib

In [3]:
# function that downloads a text
def maybe_download(url, filename, expected_bytes):
    """
    Download a file if not present, and make sure it's the right size.

    :param url: str
    :param filename: str
    :param expected_bytes: int
    :return:
    """
    """"""
    if not os.path.exists(filename):
        filename, _ = urllib.request.urlretrieve(url + filename, filename)

    statinfo = os.stat(filename)

    if statinfo.st_size == expected_bytes:

        print(f'Found and verified {filename}')

    else:
        
        print(statinfo.st_size)
        raise Exception(f'Failed to verify {filename} . Can you get to it with a browser?')
    
    return filename

In [4]:
class MySentences(object):
    """
    This is a class which given a text iterates through all the sentences 
    and yields lists of sentence tokens
    """

    def __init__(self,
                 the_text):
        """
        
        :param the_text: 
        """

        self.my_text = the_text

    def __iter__(self):
        """
        
        :return: 
        """
        for sentence in sent_tokenize(self.my_text):
            
            yield word_tokenize(sentence.lower())

In [5]:
url = 'http://mattmahoney.net/dc/'

filename = maybe_download(url=url, filename='text8.zip', expected_bytes=31344016)

with zipfile.ZipFile(filename) as f:
    my_text = f.read(f.namelist()[0])

Found and verified text8.zip


In [6]:
my_text[:1000]

# the starting 'b' shows us that it is bytes encoded

b' anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans culottes of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up as a positive label by self defined anarchists the word anarchism is derived from the greek without archons ruler chief king anarchism as a political philosophy is the belief that rulers are unnecessary and should be abolished although there are differing interpretations of what this means anarchism also refers to related social movements that advocate the elimination of authoritarian institutions particularly the state the word anarchy as most anarchists use it does not imply chaos nihilism or anomie but rather a harmonious anti authoritarian society in place of what are regarded as authoritarian political structures and coercive economic institu

In [7]:
print(len(my_text))

100000000


In [8]:
# using str() function to convert bytes to string.

my_text = str(my_text[:5_000_000]) # got only first 5 million characters.

sentences = MySentences(my_text)  # a memory-friendly iterator

In [9]:
sentences

<__main__.MySentences at 0x27521458d08>

In [10]:
# for s in sentences:
#     print(s)

In [11]:
# https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py

In [12]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [None]:
gensim.models.Word2Vec()

In [15]:
model = gensim.models.Word2Vec(sentences,
                               vector_size=300, # Dimensionality of the word vectors.
                               workers=4, 
                               min_count=5, # Ignores all words with total frequency lower than this.
                               sg=1, # ({0, 1}, optional) – Training algorithm: 1 for skip-gram; otherwise CBOW.
                               window=5, 
                               epochs=200, 
                               negative=15)

2021-05-17 22:13:51,127 : INFO : collecting all words and their counts
2021-05-17 22:13:55,071 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-05-17 22:13:55,224 : INFO : collected 47044 word types from a corpus of 847127 raw words and 1 sentences
2021-05-17 22:13:55,226 : INFO : Creating a fresh vocabulary
2021-05-17 22:13:55,289 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 12515 unique words (26.60275486778335%% of original 47044, drops 34529)', 'datetime': '2021-05-17T22:13:55.288263', 'gensim': '4.0.1', 'python': '3.7.9 (tags/v3.7.9:13c94747c7, Aug 17 2020, 18:58:18) [MSC v.1900 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'prepare_vocab'}
2021-05-17 22:13:55,289 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 791834 word corpus (93.47287950921172%% of original 847127, drops 55293)', 'datetime': '2021-05-17T22:13:55.289259', 'gensim': '4.0.1', 'python': '3.7.9 (tags/v3.7.9:13c

2021-05-17 22:14:37,983 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-17 22:14:38,274 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-05-17 22:14:38,275 : INFO : EPOCH - 10 : training on 847127 raw words (10000 effective words) took 3.6s, 2761 effective words/s
2021-05-17 22:14:42,290 : INFO : EPOCH 11 - PROGRESS: at 0.00% examples, 0 words/s, in_qsize 1, out_qsize 3
2021-05-17 22:14:42,292 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-05-17 22:14:42,293 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-05-17 22:14:42,294 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-17 22:14:42,574 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-05-17 22:14:42,575 : INFO : EPOCH - 11 : training on 847127 raw words (10000 effective words) took 3.7s, 2724 effective words/s
2021-05-17 22:14:46,402 : INFO : EPOCH 12 - PROGRESS: at 0.00% examples, 0 words

2021-05-17 22:15:36,656 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-05-17 22:15:36,657 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-05-17 22:15:36,658 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-17 22:15:36,933 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-05-17 22:15:36,933 : INFO : EPOCH - 24 : training on 847127 raw words (10000 effective words) took 3.5s, 2836 effective words/s
2021-05-17 22:15:40,770 : INFO : EPOCH 25 - PROGRESS: at 0.00% examples, 0 words/s, in_qsize 1, out_qsize 3
2021-05-17 22:15:40,772 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-05-17 22:15:40,774 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-05-17 22:15:40,775 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-17 22:15:41,051 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-05-17 22:15:41,052 : INFO 

2021-05-17 22:16:31,394 : INFO : EPOCH - 37 : training on 847127 raw words (10000 effective words) took 3.7s, 2714 effective words/s
2021-05-17 22:16:35,363 : INFO : EPOCH 38 - PROGRESS: at 0.00% examples, 0 words/s, in_qsize 1, out_qsize 3
2021-05-17 22:16:35,365 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-05-17 22:16:35,366 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-05-17 22:16:35,367 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-17 22:16:35,634 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-05-17 22:16:35,636 : INFO : EPOCH - 38 : training on 847127 raw words (10000 effective words) took 3.7s, 2738 effective words/s
2021-05-17 22:16:39,551 : INFO : EPOCH 39 - PROGRESS: at 0.00% examples, 0 words/s, in_qsize 1, out_qsize 3
2021-05-17 22:16:39,553 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-05-17 22:16:39,554 : INFO : worker thread finished; awaiti

2021-05-17 22:17:29,742 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-05-17 22:17:29,743 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-17 22:17:30,026 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-05-17 22:17:30,027 : INFO : EPOCH - 51 : training on 847127 raw words (10000 effective words) took 3.6s, 2816 effective words/s
2021-05-17 22:17:34,076 : INFO : EPOCH 52 - PROGRESS: at 0.00% examples, 0 words/s, in_qsize 1, out_qsize 3
2021-05-17 22:17:34,078 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-05-17 22:17:34,079 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-05-17 22:17:34,080 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-17 22:17:34,344 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-05-17 22:17:34,345 : INFO : EPOCH - 52 : training on 847127 raw words (10000 effective words) took 3.1s, 3214 effecti

2021-05-17 22:18:28,075 : INFO : EPOCH 65 - PROGRESS: at 0.00% examples, 0 words/s, in_qsize 1, out_qsize 3
2021-05-17 22:18:28,077 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-05-17 22:18:28,078 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-05-17 22:18:28,079 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-17 22:18:28,340 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-05-17 22:18:28,341 : INFO : EPOCH - 65 : training on 847127 raw words (10000 effective words) took 3.6s, 2783 effective words/s
2021-05-17 22:18:32,250 : INFO : EPOCH 66 - PROGRESS: at 0.00% examples, 0 words/s, in_qsize 1, out_qsize 3
2021-05-17 22:18:32,251 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-05-17 22:18:32,251 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-05-17 22:18:32,252 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-17 22:

2021-05-17 22:19:21,749 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-05-17 22:19:21,750 : INFO : EPOCH - 78 : training on 847127 raw words (10000 effective words) took 3.5s, 2868 effective words/s
2021-05-17 22:19:25,522 : INFO : EPOCH 79 - PROGRESS: at 0.00% examples, 0 words/s, in_qsize 1, out_qsize 3
2021-05-17 22:19:25,523 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-05-17 22:19:25,525 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-05-17 22:19:25,526 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-17 22:19:25,774 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-05-17 22:19:25,775 : INFO : EPOCH - 79 : training on 847127 raw words (10000 effective words) took 3.5s, 2891 effective words/s
2021-05-17 22:19:29,845 : INFO : EPOCH 80 - PROGRESS: at 0.00% examples, 0 words/s, in_qsize 1, out_qsize 3
2021-05-17 22:19:29,847 : INFO : worker thread finished; awaiti

2021-05-17 22:20:19,148 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-05-17 22:20:19,148 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-05-17 22:20:19,149 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-17 22:20:19,438 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-05-17 22:20:19,439 : INFO : EPOCH - 92 : training on 847127 raw words (10000 effective words) took 3.8s, 2658 effective words/s
2021-05-17 22:20:23,338 : INFO : EPOCH 93 - PROGRESS: at 0.00% examples, 0 words/s, in_qsize 1, out_qsize 3
2021-05-17 22:20:23,341 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-05-17 22:20:23,342 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-05-17 22:20:23,343 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-17 22:20:23,587 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-05-17 22:20:23,589 : INFO 

2021-05-17 22:21:13,452 : INFO : EPOCH - 105 : training on 847127 raw words (10000 effective words) took 3.6s, 2814 effective words/s
2021-05-17 22:21:17,356 : INFO : EPOCH 106 - PROGRESS: at 0.00% examples, 0 words/s, in_qsize 1, out_qsize 3
2021-05-17 22:21:17,358 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-05-17 22:21:17,359 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-05-17 22:21:17,360 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-17 22:21:17,609 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-05-17 22:21:17,610 : INFO : EPOCH - 106 : training on 847127 raw words (10000 effective words) took 3.6s, 2816 effective words/s
2021-05-17 22:21:21,521 : INFO : EPOCH 107 - PROGRESS: at 0.00% examples, 0 words/s, in_qsize 1, out_qsize 3
2021-05-17 22:21:21,524 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-05-17 22:21:21,525 : INFO : worker thread finished; aw

2021-05-17 22:22:15,997 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-05-17 22:22:15,998 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-17 22:22:16,311 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-05-17 22:22:16,313 : INFO : EPOCH - 119 : training on 847127 raw words (10000 effective words) took 3.9s, 2542 effective words/s
2021-05-17 22:22:21,930 : INFO : EPOCH 120 - PROGRESS: at 0.00% examples, 0 words/s, in_qsize 1, out_qsize 3
2021-05-17 22:22:21,934 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-05-17 22:22:21,935 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-05-17 22:22:21,937 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-17 22:22:22,240 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-05-17 22:22:22,241 : INFO : EPOCH - 120 : training on 847127 raw words (10000 effective words) took 4.8s, 2077 effe

2021-05-17 22:23:19,134 : INFO : EPOCH 133 - PROGRESS: at 0.00% examples, 0 words/s, in_qsize 2, out_qsize 2
2021-05-17 22:23:19,135 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-05-17 22:23:19,136 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-05-17 22:23:19,136 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-17 22:23:19,378 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-05-17 22:23:19,379 : INFO : EPOCH - 133 : training on 847127 raw words (10000 effective words) took 3.6s, 2775 effective words/s
2021-05-17 22:23:23,352 : INFO : EPOCH 134 - PROGRESS: at 0.00% examples, 0 words/s, in_qsize 1, out_qsize 3
2021-05-17 22:23:23,354 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-05-17 22:23:23,355 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-05-17 22:23:23,355 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-17 

2021-05-17 22:24:12,680 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-05-17 22:24:12,680 : INFO : EPOCH - 146 : training on 847127 raw words (10000 effective words) took 3.5s, 2835 effective words/s
2021-05-17 22:24:16,527 : INFO : EPOCH 147 - PROGRESS: at 0.00% examples, 0 words/s, in_qsize 1, out_qsize 3
2021-05-17 22:24:16,528 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-05-17 22:24:16,529 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-05-17 22:24:16,530 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-17 22:24:16,773 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-05-17 22:24:16,774 : INFO : EPOCH - 147 : training on 847127 raw words (10000 effective words) took 3.5s, 2846 effective words/s
2021-05-17 22:24:20,575 : INFO : EPOCH 148 - PROGRESS: at 0.00% examples, 0 words/s, in_qsize 2, out_qsize 2
2021-05-17 22:24:20,577 : INFO : worker thread finished; aw

2021-05-17 22:25:09,654 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-05-17 22:25:09,655 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-05-17 22:25:09,656 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-17 22:25:09,891 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-05-17 22:25:09,892 : INFO : EPOCH - 160 : training on 847127 raw words (10000 effective words) took 3.5s, 2868 effective words/s
2021-05-17 22:25:13,876 : INFO : EPOCH 161 - PROGRESS: at 0.00% examples, 0 words/s, in_qsize 1, out_qsize 3
2021-05-17 22:25:13,879 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-05-17 22:25:13,879 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-05-17 22:25:13,880 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-17 22:25:14,140 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-05-17 22:25:14,142 : INF

2021-05-17 22:26:06,798 : INFO : EPOCH - 173 : training on 847127 raw words (10000 effective words) took 4.1s, 2414 effective words/s
2021-05-17 22:26:10,734 : INFO : EPOCH 174 - PROGRESS: at 0.00% examples, 0 words/s, in_qsize 1, out_qsize 3
2021-05-17 22:26:10,736 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-05-17 22:26:10,736 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-05-17 22:26:10,738 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-17 22:26:10,966 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-05-17 22:26:10,967 : INFO : EPOCH - 174 : training on 847127 raw words (10000 effective words) took 3.6s, 2797 effective words/s
2021-05-17 22:26:15,154 : INFO : EPOCH 175 - PROGRESS: at 0.00% examples, 0 words/s, in_qsize 2, out_qsize 2
2021-05-17 22:26:15,157 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-05-17 22:26:15,158 : INFO : worker thread finished; aw

2021-05-17 22:27:08,645 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-05-17 22:27:08,646 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-17 22:27:08,991 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-05-17 22:27:08,993 : INFO : EPOCH - 187 : training on 847127 raw words (10000 effective words) took 4.0s, 2502 effective words/s
2021-05-17 22:27:13,543 : INFO : EPOCH 188 - PROGRESS: at 0.00% examples, 0 words/s, in_qsize 1, out_qsize 3
2021-05-17 22:27:13,545 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-05-17 22:27:13,546 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-05-17 22:27:13,546 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-17 22:27:13,818 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-05-17 22:27:13,819 : INFO : EPOCH - 188 : training on 847127 raw words (10000 effective words) took 4.1s, 2459 effe

2021-05-17 22:28:08,013 : INFO : Word2Vec lifecycle event {'msg': 'training on 169425400 raw words (2000000 effective words) took 852.5s, 2346 effective words/s', 'datetime': '2021-05-17T22:28:08.013793', 'gensim': '4.0.1', 'python': '3.7.9 (tags/v3.7.9:13c94747c7, Aug 17 2020, 18:58:18) [MSC v.1900 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'train'}
2021-05-17 22:28:08,014 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec(vocab=12515, vector_size=300, alpha=0.025)', 'datetime': '2021-05-17T22:28:08.014790', 'gensim': '4.0.1', 'python': '3.7.9 (tags/v3.7.9:13c94747c7, Aug 17 2020, 18:58:18) [MSC v.1900 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'created'}


In [16]:
model_name = 'my_w2v_model'

In [17]:
model.save(model_name)

2021-05-17 22:28:08,045 : INFO : Word2Vec lifecycle event {'fname_or_handle': 'my_w2v_model', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2021-05-17T22:28:08.045706', 'gensim': '4.0.1', 'python': '3.7.9 (tags/v3.7.9:13c94747c7, Aug 17 2020, 18:58:18) [MSC v.1900 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'saving'}
2021-05-17 22:28:08,047 : INFO : not storing attribute cum_table
2021-05-17 22:28:08,076 : INFO : saved my_w2v_model


In [18]:
# Load pre-trained Word2Vec model.
model = gensim.models.Word2Vec.load(model_name)

2021-05-17 22:28:08,087 : INFO : loading Word2Vec object from my_w2v_model
2021-05-17 22:28:08,133 : INFO : loading wv recursively from my_w2v_model.wv.* with mmap=None
2021-05-17 22:28:08,135 : INFO : setting ignored attribute cum_table to None
2021-05-17 22:28:08,293 : INFO : Word2Vec lifecycle event {'fname': 'my_w2v_model', 'datetime': '2021-05-17T22:28:08.293079', 'gensim': '4.0.1', 'python': '3.7.9 (tags/v3.7.9:13c94747c7, Aug 17 2020, 18:58:18) [MSC v.1900 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'loaded'}


In [19]:
# model.wv.most_similar

# for i in range(50):
#     seed_word = model.wv.vocab.keys()[i]
#     most_similar = model.wv.most_similar(positive=[seed_word], topn=3)
#     print seed_word , [ w[0] for w in most_similar ]

test_words = ['god',
              'the',
              'man',
              'one',
              'while',
              'film']

for seed_word in test_words:
    
    most_similar = model.wv.most_similar(positive=[seed_word],
                                         topn=3)
    
    print('Word: "{}" | Similar Words: {}'.format(seed_word,
                                                  [w[0] for w in most_similar]))
    print()

Word: "god" | Similar Words: ['region', 'earthly', 'kingdom']

Word: "the" | Similar Words: ['of', 'and', 'in']

Word: "man" | Similar Words: ['rain', 'film', 'hoffman']

Word: "one" | Similar Words: ['nine', 'six', 'followed']

Word: "while" | Similar Words: ['waiting', 'methodology', 'problem']

Word: "film" | Similar Words: ['hoffman', 'kim', 'documentary']



In [20]:
# show which word does not match

model.wv.doesnt_match(('dog', 'one', 'two', 'three'))

'dog'

In [21]:
print(model.wv.similarity('man', 'man'))

1.0


In [22]:
print(model.wv.similarity('woman', 'man'))

-0.08078769


In [23]:
print(model.wv.similarity('woman', 'queen'))

0.012887699


In [24]:
print(model.wv.similarity('woman', 'dog'))

0.04141955


In [25]:
model.wv.similar_by_word('documentary', 10)

[('daniel', 0.971073567867279),
 ('kim', 0.9185670614242554),
 ('film', 0.8943195939064026),
 ('hoffman', 0.8881309032440186),
 ('brain', 0.8802371025085449),
 ('subject', 0.8350253105163574),
 ('bright', 0.7888103723526001),
 ('character', 0.7736836075782776),
 ('rain', 0.7672221064567566),
 ('relatives', 0.7610706686973572)]

In [26]:
model.wv.get_vector('night')

array([-2.9594509e-03, -2.0009088e-03,  8.5113524e-04, -2.4761979e-03,
        4.2655627e-04, -4.5240880e-04, -5.7195820e-04, -2.4489322e-04,
       -1.3350558e-03, -1.3379772e-03, -1.5226364e-04,  2.8664311e-03,
        1.5097983e-03,  1.1024126e-03,  2.5248926e-04, -1.4902433e-04,
       -3.0235926e-04,  7.4696861e-04,  2.2665032e-03,  2.7668477e-05,
        1.0555109e-04,  1.8463723e-03,  1.2221996e-03, -3.0089950e-03,
       -3.1848550e-03, -2.0721301e-03,  7.1663619e-04,  1.4244484e-03,
       -3.0999652e-03,  3.4510216e-04, -1.9893774e-03, -3.3267539e-03,
        1.3148189e-03,  1.7756788e-03, -3.0444202e-03, -2.1506445e-03,
       -2.5049043e-03, -7.3171692e-04,  3.6889553e-04, -8.1696193e-04,
        3.1042933e-03,  1.5080484e-03, -7.4103515e-04, -2.4928919e-03,
        3.2507284e-03,  1.7063983e-03,  1.2215407e-03,  2.9503074e-03,
       -2.8960181e-03, -1.8323390e-03,  9.4424089e-04, -1.7132505e-03,
        7.0699217e-04, -3.0525383e-03,  1.1759575e-03,  2.0365857e-03,
      

Further info [here](https://pathmind.com/wiki/word2vec#:~:text=Word2vec%20is%20a%20two%2Dlayer,deep%20neural%20networks%20can%20understand.) and [here](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)