### Generate, i.e., train own Embeddings Matrix


#### Background: -

Word Embeddings contains Word Vectors as Distributed Representations (low dimensional dense form – contrast it with high dimensional 1-hot sparse form) of tokens. Word Embeddings Matrix (N most frequent tokens by K embedding space dimension): It is just a lookup table of tokens’ embedding vector for use in NLP machine learning (can be used in both unsupervised & supervised.

NOTE that this low dimension is quite opaque or obscure (somewhat orthogonal to Eigen space as in SVD) & this word embeddings provide means for calculating syntactic & semantic meanings.

Word Embeddings are generated through Unsupervised Pre-training, e.g., Word2Vec, GloVe, or this custom Embeddings Matrix.

Down below, observe few typical use cases of distance based arithmetic between Tokens, i.e., Word Vectors. Word2Vec. Gensim must be mostly using Cosine or Euclidean distance. Below results might seem vague which is because of small corpus and scaled down training.


#### Gensim Word2Vec model training

Gensim’s word2vec expects a sequence of corpus sentences as its input. Its first pass collects words and their frequencies to build an internal dictionary tree structure. Then, iter/epoch passes for training neural network model.


##### Food for thought: 

Why not just use Positional Index from corpus vocab/dict?
Difference b/w using Positional Index -vs- 1-hot?
Difference b/w above token-2-num options -vs- Word Embeddings?

In [1]:
import warnings
warnings.filterwarnings('ignore')

import gensim, logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

import os
import multiprocessing
from gensim.models.word2vec import Word2Vec

In [2]:
# Shakespeare.txt from Gutenberg open source http://norvig.com/ngrams/

# GenSim Word2Vec expects sentence to be fed sequentially, hence this construct for corpus sentences iterator class
class GetSentencesFromDir(object):
    def __init__(self, dirname):
        self.dirname = dirname
 
    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                yield line.split()

# Feed apporpriate path to folder containing text corpus file(s)
sentences = GetSentencesFromDir('./data/text_corpus_shakespeare') # a memory-friendly iterator

In [3]:
params = {'size': 20, 'window': 5, 'min_count': 2, 'workers': max(1, multiprocessing.cpu_count() - 1), 'sample': 1E-3, 'iter': 2}

model = Word2Vec(sentences, **params)

2018-11-22 10:58:21,060 : INFO : collecting all words and their counts
2018-11-22 10:58:21,152 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-11-22 10:58:21,262 : INFO : PROGRESS: at sentence #10000, processed 82874 words, keeping 8218 word types
2018-11-22 10:58:21,312 : INFO : PROGRESS: at sentence #20000, processed 159884 words, keeping 12353 word types
2018-11-22 10:58:21,375 : INFO : PROGRESS: at sentence #30000, processed 240235 words, keeping 15238 word types
2018-11-22 10:58:21,433 : INFO : PROGRESS: at sentence #40000, processed 319260 words, keeping 18133 word types
2018-11-22 10:58:21,485 : INFO : PROGRESS: at sentence #50000, processed 394354 words, keeping 20518 word types
2018-11-22 10:58:21,553 : INFO : PROGRESS: at sentence #60000, processed 475506 words, keeping 22757 word types
2018-11-22 10:58:21,611 : INFO : PROGRESS: at sentence #70000, processed 553869 words, keeping 25222 word types
2018-11-22 10:58:21,668 : INFO : PROGRESS: at se

In [4]:
len(model.wv.vocab)

17786

In [5]:
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)

2018-11-22 10:58:25,724 : INFO : precomputing L2-norms of word weight vectors


[('majesty', 0.9783057570457458)]

In [6]:
model.doesnt_match("breakfast cereal dinner lunch".split())



'breakfast'

In [7]:
model.similarity('woman', 'man')

0.9032757

In [8]:
model.most_similar("man")

[('thing', 0.9198711514472961),
 ('urine', 0.9072387218475342),
 ('woman', 0.9032757878303528),
 ('matter', 0.8991479873657227),
 ('ass', 0.8979557156562805),
 ('gentleman', 0.8971355557441711),
 ('there', 0.8926539421081543),
 ('fool', 0.8861606121063232),
 ('indeed', 0.8844612240791321),
 ('better', 0.8821287751197815)]

In [9]:
model.most_similar("queen")

[('spirit', 0.9883180260658264),
 ('dog', 0.982003927230835),
 ('conscience', 0.9820006489753723),
 ('servant', 0.9816766381263733),
 ('soul', 0.9808666706085205),
 ("here's", 0.980603814125061),
 ('kinsman', 0.9803838729858398),
 ('hat', 0.9798363447189331),
 ('name', 0.9795224070549011),
 ('mistress', 0.9792896509170532)]