<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#How-to-train-embeddings?" data-toc-modified-id="How-to-train-embeddings?-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>How to train embeddings?</a></span></li><li><span><a href="#Training-$\texttt{word2vec}$-embedings-with-Gensim" data-toc-modified-id="Training-$\texttt{word2vec}$-embedings-with-Gensim-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Training $\texttt{word2vec}$ embedings with Gensim</a></span><ul class="toc-item"><li><span><a href="#Training-data" data-toc-modified-id="Training-data-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Training data</a></span></li><li><span><a href="#Setup" data-toc-modified-id="Setup-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Setup</a></span></li><li><span><a href="#NLP-pipeline" data-toc-modified-id="NLP-pipeline-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>NLP pipeline</a></span></li><li><span><a href="#Get-bigrammed-sentences" data-toc-modified-id="Get-bigrammed-sentences-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Get bigrammed-sentences</a></span></li><li><span><a href="#Implement-the-$\texttt{word2vec}$-with-Gensim" data-toc-modified-id="Implement-the-$\texttt{word2vec}$-with-Gensim-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Implement the $\texttt{word2vec}$ with Gensim</a></span><ul class="toc-item"><li><span><a href="#Setup-the-params" data-toc-modified-id="Setup-the-params-2.5.1"><span class="toc-item-num">2.5.1&nbsp;&nbsp;</span>Setup the params</a></span></li><li><span><a href="#Build-the-vocabulary" data-toc-modified-id="Build-the-vocabulary-2.5.2"><span class="toc-item-num">2.5.2&nbsp;&nbsp;</span>Build the vocabulary</a></span></li><li><span><a href="#Train-the-model" data-toc-modified-id="Train-the-model-2.5.3"><span class="toc-item-num">2.5.3&nbsp;&nbsp;</span>Train the model</a></span></li></ul></li></ul></li><li><span><a href="#Explore-the-model" data-toc-modified-id="Explore-the-model-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Explore the model</a></span><ul class="toc-item"><li><span><a href="#Do-vectors-make-any-sense?" data-toc-modified-id="Do-vectors-make-any-sense?-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Do vectors make any sense?</a></span></li><li><span><a href="#Closer-characters?" data-toc-modified-id="Closer-characters?-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Closer characters?</a></span></li></ul></li><li><span><a href="#Statistical-post-processing-of-word-vectors" data-toc-modified-id="Statistical-post-processing-of-word-vectors-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Statistical post-processing of word vectors</a></span></li></ul></div>

# How to train embeddings?

Here are some Pythonic options (many others exist):

+ Tensorflow
+ Pytorch
+ Gensim

<span class="mark">Note:</span>:

+ training embeddings falls beyond the remit spaCy!
+ $\texttt{word2vec}$, Fasttext, and GloVe are algorithms, they're not pieces of software

# Training $\texttt{word2vec}$ embedings with Gensim

## Training data

Let's play a little bit with [funny data...](https://www.kaggle.com/shilpibhattacharyya/the-big-bang-theory-dataset/data?select=big_bang_theory_dataset.csv)

<img src="images/_13.jpg" width="50%">

In [20]:
# reada data

# --+ load libraries
import os
import pandas as pd

# --+ set path
in_f = os.path.join('data', 'big_bang_theory_dataset.csv')

# --+ create a df
df = pd.read_csv(in_f) 

In [21]:
# preview
df.head(3).T

Unnamed: 0,0,1,2
Unnamed: 0,0,1,2
Location,The apartment,The apartment,The room in the basement
Scene,,,"Sheldon enters, takes out a box, takes a bean..."
Text,Again I’m right here.,Fine. The record shall so reflect. Now getting...,One two three four five six seven eight… Drat....
Speaker,Leonard,Sheldon,Sheldon
Season,3,5,6


In [22]:
# some cleaning

# --+ rename cols
old_cols = df.columns
new_cols = ['id', 'location', 'scene', 'text', 'speaker', 'season']
df.rename(dict(zip(old_cols, new_cols)), axis=1, inplace=True)

# --+ drop redundant columns
df.drop(['location', 'scene'], axis=1, inplace=True)

# --+ remove NaNs
df.dropna(inplace=True)

# --+ ... and check (fine, it's a small-ish dataset, but let's a give try)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37688 entries, 0 to 38917
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   id       37688 non-null  int64 
 1   text     37688 non-null  object
 2   speaker  37688 non-null  object
 3   season   37688 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 1.4+ MB


## Setup

In [84]:
# utilities
from time import time
from collections import defaultdict
import re
import logging
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s",
                    datefmt= '%H:%M:%S', level=logging.INFO)
# nlp pipeline
import spacy

## NLP pipeline

In [43]:
# docs as list

# --+ some cleaning
def cleaning(_string):
    '''
    : argument: string
    : return  : string
    '''
    # purge non alpha characters
    alpha = re.sub("[^A-Za-z']+", ' ', str(_string))
    return alpha.lower()

                   
# --+ get clean text
docs = [cleaning(item) for item in df.text.values]

In [59]:
# load pipeline
nlp = spacy.load('en_core_web_lg', disable=['ner', 'parser', 'tagger'])

In [67]:
# tokenized text
docs_tokens = []

for doc in docs:
    tmp_tokens = [token.lemma_ for token in nlp(doc) 
                  if not token.is_stop
                  and not token.is_space
                  and not token.is_punct
                  and not token.is_oov
                  and len(token.lemma_) > 1]
    docs_tokens.append(tmp_tokens)
    tmp_tokens = []

In [69]:
# --+ let's store the tokenized text
df.loc[:, 'tkn_text'] = docs_tokens

## Get bigrammed-sentences

In [70]:
# load some gensim 
from gensim.models.phrases import Phrases, Phraser

In [71]:
phrases = Phrases(docs, min_count=30, progress_per=10000)

INFO - 09:37:08: collecting all words and their counts
INFO - 09:37:08: PROGRESS: at sentence #0, processed 0 words and 0 word types
INFO - 09:37:09: PROGRESS: at sentence #10000, processed 596681 words and 601 word types
INFO - 09:37:09: PROGRESS: at sentence #20000, processed 1194496 words and 619 word types
INFO - 09:37:10: PROGRESS: at sentence #30000, processed 1784787 words and 634 word types
INFO - 09:37:10: collected 646 word types from a corpus of 2240451 words (unigram + bigrams) and 37688 sentences
INFO - 09:37:10: using 646 counts as vocab in Phrases<0 vocab, min_count=30, threshold=10.0, max_vocab_size=40000000>


In [73]:
# --+ get rid of common terms
common_terms = [u'of', u'with', u'without', u'and', u'or', u'the', u'a',
                u'not', 'be', u'to', u'this', u'who', u'in']

# --+ fing phrases as bigrams
bigram = Phrases(docs_tokens,
                 min_count=50,
                 # max_vocab_size=50000,
                 common_terms=common_terms)

# --+ manipulate docs
docs_phrased = [bigram[line] for line in docs_tokens]

INFO - 09:40:36: collecting all words and their counts
INFO - 09:40:36: PROGRESS: at sentence #0, processed 0 words and 0 word types
INFO - 09:40:36: PROGRESS: at sentence #10000, processed 51031 words and 38711 word types
INFO - 09:40:36: PROGRESS: at sentence #20000, processed 102110 words and 66533 word types
INFO - 09:40:36: PROGRESS: at sentence #30000, processed 152571 words and 89328 word types
INFO - 09:40:36: collected 104537 word types from a corpus of 191315 words (unigram + bigrams) and 37688 sentences
INFO - 09:40:36: using 104537 counts as vocab in Phrases<0 vocab, min_count=50, threshold=10.0, max_vocab_size=40000000>


## Implement the $\texttt{word2vec}$ with Gensim

In [77]:
# let's try to speed things up a little bit
import multiprocessing
cores = multiprocessing.cpu_count()
# load gensim implementation of the word2vec
from gensim.models import Word2Vec

### Setup the params

In [78]:
'''
Fixing the params requires significant knowledge about the corpus of text at
hand and the cultural and societal context for the corpus
'''
w2v_model = Word2Vec(min_count=20,
                     window=2,
                     size=300,
                     sample=6e-5, 
                     alpha=0.03, 
                     min_alpha=0.0007, 
                     negative=20,
                     workers=cores-1)

### Build the vocabulary

In [86]:
t = time()

w2v_model.build_vocab(docs_phrased, progress_per=10000)

print('Time to build vocab: {} mins'.format(round((time() - t) / 60, 2)))

INFO - 09:48:42: collecting all words and their counts
INFO - 09:48:42: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO - 09:48:42: PROGRESS: at sentence #10000, processed 50797 words, keeping 7053 word types
INFO - 09:48:42: PROGRESS: at sentence #20000, processed 101661 words, keeping 9684 word types
INFO - 09:48:42: PROGRESS: at sentence #30000, processed 151920 words, keeping 11405 word types
INFO - 09:48:42: collected 12352 word types from a corpus of 190504 raw words and 37688 sentences
INFO - 09:48:42: Loading a fresh vocabulary
INFO - 09:48:42: effective_min_count=20 retains 1317 unique words (10% of original 12352, drops 11035)
INFO - 09:48:42: effective_min_count=20 leaves 151104 word corpus (79% of original 190504, drops 39400)
INFO - 09:48:42: deleting the raw counts dictionary of 12352 items
INFO - 09:48:42: sample=6e-05 downsamples 1134 most-common words
INFO - 09:48:42: downsampling leaves estimated 45895 word corpus (30.4% of prior 151104)
INFO - 

Time to build vocab: 0.01 mins


### Train the model

In [88]:
t = time()

w2v_model.train(docs_phrased, total_examples=w2v_model.corpus_count,
                epochs=30, report_delay=1)

print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))

w2v_model.init_sims(replace=True)

INFO - 09:50:16: training model with 7 workers on 1317 vocabulary and 300 features, using sg=0 hs=0 sample=6e-05 negative=20 window=2
INFO - 09:50:16: worker thread finished; awaiting finish of 6 more threads
INFO - 09:50:16: worker thread finished; awaiting finish of 5 more threads
INFO - 09:50:16: worker thread finished; awaiting finish of 4 more threads
INFO - 09:50:16: worker thread finished; awaiting finish of 3 more threads
INFO - 09:50:16: worker thread finished; awaiting finish of 2 more threads
INFO - 09:50:16: worker thread finished; awaiting finish of 1 more threads
INFO - 09:50:16: worker thread finished; awaiting finish of 0 more threads
INFO - 09:50:16: EPOCH - 1 : training on 190504 raw words (45684 effective words) took 0.1s, 413674 effective words/s
INFO - 09:50:16: worker thread finished; awaiting finish of 6 more threads
INFO - 09:50:16: worker thread finished; awaiting finish of 5 more threads
INFO - 09:50:16: worker thread finished; awaiting finish of 4 more thread

INFO - 09:50:18: worker thread finished; awaiting finish of 2 more threads
INFO - 09:50:18: worker thread finished; awaiting finish of 1 more threads
INFO - 09:50:18: worker thread finished; awaiting finish of 0 more threads
INFO - 09:50:18: EPOCH - 13 : training on 190504 raw words (45985 effective words) took 0.1s, 381923 effective words/s
INFO - 09:50:18: worker thread finished; awaiting finish of 6 more threads
INFO - 09:50:18: worker thread finished; awaiting finish of 5 more threads
INFO - 09:50:18: worker thread finished; awaiting finish of 4 more threads
INFO - 09:50:18: worker thread finished; awaiting finish of 3 more threads
INFO - 09:50:18: worker thread finished; awaiting finish of 2 more threads
INFO - 09:50:18: worker thread finished; awaiting finish of 1 more threads
INFO - 09:50:18: worker thread finished; awaiting finish of 0 more threads
INFO - 09:50:18: EPOCH - 14 : training on 190504 raw words (46106 effective words) took 0.1s, 414890 effective words/s
INFO - 09:50

INFO - 09:50:19: worker thread finished; awaiting finish of 4 more threads
INFO - 09:50:19: worker thread finished; awaiting finish of 3 more threads
INFO - 09:50:19: worker thread finished; awaiting finish of 2 more threads
INFO - 09:50:19: worker thread finished; awaiting finish of 1 more threads
INFO - 09:50:19: worker thread finished; awaiting finish of 0 more threads
INFO - 09:50:19: EPOCH - 26 : training on 190504 raw words (46147 effective words) took 0.1s, 406650 effective words/s
INFO - 09:50:19: worker thread finished; awaiting finish of 6 more threads
INFO - 09:50:19: worker thread finished; awaiting finish of 5 more threads
INFO - 09:50:19: worker thread finished; awaiting finish of 4 more threads
INFO - 09:50:19: worker thread finished; awaiting finish of 3 more threads
INFO - 09:50:19: worker thread finished; awaiting finish of 2 more threads
INFO - 09:50:19: worker thread finished; awaiting finish of 1 more threads
INFO - 09:50:19: worker thread finished; awaiting finish

Time to train the model: 0.06 mins


# Explore the model

## Do vectors make any sense?

**Which are the words/meanings the lexical item 'sheldon' trigger in your mind?**

<img src="images/_14.jpg" width="50%">

In [133]:
# Let's give a try...

items = ['sheldon', 'sheldon_cooper', 'dr_sheldon_cooper']

for item in items:
    try:
        positives = w2v_model.wv.most_similar(positive=[item])
        print("""
        
        Lexical item is `{}'
        ======================================================
         term                 similarity
        ------------------------------------------------------
        """.format(item))
        for term, similarity in positives:
            print('\t', term.ljust(15), '\t', np.round(similarity, 3), flush=True)
    except:
        print("""
        
        Lexical item is `{}'
        ======================================================
        
        
        !!! ...too bad, item not in dictionary !!!
        
        """.format(item))



        
        Lexical item is `sheldon'
         term                 similarity
        ------------------------------------------------------
        
	 raj             	 0.854
	 sorry           	 0.81
	 sweetie         	 0.796
	 alright         	 0.786
	 fine            	 0.782
	 oh_god          	 0.774
	 sarcasm         	 0.773
	 okay            	 0.773
	 didn            	 0.772
	 want            	 0.77

        
        Lexical item is `sheldon_cooper'
         term                 similarity
        ------------------------------------------------------
        
	 dr              	 0.909
	 welcome         	 0.905
	 hofstadter      	 0.854
	 present         	 0.833
	 neighbour       	 0.796
	 flag            	 0.768
	 cooper          	 0.754
	 hello           	 0.749
	 fun             	 0.744
	 honour          	 0.728

        
        Lexical item is `dr_sheldon_cooper'
        
        
        !!! ...too bad, item not in dictionary !!!
        
        


## Closer characters?

In [135]:
characters = ['sheldon', 'penny', 'leonard', 'howard', 'raj']

**??*

<img src="images/_15.jpg" width="50%">

*OR*

<img src="images/_16.jpg" width="50%">

In [155]:
print("""        
Inter-lexical item similarity
======================================================
  pair                             similarity
------------------------------------------------------
""")

for c_i in characters:
    for c_j in characters:
        if c_i != c_j:
            sim_cicj = w2v_model.wv.similarity(c_i, c_j)
            print("""""",
                  '{} - {}'.format(c_i, c_j).ljust(30),
                  '{}'.format(sim_cicj))
        else:
            pass
        

        
Inter-lexical item similarity
  pair                             similarity
------------------------------------------------------

 sheldon - penny                0.7068729996681213
 sheldon - leonard              0.7267926335334778
 sheldon - howard               0.6681523323059082
 sheldon - raj                  0.8539205193519592
 penny - sheldon                0.7068729996681213
 penny - leonard                0.8276159763336182
 penny - howard                 0.7030239701271057
 penny - raj                    0.7678197622299194
 leonard - sheldon              0.7267926335334778
 leonard - penny                0.8276159763336182
 leonard - howard               0.7147091031074524
 leonard - raj                  0.7717246413230896
 howard - sheldon               0.6681523323059082
 howard - penny                 0.7030239701271057
 howard - leonard               0.7147091031074524
 howard - raj                   0.7719296216964722
 raj - sheldon                  0.853920519

# Statistical post-processing of word vectors

In [165]:
# save out model
out_f = os.path.join('data', 'big_bang_theory.model')
w2v_model.save(out_f)

INFO - 11:22:14: saving Word2Vec object under data/big_bang_theory.model, separately None
INFO - 11:22:14: not storing attribute vectors_norm
INFO - 11:22:14: not storing attribute cum_table
INFO - 11:22:15: saved data/big_bang_theory.model


In [166]:
# load the data back
in_f = out_f
model = Word2Vec.load(in_f)

INFO - 11:22:57: loading Word2Vec object from data/big_bang_theory.model
INFO - 11:22:57: loading wv recursively from data/big_bang_theory.model.wv.* with mmap=None
INFO - 11:22:57: setting ignored attribute vectors_norm to None
INFO - 11:22:57: loading vocabulary recursively from data/big_bang_theory.model.vocabulary.* with mmap=None
INFO - 11:22:57: loading trainables recursively from data/big_bang_theory.model.trainables.* with mmap=None
INFO - 11:22:57: setting ignored attribute cum_table to None
INFO - 11:22:57: loaded data/big_bang_theory.model


In [167]:
model.wv['sheldon']

array([-2.56122034e-02,  5.70277534e-02,  4.97568101e-02,  8.43853131e-02,
       -1.51177630e-01,  9.56035405e-03,  2.75666080e-02,  2.51432359e-02,
       -2.45556533e-02, -1.44383898e-02, -7.56913200e-02, -4.07761298e-02,
        5.18479533e-02, -2.74627507e-02, -1.07280761e-01, -9.56522375e-02,
       -1.11931354e-01,  8.44372902e-03, -5.73443733e-02, -3.99773158e-02,
        9.35376659e-02, -1.08790612e-02,  6.88532069e-02, -5.39100245e-02,
        8.92639756e-02, -4.24373187e-02, -5.31451702e-02,  3.04649607e-03,
       -1.79845113e-02, -1.28390081e-02,  1.00214235e-01, -1.17637916e-03,
       -1.27293199e-01, -2.69187130e-02,  4.98978756e-02,  1.22400016e-01,
        2.84108194e-03, -1.30958766e-01, -6.02252483e-02, -2.43704882e-03,
       -6.80608675e-02, -1.39539884e-02,  8.00053179e-02,  1.03007294e-01,
       -2.79188193e-02, -7.66695067e-02,  3.27624157e-02,  6.69604540e-02,
        4.46786545e-03, -7.33269705e-03, -5.39108254e-02, -6.13568397e-03,
        8.85235816e-02, -

In [202]:
# let's store the word vectors associated with target items
character_vectors = []

for c in characters:
    to_append = model.wv[c]
    character_vectors.append(to_append)

<span class="mark">Students have time to run dimensionality reduction analyses</span>