# Lesson 6

## Word2Vec

The dataset we'll be using: simpsons_dataset.csv

https://www.kaggle.com/pierremegret/dialogue-lines-of-the-simpsons


We'll use only two columns:

- raw_character_text: the character who speaks (can be useful when monitoring the preprocessing steps)
- spoken_words: the raw text from the line of dialogue

To install spacy copy and run one of these codelines:

!pip install spacy

conda install -c conda-forge spacy

In [7]:
import re                             # For preprocessing
import pandas as pd                   # For data handling
from time import time                 # To time our operations
from collections import defaultdict   # For word frequency

import spacy                          # For preprocessing


In [8]:
# Setting up the loggings to monitor gensim

import logging                        
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", 
                    datefmt= '%H:%M:%S', level=logging.INFO)

### Preprocessing

In [9]:
# import csv file

df = pd.read_csv('simpsons_dataset.csv')
df.shape

(158314, 2)

In [10]:
df.head()

Unnamed: 0,raw_character_text,spoken_words
0,Miss Hoover,"No, actually, it was a little of both. Sometim..."
1,Lisa Simpson,Where's Mr. Bergstrom?
2,Miss Hoover,I don't know. Although I'd sure like to talk t...
3,Lisa Simpson,That life is worth living.
4,Edna Krabappel-Flanders,The polls will be open from now until the end ...


In [11]:
df.isnull().sum()

raw_character_text    17814
spoken_words          26459
dtype: int64

In [12]:
df = df.dropna().reset_index(drop=True)
df.isnull().sum()

raw_character_text    0
spoken_words          0
dtype: int64

In [13]:
len(df)

131853

Lemmatizing and removing the stopwords and non-alphabetic characters for each line of dialogue

In [14]:
import sys
!{sys.executable} -m spacy download en

Collecting en-core-web-sm==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl (13.7 MB)
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.0.0
[!] As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the full
pipeline package name 'en_core_web_sm' instead.
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


In [15]:
nlp = spacy.load("en_core_web_sm", disable=['ner', 'parser']) # disabling Named Entity Recognition for speed

def cleaning(doc):
    # Lemmatizes and removes stopwords
    # doc needs to be a spacy Doc object
    txt = [token.lemma_ for token in doc if not token.is_stop]
    
    # Word2Vec uses context words to learn the vector representation of a target word,
    # if a sentence is only one or two words long,
    # the benefit for the training is very small
    if len(txt) > 2:
        return ' '.join(txt)

In [16]:
# Removes non-alphabetic characters

brief_cleaning = (re.sub("[^A-Za-z']+", ' ', 
                         str(row)).lower() for row in df['spoken_words'])

In [17]:
# Take advantage of spaCy .pipe() attribute to speed-up the cleaning process

t = time()

txt = [cleaning(doc) for doc in nlp.pipe(brief_cleaning, batch_size=5000)]

print('Time to clean up everything: {} mins'.format(round((time() - t) / 60, 2)))

Time to clean up everything: 2.02 mins


In [18]:
help(nlp.pipe)

Help on method pipe in module spacy.language:

pipe(texts: Iterable[str], *, as_tuples: bool = False, batch_size: Union[int, NoneType] = None, disable: Iterable[str] = [], component_cfg: Union[Dict[str, Dict[str, Any]], NoneType] = None, n_process: int = 1) method of spacy.lang.en.English instance
    Process texts as a stream, and yield `Doc` objects in order.
    
    texts (Iterable[str]): A sequence of texts to process.
    as_tuples (bool): If set to True, inputs should be a sequence of
        (text, context) tuples. Output will then be a sequence of
        (doc, context) tuples. Defaults to False.
    batch_size (Optional[int]): The number of texts to buffer.
    disable (List[str]): Names of the pipeline components to disable.
    component_cfg (Dict[str, Dict]): An optional dictionary with extra keyword
        arguments for specific components.
    n_process (int): Number of processors to process texts. If -1, set `multiprocessing.cpu_count()`.
    YIELDS (Doc): Documents in

In [21]:
# put the results in a DataFrame to remove missing values and duplicates ALWAYS REMOVE DUPLICATES

df_clean = pd.DataFrame({'clean': txt})
df_clean = df_clean.dropna().drop_duplicates()
df_clean.shape

(85956, 1)

### Bigrams

In [26]:
from gensim.models.phrases import Phrases, Phraser

In [27]:
# Phrases() takes a list of list of words as input

sent = [row.split() for row in df_clean['clean']]


In [28]:
# Creates the relevant phrases from the list of sentences

phrases = Phrases(sent, min_count=30, progress_per=10000)

INFO - 19:39:45: collecting all words and their counts
INFO - 19:39:45: PROGRESS: at sentence #0, processed 0 words and 0 word types
INFO - 19:39:45: PROGRESS: at sentence #10000, processed 63557 words and 52733 word types
INFO - 19:39:46: PROGRESS: at sentence #20000, processed 130938 words and 99702 word types
INFO - 19:39:46: PROGRESS: at sentence #30000, processed 192959 words and 138314 word types
INFO - 19:39:46: PROGRESS: at sentence #40000, processed 249828 words and 172378 word types
INFO - 19:39:46: PROGRESS: at sentence #50000, processed 311267 words and 208202 word types
INFO - 19:39:46: PROGRESS: at sentence #60000, processed 373573 words and 243255 word types
INFO - 19:39:46: PROGRESS: at sentence #70000, processed 436422 words and 278194 word types
INFO - 19:39:46: PROGRESS: at sentence #80000, processed 497885 words and 311308 word types
INFO - 19:39:46: collected 330094 token types (unigram + bigrams) from a corpus of 537096 words and 85956 sentences
INFO - 19:39:46: m

In [29]:
bigram = Phraser(phrases)

INFO - 19:39:47: exporting phrases from Phrases<330094 vocab, min_count=30, threshold=10.0, max_vocab_size=40000000>
INFO - 19:39:48: FrozenPhrases lifecycle event {'msg': 'exported FrozenPhrases<124 phrases, min_count=30, threshold=10.0> from Phrases<330094 vocab, min_count=30, threshold=10.0, max_vocab_size=40000000> in 0.84s', 'datetime': '2021-06-24T19:39:48.099830', 'gensim': '4.0.1', 'python': '3.8.3 (default, Jul  2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'created'}


In [30]:
# Transform the corpus based on the bigrams detected

sentences = bigram[sent]

AttributeError: 'TransformedCorpus' object has no attribute 'text'

**Most Frequent Words**

Mainly a sanity check of the effectiveness of the lemmatization, removal of stopwords, and addition of bigrams.

In [23]:
word_freq = defaultdict(int)
for sent in sentences:
    for i in sent:
        word_freq[i] += 1
len(word_freq)

29493

In [24]:
sorted(word_freq, key=word_freq.get, reverse=True)[:10]

['oh', 'like', 'know', 'get', 'hey', 'think', 'right', 'look', 'want', 'come']

### Gensim Word2Vec Implementation

Separate the training in 3 distinctive steps for clarity and monitoring.

- Word2Vec():
In this first step, set up the parameters of the model one-by-one.
We do not supply the parameter sentences, and therefore leave the model uninitialized, purposefully.

- .build_vocab():
Here it builds the vocabulary from a sequence of sentences and thus initialized the model.
With the loggings, we can follow the progress and even more important, the effect of min_count and sample on the word corpus. It was noticed that these two parameters, and in particular sample, have a great influence over the performance of a model. Displaying both allows for a more accurate and an easier management of their influence.

- .train():
Finally, trains the model.
The loggings here are mainly useful for monitoring, making sure that no threads are executed instantaneously.

In [35]:
import multiprocessing

from gensim.models import Word2Vec

In [36]:
cores = multiprocessing.cpu_count() # Count the number of cores in a computer
print(cores)

8


The parameters:
- min_count = int - Ignores all words with total absolute frequency lower than this - (2, 100)
- window = int - The maximum distance between the current and predicted word within a sentence. E.g. window words on the left and window words on the left of our target - (2, 10)
- size = int - Dimensionality of the feature vectors. - (50, 300)
- sample = float - The threshold for configuring which higher-frequency words are randomly downsampled. Highly influencial. - (0, 1e-5)
- alpha = float - The initial learning rate - (0.01, 0.05)
- min_alpha = float - Learning rate will linearly drop to min_alpha as training progresses. To set it: alpha - (min_alpha * epochs) ~ 0.00
- negative = int - If > 0, negative sampling will be used, the int for negative specifies how many "noise words" should be drown. If set to 0, no negative sampling is used. - (5, 20)
- workers = int - Use these many worker threads to train the model (=faster training with multicore machines)

In [40]:
w2v_model = Word2Vec(min_count=20,
                     window=2,
                     vector_size=100,
                     sample=6e-5, 
                     alpha=0.03, 
                     min_alpha=0.0007, 
                     negative=20,
                     workers=cores-1)

INFO - 19:42:39: Word2Vec lifecycle event {'params': 'Word2Vec(vocab=0, vector_size=100, alpha=0.03)', 'datetime': '2021-06-24T19:42:39.747541', 'gensim': '4.0.1', 'python': '3.8.3 (default, Jul  2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'created'}


In [38]:
help(Word2Vec)

Help on class Word2Vec in module gensim.models.word2vec:

class Word2Vec(gensim.utils.SaveLoad)
 |  Word2Vec(sentences=None, corpus_file=None, vector_size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=<built-in function hash>, epochs=5, null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000, compute_loss=False, callbacks=(), comment=None, max_final_vocab=None)
 |  
 |  Serialize/deserialize objects from disk, by equipping them with the `save()` / `load()` methods.
 |  
 |  --------
 |  This uses pickle internally (among other techniques), so objects must not contain unpicklable attributes
 |  such as lambda functions etc.
 |  
 |  Method resolution order:
 |      Word2Vec
 |      gensim.utils.SaveLoad
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, sentences=None, corpus_file=None, vector_size=100, alpha=0.025, window=5

**Building the Vocabulary Table**

Word2Vec requires us to build the vocabulary table (simply digesting all the words and filtering out the unique words, and doing some basic counts on them)

In [41]:
t = time()

w2v_model.build_vocab(sentences, progress_per=10000)

print('Time to build vocab: {} mins'.format(round((time() - t) / 60, 2)))

INFO - 19:42:59: collecting all words and their counts
INFO - 19:42:59: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO - 19:43:00: PROGRESS: at sentence #10000, processed 61705 words, keeping 9474 word types
INFO - 19:43:00: PROGRESS: at sentence #20000, processed 127321 words, keeping 14329 word types
INFO - 19:43:00: PROGRESS: at sentence #30000, processed 187814 words, keeping 17358 word types
INFO - 19:43:00: PROGRESS: at sentence #40000, processed 243317 words, keeping 20021 word types
INFO - 19:43:00: PROGRESS: at sentence #50000, processed 303178 words, keeping 22426 word types
INFO - 19:43:00: PROGRESS: at sentence #60000, processed 363915 words, keeping 24662 word types
INFO - 19:43:00: PROGRESS: at sentence #70000, processed 425375 words, keeping 26806 word types
INFO - 19:43:00: PROGRESS: at sentence #80000, processed 485511 words, keeping 28619 word types
INFO - 19:43:00: collected 29493 word types from a corpus of 523625 raw words and 85956 sentence

Time to build vocab: 0.02 mins


**Training of the model**


Parameters of the training:

- total_examples = int - Count of sentences;
- epochs = int - Number of iterations (epochs) over the corpus - [10, 20, 30]

In [42]:
t = time()

w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))

INFO - 19:43:06: Word2Vec lifecycle event {'msg': 'training model with 7 workers on 3316 vocabulary and 100 features, using sg=0 hs=0 sample=6e-05 negative=20 window=2', 'datetime': '2021-06-24T19:43:06.236482', 'gensim': '4.0.1', 'python': '3.8.3 (default, Jul  2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'train'}
INFO - 19:43:07: EPOCH 1 - PROGRESS: at 80.38% examples, 159924 words/s, in_qsize 11, out_qsize 0
INFO - 19:43:07: worker thread finished; awaiting finish of 6 more threads
INFO - 19:43:07: worker thread finished; awaiting finish of 5 more threads
INFO - 19:43:07: worker thread finished; awaiting finish of 4 more threads
INFO - 19:43:07: worker thread finished; awaiting finish of 3 more threads
INFO - 19:43:07: worker thread finished; awaiting finish of 2 more threads
INFO - 19:43:07: worker thread finished; awaiting finish of 1 more threads
INFO - 19:43:07: worker thread finished; awaiting finish of 0 more threads
INFO - 

INFO - 19:43:18: worker thread finished; awaiting finish of 0 more threads
INFO - 19:43:18: EPOCH - 11 : training on 523625 raw words (199949 effective words) took 1.0s, 196737 effective words/s
INFO - 19:43:19: EPOCH 12 - PROGRESS: at 57.54% examples, 107715 words/s, in_qsize 10, out_qsize 4
INFO - 19:43:19: worker thread finished; awaiting finish of 6 more threads
INFO - 19:43:19: worker thread finished; awaiting finish of 5 more threads
INFO - 19:43:19: worker thread finished; awaiting finish of 4 more threads
INFO - 19:43:19: worker thread finished; awaiting finish of 3 more threads
INFO - 19:43:19: worker thread finished; awaiting finish of 2 more threads
INFO - 19:43:19: worker thread finished; awaiting finish of 1 more threads
INFO - 19:43:19: worker thread finished; awaiting finish of 0 more threads
INFO - 19:43:19: EPOCH - 12 : training on 523625 raw words (199636 effective words) took 1.1s, 173851 effective words/s
INFO - 19:43:20: EPOCH 13 - PROGRESS: at 59.45% examples, 106

INFO - 19:43:30: worker thread finished; awaiting finish of 1 more threads
INFO - 19:43:30: worker thread finished; awaiting finish of 0 more threads
INFO - 19:43:30: EPOCH - 23 : training on 523625 raw words (199619 effective words) took 1.1s, 190061 effective words/s
INFO - 19:43:31: EPOCH 24 - PROGRESS: at 57.54% examples, 114021 words/s, in_qsize 12, out_qsize 1
INFO - 19:43:31: worker thread finished; awaiting finish of 6 more threads
INFO - 19:43:31: worker thread finished; awaiting finish of 5 more threads
INFO - 19:43:31: worker thread finished; awaiting finish of 4 more threads
INFO - 19:43:31: worker thread finished; awaiting finish of 3 more threads
INFO - 19:43:31: worker thread finished; awaiting finish of 2 more threads
INFO - 19:43:31: worker thread finished; awaiting finish of 1 more threads
INFO - 19:43:31: worker thread finished; awaiting finish of 0 more threads
INFO - 19:43:31: EPOCH - 24 : training on 523625 raw words (199635 effective words) took 1.1s, 182368 effe

Time to train the model: 0.53 mins


In [33]:
### DEPRECATED!!!

# As we do not plan to train the model any further, we are calling init_sims(), which will make the model much more memory-efficient

w2v_model.init_sims(replace=True)

  w2v_model.init_sims(replace=True)


### Exploring the model

Most similar to:

Here, we will ask our model to find the word most similar to some of the most iconic characters of the Simpsons!

In [43]:
w2v_model.wv.most_similar(positive=["homer"])

[('marge', 0.6626412868499756),
 ('hammock', 0.6492546200752258),
 ('gosh', 0.649033784866333),
 ('suspicious', 0.6488982439041138),
 ('crummy', 0.6417440176010132),
 ('depressed', 0.6409281492233276),
 ('glamorous', 0.6363852620124817),
 ('terrific', 0.6356622576713562),
 ('bongo', 0.6240394115447998),
 ('awww', 0.6176309585571289)]

In [35]:
w2v_model.wv.most_similar(positive=["homer_simpson"])

[('recent', 0.6733304858207703),
 ('sir', 0.6718449592590332),
 ('easily', 0.6549437046051025),
 ('congratulation', 0.6546491384506226),
 ('sherman', 0.6523118019104004),
 ('select', 0.6451156139373779),
 ('pleased', 0.6447362899780273),
 ('hutz', 0.6387559175491333),
 ('montgomery_burn', 0.6385891437530518),
 ('kennedy', 0.6298847198486328)]

In [36]:
w2v_model.wv.most_similar(positive=["marge"])

[('snuggle', 0.6966519355773926),
 ('sweetheart', 0.6907212734222412),
 ('eliza', 0.6902742385864258),
 ('awww', 0.6883502006530762),
 ('depressed', 0.6810144782066345),
 ('crummy', 0.6776387691497803),
 ('rude', 0.6718450784683228),
 ('homer', 0.6714166402816772),
 ('brunch', 0.6688135862350464),
 ('arrange', 0.6632348299026489)]

In [37]:
w2v_model.wv.most_similar(positive=["bart"])

[('lisa', 0.7857683300971985),
 ('homework', 0.7636374235153198),
 ('convince', 0.7172344923019409),
 ('assignment', 0.7169288992881775),
 ('mom_dad', 0.7032962441444397),
 ('surprised', 0.6915916204452515),
 ('janey', 0.6913002729415894),
 ('behave', 0.6891374588012695),
 ('ralphie', 0.6828575730323792),
 ('impress', 0.6814612746238708)]

Similarities:
    
Here, we will see how similar are two words to each other :

In [44]:
w2v_model.wv.similarity('maggie', 'baby')

0.65207624

In [45]:
w2v_model.wv.similarity('bart', 'nelson')

0.5326438

Odd-One-Out:
    
Here, we ask our model to give us the word that does not belong to the list!


In [42]:
# What if we compared the friendship between Nelson, Bart, and Milhouse?

w2v_model.wv.doesnt_match(["nelson", "bart", "milhouse"])

'nelson'

In [43]:
# Last but not least, how is the relationship between Homer and his two sister-in-laws?

w2v_model.wv.doesnt_match(['homer', 'patty', 'selma'])

'homer'

Analogy difference:

Which word is to woman as homer is to marge?

In [44]:
w2v_model.wv.most_similar(positive=["woman", "homer"], negative=["marge"], topn=3)

[('man', 0.5862100124359131),
 ('rude', 0.5613473057746887),
 ('adopt', 0.5586809515953064)]

In [45]:
# Which word is to woman as bart is to man?

w2v_model.wv.most_similar(positive=["woman", "bart"], negative=["man"], topn=3)

[('lisa', 0.6911569833755493),
 ('arrange', 0.6468756794929504),
 ('encourage', 0.6392290592193604)]

In [None]:
# the end