# **Word2Vec** with **Gensim**

In this Jupyter notebook, we will demonstrate how to use the **Gensim** library to train a **Word2Vec model**. **Word2Vec** is a popular algorithm in **Natural Language Processing (NLP)** that uses **neural networks** to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. 

**Gensim** is a robust open-source **vector space modeling** and **topic modeling toolkit** implemented in Python. It allows easy handling of large text collections, efficient algorithms, and readily accessible software resources. 

In this demo, we will walk through the steps of training a **Word2Vec model** with **Gensim**, including **data preprocessing**, **model training**, **parameter tuning**, and finally, how to use the trained model for various **NLP tasks**. 

Let's get started!

#### Note

This notebook is inspired by and partially based on the excellent tutorial found at the following link: [Gensim Word2Vec Tutorial](https://www.kaggle.com/code/pierremegret/gensim-word2vec-tutorial). 

We have adapted and expanded upon the original tutorial to fit the specific needs and context of this notebook. We highly recommend checking out the original tutorial for a more in-depth look at using Gensim for Word2Vec.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import spacy
import numpy as np
import re
from tqdm import tqdm
import multiprocessing
from gensim.models import Word2Vec
from time import time
from gensim.models.phrases import Phrases, Phraser

In [2]:
# Make sure that the spacy english model is installed
if not spacy.util.is_package("en_core_web_sm"):
    spacy.cli.download("en_core_web_sm")
    
# Load the english model
nlp = spacy.load("en_core_web_sm", disable=["ner"]) # disable named entity recognition for speed

### Read and preprocess the data

In [3]:
# Read the data
df = pd.read_csv('./Data/simpsons_dataset.csv')
df.shape

(158314, 2)

In [4]:
df.head()

Unnamed: 0,raw_character_text,spoken_words
0,Miss Hoover,"No, actually, it was a little of both. Sometim..."
1,Lisa Simpson,Where's Mr. Bergstrom?
2,Miss Hoover,I don't know. Although I'd sure like to talk t...
3,Lisa Simpson,That life is worth living.
4,Edna Krabappel-Flanders,The polls will be open from now until the end ...


In [5]:
df.isnull().sum()

raw_character_text    17814
spoken_words          26459
dtype: int64

In [6]:
# Remove rows with missing values
df = df.dropna().reset_index(drop=True)

In [7]:
# Check if there are any missing values left
df.isnull().sum()

raw_character_text    0
spoken_words          0
dtype: int64

In [8]:
# Check the shape again
df.shape

(131853, 2)

In [9]:
# Use spacy tokenizer to clean the text
def spacy_text_clean(text):
    """
    This function uses the spacy tokenizer to clean the text
    
    Args:
        text (str): The text to be cleaned
        
    Returns:
        tokens: A list of tokens that have been cleaned
    """
    
    # Create a spacy object
    doc = nlp(text)
    
    # Tokenize the text
    tokens = []
    for token in doc:
        if token.is_alpha:
            tokens.append(token.lower_)
        elif token.is_punct:
            tokens.append(token.text)
    
    return tokens

In [11]:
# Initialize the clean token list
cleaned_sentences = []

for i in tqdm(range(0, len(df))):
    cleaned_sentences.append(spacy_text_clean(df['spoken_words'][i]))

100%|██████████| 131853/131853 [05:01<00:00, 437.10it/s]


In [12]:
# Remove empty lists
cleaned_sentences = [x for x in cleaned_sentences if x != []]

In [13]:
# First 10 sentences
cleaned_sentences[:10]

[['no',
  ',',
  'actually',
  ',',
  'it',
  'was',
  'a',
  'little',
  'of',
  'both',
  '.',
  'sometimes',
  'when',
  'a',
  'disease',
  'is',
  'in',
  'all',
  'the',
  'magazines',
  'and',
  'all',
  'the',
  'news',
  'shows',
  ',',
  'it',
  'only',
  'natural',
  'that',
  'you',
  'think',
  'you',
  'have',
  'it',
  '.'],
 ['where', 'bergstrom', '?'],
 ['i',
  'do',
  'know',
  '.',
  'although',
  'i',
  'sure',
  'like',
  'to',
  'talk',
  'to',
  'him',
  '.',
  'he',
  'did',
  'touch',
  'my',
  'lesson',
  'plan',
  '.',
  'what',
  'did',
  'he',
  'teach',
  'you',
  '?'],
 ['that', 'life', 'is', 'worth', 'living', '.'],
 ['the',
  'polls',
  'will',
  'be',
  'open',
  'from',
  'now',
  'until',
  'the',
  'end',
  'of',
  'recess',
  '.',
  'now',
  ',',
  'just',
  'in',
  'case',
  'any',
  'of',
  'you',
  'have',
  'decided',
  'to',
  'put',
  'any',
  'thought',
  'into',
  'this',
  ',',
  'we',
  'have',
  'our',
  'final',
  'statements',
  '.',
 

### Bigrams and Phrases

In the context of Natural Language Processing (NLP), a bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an n-gram for n=2. For example, given the sentence "I love to play football", the bigrams would be: "I love", "love to", "to play", "play football".

The `Phrases` model in Gensim is a simple and efficient way to create bigrams. It scans over the provided text data to find common phrases - that is, bigrams that appear more frequently together than you would expect by chance. 

For example, in a corpus of text about football, the words "penalty" and "kick" might frequently appear together in that order, and thus would be recognized as a common phrase and represented as a single token, "penalty_kick".

Using bigrams (or larger n-grams like trigrams, etc.) can help capture important context and improve the performance of many NLP tasks, such as language modeling, machine translation, and information retrieval.

In [14]:
phrases = Phrases(cleaned_sentences, min_count=30, progress_per=10000)

In [15]:
bigram = Phraser(phrases)

In [16]:
sentences = bigram[cleaned_sentences]

### Train the Word2Vec model

In [17]:
cores = multiprocessing.cpu_count() # Count the number of cores in a computer

Here's what each parameter in the `Word2Vec` function does:

- `min_count`: This parameter ignores all words with total frequency lower than this. In this case, any word that does not occur at least 20 times across all documents is ignored.

- `window`: The maximum distance between the current and predicted word within a sentence. In this case, only words that are within a distance of 2 words from the target word are considered in the context.

- `vector_size`: The dimensionality of the word vectors. Here, each word is represented as a 300-dimensional vector.

- `sample`: The threshold for configuring which higher-frequency words are randomly downsampled. In this case, words that appear with a frequency greater than 6e-5 are downsampled.

- `alpha`: The initial learning rate. Here, the initial learning rate is set to 0.03.

- `min_alpha`: Learning rate will linearly drop to `min_alpha` as training progresses. Here, the learning rate drops to 0.0007.

- `negative`: If > 0, negative sampling will be used, the int for negative specifies how many "noise words" should be drawn (usually between 5-20). Here, 20 noise words are drawn.

- `workers`: Use these many worker threads to train the model (=faster training with multicore machines). Here, it's set to one less than the total number of cores to leave one core free for other processes.

In [19]:
w2v_model = Word2Vec(
    min_count=20,
    window=2,
    vector_size=300,
    sample=6e-5,
    alpha=0.03,
    min_alpha=0.0007,
    negative=20,
    workers=cores - 1,
)

# Building the Vocabulary

Before we can train our Word2Vec model, we need to build the vocabulary. The vocabulary in this context refers to the set of unique words in our corpus. Each unique word has a unique vector in the Word2Vec model, so building the vocabulary is essentially defining the feature space.

The `build_vocab` method in Gensim's Word2Vec expects a sequence of sentences as its input, where each sentence is a list of words. In other words, the input should be a list of lists.

In [20]:
t = time()

# Now we can build the vocabulary
w2v_model.build_vocab(sentences)


print('Time to build vocab: {} mins'.format(round((time() - t) / 60, 2)))

Time to build vocab: 0.02 mins


### Train the Word2Vec model

In [21]:
t = time()

w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))

Time to train the model: 0.81 mins


# Exploring the Model

Once the Word2Vec model is trained, we can explore the model in various ways. Here are a few common methods:

- **Similarity Queries**: We can use the `most_similar` method to find the words most similar to a given word. For example, `w2v_model.wv.most_similar("football")` would return the words most similar to "football" according to the model.

- **Odd-One-Out**: We can use the `doesnt_match` method to find the word that doesn't match the others in a list. For example, `w2v_model.wv.doesnt_match(["football", "basketball", "apple"])` would likely return "apple".

- **Analogy Difference**: We can perform vector arithmetic with the word vectors to find interesting semantic relationships. For example, `w2v_model.wv.most_similar(positive=["king", "woman"], negative=["man"])` might return "queen", completing the analogy "man is to king as woman is to ___".

- **Word Vector**: We can directly access the vector of a word through the `wv` attribute. For example, `w2v_model.wv["football"]` would return the 300-dimensional vector representation of "football".

Remember, the quality and usefulness of these operations will depend heavily on the quality of the trained model, which in turn depends on factors like the size and quality of the training data, the choice of model parameters, and the amount of training.

In [22]:
w2v_model.wv.most_similar("football")

[('basketball', 0.613662600517273),
 ('pro', 0.5756626129150391),
 ('fantasy', 0.5238478779792786),
 ('league', 0.5225974917411804),
 ('groin', 0.5112829208374023),
 ('hockey', 0.5071981549263),
 ('stadium', 0.5038345456123352),
 ('wrestling', 0.4712387025356293),
 ('player', 0.46910229325294495),
 ('team', 0.4416936933994293)]

In [23]:
w2v_model.wv.doesnt_match(["football", "basketball", "apple"])

'apple'

In [24]:
w2v_model.wv.most_similar(positive=["king", "woman"], negative=["man"])

[('princess', 0.38972997665405273),
 ('queen', 0.3561096489429474),
 ('prom', 0.3003641664981842),
 ('tale', 0.278092622756958),
 ('sex', 0.27013933658599854),
 ('wisdom', 0.2689722776412964),
 ('di', 0.26756322383880615),
 ('mr', 0.2590598165988922),
 ('rumors', 0.2560378313064575),
 ('prince', 0.25016871094703674)]

In [25]:
w2v_model.wv["football"]

array([-1.80672660e-01,  9.09302682e-02,  3.23254138e-01, -1.56710431e-01,
       -8.45838934e-02,  2.15427235e-01,  8.66155922e-01,  1.97948162e-02,
       -3.43266577e-01, -1.33397549e-01,  2.48938918e-01, -1.92735791e-01,
       -4.58437903e-03, -4.63131547e-01, -5.21827117e-02, -2.12247800e-02,
       -1.20234919e+00, -1.17120659e+00, -2.66587343e-02,  2.62439460e-01,
        6.53793588e-02, -1.04228206e-01, -5.68211257e-01,  6.86273456e-01,
       -4.89263505e-01, -9.93103981e-01,  8.27904344e-02,  7.15437829e-01,
       -6.70806854e-04,  2.60650516e-01, -3.48504968e-02,  4.38413888e-01,
       -3.63642991e-01,  5.02590425e-02,  2.87028760e-01, -1.70110762e-01,
       -4.80886772e-02, -8.03541124e-01,  7.53154337e-01, -7.72180855e-01,
        3.89634997e-01, -6.07918985e-02, -8.02099466e-01, -2.39952832e-01,
       -1.91227496e-01, -2.95303762e-01,  5.77879190e-01,  4.63241667e-01,
       -5.74528575e-02,  4.91886377e-01, -4.68041241e-01, -3.79574537e-01,
       -6.18054904e-02,  