#### Sociology 128D: Mining Culture Through Text Data: Introduction to Social Data Science

# Notebook 9: Training Word Embeddings using `gensim`

In this notebook, we'll explore implementing the popular word2vec algorithm for training word embeddings.

In [None]:
import os
import pandas as pd
import time

from collections import Counter
from gensim.models.callbacks import CallbackAny2Vec
from gensim.models.word2vec import Word2Vec

## Setting up the Data

You can get the data from Canvas (<tt>Files -> Data</tt>)

In [None]:
f = "rjobs_2020_preprocessed.json"

In [None]:
df = pd.read_json(f)

In [None]:
df.head()

In [None]:
df.shape

Since we are training a word embedding model, we will only use the text. Any analyses we do will be based on relationships among words (well, their embeddings), so we won't use document-level metadata like the date or number of replies.

The line of code below does three things:
1. `.apply(str.split)` splits the preprocessed text on whitespace
2. `.tolist()` converts the column to a list
3. Finally, we assign the result--a list of lists--to the variable <tt>text</tt>

In [None]:
text = df["preprocessed"].apply(str.split).tolist()

In [None]:
print(text[-1])

The model will learn representations of words (as vectors) based on how the words are used. It will learn less about rare words. Below we can see the impact on the vocabulary size of excluding words that don't appear a minimum of five or 25 times.

In [None]:
len(set(filter(lambda x: x[1] >= 5, Counter([word for post in text for word in post]).items())))

In [None]:
len(set(filter(lambda x: x[1] >= 25, Counter([word for post in text for word in post]).items())))

In [None]:
all_words = [word for post in text for word in post]
min_of_five = [word for word, count in Counter(all_words).items() if count >= 5]
min_of_twentyfive = [word for word, count in Counter(all_words).items() if count >= 25]

If we exclude words that do not appear a minimum of 25 times, we lose words like "cheesecake" that actually seem quite relevant to the example below. Whether or not we keep a word like "cheesecake" matters for three reasons. First, if we exclude it, we do not get a word vector for it, and we cannot use it for any analyses. Second, representations for words like "factory" will be affect because they directly co-occur. Finally, if we remove words, we also bring the remaining words closer together. 

In [None]:
samp = df.loc[5239].preprocessed

samp_min_five = " ".join([word for word in samp.split() if word in min_of_five])
samp_min_twentyfive = " ".join([word for word in samp.split() if word in min_of_twentyfive])

print(samp_min_five, "\n")
print(samp_min_twentyfive)

#### We can speed things up by changing the number of workers!

In [None]:
os.cpu_count()

The class below is adapted from https://stackoverflow.com/a/58515344. It allows us to print the loss when we train the model for more than one epoch.

In [None]:
class callback(CallbackAny2Vec):
    """
    Callback to print loss after each epoch.
    from https://stackoverflow.com/a/58515344
    """

    def __init__(self):
        self.epoch = 0
        self.loss_to_be_subed = 0

    def on_epoch_end(self, model):
        loss = model.get_latest_training_loss()
        loss_now = loss - self.loss_to_be_subed
        self.loss_to_be_subed = loss
        print(f"Loss after epoch {self.epoch}: {loss_now:,}")
        self.epoch += 1

## Training a Basic Model

You can see the details of `gensim`'s implementation of word2vec [here](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec). The <tt>sg</tt> argument let's use the skip-gram algorithm, and <tt>negative</tt> let's use specify the number of negative samples. 

In [None]:
basic_model = Word2Vec(text, window = 5, sg = 1, negative = 5, workers = os.cpu_count()-1, min_count = 5)
basic_model = basic_model.wv

In [None]:
basic_model.most_similar("employment")

In [None]:
basic_model.most_similar("job")

## Training a Better Model

Now let's compare that model to a better model. We're going to train a model several epochs, meaning the model will have several chances to update the word embeddings and improve them.

In [None]:
start_time = time.time()

model = Word2Vec(text, vector_size = 300, window = 7, sg = 1, negative = 5, workers = os.cpu_count()-1, min_count = 5, 
                 epochs = 100, callbacks=[callback()], compute_loss = True)

minutes = (time.time() - start_time)/60
print(f"Training completed in {minutes:.1f} minutes.")

In [None]:
model.wv.most_similar("employment")

In [None]:
model.wv.most_similar("job")