# Word2Vec

**Definition:**

Word2Vec is a NLP model that under the assumption that if two words have similar neighbors then they are supposed to be similar in meanings or highly related. This codes demo shows how we can use Gensim implementation of Word2Vec. 

Tutorial: https://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/#.ZAeKny1h2M4

Replicate the process of generating words embeddings for IMDB Movie Reviews dataset, which includes text from 50k reviews.

## 1. Text Preprocessing

Imports and logging

In [1]:
# imports needed and set up logging
import gzip
import gensim 
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

**Dataset**

IMDB Movie Reviews dataset with text from 50k reviews.

In [5]:
# Load the dataset into a pandas dataframe
import pandas as pd
input_file = 'IMDB Dataset.csv'
df = pd.read_csv(input_file)
# Extract the review texts and sentiment labels
reviews = df['review'].tolist()
labels = df['sentiment'].tolist()

In [3]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
df.shape

(50000, 2)

In [6]:
# Tokenize and preprocess the reviews
def read_input(input_file):
    # Read in file avoid reading the column names
    with open(input_file, newline='', encoding='utf-8') as f:
        reader = csv.DictReader(f)
        for row in reader:
            yield gensim.utils.simple_preprocess(row['review'])

In [8]:
import gensim
import csv
documents = list(read_input(input_file))

**Training the Word2Vec model**

In this process, we pass on a list of lists to the Word2Vec model, in which each list within the main list contains a set of tokens from a user review. Word2Vec will uses all tokens to create a vocabulary, in other words, a set of unique words.

The main idea of this model training is to train a simple neural network with a single hidden layer. We are not using the neural network after traing and instead we want to learn the weights of the hidden layer, which refer to the word vectors. 

**Parameters Setup**

- documents: This is the input corpus, which is a list of lists of words. Each sublist contains the words of a single document in the corpus.
- vector_size: This parameter sets the size of the word vectors that will be produced by the Word2Vec model. 
- window: This parameter sets the maximum distance between the target word and its context words. In other words, it determines the size of the "window" of words that the model considers when learning the word embeddings. 
- min_count: This parameter sets the minimum frequency threshold for words to be included in the vocabulary. Words that occur less frequently than this threshold are discarded. 
- sg: This parameter sets the training algorithm to be used. sg stands for "skip-gram", which is a popular algorithm for training word embeddings. The other option is cbow, which stands for "continuous bag-of-words".
    - **Skip-gram (sg)**: This algorithm aims to **predict the context words given a target word**. Specifically, it tries to maximize the probability of observing the context words given the target word. This means that the target word is used as input to the model, and the output is a probability distribution over the context words.
    - **Continuous Bag-of-Words (CBOW)**: This algorithm is the opposite of the skip-gram algorithm. It aims to **predict the target word given a context of surrounding words**. Specifically, it tries to maximize the probability of observing the target word given the context words. This means that the context words are used as input to the model, and the output is a probability distribution over the target words.
    - In other words, **the skip-gram model predicts the context words given a target word, while the CBOW model predicts the target word given a context of surrounding words. The skip-gram algorithm is typically better suited for larger datasets and infrequent words, while the CBOW algorithm can be faster and more accurate for frequent words and smaller datasets**.
- workers: This parameter sets the number of threads to be used for training the model."threads" refers to the number of independent computational processes that will be used to train the model. Specifically, setting the workers parameter to a value greater than 1 enables parallel processing of the training data, which can speed up the training process and reduce the overall training time.

**Default Setup**: By default, size is set to 100, window to 5, min_count to 5, and sg to 0

In [9]:
# Build vocabulary and train model
model = gensim.models.Word2Vec(documents,vector_size=100,window=5,min_count=5,sg = 1,workers=10)

2023-04-01 15:02:32,502 : INFO : collecting all words and their counts
2023-04-01 15:02:32,509 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2023-04-01 15:02:33,250 : INFO : PROGRESS: at sentence #10000, processed 2235115 words, keeping 51730 word types
2023-04-01 15:02:34,170 : INFO : PROGRESS: at sentence #20000, processed 4471355 words, keeping 68515 word types
2023-04-01 15:02:34,822 : INFO : PROGRESS: at sentence #30000, processed 6703795 words, keeping 80657 word types
2023-04-01 15:02:35,753 : INFO : PROGRESS: at sentence #40000, processed 8930547 words, keeping 90755 word types
2023-04-01 15:02:36,579 : INFO : collected 99476 word types from a corpus of 11176467 raw words and 50000 sentences
2023-04-01 15:02:36,580 : INFO : Creating a fresh vocabulary
2023-04-01 15:02:36,832 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 39191 unique words (39.40% of original 99476, drops 60285)', 'datetime': '2023-04-01T15:02:36.793827', '

2023-04-01 15:03:26,268 : INFO : EPOCH 1 - PROGRESS: at 52.81% examples, 266542 words/s, in_qsize 20, out_qsize 1
2023-04-01 15:03:27,297 : INFO : EPOCH 1 - PROGRESS: at 56.14% examples, 266974 words/s, in_qsize 19, out_qsize 0
2023-04-01 15:03:28,352 : INFO : EPOCH 1 - PROGRESS: at 59.24% examples, 265449 words/s, in_qsize 19, out_qsize 0
2023-04-01 15:03:29,380 : INFO : EPOCH 1 - PROGRESS: at 62.61% examples, 266312 words/s, in_qsize 19, out_qsize 0
2023-04-01 15:03:30,397 : INFO : EPOCH 1 - PROGRESS: at 65.64% examples, 265745 words/s, in_qsize 16, out_qsize 3
2023-04-01 15:03:31,401 : INFO : EPOCH 1 - PROGRESS: at 68.88% examples, 265761 words/s, in_qsize 19, out_qsize 0
2023-04-01 15:03:32,460 : INFO : EPOCH 1 - PROGRESS: at 72.31% examples, 266073 words/s, in_qsize 19, out_qsize 0
2023-04-01 15:03:33,462 : INFO : EPOCH 1 - PROGRESS: at 75.57% examples, 266460 words/s, in_qsize 19, out_qsize 0
2023-04-01 15:03:34,497 : INFO : EPOCH 1 - PROGRESS: at 78.60% examples, 265545 words/s,

2023-04-01 15:04:41,136 : INFO : EPOCH 3 - PROGRESS: at 64.20% examples, 243980 words/s, in_qsize 19, out_qsize 0
2023-04-01 15:04:42,157 : INFO : EPOCH 3 - PROGRESS: at 66.48% examples, 241244 words/s, in_qsize 20, out_qsize 1
2023-04-01 15:04:43,195 : INFO : EPOCH 3 - PROGRESS: at 68.97% examples, 239483 words/s, in_qsize 19, out_qsize 0
2023-04-01 15:04:44,196 : INFO : EPOCH 3 - PROGRESS: at 71.85% examples, 239634 words/s, in_qsize 19, out_qsize 0
2023-04-01 15:04:45,313 : INFO : EPOCH 3 - PROGRESS: at 74.34% examples, 237387 words/s, in_qsize 20, out_qsize 3
2023-04-01 15:04:46,321 : INFO : EPOCH 3 - PROGRESS: at 77.27% examples, 237617 words/s, in_qsize 20, out_qsize 1
2023-04-01 15:04:47,446 : INFO : EPOCH 3 - PROGRESS: at 79.71% examples, 235291 words/s, in_qsize 18, out_qsize 1
2023-04-01 15:04:48,473 : INFO : EPOCH 3 - PROGRESS: at 82.23% examples, 234336 words/s, in_qsize 19, out_qsize 0
2023-04-01 15:04:49,491 : INFO : EPOCH 3 - PROGRESS: at 84.77% examples, 233577 words/s,

2023-04-01 15:05:54,607 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec<vocab=39191, vector_size=100, alpha=0.025>', 'datetime': '2023-04-01T15:05:54.607167', 'gensim': '4.3.1', 'python': '3.8.8 (default, Apr 13 2021, 12:59:45) \n[Clang 10.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'created'}


In [11]:
# Create empty numpy array for reviews embeddings
import numpy as np
reviews_embeddings = np.zeros((len(documents), 100))

In [12]:
# Loop through each review
for i, review in enumerate(documents):
    words_embeddings = []
    # loop through each word in the review
    for word in review:
        # if the word is in the model vocabulary, append its embedding to words_embeddings
        if word in model.wv.key_to_index:
            words_embeddings.append(model.wv[word])
    # calculate the mean embedding for the review and store it in reviews_embeddings
    if words_embeddings:
        mean_embedding = np.mean(words_embeddings, axis=0)
        reviews_embeddings[i] = mean_embedding

In [15]:
# Use reviews_embeddings as the input to logistic regression to classify each review.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = train_test_split(reviews_embeddings, labels, test_size=0.2, random_state=42)

lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)

train_acc = lr.score(X_train, y_train)
test_acc = lr.score(X_test, y_test)

print("Train accuracy:", train_acc)
print("Test accuracy:", test_acc)

Train accuracy: 0.869975
Test accuracy: 0.8696


Now I will vary size between [25, 50, 100, 150] and plot size vs. train and test accuracy.

In [None]:
# Initialize empty lists to store accuracies
train_accs = []
test_accs = []

# Define vector sizes to test
vector_sizes = [25, 50, 100, 150]

for vector_size in vector_sizes:
    # Build vocabulary and train model
    model = gensim.models.Word2Vec(documents,vector_size=vector_size,window=5,min_count=5,sg = 1,workers=10)
    
    # Create empty numpy array for reviews embeddings
    reviews_embeddings = np.zeros((len(documents), vector_size))
    
    # Loop through each review
    for i, review in enumerate(documents):
        words_embeddings = []
        # loop through each word in the review
        for word in review:
            # if the word is in the model vocabulary, append its embedding to words_embeddings
            if word in model.wv.key_to_index:
                words_embeddings.append(model.wv[word])
        # calculate the mean embedding for the review and store it in reviews_embeddings
        if words_embeddings:
            mean_embedding = np.mean(words_embeddings, axis=0)
            reviews_embeddings[i] = mean_embedding
    
    # Use reviews_embeddings as the input to logistic regression to classify each review.
    X_train, X_test, y_train, y_test = train_test_split(reviews_embeddings, labels, test_size=0.2, random_state=42)

    lr = LogisticRegression(max_iter=1000)
    lr.fit(X_train, y_train)

    train_acc = lr.score(X_train, y_train)
    test_acc = lr.score(X_test, y_test)
    
    # Append accuracies to lists
    train_accs.append(train_acc)
    test_accs.append(test_acc)

In [None]:
# Plot size vs. train and test accuracy
plt.plot(vector_sizes, train_accs, label='Train Accuracy')
plt.plot(vector_sizes, test_accs, label='Test Accuracy')
plt.xlabel('Vector Size')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

Now I will vary the window between [2, 3, 5, 10]