# Deep Learning Tutorial

## Reading the Data

Import the `pandas` package, then use the `read_csv` function to read the labeled training data.

In [1]:
import pandas as pd

# Read the labeled training data
train = pd.read_csv("../data/labeled_train_data.tsv", header=0, delimiter="\t", quoting=3)

print(train.shape)
print(train.columns.values)
print(train["review"][0])

(25000, 3)
['id' 'sentiment' 'review']
"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actu

## Data Cleaning and Text Preprocessing

Import the Beautiful Soup library. Remove HTML markup from review text.

In [2]:
from bs4 import BeautifulSoup

# Initialize the BeautifulSoup object on a single review
example_review = BeautifulSoup(train["review"][0])

# Print the raw text and the text without tags or markup
print("Raw review text:\n", train["review"][0])
print("Modified review text:\n", example_review.get_text())

Raw review text:
 "With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit w

For simplicity, remove all punctuation and numbers. However, in sentiment analysis problems, it is important to remember that punctuation and numbers often carry sentiment and should usually be treated as words.

To remove punctuation and numbers, use regular expressions.

In [3]:
import re

# Use regular expressions to do a find and-replace
letters_only = re.sub("[^a-zA-Z]", " ", example_review.get_text())

print("Text after removing punctuation and numbers:\n", letters_only)

Text after removing punctuation and numbers:
  With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film b

Convert reviews to lower_case and split them into individual words ("tokenization").

In [4]:
# Convert to lower-case and split into words
lower_case = letters_only.lower()
tokens = lower_case.split()

print("Lower-case tokens:\n", tokens)

Lower-case tokens:
 ['with', 'all', 'this', 'stuff', 'going', 'down', 'at', 'the', 'moment', 'with', 'mj', 'i', 've', 'started', 'listening', 'to', 'his', 'music', 'watching', 'the', 'odd', 'documentary', 'here', 'and', 'there', 'watched', 'the', 'wiz', 'and', 'watched', 'moonwalker', 'again', 'maybe', 'i', 'just', 'want', 'to', 'get', 'a', 'certain', 'insight', 'into', 'this', 'guy', 'who', 'i', 'thought', 'was', 'really', 'cool', 'in', 'the', 'eighties', 'just', 'to', 'maybe', 'make', 'up', 'my', 'mind', 'whether', 'he', 'is', 'guilty', 'or', 'innocent', 'moonwalker', 'is', 'part', 'biography', 'part', 'feature', 'film', 'which', 'i', 'remember', 'going', 'to', 'see', 'at', 'the', 'cinema', 'when', 'it', 'was', 'originally', 'released', 'some', 'of', 'it', 'has', 'subtle', 'messages', 'about', 'mj', 's', 'feeling', 'towards', 'the', 'press', 'and', 'also', 'the', 'obvious', 'message', 'of', 'drugs', 'are', 'bad', 'm', 'kay', 'visually', 'impressive', 'but', 'of', 'course', 'this', 'i

Decide how to deal with frequently occurring words that don't carry much meaning ("stop words"), such as "a", "and", "is", and "the".

Import a stop word list from the Python Natural Language Toolkit and remove stop words from the review text.

In [5]:
import nltk
nltk.download("stopwords")

from nltk.corpus import stopwords
print("Stop words in NLTK:", stopwords.words("english"))

# Remove stop words from the review text
tokens = [token for token in tokens if token not in stopwords.words("english")]
print("Tokens after removing stop words:\n", tokens)

Stop words in NLTK: ['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 's

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/rohanmistry/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Create reusable function to clean review text.

In [6]:
stops = set(stopwords.words("english"))

def clean_review_text(review: str) -> list:
    """
    Cleans the review text by removing HTML tags, markup, punctuation, numbers, and stop words.

    Parameters:
        review (str): The raw review text to be cleaned.
    
    Returns: 
        list: A list of cleaned tokens from the review.
    """

    text = BeautifulSoup(review).get_text()
    letters_only = re.sub("[^a-zA-Z]", " ", text)
    tokens = letters_only.lower().split()
    tokens = [token for token in tokens if token not in stops]

    return (" ".join(tokens))

In [7]:
clean_review = clean_review_text(train["review"][0])
print("Cleaned review text:\n", clean_review)

Cleaned review text:
 stuff going moment mj started listening music watching odd documentary watched wiz watched moonwalker maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember going see cinema originally released subtle messages mj feeling towards press also obvious message drugs bad kay visually impressive course michael jackson unless remotely like mj anyway going hate find boring may call mj egotist consenting making movie mj fans would say made fans true really nice actual feature film bit finally starts minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord wants mj dead bad beyond mj overheard plans nah joe pesci character ranted wanted people know supplying drugs etc dunno maybe hates mj music lots cool things like mj turning car robot whole speed demon sequence also director must patience saint came filming kiddy bad sequence usually 

Loop through and clean entire training set.

In [8]:
# Get the number of reviews in the training set
num_reviews = train["review"].size

# Initialize an empty list to hold the cleaned reviews
cleaned_train_reviews = []

# Iterate through each review in the training set
for i in range(num_reviews):
    if (i + 1) % 1000 == 0:
        print(f"Cleaning review {i + 1} of {num_reviews}")
    
    # Clean the review and append it to the list
    cleaned_review = clean_review_text(train["review"][i])
    cleaned_train_reviews.append(cleaned_review)

Cleaning review 1000 of 25000
Cleaning review 2000 of 25000
Cleaning review 3000 of 25000
Cleaning review 4000 of 25000
Cleaning review 5000 of 25000
Cleaning review 6000 of 25000
Cleaning review 7000 of 25000
Cleaning review 8000 of 25000
Cleaning review 9000 of 25000
Cleaning review 10000 of 25000
Cleaning review 11000 of 25000
Cleaning review 12000 of 25000
Cleaning review 13000 of 25000
Cleaning review 14000 of 25000
Cleaning review 15000 of 25000
Cleaning review 16000 of 25000
Cleaning review 17000 of 25000
Cleaning review 18000 of 25000
Cleaning review 19000 of 25000
Cleaning review 20000 of 25000
Cleaning review 21000 of 25000
Cleaning review 22000 of 25000
Cleaning review 23000 of 25000
Cleaning review 24000 of 25000
Cleaning review 25000 of 25000


Use bag-of-words approach to convert training reviews to numeric representation for machine learning. Build a vocabulary from all reviews, then create feature vectors with the count of each word in each review.

To limit the size of the feature vectors, use the 5000 most frequent words. Use the `feature_extraction` module from `scikit-learn` to create bag-of-words features.

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the CountVectorizer with parameters to limit the vocabulary size
vectorizer = CountVectorizer(analyzer="word", tokenizer=None, preprocessor=None, stop_words=None, max_features=5000)

# Fit the vectorizer on the cleaned training reviews, learn the vocabulary, and transform the reviews into feature vectors
train_data_features = vectorizer.fit_transform(cleaned_train_reviews)

# Convert the resulting sparse matrix to a dense format
train_data_features = train_data_features.toarray()

print("Shape of the training data features:", train_data_features.shape)

Shape of the training data features: (25000, 5000)


In [10]:
vocabulary = vectorizer.get_feature_names_out()
print(vocabulary)

['abandoned' 'abc' 'abilities' ... 'zombie' 'zombies' 'zone']


Use a Random Forest classifier for supervised learning.

In [11]:
from sklearn.ensemble import RandomForestClassifier

print("Training the random forest classifier...")

# Initialize the Random Forest classifier with 100 trees
forest = RandomForestClassifier(n_estimators=100)

# Fit the forest to the training data features using the bag of words as features and the sentiment labels as the response variable
forest = forest.fit(train_data_features, train["sentiment"])

Training the random forest classifier...


Run the trained Random Forest classifier on the test set and create a submission file.

In [12]:
# Read the test data
test = pd.read_csv("../data/test_data.tsv", header=0, delimiter="\t", quoting=3)

# Verify the test data shape
print("Test data shape:", test.shape)

Test data shape: (25000, 2)


In [13]:
# Clean and parse test reviews
num_test_reviews = len(test["review"])
cleaned_test_reviews = []

print("Cleaning test reviews...")
for i in range(num_test_reviews):
    if (i + 1) % 1000 == 0:
        print(f"Cleaning test review {i + 1} of {num_test_reviews}")
    
    cleaned_review = clean_review_text(test["review"][i])
    cleaned_test_reviews.append(cleaned_review)

# Transform the cleaned test reviews into feature vectors
test_data_features = vectorizer.transform(cleaned_test_reviews)
test_data_features = test_data_features.toarray()

# Use the trained Random Forest classifier to predict sentiment for the test set
result = forest.predict(test_data_features)

# Copy results to a DataFrame for submission
output = pd.DataFrame(data={"id": test["id"], "sentiment": result})

# Write the DataFrame to a CSV file for submission
output.to_csv("../data/tutorial_submission.csv", index=False, quoting=3)

Cleaning test reviews...
Cleaning test review 1000 of 25000
Cleaning test review 2000 of 25000
Cleaning test review 3000 of 25000
Cleaning test review 4000 of 25000
Cleaning test review 5000 of 25000
Cleaning test review 6000 of 25000
Cleaning test review 7000 of 25000
Cleaning test review 8000 of 25000
Cleaning test review 9000 of 25000
Cleaning test review 10000 of 25000
Cleaning test review 11000 of 25000
Cleaning test review 12000 of 25000
Cleaning test review 13000 of 25000
Cleaning test review 14000 of 25000
Cleaning test review 15000 of 25000
Cleaning test review 16000 of 25000
Cleaning test review 17000 of 25000
Cleaning test review 18000 of 25000
Cleaning test review 19000 of 25000
Cleaning test review 20000 of 25000
Cleaning test review 21000 of 25000
Cleaning test review 22000 of 25000
Cleaning test review 23000 of 25000
Cleaning test review 24000 of 25000
Cleaning test review 25000 of 25000


## Using Word2vec

Word2vec is a neural network implementation that learns distributed representations for words. It does not need labels to create meaningful repretentations; if the network is given enough training data, it produces word vectors with intriguing characteristics. Words with similar meanings appear in clusters and clusters are spaced such that some word relationships, such as analogies, can be reproduced using vector math.

First, can now use unlabeled training data in addition to labeled training data.

In [14]:
unlabeled_train = pd.read_csv("../data/unlabeled_train_data.tsv", header=0, delimiter="\t", quoting=3)

To train Word2vec, it is better not to remove stopwords because the algorithm relies on the broader context of the sentence to produce high-quality word vectors.

In [15]:
def review_to_tokens(review: str, remove_stopwords=False) -> list:
    """
    Converts a review into a list of tokens, optionally removing stop words.

    Parameters:
        review (str): The raw review text to be tokenized.
        remove_stopwords (bool): Whether to remove stop words from the tokens.

    Returns:
        list: A list of tokens from the review.
    """

    # Parse the review using BeautifulSoup and remove HTML tags and markup
    text = BeautifulSoup(review).get_text()

    # Remove punctuation and numbers, convert to lower-case, and split into tokens
    letters_only = re.sub("[^a-zA-Z]", " ", text)
    tokens = letters_only.lower().split()

    # Optionally remove stop words
    if remove_stopwords:
        tokens = [token for token in tokens if token not in stops]

    return tokens

Word2vec expects single sentences, each one as a list of tokens. Need to decide how to split a paragraph into sentences. Use NLTK's punkt tokenizer for sentence splitting.

In [22]:
from nltk.tokenize import sent_tokenize

nltk.download("punkt_tab")

# Split the review into sentences and tokenize each sentence
def review_to_sentences(review: str, remove_stopwords=False) -> list:
    """
    Splits a review into sentences and tokenizes each sentence.

    Parameters:
        review (str): The raw review text to be split and tokenized.
        tokenizer: The NLTK tokenizer for sentence splitting.
        remove_stopwords (bool): Whether to remove stop words from the tokens.

    Returns:
        list: A list of lists, where each inner list contains tokens from a sentence.
    """

    # Split the review into sentences
    raw_sentences = sent_tokenize(review.strip())

    # Tokenize each sentence
    for raw_sentence in raw_sentences:
        # Check if the sentence is empty
        if len(raw_sentence) > 0:
            # Tokenize the sentence
            sentences = review_to_tokens(raw_sentence, remove_stopwords)
            # Append the list of tokens to the result
            yield sentences

    return sentences

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/rohanmistry/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Apply `review_to_sentences` function to prepare data for input to Word2vec.

In [23]:
sentences = []

print("Processing sentences from training data...")
for review in train["review"]:
    sentences += list(review_to_sentences(review))

for review in unlabeled_train["review"]:
    sentences += list(review_to_sentences(review))

Processing sentences from training data...



If you meant to use Beautiful Soup to parse the web page found at a certain URL, then something has gone wrong. You should use an Python package like 'requests' to fetch the content behind the URL. Once you have the content as a string, you can feed that string into Beautiful Soup.



    
  text = BeautifulSoup(review).get_text()


In [24]:
print("Number of sentences processed:", len(sentences))

Number of sentences processed: 796172


## Training and Saving the Model

There are many parameter choices that affect the runtime and quality of the final model produced, such as architecture, training algorithm, downsampling of frequent words, word vector dimensionality, context/window size, worker threads, and minimum word count.

In [25]:
import logging
logging.basicConfig(format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO)

# Set values for Word2vec parameters
num_features = 300    # Word vector dimensionality
min_word_count = 40   # Minimum word count to consider a word in the vocabulary
num_workers = 4       # Number of worker threads to train the model
context = 10          # Context window size
downsampling = 1e-3   # Downsample setting for frequent words

# Initialize and train the model
from gensim.models import Word2Vec
print("Training Word2Vec model...")
model = Word2Vec(sentences, workers=num_workers, vector_size=num_features, min_count=min_word_count, window=context, sample=downsampling)
model.init_sims(replace=True)

# Save the model to a file
model_file = "../models/word2vec_model"
model.save(model_file)
print(f"Word2Vec model saved to {model_file}")

2025-08-05 14:28:03,696 : INFO : collecting all words and their counts
2025-08-05 14:28:03,697 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2025-08-05 14:28:03,742 : INFO : PROGRESS: at sentence #10000, processed 225664 words, keeping 17775 word types
2025-08-05 14:28:03,768 : INFO : PROGRESS: at sentence #20000, processed 451738 words, keeping 24945 word types
2025-08-05 14:28:03,789 : INFO : PROGRESS: at sentence #30000, processed 670858 words, keeping 30027 word types
2025-08-05 14:28:03,811 : INFO : PROGRESS: at sentence #40000, processed 896840 words, keeping 34335 word types
2025-08-05 14:28:03,834 : INFO : PROGRESS: at sentence #50000, processed 1116081 words, keeping 37751 word types
2025-08-05 14:28:03,855 : INFO : PROGRESS: at sentence #60000, processed 1337543 words, keeping 40711 word types
2025-08-05 14:28:03,876 : INFO : PROGRESS: at sentence #70000, processed 1560306 words, keeping 43311 word types


Training Word2Vec model...


2025-08-05 14:28:03,903 : INFO : PROGRESS: at sentence #80000, processed 1779515 words, keeping 45707 word types
2025-08-05 14:28:03,926 : INFO : PROGRESS: at sentence #90000, processed 2003713 words, keeping 48121 word types
2025-08-05 14:28:03,943 : INFO : PROGRESS: at sentence #100000, processed 2225464 words, keeping 50190 word types
2025-08-05 14:28:03,963 : INFO : PROGRESS: at sentence #110000, processed 2444322 words, keeping 52058 word types
2025-08-05 14:28:03,982 : INFO : PROGRESS: at sentence #120000, processed 2666487 words, keeping 54098 word types
2025-08-05 14:28:04,007 : INFO : PROGRESS: at sentence #130000, processed 2892314 words, keeping 55837 word types
2025-08-05 14:28:04,026 : INFO : PROGRESS: at sentence #140000, processed 3104795 words, keeping 57324 word types
2025-08-05 14:28:04,043 : INFO : PROGRESS: at sentence #150000, processed 3330431 words, keeping 59045 word types
2025-08-05 14:28:04,065 : INFO : PROGRESS: at sentence #160000, processed 3552465 words, k

Word2Vec model saved to ../models/word2vec_model


## Exploring the Model Results

In [27]:
print("Which doesn't match between 'man', 'woman', 'child', and 'kitchen'? ", model.wv.doesnt_match("man woman child kitchen".split()))
print("Which doesn't match between 'france', 'england', 'germany', and 'berlin'? ", model.wv.doesnt_match("france england germany berlin".split()))
print("Which doesn't match between 'paris', 'berlin', 'london', and 'austria'? ", model.wv.doesnt_match("paris berlin london austria".split()))
print("Most similar to 'man': ", model.wv.most_similar("main"))
print("Most similar to 'queen': ", model.wv.most_similar("queen"))
print("Most similar to 'awful': ", model.wv.most_similar("awful"))

Which doesn't match between 'man', 'woman', 'child', and 'kitchen'?  kitchen
Which doesn't match between 'france', 'england', 'germany', and 'berlin'?  berlin
Which doesn't match between 'paris', 'berlin', 'london', and 'austria'?  paris
Most similar to 'man':  [('central', 0.6256492733955383), ('secondary', 0.5544471740722656), ('primary', 0.551468014717102), ('development', 0.49399852752685547), ('peripheral', 0.49228551983833313), ('undeveloped', 0.4839439392089844), ('titular', 0.4774162471294403), ('major', 0.4698925018310547), ('unlikeable', 0.4575895667076111), ('biggest', 0.4439127445220947)]
Most similar to 'queen':  [('princess', 0.6462267637252808), ('victoria', 0.6021219491958618), ('bride', 0.5906339883804321), ('goddess', 0.5817415714263916), ('maid', 0.5766915678977966), ('belle', 0.5599071383476257), ('prince', 0.5592232346534729), ('showgirl', 0.558584451675415), ('fatale', 0.558175802230835), ('maria', 0.5564477443695068)]
Most similar to 'awful':  [('terrible', 0.769

## Numeric Representation of Words

The trained Word2vec model consists of a feature vector for each word in the vocabulary, stored in a `numpy` array called "vectors".

In [30]:
type(model.wv.vectors)
print("Shape of the word vectors (syn0):", model.wv.vectors.shape)

Shape of the word vectors (syn0): (16490, 300)


The number of each row is the number of words in the model's vocabulary and the number corresponds to the size of the feature vector. Setting the minimum word count to 40 means that the total vocabulary consists of approximately 16,490 words with 300 features each. Individual word vectors (a 1 x 300 `numpy` array) can be accessed.

In [34]:
print(model.wv["flower"])

[-1.01115685e-02  1.29889371e-02 -2.08567455e-02  8.65000933e-02
  2.41670031e-02 -4.74770330e-02  1.91409495e-02  9.70198661e-02
  1.05710821e-02 -5.11368178e-02 -2.05972921e-02  4.30359952e-02
 -2.48122811e-02  8.83926451e-02 -4.75177281e-02 -2.69599017e-02
  1.27959223e-02 -1.27007991e-01  6.86568115e-03 -3.40595134e-02
 -1.24566734e-03 -2.57888157e-02  3.92788351e-02  9.20327082e-02
  7.03837350e-02 -1.54855298e-02 -1.91287249e-02  2.30791830e-02
  6.86938036e-03  4.37342236e-03  4.70337681e-02 -3.90385799e-02
 -1.34434234e-02 -3.38446535e-02  1.04363887e-02  6.28910437e-02
  5.69956750e-02 -1.17500491e-01 -4.45946008e-02 -3.11273336e-02
 -2.11478807e-02  7.73067819e-03  9.06712934e-02 -1.16943354e-02
 -1.15378015e-01  1.76322833e-03 -4.87123895e-03 -9.07074846e-03
 -2.58970149e-02  1.85833052e-02  6.70485795e-02  3.26483399e-02
 -3.44926640e-02 -5.82870608e-03 -2.64347177e-02  4.23103645e-02
 -1.08057745e-02 -3.82160163e-03 -1.67368837e-02 -9.27755088e-02
 -7.69823045e-02  3.73006

## From Words to Paragraphs

### Vector Averaging

One challenge is variable-length reviews. Need to take individual word vectors and transform them into a feature set that is the same length for every review. Since each word is a vector in 300-dimensional space, use vector operations to combine the words in each review. One method is to simply average the word vectors in a given review.

In [40]:
import numpy as np

def make_feature_vector(words: list, model: Word2Vec, num_features: int) -> np.ndarray:
    """
    Calculates the average feature vector for a list of words.

    Parameters:
        words (list): A list of words to be averaged.
        model (Word2Vec): The trained Word2Vec model.
        num_features (int): The dimensionality of the feature vectors.

    Returns:
        np.ndarray: A numpy array representing the average feature vector for the words.
    """

    # Initialize an empty feature vector of zeros
    feature_vector = np.zeros((num_features,), dtype="float32")
    num_words = 0

    # Index2word is a list that contains the names of the words in the model's vocabulary
    index2word_set = set(model.wv.index_to_key)

    # Iterate through each word in the review and add its feature vector to the total if it is in the model's vocabulary
    for word in words:
        if word in index2word_set: 
            num_words = num_words + 1.
            feature_vector = np.add(feature_vector, model.wv[word])

    # Divide the result by the number of words to get the average
    if num_words > 0:
        feature_vector = np.divide(feature_vector, num_words)

    return feature_vector


def get_average_feature_vectors(reviews, model, num_features):
    """
    Calculates the average feature vector for each review.

    Parameters:
        reviews (list): A list of reviews, where each review is a list of words.
        model (Word2Vec): The trained Word2Vec model.
        num_features (int): The dimensionality of the feature vectors.
        
    Returns:
        np.ndarray: A 2D numpy array where each row corresponds to the average feature vector of a review.
    """
    
    # Initialize a counter
    counter = 0

    # Initialize an empty array to hold the feature vectors
    review_feature_vectors = np.zeros((len(reviews), num_features), dtype="float32")

    # Iterate through the reviews
    for review in reviews:
       # Print a status message every 1000th review
       if counter % 1000 == 0.:
           print("Review %d of %d" % (counter, len(reviews)))

       # Make average feature vector for the review
       review_feature_vectors[counter] = make_feature_vector(review, model, num_features)
       counter = counter + 1

    return review_feature_vectors

In [41]:
# Calculate the average feature vectors for the training reviews
print("Calculating average feature vectors for training reviews...")
clean_train_reviews = []
for review in train["review"]:
    clean_train_reviews.append(review_to_tokens(review, remove_stopwords=True))

train_vectors = get_average_feature_vectors(clean_train_reviews, model, num_features)

# Calculate the average feature vectors for the test reviews
print("Calculating average feature vectors for test reviews...")
clean_test_reviews = []
for review in test["review"]:
    clean_test_reviews.append(review_to_tokens(review, remove_stopwords=True))
    
test_vectors = get_average_feature_vectors(clean_test_reviews, model, num_features)

Calculating average feature vectors for training reviews...
Review 0 of 25000
Review 1000 of 25000
Review 2000 of 25000
Review 3000 of 25000
Review 4000 of 25000
Review 5000 of 25000
Review 6000 of 25000
Review 7000 of 25000
Review 8000 of 25000
Review 9000 of 25000
Review 10000 of 25000
Review 11000 of 25000
Review 12000 of 25000
Review 13000 of 25000
Review 14000 of 25000
Review 15000 of 25000
Review 16000 of 25000
Review 17000 of 25000
Review 18000 of 25000
Review 19000 of 25000
Review 20000 of 25000
Review 21000 of 25000
Review 22000 of 25000
Review 23000 of 25000
Review 24000 of 25000
Calculating average feature vectors for test reviews...
Review 0 of 25000
Review 1000 of 25000
Review 2000 of 25000
Review 3000 of 25000
Review 4000 of 25000
Review 5000 of 25000
Review 6000 of 25000
Review 7000 of 25000
Review 8000 of 25000
Review 9000 of 25000
Review 10000 of 25000
Review 11000 of 25000
Review 12000 of 25000
Review 13000 of 25000
Review 14000 of 25000
Review 15000 of 25000
Review 1

Use the average paragraph vectors to train a random forest.

In [42]:
# Fit a Random Forest classifier to the training data
print("Fitting the Random Forest classifier to the labeled training data...")
forest = RandomForestClassifier(n_estimators=100)
forest = forest.fit(train_vectors, train["sentiment"])

# Test the model on the test data
print("Predicting sentiment for the test data...")
result = forest.predict(test_vectors)

# Copy results to a DataFrame for submission
output = pd.DataFrame(data={"id": test["id"], "sentiment": result})

# Write the DataFrame to a CSV file for submission
output.to_csv("../data/word2vec_average_vectors.csv", index=False, quoting=3)

Fitting the Random Forest classifier to the labeled training data...
Predicting sentiment for the test data...


### Clustering

Another approach is to exploit the similarity of words within a cluster. Grouping vectors in this way is known as vector quantization. To accomplish this, find the centers of the word clusters by using a clustering algorithm such as K-means.

In K-means, the one parameter to set is `k`, the number of clusters.

In [44]:
from sklearn.cluster import KMeans
import time

start = time.time()

# Set k to be 1/5th of the vocabulary size
num_clusters = model.wv.vectors.shape[0] // 5

# Initialize a k-means clustering model and use it to extract the centroids
kmeans_clustering = KMeans(n_clusters=num_clusters)
idx = kmeans_clustering.fit_predict(model.wv.vectors)

# Get end time and print how long process took
end = time.time()
elapsed = end - start
print(f"Time taken to cluster the words into {num_clusters} clusters: {elapsed:.2f} seconds")

# Create a dictionary to hold the cluster centers
word_centroid_map = dict(zip(model.wv.index_to_key, idx))

# Print the first 10 clusters
for cluster in range(10):
    # Print the cluster number
    print(f"\nCluster {cluster}:")

    # Get the words in the cluster
    words = [word for word, idx in word_centroid_map.items() if idx == cluster]

    # Print the words in the cluster
    print(" ".join(words))

Time taken to cluster the words into 3298 clusters: 38.93 seconds

Cluster 0:
marjorie

Cluster 1:
es heaps gigli lingo glitter

Cluster 2:
underwhelming incongruous recognisable ridicules patchy

Cluster 3:
head mouth skin arms teeth legs chair neck arm leg throat chest fingers bite skull blast penis limbs toe crotch eyeballs slit

Cluster 4:
repressed controlling fierce fearful persistent unspoken childlike liberated promiscuous deviant virtuous cautious carnal

Cluster 5:
cynicism craziness discomfort naivety

Cluster 6:
homes towns villages artifacts

Cluster 7:
ringo starr untouchables yoko zappa

Cluster 8:
commando cobra

Cluster 9:
clumsy stilted plotting limp snappy overwrought sparse clunky disconnected fisted haphazard leaden jumpy pointlessly uninvolving listless sketchy unintelligible audible stagy laboured


Now have centroid assignments for each word and can define a function to convert reviews into bags-of-centroids. Works similarly to bag-of-words but uses semantically-related clusters instead of individual words.

In [45]:
def create_bag_of_centroids(words, word_centroid_map) -> np.ndarray:
    """
    Converts a list of words into a bag of centroids.

    Parameters:
        words (list): A list of words to be converted into centroids.
        word_centroid_map (dict): A dictionary mapping words to their corresponding cluster indices.
    
    Returns:
        np.ndarray: A numpy array representing the bag of centroids for the words.
    """

    # The number of clusters is equal to the highest cluster index
    num_centroids = max(word_centroid_map.values()) + 1

    # Initialize a bag of centroids with zeros
    bag_of_centroids = np.zeros(num_centroids, dtype="float32")

    # Iterate through each word in the input list
    for word in words:
        # Check if the word is in the word_centroid_map
        if word in word_centroid_map:
            # Increment the corresponding centroid count
            index = word_centroid_map[word]
            bag_of_centroids[index] += 1

    return bag_of_centroids

Create bags of centroids for training and test set, then train a random forest and extract results.

In [46]:
# Initialize an empty list to hold the bags of centroids for the training reviews
train_centroids = np.zeros((train["review"].size, num_clusters), dtype="float32")

# Transform training reviews into bags of centroids
print("Creating bags of centroids for training reviews...")
for i, review in enumerate(clean_train_reviews):
    if (i + 1) % 1000 == 0:
        print(f"Processing review {i + 1} of {len(clean_train_reviews)}")
    
    # Create a bag of centroids for the review
    train_centroids[i] = create_bag_of_centroids(review, word_centroid_map)

# Initialize an empty list to hold the bags of centroids for the test reviews
test_centroids = np.zeros((test["review"].size, num_clusters), dtype="float32")

# Transform test reviews into bags of centroids
print("Creating bags of centroids for test reviews...")
for i, review in enumerate(clean_test_reviews):
    if (i + 1) % 1000 == 0:
        print(f"Processing review {i + 1} of {len(clean_test_reviews)}")
    
    # Create a bag of centroids for the review
    test_centroids[i] = create_bag_of_centroids(review, word_centroid_map)

Creating bags of centroids for training reviews...
Processing review 1000 of 25000
Processing review 2000 of 25000
Processing review 3000 of 25000
Processing review 4000 of 25000
Processing review 5000 of 25000
Processing review 6000 of 25000
Processing review 7000 of 25000
Processing review 8000 of 25000
Processing review 9000 of 25000
Processing review 10000 of 25000
Processing review 11000 of 25000
Processing review 12000 of 25000
Processing review 13000 of 25000
Processing review 14000 of 25000
Processing review 15000 of 25000
Processing review 16000 of 25000
Processing review 17000 of 25000
Processing review 18000 of 25000
Processing review 19000 of 25000
Processing review 20000 of 25000
Processing review 21000 of 25000
Processing review 22000 of 25000
Processing review 23000 of 25000
Processing review 24000 of 25000
Processing review 25000 of 25000
Creating bags of centroids for test reviews...
Processing review 1000 of 25000
Processing review 2000 of 25000
Processing review 3000

In [48]:
# Fit a Random Forest classifier to the training data using the bags of centroids
forest = RandomForestClassifier(n_estimators=100)
print("Fitting the Random Forest classifier to the training data using bags of centroids...")
forest = forest.fit(train_centroids, train["sentiment"])

# Predict sentiment for the test data using the trained Random Forest classifier
print("Predicting sentiment for the test data using bags of centroids...")
result = forest.predict(test_centroids)

# Copy results to a DataFrame for submission
output = pd.DataFrame(data={"id": test["id"], "sentiment": result})

# Write the DataFrame to a CSV file for submission
output.to_csv("../data/bag_of_centroids_submission.csv", index=False, quoting=3)

Fitting the Random Forest classifier to the training data using bags of centroids...
Predicting sentiment for the test data using bags of centroids...
