# Part 0: Data Preparation

In [2]:
!pip install datasets
!pip install nltk



In [3]:
from datasets import load_dataset
dataset = load_dataset("rotten_tomatoes")
train_dataset = dataset['train']
validation_dataset = dataset['validation']
test_dataset = dataset['test']

# Part 1: Preparing Word Embeddings

In [5]:
import nltk
import tensorflow as tf
from tensorflow.keras import layers

nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\zhixu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Question 1. Word Embedding

### (a) What is the size of the vocabulary formed from your training data?

In [8]:
tokens = []
tokenized_sentences = [nltk.tokenize.word_tokenize(sentence['text'].lower()) for sentence in train_dataset]

for sentence in train_dataset:
    tokens.extend(nltk.tokenize.word_tokenize(sentence['text'].lower()))

print ('Number of tokens: '+ str(len(tokens)))

Number of tokens: 183968


In [9]:
# List of all distinct words found in the tokens list
vocab = list(set(tokens))
vocab_size = len(vocab)
print ('Number of token types (Vocabulary Size): '+ str(vocab_size))

Number of token types (Vocabulary Size): 18029


The size of vocabulary formed from the training data is 18029.

### (b) We use OOV (out-of-vocabulary) to refer to those words appeared in the training data but  not in the Word2vec (or Glove) dictionary. How many OOV words exist in your training data?

In [12]:
from gensim.models import Word2Vec

# Set parameters for the Word2Vec model
embedding_dim = 100  # Adjust this dimension as needed
window_size = 5  # Context window size
min_count = 2  # Ignores words with total frequency lower than this

# Train the Word2Vec model
word2vec_model = Word2Vec(sentences=tokenized_sentences, vector_size=embedding_dim, window=window_size, min_count=min_count, workers=4)

# Vocabulary size after training (vocabulary size of word2vec)
vocab_size = len(word2vec_model.wv)
print("Vocabulary size:", vocab_size)

Vocabulary size: 8841


Number of OOV words = 18029 - 8841 = 9188.

In [14]:
import numpy as np

# Initialize the embedding matrix
embedding_matrix = np.zeros((vocab_size, embedding_dim))

# Map words in the Word2Vec vocabulary to the embedding matrix
word_to_idx = {word: idx for idx, word in enumerate(word2vec_model.wv.index_to_key)}
for word, idx in word_to_idx.items():
    embedding_matrix[idx] = word2vec_model.wv[word]

print("Embedding matrix shape:", embedding_matrix.shape)

Embedding matrix shape: (8841, 100)


### (c) The existence of the OOV words is one of the well-known limitations of Word2vec (or Glove). Without using any transformer-based language models (e.g., BERT, GPT, T5), what do you think is the best strategy to mitigate such limitation? Implement your solution in your source code. Show the corresponding code snippet.


#### Method 1: Cosine Similarity

Find replacement words for OOV based on cosine similarity. The steps are:
1. Converting the OOV word into a vector (by summing the embeddings of its subwords or by using a neighboring word’s embedding).
2. Selecting replacement candidates based on similarity in the vector space.
3. Calculating cosine similarity between the OOV word and replacement candidates.
4. Choosing a candidate with a high cosine similarity score, ideally close to 1, as the replacement.

In [18]:
# Define a function to find replacement words based on cosine similarity
def find_replacement_word(oov_word, model, topn=5):
    if oov_word in model.wv:
        return oov_word  # No replacement needed if word exists in vocab
    else:
        # Generate vector for OOV by averaging embeddings of subwords (as a simple example)
        oov_vector = np.mean([model.wv[word] for word in oov_word if word in model.wv], axis=0)
        
        # Find similar words to this estimated vector
        similar_words = model.wv.similar_by_vector(oov_vector, topn=topn)
        
        # Return the word with highest similarity
        return similar_words[0][0] if similar_words else None

In [19]:
# Example usage
oov_word = "algorithm"
replacement = find_replacement_word(oov_word, word2vec_model)
print("Replacement word for OOV:", replacement)

Replacement word for OOV: any


Advantages: Quick way to handle OOV words by finding a close semantic replacement within the existing vocabulary.   
Drawback: Can mistakenly replace an OOV word with a semantically opposite word since cosine similarity alone doesn't fully account for nuanced semantic relationships.

#### Method 2: Subword Embeddings

This approach involves breaking down words into smaller units, such as character n-grams, so that even if the entire word isn’t in the vocabulary, its smaller components might be. This can help represent OOV words by capturing their subword structure, which often contains useful semantic information.   

With `gensim`, you can implement a similar concept using the **FastText** model, which is capable of learning embeddings for subword units. This model will generate embeddings for any word, even if it wasn’t in the original training data.

**Step 1: Train a FastText Model on Your Dataset**   
FastText is an extension of Word2vec but includes character n-grams, allowing it to produce embeddings for OOV words by combining the embeddings of their n-grams.

In [24]:
from gensim.models import FastText

# Set parameters for the FastText model
embedding_dim = 100  # Adjust this dimension as needed
window_size = 5      # Context window size
min_count = 2        # Ignores words with total frequency lower than this

# Train the FastText model
fasttext_model = FastText(sentences=tokenized_sentences, vector_size=embedding_dim, window=window_size, min_count=min_count, workers=4)

# Vocabulary size after training
vocab_size = len(fasttext_model.wv)
print("Vocabulary size with FastText:", vocab_size)

Vocabulary size with FastText: 8841


**Step 2: Building the Embedding Matrix for Known Words**   
For words in the training data vocabulary, use the learned embeddings. If you encounter an OOV word, FastText will automatically create an embedding based on the subword n-grams.

In [26]:
# Initialize the embedding matrix for known vocabulary
embedding_matrix = np.zeros((vocab_size, embedding_dim))

# Map words in the FastText vocabulary to the embedding matrix
word_to_idx = {word: idx for idx, word in enumerate(fasttext_model.wv.index_to_key)}
for word, idx in word_to_idx.items():
    embedding_matrix[idx] = fasttext_model.wv[word]

print("Embedding matrix shape with FastText:", embedding_matrix.shape)

Embedding matrix shape with FastText: (8841, 100)


**Step 3: Handle OOV Words in Validation and Test Data**   
Using FastText, OOV words should already be covered by the subword embeddings. If a word does not exist in the vocabulary during validation or testing, FastText will still generate an embedding based on the character n-grams it contains, so you can simply use the model as-is.

In [28]:
# Example of getting embeddings for words, including OOV
word = "algorithm"  # Replace with any OOV word
if word in fasttext_model.wv:
    embedding = fasttext_model.wv[word]
else:
    embedding = fasttext_model.wv[word]  # FastText will handle OOV by generating the embedding
print(f"Embedding for '{word}':", embedding)

Embedding for 'algorithm': [-0.0744204   0.08616591 -0.12863164 -0.08968816  0.21379943  0.13077094
  0.01214228  0.20079766  0.10782887 -0.21856885  0.11229397  0.00104024
 -0.06299437  0.25077713 -0.05319802 -0.08080995  0.03621447 -0.11046183
 -0.22584887 -0.21995072 -0.30807883  0.04930369 -0.17191687 -0.11598082
 -0.14564371 -0.21329267 -0.14739008 -0.08986089  0.00861504 -0.00105365
 -0.09338482  0.01998294  0.23218516 -0.06302945  0.02602169  0.10912537
  0.02437454  0.10636683 -0.11935042 -0.19517405  0.16995311 -0.06841488
  0.04334502 -0.15479755 -0.14447983 -0.15353383 -0.08981054  0.00958545
  0.16006233  0.06367456  0.12022658 -0.09258064 -0.129359   -0.13445319
 -0.06514408 -0.05390274 -0.11533449 -0.01530871 -0.09887797 -0.02505171
 -0.16128217 -0.10399985  0.02462547  0.23249193 -0.0537083   0.2611112
 -0.01397066 -0.00611461  0.03118794  0.09490933 -0.15058059  0.14254603
  0.14536972 -0.24163775  0.10825963 -0.06576266  0.1510735  -0.00261503
 -0.01918464  0.06899466 

- By using FastText, you avoid having to manually handle OOV words because the model can generate embeddings for any word using character n-grams. This enables you to retain useful information from OOV words, as their morphology often reflects their meaning.
- This approach is effective without needing transformer-based models, as it leverages the internal structure of words.