# 0. Dataset preparation

## 0.1 Import dataset from part 0

In [1]:
from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes")
train_dataset = dataset["train"]

In [2]:
train_dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 8530
})

## 0.2 Process text data

In [3]:
from utils.text import tokenize

train_dataset = tokenize(train_dataset)

[nltk_data] Downloading package punkt to /home/yuri/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package treebank to /home/yuri/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/yuri/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


# 1. Preparing Word Embeddings

- using `Word2Vec`

load a pretrained word2vec model (trained on Google News dataset contained about 100 billion words)

In [4]:
import gensim.downloader
import os

w2v_model_path = "models/word2vec-google-news-300"

# Download pretrained embeddings model if haven't done so
if not os.path.exists(w2v_model_path):
    # Takes around 7mins
    w2v_model = gensim.downloader.load("word2vec-google-news-300")
    w2v_model.save(w2v_model_path)

    # Alternatively, download from the link below
    # w2v_model = gensim.models.KeyedVectors.load_word2vec_format('model\GoogleNews-vectors-negative300.bin\GoogleNews-vectors-negative300.bin', binary=True)
    # download the pretrained model from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g (take around 1.5GB)

w2v_model = gensim.models.KeyedVectors.load(w2v_model_path)

In [5]:
len(w2v_model.index_to_key)

3000000

## (a) size of vocabulary from the training data.

In [17]:
vocab_train = set([word for sentence in train_dataset["tokens"] for word in sentence])

len(vocab_train)

16331

## (b) number of OOV (out-of-vocabulary)
- (those words appeared in the training data but not in the Word2vec dictionary)

In [19]:
oov_words = [word for word in vocab_train if word not in w2v_model]

print(len(oov_words))
print(oov_words[:20])

1445
['kosminsky', 'desplechin', 'reeses', 'alcatraz', 'weissman', 'esteticamente', 'premissa', 'bugsy', 'manhunter', 'khouri', 'mctiernan', 'francamente', 'kubrick', 'glamour', 'pianista', 'shamu', 'vittorio', 'weigel', 'clements', 'mulan']


## (c) Handling OOV words

To handle OOV words, we have considered various options:
- **Global Average:** mean of all embeddings in the w2v model
- **Contextual Average:** mean of embeddings of the surrounding words in the training data
- **Similar Words:** generate an embedding by finding the closest match in w2v vocab

We have decided that **contextual average** is the best approach as the meaning of words are usually contextual, hence the contextual average approach would be better able to approximate the OOV word's meaning, making it more versatile.

On the other hand, the similar words approach might not work well for completely novel words like nouns or names. Since our OOV words in the training data contains a large number of nouns, this approach might not be as suitable for our use case.

The global average method is not good as it is just assigns all OOV words.

**Selected approach: Contextual Average**

> We have implemented the contextual average approach and they can be found in [utils/text.py](utils/text.py).

In [31]:
# Example usage
from utils.text import get_context_average_embedding

sentence_tokens = ["this", "is", "an", "example", "with", "oovword"]
oov_token = "oovword"
embedding = get_context_average_embedding(sentence_tokens, oov_token, w2v_model)

print("OOV embedding size:", embedding.size)

OOV embedding size: 300


## Prepare the word embedding matrix

In [35]:
import numpy as np

embedding_dim = w2v_model.vector_size
vocab_size = len(vocab_train)

# Initialize embedding matrix with zeros.
embedding_matrix = np.zeros((vocab_size, embedding_dim))

# Word to index dictionary for easy lookup.
index_from_word = {word: i for i, word in enumerate(vocab_train)}

# Populate embedding matrix.
for word, i in index_from_word.items():
    if word in w2v_model.key_to_index:
        embedding_matrix[i] = w2v_model[word]

**Export Embedding Matrix & vocab_index_train**

In [18]:
from utils.file import save_to_local_file
from pathlib import Path

embedding_path = Path("models/embedding_matrix.pckl")
vocab_to_index_path = Path("models/embedding_matrix_train_dataset_vocab_to_index.pckl")

save_to_local_file(embedding_path, embedding_matrix)
save_to_local_file(vocab_to_index_path, vocab_index_train)

Saving object to local...
Object saved to local!
Saving object to local...
Object saved to local!
