# Part 2

This notebook will answer all the questions in part 2

In [1]:
from datasets import load_dataset
from utils.rnn_model import *
from utils.rnn_utils import *
from dotenv import load_dotenv
import os

load_dotenv()

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\qkm20\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\qkm20\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\qkm20\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

## Dataset Preparation 

First, we define all relevant hyperparameters, then we load the dataset.

After which, we will load the word embeddings and process the data accordingly

In [2]:
BATCH_SIZE = 32

In [3]:
dataset = load_dataset("rotten_tomatoes")
trn_dataset = dataset["train"]
val_dataset = dataset["validation"]
tst_dataset = dataset["test"]

### Word Embedding

We load the pre-trained Google News 300 dimension Word2Vec model and obtain a word index to be used later in data processing

In [4]:
word2vec_model = load_word2vec()

In [5]:
word_index = {
    word: i for i, word in enumerate(
        word2vec_model.index_to_key
    )
}

### Dataloaders

Once we have the dataset and the word index both loaded, we can proceed with building the dataloaders for batch training. We first prepare the data by tokenizing and padding the data so that they are all of the same length

In [6]:
trn_sentences, trn_labels = prepare_data(
    trn_dataset["text"],
    trn_dataset["label"],
    word_index=word_index
)
val_sentences, val_labels = prepare_data(
    val_dataset["text"],
    val_dataset["label"],
    word_index=word_index
)
tst_sentences, tst_labels = prepare_data(
    tst_dataset["text"],
    tst_dataset["label"],
    word_index=word_index
)

Once the data is processed, we create dataloaders for the data for batch training

In [7]:
trn_dataloader = create_dataloader(
    trn_sentences,
    trn_labels,
    BATCH_SIZE,
    shuffle=True)
val_dataloader = create_dataloader(
    val_sentences,
    val_labels,
    BATCH_SIZE,
    shuffle=False)
tst_dataloader = create_dataloader(
    tst_sentences,
    tst_labels,
    BATCH_SIZE,
    shuffle=False)

## Model

Once again, we define all relevant hyperparameters

In [8]:
LR = 0.0001
MODEL_TYPE = "rnn"

### Default RNN

We initialise the model for the default RNN without any extra processing to derive the final sentence representation

In [9]:
model = RNNModel(
    embedding_dim=300,
    hidden_size=256,
    embedding_matrix=word2vec_model.vectors,
    rnn_type=MODEL_TYPE,
    bidirectional=False,
    num_layers=1,
)

Now that all the data is loaded and processed into Dataloaders, we can start training!

In [10]:
train(
    model=model,
    trn_dataloader=trn_dataloader,
    val_dataloader=val_dataloader,
    version="1",
    model_type=MODEL_TYPE,
    model_save_path=os.getenv("MODEL_SAVE_PATH", "modelfiles/"),
    optimizer=torch.optim.Adam(model.parameters(), lr=LR),
    epochs=100,
    early_stopping_patience=10,
    train_mode=None
)

Epoch   1/100, Loss: 0.6527, Accuracy: 0.7214
Model saved.
Epoch   2/100, Loss: 0.5389, Accuracy: 0.7608
Model saved.
Epoch   3/100, Loss: 0.5156, Accuracy: 0.7486
Epoch   4/100, Loss: 0.5026, Accuracy: 0.7561
Epoch   5/100, Loss: 0.4955, Accuracy: 0.7477
Epoch   6/100, Loss: 0.4916, Accuracy: 0.7467
Epoch   7/100, Loss: 0.4894, Accuracy: 0.7439
Epoch   8/100, Loss: 0.4778, Accuracy: 0.7636
Model saved.
Epoch   9/100, Loss: 0.4680, Accuracy: 0.7486
Epoch  10/100, Loss: 0.4575, Accuracy: 0.7552
Epoch  11/100, Loss: 0.4453, Accuracy: 0.7477
Epoch  12/100, Loss: 0.4325, Accuracy: 0.7364
Epoch  13/100, Loss: 0.4207, Accuracy: 0.7280
Epoch  14/100, Loss: 0.4049, Accuracy: 0.7205
Epoch  15/100, Loss: 0.3997, Accuracy: 0.7129
Epoch  16/100, Loss: 0.3844, Accuracy: 0.7261
Epoch  17/100, Loss: 0.3727, Accuracy: 0.7186
Epoch  18/100, Loss: 0.3666, Accuracy: 0.7054
Early stopping triggered after 18 epochs.
Training ended, loading best model...
Model loaded.


We run the validation check again to make sure we've loaded the right model

In [11]:
val_accuracy = validate(model, val_dataloader)

Accuracy: 0.7636


Test the model on the test set to obtain the accuracy

In [12]:
tst_accuracy = validate(model, tst_dataloader)

Accuracy: 0.7364


### Last State RNN

This RNN will pick the hidden vector from the last time step as the sentence representation. This approach assumes that the last hidden state will capture the overall meaning of the sentence

In [13]:
model = RNNModel(
    embedding_dim=300,
    hidden_size=256,
    embedding_matrix=word2vec_model.vectors,
    rnn_type=MODEL_TYPE,
    bidirectional=False,
    num_layers=1,
)

Now that all the data is loaded and processed into Dataloaders, we can start training!

In [14]:
train(
    model=model,
    trn_dataloader=trn_dataloader,
    val_dataloader=val_dataloader,
    version="1",
    model_type=MODEL_TYPE,
    model_save_path=os.getenv("MODEL_SAVE_PATH", "modelfiles/"),
    optimizer=torch.optim.Adam(model.parameters(), lr=LR),
    epochs=100,
    early_stopping_patience=10,
    train_mode="last_state"
)

Epoch   1/100, Loss: 0.6648, Accuracy: 0.7392
Model saved.
Epoch   2/100, Loss: 0.5354, Accuracy: 0.7486
Model saved.
Epoch   3/100, Loss: 0.5125, Accuracy: 0.7533
Model saved.
Epoch   4/100, Loss: 0.5042, Accuracy: 0.7495
Epoch   5/100, Loss: 0.4986, Accuracy: 0.7552
Model saved.
Epoch   6/100, Loss: 0.4895, Accuracy: 0.7514
Epoch   7/100, Loss: 0.4850, Accuracy: 0.7505
Epoch   8/100, Loss: 0.4801, Accuracy: 0.7542
Epoch   9/100, Loss: 0.4712, Accuracy: 0.7505
Epoch  10/100, Loss: 0.4624, Accuracy: 0.7439
Epoch  11/100, Loss: 0.4498, Accuracy: 0.7439
Epoch  12/100, Loss: 0.4383, Accuracy: 0.7383
Epoch  13/100, Loss: 0.4258, Accuracy: 0.7317
Epoch  14/100, Loss: 0.4160, Accuracy: 0.7270
Epoch  15/100, Loss: 0.4035, Accuracy: 0.7336
Early stopping triggered after 15 epochs.
Training ended, loading best model...
Model loaded.


We run the validation check again to make sure we've loaded the right model

In [15]:
val_accuracy = validate(model, val_dataloader)

Accuracy: 0.7552


Test the model on the test set to obtain the accuracy

In [16]:
tst_accuracy = validate(model, tst_dataloader)

Accuracy: 0.7373


### Mean Pooling RNN

This RNN will use the average of all hidden vectors as the sentence representation. This captures information across the whole sentence by averaging the all the words' contributions

In [17]:
model = RNNModel(
    embedding_dim=300,
    hidden_size=256,
    embedding_matrix=word2vec_model.vectors,
    rnn_type=MODEL_TYPE,
    bidirectional=False,
    num_layers=1,
)

Now that all the data is loaded and processed into Dataloaders, we can start training!

In [18]:
train(
    model=model,
    trn_dataloader=trn_dataloader,
    val_dataloader=val_dataloader,
    version="1",
    model_type=MODEL_TYPE,
    model_save_path=os.getenv("MODEL_SAVE_PATH", "modelfiles/"),
    optimizer=torch.optim.Adam(model.parameters(), lr=LR),
    epochs=100,
    early_stopping_patience=10,
    train_mode="mean_pool"
)

Epoch   1/100, Loss: 0.6644, Accuracy: 0.7064
Model saved.
Epoch   2/100, Loss: 0.5367, Accuracy: 0.7111
Model saved.
Epoch   3/100, Loss: 0.5131, Accuracy: 0.7505
Model saved.
Epoch   4/100, Loss: 0.5049, Accuracy: 0.7542
Model saved.
Epoch   5/100, Loss: 0.4977, Accuracy: 0.7411
Epoch   6/100, Loss: 0.4920, Accuracy: 0.7533
Epoch   7/100, Loss: 0.4860, Accuracy: 0.7533
Epoch   8/100, Loss: 0.4812, Accuracy: 0.7552
Model saved.
Epoch   9/100, Loss: 0.4692, Accuracy: 0.7533
Epoch  10/100, Loss: 0.4637, Accuracy: 0.7261
Epoch  11/100, Loss: 0.4526, Accuracy: 0.7477
Epoch  12/100, Loss: 0.4387, Accuracy: 0.7411
Epoch  13/100, Loss: 0.4278, Accuracy: 0.7364
Epoch  14/100, Loss: 0.4188, Accuracy: 0.7326
Epoch  15/100, Loss: 0.4084, Accuracy: 0.7345
Epoch  16/100, Loss: 0.3991, Accuracy: 0.7139
Epoch  17/100, Loss: 0.3915, Accuracy: 0.7289
Epoch  18/100, Loss: 0.3753, Accuracy: 0.7073
Early stopping triggered after 18 epochs.
Training ended, loading best model...
Model loaded.


We run the validation check again to make sure we've loaded the right model

In [19]:
val_accuracy = validate(model, val_dataloader)

Accuracy: 0.7552


Test the model on the test set to obtain the accuracy

In [20]:
tst_accuracy = validate(model, tst_dataloader)

Accuracy: 0.7402


### Max Pool RNN

This RNN will compute the max of all hidden vectors along each dimension. This will effectively use the most significant word as the representation of the sentence

In [21]:
model = RNNModel(
    embedding_dim=300,
    hidden_size=256,
    embedding_matrix=word2vec_model.vectors,
    rnn_type=MODEL_TYPE,
    bidirectional=False,
    num_layers=1,
)

Now that all the data is loaded and processed into Dataloaders, we can start training!

In [22]:
train(
    model=model,
    trn_dataloader=trn_dataloader,
    val_dataloader=val_dataloader,
    version="1",
    model_type=MODEL_TYPE,
    model_save_path=os.getenv("MODEL_SAVE_PATH", "modelfiles/"),
    optimizer=torch.optim.Adam(model.parameters(), lr=LR),
    epochs=100,
    early_stopping_patience=10,
    train_mode="max_pool"
)

Epoch   1/100, Loss: 0.6549, Accuracy: 0.7402
Model saved.
Epoch   2/100, Loss: 0.5338, Accuracy: 0.7477
Model saved.
Epoch   3/100, Loss: 0.5127, Accuracy: 0.7561
Model saved.
Epoch   4/100, Loss: 0.5037, Accuracy: 0.7439
Epoch   5/100, Loss: 0.4981, Accuracy: 0.7561
Epoch   6/100, Loss: 0.4927, Accuracy: 0.7505
Epoch   7/100, Loss: 0.4880, Accuracy: 0.7523
Epoch   8/100, Loss: 0.4827, Accuracy: 0.7636
Model saved.
Epoch   9/100, Loss: 0.4767, Accuracy: 0.7664
Model saved.
Epoch  10/100, Loss: 0.4693, Accuracy: 0.7439
Epoch  11/100, Loss: 0.4608, Accuracy: 0.7495
Epoch  12/100, Loss: 0.4546, Accuracy: 0.7495
Epoch  13/100, Loss: 0.4397, Accuracy: 0.7308
Epoch  14/100, Loss: 0.4280, Accuracy: 0.7345
Epoch  15/100, Loss: 0.4196, Accuracy: 0.7326
Epoch  16/100, Loss: 0.4094, Accuracy: 0.7036
Epoch  17/100, Loss: 0.3970, Accuracy: 0.7345
Epoch  18/100, Loss: 0.3856, Accuracy: 0.7289
Epoch  19/100, Loss: 0.3728, Accuracy: 0.7120
Early stopping triggered after 19 epochs.
Training ended, loa

We run the validation check again to make sure we've loaded the right model

In [23]:
val_accuracy = validate(model, val_dataloader)

Accuracy: 0.7664


Test the model on the test set to obtain the accuracy

In [24]:
tst_accuracy = validate(model, tst_dataloader)

Accuracy: 0.7430
