# Self study 7

In this self study we will experiment with text classification using word embeddings. Specifically, we will use the prelearned embeddings that was also used during the lecture and which can be downloaded from https://code.google.com/archive/p/word2vec/. There are two aspects to the self study: 1. to play around with word embeddings and 2. to get some experience using these embedding for text modeling tasks such as text classification. The latter also entails working with relevant models, which in this self study are in the form of neural network models.

In [1]:
import gensim.downloader as api
import gensim
import certifi
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

import numpy as np
from datasets import load_dataset
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer

  from .autonotebook import tqdm as notebook_tqdm


## Getting the embedding model

We again use the prelearned embeddings from the lecture. Note, loading these embeddings can be a bit time consuming so a bit of patience is advised ...

In [2]:
model = gensim.models.KeyedVectors.load_word2vec_format('/Users/sebastiantruong/Downloads/GoogleNews-vectors-negative300.bin.gz', binary = True)

Extract a few example words from the embedding model

In [3]:
list(model.key_to_index.keys())[:10]

['</s>', 'in', 'for', 'that', 'is', 'on', '##', 'The', 'with', 'said']

We can also get the embdding associated with a particular word (here only the first 10 entries of one)

In [4]:
model['IMDB'][:10]

array([ 0.29492188, -0.32617188, -0.39453125, -0.08691406, -0.03833008,
        0.07568359, -0.07958984, -0.06640625,  0.39257812, -0.1953125 ],
      dtype=float32)

## Loading and processing the IMDB data

We use a data set from the Internet Movie database

In [5]:
imdb = load_dataset("imdb")
reviews_train = imdb['train']["text"]
reviews_test = imdb['test']["text"]

Example review and label ...

In [6]:
review_idx = 42
print("Review: " + imdb['train']['text'][review_idx])
print("Label: " + str(imdb['train']['label'][review_idx]))

Label: 0


We construct a dictionary for the reviews in reviews_train,  but only include terms with a minimum frequency of 0.0005 and a maximum frequency of 0.5. More specialized types of text preprocessing could also be performed, but that is less important for this self study and is therefore left out.

In [7]:
#dictionary=CountVectorizer(min_df=0.0005, max_df=0.5).fit(reviews_train)
dictionary=CountVectorizer(min_df=0.001, max_df=0.5).fit(reviews_train)


Based on the dictionary we can get matrix/array representations of the data sets containing the counts of the invdividual words in the reviews. Thus, for each data set we have a matrix of size (#reviews x #words_in_dictionary).

In [8]:
# Sparse matrix representation
reviews_train_tf=dictionary.transform(reviews_train)
reviews_test_tf=dictionary.transform(reviews_test)

# reviews_train_tf is a numpy matrix, so we convert it to an array. Same things applies to the test data
reviews_train_tf = np.squeeze(np.asarray(reviews_train_tf.todense()))
reviews_test_tf = np.squeeze(np.asarray(reviews_test_tf.todense()))
print(f"Shape of reviews_train:{reviews_train_tf.shape}")

Shape of reviews_train:(25000, 10430)


Next we construct embedding matrices for reviews_train and reviews_test. An embedding matrix has size #words_in_dictionary x #embedding_size. The matrix is constructed by iterating over the words in the dictionary we constructed above. If a word does not have an embedding (i.e., it is not contained in the embedding model), it is effectively given an embedding consisting of zeros only. 

In [9]:
train_emb_matrix = np.zeros((reviews_train_tf.shape[1], 300))
words_not_found = []
for k, i in dictionary.vocabulary_.items():
    try: 
        train_emb_matrix[i,:] = model[k]
    except KeyError:
        words_not_found.append(k)
print(f"Number of failed look up attempts for the dictionary: {len(words_not_found)}")
print(f"Examples include: {words_not_found[:10]}")
        
test_emb_matrix = np.zeros((reviews_test_tf.shape[1], 300))
for k, i in dictionary.vocabulary_.items():
    try: 
        test_emb_matrix[i,:] = model[k]
    except KeyError:
        None

Number of failed look up attempts for the dictionary: 502
Examples include: ['1967', '40', 'bergman', '17', 'godard', '10', 'lucille', 'ritter', 'bogdanovich', 'audrey']


Next we represent each movie review by averaging over the embeddings of the words in the review. This is achieved by first finding the normalized word frequencies and afterwards combining them with the embedding matrix. 

In [10]:
reviews_train_tf_norm =  reviews_train_tf / np.sum(reviews_train_tf, axis=1, keepdims=True)
reviews_test_tf_norm =  reviews_test_tf / np.sum(reviews_test_tf, axis=1, keepdims=True)
review_train_word2vec = reviews_train_tf_norm @ train_emb_matrix
review_test_word2vec = reviews_test_tf_norm @ test_emb_matrix
print(f"The resulting matrix (review_train_word2vec) shape: {review_train_word2vec.shape}") 

The resulting matrix (review_train_word2vec) shape: (25000, 300)


## Task

Below is a small example of how to set up a simple neural network using PyTorch to experiment with classification of the movie reviews using their embedding representations. In this task you should:
 * Experiment with different neural network archietctures and learning settings to investigate the effects on the classification accuracy. To support this analysis, revisit the data processing setup above and define a validation data set that can be used during model learning. 
 * Try changing the cutoff frequencies when constructing the IMDB dictionary. How does changing these parameters affect that accuracy results of your model?

In the example code below, we work with the torch.nn module provided by PyTorch. A short introduction to this module and how to define neural networks in PyTorch can be found at

 https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html#sphx-glr-beginner-blitz-neural-networks-tutorial-py
 https://pytorch.org/tutorials/beginner/nn_tutorial.html

In [21]:
###Experiment with different NN architecture and learning settings and how it affects the classification accuracy
#tried two different NN layers setups, but with a similar accuracy of approx. 86%
#changing adam learning rate #lr=0.01 83%acc, #lr=0.0001 86.1%acc, 

###Try changing the cutoff frequencies
#The chosen standard value works well, and an increase in the min_df seems to reduce
#the accuracy by some percentages, as some reviews are filtered off.
#tested: (min_df=0.0005, max_df=0.5) Test Error: Accuracy: 86.1%, Avg loss: 0.326928 
#tested: (min_df=0.0005, max_df=0.7) Test Error: Accuracy: 85.6%, Avg loss: 0.336738
#tested: (min_df=0.0005, max_df=0.9) Test Error: Accuracy: 85.8%, Avg loss: 0.335870
#tested: (min_df=0.001 , max_df=0.5) Test Error: Accuracy: 85.6%, Avg loss: 0.337421 
#tested: (min_df=0.02  , max_df=0.5) Test Error: Accuracy: 83.5%, Avg loss: 0.374771 
#tested: (min_df=0.02  , max_df=0.9) Test Error: Accuracy: 83.6%, Avg loss: 0.377356 

#Changing batch size had an effect
# batch size 8  did:  Test Error: Accuracy: 86.0%, Avg loss: 0.329796
# batch size 32 did:  Test Error: Accuracy: 85.7%, Avg loss: 0.332952 
# batch size 64 did:  Test Error: Accuracy: 86.1%, Avg loss: 0.326928 
# batch size 128 did: Test Error: Accuracy: 85.5%, Avg loss: 0.340475




### Simple neural network implementation

In [12]:
import torch
from torch.utils.data import DataLoader
from torch import nn
from tqdm.auto import tqdm
from torch.optim import Adam
import evaluate


In [13]:
device = (
    "cuda"
    if torch.cuda.is_available()
    else "cpu"
)
print(f"Using {device} device")

Using cpu device


Define a dataloader for manageing the data set. Here we still be working with minibatches of size 64, which forms the basis for a stochastic gradient descent implementation for learning the model parameters.

In [36]:
train = torch.utils.data.TensorDataset(torch.tensor(review_train_word2vec, dtype=torch.float), torch.tensor(imdb['train']['label']))
test = torch.utils.data.TensorDataset(torch.tensor(review_test_word2vec, dtype=torch.float), torch.tensor(imdb['test']['label']))
batch_size = 8
train_loader = torch.utils.data.DataLoader(train, 
                                           batch_size=batch_size,
                                           shuffle=True,
                                           num_workers=0)
test_loader = torch.utils.data.DataLoader(test, 
                                           batch_size=1,
                                           shuffle=False,
                                           drop_last=False,
                                           num_workers=0)

Define the neural network model

In [37]:
#Standard
#Test Error: 
#Accuracy: 86.1%, Avg loss: 0.326928 
class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(300, 128),
            nn.ReLU(),
            nn.Linear(128, 2),
        )

    def forward(self, x):        
        logits = self.linear_relu_stack(x)
        return logits


#test1
#Test Error: 
#Accuracy: 86.0%, Avg loss: 0.328519
# class NeuralNetwork(nn.Module):
#     def __init__(self):
#         super().__init__()
#         self.flatten = nn.Flatten()
#         self.linear_relu_stack = nn.Sequential(
#             nn.Linear(300, 512),
#             nn.ReLU(),
#             nn.Linear(512, 512),
#             nn.ReLU(),
#             nn.Linear(512, 2),
#         )

#     def forward(self, x):
#         x = self.flatten(x)
#         logits = self.linear_relu_stack(x)
#         return logits

#test2
#Test Error: 
#Accuracy: 86.2%, Avg loss: 0.324331     
# class NeuralNetwork(nn.Module):
# 	def __init__(self):
# 			super().__init__()
# 			self.linear_relu_stack = nn.Sequential(
# 					nn.Linear(300, 128),
# 					nn.ReLU(),
# 					nn.Linear(128, 128),
# 					nn.Linear(128, 300),
# 					nn.Sigmoid(),
# 					nn.Linear(300, 2),
# 			)

# 	def forward(self, x):        
# 			logits = self.linear_relu_stack(x)
# 			return logits

In [38]:
model = NeuralNetwork().to(device)
print(model)

NeuralNetwork(
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=300, out_features=128, bias=True)
    (1): ReLU()
    (2): Linear(in_features=128, out_features=2, bias=True)
  )
)


Set up methods for the training and testing

In [39]:
# Adapted from https://pytorch.org/tutorials/beginner/basics/optimization_tutorial.html

def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    # Set the model to training mode - important for batch normalization and dropout layers
    # Unnecessary in this situation but added for best practices
    model.train()
    train_loss = 0
    for batch, (X, y) in enumerate(dataloader):
        # Compute prediction and loss
        pred = model(X)
        loss = loss_fn(pred, y)
        train_loss += loss.item()
        # Backpropagation
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    return train_loss / len(dataloader)

def test_loop(dataloader, model, loss_fn):
    # Set the model to evaluation mode - important for batch normalization and dropout layers
    # Unnecessary in this situation but added for best practices
    model.eval()
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    test_loss, correct = 0, 0

    # Evaluating the model with torch.no_grad() ensures that no gradients are computed during test mode
    # also serves to reduce unnecessary gradient computations and memory usage for tensors with requires_grad=True
    with torch.no_grad():
        for X, y in dataloader:
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()

    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")
    return test_loss, correct


Lastly, let's take the model out for a spin and see how it performs.

In [40]:
loss_fn = nn.CrossEntropyLoss()
epochs = 10
optimizer = torch.optim.Adam(model.parameters(),lr=1e-3)
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.5**(2/epochs))

test_loop(test_loader, model, loss_fn)

pbar = tqdm(range(epochs))
for epoch in pbar:
    train_loss = train_loop(train_loader, model, loss_fn, optimizer)
    test_loss, accuracy = test_loop(test_loader, model, loss_fn)
    scheduler.step()
    pbar.set_description(f"Train_loss: {train_loss:.2f}, Test_loss: {test_loss:.2f}, accuracy: {100*accuracy:0.2f}")

test_loop(test_loader, model, loss_fn)
print("Done!")

Test Error: 
 Accuracy: 50.0%, Avg loss: 0.693273 



Train_loss: 0.41, Test_loss: 0.35, accuracy: 85.02:  10%|█         | 1/10 [00:06<00:55,  6.15s/it]

Test Error: 
 Accuracy: 85.0%, Avg loss: 0.350543 



Train_loss: 0.35, Test_loss: 0.35, accuracy: 84.70:  20%|██        | 2/10 [00:12<00:49,  6.19s/it]

Test Error: 
 Accuracy: 84.7%, Avg loss: 0.352131 



Train_loss: 0.34, Test_loss: 0.34, accuracy: 85.38:  30%|███       | 3/10 [00:18<00:44,  6.30s/it]

Test Error: 
 Accuracy: 85.4%, Avg loss: 0.344042 



Train_loss: 0.34, Test_loss: 0.34, accuracy: 85.47:  40%|████      | 4/10 [00:24<00:37,  6.25s/it]

Test Error: 
 Accuracy: 85.5%, Avg loss: 0.341767 



Train_loss: 0.34, Test_loss: 0.34, accuracy: 85.50:  50%|█████     | 5/10 [00:31<00:31,  6.21s/it]

Test Error: 
 Accuracy: 85.5%, Avg loss: 0.338778 



Train_loss: 0.33, Test_loss: 0.34, accuracy: 85.38:  60%|██████    | 6/10 [00:37<00:24,  6.20s/it]

Test Error: 
 Accuracy: 85.4%, Avg loss: 0.339280 



Train_loss: 0.33, Test_loss: 0.33, accuracy: 85.81:  70%|███████   | 7/10 [00:43<00:18,  6.24s/it]

Test Error: 
 Accuracy: 85.8%, Avg loss: 0.333113 



Train_loss: 0.33, Test_loss: 0.34, accuracy: 85.34:  80%|████████  | 8/10 [00:50<00:12,  6.40s/it]

Test Error: 
 Accuracy: 85.3%, Avg loss: 0.338140 



Train_loss: 0.33, Test_loss: 0.33, accuracy: 85.85:  90%|█████████ | 9/10 [00:56<00:06,  6.39s/it]

Test Error: 
 Accuracy: 85.9%, Avg loss: 0.331768 



Train_loss: 0.32, Test_loss: 0.33, accuracy: 85.97: 100%|██████████| 10/10 [01:03<00:00,  6.35s/it]

Test Error: 
 Accuracy: 86.0%, Avg loss: 0.329796 






Test Error: 
 Accuracy: 86.0%, Avg loss: 0.329796 

Done!
