## Import libraries
Here, we import the libraries required to develop our neural network model. Besides familiar libraries such as sklearn, nltk, we also import pytorch which is a deep learning library used for applications such as computer vision and natural language processing.

In [58]:
import os
import io
import sys
import torch
from torch.autograd import Variable
from sklearn.metrics import f1_score, classification_report, roc_curve, auc
import numpy as np
import pandas as pd
import re
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Upload data files and load data from csv to pandas dataframe
Here, you upload the training and test files that we provide from your local machine to Google Colab.

In [59]:
# Upload data files - note that it would take about 3 mins for the colab upload your files successfully
from google.colab import files
uploaded = files.upload()

Saving assignment5_processed_test.csv to assignment5_processed_test (1).csv
Saving assignment5_processed_train.csv to assignment5_processed_train (1).csv


In [60]:
from sklearn import preprocessing

# Import data into panda dataframes
train_df = pd.read_csv(io.BytesIO(uploaded['assignment5_processed_train.csv']))
test_df = pd.read_csv(io.BytesIO(uploaded['assignment5_processed_test.csv']))

# Get only Text and Label columns for the task
train_df = train_df[["ProcessedTweet","Sentiment"]]
test_df = test_df[["ProcessedTweet","Sentiment"]]

# Change name of the columns for convenience
train_df.columns = ["TEXT","LABEL"]
test_df.columns = ["TEXT","LABEL"]

# convert labels to numeric values
le = preprocessing.LabelEncoder()
le.fit(["Positive","Negative","Neutral"])
print ("List of labels: ", list(le.classes_))
train_df.LABEL = le.transform(train_df.LABEL)
test_df.LABEL = le.transform(test_df.LABEL)

# Print the size of each set
print ("Training set: ", len(train_df))
print ("Test set: ", len(test_df))

# Display the first 5 rows in each set for double-checking
display(train_df.head(5))
display(test_df.head(5))

List of labels:  ['Negative', 'Neutral', 'Positive']
Training set:  41142
Test set:  3796


Unnamed: 0,TEXT,LABEL
0,http co ifz9fan2pa http co xx6ghgfzcc http co ...,1
1,advic talk neighbour famili exchang phone numb...,2
2,coronaviru australia woolworth give elderli di...,2
3,food stock one empti pleas panic enough food e...,2
4,readi go supermarket outbreak paranoid food st...,0


Unnamed: 0,TEXT,LABEL
0,trend new yorker encount empti supermarket she...,0
1,find hand sanit fred meyer turn 114 97 2 pack ...,2
2,find protect love one,2
3,buy hit citi anxiou shopper stock food amp med...,0
4,australia updat gate one week everyon buy babi...,1


## Check if GPU is available to run the neural network

In [61]:
# Use cuda if present
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device available for running: ")
print(device)

Device available for running: 
cuda


## Function 1:

In [62]:
from nltk.tokenize import word_tokenize
from collections import defaultdict

# define function
def tokenize(texts):
    max_len = 0
    tokenized_texts = []
    word2idx = {}

    word2idx['<pad>'] = 0
    word2idx['<unk>'] = 1

    idx = 2
    for sent in texts:
        tokenized_sent = word_tokenize(sent)

        tokenized_texts.append(tokenized_sent)

        for token in tokenized_sent:
            if token not in word2idx:
                word2idx[token] = idx
                idx += 1

        max_len = max(max_len, len(tokenized_sent))

    return tokenized_texts, word2idx, max_len

# Run the function
all_text = train_df.TEXT.to_list() + test_df.TEXT.to_list()
tokenized_texts, word2idx, max_len = tokenize(all_text)


Input is TEXT in training set and test set.
Output is tokenized texts in list, work frequencies, and maximum length of the longest text.
This function is tokenizing the TEXT in both dataset.




In [63]:
print('Vocabulary Size: ', len(word2idx))
print('bored:', word2idx['boredom'])
print('panic:', word2idx['panic'])
print('glad:', word2idx['glad'])
print('safe:', word2idx['safe'])

Vocabulary Size:  51361
bored: 2466
panic: 49
glad: 2514
safe: 56


## Function 2:

In [64]:
def encode(tokenized_texts, word2idx, max_len):
    input_ids = []
    for tokenized_sent in tokenized_texts:

        tokenized_sent += ['<pad>'] * (max_len - len(tokenized_sent))

        input_id = [word2idx.get(token) for token in tokenized_sent]
        input_ids.append(input_id)
    
    return np.array(input_ids)

# Run the function
input_ids = encode(tokenized_texts, word2idx, max_len)

Input is tokenized texts, dictionary which here is generated in function 1 - word2idx, and max_len in words - also generated in function 1.
Output is an 1D array of numbers.

In [65]:
len(tokenized_texts[0])

48

In [66]:
print(input_ids[0])
print(len(input_ids[0]))

[2 3 4 2 3 5 2 3 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0]
48


The shape of it is a 1D array, it is the index IDs in the word2idx dictionary, the rest of 0s are just filling ups because the length of this is shorter than the longest text of all tokenized texts.

## Download pre-trained word embeddings
In this step, we are going to use fasttext pre-trained word embeddings.

In [67]:
%%time
URL = "https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip"
FILE = "fastText"

if os.path.isdir(FILE):
    print("fastText exists.")
else:
    !wget -P $FILE $URL
    !unzip $FILE/crawl-300d-2M.vec.zip -d $FILE

fastText exists.
CPU times: user 917 µs, sys: 51 µs, total: 968 µs
Wall time: 832 µs


## Function 3:

In [68]:
from tqdm import tqdm_notebook

def load_pretrained_vectors(word2idx, fname):
    fin = open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    n, d = map(int, fin.readline().split())

    embeddings = np.random.uniform(-0.25, 0.25, (len(word2idx), d))
    embeddings[word2idx['<pad>']] = np.zeros((d,))

    count = 0
    for line in tqdm_notebook(fin):
        tokens = line.rstrip().split(' ')
        word = tokens[0]
        if word in word2idx:
            count += 1
            embeddings[word2idx[word]] = np.array(tokens[1:], dtype=np.float32)

    return embeddings
  
# Run the function
embeddings = load_pretrained_vectors(word2idx, "fastText/crawl-300d-2M.vec")
embeddings = torch.tensor(embeddings)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  # This is added back by InteractiveShellApp.init_path()


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




Inputs are dictionary and word to vector corpus filename - in here the imported fastText word to vector corpus, Output is embeddings of words. This function is using pretrained word vectors and embed them to the tokenized texts.

In [69]:
embeddings[9]

tensor([ 0.2667, -0.3146, -0.5417, -0.5040, -0.2145,  0.4093,  0.2662, -0.3960,
        -0.0460, -0.3961,  0.0053,  0.4392,  0.3351,  0.0847, -0.6161, -0.3722,
        -0.2911, -0.0162,  0.3547, -0.1321, -0.0685, -0.0682, -0.1115, -0.2473,
         0.1107,  0.1364,  0.5421,  0.7690,  0.1951,  0.1173, -0.0965, -0.1923,
        -0.0522, -0.2383, -0.4833, -0.5174, -0.1734,  0.3488, -0.1417, -0.0074,
        -0.1720, -0.2842,  0.4342,  0.4815, -0.0183, -0.3316,  0.0451, -0.0840,
        -0.0279, -0.2275,  0.2233, -0.1869,  0.2873,  0.0264,  0.0581,  0.0107,
        -0.1087, -0.0400, -0.0673, -0.0021,  0.0355,  0.0558, -0.3831, -0.2310,
        -0.2670, -0.3886, -0.0738,  0.2776,  0.0391, -0.3716, -0.0654, -0.1227,
         0.0228,  0.1081, -0.3040, -0.0550, -0.2696, -0.1574, -0.0096,  0.2272,
         0.0289,  0.1883,  0.2891,  0.4875,  0.1178,  0.2956,  0.3154,  0.0304,
         0.0946, -0.5979,  0.4883,  0.2340, -0.3449, -0.1650,  0.0602, -0.3023,
         0.2113,  0.3234,  0.1046, -0.09

The shape is a 1D array of floats with length of 300. It indicates the vector of the 10th word in word2idx vocabulary.

In [70]:
embeddings.shape

torch.Size([51361, 300])

## Function 4:

In [71]:
from torch.utils.data import (TensorDataset, DataLoader, RandomSampler,SequentialSampler)

def data_loader(train_inputs, test_inputs, train_labels, test_labels,
                batch_size=50):

    train_inputs, test_inputs, train_labels, test_labels = tuple(torch.tensor(data) for data in [train_inputs, test_inputs, train_labels, test_labels])

    batch_size = 50

    train_data = TensorDataset(train_inputs, train_labels)
    train_sampler = RandomSampler(train_data)
    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

    test_data = TensorDataset(test_inputs, test_labels)
    test_sampler = SequentialSampler(test_data)
    test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=batch_size)

    return train_dataloader, test_dataloader

# Run the function
train_inputs = input_ids[:41142]
test_inputs = input_ids[41142:]

train_labels = train_df.LABEL.tolist()
test_labels = test_df.LABEL.tolist()

train_dataloader, test_dataloader = data_loader(train_inputs, test_inputs, train_labels, test_labels, batch_size=50)
  

Inputs are training set, test set, and their labels.
Outputs are prepared datasets for texts and corresponding labels.
This function is loading data from training and testing sets and their labels and convert texts and labels into numbers.


## Function 5 - CNN Model
In this section we are going to define a vanila CNN model.

In [77]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class CNN_classifier(nn.Module):
    def __init__(self,vocab_size=None,embed_dim=300,filter_sizes=2,num_filters=100,num_classes=3,dropout=0.5, learning_rate = 0.25):

        super(CNN_classifier, self).__init__()

        # Layer 1
        self.embed_dim = embed_dim
        self.embedding = nn.Embedding(num_embeddings=vocab_size,embedding_dim=self.embed_dim,padding_idx=0, max_norm=5.0)
            
        # Layer 2
        self.conv1d_list = nn.ModuleList([nn.Conv1d(in_channels=self.embed_dim,out_channels=num_filters,kernel_size=filter_sizes)])
        
        # Layer 3
        self.fc = nn.Linear(np.sum(num_filters), num_classes)

        # Layer 4
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, input_ids):
        x_embed = self.embedding(input_ids).float()
        x_reshaped = x_embed.permute(0, 2, 1)
        x_conv_list = [F.relu(conv1d(x_reshaped)) for conv1d in self.conv1d_list]
        x_pool_list = [F.max_pool1d(x_conv, kernel_size=x_conv.shape[2]) for x_conv in x_conv_list]
        x_fc = torch.cat([x_pool.squeeze(dim=2) for x_pool in x_pool_list],dim=1)
        logits = self.fc(self.dropout(x_fc))

        return logits

Layer 1 is embedding layer which embed each word into a 1D array of 300 vectors. 
Layer 2 is convolutional layer that extract features from texts, then it slides the filter over texts to generate a 1D array.
Layer 3 is fully converted layer, it classifies different classes by traininig.
Layer 4 is dropout layer, to overcome the problem of overfitting, on passing a dropout of 0.5, 50% of the texts are dropped out randomly from the neural network.

## Function 6:

In [111]:
import random
from sklearn.metrics import classification_report

# Specify loss function
loss_fn = nn.CrossEntropyLoss()

def train(model, optimizer, train_dataloader, test_dataloader=None, epochs=10):
    print("Start training...\n")
    print(f"{'Epoch':^7} | {'Train Loss':^12} | {'Test Loss':^10} | {'Test F1':^9}")
    print("-"*50)

    best_epoch = 0
    best_f1 = 0
    best_report = None

    for epoch_i in range(epochs):
        total_loss = 0

        model.train()

        for step, batch in enumerate(train_dataloader):
            b_input_ids, b_labels = tuple(t.to(device) for t in batch)
            model.zero_grad()
            logits = model(b_input_ids)
            loss = loss_fn(logits, b_labels)
            total_loss += loss.item()
            loss.backward()
            optimizer.step()

        avg_train_loss = total_loss / len(train_dataloader)

        if test_dataloader is not None:
            test_loss, test_f1_score_mean = evaluate(model, test_dataloader)
            print(f"{epoch_i + 1:^7} | {avg_train_loss:^12.6f} | {test_loss:^10.6f} | {test_f1_score_mean:^9.2f}")
            
            if(test_f1_score_mean > best_f1):
              best_f1 = test_f1_score_mean
              best_epoch = epoch_i
              best_report = classification_report(b_labels.tolist(), torch.argmax(logits, dim=1).flatten().tolist())
    print('Best epoch the model performs: ', best_epoch)
    print('Best performing model classification report: \n', best_report)

def evaluate(model, val_dataloader):
    model.eval()

    val_f1_score = []
    val_loss = []

    for batch in val_dataloader:
        b_input_ids, b_labels = tuple(t.to(device) for t in batch)

        with torch.no_grad():
            logits = model(b_input_ids)

        loss = loss_fn(logits, b_labels)
        val_loss.append(loss.item())

        preds = torch.argmax(logits, dim=1).flatten()

        f1_score_item = f1_score(b_labels.cpu().numpy(),preds.cpu().numpy(), average="weighted")
        val_f1_score.append(f1_score_item)

    val_loss = np.mean(val_loss)
    val_f1_score_mean = np.mean(val_f1_score)

    return val_loss, val_f1_score_mean

For function train, the inputs are selected model, optimizer which is a hyperparameter in CNN model, training set (text + label), test set (text + label), and number of epochs which is the number of times the whole training set pass through the neural network. Output is a list of train loss, test loss, and test f1 score for each epoch. This function is training the input dataset using selected model.

For function evaluate, the inputs are selected model, and value of dataset (text + label) so it can output the loss and f1 score. It is evaluating the performance of the training model.

## Function 7 - Train and evaluate model

In [118]:
import torch.optim as optim

# Define hyperparameters
vocab_size=len(word2idx)
embed_dim=300

#filter_sizes=1
filter_sizes=2
#num_filters=100
num_filters=200

num_classes=3
#dropout = 0.1
dropout = 0.2

#learning_rate = 0.01
learning_rate = 0.0003

cnn_model = CNN_classifier(vocab_size=vocab_size,
                    embed_dim=embed_dim,
                    num_classes= num_classes,
                    filter_sizes = filter_sizes,
                    num_filters = num_filters,
                    dropout = dropout,
                    learning_rate = learning_rate)

optimizer = optim.Adam(cnn_model.parameters(),lr=learning_rate)
    
cnn_model.to(device)

train(cnn_model, optimizer, train_dataloader, test_dataloader, epochs=20)

Start training...

 Epoch  |  Train Loss  | Test Loss  |  Test F1 
--------------------------------------------------
   1    |   0.829973   |  0.674513  |   0.72   
   2    |   0.551050   |  0.546181  |   0.80   
   3    |   0.417927   |  0.518702  |   0.81   
   4    |   0.339732   |  0.517281  |   0.82   
   5    |   0.271966   |  0.533123  |   0.81   
   6    |   0.209138   |  0.563183  |   0.80   
   7    |   0.160186   |  0.614263  |   0.79   
   8    |   0.118180   |  0.663535  |   0.78   
   9    |   0.087417   |  0.727516  |   0.77   
  10    |   0.065455   |  0.787022  |   0.75   
  11    |   0.050397   |  0.850490  |   0.74   
  12    |   0.038644   |  0.912691  |   0.74   
  13    |   0.031185   |  0.978975  |   0.73   
  14    |   0.024999   |  1.029256  |   0.73   
  15    |   0.021218   |  1.083438  |   0.73   
  16    |   0.017074   |  1.131279  |   0.72   
  17    |   0.013802   |  1.181611  |   0.72   
  18    |   0.012680   |  1.246254  |   0.71   
  19    |   0.0103

vocab_size is the length of dictionary (word2idx). embed_dim is the dimension of embedded vector space for all words in vocabulary. filter_size is the number of neighnor information you can see when processing the layer. num_filters is the number of neurons in the layer. dropout is what percentage is dropped out from the neural network. Learning rate helps to control how much to update the weight in the optimization algorithm.

Best f1 score is best for epoch 1. The larger the loss, the worse the F1 score. As Test loss reach maximum at around epoch 19, it performs the worest on F1 score.

## Function 8:


In [123]:
def predict(text, model=cnn_model.to("cpu"), max_len=62):

    tokens = word_tokenize(text.lower())
    padded_tokens = tokens + ['<pad>'] * (max_len - len(tokens))
    input_id = [word2idx.get(token, word2idx['<unk>']) for token in padded_tokens]

    input_id = torch.tensor(input_id).unsqueeze(dim=0)

    logits = model.forward(input_id)

    probs = F.softmax(logits, dim=1).squeeze(dim=0)

    print(f"This review is {probs[0] * 100:.5f}% Negative;  {probs[1] * 100:.5f}% Neutral;  {probs[2] * 100:.5f}% Positive.")

predict("covid 19 is suck. I am fed up of staying at home.")
predict("I feel much better now since the vaccine has been produced.")
predict("Covid 19 is dangerous. I feel unsafe when going out these days.")
predict("It is good that the govenrment starts acting.")

This review is 100.00000% Negative;  0.00000% Neutral;  0.00000% Positive.
This review is 0.00000% Negative;  0.00000% Neutral;  100.00000% Positive.
This review is 25.00377% Negative;  24.76023% Neutral;  50.23600% Positive.
This review is 0.00000% Negative;  0.00000% Neutral;  100.00000% Positive.


Original:

This review is 0.00126% Negative;  0.00000% Neutral;  99.99873% Positive.
This review is 0.00000% Negative;  0.00000% Neutral;  100.00000% Positive.
This review is 29.60551% Negative;  0.00074% Neutral;  70.39374% Positive.
This review is 0.00000% Negative;  0.00000% Neutral;  100.00000% Positive.

After tuning:

This review is 99.95264% Negative;  0.04655% Neutral;  0.00081% Positive.
This review is 0.00000% Negative;  0.00001% Neutral;  99.99998% Positive.
This review is 0.59901% Negative;  73.15240% Neutral;  26.24858% Positive.
This review is 0.00001% Negative;  0.00000% Neutral;  99.99998% Positive.

Yes, I did see improvements on the performance, I think the potential reason may be filter size, originally it was 1 so it cannot extract information from neighbors. As it can get information from neighbors after hyperparameter tuning, for the first prediction it turn out to be negative since "fed up" is very negative word combo. Originally when filter size is 1, it cannot detect it.

In [122]:
# My own tuning
# Define hyperparameters
vocab_size=len(word2idx)
embed_dim=300

#filter_sizes=1
filter_sizes=2
#num_filters=100
num_filters=100

num_classes=3
#dropout = 0.1
dropout = 0.3

#learning_rate = 0.01
learning_rate = 0.01

cnn_model = CNN_classifier(vocab_size=vocab_size,
                    embed_dim=embed_dim,
                    num_classes= num_classes,
                    filter_sizes = filter_sizes,
                    num_filters = num_filters,
                    dropout = dropout,
                    learning_rate = learning_rate)

optimizer = optim.Adam(cnn_model.parameters(),lr=learning_rate)
    
cnn_model.to(device)

train(cnn_model, optimizer, train_dataloader, test_dataloader, epochs=20)

Start training...

 Epoch  |  Train Loss  | Test Loss  |  Test F1 
--------------------------------------------------
   1    |   0.711043   |  0.593878  |   0.79   
   2    |   0.558129   |  0.726199  |   0.74   
   3    |   0.368194   |  0.915208  |   0.69   
   4    |   0.300668   |  1.150014  |   0.63   
   5    |   0.287210   |  1.196869  |   0.64   
   6    |   0.273318   |  1.205637  |   0.63   
   7    |   0.280296   |  1.363516  |   0.63   
   8    |   0.283924   |  1.831694  |   0.58   
   9    |   0.267789   |  1.529301  |   0.63   
  10    |   0.268250   |  1.671096  |   0.57   
  11    |   0.258975   |  1.690187  |   0.61   
  12    |   0.249657   |  1.860120  |   0.60   
  13    |   0.248953   |  2.275559  |   0.58   
  14    |   0.260522   |  2.056852  |   0.64   
  15    |   0.250883   |  1.519942  |   0.63   
  16    |   0.234276   |  1.914752  |   0.60   
  17    |   0.231214   |  1.638235  |   0.65   
  18    |   0.234306   |  2.241933  |   0.59   
  19    |   0.2361

My own tuning:

This review is 100.00000% Negative;  0.00000% Neutral;  0.00000% Positive.
This review is 0.00000% Negative;  0.00000% Neutral;  100.00000% Positive.
This review is 25.00377% Negative;  24.76023% Neutral;  50.23600% Positive.
This review is 0.00000% Negative;  0.00000% Neutral;  100.00000% Positive.

I chose filter_size = 2 because 1 performs bad in the first case, and I chose num_filters = 100 because the maximum text length is 62 which is less than 100. dropout = 0.3 so it can drop out some useless information. learning rate = 0.1 so it can learn some problems gradually at a not very fast speed to avoid quick converge caused by large number of learning rates.

Task 9:

Inputs are text, and model used to predict. Output is prediction among positive, neutral, and negative. This function shows the probability for each sentiment of a text. These results are as expected even there are some predictions are not very accurate. But under most cases, it has a pretty good precision and recall rate.

## Exporting your results to PDF
1. Download your notebook with _File -> Download .ipynb_
1. Rename with your name like in other assignments, for example lastname_firstname_assignment5.ipynb
1. Submit the notebook file on Moodle