<a href="https://colab.research.google.com/github/zen030/tech_review/blob/master/tech_review_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis using BERT
###### zainalh2@illinois.edu

## 1. Introduction
This document will illustrate the use of BERT pre-trained model to classify Twitter tweets sentiment.

The target audience of the document:
1. The audience who has no experience in Deep NLP Model such as BERT.
2. The audience who has some basic understanding in Data Mining and Machine Learning.
3. The audience who has the curiosity on how to use and fine-tune the BERT model for simple training task.
4. The audience should have basic knowledge of Python 3, PyTorch usage, and Jupyter Notebook.
5. And of course, the audience who is looking forward to spending fun time with Python code!

## 2. What is BERT and Text Sentiment Analysis?
### 2. 1. What is BERT?
BERT is Bidirectional Encoded Representations from Transformer is a Natural Processing Language (NLP) pre-training model developed by Google (https://en.wikipedia.org/wiki/BERT_(language_model)). 

Useful Internet sources to understand basic concept of BERT:
1.	BERT for dummies: https://towardsdatascience.com/bert-for-dummies-step-by-step-tutorial-fb90890ffe03
2.	The Illustrated BERT: http://jalammar.github.io/illustrated-bert/
3.	The BERT original paper:  https://arxiv.org/abs/1810.04805.

### 2. 2. What is Text Sentiment analysis?

Text Sentiment Analysis is study branch in Text Mining to systematically identify, extract, quantify and study subjective information in un-structured text data (https://en.wikipedia.org/wiki/Sentiment_analysis).

Imagine we have a “smart” module in which we can feed text data as an input. The “smart” module, implemented as a software code, will be able to output the sentiment found in the text data, as illustrated by the diagram below:
![Sentiment_Analysis_Illustration.png](attachment:Sentiment_Analysis_Illustration.png)
In the illustration above, we pass a text data “I don’t like the apple. It’s rotten!”. We as human can easily tell that the opinion holder expressed his/her negative sentiment about an apple. To us the reason is so obvious that the apple is rotten. But for a computer to understand the sentiment, it requires complex computation model.  This document uses BERT to model the computation of the “Smart” module to detect sentiment found in text data.

# 3. Case study
Coursera course https://www.coursera.org/learn/sentiment-analysis-bert/home/welcome is the main reference for the case study in this document. There are some minor changes done to illustrate the concept and algorithm done in software code.

## 3.1. Dataset
### 3.1.1. Dataset Source cititation
Wang, Bo; Tsakalidis, Adam; Liakata, Maria; Zubiaga, Arkaitz; Procter, Rob; Jensen, Eric (2016): SMILE Twitter Emotion dataset. figshare. Dataset. https://doi.org/10.6084/m9.figshare.3187909.v2

### 3.1.2. Text structure in the dataset
Dataset has 3085 lines (tweets). Each record consists of 3 columns.
- Column-1: The unique ID of the line/tweet
- Column-2: The tweet message
- Column-3: The sentiment label of the message (sad, happy, etc.)
Sample data from the dataset:

611537640857411584,"@britishmuseum @SenderosP The Rosetta Stone ;)",happy
- Column-1: 611537640857411584
- Column-2: "@britishmuseum @SenderosP The Rosetta Stone ;)"
- Column-3: happy

## 3.2. The goal of Case Study
The dataset is labelled Twitter tweet. Each label expresses the sentiment of the tweet. In the following software code, we will split the dataset into training and validation datasets. Training dataset will be used to training BERT model. Using the trained BERT model, we will use validation dataset to evaluate the accuracy of the trained BERT model.

# 4. Software
The final software code is implemented using Jupyter Notebook. The source code is available here https://github.com/zen030/tech_review/blob/master/tech_review_case.ipynb

In [2]:
import tensorflow as tf

# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

Found GPU at: /device:GPU:0


In [3]:
import torch

# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla T4


### 4.1. Analyze the dataset using pandas DataFrame

In [4]:
import pandas as pd  # https://pandas.pydata.org/

In [5]:
# Load the dataset into dataframes
# dataframe columns: [id, text, category]
df = pd.read_csv('sample_data/smile-annotations-final.csv', names=['id', 'text', 'category'])
df.set_index('id', inplace=True)

In [6]:
# Let take a look at how the data look like
df.head()

Unnamed: 0_level_0,text,category
id,Unnamed: 1_level_1,Unnamed: 2_level_1
611857364396965889,@aandraous @britishmuseum @AndrewsAntonio Merc...,nocode
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy
614877582664835073,@Sofabsports thank you for following me back. ...,happy
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy


In [7]:
# Let's take a look at each label & the counting
df.category.value_counts()

nocode               1572
happy                1137
not-relevant          214
angry                  57
surprise               35
sad                    32
happy|surprise         11
happy|sad               9
disgust|angry           7
disgust                 6
sad|disgust             2
sad|angry               2
sad|disgust|angry       1
Name: category, dtype: int64

In [8]:
# Filter out multiple label (the ones with | character) and "nocode" label
df = df[~df.category.str.contains('\|')]
df = df[df.category != 'nocode']

In [9]:
# Let's review the dataset/label counting after removing some unwanted tweets/label
# this removal is purely for simplicity of our training model
df.category.value_counts()

happy           1137
not-relevant     214
angry             57
surprise          35
sad               32
disgust            6
Name: category, dtype: int64

In [10]:
# Assign integer value to the text label. The new column is called "label"
# by now dataframe structure [id, text, category, label]
possible_labels = df.category.unique()
label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index
df['label'] = df.category.replace(label_dict)    

In [11]:
# Let's review the new (integer) label
df.head()

Unnamed: 0_level_0,text,category,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy,0
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy,0
614877582664835073,@Sofabsports thank you for following me back. ...,happy,0
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy,0
611570404268883969,@NationalGallery @ThePoldarkian I have always ...,happy,0


## 4.2. Split the dataset into training dataset and validation dataset
The split as follow:
- 85% as Training dataset
- 15% as Validation dataset

In [12]:
# We will use sklearn library to split the dataset into training and test dataset
from sklearn.model_selection import train_test_split # https://scikit-learn.org/

In [13]:
# Here we will split the dataset as follow:
# 15% as testing dataset
# 85% as training dataset
# random_state is set to 42. Popular integer random seeds are 0 and 42.
X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                  df.label.values, 
                                                  test_size=0.15, 
                                                  random_state=42, 
                                                  stratify=df.label.values)

In [14]:
# Adding a new column called "data_type", the possible value of the new columns:
# 'train' for the training dataset
# or 'val' stands for validation for the testing dataset
df['data_type'] = ['not_set']*df.shape[0]
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

# print dataframe to see the result
df

Unnamed: 0_level_0,text,category,label,data_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy,0,train
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy,0,train
614877582664835073,@Sofabsports thank you for following me back. ...,happy,0,train
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy,0,train
611570404268883969,@NationalGallery @ThePoldarkian I have always ...,happy,0,train
...,...,...,...,...
611258135270060033,@_TheWhitechapel @Campaignforwool @SlowTextile...,not-relevant,1,train
612214539468279808,“@britishmuseum: Thanks for ranking us #1 in @...,happy,0,train
613678555935973376,MT @AliHaggett: Looking forward to our public ...,happy,0,train
615246897670922240,@MrStuchbery @britishmuseum Mesmerising.,happy,0,train


In [15]:
# print dataframe groung by category, label and data_type
# Here we should have 85% vs 15% distribution for each category/label
df.groupby(['category', 'label', 'data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,text
category,label,data_type,Unnamed: 3_level_1
angry,2,train,48
angry,2,val,9
disgust,3,train,5
disgust,3,val,1
happy,0,train,966
happy,0,val,171
not-relevant,1,train,182
not-relevant,1,val,32
sad,4,train,27
sad,4,val,5


## 4.3. Tokenizing and Encoding
Tokenization in BERT is another interesting topic to explore. BERT uses WordPiece tokenization strategy. 
Internet sources to explore this topic further:
- Original paper: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37842.pdf
- Article that explains about BERT Token Embedding: https://medium.com/@_init_/why-bert-has-3-embedding-layers-and-their-implementation-details-9c261108e28a

In [16]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/2c/4e/4f1ede0fd7a36278844a277f8d53c21f88f37f3754abf76a5d6224f76d4a/transformers-3.4.0-py3-none-any.whl (1.3MB)
[K     |████████████████████████████████| 1.3MB 4.4MB/s 
Collecting sentencepiece!=0.1.92
[?25l  Downloading https://files.pythonhosted.org/packages/e5/2d/6d4ca4bef9a67070fa1cac508606328329152b1df10bdf31fb6e4e727894/sentencepiece-0.1.94-cp36-cp36m-manylinux2014_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 23.3MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 38.5MB/s 
Collecting tokenizers==0.9.2
[?25l  Downloading https://files.pythonhosted.org/packages/7c/a5/78be1a55b2ac8d6a956f0a211d372726e2b1dd2666bb537fea9b03abd62c/tokenizers-0.9.2-cp36-cp36m-manylinux1_x86_64.whl (2.9MB)
[K     

In [17]:
from transformers import BertTokenizer # https://huggingface.co/transformers/model_doc/bert.html
from torch.utils.data import TensorDataset # https://pytorch.org/
import torch

In [18]:
# we create our tokenizer here
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




In [19]:
# tokenizer will read dataframe and encode number representation 
# which will be recognized by MODEL pre-trained models
encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].text.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    max_length = 64,
    padding='max_length',
    return_tensors='pt'
)

encoded_data_val = tokenizer.batch_encode_plus(
    df[df.data_type=='val'].text.values, 
    add_special_tokens=True, 
    return_attention_mask=True,     
    max_length = 64,
    padding='max_length',
    return_tensors='pt'
)


input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.data_type=='train'].label.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(df[df.data_type=='val'].label.values)

Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [20]:
# Let's take a look at our text data in a integer representation!
# Don't worry BERT model will understand these numbers as we understand the text data
encoded_data_train

{'input_ids': tensor([[  101, 16092,  3897,  ...,     0,     0,     0],
        [  101,  1030, 27034,  ...,     0,     0,     0],
        [  101,  1030, 10682,  ...,     0,     0,     0],
        ...,
        [  101, 11047,  1030,  ...,     0,     0,     0],
        [  101,  1030,  3680,  ...,     0,     0,     0],
        [  101,  1030,  2120,  ...,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}

In [21]:
# Create the Tensor dataset using the encoded data created in the previous step
dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)

In [22]:
# Total dataset
len(dataset_train)+len(dataset_val)

1481

In [23]:
# Total dataset by label (they are matching!, so we got all our dataset covered in TensorDataset after dataset split)
df.label.count()

# Next is to feed the Tensordataset to our BERT Model

1481

## 4.4. Setting Pre-Trained BERT Model

The original BERT paper presented two model sizes:
- BERT BASE: 12 Encoder Layers
- BERT LARGE: 24 Encoder Layers
This case study uses base-uncased model (uncased means all the character in the text data are treated as lower case characters).

In [24]:
# Here, we will use a pre-tained BERT model BertForSequenceClassification
# For complete list of BERT Model available in huggingface: https://huggingface.co/models
from transformers import BertForSequenceClassification

In [25]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

model.to(device)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

## 4.5. Creating Data Loaders

In [26]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

In [27]:
batch_size = 32

dataloader_train = DataLoader(dataset_train, 
                              sampler=RandomSampler(dataset_train), 
                              batch_size=batch_size)

dataloader_validation = DataLoader(dataset_val, 
                                   sampler=SequentialSampler(dataset_val), 
                                   batch_size=batch_size)

## 4.6. Setting Up Optimiser and Scheduler

In [28]:
from transformers import AdamW, get_linear_schedule_with_warmup

In [29]:
optimizer = AdamW(model.parameters(),
                  lr=1e-5, 
                  eps=1e-8)

In [30]:
epochs = 10

scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps=0,
                                            num_training_steps=len(dataloader_train)*epochs)

## 4.7. Defining our Performance Metrics

Accuracy metric approach originally used in accuracy function in [this tutorial](https://mccormickml.com/2019/07/22/BERT-fine-tuning/#41-bertforsequenceclassification).

In [31]:
import numpy as np

In [32]:
from sklearn.metrics import f1_score

In [33]:
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

In [34]:
def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')

## 4.8. Creating our Training Loop

Approach adapted from an older version of HuggingFace's `run_glue.py` script. Accessible [here](https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128).

In [35]:
import random

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [36]:
def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in dataloader_val:
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

In [37]:
from tqdm.notebook import tqdm # https://github.com/tqdm/tqdm

In [38]:
for epoch in tqdm(range(epochs)):
    
    model.train()
    
    loss_train_total = 0

    progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)
    for batch in progress_bar:

        model.zero_grad()
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }       

        outputs = model(**inputs)
        
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
         
        
    torch.save(model.state_dict(), f'finetuned_BERT_epoch_{epoch}.model')
        
    tqdm.write(f'\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train)            
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_validation)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (Weighted): {val_f1}')

HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch 0', max=40.0, style=ProgressStyle(description_width…


Epoch 0
Training loss: 1.114614723622799
Validation loss: 0.7650820016860962
F1 Score (Weighted): 0.6656119824269878


HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=40.0, style=ProgressStyle(description_width…


Epoch 1
Training loss: 0.6983324564993382
Validation loss: 0.641399119581495
F1 Score (Weighted): 0.7599839515880076


HBox(children=(FloatProgress(value=0.0, description='Epoch 2', max=40.0, style=ProgressStyle(description_width…


Epoch 2
Training loss: 0.5476117491722107
Validation loss: 0.5739478724343436
F1 Score (Weighted): 0.7774018823873743


HBox(children=(FloatProgress(value=0.0, description='Epoch 3', max=40.0, style=ProgressStyle(description_width…


Epoch 3
Training loss: 0.44787438958883286
Validation loss: 0.52233196582113
F1 Score (Weighted): 0.8311979500320308


HBox(children=(FloatProgress(value=0.0, description='Epoch 4', max=40.0, style=ProgressStyle(description_width…


Epoch 4
Training loss: 0.38811107762157915
Validation loss: 0.5514254314558846
F1 Score (Weighted): 0.8119891535727453


HBox(children=(FloatProgress(value=0.0, description='Epoch 5', max=40.0, style=ProgressStyle(description_width…


Epoch 5
Training loss: 0.3241060607135296
Validation loss: 0.4778749815055302
F1 Score (Weighted): 0.8334828101644246


HBox(children=(FloatProgress(value=0.0, description='Epoch 6', max=40.0, style=ProgressStyle(description_width…


Epoch 6
Training loss: 0.2865843648090959
Validation loss: 0.46311125585011076
F1 Score (Weighted): 0.844822292660356


HBox(children=(FloatProgress(value=0.0, description='Epoch 7', max=40.0, style=ProgressStyle(description_width…


Epoch 7
Training loss: 0.25671841371804477
Validation loss: 0.4787481980664389
F1 Score (Weighted): 0.8380746206012943


HBox(children=(FloatProgress(value=0.0, description='Epoch 8', max=40.0, style=ProgressStyle(description_width…


Epoch 8
Training loss: 0.24421168845146896
Validation loss: 0.4674115266118731
F1 Score (Weighted): 0.8516421113320178


HBox(children=(FloatProgress(value=0.0, description='Epoch 9', max=40.0, style=ProgressStyle(description_width…


Epoch 9
Training loss: 0.231101331487298
Validation loss: 0.468402807201658
F1 Score (Weighted): 0.8516421113320178



In [39]:
for _, epoch in enumerate(range(epochs)):
  tqdm.write(f'EPOCH: {epoch}')
  model.load_state_dict(torch.load('finetuned_BERT_epoch_{0}.model'.format(epoch), map_location=torch.device('cpu')))
  _, predictions, true_vals = evaluate(dataloader_validation)
  accuracy_per_class(predictions, true_vals)
  tqdm.write(f'########################################################################')

EPOCH: 0
Class: happy
Accuracy: 171/171

Class: not-relevant
Accuracy: 0/32

Class: angry
Accuracy: 0/9

Class: disgust
Accuracy: 0/1

Class: sad
Accuracy: 0/5

Class: surprise
Accuracy: 0/5

########################################################################
EPOCH: 1
Class: happy
Accuracy: 170/171

Class: not-relevant
Accuracy: 9/32

Class: angry
Accuracy: 2/9

Class: disgust
Accuracy: 0/1

Class: sad
Accuracy: 0/5

Class: surprise
Accuracy: 0/5

########################################################################
EPOCH: 2
Class: happy
Accuracy: 169/171

Class: not-relevant
Accuracy: 11/32

Class: angry
Accuracy: 3/9

Class: disgust
Accuracy: 0/1

Class: sad
Accuracy: 0/5

Class: surprise
Accuracy: 0/5

########################################################################
EPOCH: 3
Class: happy
Accuracy: 169/171

Class: not-relevant
Accuracy: 18/32

Class: angry
Accuracy: 5/9

Class: disgust
Accuracy: 0/1

Class: sad
Accuracy: 0/5

Class: surprise
Accuracy: 0/5

###########

# 5 Evaluation Result

Machine used to train the model
System Model	DELL Precision 7520
Operating System	Microsoft Windows 10 Enterprise
Processor	Intel(R) Xeon(R) CPU E3-1545M v5 @ 2.90GHz, 2901 MHz, 4 Core(s), 8 Logical Processor(s)
Installed Physical Memory (RAM)	32 GB
Execution time to train the model: 11 hours 15 minutes (4 EPOCHS)


# 6 Summary
The model can evaluate “Happy” and “Non-Relevant” better than the other sentiment. Most probably due to the fact the training data is bigger.
After EPOCH 3, the accuracy of “Happy” is reduced.
EPOCH	Happy	Non-Relevant	Angry	Disgust	Sad	Surprise
EPOCH 1	171/171	0/32	0/9	0/1	0/5	0/5
EPCOH 2	171/171	9/32	0/9	0/1	0/5	0/5
EPOCH 3	169/171	11/32	0/9	0/1	0/5	0/5
EPOCH 4	169/171	11/32	0/9	0/1	0/5	0/5