# Sentiment Analysis with Deep Learning using BERT

### Prerequisites

- Intermediate-level knowledge of Python 3 (NumPy and Pandas preferably, but not required)
- Exposure to PyTorch usage
- Basic understanding of Deep Learning and Language Models (BERT specifically)

### Project Outline

**Task 1**: Introduction (this section)

**Task 2**: Exploratory Data Analysis and Preprocessing

**Task 3**: Training/Validation Split

**Task 4**: Loading Tokenizer and Encoding our Data

**Task 5**: Setting up BERT Pretrained Model

**Task 6**: Creating Data Loaders

**Task 7**: Setting Up Optimizer and Scheduler

**Task 8**: Defining our Performance Metrics

**Task 9**: Creating our Training Loop

**Task 10**: Loading and Evaluating our Model

## Task 1: Introduction

<img src="Images/BERT_diagrams.pdf" width="1000">

## Task 2: Exploratory Data Analysis and Preprocessing

We will use the SMILE Twitter dataset.

_Wang, Bo; Tsakalidis, Adam; Liakata, Maria; Zubiaga, Arkaitz; Procter, Rob; Jensen, Eric (2016): SMILE Twitter Emotion dataset. figshare. Dataset. https://doi.org/10.6084/m9.figshare.3187909.v2_

In [1]:
import torch
import pandas as pd
from tqdm.notebook import tqdm

In [2]:
df = pd.read_csv('Data/smile-annotations-final.csv', names=['id','text','category'])
df.set_index('id', inplace=True)
df.head()

Unnamed: 0_level_0,text,category
id,Unnamed: 1_level_1,Unnamed: 2_level_1
611857364396965889,@aandraous @britishmuseum @AndrewsAntonio Merc...,nocode
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy
614877582664835073,@Sofabsports thank you for following me back. ...,happy
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy


In [3]:
df.category.unique()

array(['nocode', 'happy', 'not-relevant', 'angry', 'disgust|angry',
       'disgust', 'happy|surprise', 'sad', 'surprise', 'happy|sad',
       'sad|disgust', 'sad|angry', 'sad|disgust|angry'], dtype=object)

In [4]:
df.text.iloc[0] #first tweet

'@aandraous @britishmuseum @AndrewsAntonio Merci pour le partage! @openwinemap'

In [5]:
df.category.value_counts() 
# nocode- no clear emotion
# won't take emotion with more than one emotions. Only take for one emotion 
# So, chosen emotions are- happy, not-relevant, angry, surprise, sad, disgust.
# Rest emotions will be one-hot encoded to zeroes, except above emotions.

nocode               1572
happy                1137
not-relevant          214
angry                  57
surprise               35
sad                    32
happy|surprise         11
happy|sad               9
disgust|angry           7
disgust                 6
sad|angry               2
sad|disgust             2
sad|disgust|angry       1
Name: category, dtype: int64

In [6]:
df.shape

(3085, 2)

In [3]:
# So, removing  tweets having multiple categories or nocode
# df[df.category=='nocode']
df = df[~(df['category'].str.contains('\|')) & ~(df.category=='nocode')]

In [8]:
df.shape

(1481, 2)

In [4]:
df.category.value_counts()

happy           1137
not-relevant     214
angry             57
surprise          35
sad               32
disgust            6
Name: category, dtype: int64

In [10]:
# There is a class imbalance here, as happy has a large no of emotions, compared to disgust, which has just 6 count

In [5]:
# Build a dictionory of category values
possible_labels = df.category.unique()

In [6]:
label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index

In [12]:
label_dict

{'happy': 0,
 'not-relevant': 1,
 'angry': 2,
 'disgust': 3,
 'sad': 4,
 'surprise': 5}

In [7]:
df['label']= df.category.map(label_dict)

In [14]:
df.label.value_counts()

0    1137
1     214
2      57
5      35
4      32
3       6
Name: label, dtype: int64

In [15]:
df.index.values

array([614484565059596288, 614746522043973632, 614877582664835073, ...,
       613678555935973376, 615246897670922240, 613016084371914753])

## Task 3: Training/Validation Split

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
X = df.index.values
y = df.label.values
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.15,random_state= 1, stratify=y)

In [10]:
df['data_type'] = ['not_set']*df.shape[0]
df.head()

Unnamed: 0_level_0,text,category,label,data_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy,0,not_set
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy,0,not_set
614877582664835073,@Sofabsports thank you for following me back. ...,happy,0,not_set
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy,0,not_set
611570404268883969,@NationalGallery @ThePoldarkian I have always ...,happy,0,not_set


In [11]:
df.loc[X_train, 'data_type']= 'train'
df.loc[X_val, 'data_type']= 'val'

In [12]:
df.groupby(['category','label','data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,text
category,label,data_type,Unnamed: 3_level_1
angry,2,train,48
angry,2,val,9
disgust,3,train,5
disgust,3,val,1
happy,0,train,966
happy,0,val,171
not-relevant,1,train,182
not-relevant,1,val,32
sad,4,train,27
sad,4,val,5


## Task 4: Loading Tokenizer and Encoding our Data

In [13]:
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

In [14]:
tokenizer = BertTokenizer.from_pretrained(
'bert-base-uncased',
do_lower_case=True
)

In [15]:
encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].text.values,
    add_special_tokens=True,
    return_attention_masks=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)
encoded_data_valid = tokenizer.batch_encode_plus(
    df[df.data_type=='val'].text.values,
    add_special_tokens=True,
    return_attention_masks=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

In [26]:
encoded_data_valid

{'input_ids': tensor([[ 101, 1030, 8223,  ...,    0,    0,    0],
         [ 101, 1030, 2120,  ...,    0,    0,    0],
         [ 101, 9107, 1996,  ...,    0,    0,    0],
         ...,
         [ 101, 2551, 2006,  ...,    0,    0,    0],
         [ 101, 1037, 2621,  ...,    0,    0,    0],
         [ 101, 2204, 2000,  ...,    0,    0,    0]]),
 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         ...,
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]])}

In [16]:
input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.data_type=='train'].label.values)

input_ids_val = encoded_data_valid['input_ids']
attention_masks_val = encoded_data_valid['attention_mask']
labels_val = torch.tensor(df[df.data_type=='val'].label.values)

In [17]:
dataset_train = TensorDataset(input_ids_train,
                             attention_masks_train, labels_train)
dataset_val = TensorDataset(input_ids_val,
                             attention_masks_val, labels_val)

In [30]:
len(dataset_train)

1258

In [31]:
len(dataset_val)

223

## Task 5: Setting up BERT Pretrained Model

In [18]:
from transformers import BertForSequenceClassification

In [19]:
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels = len(label_dict),
    output_attentions = False,
    output_hidden_states= False
)

## Task 6: Creating Data Loaders

In [20]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

In [21]:
batch_size = 4 #32

dataloader_train = DataLoader(
dataset_train,
sampler = RandomSampler(dataset_train),
batch_size= batch_size)

dataloader_val = DataLoader(
dataset_val,
sampler = RandomSampler(dataset_val),
batch_size= 32)

## Task 7: Setting Up Optimizer and Scheduler

In [22]:
from transformers import AdamW, get_linear_schedule_with_warmup

In [23]:
optimizer = AdamW(
model.parameters(),
    lr = 1e-5,
    eps= 1e-8
)

In [24]:
epochs = 10

scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=0,
num_training_steps= len(dataloader_train)* epochs
)

## Task 8: Defining our Performance Metrics

Accuracy metric approach originally used in accuracy function in [this tutorial](https://mccormickml.com/2019/07/22/BERT-fine-tuning/#41-bertforsequenceclassification).

In [25]:
import numpy as np

In [26]:
from sklearn.metrics import f1_score

In [27]:
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1)
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

In [28]:
def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    
    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class:{label_dict_inverse[label]}')
        print(f'Accuracy:{len(y_preds[y_preds==label])}/{len(y_true)}\n')

## Task 9: Creating our Training Loop

Approach adapted from an older version of HuggingFace's `run_glue.py` script. Accessible [here](https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128).

In [29]:
import random

seed_val = 1
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [35]:
# to check for which device is currently used by model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(device)

cpu


In [38]:
def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in tqdm(dataloader_val):
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals


In [None]:
for epoch in tqdm(range(1, epochs+1)):
    # train model
    model.train()
    
    loss_train_total = 0
    # tqdm is used to check how many batches are trained or validated, and how many are left
    progress_bar = tqdm(dataloader_train, desc = 'Epoch{:1d}'.format(epoch), 
                        leave=False, disable= False)
    # for each epoch, will do batches to do backpropgation
    for batch in progress_bar:
        # Once we're in first batch, will set gradient to 0(standard procedure for PyTorch. Default is not 0, but, that's for RNN. But, here we're not working with RNN. That's the whole point of transformers.)
        model.zero_grad()
        
        # dataloader will have 3 different tuples, and each individual item of tuple should be in correct device, so will be in batch.
        # batch is a tuple of items, and each item is in the device, we care about
        batch = tuple(b.to(device) for b in batch)
        #input to bert model
        inputs = {
            'input_ids': batch[0],
            'attention_mask': batch[1],
            'labels': batch[2]
        }
        
        #output to model- run model. Bert model returns loss and logits
        outputs = model(**inputs)
        
        loss = outputs[0]
        loss_train_total += loss.item()
        #do backpropogation
        loss.backward()
        
        # takes gradient and convert to norm value. It just help gradient not to be exceptionally smaller or exceptionally bigger
        torch.nn.utils.clip_grad_norm_(model.parameters, 1.0)
        
        optimizer.step()
        scheduler.step()
        
        # update progress bar to show loss per batch
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
        
    # save model in each epoch, it'll have all model wqeights and all layers
    torch.save(model.save_dict(), f'Models/BERT_ft_epoch{epoch}.model')
    #which epoch we're on
    tqdm.write('\n Epoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train)
    # write avg training loss after each epoch
    tqdm.write(f'Training loss : {loss_train_avg}')
    
    # check for validation loss, to determine if model is not overtraining
    # overtraining occurs, when training loss is going down, but, model loss goes up.
    #evaluate on validation dataset
    val_loss, predictions, true_vals = evaluate(dataloader_val)
    
    #weighted f-1 score
    val_f1 = f1_score_func(predictions, true_vals)
    # print Validation loss, and weighted f-1 score
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 score (weighted):{val_f1}')

HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch1', max=315.0, style=ProgressStyle(description_width…

## Task 10: Loading and Evaluating our Model

In [32]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

In [33]:
model.to(device)
pass

In [36]:
model.load_state_dict(
    torch.load('Models/finetuned_bert_epoch_1_gpu_trained.model',
    map_location=torch.device('cpu')))
# if everything gone well, it will look for that expected weights and weights received from model ar all same.


<All keys matched successfully>

In [39]:
# will get list of predictions and true values
_, predictions, true_vals = evaluate(dataloader_val)

HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))




In [42]:
accuracy_per_class(predictions, true_vals)

Class:happy
Accuracy:170/171

Class:not-relevant
Accuracy:32/32

Class:angry
Accuracy:9/9

Class:disgust
Accuracy:1/1

Class:sad
Accuracy:5/5

Class:surprise
Accuracy:5/5



In [None]:
# trained model on google COlab -- GPU instance(k8) , batch_size = 32, epochs=10