# Sentiment Analysis with Deep Learning using BERT

### Prerequisites

- Intermediate-level knowledge of Python 3 (NumPy and Pandas preferably, but not required)
- Exposure to PyTorch usage
- Basic understanding of Deep Learning and Language Models (BERT specifically)

## Introduction

### What is BERT

BERT is a large-scale transformer-based Language Model that can be finetuned for a variety of tasks.

For more information, the original paper can be found [here](https://arxiv.org/abs/1810.04805). 

[HuggingFace documentation](https://huggingface.co/transformers/model_doc/bert.html)

[Bert documentation](https://characters.fandom.com/wiki/Bert_(Sesame_Street) ;)

##  Exploratory Data Analysis and Preprocessing

We will use the SMILE Twitter dataset.

_Wang, Bo; Tsakalidis, Adam; Liakata, Maria; Zubiaga, Arkaitz; Procter, Rob; Jensen, Eric (2016): SMILE Twitter Emotion dataset. figshare. Dataset. https://doi.org/10.6084/m9.figshare.3187909.v2_

In [3]:
import torch
import pandas as pd
from tqdm.notebook import trange, tqdm

In [4]:
# TDQ is a A Fast, Extensible Progress Bar for Python and CLI

In [5]:

for i in trange(10):
    print(i)

  0%|          | 0/10 [00:00<?, ?it/s]

0
1
2
3
4
5
6
7
8
9


In [2]:
torch.cuda.is_available()

True

In [3]:
df = pd.read_csv('Data/smile-annotations-final.csv', 
                 names =['id', 'text', 'category'  ])


In [4]:
df.set_index('id', inplace=True)

In [5]:
df.text.iloc[0]

'@aandraous @britishmuseum @AndrewsAntonio Merci pour le partage! @openwinemap'

In [6]:
df.category.value_counts()

nocode               1572
happy                1137
not-relevant          214
angry                  57
surprise               35
sad                    32
happy|surprise         11
happy|sad               9
disgust|angry           7
disgust                 6
sad|disgust             2
sad|angry               2
sad|disgust|angry       1
Name: category, dtype: int64

In [7]:
df = df[~df.category.str.contains('\|')]

In [8]:
df = df[df.category!= 'nocode']

In [9]:
df.category.value_counts()

happy           1137
not-relevant     214
angry             57
surprise          35
sad               32
disgust            6
Name: category, dtype: int64

It is unbalanced

In [10]:
possible_labels = df.category.unique()

In [11]:
label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label]= index

In [12]:
label_dict

{'happy': 0,
 'not-relevant': 1,
 'angry': 2,
 'disgust': 3,
 'sad': 4,
 'surprise': 5}

In [13]:
df['label'] = df.category.replace(label_dict)
df.head()

Unnamed: 0_level_0,text,category,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy,0
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy,0
614877582664835073,@Sofabsports thank you for following me back. ...,happy,0
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy,0
611570404268883969,@NationalGallery @ThePoldarkian I have always ...,happy,0


In [14]:
df['text'].iloc[0]

'Dorian Gray with Rainbow Scarf #LoveWins (from @britishmuseum http://t.co/Q4XSwL0esu) http://t.co/h0evbTBWRq'

## Training/Validation Split

In [15]:
from sklearn.model_selection import train_test_split

In [16]:
df.index.values

array([614484565059596288, 614746522043973632, 614877582664835073, ...,
       613678555935973376, 615246897670922240, 613016084371914753],
      dtype=int64)

In [17]:
df.label.values

array([0, 0, 0, ..., 0, 0, 1], dtype=int64)

In [18]:
X_train, X_val, y_train, y_val = train_test_split(
    df.index.values,
    df.label.values,
    test_size=0.15,
    random_state=17,
    stratify=df.label.values
)

In [19]:
X_train

array([614767094345936896, 610755488372948992, 610609791073931266, ...,
       613744184495894529, 610873494910443520, 610741907426267136],
      dtype=int64)

In [20]:
y_train

array([0, 0, 0, ..., 0, 0, 2], dtype=int64)

In [21]:
df['data_type']= ['no_set']*df.shape[0]

In [22]:
X_train

array([614767094345936896, 610755488372948992, 610609791073931266, ...,
       613744184495894529, 610873494910443520, 610741907426267136],
      dtype=int64)

In [23]:
df.loc[X_train, 'data_type']='train'
df.loc[X_val, 'data_type']='val'

In [24]:
df.groupby(['category','label','data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,text
category,label,data_type,Unnamed: 3_level_1
angry,2,train,48
angry,2,val,9
disgust,3,train,5
disgust,3,val,1
happy,0,train,966
happy,0,val,171
not-relevant,1,train,182
not-relevant,1,val,32
sad,4,train,27
sad,4,val,5


##  Loading Tokenizer and Encoding our Data

In [25]:
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

In [26]:
tokenizer = BertTokenizer.from_pretrained(
    'bert-base-uncased',
    do_lower_case=True
)

In [27]:
df.data_type=='train'

id
614484565059596288    True
614746522043973632    True
614877582664835073    True
611932373039644672    True
611570404268883969    True
                      ... 
611258135270060033    True
612214539468279808    True
613678555935973376    True
615246897670922240    True
613016084371914753    True
Name: data_type, Length: 1481, dtype: bool

In [28]:
df[df.data_type=='train']

Unnamed: 0_level_0,text,category,label,data_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy,0,train
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy,0,train
614877582664835073,@Sofabsports thank you for following me back. ...,happy,0,train
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy,0,train
611570404268883969,@NationalGallery @ThePoldarkian I have always ...,happy,0,train
...,...,...,...,...
611258135270060033,@_TheWhitechapel @Campaignforwool @SlowTextile...,not-relevant,1,train
612214539468279808,“@britishmuseum: Thanks for ranking us #1 in @...,happy,0,train
613678555935973376,MT @AliHaggett: Looking forward to our public ...,happy,0,train
615246897670922240,@MrStuchbery @britishmuseum Mesmerising.,happy,0,train


In [29]:
df[df.data_type=='train'].text.values

array(['Dorian Gray with Rainbow Scarf #LoveWins (from @britishmuseum http://t.co/Q4XSwL0esu) http://t.co/h0evbTBWRq',
       '@SelectShowcase @Tate_StIves ... Replace with your wish which the artist uses in next installation! It was entralling!',
       '@Sofabsports thank you for following me back. Great to hear from a diverse &amp; interesting panel #DefeatingDepression @RAMMuseum',
       ...,
       'MT @AliHaggett: Looking forward to our public engagement event #DefeatDepression @RAMMuseum @UofE_Research tonight http://t.co/0JHoIWCFfI',
       '@MrStuchbery @britishmuseum Mesmerising.',
       '@NationalGallery The 2nd GENOCIDE against #Biafrans as promised by #Buhari has begun,3days of unreported aerial Bombardment in #Biafraland'],
      dtype=object)

We have to encode the texts by using [tokenizer.batch_encode_plus](https://huggingface.co/transformers/internal/tokenization_utils.html#transformers.tokenization_utils_base.PreTrainedTokenizerBase.batch_encode_plus)

In [30]:
encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    #pad_to_max_length=True,
    padding=True,
    truncation=True,
    max_length=256,
    return_tensors='pt'
)

In [31]:
encoded_data_val = tokenizer.batch_encode_plus(
    df[df.data_type=='val'].text.values,
    add_special_tokens=True,
    return_attention_mask=True,
   # pad_to_max_length=True,
    padding=True,
    truncation=True,
    max_length=256,
    return_tensors='pt'
)

For the train

In [32]:
input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.data_type=='train'].label.values) 

For the validation

In [33]:
input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(df[df.data_type=='val'].label.values) 

It is created the TensorDataset adapted to Bert for the train and validation

In [34]:
dataset_train = TensorDataset(
    input_ids_train,
    attention_masks_train,
    labels_train
)

In [35]:
dataset_val = TensorDataset(input_ids_val,
                            attention_masks_val,
                            labels_val
)

In [36]:
len(dataset_train)

1258

In [37]:
len(dataset_val)

223

## Setting up BERT Pretrained Model

In [38]:
from transformers import BertForSequenceClassification

In [39]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', 
     num_labels=len(label_dict),
     output_attentions=False,
     output_hidden_states=False)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

## Creating Data Loaders

In [40]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

In [6]:
#In Google Colab -- GPU Instance (k80)
#batch_size =32
#epoch =10

In [41]:
batch_size = 4 #32
dataloader_train = DataLoader(
    dataset_train,
    sampler=RandomSampler(dataset_train),
    batch_size=batch_size
)


dataloader_val = DataLoader(
    dataset_val,
    sampler=SequentialSampler(dataset_val),
    batch_size=batch_size 
)

## Setting Up Optimizer and Scheduler

In [42]:
from transformers import AdamW, get_linear_schedule_with_warmup

In [43]:
optimizer = AdamW(
    model.parameters(),
    lr=1e-5, #2e-5 > 5e-5
    eps=1e-8
)

In [44]:
epochs = 10

scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=0,
        num_training_steps=len(dataloader_train)*epochs
)

## Defining our Performance Metrics

Accuracy metric approach originally used in accuracy function in [this tutorial](https://mccormickml.com/2019/07/22/BERT-fine-tuning/#41-bertforsequenceclassification).

In [45]:
import numpy as np

In [46]:
from sklearn.metrics import f1_score

In [47]:
#preds=[0.9 0.05 0.05 0 0 0]
#preds = [1 0 0 0 0]

In [48]:
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis =1 ).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

In [49]:
def accuracy_per_class(preds, labels):
    label_dict_inverse={v: k for k, v in label_dict.items()}
    preds_flat = np.argmax(preds, axis =1 ).flatten()
    labels_flat = labels.flatten()
    
    for label in np.unique(labels_flat):
        y_pred = preds_flat[labels_flat== label]
        y_true = labels_flat[labels_flat== label]
        print(f'Class:{label_dict_inverse[label]}')
        print(f'Accuracy:{len(y_pred[y_pred==label])}/{len(y_true)}\n')

##  Creating our Training Loop

Approach adapted from an older version of HuggingFace's `run_glue.py` script. Accessible [here](https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128).

In [50]:
import random

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [51]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(device)

cuda


Assuming valX is a tensor with the complete validation data, 
The usual approach would be to wrap it in a Dataset and DataLoader and get the predictions for each batch. 

Also, to save memory during evaluation and test, you could wrap the validation and test code into a with torch.no_grad() block.

 for evaluation and test set the code should be:
```python

with torch.no_grad():
    model.eval()
    y_pred = model(valX)
    val_loss = criterion(y_pred, valY)
```

and
```python

with torch.no_grad():
    model.eval()
    y_pred = model(test)
    test_loss = criterion(y_pred, testY)
```

In [52]:
def evaluate(dataloader_val):
    model.eval()
    loss_val_total = 0
    predictions, true_vals = [], []
    for batch in tqdm(dataloader_val):
        batch = tuple(b.to(device) for b in batch)
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals


In [53]:
for epoch in tqdm(range(1, epochs+1)):
    
    model.train()
    
    loss_train_total = 0
    
    progress_bar = tqdm(dataloader_train, 
                        desc='Epoch {:1d}'.format(epoch),
                        leave=False,
                        disable=False)
    for batch in progress_bar:
        model.zero_grad()
        batch = tuple(b.to(device) for b in batch)
        inputs ={
            'input_ids'    :batch[0],
            'attention_mask':batch[1],
            'labels'        :batch[2]
        }
        outputs = model(**inputs)
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()
    
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
        progress_bar.set_postfix(
            {'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
        
    #torch.save(model.state_dict(),f'Models/BERT_ft_epoch{epoch}.model')
    tqdm.write('\nEpoch {epoch}')
    
    loss_train_avg= loss_train_total/len(dataloader_train)
    tqdm.write(f'Training loss:{loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_val)
    val_f1= f1_score_func(predictions,true_vals)
    tqdm.write(f'Validation{val_loss}')
    tqdm.write(f'F1 Score (weigthed): {val_f1}')
torch.save(model.state_dict(),f'Models/BERT_ft_epoch{epoch}.model')       

  0%|          | 0/10 [00:00<?, ?it/s]

Epoch 1:   0%|          | 0/315 [00:00<?, ?it/s]


Epoch {epoch}
Training loss:0.7975574296973055


  0%|          | 0/56 [00:00<?, ?it/s]

Validation0.6059155837699238
F1 Score (weigthed): 0.762975916339145


Epoch 2:   0%|          | 0/315 [00:00<?, ?it/s]


Epoch {epoch}
Training loss:0.4435813750036889


  0%|          | 0/56 [00:00<?, ?it/s]

Validation0.5948948868151221
F1 Score (weigthed): 0.8381931883679935


Epoch 3:   0%|          | 0/315 [00:00<?, ?it/s]


Epoch {epoch}
Training loss:0.2983576275445225


  0%|          | 0/56 [00:00<?, ?it/s]

Validation0.5538139296175879
F1 Score (weigthed): 0.8431542233760785


Epoch 4:   0%|          | 0/315 [00:00<?, ?it/s]


Epoch {epoch}
Training loss:0.18270550284845133


  0%|          | 0/56 [00:00<?, ?it/s]

Validation0.5259785884367635
F1 Score (weigthed): 0.8662434969638529


Epoch 5:   0%|          | 0/315 [00:00<?, ?it/s]


Epoch {epoch}
Training loss:0.11218177344158499


  0%|          | 0/56 [00:00<?, ?it/s]

Validation0.6620622687145702
F1 Score (weigthed): 0.8535941751228626


Epoch 6:   0%|          | 0/315 [00:00<?, ?it/s]


Epoch {epoch}
Training loss:0.05948421741458809


  0%|          | 0/56 [00:00<?, ?it/s]

Validation0.6958074946513599
F1 Score (weigthed): 0.8637082897172584


Epoch 7:   0%|          | 0/315 [00:00<?, ?it/s]


Epoch {epoch}
Training loss:0.04381906234674037


  0%|          | 0/56 [00:00<?, ?it/s]

Validation0.7028247105024222
F1 Score (weigthed): 0.8623919844487555


Epoch 8:   0%|          | 0/315 [00:00<?, ?it/s]


Epoch {epoch}
Training loss:0.028735312574546753


  0%|          | 0/56 [00:00<?, ?it/s]

Validation0.7281462288156035
F1 Score (weigthed): 0.8645974679453454


Epoch 9:   0%|          | 0/315 [00:00<?, ?it/s]


Epoch {epoch}
Training loss:0.02450285246531065


  0%|          | 0/56 [00:00<?, ?it/s]

Validation0.7339509774070133
F1 Score (weigthed): 0.8674038975615305


Epoch 10:   0%|          | 0/315 [00:00<?, ?it/s]


Epoch {epoch}
Training loss:0.021128401538405183


  0%|          | 0/56 [00:00<?, ?it/s]

Validation0.7412530884700702
F1 Score (weigthed): 0.8648362667790782


When saving a general checkpoint, to be used for either inference or resuming training, you must save more than just the model’s state_dict. It is important to also save the optimizer’s state_dict, as this contains buffers and parameters that are updated as the model trains. Other items that you may want to save are the epoch you left off on, the latest recorded training loss, external torch.nn.Embedding layers, etc. As a result, such a checkpoint is often 2~3 times larger than the model alone.

To save multiple components, organize them in a dictionary and use torch.save() to serialize the dictionary. A common PyTorch convention is to save these checkpoints using the .tar file extension.

##  Loading and Evaluating our Model

In [54]:
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=len(label_dict),
    output_attentions=False,
    output_hidden_states=False)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

When we are loading the bert-base-cased checkpoint (which is a checkpoint that was trained using a similar architecture to BertForPreTraining) in a BertForSequenceClassification model.

This means that:

The layers that BertForPreTraining has, but BertForSequenceClassification does not have will be discarded
The layers that BertForSequenceClassification has but BertForPreTraining does not have will be randomly initialized.
This is expected, and tells you that you won't have good performance with your BertForSequenceClassification model before you fine-tune it 🙂.

This warning means that during your training, you're not using the pooler in order to compute the loss. I don't know how you're finetuning your model, but if you're not using the pooler layer then there's no need to worry about that warning.

In [55]:
len(label_dict)

6

In PyTorch, the learnable parameters (i.e. weights and biases) of an torch.nn.Module model are contained in the model’s parameters (accessed with model.parameters()). A state_dict is simply a Python dictionary object that maps each layer to its parameter tensor.

In [56]:
# Print model's state_dict
#print("Model's state_dict:")
#for param_tensor in model.state_dict():
#    print(param_tensor, "\t", model.state_dict()[param_tensor].size())

In [57]:
# Print optimizer's state_dict
#print("Optimizer's state_dict:")
#for var_name in optimizer.state_dict():
#    print(var_name, "\t", optimizer.state_dict()[var_name])

In [58]:
device = torch.device('cuda')
pass

In [59]:
model.to(device)
pass
# Make sure to call input = input.to(device) on any input tensors that you feed to the model

In [60]:
PATH='./Models/BERT_ft_epoch10.model'

In [61]:
model.load_state_dict(torch.load(PATH, 
                                 map_location=torch.device('cuda:0')))

<All keys matched successfully>

When loading a model on a GPU that was trained and saved on GPU, simply convert the initialized model to a CUDA optimized model using model.to(torch.device('cuda')). Also, be sure to use the .to(torch.device('cuda')) function on all model inputs to prepare the data for the model. Note that calling my_tensor.to(device) returns a new copy of my_tensor on GPU. It does NOT overwrite my_tensor. Therefore, remember to manually overwrite tensors: my_tensor = my_tensor.to(torch.device('cuda')).

In [62]:
_, predictions, true_vals = evaluate(dataloader_val)

  0%|          | 0/56 [00:00<?, ?it/s]

In [63]:
accuracy_per_class(predictions, true_vals)

Class:happy
Accuracy:161/171

Class:not-relevant
Accuracy:20/32

Class:angry
Accuracy:8/9

Class:disgust
Accuracy:0/1

Class:sad
Accuracy:2/5

Class:surprise
Accuracy:2/5

