<a href="https://colab.research.google.com/github/skanderbenmansour/nlp_study_group/blob/master/tyler/bert_course/project_to_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis with Deep Learning using BERT

### Prerequisites

- Intermediate-level knowledge of Python 3 (NumPy and Pandas preferably, but not required)
- Exposure to PyTorch usage
- Basic understanding of Deep Learning and Language Models (BERT specifically)

### Project Outline

**Task 1**: Introduction (this section)

**Task 2**: Exploratory Data Analysis and Preprocessing

**Task 3**: Training/Validation Split

**Task 4**: Loading Tokenizer and Encoding our Data

**Task 5**: Setting up BERT Pretrained Model

**Task 6**: Creating Data Loaders

**Task 7**: Setting Up Optimizer and Scheduler

**Task 8**: Defining our Performance Metrics

**Task 9**: Creating our Training Loop

**Task 10**: Loading and Evaluating our Model

In [7]:
! pip install transformers --quiet

[K     |████████████████████████████████| 778kB 10.6MB/s 
[K     |████████████████████████████████| 3.0MB 58.7MB/s 
[K     |████████████████████████████████| 890kB 62.0MB/s 
[K     |████████████████████████████████| 1.1MB 53.3MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


In [8]:
import torch
import pandas as pd
from tqdm.notebook import tqdm
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer
from torch.utils.data import TensorDataset
from transformers import BertForSequenceClassification
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from transformers import AdamW, get_linear_schedule_with_warmup
import numpy as np
from sklearn.metrics import f1_score
import random
from torch.nn.utils import clip_grad_norm_
import os

## Download data and setup directory

In [3]:
#from google.colab import drive
#drive.mount('/content/drive')

In [9]:
project_dir = '/content/drive/My Drive/bert_course'

In [11]:
os.makedirs(f'{project_dir}/Models',exist_ok=True)
os.makedirs(f'{project_dir}/Data',exist_ok=True)

In [14]:
! wget https://ndownloader.figshare.com/files/4988956 -O '/content/drive/My Drive/bert_course/Data/smile-annotations-final.csv'

--2020-08-28 19:45:39--  https://ndownloader.figshare.com/files/4988956
Resolving ndownloader.figshare.com (ndownloader.figshare.com)... 34.246.93.132, 34.242.50.74, 18.202.7.12, ...
Connecting to ndownloader.figshare.com (ndownloader.figshare.com)|34.246.93.132|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/4988956/smileannotationsfinal.csv [following]
--2020-08-28 19:45:41--  https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/4988956/smileannotationsfinal.csv
Resolving s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)... 52.218.20.124
Connecting to s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)|52.218.20.124|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 429669 (420K) [binary/octet-stream]
Saving to: ‘/content/drive/My Drive/bert_course/Data/smile-annotations-final.csv’


2020-08-28 19:45:43 (334 KB/s) - ‘/content/drive/My Drive/bert_course/Data/smile-annot

## Task 1: Introduction

### What is BERT

BERT is a large-scale transformer-based Language Model that can be finetuned for a variety of tasks.

For more information, the original paper can be found [here](https://arxiv.org/abs/1810.04805). 

[HuggingFace documentation](https://huggingface.co/transformers/model_doc/bert.html)

[Bert documentation](https://characters.fandom.com/wiki/Bert_(Sesame_Street) ;)

<img src="Images/BERT_diagrams.pdf" width="1000">

## Task 2: Exploratory Data Analysis and Preprocessing

We will use the SMILE Twitter dataset.

_Wang, Bo; Tsakalidis, Adam; Liakata, Maria; Zubiaga, Arkaitz; Procter, Rob; Jensen, Eric (2016): SMILE Twitter Emotion dataset. figshare. Dataset. https://doi.org/10.6084/m9.figshare.3187909.v2_

In [16]:
names = 'id text category'.split()
data_path = f'{project_dir}/Data/smile-annotations-final.csv'
df = pd.read_csv(data_path,names=names)

In [17]:
df.set_index('id',inplace=True)

In [18]:
df.text.sample().values[0]

'.@NationalGallery #AskTheGallery perhaps the questions may be better answered here(?)  information@ng-london.org.uk +44 (0)20 7747 2885'

In [19]:
df.category.value_counts()

nocode               1572
happy                1137
not-relevant          214
angry                  57
surprise               35
sad                    32
happy|surprise         11
happy|sad               9
disgust|angry           7
disgust                 6
sad|disgust             2
sad|angry               2
sad|disgust|angry       1
Name: category, dtype: int64

In [20]:
df[df.category=='sad'].sample(5).text.values

array(['@britishmuseum @thehistoryguy shame about the idiots who post rude comments throughout!',
       '@JHGHendriks @britishmuseum @aroberts_andrew estimates range from 15000 to 20000 horses killed/severly wounded',
       'After 29 yrs #Cézanne #painting to leave UK unless funds raised @FitzMuseum_UK via @an_artnews http://t.co/VjDNl3XU7C http://t.co/0frSDQ5LHN',
       'Found this yesterday @BritishMuseum - desperately needs more info on the exhibit though :( Pillar of Emperor Ashoka http://t.co/0XtD8ly3nX',
       '@britishmuseum Wish you could extend the exhibition, 😭, I will only be in London in August...'],
      dtype=object)

In [21]:
keep_cat = 'happy not-relevant angry surprise sad disgust'.split()
df = df[df.category.isin(keep_cat)].copy()

In [22]:
df.category.value_counts()

happy           1137
not-relevant     214
angry             57
surprise          35
sad               32
disgust            6
Name: category, dtype: int64

In [23]:
cat2idx = {cat:idx for idx,cat in enumerate(list(df.category.unique()))}

In [24]:
cat2idx

{'angry': 2,
 'disgust': 3,
 'happy': 0,
 'not-relevant': 1,
 'sad': 4,
 'surprise': 5}

In [25]:
df['label'] = df.category.map(cat2idx).copy()

In [26]:
df.sample(5)

Unnamed: 0_level_0,text,category,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
612589523348926464,@britishmuseum the longest day then new beginn...,happy,0
611849934061588480,@MuseeLouvre @museiincomune @leCMN @GoldUnveil...,not-relevant,1
610544217971118080,@PlatformLondon @NationalGallery lets find out...,angry,2
612583484733980673,"@_TheWhitechapel comme au @PalaisdeTokyo, les ...",happy,0
611464875576066048,@cheltcollege L6 History of Art Dept on way to...,happy,0


In [32]:
cleaned_path = f'{project_dir}/Data/cleaned.csv'
df.to_csv(cleaned_path)

## Task 3: Training/Validation Split

In [33]:
df = pd.read_csv(cleaned_path)

In [34]:
x_train,x_val,y_train,y_val = train_test_split(
    df.index.values,
    df.label.values,
    test_size = .15,
    random_state = 17,
    stratify=df.label.values)

In [35]:
df['data_type'] = 'not_set'

In [36]:
x_train

array([ 598,  372,  740, ...,   79, 1237,   40])

In [37]:
df.loc[x_train,'data_type'] = 'train'
df.loc[x_val,'data_type'] = 'val'

In [38]:
df.groupby(['category','label','data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,text
category,label,data_type,Unnamed: 3_level_1,Unnamed: 4_level_1
angry,2,train,48,48
angry,2,val,9,9
disgust,3,train,5,5
disgust,3,val,1,1
happy,0,train,966,966
happy,0,val,171,171
not-relevant,1,train,182,182
not-relevant,1,val,32,32
sad,4,train,27,27
sad,4,val,5,5


## Task 4: Loading Tokenizer and Encoding our Data

In [39]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased',
                                          do_lower_case=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




In [40]:
encoded_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].text.values,
    add_special_tokens=True,
    #return_attention_masks=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt')

encoded_val = tokenizer.batch_encode_plus(
    df[df.data_type=='val'].text.values,
    add_special_tokens=True,
    #return_attention_masks=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt')

Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [41]:
input_ids_train = encoded_train['input_ids']
attn_mask_train = encoded_train['attention_mask']
labels_train = torch.tensor(df[df.data_type=='train'].label.values)

input_ids_val = encoded_val['input_ids']
attn_mask_val = encoded_val['attention_mask']
labels_val = torch.tensor(df[df.data_type=='val'].label.values)

In [42]:
data_train = TensorDataset(input_ids_train,attn_mask_train,labels_train)

data_val = TensorDataset(input_ids_val,attn_mask_val,labels_val)

In [43]:
len(data_train),len(data_val)

(1258, 223)

## Task 5: Setting up BERT Pretrained Model

In [44]:
num_label = len(cat2idx)
model = BertForSequenceClassification.from_pretrained('bert-base-uncased',
                                      num_labels=num_label,
                                      output_attentions=False,
                                      output_hidden_states=False)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [45]:
model.device

device(type='cpu')

## Task 6: Creating Data Loaders

In [46]:
train_batch_size = 32 # normally use 32
val_batch_size = 32 # normally use 32

In [47]:
sampler = RandomSampler(data_train)
train_loader = DataLoader(data_train,sampler=sampler,batch_size=train_batch_size)

sampler = RandomSampler(data_val)
val_loader = DataLoader(data_val,sampler=sampler,batch_size=val_batch_size)

## Task 7: Setting Up Optimizer and Scheduler

In [48]:
lr = 1e-5 ## should be between 1e-5 and 5e-5 according to paper
eps = 1e-8
optimizer = AdamW(model.parameters(),lr=lr,eps=eps)

In [49]:
epochs = 10
num_warmup_steps = 0
num_training_steps = len(train_loader) * epochs
scheduler = get_linear_schedule_with_warmup(optimizer,num_warmup_steps=num_warmup_steps,
                                            num_training_steps=num_training_steps)

In [50]:
num_training_steps

400

## Task 8: Defining our Performance Metrics

Accuracy metric approach originally used in accuracy function in [this tutorial](https://mccormickml.com/2019/07/22/BERT-fine-tuning/#41-bertforsequenceclassification).

In [51]:
idx2cat = {idx:cat for cat,idx in cat2idx.items()}

In [62]:
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

In [68]:
def accuracy_per_class(preds, labels, idx2cat):
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {idx2cat[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')

## Task 9: Creating our Training Loop

Approach adapted from an older version of HuggingFace's `run_glue.py` script. Accessible [here](https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128).

In [54]:
seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [55]:
device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'    

In [57]:
device_name = torch.cuda.get_device_name()
print(device_name)

Tesla T4


In [58]:
_ = model.to(device)

In [73]:
def evaluate(model,dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in tqdm(dataloader_val):
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2]}

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals


In [64]:
for epoch in tqdm(range(1, epochs+1)):
    model.train()
    loss_train_total = 0
    
    progress_bar = tqdm(train_loader,desc=f'Epoch {epoch}',leave=False,disable=False)
    for idx,batch in enumerate(progress_bar):
        #if idx < 1:
        model.zero_grad()
        input_ids,attn_mask,labels = tuple(b.to(device) for b in batch)
        inputs = {'input_ids': input_ids,'attention_mask':attn_mask,'labels':labels}

        outputs = model(**inputs)
        loss,_ = outputs

        loss_train_total += loss
        loss.backward()

        clip_grad_norm_(model.parameters(),1)

        optimizer.step()
        scheduler.step()

        loss_to_print = loss.item() / len(batch)
        progress_bar.set_postfix(training_loss=f'{loss_to_print:.3f}')
        
    save_path = f'{project_dir}/Models/bert_{epoch}.model'
    torch.save(model.state_dict(),save_path)
                                 
    tqdm.write(f'\nEpoch {epoch}')
                                 
    loss_train_avg = loss_train_total / len(train_loader)
                                     
    tqdm.write(f'Training loss: {loss_train_avg:.5f}')
                                 
    val_loss,preds,true_vals = evaluate(model,val_loader)
    val_f1 = f1_score_func(preds,true_vals)
                                 
    tqdm.write(f'Val loss: {val_loss:.5f}')          
    tqdm.write(f'Val f1: {val_f1:.5f}')                       


HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=40.0, style=ProgressStyle(description_width…


Epoch 1
Training loss: 1.00323


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Val loss: 0.7580523405756269
Val f1: 0.6953185953656175


HBox(children=(FloatProgress(value=0.0, description='Epoch 2', max=40.0, style=ProgressStyle(description_width…


Epoch 2
Training loss: 0.66350


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Val loss: 0.6324100877557483
Val f1: 0.7289416177237934


HBox(children=(FloatProgress(value=0.0, description='Epoch 3', max=40.0, style=ProgressStyle(description_width…


Epoch 3
Training loss: 0.56125


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Val loss: 0.6282747430460793
Val f1: 0.7531245980763923


HBox(children=(FloatProgress(value=0.0, description='Epoch 4', max=40.0, style=ProgressStyle(description_width…


Epoch 4
Training loss: 0.47592


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Val loss: 0.6157398649624416
Val f1: 0.7826679820157194


HBox(children=(FloatProgress(value=0.0, description='Epoch 5', max=40.0, style=ProgressStyle(description_width…


Epoch 5
Training loss: 0.42737


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Val loss: 0.5902184929166522
Val f1: 0.7862480072500582


HBox(children=(FloatProgress(value=0.0, description='Epoch 6', max=40.0, style=ProgressStyle(description_width…


Epoch 6
Training loss: 0.36720


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Val loss: 0.5792331269809178
Val f1: 0.7838516881735182


HBox(children=(FloatProgress(value=0.0, description='Epoch 7', max=40.0, style=ProgressStyle(description_width…


Epoch 7
Training loss: 0.34803


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Val loss: 0.5793816851718085
Val f1: 0.780976441609297


HBox(children=(FloatProgress(value=0.0, description='Epoch 8', max=40.0, style=ProgressStyle(description_width…


Epoch 8
Training loss: 0.31574


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Val loss: 0.5663876788956779
Val f1: 0.7790288504974604


HBox(children=(FloatProgress(value=0.0, description='Epoch 9', max=40.0, style=ProgressStyle(description_width…


Epoch 9
Training loss: 0.29758


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Val loss: 0.5823760628700256
Val f1: 0.7858849957352656


HBox(children=(FloatProgress(value=0.0, description='Epoch 10', max=40.0, style=ProgressStyle(description_widt…


Epoch 10
Training loss: 0.29819


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Val loss: 0.5948270601885659
Val f1: 0.7844035442564187



In [None]:
## with Tesla T4, 1 epoch takes about 58 sec ==> 10 min total

## Task 10: Loading and Evaluating our Model

In [72]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(cat2idx),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [75]:
_ = model.to(device)

In [85]:
path = '/Models/finetuned_bert_epoch_1_gpu_trained.model'
path = f'{project_dir}/Models/bert_1.model'

state_dict = torch.load(path,map_location=torch.device('cuda'))
model.load_state_dict(state_dict)

<All keys matched successfully>

In [86]:
_,preds,true_val = evaluate(model,val_loader)

HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))




In [87]:
accuracy_per_class(preds,true_vals,idx2cat)

Class: happy
Accuracy: 169/171

Class: not-relevant
Accuracy: 1/32

Class: angry
Accuracy: 0/9

Class: disgust
Accuracy: 0/1

Class: sad
Accuracy: 0/5

Class: surprise
Accuracy: 0/5

