# **Sentiment Analysis with Deep Learning using BERT**


## **What is BERT?**

BERT is a large-scale transformer-based Language Model that can be finetuned for a variety of tasks.

For more information, the original paper can be found here (https://arxiv.org/abs/1810.04805).

HuggingFace documentation (https://huggingface.co/transformers/model_doc/bert.html)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## 1: Exploratory Data Analysis and Preprocessing

In [None]:
import torch
from tqdm.notebook import tqdm

In [None]:
df = pd.read_csv('corpus.csv',
                names=['text', 'category'])
df = df[1:]
df.insert(0, 'id', range(1, 1 + len(df)))
df.set_index('id', inplace=True)

In [None]:
df.head()

Unnamed: 0_level_0,text,category
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,AdilNisarButt pakistan ka ghra tauq he Pakista...,negative
2,Madarchod mulle ye mathura me Nahi dikha tha j...,negative
3,narendramodi Manya Pradhan Mantri mahoday Shri...,positive
4,Atheist _ Krishna Jcb full trend me chal rahi aa,positive
5,AbhisharSharma _ RavishKumarBlog Loksabha me j...,positive


In [None]:
df.category.value_counts()

neutral     5638
positive    5034
negative    4459
Name: category, dtype: int64

In [None]:
df.category.value_counts()

neutral     5638
positive    5034
negative    4459
Name: category, dtype: int64

In [None]:
possible_labels = df.category.unique()

In [None]:
label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index

In [None]:
label_dict

{'negative': 0, 'neutral': 2, 'positive': 1}

In [None]:
df.category = df['category'].map(label_dict)

In [None]:
df.head(10)

Unnamed: 0_level_0,text,category
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,AdilNisarButt pakistan ka ghra tauq he Pakista...,0
2,Madarchod mulle ye mathura me Nahi dikha tha j...,0
3,narendramodi Manya Pradhan Mantri mahoday Shri...,1
4,Atheist _ Krishna Jcb full trend me chal rahi aa,1
5,AbhisharSharma _ RavishKumarBlog Loksabha me j...,1
6,noirnaveed AngelAhana6 cricketworldcup Bhosdik...,0
7,Love u Bhaijan ... Father + son .. Bharat IAmB...,1
8,manojgajjar111 Tumhara pass abh deemagh hai na...,0
9,Mahlogo _ nolo Weni ankere o gae this weekend,1
10,Aurangzeb _ AIMIM SachinS40805591 Lage raho mu...,0


Classes are imbalanced as visible

## 2: Training/Validation Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                  df.category.values, 
                                                  test_size=0.15, 
                                                  random_state=42,
                                                  stratify=df.category.values)

In [None]:
df['data_type'] = ['not_set']*df.shape[0]

In [None]:
df.head()

Unnamed: 0_level_0,text,category,data_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,AdilNisarButt pakistan ka ghra tauq he Pakista...,0,not_set
2,Madarchod mulle ye mathura me Nahi dikha tha j...,0,not_set
3,narendramodi Manya Pradhan Mantri mahoday Shri...,1,not_set
4,Atheist _ Krishna Jcb full trend me chal rahi aa,1,not_set
5,AbhisharSharma _ RavishKumarBlog Loksabha me j...,1,not_set


In [None]:
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

In [None]:
df.groupby(['category', 'data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,text
category,data_type,Unnamed: 2_level_1
0,train,3789
0,val,668
1,train,4276
1,val,755
2,train,4789
2,val,845


# 3. Loading Tokenizer and Encoding our Data

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.13.0-py3-none-any.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 5.2 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 40.4 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 546 kB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 41.8 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 49.2 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attem

In [None]:
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

In [None]:
tokenizer = BertTokenizer.from_pretrained(
    'bert-base-multilingual-cased',
    do_lower_case=True
)

Downloading:   0%|          | 0.00/972k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.87M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

In [None]:
df.dropna()

Unnamed: 0_level_0,text,category,data_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,AdilNisarButt pakistan ka ghra tauq he Pakista...,0,train
2,Madarchod mulle ye mathura me Nahi dikha tha j...,0,train
3,narendramodi Manya Pradhan Mantri mahoday Shri...,1,train
4,Atheist _ Krishna Jcb full trend me chal rahi aa,1,train
5,AbhisharSharma _ RavishKumarBlog Loksabha me j...,1,val
...,...,...,...
15127,rohitsharmawpg asadowaisi narendramodi What a ...,0,train
15128,Prof _ Hariom JKgrievance Who is BIJLI mantri ...,0,train
15129,amjedmbt bandisanjay _ bjp cpkarimnagar Telang...,0,train
15130,Sunju _ Mishra To phir bjp ke leader vikas ke ...,0,train


In [None]:
df['text'] = df['text'].astype('str') 
df[df.data_type=='train'].text.values

array(['AdilNisarButt pakistan ka ghra tauq he Pakistan Israel ko tasleem nahein kerta Isko Palestine kehta he - OCCUPIED PALESTINE',
       'Madarchod mulle ye mathura me Nahi dikha tha jab mullo ne Hindu ko iss liye mara ki vo lasse ki paise mag liye the  ',
       'narendramodi Manya Pradhan Mantri mahoday Shriman Narendra Modi ji Pradhanmantri banne par Hardik Badhai tahe Dil  ',
       ...,
       'amjedmbt bandisanjay _ bjp cpkarimnagar TelanganaCMO KTRTRS KTRoffice TelanganaDGP Musalman ke naam pe kalank  ',
       'Sunju _ Mishra To phir bjp ke leader vikas ke bare me kyon ni batate unhe to sirf bhart mata ki jai kahte pirte hai  ',
       'kunalkamra88 Swamy39 ISS ko BJP4India wale doglepan I alawa kuch bhi nahi karenge .'],
      dtype=object)

In [None]:
encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].text,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

encoded_data_val = tokenizer.batch_encode_plus(
    df[df.data_type=='val'].text,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.data_type=='train'].category.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(df[df.data_type=='val'].category.values)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [None]:
len(input_ids_train), len(attention_masks_train), len(labels_train)

(12861, 12861, 12861)

In [None]:
dataset_train = TensorDataset(input_ids_train, 
                              attention_masks_train,
                              labels_train)

dataset_val = TensorDataset(input_ids_val, 
                            attention_masks_val,
                           labels_val)

In [None]:
len(dataset_train)

12861

In [None]:
dataset_val.tensors

(tensor([[  101, 11357, 49311,  ...,     0,     0,     0],
         [  101, 43731, 10376,  ...,     0,     0,     0],
         [  101, 16642, 11359,  ...,     0,     0,     0],
         ...,
         [  101, 10846, 29389,  ...,     0,     0,     0],
         [  101, 10173, 68094,  ...,     0,     0,     0],
         [  101, 12796, 10874,  ...,     0,     0,     0]]),
 tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 tensor([1, 0, 0,  ..., 0, 0, 0]))

# 4. Setting up BERT Pretrained Model

In [None]:
from transformers import BertForSequenceClassification

In [None]:
model = BertForSequenceClassification.from_pretrained(
                                      'bert-base-multilingual-cased', 
                                      num_labels = len(label_dict),
                                      output_attentions = False,
                                      output_hidden_states = False
                                     )

Downloading:   0%|          | 0.00/681M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model ch

# 5. Creating Data Loaders

In [None]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

In [None]:
batch_size = 4

dataloader_train = DataLoader(
    dataset_train,
    sampler=RandomSampler(dataset_train),
    batch_size=batch_size
)

dataloader_val = DataLoader(
    dataset_val,
    sampler=RandomSampler(dataset_val),
    batch_size=32
)

# 6. Setting Up Optimizer and Scheduler

In [None]:
from transformers import AdamW, get_linear_schedule_with_warmup

In [None]:
optimizer = AdamW(
    model.parameters(),
    lr = 1e-5,
    eps = 1e-8
)

In [None]:
epochs = 5

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps = len(dataloader_train)*epochs
)

# 7. Defining our Performance Metrics

In [None]:
import numpy as np
from sklearn.metrics import f1_score

In [None]:
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average = 'weighted')

In [None]:
def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    
    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy:{len(y_preds[y_preds==label])}/{len(y_true)}\n')

# 8. Creating our Training Loop

In [None]:
import random

seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(device)

cuda


In [None]:
def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in tqdm(dataloader_val):
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

In [None]:
for epoch in tqdm(range(1, epochs+1)):
    model.train()
    loss_train_total = 0
    
    progress_bar = tqdm(dataloader_train, 
                        desc='Epoch {:1d}'.format(epoch), 
                        leave=False, 
                        disable=False)
    
    for batch in progress_bar:
        model.zero_grad()
        batch = tuple(b.to(device) for b in batch)
        inputs = {
            'input_ids': batch[0],
            'attention_mask': batch[1],
            'labels': batch[2]
        }
        
        outputs = model(**inputs)
        loss = outputs[0]
        loss_train_total +=loss.item()
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})     
    
    #torch.save(model.state_dict(), f'Models/BERT_ft_Epoch{epoch}.model')
    
    tqdm.write('\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train)
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_val)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (weighted): {val_f1}')


  0%|          | 0/5 [00:00<?, ?it/s]

Epoch 1:   0%|          | 0/3216 [00:00<?, ?it/s]


Epoch {epoch}
Training loss: 0.927720907690995


  0%|          | 0/71 [00:00<?, ?it/s]

Validation loss: 0.870682697061082
F1 Score (weighted): 0.6080996079830271


Epoch 2:   0%|          | 0/3216 [00:00<?, ?it/s]


Epoch {epoch}
Training loss: 0.7966384190292242


  0%|          | 0/71 [00:00<?, ?it/s]

Validation loss: 0.9015377818698614
F1 Score (weighted): 0.6226862678366226


Epoch 3:   0%|          | 0/3216 [00:00<?, ?it/s]


Epoch {epoch}
Training loss: 0.705949972374411


  0%|          | 0/71 [00:00<?, ?it/s]

Validation loss: 0.9544898028105078
F1 Score (weighted): 0.6460688046387567


Epoch 4:   0%|          | 0/3216 [00:00<?, ?it/s]


Epoch {epoch}
Training loss: 0.6450592718406845


  0%|          | 0/71 [00:00<?, ?it/s]

Validation loss: 1.3222483522455457
F1 Score (weighted): 0.639378470331976


Epoch 5:   0%|          | 0/3216 [00:00<?, ?it/s]


Epoch {epoch}
Training loss: 0.6096685188514035


  0%|          | 0/71 [00:00<?, ?it/s]

Validation loss: 1.4898977942869698
F1 Score (weighted): 0.6308120996825038


# 9. Evaluating our Model

In [None]:
accuracy_per_class(predictions, true_vals)


Class: negative
Accuracy:437/669

Class: positive
Accuracy:530/755

Class: neutral
Accuracy:467/846



In [43]:
print("Negative", (437/669) * 100)
print("Positive", (530/755) * 100)
print("Neutral", (467/846) * 100)

Negative 65.32137518684604
Positive 70.19867549668875
Neutral 55.20094562647754
