<a href="https://colab.research.google.com/github/mohd-faizy/06P_Sentiment_Analysis_With_Deep_Learning_Using_BERT/blob/master/01_Sentiment_Analysis_with_Deep_Learning_using_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<strong><h1 align = center><font size = 6>Sentiment Analysis with Deep Learning using BERT</font></h1></strong>

# __1. Introduction__

## __What is BERT ?__

__Bidirectional Encoder Representations from Transformers__ 

- __BERT__ is basically the advancement of the __RNNs__, as its able to Parallelize the Processing and Training. For Example $\rightarrow$ In sentence we have to process each word sequentially, __BERT__ allow us to do the things in Parellel.
- BERT is a large-scale transformer-based Language Model that can be finetuned for a variety of tasks.




> We will be using the __Hugging Face Transformer library__ that provides a __high-level API__ to state-of-the-art transformer-based models such as __BERT, GPT2, ALBERT, RoBERTa, and many more__. The Hugging Face team also happens to maintain another highly efficient and super fast library for text tokenization called Tokenizers.

    - Bidirectional: Bert is naturally bi-directional
    - Generalizable: Pre-trained BERT model can be fine-tuned easily for downstream NLp task.
    - High Performace: Fine-tuned BERT models beats state-of-art results for many NLP tasks.
    - Universal: Trained on Wikipedia() + BookCorpus. No special Dataset needed,

__Extension of Architecture:__

 - __RoBERTa__
 - __DistilBERT__
 - __AlBERT__

__Other Languages:__

 - __CamemBERT(French)__
 - __AraBERT(Arabic)__
 - __mBERT(Multilingual)__ 

Google Research recently __open-sourced__ implementation of __BERT__ and also released the following pre-trained models:


---



- BERT-Base, Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters
- BERT-Large, Uncased: 24-layer, 1024-hidden, 16-heads, 340M parameters



---


- BERT-Base, Cased: 12-layer, 768-hidden, 12-heads , 110M parameters
- BERT-Large, Cased: 24-layer, 1024-hidden, 16-heads, 340M parameters



---


- BERT-Base, Multilingual Cased (New, recommended): 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
- BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters

### __Embedding__

In BERT, the embedding is the summation of three types of embeddings:

![Embeddings](https://mengxinji.github.io/Blog/images/bert/embedding.jpg)


> __Token Embeddings__ is a word vector, with the first word as the __CLS flag__, which can be used for classification tasks.


> __Segment Embeddings__ is used to distinguish between two sentences, since pre-training is not just a language modeling but also a classification task with two sentences as input

> __Position Embedding__ is different from Transformer, __BERT__ learns a unique position embedding for the __input sequence__, and this __position-specific information__ can flow through the model to the __key__ and __query vectors__.

### __Model Architecture__

Here I use pre-trained BERT for binary sentiment analysis on Stanford Sentiment Treebank.

- BertEmbeddings: Input embedding layer
- BertEncoder: The 12 BERT attention layers
- Classifier: Our multi-label classifier with out_features=2, each corresponding to our 2 labels




```
- BertModel
    - embeddings: BertEmbeddings
      	- word_embeddings: Embedding(28996, 768)
      	- position_embeddings: Embedding(512, 768)
      	- token_type_embeddings: Embedding(2, 768)
      	- LayerNorm: FusedLayerNorm(torch.Size([768])
	- dropout: Dropout = 0.1
    - encoder: BertEncoder
      	- BertLayer
          	- attention: BertAttention
            		- self: BertSelfAttention
              		- query: Linear(in_features=768, out_features=768, bias=True)
              		- key: Linear(in_features=768, out_features=768, bias=True)
               		- value: Linear(in_features=768, out_features=768, bias=True)
              		- dropout: Dropout = 0.1
            	- output: BertSelfOutput(
              		- dense: Linear(in_features=768, out_features=768, bias=True)
              		- LayerNorm: FusedLayerNorm(torch.Size([768]), 
              		- dropout: Dropout =0.1

          	- intermediate: BertIntermediate(
            		- dense): Linear(in_features=768, out_features=3072, bias=True)
          
          	- output: BertOutput
            		- dense: Linear(in_features=3072, out_features=768, bias=True)
            		- LayerNorm: FusedLayerNorm(torch.Size([768])
            		- dropout: Dropout =0.1
 	- pooler: BertPooler
      		- dense: Linear(in_features=768, out_features=768, bias=True)
      		- activation: Tanh()
	- dropout: Dropout =0.1
 	- classifier: Linear(in_features=768, out_features = 2, bias=True)
```


[Source: `mengxinji.github.io`](https://mengxinji.github.io/Blog/2019-03-27/pre-trained-bert/)


### __Transformer model__

The Transformer model was proposed in the paper: [Attention Is All You Need](https://arxiv.org/abs/1706.03762). In that paper they provide a new way of handling the sequence transduction problem (like the machine translation task) without complex recurrent or convolutional structure. Simply use a stack of attention mechanisms to get the latent structure in the input sentences and a special embedding (positional embedding) to get the locationality. The whole model architecture looks like this:




![Transformer](https://nextjournal.com/data/QmNQFSULXLPYnGhHSCxmeGk8oHjfdWnybmZGFztfS26fgZ?filename=2019-05-26%2023-43-43%20%E7%9A%84%E8%9E%A2%E5%B9%95%E6%93%B7%E5%9C%96.png&content-type=image/png)

#### __Multi-Head Attention__

Instead of using the __regular attention mechanism__, they split the __input vector__ to several pairs of __subvector__ and perform a __dot-product attention__ on each __subvector pairs__. 

![Multi-Head Attention](https://nextjournal.com/data/QmbkuwYT2AmCiNaWu9ucwRyK5adX86VWRSo4exqkJBpVvy?filename=2019-05-26%2023-52-01%20%E7%9A%84%E8%9E%A2%E5%B9%95%E6%93%B7%E5%9C%96.png&content-type=image/png)

__Formula__:

$
Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V
$

$
MultiHead(Q, K, V) = Concat(head_1,..., head_h)W^O
$

$ 
\text{where }head_i = Attention(QW^Q_i, KW^K_i, VW^V_i)
$

# __2. Exploratory Data Analysis and Preprocessing__

__We will use the SMILE Twitter DATASET__.

_Wang, Bo; Tsakalidis, Adam; Liakata, Maria; Zubiaga, Arkaitz; Procter, Rob; Jensen, Eric (2016): SMILE Twitter Emotion dataset. figshare. Dataset. https://doi.org/10.6084/m9.figshare.3187909.v2_

In [1]:
! pip install torch torchvision

Collecting torchvision
  Downloading torchvision-0.11.2-cp39-cp39-win_amd64.whl (985 kB)
Installing collected packages: torchvision
Successfully installed torchvision-0.11.2


In [2]:
! pip install tqdm



[Python: Progress Bar with tqdm](https://youtu.be/qVHM3ly-Amg)

> $Tqdm$ : Tqdm package is one of the more comprehensive packages for __Progress Bars__ with python and is handy for those instances you want to build scripts that keep the users informed on the status of your application.


In [3]:
import torch
import pandas as pd
from tqdm.notebook import tqdm

In [4]:
from google.colab import files
uploaded = files.upload()

ModuleNotFoundError: No module named 'google.colab'

In [5]:
ls

 C 드라이브의 볼륨에는 이름이 없습니다.
 볼륨 일련 번호: F042-BB1A

 C:\Users\ym\Desktop\06P_Sentiment-Analysis-With-Deep-Learning-Using-BERT-master\06P_Sentiment-Analysis-With-Deep-Learning-Using-BERT-master 디렉터리

2022-01-26  오후 05:02    <DIR>          .
2022-01-26  오후 05:02    <DIR>          ..
2022-01-26  오후 05:02    <DIR>          .ipynb_checkpoints
2021-03-30  오전 01:34         5,669,230 [AV]Transformers In NLP.pdf
2021-03-30  오전 01:34         7,378,556 [AV]What is BERT.pdf
2021-03-30  오전 01:34         2,670,152 [Medium]BERT_ Pre-Training of Transformers.pdf
2021-03-30  오전 01:34           775,166 [Paper]BERT_2019.pdf
2021-03-30  오전 01:34         2,201,700 [Paper]Transformer_2017.pdf
2021-03-30  오전 01:34           504,559 [TDS]BERT Explained A Complete Guide with Theory andTutorial.pdf
2021-03-30  오전 01:34         2,572,365 [TDS]BERT Explained.pdf
2021-03-30  오전 01:34           514,590 [Wiki]BERT_NLP.pdf
2021-03-30  오전 01:34           361,295 01_Sentiment_Analysis_with_Deep_Learning_using_BERT.ipynb
2021

In [9]:
df = pd.read_csv('"C:\Users\ym\Desktop\KOSPInews-en.csv"', names=['id', 'text', 'category'], encoding='euc-kr')
df.set_index('id', inplace=True)

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 3-4: truncated \UXXXXXXXX escape (Temp/ipykernel_21836/419001026.py, line 1)

> Pandas `set_index()` is a method to set a List, Series or Data frame as index of a Data Frame. Index column can be set while making a data frame too. But sometimes a data frame is made out of __two or more data frames__ and hence later index can be changed using this method.


$Syntax$
```
DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
```



In [None]:
df.text.iloc[1]

'Dorian Gray with Rainbow Scarf #LoveWins (from @britishmuseum http://t.co/Q4XSwL0esu) http://t.co/h0evbTBWRq'

In [None]:
df.head(10)

Unnamed: 0_level_0,text,category
id,Unnamed: 1_level_1,Unnamed: 2_level_1
611857364396965889,@aandraous @britishmuseum @AndrewsAntonio Merc...,nocode
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy
614877582664835073,@Sofabsports thank you for following me back. ...,happy
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy
611570404268883969,@NationalGallery @ThePoldarkian I have always ...,happy
614456889863208960,"@britishmuseum say wot, mate?",nocode
614016385442807809,Two workshops on evaluating audience engagemen...,nocode
610916556751642624,"A Forest Road, by Thomas Gainsborough 1750 Oil...",nocode
614499696015503361,Lucky @FitzMuseum_UK! Good luck @MirandaStearn...,happy


$\color{red}{\textbf{NOTE:}}$ `id` is in bold because we set it as an __index__, So its no longer a data in the actual dataframe 

In [None]:
df.category.value_counts() # it counts How many times each unique instance occur in your data

nocode               1572
happy                1137
not-relevant          214
angry                  57
surprise               35
sad                    32
happy|surprise         11
happy|sad               9
disgust|angry           7
disgust                 6
sad|disgust             2
sad|angry               2
sad|disgust|angry       1
Name: category, dtype: int64

- So we choose to ignore the  _nodecode_ as it dose not contaion any emotions.
- we also choose to ignore the multiple emotions as it makes our __BERT__ more Complicated.
- So essentially we want. is $\rightarrow$ _one tweet to have one result._


In [None]:
# Removing the tweet with multiple category/nocode
df = df[~df.category.str.contains('\|')]
                    #str -> As we have to pull-it out of the string
                        #contain -> if the str contaion '|' -> Return True, Else False 

In [None]:
df = df[df.category != 'nocode']

In [None]:
df.category.value_counts()

happy           1137
not-relevant     214
angry             57
surprise          35
sad               32
disgust            6
Name: category, dtype: int64

> This Shows that we have a __Class imbalance__ here, and we ned to take this into account.

```
happy           1137
not-relevant     214
angry             57
surprise          35
sad               32
disgust            6
Name: category, dtype: int64
```





Building a _dictionary_ that can convert the emotions into the revelent number.

_for example:_

```
happy           1
not-relevant    2
angry           3
surprise        4
sad             5
disgust         6
```



In [None]:
possible_labels = df.category.unique() # Now we have the list that conatin all-of the labels

In [None]:
label_dict = {} # Creating an empty Dict, & Looping over the possible labels 
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index

In [None]:
label_dict

{'angry': 2,
 'disgust': 3,
 'happy': 0,
 'not-relevant': 1,
 'sad': 4,
 'surprise': 5}

_looping over the iterable  and return the index_

> `Enumerate()` in Python: 
A lot of times when dealing with iterators, we also get a need to keep a count of iterations. Python eases the programmers’ task by providing a built-in function `enumerate()` for this task.

> `Enumerate()` method adds a counter to an iterable and returns it in a form of enumerate object. This enumerate object can then be used directly in for loops or be converted into a list of tuples using `list()` method.

$Synatx$

```
enumerate(iterable, start=0)

Parameters:
Iterable: any object that supports iteration
Start: the index value from which the counter is 
              to be started, by default it is 0 
```



In [None]:
df['label'] = df.category.replace(label_dict)

In [None]:
df.head(10)

Unnamed: 0_level_0,text,category,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy,0
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy,0
614877582664835073,@Sofabsports thank you for following me back. ...,happy,0
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy,0
611570404268883969,@NationalGallery @ThePoldarkian I have always ...,happy,0
614499696015503361,Lucky @FitzMuseum_UK! Good luck @MirandaStearn...,happy,0
613601881441570816,Yr 9 art students are off to the @britishmuseu...,happy,0
613696526297210880,@RAMMuseum Please vote for us as @sainsbury #s...,not-relevant,1
610746718641102848,#AskTheGallery Have you got plans to privatise...,not-relevant,1
612648200588038144,@BarbyWT @britishmuseum so beautiful,happy,0


# __3. Training/Validation Split__

[__train_test_split__ Vs __StratifiedShuffleSplit__](https://medium.com/@411.codebrain/train-test-split-vs-stratifiedshufflesplit-374c3dbdcc36)

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train, x_val, y_train, y_val =  train_test_split(df.index.values,
                                                   df.label.values,
                                                   test_size=0.15,
                                                   random_state=17,
                                                   stratify=df.label.values
)

- the first thing we give in `train_test_split` is the _index value._ So as to uniquely identify each sample.
- `df.label.values` it'll doing the random split based on index and label.
- `test_size` is kept at `15%` so as to provide more data for training.
- `random_state` ensures that the splits that you generate are __reproducible__. Scikit-learn uses random permutations to generate the splits. The random state that you provide is used as a __seed__ to the random number generator. This ensures that the random numbers are generated in the same order.

> When the Random_state is not defined in the code for every run train data will change and accuracy might change for every run. When the `Random_state` = _"constant integer"_ is defined then train data will be constant For every run so that it will make easy to debug.

- `stratify` to ensure that your training and validation datasets each contain the same percentage of classes 



In [None]:
# Creating the New column in our dataframe --> 'data_type'
# data_type is Initally 'not_set' for all the samples
df['data_type'] = ['not_set']*df.shape[0]

In [None]:
df.head()

Unnamed: 0_level_0,text,category,label,data_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy,0,not_set
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy,0,not_set
614877582664835073,@Sofabsports thank you for following me back. ...,happy,0,not_set
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy,0,not_set
611570404268883969,@NationalGallery @ThePoldarkian I have always ...,happy,0,not_set


In [None]:
df.loc[x_train, 'data_type'] = 'train'
df.loc[x_val, 'data_type'] = 'val'

In [None]:
df.groupby(['category', 'label', 'data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,text
category,label,data_type,Unnamed: 3_level_1
angry,2,train,48
angry,2,val,9
disgust,3,train,5
disgust,3,val,1
happy,0,train,966
happy,0,val,171
not-relevant,1,train,182
not-relevant,1,val,32
sad,4,train,27
sad,4,val,5


Pandas `dataframe.groupby()` function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names.

# __4. Loading Tokenizer and Encoding our Data__

__BERT-Base__, uncased uses a vocabulary of __30,522__ words. The processes of __tokenization__ involves splitting the input text into list of tokens that are available in the vocabulary. In order to deal with the words not available in the vocabulary, BERT uses a technique called __BPE__ based WordPiece tokenization.

In [None]:
! pip install transformers

In [None]:
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

### __Tokenizer__

__Tokenizer__ takes the raw text as an input and splits it into the _Tokens_, Its a numerical number that represents a certain word.

> __Tokenizer__ convert the text into the numerical data 

`TensorDataset`: It setup the data in the Pytorch enviorment. The Dataset wrapped into the tensors. Each sample will be retrieved by indexing tensors along the first dimension.

> __BERT__ was trained using the WordPiece __tokenization__. It means that a word can be broken down into more than one __sub-words__. For example, _if I tokenize the sentence “Hi, my name is Dima”_  --  I'll get: tokenizer.tokenize('Hi my name is Dima')# OUTPUT. `['hi', 'my', 'name', 'is', 'dim', '##a']`

In [None]:
# The Tokenizer came from the Pre_trained BERT
# 'bert-base-uncased' means that we are using all lower case data
# `do_lower_case` Convert everything to lower-case.

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                          do_lower_case=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




### __Encoding__


Convert all the Tweets into the encoded form.



In [None]:
# Encoding the Training data
encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].text.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)

# Encoding the Validation data
encoded_data_val = tokenizer.batch_encode_plus(
    df[df.data_type=='val'].text.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)

# Spliting the data for the BERT training
'''
What the BERT needs for Training?
 --> Inputs ids
 --> Attention Masks
 --> & Labels
'''


input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.data_type=='train'].label.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(df[df.data_type=='val'].label.values)

Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


- `batch_encode_plus` is used to convert Multiple Strings into token as we need them. And this is perform seperately for both train and validation data.

- `df[df.data_type=='train'].text.values`: we takes all the training data & takes the text values from it.

- `add_special_tokens`: This is just the __BERT__ way of Knowing that when the sentence __ENDs__ and when the a __NEW__ one Begins.

- `return_attention_mask`: Because we are using the _Fixed Input_. So, for an Instance  we are having an sentence with $5$ words, and another sentence has $50$ $\rightarrow$ Everything has to be of same __Dimensionality__.  So we set our `max_length` to a large value $256$, So as to contain all the Possible values. `attention_mask` tells where the actual values are, and where the blank[__Zeros__] are.

- `max_length=256` as single Tweet dosen't have more than 256 words in it.

- `return_tensors='pt'`: this represents how we wants to return these Tensors -- `pt` here represents __PyTorch__. 

### __We have to convert the input to the feature that is understood by BERT__

    - input_ids: list of numerical ids for the tokenized text
    - input_mask: will be set to 1 for real tokens and 0 for the padding tokens
    - segment_ids: for our case, this will be set to the list of ones
    - label_ids: one-hot encoded labels for the text



```python
input_ids_dataset = encoded_data_dataset['input_ids']
attention_masks_dataset = encoded_data_dataset['attention_mask']
labels_dataset = torch.tensor(df[df.data_type=='dataset'].label.values)
```

- `encoded_data_dataset` This will return the dictionary --> and we will pull out the `input_ids`, It represents each word as a number

- similarly we will pull out the list of  `attention_mask` as a PyTorch
tensor.

- Next we pulls the label, because thats the Numerical number we need.


In [None]:
# Creating two different dataset
dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)

In [None]:
len(dataset_train)

1258

In [None]:
len(dataset_val)

223

# __5. Setting up BERT Pretrained Model__

In [None]:
from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

- Each tweet is treated as its own unique sequence.So one sequence will be classified into one of six classes

- we are using the __BERT__ `bert-base` version as its Computationally efficent, & it's a smaller version.

- `num_labels=len(label_dict)` which is how many output labels this final __BERT__ layout will have to be abel to classify.

- `output_attentions=False` as we don't want any un-necessary inputs from the model.

- we also don't care about the `output_hidden_states`, which is the state just before the prediction. 


# __6. Creating Data Loaders__

In [None]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

> __Dataloader__ Combines a `dataset` and a `sampler`, and provides single or multi-process __iterators__ over the dataset.

Large datasets are _indispensable_ in the world of __Machine learning__ and __Deep learning__ these days. However, working with large datasets requires loading them into memory all at once.

This leads to memory outage and slowing down of programs. PyTorch offers a solution for __parallelizing__ the data loading process with the support of automatic batching as well. This is the DataLoader class present within the `torch.utils.data package`

<img src='https://cdn.journaldev.com/wp-content/uploads/2020/02/PyTorch-Data-Loader.png' width='400' height='450'>

$\Rightarrow$ [How does data loader work PyTorch?](https://youtu.be/zN49HdDxHi8)

$\Rightarrow$ [PyTorch-dataloader](https://www.journaldev.com/36576/pytorch-dataloader)

- `RandomSampler`, `SequentialSampler` - This is how to sample the data per batch. we use `RandomSampler` for traning, it randomize how our model is training & what data it's being Exposed to ans it also prevents the model from learning the sequence based differences while training.

Where as the `SequentialSampler` return the samples sequentially contained in the dataset passed to the sampler, It takes in the dataset, not the set of indices.


In [None]:
batch_size = 32

# We Need two different dataloder
dataloader_train = DataLoader(dataset_train, 
                              sampler=RandomSampler(dataset_train),
                              batch_size=batch_size)

dataloader_validation = DataLoader(dataset_val, 
                              sampler=RandomSampler(dataset_val),
                              batch_size=batch_size)

#  __7. Setting Up Optimiser and Scheduler__

In [None]:
from transformers import AdamW, get_linear_schedule_with_warmup

__AdamW__

> - Compute __weight decay__ before applying __gradient step__.
- Multiply the weight decay by the learning rate.

![AdamW](https://user-images.githubusercontent.com/50560933/57822546-95aec680-77c6-11e9-8b99-45490e8ee4c0.png)

The original Adam algorithm was proposed in Adam: 'A Method for Stochastic Optimization'. The AdamW variant was proposed in 'Decoupled Weight Decay Regularization'.



---


`get_linear_schedule_with_warmup` Warm up steps is a parameter which is used to lower the __learning rate__ in order to reduce the impact of __deviating__ the model from learning on __sudden new data set exposure__.

> _By default, number of warm up steps is 0._

Then you make bigger steps, because you are probably not near the minima. But as you are approaching the minima, you make smaller steps to converge to it.

Also, note that number of training steps is __number of batches * number of epochs__, but not just number of epochs. So, basically num_training_steps = N_EPOCHS+1 is not correct, unless your batch_size is equal to the training set size.


__Source__:[Optimizer and scheduler for BERT fine-tuning](https://stackoverflow.com/questions/60120043/optimizer-and-scheduler-for-bert-fine-tuning)

In [None]:
'''
Learning Rate as per the original paper: -- 2e-5 > 5e-5 --
''' 
optimizer = AdamW(model.parameters(),
                  lr=1e-5, 
                  eps=1e-8)

In [None]:
epochs = 10

scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps=0,
                                            num_training_steps=len(dataloader_train)*epochs)

# __8. Defining our Performance Metrics__

Accuracy metric approach originally used in accuracy function in [this tutorial](https://mccormickml.com/2019/07/22/BERT-fine-tuning/#41-bertforsequenceclassification).

In [None]:
import numpy as np

In [None]:
from sklearn.metrics import f1_score

There are total of Six labels to classify  
  
    - preds-probability = [0.9, 0.05, 0.05, 0, 0, 0]
    - preds-binary-labels = [1, 0, 0, 0, 0, 0] --> These are Flat Values that we want

__Flatten in contex of Keras__

> __Flattening__ means. It breaks the spatial structure of the data and transforms your tridimensional $(W-(s-1), H - (s-1), N)$ tensor into a monodimensional tensor (a vector) of size $(W-(s-1))x(H - (s-1))xN$.

![Flatten in Keras](https://i.stack.imgur.com/lmrin.png)

> Flatten make explicit how you serialize a __multidimensional tensor__ (tipically the input one). This allows the __Mapping__ between the (flattened) input tensor and the first hidden layer. If the first hidden layer is "dense" each element of the (serialized) input tensor will be connected with each element of the hidden array. If you do not use Flatten, the way the input tensor is mapped onto the first hidden layer would be ambiguous.

In [None]:
def f1_score_func(preds, labels):

    # Setting up the preds to axis=1
    # Flatting it to a single iterable list of array
    preds_flat = np.argmax(preds, axis=1).flatten()

    # Flattening the labels
    labels_flat = labels.flatten()

    # Returning the f1_score as define by sklearn
    return f1_score(labels_flat, preds_flat, average='weighted')

[__sklearn.metrics.f1_score__](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)

In [None]:
def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    # Iterating over all the unique labels
    # label_flat are the --> True labels
    for label in np.unique(labels_flat):
        # Taking out all the pred_flat where the True alable is the lable we care about.
        # e.g. for the label Happy -- we Takes all Prediction for true happy flag
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')

- ` label_dict_inverse` before we have  [ __Happy__$\rightarrow$0 ]  now we have [ 0$\rightarrow$__Happy__ ], So we have crated a _NEW inverse DICTIONARY_ , where insted of [ __Key__$\rightarrow$__Value__ ] we have [ __Value__$\rightarrow$__Key__ ]






# __9. Create a training loop to control PyTorch finetuning of BERT using CPU or GPU acceleration__

Approach adapted from an older version of HuggingFace's `run_glue.py` script. Accessible [here](https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128).

In [None]:
import random

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

- A seed value specifies a particular stream from a set of possible random number streams. When you specify a seed, SAS generates the same set of pseudorandom numbers every time you run the program.

- Seed function is used to save the state of a random function, so that it can generate same random numbers on multiple executions of the code on the same machine or on different machines (for a specific seed value). The seed value is the previous value number generated by the generator.

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

print(device)

Tesla T4 with CUDA capability sm_75 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the Tesla T4 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/



cuda


In [None]:
def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in tqdm(dataloader_val):
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

In [None]:
for epoch in tqdm(range(1, epochs+1)):
    
    model.train()          # Sending our model in Training mode
    
    loss_train_total = 0   # Setting the training loss to zero initially

    # Setting up the Progress bar to Moniter the progress of training
    progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)
    for batch in progress_bar:

        model.zero_grad() # As we not working with thew RNN's
        
        # As our dataloader has '3' iteams so batches will be the Tuple of '3'
        batch = tuple(b.to(device) for b in batch)
        
        # INPUTS
        # Pulling out the inputs in the form of dictionary
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }       

        # OUTPUTS
        outputs = model(**inputs) # '**' Unpacking the dictionary stright into the input
        
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()           # backpropagation

        # Gradient Clipping -- Taking the Grad. & gives it a NORM value ~ 1 
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
         
        
    torch.save(model.state_dict(), f'finetuned_BERT_epoch_{epoch}.model')
        
    tqdm.write(f'\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train)            
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_validation)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (Weighted): {val_f1}')

HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=40.0, style=ProgressStyle(description_width…


Epoch 1
Training loss: 1.018302983045578


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.8175058705466134
F1 Score (Weighted): 0.6656119824269878


HBox(children=(FloatProgress(value=0.0, description='Epoch 2', max=40.0, style=ProgressStyle(description_width…


Epoch 2
Training loss: 0.7323148302733898


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.6933749105249133
F1 Score (Weighted): 0.7060557969984234


HBox(children=(FloatProgress(value=0.0, description='Epoch 3', max=40.0, style=ProgressStyle(description_width…


Epoch 3
Training loss: 0.5789222911000251


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.5904590444905418
F1 Score (Weighted): 0.7621849484251927


HBox(children=(FloatProgress(value=0.0, description='Epoch 4', max=40.0, style=ProgressStyle(description_width…


Epoch 4
Training loss: 0.44865028411149976


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.5404394567012787
F1 Score (Weighted): 0.8005509963040223


HBox(children=(FloatProgress(value=0.0, description='Epoch 5', max=40.0, style=ProgressStyle(description_width…


Epoch 5
Training loss: 0.36056033074855803


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.502601968390601
F1 Score (Weighted): 0.8280173634620925


HBox(children=(FloatProgress(value=0.0, description='Epoch 6', max=40.0, style=ProgressStyle(description_width…


Epoch 6
Training loss: 0.31028423868119714


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.5212855168751308
F1 Score (Weighted): 0.8607421929884219


HBox(children=(FloatProgress(value=0.0, description='Epoch 7', max=40.0, style=ProgressStyle(description_width…


Epoch 7
Training loss: 0.26597671397030354


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.5215627274342945
F1 Score (Weighted): 0.8524800449241667


HBox(children=(FloatProgress(value=0.0, description='Epoch 8', max=40.0, style=ProgressStyle(description_width…


Epoch 8
Training loss: 0.2486192662268877


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.5076191020863396
F1 Score (Weighted): 0.8504424877108723


HBox(children=(FloatProgress(value=0.0, description='Epoch 9', max=40.0, style=ProgressStyle(description_width…


Epoch 9
Training loss: 0.22325176503509284


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.5181363544293812
F1 Score (Weighted): 0.8504424877108723


HBox(children=(FloatProgress(value=0.0, description='Epoch 10', max=40.0, style=ProgressStyle(description_widt…


Epoch 10
Training loss: 0.2102439481765032


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.5167937661920275
F1 Score (Weighted): 0.8504424877108723



> __Gradient clipping__ is a technique to prevent __Exploding gradients__ in very deep networks, usually in recurrent neural networks -- This prevents any gradient to have norm greater than the threshold and thus the gradients are clipped.

# __10. Loading finetuned BERT model and evaluate its performance__

In [None]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

model.to(device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [None]:
model.load_state_dict(torch.load('/content/finetuned_BERT_epoch_10.model', map_location=torch.device('cpu')))

<All keys matched successfully>

In [None]:
_, predictions, true_vals = evaluate(dataloader_validation)

HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))




In [None]:
accuracy_per_class(predictions, true_vals)

Class: happy
Accuracy: 166/171

Class: not-relevant
Accuracy: 18/32

Class: angry
Accuracy: 8/9

Class: disgust
Accuracy: 0/1

Class: sad
Accuracy: 0/5

Class: surprise
Accuracy: 2/5



accuracy pred for finetuned_BERT_epoch_4.model

```
Class: happy
Accuracy: 168/171

Class: not-relevant
Accuracy: 16/32

Class: angry
Accuracy: 0/9

Class: disgust
Accuracy: 0/1

Class: sad
Accuracy: 0/5

Class: surprise
Accuracy: 0/5
```



# __11 Oth-Resources__



> 1. Paper: [Transformer](https://arxiv.org/abs/1706.03762)

> 2. Paper: [BERT](https://arxiv.org/abs/1810.04805)

3. [Transformer Neural Networks - EXPLAINED!](https://youtu.be/TQQlZhbC5ps)

4. [BERT Neural Network - EXPLAINED!](https://youtu.be/xI0HHN5XKDo)

5. [HuggingFace documentation](https://huggingface.co/transformers/model_doc/bert.html)

6. [Hugging Face Write with Transformers](https://transformer.huggingface.co/)

7. [LSTM is dead. Long Live Transformers!](https://youtu.be/S27pHKBEp30)

8. [Hugging Face Releases New NLP ‘Tokenizers’ Library](https://www.analyticsvidhya.com/blog/2020/06/hugging-face-tokenizers-nlp-library/)

9. [Transfer Learning for NLP: Fine-Tuning BERT for Text Classification](https://www.analyticsvidhya.com/blog/2020/07/transfer-learning-for-nlp-fine-tuning-bert-for-text-classification/)

10. [Demystifying BERT: A Comprehensive Guide to the Groundbreaking NLP Framework](https://www.analyticsvidhya.com/blog/2019/09/demystifying-bert-groundbreaking-nlp-framework/)

11. [BERT Explained: State of the art language model for NLP](https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270)

12. [How do Transformers Work in NLP? A Guide to the Latest State-of-the-Art Models](https://www.analyticsvidhya.com/blog/2019/06/understanding-transformers-nlp-state-of-the-art-models/?utm_source=blog&utm_medium=demystifying-bert-groundbreaking-nlp-framework)

13. [BERT: Pre-Training of Transformers for Language Understanding](https://www.analyticsvidhya.com/blog/2019/09/demystifying-bert-groundbreaking-nlp-framework/)

14. [BERT Explained: A Complete Guide with Theory and Tutorial](https://towardsml.com/2019/09/17/bert-explained-a-complete-guide-with-theory-and-tutorial/)

15.  [PyTorch_TDS](https://towardsdatascience.com/@theairbend3r)
