<a href="https://colab.research.google.com/github/sahibpreetsingh12/Kaggle-Notebooks/blob/master/DistillBert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Turning the Beast on

This is the Part1 of 2 Part Notebooks in next Notebook we will Train our corpus on DistillBert this process is <font color='red'>Fine tuning</font>

In [0]:
import tensorflow as tf

# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

Found GPU at: /device:GPU:0


In [0]:
import torch

# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla K80


## Installing Hugging Face

In [0]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/a3/78/92cedda05552398352ed9784908b834ee32a0bd071a9b32de287327370b7/transformers-2.8.0-py3-none-any.whl (563kB)
[K     |▋                               | 10kB 20.2MB/s eta 0:00:01[K     |█▏                              | 20kB 1.7MB/s eta 0:00:01[K     |█▊                              | 30kB 2.3MB/s eta 0:00:01[K     |██▎                             | 40kB 2.5MB/s eta 0:00:01[K     |███                             | 51kB 2.0MB/s eta 0:00:01[K     |███▌                            | 61kB 2.2MB/s eta 0:00:01[K     |████                            | 71kB 2.5MB/s eta 0:00:01[K     |████▋                           | 81kB 2.7MB/s eta 0:00:01[K     |█████▎                          | 92kB 2.9MB/s eta 0:00:01[K     |█████▉                          | 102kB 2.8MB/s eta 0:00:01[K     |██████▍                         | 112kB 2.8MB/s eta 0:00:01[K     |███████                         | 122kB 2.8M

## Importing Libraries

In [0]:
import numpy as np
import pandas as pd
import torch
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

## 1. Loading Dataset
For this task we will use SST2  Dataset from [here](https://github.com/clairett/pytorch-sentiment-classification/)


In [0]:
df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)
df.head()

Unnamed: 0,0,1
0,"a stirring , funny and finally transporting re...",1
1,apparently reassembled from the cutting room f...,0
2,they presume their audience wo n't sit still f...,0
3,this is a visually stunning rumination on love...,1
4,jonathan parker 's bartleby should have been t...,1


## Some Visualisation 

In [0]:
df.columns

Int64Index([0, 1], dtype='int64')

0 column represents our movie reviews and 

1 column represents their label
where Zero represents negative movie review and 1 represnts positive review

In [0]:
df[1].value_counts()

1    3610
0    3310
Name: 1, dtype: int64

## So it is clear from above that it is <font color='green'>Balanced Dataset</font>
## This means we can use <font color='red'>**Accuracy**</font> as metric.

# Tokenization & Input Formatting for <font color='blue'>BERT</font>


## 2.1 Distill Bert Tokenizer
To feed our text to BERT, it must be split into tokens, and then these tokens must be mapped to their index in the tokenizer vocabulary.

In [0]:
import transformers

In [0]:
from transformers import DistilBertTokenizer

# Load the DistillBERT tokenizer.
print('Loading DistillBERT tokenizer...')
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased', do_lower_case=True)# convertiing evry input to lower case

Loading DistillBERT tokenizer...


# Let's See What we have got after <font color='orange'>Tokenisation</font>

In [0]:
# Print the original sentence.
print(' Original: ', df[0][0])

# Print the sentence split into tokens.
print('Tokenized: ', tokenizer.tokenize(df[0][0]))

# Print the sentence mapped to token ids.
print('Token IDs: ', tokenizer.convert_tokens_to_ids(tokenizer.tokenize(df[0][0])))

 Original:  a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films
Tokenized:  ['a', 'stirring', ',', 'funny', 'and', 'finally', 'transporting', 're', 'imagining', 'of', 'beauty', 'and', 'the', 'beast', 'and', '1930s', 'horror', 'films']
Token IDs:  [1037, 18385, 1010, 6057, 1998, 2633, 18276, 2128, 16603, 1997, 5053, 1998, 1996, 6841, 1998, 5687, 5469, 3152]



*   For Experimentation in This Notebook we wil use  <font               color='orange'>DistillBert</font> directly(i.e will not train it on our data) 
*   And In <font color='orange'>Second Part</font> We will Train our DistillBert on our Data



## Loading Our <font color='pink'>Bad Boy </font>


In [0]:
model, pretrained_weights = (transformers.DistilBertModel, 'distilbert-base-uncased')

In [0]:
model = model.from_pretrained(pretrained_weights)

 Above we use `convert_tokens_to_ids` to just tokenise our input we will now use ` tokenizer.encode` funcion that fullfills all our input requirements. Input requiremnts. for <font color='orange'>BERT</font> or Even Other Transformers  are :-


1.   All the Input sentences to Transfromer model must be of Same length.
1.   Add special tokens to the start and end of each sentence.
2.   **Pad** & **truncate** all sentences to a single constant length.Because ***BERT*** is pretrained model and it has fixed Maximum Input Size of <font color='orange'>512</font> tokens
3.   Explicitly differentiate real tokens from padding tokens with the "attention mask".

Now Attention Masks are very special .Suppose one of our input sentence after tokenisation has 8 tokens it means our <font color='orange'>Real Tokens are 8</font> and suppose we have set our Maximum Length to 10 tokens so we will pad Zeros to at the right of Sentence . So now to diffrentiate between Real tokens and Padded tokens we add <font color='orange'>Attention Masks</font>


```
We can only Padding Tokens to right of sentence in BERT( it is a pretrained mdel)
```




## Padding Tokens

Padding is done with a special `[PAD]` token, which is at index 0 in the BERT vocabulary. The below illustration demonstrates padding out to a "MAX_LEN" of 8 tokens.

![alt text](http://www.mccormickml.com/assets/BERT/padding_and_mask.png)

## Special Tokens

`[SEP]`

At the end of every sentence, we need to append the special [SEP] token.

This token is an separator of two-sentence tasks, where BERT is given two separate sentences and asked to determine something (e.g., can the answer to the question in sentence A be found in sentence B?).


And We have to use this token even for single sentences.

`[CLS]`

For classification tasks, we must prepend the special `[CLS]` token to the beginning of every sentence.

This token has special significance. BERT consists of 12 Transformer layers. Each transformer takes in a list of token embeddings, and produces the same number of embeddings on the output (but with the feature values changed, of course!).

![alt text](http://mccormickml.com/assets/BERT/CLS_token_500x606.png)


On the output of the final (12th) transformer, only the first embedding (corresponding to the `[CLS]` token) is used by the classifier.

` The first token of every sequence is always a special classification token `([CLS]`). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. `

Also, because BERT is trained to only use this `[CLS]` token for classification, we know that the model has been motivated to encode everything it needs for the classification step into that single 768-value embedding vector.

## Tokenisation

In [0]:
tokenized = df[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

![alt text](https://jalammar.github.io/images/distilBERT/sst2-text-to-tokenized-ids-bert-example.png)


Cuurently we have Series/Dataframe of list but we have to convert it tensor for input to `DistillBert`

# Padding


In [0]:
# Padding
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

In [0]:
print(padded.shape)


(6920, 67)


## See what we got after Padding

In [0]:
# Print the original sentence.
print(' Original: ', df[0][0])
print('\n')
# Print the sentence split into tokens.
print('Tokenized: ', tokenizer.tokenize(df[0][0]))
print('\n')
# Print the sentence mapped to token ids.
print('Token IDs: ', tokenizer.convert_tokens_to_ids(tokenizer.tokenize(df[0][0])))
print('\n')
# padded sentence
print(padded[0])

 Original:  a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films


Tokenized:  ['a', 'stirring', ',', 'funny', 'and', 'finally', 'transporting', 're', 'imagining', 'of', 'beauty', 'and', 'the', 'beast', 'and', '1930s', 'horror', 'films']


Token IDs:  [1037, 18385, 1010, 6057, 1998, 2633, 18276, 2128, 16603, 1997, 5053, 1998, 1996, 6841, 1998, 5687, 5469, 3152]


[  101  1037 18385  1010  6057  1998  2633 18276  2128 16603  1997  5053
  1998  1996  6841  1998  5687  5469  3152   102     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0]


# Masking / Attention-Masking

In [0]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(6920, 67)

# Our Model
![alt text](https://camo.githubusercontent.com/7c092d2fd20a0cd922bdd15a862e31155f6adcb7/68747470733a2f2f6a616c616d6d61722e6769746875622e696f2f696d616765732f64697374696c424552542f626572742d64697374696c626572742d7475746f7269616c2d73656e74656e63652d656d62656464696e672e706e67)

In [0]:
df.shape

(6920, 2)

In [0]:
input_ids = torch.tensor(padded)   # converting into torch tensors
attention_mask = torch.tensor(attention_mask) # converting into torch tensors

`The model() function runs our sentences through BERT. The results of the processing will be returned into last_hidden_states.`

SO this can take time


In [0]:
with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

In [0]:
features = last_hidden_states[0][:,0,:].numpy()

In [0]:
labels = df[1] # getting labels

# Splitting Data for training and Testing

In [0]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

In [0]:
lr_clf = LogisticRegression(C=5.2, max_iter=2000)
lr_clf.fit(train_features, train_labels)

LogisticRegression(C=5.2, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=2000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [0]:
lr_clf.score(test_features, test_labels)

0.830635838150289

 For reference, the highest accuracy score for this dataset is currently 96.8. DistilBERT can be trained to improve its score on this task – a process called fine-tuning which updates BERT’s weights to make it achieve a better performance in this sentence classification task (which we can call the downstream task). The fine-tuned DistilBERT turns out to achieve an accuracy score of 90.7. The full size BERT model achieves 94.9.