<a href="https://colab.research.google.com/github/sadat1971/Deep_Learning_NLP/blob/main/BERT_sentence_embedding_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Practice of BERT : Step by step analysis

The link that helps: http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/

## Step1 : Load the dependencies

install the transformer if you are using colab. Load the libraries.

In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/88/b1/41130a228dd656a1a31ba281598a968320283f48d42782845f6ba567f00b/transformers-4.2.2-py3-none-any.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 6.8MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 22.4MB/s 
Collecting tokenizers==0.9.4
[?25l  Downloading https://files.pythonhosted.org/packages/0f/1c/e789a8b12e28be5bc1ce2156cf87cb522b379be9cadc7ad8091a4cc107c4/tokenizers-0.9.4-cp36-cp36m-manylinux2010_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 41.9MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893261 sha256=0079729fb1306

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')

## Step 2: Get the dataset. Better load it in dataframe

In [11]:
# reading the dataset
df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)

In [12]:
df.head()

Unnamed: 0,0,1
0,"a stirring , funny and finally transporting re...",1
1,apparently reassembled from the cutting room f...,0
2,they presume their audience wo n't sit still f...,0
3,this is a visually stunning rumination on love...,1
4,jonathan parker 's bartleby should have been t...,1


In [13]:
df.shape

(6920, 2)

In [14]:
df[0][4]

"jonathan parker 's bartleby should have been the be all end all of the modern office anomie films"

### Substep 2.1: If the dataset is loo large, you can shrink it to fit it in the memory. 

We are using first 2k data in this case

In [15]:
# following the tutorial, just gonna use the first 2000
batch_1 = df[:2000]

In [16]:
batch_1.shape

(2000, 2)

In [17]:
batch_1[1].value_counts()

1    1041
0     959
Name: 1, dtype: int64

## **Step 3: Important: Load the pretrained BERT models.

Load the following three stuff:

**model_class**--> Needed the basic model to send our inputs in to produce the sentence embedding

**tokenizer_class** --> Needed for tokenizing the sentences.

Task specific --> you can choose from different options: we use 'distilbert-base-uncased'



In [19]:
# Loading the pretrained BERT model
# For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

## Want BERT instead of distilBERT? Uncomment the following line:
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




In [36]:
model = model_class.from_pretrained(pretrained_weights)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




## Step 4: Tokenize the sentences

In [20]:
# Now let's tokenize them-- breaking sentences to words or tokens
tokenized = batch_1[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

In [21]:
tokenized

0       [101, 1037, 18385, 1010, 6057, 1998, 2633, 182...
1       [101, 4593, 2128, 27241, 23931, 2013, 1996, 62...
2       [101, 2027, 3653, 23545, 2037, 4378, 24185, 10...
3       [101, 2023, 2003, 1037, 17453, 14726, 19379, 1...
4       [101, 5655, 6262, 1005, 1055, 12075, 2571, 376...
                              ...                        
1995    [101, 2205, 20857, 1998, 11865, 16643, 2135, 5...
1996    [101, 2009, 2515, 1050, 1005, 1056, 2147, 2004...
1997    [101, 2023, 2028, 8704, 2005, 1996, 11848, 199...
1998    [101, 1999, 1996, 2171, 1997, 2019, 9382, 1898...
1999    [101, 1996, 3185, 2003, 25757, 2011, 1037, 244...
Name: 0, Length: 2000, dtype: object

## Step 5: padding and masking


The problem with tokenizing is that, for the whole dataset, it creates tokens of different length. We want to make them of same size. So, may be the best way is to pad them with zero, upto the maximum length sentence.

But In that way, we may mislead the model. For example, let's say the padding length is 10. So for a sentence *This is great*, we wend up getting *This is great nothing nothing nothing nothing nothing nothing nothing* which is weird. So, we make a masking. The way masking is done, explained as below:

Le's say, we have an array after the padding, A = [9, 8, 23, 8, 0, 0, 0]

Now, we can create another array to feed in the model, indicating which valued SHOULD be masked. So, we need to create and array of same size, where the position of valid values will get 1 and masked will get 0. So, the array should be:

making_A = [1, 1, 1, 1, 0, 0, 0] 

In [26]:
# Padding: The token list have difference in size. So, let's put them in the same length
def create_padding(tokenized): #tokenized is Dataframe --the dataset that has the tokenized values
  max_len = 0
  for i in tokenized.values:
      if len(i) > max_len:
          max_len = len(i)

  padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])
  return padded

In [28]:
padded = create_padding(tokenized)

In [29]:
padded.shape

(2000, 59)

In [51]:
# To make the system understand to ignore the padded zeros, we need to provided it information for masking
attention_mask = np.where(padded != 0, 1, 0) #np.where(condition, value that will be true if condition is True, value returned if false)
attention_mask.shape

(2000, 59)

In [32]:
attention_mask

array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]])

## Step 6: Run the model and find the last hidden layer

**Produce the Input**: The input MUST be a tensor. So, convert the 2-d array features to tensor using torch. Also convert the attention mask to tensor as well. 

Then, feed the input AND the attention mask to the model. 

**Important:** Must use the <u>torch.no_grad()</u> to avoid the gradient backprop.

In [34]:
import time

In [37]:
start_time = time.time()
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

time.time()-start_time

177.34562492370605

## Step 7: Extract the last hidden states

Now, we want to represent the sentences as embedding vectors. So, we are expecting the size of (2000, 768) tensor. But the model gives an output of (2000, 768) for 59 places. The first place is what we are looking for and it contains the embedding vector.

In [45]:
last_hidden_states[0].shape

torch.Size([2000, 59, 768])

In [49]:
features = last_hidden_states[0][:,0,:]

In [50]:
features.shape

torch.Size([2000, 768])