<a href="https://colab.research.google.com/github/sarthakkaushik/Diploma-Program-in-ML-and-AI/blob/main/BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# install huggingface Transformers [https://huggingface.co/transformers/installation.html]

# Many transformer based models in a single library: https://github.com/huggingface/transformers#model-architectures
! pip install transformers

# This week: we will use HuggingFace BERT implementations.
# Next sessions: Build an encoder-decoder seq-seq Transfomer from scratch using TF/Keras.

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/19/22/aff234f4a841f8999e68a7a94bdd4b60b4cebcfeca5d67d61cd08c9179de/transformers-3.3.1-py3-none-any.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 3.4MB/s 
Collecting tokenizers==0.8.1.rc2
[?25l  Downloading https://files.pythonhosted.org/packages/80/83/8b9fccb9e48eeb575ee19179e2bdde0ee9a1904f97de5f02d19016b8804f/tokenizers-0.8.1rc2-cp36-cp36m-manylinux1_x86_64.whl (3.0MB)
[K     |████████████████████████████████| 3.0MB 16.1MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 45.1MB/s 
Collecting sentencepiece!=0.1.92
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K 

In [None]:
# Reference: https://medium.com/tensorflow/using-tensorflow-2-for-state-of-the-art-natural-language-processing-102445cda54a
# Ref: https://huggingface.co/transformers/notebooks.html

In [None]:
%tensorflow_version 2.x
import tensorflow as tf
print(tf.__version__)

2.3.0


## Tokenization

In [None]:
# Tokenization: map words to ids
# Refer: https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb#scrollTo=LgktNYt7ADPS

# simple example
s = "very long corpus..."
words = s.split(" ")  # Split over space
vocabulary = dict(enumerate(set(words)))  # Map storing the word to it's corresponding id

print(vocabulary)

# Problems: cat(1123) vs cats(1346)

{0: 'corpus...', 1: 'very', 2: 'long'}


### Sub-tokenization

- Why? : fast vs faster, cat vs cats
- example: cats --**bold text**> [cat, ##s]
- Image: https://nlp.fast.ai/images/multifit_vocabularies.png

<img src="https://nlp.fast.ai/images/multifit_vocabularies.png" alt="Smiley face" height="75%" width="75%">


### Tokenization in huggingface
**bold text**

In [None]:
from transformers import BertTokenizer
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-cased") 

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




In [None]:
# Refer BERT architecture from the previous videos in the course.

#https://huggingface.co/transformers/main_classes/tokenizer.html
print(bert_tokenizer.cls_token)

[CLS]


In [None]:
enc = bert_tokenizer.encode("Hi, I am James bond !")
print(enc)

print(bert_tokenizer.decode(enc))

[101, 8790, 117, 146, 1821, 1600, 7069, 106, 102]
[CLS] Hi, I am James bond! [SEP]


In [None]:
print(bert_tokenizer.decode([117]))
print(bert_tokenizer.decode([106]))

,
!


In [None]:
enc = bert_tokenizer.encode("I see many cats and dogs")
print(enc)

print(bert_tokenizer.decode(enc))

[101, 146, 1267, 1242, 11771, 1105, 6363, 102]
[CLS] I see many cats and dogs [SEP]


## BERT Models
- DistillBERT
- RoBERTa
- https://miro.medium.com/max/2000/1*IFVX74cEe8U5D1GveL1uZA.png 
<img src="https://miro.medium.com/max/2000/1*IFVX74cEe8U5D1GveL1uZA.png " alt="Smiley face" height="75%" width="75%">

- https://miro.medium.com/max/1400/1*bSUO_Qib4te1xQmBlQjWaw.png
<img src="https://miro.medium.com/max/1400/1*bSUO_Qib4te1xQmBlQjWaw.png " alt="Smiley face" height="75%" width="75%">

- General Language Understanding Evaluation (GLUE)  : https://gluebenchmark.com/


In [None]:
import tensorflow as tf

# Refer: https://huggingface.co/transformers/model_doc/distilbert.html#

from transformers import DistilBertTokenizer, TFDistilBertModel

distil_bert = 'distilbert-base-uncased' # Name of the pretrained models

#DistilBERT 
tokenizer = DistilBertTokenizer.from_pretrained(distil_bert)
model = TFDistilBertModel.from_pretrained(distil_bert)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=363423424.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['vocab_layer_norm', 'vocab_transform', 'activation_13', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


### Extract features using BERT

In [None]:
# obtain the 768-dim vector correpsoding to [CLS] which is a sentence vector

e = tokenizer.encode("Hello, my dog is cute")
print(e)

input = tf.constant(e)[None, :]  # Batch size 1 
print(input)
print(type(input)) # shape: [1,8]

output = model(input)

print(type(output))
print(len(output))
print(output) #shape[1,8,768]


[101, 7592, 1010, 2026, 3899, 2003, 10140, 102]
tf.Tensor([[  101  7592  1010  2026  3899  2003 10140   102]], shape=(1, 8), dtype=int32)
<class 'tensorflow.python.framework.ops.EagerTensor'>
<class 'tuple'>
1
(<tf.Tensor: shape=(1, 8, 768), dtype=float32, numpy=
array([[[-1.8296401e-01, -7.4054271e-02,  5.0267667e-02, ...,
         -1.1260690e-01,  4.4493100e-01,  4.0941307e-01],
        [ 7.0589967e-04,  1.4825365e-01,  3.4328270e-01, ...,
         -8.6039528e-02,  6.9474751e-01,  4.3353081e-02],
        [-5.0720602e-01,  5.3085494e-01,  3.7162632e-01, ...,
         -5.6287450e-01,  1.3755678e-01,  2.8475279e-01],
        ...,
        [-4.2251340e-01,  5.7314664e-02,  2.4338306e-01, ...,
         -1.5222676e-01,  2.4462426e-01,  6.4154869e-01],
        [-4.9384493e-01, -1.8895482e-01,  1.2640803e-01, ...,
          6.3240677e-02,  3.6912847e-01, -5.8252141e-02],
        [ 8.3268642e-01,  2.4948184e-01, -4.5439535e-01, ...,
          1.1997543e-01, -3.9257327e-01, -2.7785364e-01]]], d

In [None]:
#[CLS] corresponding vector
print((output[0])[0,0,:])  # shape: 768 dim vector

tf.Tensor(
[-1.82964012e-01 -7.40542710e-02  5.02676666e-02 -3.49530607e-01
 -7.28534013e-02 -2.63872504e-01  2.39293277e-01  4.79842067e-01
 -2.14802593e-01 -1.89516276e-01  8.99827629e-02 -1.29189104e-01
 -1.11275986e-01  3.16634566e-01 -8.25904459e-02  9.26223695e-02
 -2.09082887e-02  4.74876106e-01  1.28833517e-01  3.18710878e-03
 -1.53505564e-01 -3.57001781e-01  9.89293680e-04 -3.92748415e-03
  1.38444286e-02 -5.49408533e-02  8.45261663e-02  1.36564478e-01
  2.18252212e-01 -1.96798772e-01  2.47996300e-02  1.75569296e-01
 -3.97217683e-02 -1.10776976e-01  5.48524447e-02  6.07529581e-02
  1.71999224e-02 -1.07415311e-01 -8.76945704e-02  2.12041944e-01
 -4.05893549e-02 -3.17957923e-02  1.37657166e-01 -1.39004529e-01
 -4.68857959e-03 -3.97633344e-01 -2.60034633e+00 -1.08741574e-01
  4.86704111e-02 -3.61387730e-01  3.71814460e-01 -7.61094838e-02
  3.23910564e-02  2.31666416e-01  2.63016045e-01  3.18299681e-01
 -3.87970746e-01  2.98111200e-01 -4.93028834e-02 -3.59303094e-02
  1.58540457e-

In [None]:
# How about hidden layer outputs

#https://huggingface.co/transformers/model_doc/distilbert.html#distilbertconfig
from transformers import  DistilBertConfig

config = DistilBertConfig.from_pretrained(distil_bert, output_hidden_states=True)


e = tokenizer.encode("Hello, my dog is cute")
input = tf.constant(e)[None, :]  # Batch size 1 
model = TFDistilBertModel.from_pretrained(distil_bert, config=config)
print(model.config) # Every model has a config file 



Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['vocab_layer_norm', 'vocab_transform', 'activation_13', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_hidden_states": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "vocab_size": 30522
}



In [None]:
output = model(input)
print(len(output))

2


In [None]:
print(output[0])

tf.Tensor(
[[[-1.8296401e-01 -7.4054271e-02  5.0267667e-02 ... -1.1260690e-01
    4.4493100e-01  4.0941307e-01]
  [ 7.0589967e-04  1.4825365e-01  3.4328270e-01 ... -8.6039528e-02
    6.9474751e-01  4.3353081e-02]
  [-5.0720602e-01  5.3085494e-01  3.7162632e-01 ... -5.6287450e-01
    1.3755678e-01  2.8475279e-01]
  ...
  [-4.2251340e-01  5.7314664e-02  2.4338306e-01 ... -1.5222676e-01
    2.4462426e-01  6.4154869e-01]
  [-4.9384493e-01 -1.8895482e-01  1.2640803e-01 ...  6.3240677e-02
    3.6912847e-01 -5.8252141e-02]
  [ 8.3268642e-01  2.4948184e-01 -4.5439535e-01 ...  1.1997543e-01
   -3.9257327e-01 -2.7785364e-01]]], shape=(1, 8, 768), dtype=float32)


In [None]:
output[0].shape

TensorShape([1, 8, 768])

In [None]:
output[1][0].shape

TensorShape([1, 8, 768])

In [None]:
print(type(output[1]))
print(len(output[1])) # 7 Why?
print(output[1][6]) # Shape:(1,8,768)

<class 'tuple'>
7
tf.Tensor(
[[[-1.8296401e-01 -7.4054271e-02  5.0267667e-02 ... -1.1260690e-01
    4.4493100e-01  4.0941307e-01]
  [ 7.0589967e-04  1.4825365e-01  3.4328270e-01 ... -8.6039528e-02
    6.9474751e-01  4.3353081e-02]
  [-5.0720602e-01  5.3085494e-01  3.7162632e-01 ... -5.6287450e-01
    1.3755678e-01  2.8475279e-01]
  ...
  [-4.2251340e-01  5.7314664e-02  2.4338306e-01 ... -1.5222676e-01
    2.4462426e-01  6.4154869e-01]
  [-4.9384493e-01 -1.8895482e-01  1.2640803e-01 ...  6.3240677e-02
    3.6912847e-01 -5.8252141e-02]
  [ 8.3268642e-01  2.4948184e-01 -4.5439535e-01 ...  1.1997543e-01
   -3.9257327e-01 -2.7785364e-01]]], shape=(1, 8, 768), dtype=float32)


 **Same steps as above, for any Transformer /BERT like model**

### Fine-tuning for various tasks

- Refer: https://arxiv.org/pdf/1810.04805.pdf

-Next video