# Lesson 4 Notebook: BERT Endeavors

**Description:** After some setup for our standard IMDB movie classification task we will explore BERT (obtained from the [Huggingface Transformer library](https://huggingface.co/docs/transformers/index)) and apply it to text classification (in one way). 

<a id = 'returnToTop'></a>

## Notebook Contents
  * 1. [Setup](#setup) 
  * 2. [Data Acquisition](#dataAcquisition)  
  * 3. [BERT Basics](#bertBasics)
    * 3.1 [Tokenization](#tokenization)
    * 3.2 [Model Structure & Output](#modelOutput)
    * 3.3 [Context Based Embeddings with BERT](#contextualEmbeddings)
  * 4. [Text Classification with BERT (using the Pooler Output)](#BERTClassification)
    * 4.1 [Class Exercise](#classExercise)

  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2023-spring-main/blob/master/materials/lesson_notebooks/lesson_4_BERT.ipynb)

[Return to Top](#returnToTop)  
<a id = 'setup'></a>

## 1. Setup

This notebook requires the tensorflow dataset and other prerequisites that you must download and then store locally. 

In [1]:
!pip install tensorflow-datasets --quiet

pydot is also helpful, along with **graphviz**.

In [2]:
!pip install pydot --quiet

For BERT and other Transformer libraries we generally use Huggingface's implementations:

In [3]:
!pip install transformers --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m87.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m61.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.2/199.2 KB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[?25h

Ready to do the imports.

In [4]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

from tensorflow.keras.layers import Embedding, Input, Dense, Lambda
from tensorflow.keras.models import Model
import tensorflow.keras.backend as K
import tensorflow_datasets as tfds



import sklearn as sk
import os
import nltk
from nltk.data import find

import matplotlib.pyplot as plt

import re

For the Transformer library we need to import the **tokenizer** and the TensorFlow **model**:

In [5]:
from transformers import BertTokenizer, TFBertModel

We then continue for now as we have before.

Below is a helper function to plot histories.

In [6]:
# 4-window plot of loss and accuracy for two models, for comparison

def make_plot(axs,
              model_history1, 
              model_history2, 
              model_1_name='model 1',
              model_2_name='model 2',
              ):
    box = dict(facecolor='yellow', pad=5, alpha=0.2)

    for i, metric in enumerate(['loss', 'accuracy']):
        y_lim_lower1 = np.min(model_history1.history[metric])
        y_lim_lower2 = np.min(model_history2.history[metric])
        y_lim_lower = min(y_lim_lower1, y_lim_lower2) * 0.9

        y_lim_upper1 = np.max(model_history1.history[metric])
        y_lim_upper2 = np.max(model_history2.history[metric])
        y_lim_upper = max(y_lim_upper1, y_lim_upper2) * 1.1

        for j, model_history in enumerate([model_history1, model_history2]):
            model_name = [model_1_name, model_2_name][j]
            ax1 = axs[i, j]
            ax1.plot(model_history.history[metric])
            ax1.plot(model_history.history['val_%s' % metric])
            ax1.set_title('%s - %s' % (metric, model_name))
            ax1.set_ylabel(metric, bbox=box)
            ax1.set_ylim(y_lim_lower, y_lim_upper)

A small function calculating the cosine similarity may also come in handy:

In [7]:
def cosine_similarities(vecs):
    for v_1 in vecs:
        similarities = ''
        for v_2 in vecs:
            similarities += ('\t' + str(np.dot(v_1, v_2)/np.sqrt(np.dot(v_1, v_1) * np.dot(v_2, v_2)))[:4])
        print(similarities)

[Return to Top](#returnToTop)  
<a id = 'dataAcquisition'></a>

## 2. Data Acquisition

We will use the IMDB dataset delivered as part of the tensorflow-datasets library, and split into training and test sets. For expedience, we will limit ourselves in terms of train and test examples.

In [None]:
train_data, test_data = tfds.load(
    name="imdb_reviews", 
    split=('train[:80%]', 'test[80%:]'),
    as_supervised=True)

Downloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteGRX6V9/imdb_reviews-train.tfrecord…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteGRX6V9/imdb_reviews-test.tfrecord*…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteGRX6V9/imdb_reviews-unsupervised.t…

Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.


In [9]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [11]:
import pandas as pd
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/w266project/train.csv')
print('Number of reviews:', df.shape[0])
#print('Unique rating values:', np.sort(df_init.rating.unique()))
df.head()

Number of reviews: 38000


Unnamed: 0.1,Unnamed: 0,comment_text,toxic
0,0,people all over the country are so angry about...,1
1,1,barry soetoro has acted with stupidity and mal...,1
2,2,yeah hockeytown or is that you sarah palin moo...,1
3,3,with his words trump described acts of felonio...,1
4,4,walker is a one term gov he knows this he is t...,1


In [15]:
from sklearn.model_selection import train_test_split
train_examples, test_examples, train_labels, test_labels = train_test_split(
    df.comment_text,
    df.toxic,
    test_size=0.20,
    random_state=1,
    shuffle=True
)

In [20]:
train_examples, train_labels = tf.convert_to_tensor(train_examples), tf.convert_to_tensor(train_labels)
test_examples, test_labels = tf.convert_to_tensor(test_examples), tf.convert_to_tensor(test_labels)

In [21]:
train_examples[:4]

<tf.Tensor: shape=(4,), dtype=string, numpy=
array([b'you are nuts trumpy cannot run the government like a private business he has to deal with congress they hold the money he thinks he can be a dictator that he can do anything and everything as long as he says to do it sorry but this is not the way our government works so how can he do things by himself to reduce the corporate tax rates congress has to approve to impose tariffs on foreign goods congress has to approve and he cannot tell corporations to bring back jobs that they sent overseas ',
       b'it s great to see a campaign not beholden to big corporations seems like we should be applauding that i agree with ria 5948 we re last in corporate taxes and changing that is bound to rustle feathers with big corps like comcast if anything that makes this campaign targetting comcast ring even more true surprised to see ww upset that someone is siding with schools over corporate profits ',
       b'cherry picking history is the act of a

In [22]:
train_labels[:4]

<tf.Tensor: shape=(4,), dtype=int64, numpy=array([1, 0, 1, 0])>

[Return to Top](#returnToTop)  
<a id = 'bertBasics'></a>
## 3. BERT Basics

We now need to settle on the pre-trained BERT model we want to use. We will leverage **'bert-base-cased'**.

We need to create the corresponding model and tokenizer:

In [8]:
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
bert_model = TFBertModel.from_pretrained('bert-base-cased')

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading tf_model.h5:   0%|          | 0.00/527M [00:00<?, ?B/s]

Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


[Return to Top](#returnToTop)  
<a id = 'tokenization'></a>

### 3.1 Tokenization

Tokenization with BERT is interesting. To minimize the number of unknown words, BERT (like most pre-trained transformer models) uses a **subword** model for tokenization. We will see what that means in a second.

Let's start with something simple:

Ok, that is as expected. What about:

or

Ouch! Many more complex terms are not in BERT's vocabulary and are split up.

**Question:** in what type of NLP problems can this lead to complications?

Next, how do we generate the BERT input with its tokenizer? Fortunately, by now Huggingface's tokenizer implementation makes this rather straightforward:

To make sure we do this correctly though we may want to specify that we want to have the inputs for TensorFlow (vs. PyTorch), and we may want to do some padding:

What do we notice? Look at shapes and values. Does everything make sense?

**Question**: What is the input_id of the CLS token? What is the input_id of the SEP token?

[Return to Top](#returnToTop)  
<a id = 'modelOutput'></a>

### 3.2 Model Structure & Output

Where we have familiarized ourselves with the tokenization, we can now turn to the model and its output. How does that work? Simple!

Let's look at the first 3 of these:

What are those three?  What can you infer from the shape?

Let's look at two more examples.  

Let's analyze this a bit:

What does that mean? Are the dimensions correct? Why are there 2 outputs? Let's discuss in class. You can (and should!) also go to https://huggingface.co/docs/transformers/model_doc/bert#transformers.TFBertModel and read the documentation. **REALLY(!)**
 critical.

[Return to Top](#returnToTop)  
<a id = 'contextualEmbeddings'></a>

### 3.3 Context-based Embeddings with BERT

Let's look at the word "bank" in a few contexts:

Next, we will get the outputs and extract the word vectors for bank in each of these sentences:

Where are those numbers coming from?

Finally, we obtain the cosine similarities between the 4 vectors (from left to right and top to bottom we iterate through our vectors and report the cosine similarity):

Does this look right?

[Return to Top](#returnToTop)  
<a id = 'BERTClassification'></a>

# 4. Text Classification with BERT (using the Pooler Output)

The BERT model returns two values that can be exploited for classification purposes. One is the last_hidden_state which is the sequence of hidden-states at the output of the last layer of the model.  The second one is the pooler output, with is the output of the [CLS] token where another linear layer is added on top followed by a tanh. This pooler output can be used for classification purposes.

Let us create the data. More will be discussed in class. (We can limit the training and test data sizes for expedience in class.)

In [27]:
# BERT Tokenization of training and test data

num_train_examples = 20000      # set number of train examples - 1500 for realtime demo
num_test_examples = 5000        # set number of test examples - 500 for realtime demo

MAX_SEQUENCE_LENGTH = 128                 # set max_length of the input sequence

all_train_examples = [x.decode('utf-8') for x in train_examples.numpy()]
all_test_examples = [x.decode('utf-8') for x in test_examples.numpy()]

x_train = bert_tokenizer(all_train_examples,
              max_length=MAX_SEQUENCE_LENGTH,
              truncation=True,
              padding='max_length', 
              return_tensors='tf')
y_train = train_labels

x_test = bert_tokenizer(all_test_examples,
              max_length=MAX_SEQUENCE_LENGTH,
              truncation=True,
              padding='max_length', 
              return_tensors='tf')
y_test = test_labels

Now we define the model...

In [28]:
def create_bert_classification_model(bert_model,
                                     num_train_layers=0,
                                     hidden_size = 200, 
                                     dropout=0.3,
                                     learning_rate=0.00005):
    """
    Build a simple classification model with BERT. Use the Pooler Output for classification purposes
    """
    if num_train_layers == 0:
        # Freeze all layers of pre-trained BERT model
        bert_model.trainable = False

    elif num_train_layers == 12: 
        # Train all layers of the BERT model
        bert_model.trainable = True

    else:
        # Restrict training to the num_train_layers outer transformer layers
        retrain_layers = []

        for retrain_layer_number in range(num_train_layers):

            layer_code = '_' + str(11 - retrain_layer_number)
            retrain_layers.append(layer_code)
          
        
        print('retrain layers: ', retrain_layers)

        for w in bert_model.weights:
            if not any([x in w.name for x in retrain_layers]):
                #print('freezing: ', w)
                w._trainable = False

    input_ids = tf.keras.layers.Input(shape=(MAX_SEQUENCE_LENGTH,), dtype=tf.int64, name='input_ids_layer')
    token_type_ids = tf.keras.layers.Input(shape=(MAX_SEQUENCE_LENGTH,), dtype=tf.int64, name='token_type_ids_layer')
    attention_mask = tf.keras.layers.Input(shape=(MAX_SEQUENCE_LENGTH,), dtype=tf.int64, name='attention_mask_layer')

    bert_inputs = {'input_ids': input_ids,
                   'token_type_ids': token_type_ids,
                   'attention_mask': attention_mask}      

    bert_out = bert_model(bert_inputs)

    pooler_token = bert_out[1]
    #cls_token = bert_out[0][:, 0, :]

    hidden = tf.keras.layers.Dense(hidden_size, activation='relu', name='hidden_layer')(pooler_token)


    hidden = tf.keras.layers.Dropout(dropout)(hidden)  


    classification = tf.keras.layers.Dense(1, activation='sigmoid',name='classification_layer')(hidden)
    
    classification_model = tf.keras.Model(inputs=[input_ids, token_type_ids, attention_mask], outputs=[classification])
    
    classification_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
                                 loss=tf.keras.losses.BinaryCrossentropy(from_logits=False), 
                                 metrics='accuracy')
    
    return classification_model

In [29]:
bert_classification_model = create_bert_classification_model(bert_model, num_train_layers=0)

In [30]:
#confirm all layers are frozen
bert_classification_model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 attention_mask_layer (InputLay  [(None, 128)]       0           []                               
 er)                                                                                              
                                                                                                  
 input_ids_layer (InputLayer)   [(None, 128)]        0           []                               
                                                                                                  
 token_type_ids_layer (InputLay  [(None, 128)]       0           []                               
 er)                                                                                              
                                                                                              

In [32]:
bert_classification_model_history = bert_classification_model.fit(
    [x_train.input_ids, x_train.token_type_ids, x_train.attention_mask],
    y_train,
    validation_data=([x_test.input_ids, x_test.token_type_ids, x_test.attention_mask], y_test),
    batch_size=32,
    epochs=2
)  

Epoch 1/2
Epoch 2/2


And that is one way to do text classification with BERT. There are multiple ways (see Assignment 2.)