# DistilBERT Exercise

In this exercise, you will use the DistilBERT model for a binary classification task. The dataset will be the  dataset that you have been using for the homework assignments. You will use the <b>sentiment</b> field as the label for this exercise. 

You will follow several sections of the blog post at https://towardsdatascience.com/hugging-face-transformers-fine-tuning-distilbert-for-binary-classification-tasks-490f1d192379 to guide you through this exercise. 

Read <i><b>Section 1.0: Introduction</i></b> in the blog post. 

You can skip <i>Section 2.0: The Data</i> since you will be using the movie reviews dataset for the exercise. 

IMPORTANT: You may not understand all of the details of the blog post, but try to work through the code as quickly as you can. You can go back and read the blog again as you wait for the training to complete in sections 1.3.3 and 1.3.4. 

## Setup

You can install transformers and tensorflow (if you don't have them installed) on the command line or by running the cells below.

In [1]:
#uncomment the line below and run the cell just one time. Comment it back after that. 
#!pip install transformers

In [2]:
#uncomment the line below and run the cell just one time. Comment it back after that. 
#!pip install tensorflow

In [1]:
# You need some imports, so run this cell. 
# You will do some more imports later. 
import numpy as np
from sklearn.model_selection import train_test_split
import tensorflow as tf

## Data Preparation - 5 points

In this section, you need to read the data from the file <i>reviews.txt</i> and create lists for the reviews and the labels. The function <i>read_corpus below</i> is a solution from your homework assignments but it now has 2 additional parameters: <i>max_length</i> and <i>max_reviews</i>. Both parameters are integers. 

You need to modify the function below for the following:
1. Store 1 for the 'pos' label and 0 for the 'neg' label in the <i>all_labels</i> list
2. Ignore all lines containing more than <i>max_length</i> tokens. You can use len(tokens) to check the number of tokens in a line.
3. Only collect a total of <i>max_reviews</i> lines. You can use the Python <i>break</i> keyword to exit a loop early.

For instance, if you call the function as <i>read_corpus("reviews.txt", 1, 150, 2000)</i>, it will ignore all lines having more than 150 tokens and it will collect a total of 2000 reviews. 

In [2]:
def read_corpus(filename, category_position, max_length, max_reviews):
    all_data = []
    all_labels = []  
   
    with open(filename) as f:
        num_reviews = 0
        for line in f:
            if num_reviews+1 >  max_reviews:
                break
            tokens = line.strip().split()
            if len(tokens) > max_length:
                continue
            if tokens[category_position] == 'neg':
                all_labels.append(0)
            else:
                all_labels.append(1)
            all_data.append(" ".join(tokens[3:]))
            num_reviews += 1
    return all_data, all_labels

In [3]:
# replace the path with the path to your data file and run the cell
all_data, all_labels = read_corpus('../data/HW4_data/all_reviews.txt', 1, 150,2000)

In [4]:
type(all_data)

list

In [5]:
# extract a 80-20 split for train and validation sets
X_train, X_valid, y_train, y_valid = (train_test_split(all_data, all_labels, test_size=0.2))

In [6]:
type(X_train)

list

In [7]:
X_train[0:3]

["i use this product all the time . it was becoming hard to find in department stores , but i found it available and cheaper online . i 'm very happy .",
 'do not buy the case canon sells for this camera . it takes 20 seconds to take the camera out or put it in',
 "great tripod . we 're using it in a television studio w / sony vx-2000 cameras . they work great"]

In [8]:
print(len(X_train))

1600


In [9]:
print(len(y_train))

1600


In [10]:
print(len(X_valid))

400


In [11]:
print(len(y_valid))

400


In [12]:
# test that your extraction of 2000 reviews was correct. 
# If it throws an error, then you made a mistake. 
assert(len(X_train)==1600)
assert(len(X_valid)==400)

In [13]:
# check a few reviews and their labels, make sure the label is 1 or 0 (not 'pos'/'neg')
print(X_train[10],y_train[10])

stay away from all the movies from wesley snipes ' and steven seagal's . they are brain-dead movies with brain-dead screenplays , bad directors , bad everythings . this movie , like other viewers ' said , drove me nuts . i have to wear straight jacket to refrain myself everytime when i watched these two guys ' movies , otherwise i might have crashed my tv set or trashed everything around me in order to vant my frustration . amazon.com should also stay away from selling this crappy movie . god help us . 0


In [14]:
# check a few reviews and their labels, make sure the label is 1 or 0 (not 'pos'/'neg')
print(X_valid[10],y_valid[10])

thought the cards would be like broderbund 's american gretings creatacard platinum 8 software but it was not and i did not like what i got . 0


In [15]:
# check a few reviews and their labels, make sure the label is 1 or 0 (not 'pos'/'neg')
print(X_valid[2],y_valid[2])

after total knee replacement , this wedge sure helped to keep my knee in one place and elevated , while preventing me from moving too much while i slept or was resting . highly recommended 1


## Transfer Learning with Hugging Face - 12 points

Read Section 3.0 in the blog post https://towardsdatascience.com/hugging-face-transformers-fine-tuning-distilbert-for-binary-classification-tasks-490f1d192379 as a guide for this section. 

You can use the code given in the blog post, but you will need to make some minor modifications. 

### Tokenizing the text

Use the code from Section 3.1 on the blog post for this section. You will need to make the following modifications:

1. <i>X_train</i> and <i>X_valid</i> are already lists, so you don't need to convert them to lists using <i>X_train.tolist()</i> and <i>X_valid.tolist()</i>. Just use <i>X_train</i> and <i>X_valid</i>.

2. You will not be using <i>X_test</i>, so comment out lines using <i>X_test</i>. 

In [16]:
from transformers import DistilBertTokenizerFast

# Instantiate DistilBERT tokenizer...we use the Fast version to optimize runtime
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

In [18]:
# Define the maximum number of words to tokenize (DistilBERT can tokenize up to 512)
MAX_LENGTH = 128


# Define function to encode text data in batches
def batch_encode(tokenizer, texts, batch_size=256, max_length=MAX_LENGTH):
    """
    A function that encodes a batch of texts and returns the texts'
    corresponding encodings and attention masks that are ready to be fed 
    into a pre-trained transformer model.
    
    Input:
        - tokenizer:   Tokenizer object from the PreTrainedTokenizer Class
        - texts:       List of strings where each string represents a text
        - batch_size:  Integer controlling number of texts in a batch
        - max_length:  Integer controlling max number of words to tokenize in a given text
    Output:
        - input_ids:       sequence of texts encoded as a tf.Tensor object
        - attention_mask:  the texts' attention mask encoded as a tf.Tensor object
    """
    
    input_ids = []
    attention_mask = []
    
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        inputs = tokenizer.batch_encode_plus(batch,
                                             max_length=max_length,
                                             padding='longest', #implements dynamic padding
                                             truncation=True,
                                             return_attention_mask=True,
                                             return_token_type_ids=False
                                             )
        input_ids.extend(inputs['input_ids'])
        attention_mask.extend(inputs['attention_mask'])
    
    print([len(i) for i in input_ids])
    
    return tf.convert_to_tensor(input_ids), tf.convert_to_tensor(attention_mask)
    
    
# Encode X_train
X_train_ids, X_train_attention = batch_encode(tokenizer, X_train)

# Encode X_valid
X_valid_ids, X_valid_attention = batch_encode(tokenizer, X_valid)

# Encode X_test
#X_test_ids, X_test_attention = batch_encode(tokenizer, X_test)

[128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128,

### Defining the Model Architecture

Read and use the code from Section 3.2 on the blog post for initializing the base model and adding a classification head. You will need to make the following modification:

1. Instead of <i>loss=focal_loss()</i>, use <i>loss='binary_crossentropy'</i> (Otherwise, you will need to write code to define focal_loss()).

You may get a message regarding some layers from the model checkpoint not being used when initializing TFDistilBertModel. That is okay. 

In [17]:
from transformers import TFDistilBertModel, DistilBertConfig

DISTILBERT_DROPOUT = 0.2
DISTILBERT_ATT_DROPOUT = 0.2
 
# Configure DistilBERT's initialization
config = DistilBertConfig(dropout=DISTILBERT_DROPOUT, 
                          attention_dropout=DISTILBERT_ATT_DROPOUT, 
                          output_hidden_states=True)
                          
# The bare, pre-trained DistilBERT transformer model outputting raw hidden-states 
# and without any specific head on top.
distilBERT = TFDistilBertModel.from_pretrained('distilbert-base-uncased', config=config)

# Make DistilBERT layers untrainable
for layer in distilBERT.layers:
    layer.trainable = False

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['vocab_projector', 'vocab_layer_norm', 'activation_13', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


In [18]:
MAX_LENGTH = 128
LAYER_DROPOUT = 0.2
LEARNING_RATE = 5e-5
RANDOM_STATE = 42

def build_model(transformer, max_length=MAX_LENGTH):
    """
    Template for building a model off of the BERT or DistilBERT architecture
    for a binary classification task.
    
    Input:
      - transformer:  a base Hugging Face transformer model object (BERT or DistilBERT)
                      with no added classification head attached.
      - max_length:   integer controlling the maximum number of encoded tokens 
                      in a given sequence.
    
    Output:
      - model:        a compiled tf.keras.Model with added classification layers 
                      on top of the base pre-trained model architecture.
    """
    
    # Define weight initializer with a random seed to ensure reproducibility
    weight_initializer = tf.keras.initializers.GlorotNormal(seed=RANDOM_STATE) 
    
    # Define input layers
    input_ids_layer = tf.keras.layers.Input(shape=(max_length,), 
                                            name='input_ids', 
                                            dtype='int32')
    input_attention_layer = tf.keras.layers.Input(shape=(max_length,), 
                                                  name='input_attention', 
                                                  dtype='int32')
    
    # DistilBERT outputs a tuple where the first element at index 0
    # represents the hidden-state at the output of the model's last layer.
    # It is a tf.Tensor of shape (batch_size, sequence_length, hidden_size=768).
    last_hidden_state = transformer([input_ids_layer, input_attention_layer])[0]
    
    # We only care about DistilBERT's output for the [CLS] token, 
    # which is located at index 0 of every encoded sequence.  
    # Splicing out the [CLS] tokens gives us 2D data.
    cls_token = last_hidden_state[:, 0, :]
    
    ##                                                 ##
    ## Define additional dropout and dense layers here ##
    ##                                                 ##
    
    # Define a single node that makes up the output layer (for binary classification)
    output = tf.keras.layers.Dense(1, 
                                   activation='sigmoid',
                                   kernel_initializer=weight_initializer,  
                                   kernel_constraint=None,
                                   bias_initializer='zeros'
                                   )(cls_token)
    
    # Define the model
    model = tf.keras.Model([input_ids_layer, input_attention_layer], output)
    
    # Compile the model
    model.compile(tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE), #lr deprecated:
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    
    
    return model


In [19]:
# After you initialize the base model and add a classification head,
# invoke the build_model function to make an instance of your model
model = build_model(distilBERT)

### Training Classification Layer Weights

Read and use the code from Section 3.3 on the blog post for this section. You can ignore the part at the end of Section 3.3 where the author discusses adding 2 dense layers using grid search. You will need to make the following modifications:

1. Since <i>y_train</i> and <i>y_valid</i> are lists, use <i>np.asarray(y_train)</i> to convert them to numpy arrays instead of using the function <i>to_numpy()</i>
2. Use verbose = 1 to show the timeline of your training
3. comment out the lines: <i>NUM_STEPS = len(X_train.index)</i> and <i>steps_per_epoch = NUM_STEPS</i> since you will be using only batches in epochs
4. You can change EPOCHS to 4 if you have a slower machine and would like training to complete earlier. 

IMPORTANT: This part will take a while to complete.

In [20]:
EPOCHS = 4
BATCH_SIZE = 64
#NUM_STEPS = len(X_train.index) // BATCH_SIZE

# Train the model
train_history1 = model.fit(
    x = [X_train_ids, X_train_attention],
    y = np.asarray(y_train),
    epochs = EPOCHS,
    batch_size = BATCH_SIZE,
#     steps_per_epoch = NUM_STEPS,
    validation_data = ([X_valid_ids, X_valid_attention], np.asarray(y_valid)),
    verbose=1
)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


### Fine-tuning DistilBERT and Training

Read and use the code from Section 3.4 on the blog post for this section. 

You will need to make the same modifications to the code as you did in Sections 1.3.2 and 1.3.3

IMPORTANT: This part will take a while to complete.

In [21]:
FT_EPOCHS = 4
BATCH_SIZE = 64
#NUM_STEPS = len(X_train.index)

# Unfreeze distilBERT layers and make available for training
for layer in distilBERT.layers:
    layer.trainable = True
    
# Recompile model after unfreezing
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=2e-5), #Changed learning rate
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Train the model
train_history2 = model.fit(
    x = [X_train_ids, X_train_attention],
    y = np.asarray(y_train),
    epochs = FT_EPOCHS,
    batch_size = BATCH_SIZE,
#     steps_per_epoch = NUM_STEPS,
    validation_data = ([X_valid_ids, X_valid_attention], np.asarray(y_valid)),
    verbose=1
)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


## Observations - 3 points

Write a few sentences describing your experience in this exercise. Use the cell below and set the type to Markdown. 

If you ran into issues or the training did not complete, check the solution that will be posted by tomorroow. 

This exercise reveals the ever-changing nature of tensorflow.<br>
For example, even though the article was written less than a year ago, some of the code such as the lr parameter is deprecated in favor of learning_rate<br>
Now let us discuss the actual model training:<br>
For learning rate= 5e-5, the validation accuracy did not go past 0.505<br>
When the learning rate was changed lowered to 2e-5, the validation accuracy was as high as 0.9075 in the second epoch<br>
In both models, the loss decreases, but at the end of epoch 4, the fine-tuned model has a loss much closer to 0 (0.1496) than the original model with a final loss of 0.7001 
