# MMAI 894 - Exercise 3
## Transfer learning with DistilBert
The goal of this excercise is to build a text classifier using the pretrained DistilBert published by HuggingFace. You will be doing this using the Glue/CoLA dataset (https://nyu-mll.github.io/CoLA/).

Submission instructions:

- You cannot edit this notebook directly. Save a copy to your drive, and make sure to identify yourself in the title using name and student number
- Do not insert new cells before the final one (titled "Further exploration") 
- Verify that your notebook can _restart and run all_. 
- Unlike previous assignments, please **submit all three formats: .py, .ipynb, and html** (see https://torbjornzetterlund.com/how-to-save-a-google-colab-notebook-as-html/)
 - The notebook and html submissions should show the completion of your best performing run
 - Submission files should be named: `studentID_lastname_firstname_ex3.py (or .html, .ipynb)`
- The mark will be assessed on the implementation of the functions with #TODO
- **Do not change anything outside the functions**  unless in the further exploration section
- - As you are encouraged to explore the network configuration, 20% of the mark is based on final accuracy. 
- Note: You do not have to answer the questions in thie notebook as part of your submission. They are meant to guide you.

- You should not need to use any additional libraries other than the ones listed below. You may want to import additional modules from those libraries, however.

In [None]:
# This cell installs and sets up DistilBert import, as well as the dataset, which we will 
# use tf.datasets to load (https://www.tensorflow.org/datasets/catalog/overview)

!pip install -q transformers tfds-nightly


import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow.keras as keras
import pandas as pd
import nltk
import regex as re
import tensorflow_datasets as tfds
import numpy as np

try: # this is only working on the 2nd try in colab :)
  from transformers import DistilBertTokenizer, TFDistilBertModel, DistilBertConfig, RobertaConfig, RobertaModel
except Exception as err: # so we catch the error and import it again
  from transformers import DistilBertTokenizer, TFDistilBertModel


from tensorflow.keras import regularizers
from tensorflow.keras.layers import Dense, Input, Dropout
from keras.initializers import GlorotNormal


dbert_tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')


[K     |████████████████████████████████| 3.5 MB 5.5 MB/s 
[K     |████████████████████████████████| 4.2 MB 6.2 MB/s 
[K     |████████████████████████████████| 596 kB 17.4 MB/s 
[K     |████████████████████████████████| 6.5 MB 28.8 MB/s 
[K     |████████████████████████████████| 895 kB 32.1 MB/s 
[K     |████████████████████████████████| 67 kB 3.3 MB/s 
[K     |████████████████████████████████| 76 kB 5.1 MB/s 
[?25h[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

# Data Preparation

In [None]:
def load_data(save_dir="./"):
  dataset = tfds.load('glue/cola', shuffle_files=True)
  train = tfds.as_dataframe(dataset["train"])
  val = tfds.as_dataframe(dataset["validation"])
  test = tfds.as_dataframe(dataset["test"])
  return train, val, test

def prepare_raw_data(df):
  raw_data = df.loc[:, ["idx", "sentence", "label"]]
  raw_data["label"] = raw_data["label"].astype('category')
  return raw_data

train, val, test = load_data()
train = prepare_raw_data(train)
val = prepare_raw_data(val)
test = prepare_raw_data(test)

[1mDownloading and preparing dataset 368.14 KiB (download: 368.14 KiB, generated: 965.49 KiB, total: 1.30 MiB) to ~/tensorflow_datasets/glue/cola/2.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/8551 [00:00<?, ? examples/s]

Shuffling ~/tensorflow_datasets/glue/cola/2.0.0.incompleteXS8P5Q/glue-train.tfrecord*...:   0%|          | 0/8…

Generating validation examples...:   0%|          | 0/1043 [00:00<?, ? examples/s]

Shuffling ~/tensorflow_datasets/glue/cola/2.0.0.incompleteXS8P5Q/glue-validation.tfrecord*...:   0%|          …

Generating test examples...:   0%|          | 0/1063 [00:00<?, ? examples/s]

Shuffling ~/tensorflow_datasets/glue/cola/2.0.0.incompleteXS8P5Q/glue-test.tfrecord*...:   0%|          | 0/10…

[1mDataset glue downloaded and prepared to ~/tensorflow_datasets/glue/cola/2.0.0. Subsequent calls will reuse this data.[0m


Before using this data, we need to clean and QA it. Unlike MNIST, this is a text dataset, and we should be more caerful. For example:
- Are there any duplicate entries? 
- What is the range of lengths for the sentences? Should we impose a minimum sentence length?
- Are there "non-sentence" entries? For example, hashtags or other features we should remove? (luckily, this dataset is quite clean, but that might not always be the case!)

NOTE! The sentences are encoded as binary strings. To do text manipulations, you might need to decode them using `s.decode("utf-8")`

You may notice that that test set has no labels. This is because Glue is a benchmark dataset, and only gets scored on submissions.

In [None]:

def clean_data(df):
#   # TODO: What data cleaning/filtering should you consider?
#   # DO NOT CHANGE THE INPUTS OR OUTPUTS TO THIS FUNCTION

    df.drop_duplicates(subset='sentence',inplace=True)

    return df

train = clean_data(train)
val = clean_data(val)
test = clean_data(test)

print(train.info())
print(test.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8532 entries, 0 to 8550
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   idx       8532 non-null   int32   
 1   sentence  8532 non-null   object  
 2   label     8532 non-null   category
dtypes: category(1), int32(1), object(1)
memory usage: 175.1+ KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1060 entries, 0 to 1062
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   idx       1060 non-null   int32   
 1   sentence  1060 non-null   object  
 2   label     1060 non-null   category
dtypes: category(1), int32(1), object(1)
memory usage: 21.9+ KB
None


Next, we need to prepare the text for DistilBert. Instead of ingesting raw text, the model uses token IDs to map to internal embedding. Additionally, since the input is fixed size (due to our use of batches), we need to let the model know which tokens to use (i.e. are part of the sentence).

Luckily, `dbert_tokenizer` takes care of all that for us - 
- Preprocessing: https://huggingface.co/transformers/preprocessing.html
- Summary of tokenizers (DistilBert uses WordPiece): https://huggingface.co/transformers/tokenizer_summary.html#wordpiece

In [None]:
def extract_text_and_y(df):
  text = [x.decode('utf-8') for x in  df.sentence.values]
  # for multiclass problems, you can use sklearn.preprocessing.OneHotEncoder, but we only have two classes, so we'll use a single sigmoid output
  y = np.array([x for x in df.label.values])

  return text, y

def encode_text(text):
    # TODO: encode text using dbert_tokenizer
    # DO NOT CHANGE THE INPUTS OR OUTPUTS TO THIS FUNCTION
    encoded = dbert_tokenizer(
              text = text,
              add_special_tokens = True,
              max_length = 128,
              padding = 'max_length',
              truncation= True,
              return_tensors = 'tf'
              )


    input_ids = encoded['input_ids']
    attention_mask = encoded['attention_mask']

    
    return input_ids, attention_mask

# the following prepares the input for running in DistilBert
train_text, train_y = extract_text_and_y(clean_data(train))
val_text, val_y = extract_text_and_y(clean_data(val))
test_text, test_y = extract_text_and_y(clean_data(test))

train_input, train_mask = encode_text(train_text)
val_input, val_mask = encode_text(val_text)
test_input, test_mask = encode_text(test_text)

train_model_inputs_and_masks = {
    'inputs' : train_input,
    'masks' : train_mask
}

val_model_inputs_and_masks = {
    'inputs' : val_input,
    'masks' : val_mask
}

test_model_inputs_and_masks = {
    'inputs' : test_input,
    'masks' : test_mask
}

# Modelling

## Build and Train Model

Resources:
- BERT paper https://arxiv.org/pdf/1810.04805.pdf
- DistilBert paper: https://arxiv.org/abs/1910.01108
- DistilBert Tensorflow Documentation: https://huggingface.co/transformers/model_doc/distilbert.html#tfdistilbertmodel

In [None]:
def build_model(base_model, trainable=False, params={}):
    # TODO: build the model, with the option to freeze the parameters in distilBERT
    # DO NOT CHANGE THE INPUTS OR OUTPUTS TO THIS FUNCTION
    # Hint 1: the cls token (token for classification in bert / distilBert) corresponds to the first element in the sequence in DistilBert. Take a look at Figure 2 in BERT paper.
    # Hint 2: this guide may be helpful for parameter freezing: https://keras.io/guides/transfer_learning/
    # Hint 3: double check that your number of parameters make sense
    # Hint 4: carefully consider your final layer activation and loss function
    max_seq_len = 128
    inputs = tf.keras.layers.Input(shape=(max_seq_len,), dtype='int64')
    masks = tf.keras.layers.Input(shape=(max_seq_len,), dtype='int64') 

    # Refer to https://keras.io/api/layers/core_layers/input/
  
    base_model.trainable = trainable
    weight_initialiazer = GlorotNormal(seed=12)
    dbert_output = base_model(inputs, attention_mask=masks)

    # dbert_last_hidden_state gets you the output encoding for each of your tokens.
    # Each such encoding is a vector with 768 values. The first token fed into the model is [cls]
    # which can be used to build a sentence classification network
    dbert_last_hidden_state = dbert_output.last_hidden_state

    token = dbert_last_hidden_state[:,0,:]

    # Any additional layers should go here
    # use the 'params' as a dictionary for hyper parameter to facilitate experimentation

    
    dense1 = Dense(params["dense1"], activation='relu')(token)
    dropout1 = Dropout(params["dropout1"])(dense1)
    dense2 = Dense(params["dense2"], activation='relu')(dropout1)
    dropout2 = Dropout(params["dropout2"])(dense2)


    outputs = Dense(1, activation='sigmoid',
                    kernel_initializer=weight_initialiazer,
                    kernel_regularizer=regularizers.l2(0.01),
                    name='output')(dropout2)
    
    model = tf.keras.Model(inputs=[inputs, masks], outputs = outputs)

    
    print(model.summary())
    return model

dbert_model = TFDistilBertModel.from_pretrained('distilbert-base-uncased')
params={
    "dense1" : 128 ,
    "dense2" : 64 ,
    "dropout1" : 0.4,
    "dropout2" : 0.4
    
}
model = build_model(dbert_model,params=params)




Downloading:   0%|          | 0.00/347M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['vocab_projector', 'vocab_layer_norm', 'vocab_transform', 'activation_13']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, 128)]        0           []                               
                                                                                                  
 input_2 (InputLayer)           [(None, 128)]        0           []                               
                                                                                                  
 tf_distil_bert_model (TFDistil  TFBaseModelOutput(l  66362880   ['input_1[0][0]',                
 BertModel)                     ast_hidden_state=(N               'input_2[0][0]']                
                                one, 128, 768),                                                   
                                 hidden_states=None                                           

In [None]:
def compile_model(model):
    # TODO: compile the model, include relevant auc metrics when training
    # DO NOT CHANGE THE INPUTS OR OUTPUTS TO THIS FUNCTION
    
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
        loss=tf.keras.losses.BinaryCrossentropy(
        label_smoothing=0.5, axis=-1),
        metrics='accuracy',)
#SparseCategoricalCrossentropy(from_logits=True)
    return model

model = compile_model(model)


In [None]:
def train_model(model, model_inputs_and_masks_train, model_inputs_and_masks_val,
    y_train, y_val, batch_size, num_epochs):
    # TODO: train the model
    # DO NOT CHANGE THE INPUTS OR OUTPUTS TO THIS FUNCTION

    history = model.fit(
        x = [model_inputs_and_masks_train['inputs'], model_inputs_and_masks_train['masks']],
        y = y_train ,
        epochs = num_epochs,
        batch_size = batch_size,
        validation_data = ([model_inputs_and_masks_val['inputs'], 
                           model_inputs_and_masks_val['masks']], 
                           y_val),
                           verbose = 'auto'
    )
    return model, history

model, history = train_model(model, train_model_inputs_and_masks, val_model_inputs_and_masks, train_y, val_y, batch_size=128, num_epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
