<a href="https://colab.research.google.com/github/nyp-sit/iti107/blob/main/session-6/bert-embedding_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using BERT as Feature Extractor

Other than fine-tuning BERT for downstream task such as text classification, we can use pretrained BERT model as a feature extractor, very much the same as we are using pretrained CNN such as ResNet as feature extractors for downstream task such as image classification and object detection.  

In this lab, we will see how we use a pretrained DistilBert Model to extract features (or embedding) from text and use the extracted features (embeddings) to train a classifier to classify text. You can contrast this with the other lab where we train the DistilBert end to end for the classification, and compare the performance of both.

At the end of this session, you will be able to:
- prepare data and use model-specific Tokenizer to format data suitable for use by the model
- extract text embeddings from the bert model
- use the extracted features for text classification


## Install Hugging Face Transformers library
If you are running this notebook in Google Colab, you will need to install the Hugging Face transformers library as it is not part of the standard environment.

In [None]:
!pip install transformers

In [None]:
import numpy as np
import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
# downloaded the datasets.
test_data_url = 'https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/imdb_test.csv'
train_data_url = 'https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/imdb_train.csv'

train_df = pd.read_csv(train_data_url)
test_df = pd.read_csv(test_data_url)

The train set has 40000 samples. We will a small subset (e.g. 2000) samples for finetuning our pretrained model. Similarly we will use a smaller test set for evaluating our model. We use dataframe's sample() to randomly select a subset of samples.

In [None]:
TRAIN_SIZE = 2000
TEST_SIZE = 200

train_df = train_df.sample(n=TRAIN_SIZE, random_state=128)
test_df = test_df.sample(n=TEST_SIZE, random_state=128)

In [None]:
train_df['sentiment'] =  train_df['sentiment'].apply(lambda x: 0 if x == 'negative' else 1)
test_df['sentiment'] =  test_df['sentiment'].apply(lambda x: 0 if x == 'negative' else 1)

In [None]:
train_texts = train_df['review'].to_list()
train_labels = train_df['sentiment'].to_list()
test_texts = test_df['review'].to_list()
test_labels = test_df['sentiment'].to_list()

In [None]:
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

## Tokenization

We will now load the DistilBert tokenizer for the pretrained model "distillbert-base-uncased".  This is the same as the other lab exercise.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

The pretrained DistilBERT [tokenizer](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer) expects a string or list of string, so we need to convert the data frame (or series) into list.

Here we will tokenize the text string, and pad the text string to the longest sequence in the batch, and also to truncate the sequence if it exceeds the maximum length allowed by the model (in BERT's case, it is 512).

In [None]:
train_encodings = tokenizer(train_texts, padding=True, truncation=True)
val_encodings = tokenizer(val_texts, padding=True, truncation=True)
test_encodings = tokenizer(test_texts, padding=True, truncation=True)

We will create a tensorflow dataset and use it's efficient batching later to obtain the embeddings.

In [None]:
batch_size = 16

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
)).batch(batch_size)

val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
)).batch(batch_size)

test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test_labels
)).batch(batch_size)

Here we instantiate a pretrained model from 'distilbert-base-cased' and specify output_hidden_state=True so that we get the output from each of the attention layers.

## Feature Extraction using (Distil)BERT.

Here we will load the pretrained model for distibert-based-uncased and use it to extract features from the text (i.e. emeddings).

In [None]:
from transformers import TFAutoModel

model = TFAutoModel.from_pretrained("distilbert-base-uncased",output_hidden_states=True)

The model will produce two outputs: the 1st output `output[0]` is of shape `(16, 512, 768)` which corresponds to the output of the last hidden layer and the second output `output[1]` is a list of 7 outputs of shape `(16, 512, 768)`, corresponding to the output of each of the 6 attention layers and the output. 768 refers to the hidden size.

In [None]:
def extract_features(dataset):

    embeddings = []
    labels = []

    for encoding, label in dataset:
        output = model(encoding)
        hidden_states = output[1]
        # here we take the output of the second last attention layer as our embeddings.
        hs = hidden_states[-2]
        # get the corresponding attention mask
        attention_masks = encoding['attention_mask']
        # we make the attention masks same shape as the hidden states
        # i.e. the masks will be (1, 512, 768).
        # those positions where mask=0 will 3rd axis with all 0s
        masks = tf.reshape(attention_masks, (hs.shape[0], hs.shape[1], -1))
        masks = tf.tile(masks, multiples=[1,1, hs.shape[2]])
        masks = tf.cast(masks, dtype=tf.float32)
        # when we multiply with the hidden state with masks,
        # those positions in the hidden state becomes 0 too
        # and when we take the mean, those masks positions does not really
        # contribute anything
        results = tf.multiply(hs, masks)
        summation = tf.reduce_sum(results, axis=1)
        denominator = tf.cast(tf.math.count_nonzero(results, axis=1), dtype=tf.float32)
        sentence_embedding = tf.divide(summation, denominator)
        embeddings.append(sentence_embedding)
        labels.append(label)

    embeddings, labels = np.concatenate(embeddings), np.concatenate(labels)

    return embeddings, labels

In [None]:
X_train, y_train = extract_features(train_dataset)
X_val, y_val = extract_features(val_dataset)
X_test, y_test = extract_features(test_dataset)

## Train a classifier using the extracted features (embeddings)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [None]:
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

In [None]:
print(f'train score : {clf.score(X_train, y_train)}')
print(f'validation score : {clf.score(X_val, y_val)}')
print(f'test score : {clf.score(X_test, y_test)}')

We should be getting an validation and accuracy score of around 86% to 87% which is quite good, considering we are training with only 2000 samples!

**Exercise**

1. Modify the code to use the hidden states from a different attention layer as features or take average of hidden states  from few layers as features.
2. Modify the code to use BERT model and see if it performs better than the DistilBERT. For BERT Model, the output of different layers are in `output[2]`