# <a name="0">Machine Learning Accelerator - Natural Language Processing - Lecture 3</a>
## Fine-tuning BERT for the Product Review Problem - Classify Product Reviews as Positive or Not

Let's fine-tune the BERT model to classify our product reviews. We will install a new library __transformers__ and get a pre-trained BERT model from it. We are following [this tutorial](https://huggingface.co/transformers/custom_datasets.html) from the HuggingFace framework.

We are using a light version of the original BERT implementation called __"DistilBert"__. You can checkout [their paper](https://arxiv.org/pdf/1910.01108.pdf) for more details. __This demo takes a long time to complete (even for 1 epoch) with our current instance. It is intended to get you familiar with this tool.__

In [1]:
!pip install -r ../../requirements.txt
!pip install -q transformers



In [2]:
import torch
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from transformers import Trainer, TrainingArguments, DistilBertForSequenceClassification, DistilBertTokenizerFast

Let's read the dataset

In [3]:
df = pd.read_csv("../../data/examples/NLP-REVIEW-DATA-CLASSIFICATION-TRAINING.csv")

Let's print the dataset information.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56000 entries, 0 to 55999
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   ID          56000 non-null  int64  
 1   reviewText  55990 non-null  object 
 2   summary     55988 non-null  object 
 3   verified    56000 non-null  bool   
 4   time        56000 non-null  int64  
 5   log_votes   56000 non-null  float64
 6   isPositive  56000 non-null  float64
dtypes: bool(1), float64(2), int64(2), object(2)
memory usage: 2.6+ MB


We drop rows with text field missing.

In [5]:
df.dropna(subset=["reviewText"], inplace=True)

BERT requires powerful compute power. In this demo, we will only use the first 1,000 data points. 

In [6]:
df = df.head(1000)

We set the output type to int64 as it is required by this library.

In [7]:
df["isPositive"] = df["isPositive"].astype("int64")

Let's keep 10% of the data for validation.

In [8]:
# This separates 10% of the entire dataset into validation dataset.
train_texts, val_texts, train_labels, val_labels = train_test_split(
    df["reviewText"].tolist(),
    df["isPositive"].tolist(),
    test_size=0.10,
    shuffle=True,
    random_state=324,
    stratify = df["isPositive"].tolist(),
)

Let's get the special tokenizer for BERT.

In [9]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

train_encodings = tokenizer(train_texts,
                            truncation=True,
                            padding=True)
val_encodings = tokenizer(val_texts,
                          truncation=True,
                          padding=True)

We prepare our data below.

In [10]:
class ReviewDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)
    
train_dataset = ReviewDataset(train_encodings, train_labels)
val_dataset = ReviewDataset(val_encodings, val_labels)

Let's call the model. This may print some warning messages. We are using it as intended, so don't worry about them.

In [11]:
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased",
                                                            num_labels=2)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

Let's start the fine-tuning process. This code takes __a long time__ to complete. It is intended for educational purposes. It usually requires a bigger instance for a quicker run. You can reduce the __num_train_epochs__ in your run.

In [None]:
# A simple function to calc. accuracy
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    # calculate accuracy using sklearn's function
    acc = accuracy_score(labels, preds)
    return {
      'accuracy': acc
    }

training_args = TrainingArguments(
    output_dir="results",  # output directory
    num_train_epochs=20,  # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,  # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir="logs",  # directory for storing logs
    logging_steps=100,
    evaluation_strategy="steps",  # print val score at each step
    load_best_model_at_end=True
)

# Freeze the encoder weights until the classfier
for name, param in model.named_parameters():
    if "classifier" not in name:
        param.requires_grad = False

trainer = Trainer(
    model=model,  # the Transformers model
    args=training_args,  # training arguments
    train_dataset=train_dataset,  # passing training dataset
    eval_dataset=val_dataset,  # passing evaluation dataset
    compute_metrics=compute_metrics  # the callback that computes metrics of interest
)

trainer.train()

Step,Training Loss,Validation Loss,Accuracy,Runtime,Samples Per Second
100,0.6764,0.66229,0.61,56.6487,1.765
200,0.6439,0.626768,0.61,55.9924,1.786
300,0.5994,0.576111,0.67,57.2194,1.748
400,0.5327,0.509893,0.77,57.3881,1.743
500,0.4708,0.451325,0.82,56.8971,1.758
600,0.4193,0.416415,0.83,57.07,1.752


Let's use the trained model to make predictions on the validation dataset and compute metrics. 

In [32]:
val_predictions = trainer.predict(val_dataset)

We get validation predictions using the argmax function. It returns 0 or 1 for each prediction.

In [33]:
preds = val_predictions.predictions.argmax(-1)

print(confusion_matrix(val_predictions.label_ids, preds))
print(classification_report(val_predictions.label_ids, preds))
print("Accuracy (validation):", accuracy_score(val_predictions.label_ids, preds))

[[32  7]
 [ 7 54]]
              precision    recall  f1-score   support

           0       0.82      0.82      0.82        39
           1       0.89      0.89      0.89        61

    accuracy                           0.86       100
   macro avg       0.85      0.85      0.85       100
weighted avg       0.86      0.86      0.86       100

Accuracy (validation): 0.86


### Looking at what's going on

The fine-tuned BERT is able to correctly classify the sentiment of all records in the validation set. Let's print some of the data and what's happening with it.

In [34]:
k = 0
print(len(val_dataset.encodings["input_ids"][k]))
print(val_dataset.encodings["input_ids"][k])
print(val_texts[k])
print(val_labels[k])

512
[101, 1045, 4149, 2023, 2138, 6881, 3769, 1011, 2039, 21628, 2015, 2020, 4760, 2039, 2006, 2026, 12191, 1012, 2023, 10770, 3036, 4031, 2134, 1005, 1056, 2131, 9436, 1997, 2068, 1010, 2061, 1045, 2001, 9364, 1010, 2021, 2009, 2052, 3796, 1996, 4180, 1012, 2061, 2009, 4066, 1997, 2499, 1010, 2021, 1045, 4299, 2009, 2071, 4550, 2039, 2026, 3274, 2062, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [35]:
k = 24
print(len(val_dataset.encodings["input_ids"][k]))
print(val_dataset.encodings["input_ids"][k])
print(val_texts[k])
print(val_labels[k])

512
[101, 1045, 2031, 2109, 2119, 22432, 7959, 2063, 1998, 10770, 1998, 2044, 2383, 2109, 2023, 4007, 2005, 2058, 1037, 2095, 2085, 1045, 2079, 2025, 2933, 2000, 2689, 1012, 2009, 2515, 2025, 4030, 2091, 2026, 3274, 1012, 1045, 2224, 2009, 2006, 2026, 7473, 1998, 14960, 1012, 2009, 2038, 7420, 2033, 1997, 4022, 4795, 4773, 4573, 2008, 1996, 2060, 2048, 2106, 2025, 1998, 2009, 2003, 16286, 21125, 1012, 6581, 4007, 2005, 4274, 3036, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

Let's observe in more detail how sentences are tokenized.

In [36]:
st = val_texts[24]
print(st)
tok = tokenizer(st, truncation=True, padding=True)
print(tok)

I have used both McAfee and Norton and after having used this software for over a year now I do not plan to change. It does not slow down my computer. I use it on my PC and notebook. It has warned me of potential dangerous web sites that the other two did not and it is reasonably priced. Excellent software for internet security.
{'input_ids': [101, 1045, 2031, 2109, 2119, 22432, 7959, 2063, 1998, 10770, 1998, 2044, 2383, 2109, 2023, 4007, 2005, 2058, 1037, 2095, 2085, 1045, 2079, 2025, 2933, 2000, 2689, 1012, 2009, 2515, 2025, 4030, 2091, 2026, 3274, 1012, 1045, 2224, 2009, 2006, 2026, 7473, 1998, 14960, 1012, 2009, 2038, 7420, 2033, 1997, 4022, 4795, 4773, 4573, 2008, 1996, 2060, 2048, 2106, 2025, 1998, 2009, 2003, 16286, 21125, 1012, 6581, 4007, 2005, 4274, 3036, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

In [37]:
# The mapped vocabulary is stored in tokenizer.vocab
tokenizer.vocab_size

30522

In [38]:
# Methods convert_ids_to_tokens and convert_tokens_to_ids allow to see how sentences are tokenized
print(tokenizer.convert_ids_to_tokens(tok['input_ids']))

['[CLS]', 'i', 'have', 'used', 'both', 'mca', '##fe', '##e', 'and', 'norton', 'and', 'after', 'having', 'used', 'this', 'software', 'for', 'over', 'a', 'year', 'now', 'i', 'do', 'not', 'plan', 'to', 'change', '.', 'it', 'does', 'not', 'slow', 'down', 'my', 'computer', '.', 'i', 'use', 'it', 'on', 'my', 'pc', 'and', 'notebook', '.', 'it', 'has', 'warned', 'me', 'of', 'potential', 'dangerous', 'web', 'sites', 'that', 'the', 'other', 'two', 'did', 'not', 'and', 'it', 'is', 'reasonably', 'priced', '.', 'excellent', 'software', 'for', 'internet', 'security', '.', '[SEP]']


# Getting predictions on the test data and saving results
* Read the test data
* Pass the data into your pipeline and make predictions

In [39]:
# Read the test data (It doesn't have the human_tag label, we are trying to predict that :D )
df_test = pd.read_csv("../../data/examples/NLP-REVIEW-DATA-CLASSIFICATION-TEST.csv")
df_test.head()

Unnamed: 0,ID,reviewText,summary,verified,time,log_votes
0,33276,I've been using greeting card software for wel...,Absolutely awful.,False,1300233600,0.0
1,20859,"This version worked well for me, have upgraded...",Good for virtual machine on a mac,True,1448755200,0.0
2,63500,Great!,Five Stars,True,1456963200,0.0
3,4950,I can assure you that any five star review was...,SCAM,False,1400803200,2.197225
4,26509,Overall the product really seems the same but ...,Has potential but many glitches and really the...,False,1419206400,0.0


In [40]:
df_test.isna().sum()

ID            0
reviewText    1
summary       2
verified      0
time          0
log_votes     0
dtype: int64

In [41]:
df_test["reviewText"] = df_test["reviewText"].fillna(value='')

In [42]:
test_texts = df_test["reviewText"].tolist()

In [43]:
test_encodings = tokenizer(test_texts,
                          truncation=True,
                          padding=True)

In [44]:
test_dataset = ReviewDataset(test_encodings, [0]*len(test_texts))

In [None]:
test_predictions = trainer.predict(test_dataset)

In [None]:
test_preds = test_predictions.predictions.argmax(-1)

In [None]:
k = 0
print(len(test_dataset.encodings["input_ids"][k]))
print(test_dataset.encodings["input_ids"][k])
print(test_texts[k])
#check whether the prediction is good enough
print(test_preds[k])

In [None]:
import pandas as pd

result_df = pd.DataFrame()
result_df["ID"] = df_test["ID"]
result_df["isPositive"] = test_preds

result_df.to_csv("result_day3_bert.csv", encoding='utf-8', index=False)

This command deletes saved models

In [None]:
! rm -rf results

In [None]:
! rm -rf logs