# Sentiment analysis - fine-tuning BERT

In this notebook we'll take a look at the process needed to fine-tune a pretrained BERT model to detect the sentiment of a piece of text. Our goal will be to classify the polarity of [IMDB](https://www.imdb.com/) movie reviews, we'll be working with this [dataset](https://huggingface.co/datasets/imdb). The techniques we'll discuss also apply to a more general text classification.


<div>
<img src="https://raw.githubusercontent.com/valira-ai/NLP-tutorial-DSC22/main/figures/classification.png" width="700"/>
</div>

First things first, let's make sure we have a GPU instance in this Colab session:

* `Edit -> Notebook settings -> Hardware accelerator` must be set to `GPU`

* if needed, reinitiliaze the session by clicking `Connect` in top right corner

After the session is initilized, we can check our assigned GPU with the following command:



In [None]:
!nvidia-smi

In [2]:
%%capture
!pip install transformers  # Huggingface library for transformer models
!pip install datasets  # Huggingface dataset library
!pip install --upgrade --no-cache-dir gdown  # downloading from Google Drive

In [3]:
import numpy as np

from datasets import DatasetDict, load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer, EarlyStoppingCallback, Trainer, TrainingArguments

## Dataset

Let's download the dataset of IMDB reviews:

In [None]:
raw_dataset = load_dataset("imdb")

In [5]:
split_train = raw_dataset["train"].train_test_split(test_size=0.2)
dataset = DatasetDict({
    "train": split_train["train"],
    "val": split_train["test"],
    "test": raw_dataset["test"]
})

In [None]:
dataset["train"].features

In [None]:
dataset["train"][2]

Tokenizing our data - preparing model inputs:

In [None]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [9]:
def tokenize(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

In [None]:
tok_dataset = dataset.map(tokenize, batched=True)

## Training (don't run during tutorial)

In [None]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=dataset["train"].features["label"].num_classes)

In [None]:
# optional if you want to save your models to Google Drive
from google.colab import drive
drive.mount("/content/drive/")

In [None]:
args = TrainingArguments(
    output_dir="/content/drive/MyDrive/ds_conference/BERT-sentiment/",
    evaluation_strategy="steps",
    eval_steps=250,
    save_total_limit=1,
    learning_rate=5e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    num_train_epochs=5,
    weight_decay=0.0001,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss"
)

In [None]:
def compute_metrics(eval_pred):
    pred = np.argmax(eval_pred.predictions, axis=-1)
    accuracy = np.mean(pred == eval_pred.label_ids)
    
    return {"accuracy": accuracy}

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=tok_dataset["train"],
    eval_dataset=tok_dataset["val"],
    compute_metrics=compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]
)

In [None]:
trainer.train()

## Evaluation

In [None]:
!mkdir /content/bert-imdb
!gdown -O /content/bert-imdb/config.json https://drive.google.com/uc?id=1-XtrUCTwyBnG79LpOYe6nfFeREg9hvfm
!gdown -O /content/bert-imdb/pytorch_model.bin https://drive.google.com/uc?id=1-UnnVyANUEBULAhSBkZ9_fKKCXugV751

In [15]:
model = AutoModelForSequenceClassification.from_pretrained("bert-imdb")

In [16]:
def compute_metrics(eval_pred):
    pred = np.argmax(eval_pred.predictions, axis=-1)
    accuracy = np.mean(pred == eval_pred.label_ids)
    
    return {"accuracy": accuracy}

In [17]:
args = TrainingArguments("test", per_device_eval_batch_size=64)

trainer = Trainer(
    model,
    args,
    eval_dataset=tok_dataset["test"].shuffle(42).select(range(2000)),  # for tutorial purposes, we subsample the test data
    compute_metrics=compute_metrics
)

In [None]:
trainer.evaluate()

## A bit more testing:)

In [19]:
from transformers import AutoConfig, pipeline

In [None]:
# just adds id2label to model config
config = AutoConfig.from_pretrained("bert-imdb")
config.id2label = {"0": "Negative", "1": "Positive"}
config.save_pretrained("bert-imdb")

In [None]:
classifier = pipeline("sentiment-analysis", model="bert-imdb", tokenizer="distilbert-base-uncased", device=0)

In [None]:
classifier("This movie sucks!")

In [None]:
classifier("This movie is great!")

In [None]:
classifier("I don't think this movie is good.")