## BERT IMDB Classifier

In this notebook we fine tune distilbert for sentiment classification on the IMDB dataset.

Training on the the full train set of 25000 records takes 20 minutes an epoch, so I limited this to only use 200 steps for training.

All done on RTX 3070M.


In [1]:
# check cuda
import torch
torch.cuda.is_available()

True

In [2]:
from datasets import load_dataset
imdb = load_dataset("imdb")

In [3]:
imdb['test'][0]

{'text': 'I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn\'t match the background, and painfully one-dimensional characters cannot be overcome with a \'sci-fi\' setting. (I\'m sure there are those of you out there who think Babylon 5 is good sci-fi TV. It\'s not. It\'s clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It\'s really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it\'s rubbish as 

In [4]:
# tokenize data
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
def preprocess_function(examples):
    # tokenize and truncate to distilberts max length
    return tokenizer(examples["text"], truncation=True) 

tokenized_imdb = imdb.map(preprocess_function, batched=True)

In [5]:

import evaluate
from transformers import DataCollatorWithPadding
import numpy as np

# dynamic padding to efficiently pad sentences to max length in batch
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# set metrics
accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

In [6]:
# load models
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert/distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
# this dataset contains 25000 train and test records
# let's limit the number of steps to evalue:
max_steps = 200
training_args = TrainingArguments(
    output_dir="bert-sentiment-classifer",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    dataloader_num_workers=4,  # Helps with data loading speed
    dataloader_pin_memory=True,  # Helps with CUDA transfer
    #gradient_accumulation_steps=2, # simulate larger batches
    #fp16=True,  # mixed precision training
    max_steps = max_steps,
    #num_train_epochs=2,
    weight_decay=0.01,
    eval_steps = 0.50, # run evaluation half way through max steps
    eval_strategy="steps",
    save_strategy="steps",
    load_best_model_at_end=True,
    
    #push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
100,No log,0.287755,0.88512
200,No log,0.244903,0.90324


TrainOutput(global_step=200, training_loss=0.3460448455810547, metrics={'train_runtime': 516.9449, 'train_samples_per_second': 12.38, 'train_steps_per_second': 0.387, 'total_flos': 847791351398400.0, 'train_loss': 0.3460448455810547, 'epoch': 0.2557544757033248})

In [8]:
model.device

device(type='cuda', index=0)