This workbook follows the example here: https://huggingface.co/transformers/custom_datasets.html?highlight=sequence#seq-imdb

Can download the data directly from Stanford website with the following two commands:
wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
tar -xf aclImdb_v1.tar.gz

In [1]:
# Installations - only need to run the first time
#!pip install torch
#!pip install transformers

In [2]:
# Imports
from pathlib import Path
from sklearn.model_selection import train_test_split

import torch
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification, Trainer, TrainingArguments

In [3]:
# Load data
def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir == "neg" else 1)

    return texts, labels

train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')

In [4]:
# Create dev set from portion of train set
train_texts, dev_texts, train_labels, dev_labels = train_test_split(train_texts, train_labels, test_size=.2)

In [5]:
# Specify tokenizer and apply to each dataset
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

train_encodings = tokenizer(train_texts, truncation=True, padding=True)
dev_encodings = tokenizer(dev_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

In [6]:
# Turn encodings into datasets for easy batching
class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_labels)
dev_dataset = IMDbDataset(dev_encodings, dev_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

In [7]:
# Fine tune with Trainer
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=dev_dataset             # evaluation dataset
)

trainer.train()

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'classifier

Step,Training Loss
10,0.695
20,0.6977
30,0.6883
40,0.6885
50,0.6871
60,0.6809
70,0.6825
80,0.6621
90,0.63
100,0.563


Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-1000
Configuration saved in ./results/checkpoint-1000/config.json
Model weights saved in ./results/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-1500
Configuration saved in ./results/checkpoint-1500/config.json
Model weights saved in ./results/checkpoint-1500/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-2000
Configuration saved in ./results/checkpoint-2000/config.json
Model weights saved in ./results/checkpoint-2000/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-2500
Configuration saved in ./results/checkpoint-2500/config.json
Model weights saved in ./results/checkpoint-2500/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-3000
Configuration saved in ./results/checkpoint-3

TrainOutput(global_step=3750, training_loss=0.17801461814939976, metrics={'train_runtime': 75043.2433, 'train_samples_per_second': 0.8, 'train_steps_per_second': 0.05, 'total_flos': 7948043919360000.0, 'train_loss': 0.17801461814939976, 'epoch': 3.0})

In [8]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 5000
  Batch size = 64


{'eval_loss': 0.3744041323661804,
 'eval_runtime': 1839.567,
 'eval_samples_per_second': 2.718,
 'eval_steps_per_second': 0.043,
 'epoch': 3.0}

In [11]:
trainer.save_model("original_base_model")
tokenizer.save_pretrained("original_tokenizer")

Saving model checkpoint to original_base_model
Configuration saved in original_base_model/config.json
Model weights saved in original_base_model/pytorch_model.bin
tokenizer config file saved in original_tokenizer/tokenizer_config.json
Special tokens file saved in original_tokenizer/special_tokens_map.json


('original_tokenizer/tokenizer_config.json',
 'original_tokenizer/special_tokens_map.json',
 'original_tokenizer/vocab.txt',
 'original_tokenizer/added_tokens.json',
 'original_tokenizer/tokenizer.json')