<a href="https://colab.research.google.com/github/y-bai/llm/blob/main/notebooks/distilbert_ft.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
!pip -q install datasets evaluate transformers[sentencepiece] peft

In [6]:
from datasets import load_dataset, DatasetDict, Dataset

from transformers import (
    AutoTokenizer,
    AutoConfig,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
    EvalPrediction
)

from peft import PeftModel, PeftConfig, get_peft_model, LoraConfig

import evaluate
import torch
import numpy as np


# Base pretrained model

We use a relativaly small model - distilbert - for sequence classification.

A full list is available [here](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForSequenceClassification)

In [7]:
model_checkpoint = "distilbert-base-uncased"
config = AutoConfig.from_pretrained(model_checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

We then have `BertConfig`. For details, see [here](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/configuration_bert.py#L29).


In [10]:
config

DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.41.2",
  "vocab_size": 30522
}

Now, we load **BERT** model for sequencen classification. For details, see [here](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py#L1650).


**NOTE**

`id2label` and `label2id` are parameters in `PretrainedConfig` for fine-tuning. For details, see [here](https://huggingface.co/docs/transformers/en/main_classes/configuration).


In [15]:
config.num_labels = 2 # we have 2 classes
config.id2label = {0: "Negative", 1: "Positive"}
config.label2id = {"Negative": 0, "Positive": 1}
model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint,
    config=config)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Load dataset

In [13]:
dataset = load_dataset("shawhin/imdb-truncated")
dataset

Downloading readme:   0%|          | 0.00/592 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/836k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/853k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['label', 'text'],
        num_rows: 1000
    })
})

In [14]:
dataset['train'][:3]

{'label': [1, 1, 0],
 'text': ['. . . or type on a computer keyboard, they\'d probably give this eponymous film a rating of "10." After all, no elephants are shown being killed during the movie; it is not even implied that any are hurt. To the contrary, the master of ELEPHANT WALK, John Wiley (Peter Finch), complains that he cannot shoot any of the pachyderms--no matter how menacing--without a permit from the government (and his tone suggests such permits are not within the realm of probability). Furthermore, the elements conspire--in the form of an unusual drought and a human cholera epidemic--to leave the Wiley plantation house vulnerable to total destruction by the Elephant People (as the natives dub them) to close the story. If you happen to see the current release EARTH, you\'ll detect the Elephant People are faring less well today.',
  "During 1933 this film had many cuts taken from it because it was very over the top for the story content and the fact that Lily Powers,(Barbara S

# Tokenize dataset

In [16]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space=True)

In [19]:
special_tokens = (
    tokenizer.bos_token,
    tokenizer.eos_token,
    tokenizer.pad_token,
    tokenizer.unk_token,
    tokenizer.mask_token,
    tokenizer.sep_token,
    tokenizer.cls_token
)
special_token_ids = (
    tokenizer.bos_token_id,
    tokenizer.eos_token_id,
    tokenizer.pad_token_id,
    tokenizer.unk_token_id,
    tokenizer.mask_token_id,
    tokenizer.sep_token_id,
    tokenizer.cls_token_id)

dict(zip(special_tokens, special_token_ids))

{None: None,
 '[PAD]': 0,
 '[UNK]': 100,
 '[MASK]': 103,
 '[SEP]': 102,
 '[CLS]': 101}

In [17]:
# create tokenize function
def tokenize_function(examples):
    # extract text
    text = examples["text"]

    #tokenize and truncate text
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        truncation=True,
        max_length=512
    )

    return tokenized_inputs

# add pad token if none exists
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer))

# tokenize training and validation datasets
tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_dataset


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 1000
    })
})

In [20]:
token_item = tokenized_dataset['train'][0]


In [21]:
len(token_item['input_ids'])

183

In [22]:
token_item['input_ids'][:10], token_item['attention_mask'][:10]


([101, 1012, 1012, 1012, 2030, 2828, 2006, 1037, 3274, 9019],
 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [23]:
token_item['input_ids'][-10:], token_item['attention_mask'][-10:]

([10777, 2111, 2024, 2521, 2075, 2625, 2092, 2651, 1012, 102],
 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [24]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [29]:
tokenized_dataset['train']

Dataset({
    features: ['label', 'text', 'input_ids', 'attention_mask'],
    num_rows: 1000
})

In [31]:
tokenized_dataset_trn = tokenized_dataset['train'].remove_columns(['text'])
data_collator(tokenized_dataset_trn[:3])

{'input_ids': tensor([[  101,  1012,  1012,  ...,     0,     0,     0],
        [  101,  2076,  4537,  ...,     0,     0,     0],
        [  101,  2292,  2033,  ...,  5487, 23872,   102]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1]]), 'labels': tensor([1, 1, 0])}

# Evaluation metrics

In [32]:
# import accuracy evaluation metric
accuracy = evaluate.load("accuracy")

# define an evaluation function to pass into trainer later
def compute_metrics(p:EvalPrediction):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=1)

    return {"accuracy": accuracy.compute(predictions=predictions,
                                          references=labels)}

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

# Zero-shoting performance

In [33]:
# define list of examples
text_list = ["It was good.", "Not a fan, don't recommed.",
"Better than the first one.", "This is not worth watching even once.",
"This one is a pass."]

In [34]:
tokenizer.encode(text_list[0], return_tensors="pt")

tensor([[ 101, 2009, 2001, 2204, 1012,  102]])

In [35]:
print("Untrained model predictions:")
print("----------------------------")
for text in text_list:
    # tokenize text
    inputs = tokenizer.encode(text, return_tensors="pt")
    # compute logits
    logits = model(inputs).logits
    # convert logits to label
    predictions = torch.argmax(logits)

    print(text + " - " + config.id2label[predictions.tolist()])


Untrained model predictions:
----------------------------
It was good. - Positive
Not a fan, don't recommed. - Positive
Better than the first one. - Positive
This is not worth watching even once. - Positive
This one is a pass. - Positive


# Fine-tuning with LoRA

We first configure the `LoRA`. For details, see [here](https://huggingface.co/docs/peft/en/package_reference/lora) and [here](https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/config.py).

In [36]:
model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [37]:

peft_config = LoraConfig(
    task_type="SEQ_CLS", # sequence classification
    r=4, # intrinsic rank of trainable weight matrix
    lora_alpha=32, # this is like a learning rate
    lora_dropout=0.01, # probablity of dropout
    target_modules = ['q_lin']) # we apply lora to query layer only

In [38]:
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

trainable params: 628,994 || all params: 67,584,004 || trainable%: 0.9307


In [39]:
# hyperparameters
lr = 1e-3 # size of optimization step
batch_size = 4 # number of examples processed per optimziation step
num_epochs = 10 # number of times model runs through training data

# define training arguments
training_args = TrainingArguments(
    output_dir= model_checkpoint + "-lora-text-classification",
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)



In [41]:
# creater trainer object
trainer = Trainer(
    model=model, # our peft model
    args=training_args, # hyperparameters
    train_dataset=tokenized_dataset["train"], # training data
    eval_dataset=tokenized_dataset["validation"], # validation data
    tokenizer=tokenizer, # define tokenizer
    data_collator=data_collator, # this will dynamically pad examples in each batch to be equal length
    compute_metrics=compute_metrics, # evaluates model using compute_metrics() function from before
)

# train model
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.469513,{'accuracy': 0.884}
2,0.331100,0.71776,{'accuracy': 0.855}
3,0.331100,0.673376,{'accuracy': 0.877}
4,0.183900,0.858175,{'accuracy': 0.882}
5,0.183900,1.022558,{'accuracy': 0.886}
6,0.050000,0.988477,{'accuracy': 0.878}
7,0.050000,1.067949,{'accuracy': 0.881}
8,0.009000,1.087965,{'accuracy': 0.883}
9,0.009000,1.149433,{'accuracy': 0.881}
10,0.009000,1.155663,{'accuracy': 0.882}


Trainer is attempting to log a value of "{'accuracy': 0.884}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.855}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.877}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.882}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.886}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This i

TrainOutput(global_step=2500, training_loss=0.11659659614562988, metrics={'train_runtime': 267.5582, 'train_samples_per_second': 37.375, 'train_steps_per_second': 9.344, 'total_flos': 1112883852759936.0, 'train_loss': 0.11659659614562988, 'epoch': 10.0})

In [42]:
model.device

device(type='cuda', index=0)

In [43]:
model.to('cpu')  # or 'mps' for Mac

PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): DistilBertForSequenceClassification(
      (distilbert): DistilBertModel(
        (embeddings): Embeddings(
          (word_embeddings): Embedding(30522, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (transformer): Transformer(
          (layer): ModuleList(
            (0-5): 6 x TransformerBlock(
              (attention): MultiHeadSelfAttention(
                (dropout): Dropout(p=0.1, inplace=False)
                (q_lin): lora.Linear(
                  (base_layer): Linear(in_features=768, out_features=768, bias=True)
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.01, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=76

In [44]:
model.device

device(type='cpu')

In [47]:
print("Trained model predictions:")
print("--------------------------")
for text in text_list:
    inputs = tokenizer.encode(text, return_tensors="pt").to("cpu") # moving to mps for Mac (mps) (can alternatively do 'cpu')

    logits = model(inputs).logits
    predictions = torch.max(logits,1).indices

    print(text + " - " + config.id2label[predictions.tolist()[0]])

Trained model predictions:
--------------------------
It was good. - Positive
Not a fan, don't recommed. - Negative
Better than the first one. - Positive
This is not worth watching even once. - Negative
This one is a pass. - Negative
