# Finetuning a DistilBERT Classifier in Lightning

In [1]:
!pip install -r requirements.txt

Collecting lightning (from -r requirements.txt (line 3))
  Downloading lightning-2.5.5-py3-none-any.whl.metadata (39 kB)
Collecting watermark (from -r requirements.txt (line 4))
  Downloading watermark-2.5.0-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting scikit-learn==1.5.2 (from -r requirements.txt (line 5))
  Downloading scikit_learn-1.5.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Collecting lightning-utilities<2.0,>=0.10.0 (from lightning->-r requirements.txt (line 3))
  Downloading lightning_utilities-0.15.2-py3-none-any.whl.metadata (5.7 kB)
Collecting torchmetrics<3.0,>0.7.0 (from lightning->-r requirements.txt (line 3))
  Downloading torchmetrics-1.8.2-py3-none-any.whl.metadata (22 kB)
Collecting pytorch-lightning (from lightning->-r requirements.txt (line 3))
  Downloading pytorch_lightning-2.5.5-py3-none-any.whl.metadata (20 kB)
Collecting jedi>=0.16 (from ipython>=6.0->watermark->-r requirements.txt (line 4))
  Downloading jedi-0.19.2-py2.

In [2]:
import torch
print("Torch version:", torch.__version__)
print("Torch file location:", torch.__file__)

# 检查是否包含CUDA编译信息
print("CUDA compiled version:", torch.version.cuda)
print("cuDNN version:", torch.backends.cudnn.version() if torch.backends.cudnn.is_available() else "Not available")

Torch version: 2.8.0+cu126
Torch file location: /usr/local/lib/python3.12/dist-packages/torch/__init__.py
CUDA compiled version: 12.6
cuDNN version: 91002


![](figures/finetuning-ii.png)

# 1 Loading the dataset into DataFrames

In [3]:
import os.path as op

from datasets import load_dataset

import lightning as L
from lightning.pytorch.loggers import CSVLogger
from lightning.pytorch.callbacks import ModelCheckpoint

import numpy as np
import pandas as pd
import torch

from sklearn.feature_extraction.text import CountVectorizer

from local_dataset_utilities import download_dataset, load_dataset_into_to_dataframe, partition_dataset
from local_dataset_utilities import IMDBDataset

In [4]:
download_dataset()

df = load_dataset_into_to_dataframe()
partition_dataset(df)

100% | 80.23 MB | 4.87 MB/s | 16.47 sec elapsed

100%|██████████| 50000/50000 [00:43<00:00, 1155.72it/s]


Class distribution:


In [5]:
df_train = pd.read_csv("train.csv")
df_val = pd.read_csv("val.csv")
df_test = pd.read_csv("test.csv")

In [6]:
df_train['text'][0]

'When we started watching this series on cable, I had no idea how addictive it would be. Even when you hate a character, you hold back because they are so beautifully developed, you can almost understand why they react to frustration, fear, greed or temptation the way they do. It\'s almost as if the viewer is experiencing one of Christopher\'s learning curves.<br /><br />I can\'t understand why Adriana would put up with Christopher\'s abuse of her, verbally, physically and emotionally, but I just have to read the newspaper to see how many women can and do tolerate such behavior. Carmella has a dream house, endless supply of expensive things, but I\'m sure she would give it up for a loving and faithful husband - or maybe not. That\'s why I watch.<br /><br />It doesn\'t matter how many times you watch an episode, you can find something you missed the first five times. We even watch episodes out of sequence (watch season 1 on late night with commercials but all the language, A&E with lang

# 2 Tokenization and Numericalization

**Load the dataset via `load_dataset`**

In [7]:
imdb_dataset = load_dataset(
    "csv",
    data_files={
        "train": "train.csv",
        "validation": "val.csv",
        "test": "test.csv",
    },
)

print(imdb_dataset)

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 35000
    })
    validation: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 10000
    })
})


**Tokenize the dataset**

In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased", cache_dir="./models")
print("Tokenizer input max length:", tokenizer.model_max_length)
print("Tokenizer vocabulary size:", tokenizer.vocab_size)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Tokenizer input max length: 512
Tokenizer vocabulary size: 30522


In [9]:
def tokenize_text(batch):
    return tokenizer(batch["text"], truncation=True, padding=True)

In [10]:
imdb_tokenized = imdb_dataset.map(tokenize_text, batched=True, batch_size=None)

Map:   0%|          | 0/35000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [11]:
del imdb_dataset

In [12]:
imdb_tokenized.set_format("torch", columns=["input_ids", "attention_mask", "label"])

In [13]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# 3 Set Up DataLoaders

In [14]:
from torch.utils.data import DataLoader, Dataset


class IMDBDataset(Dataset):
    def __init__(self, dataset_dict, partition_key="train"):
        self.partition = dataset_dict[partition_key]

    def __getitem__(self, index):
        return self.partition[index]

    def __len__(self):
        return self.partition.num_rows

In [15]:
train_dataset = IMDBDataset(imdb_tokenized, partition_key="train")
val_dataset = IMDBDataset(imdb_tokenized, partition_key="validation")
test_dataset = IMDBDataset(imdb_tokenized, partition_key="test")

train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=12,
    shuffle=True,
    num_workers=4
)

val_loader = DataLoader(
    dataset=val_dataset,
    batch_size=12,
    num_workers=4
)

test_loader = DataLoader(
    dataset=test_dataset,
    batch_size=12,
    num_workers=4
)

# 4 Initializing DistilBERT

In [16]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2, cache_dir="./models")

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [17]:
model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


In [18]:
from peft import LoraConfig, get_peft_model, TaskType
# LoRA setting
lora_cfg = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=8,              # r
    lora_alpha=16,    # alpha
    lora_dropout=0.1,
    target_modules=["q_lin", "v_lin"]  # default tuning layers with pre_classifier + classifier
)

model = get_peft_model(model, lora_cfg)
model.print_trainable_parameters()

trainable params: 739,586 || all params: 67,694,596 || trainable%: 1.0925


## 5 Finetuning

**Wrap in LightningModule for Training**

In [19]:
import lightning as L
import torch
import torchmetrics


class LightningModel(L.LightningModule):
    def __init__(self, model, learning_rate=5e-5):
        super().__init__()

        self.learning_rate = learning_rate
        self.model = model

        self.val_acc = torchmetrics.Accuracy(task="multiclass", num_classes=2)
        self.test_acc = torchmetrics.Accuracy(task="multiclass", num_classes=2)

    def forward(self, input_ids, attention_mask, labels):
        return self.model(input_ids, attention_mask=attention_mask, labels=labels)

    def training_step(self, batch, batch_idx):
        outputs = self(batch["input_ids"], attention_mask=batch["attention_mask"],
                       labels=batch["label"])
        self.log("train_loss", outputs["loss"])
        return outputs["loss"]  # this is passed to the optimizer for training

    def validation_step(self, batch, batch_idx):
        outputs = self(batch["input_ids"], attention_mask=batch["attention_mask"],
                       labels=batch["label"])
        self.log("val_loss", outputs["loss"], prog_bar=True)

        logits = outputs["logits"]
        predicted_labels = torch.argmax(logits, 1)
        self.val_acc(predicted_labels, batch["label"])
        self.log("val_acc", self.val_acc, prog_bar=True)

    def test_step(self, batch, batch_idx):
        outputs = self(batch["input_ids"], attention_mask=batch["attention_mask"],
                       labels=batch["label"])

        logits = outputs["logits"]
        predicted_labels = torch.argmax(logits, 1)
        self.test_acc(predicted_labels, batch["label"])
        self.log("accuracy", self.test_acc, prog_bar=True)

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
        return optimizer


lightning_model = LightningModel(model)

In [20]:
from lightning.pytorch.callbacks import ModelCheckpoint
from lightning.pytorch.loggers import CSVLogger


callbacks = [
    ModelCheckpoint(
        save_top_k=1, mode="max", monitor="val_acc"
    )  # save top 1 model
]
logger = CSVLogger(save_dir="logs/", name="my-model")

In [21]:
trainer = L.Trainer(
    max_epochs=3,
    callbacks=callbacks,
    accelerator="gpu",
    precision="16-mixed",
    devices=1,
    logger=logger,
    log_every_n_steps=10,
)

trainer.fit(model=lightning_model,
            train_dataloaders=train_loader,
            val_dataloaders=val_loader)

INFO: Using 16bit Automatic Mixed Precision (AMP)
INFO:lightning.pytorch.utilities.rank_zero:Using 16bit Automatic Mixed Precision (AMP)
INFO: GPU available: True (cuda), used: True
INFO:lightning.pytorch.utilities.rank_zero:GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO: You are using a CUDA device ('NVIDIA A100-SXM4-40GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
INFO:lightning.pytorch.utilities.rank_zero:You are using a CUDA device ('NVIDIA A100-SXM4-40GB') that has Tensor Cor

Sanity Checking: |          | 0/? [00:00<?, ?it/s]



Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO: `Trainer.fit` stopped: `max_epochs=3` reached.
INFO:lightning.pytorch.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


In [22]:
trainer.test(lightning_model, dataloaders=train_loader, ckpt_path="best")

INFO: Restoring states from the checkpoint path at logs/my-model/version_0/checkpoints/epoch=2-step=8751.ckpt
INFO:lightning.pytorch.utilities.rank_zero:Restoring states from the checkpoint path at logs/my-model/version_0/checkpoints/epoch=2-step=8751.ckpt
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: Loaded model weights from the checkpoint at logs/my-model/version_0/checkpoints/epoch=2-step=8751.ckpt
INFO:lightning.pytorch.utilities.rank_zero:Loaded model weights from the checkpoint at logs/my-model/version_0/checkpoints/epoch=2-step=8751.ckpt
/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/connectors/data_connector.py:484: Your `test_dataloader`'s sampler has shuffling enabled, it is strongly recommended that you turn shuffling off for val/test dataloaders.


Testing: |          | 0/? [00:00<?, ?it/s]

[{'accuracy': 0.9346857070922852}]

In [23]:
trainer.test(lightning_model, dataloaders=val_loader, ckpt_path="best")

INFO: Restoring states from the checkpoint path at logs/my-model/version_0/checkpoints/epoch=2-step=8751.ckpt
INFO:lightning.pytorch.utilities.rank_zero:Restoring states from the checkpoint path at logs/my-model/version_0/checkpoints/epoch=2-step=8751.ckpt
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: Loaded model weights from the checkpoint at logs/my-model/version_0/checkpoints/epoch=2-step=8751.ckpt
INFO:lightning.pytorch.utilities.rank_zero:Loaded model weights from the checkpoint at logs/my-model/version_0/checkpoints/epoch=2-step=8751.ckpt


Testing: |          | 0/? [00:00<?, ?it/s]

[{'accuracy': 0.921999990940094}]

In [24]:
trainer.test(lightning_model, dataloaders=test_loader, ckpt_path="best")

INFO: Restoring states from the checkpoint path at logs/my-model/version_0/checkpoints/epoch=2-step=8751.ckpt
INFO:lightning.pytorch.utilities.rank_zero:Restoring states from the checkpoint path at logs/my-model/version_0/checkpoints/epoch=2-step=8751.ckpt
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: Loaded model weights from the checkpoint at logs/my-model/version_0/checkpoints/epoch=2-step=8751.ckpt
INFO:lightning.pytorch.utilities.rank_zero:Loaded model weights from the checkpoint at logs/my-model/version_0/checkpoints/epoch=2-step=8751.ckpt


Testing: |          | 0/? [00:00<?, ?it/s]

[{'accuracy': 0.9161999821662903}]

In [25]:
model = lightning_model.model
# local path to save the adapter
local_adapter_path = "./my-new-awesome-lora-adapter"

# save lora adapter weights and bias
model.save_pretrained(local_adapter_path)

# save tokenizer
tokenizer.save_pretrained(local_adapter_path)


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

('./my-new-awesome-lora-adapter/tokenizer_config.json',
 './my-new-awesome-lora-adapter/special_tokens_map.json',
 './my-new-awesome-lora-adapter/vocab.txt',
 './my-new-awesome-lora-adapter/added_tokens.json',
 './my-new-awesome-lora-adapter/tokenizer.json')

In [26]:
pip install huggingface_hub #Deploy the model in Huggingface




In [27]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [28]:
# upload the adapter model to Hugging face
hub_repo_id_adapter = "Qndhm/distilled-bert-imdb-lora-adapter"

model.push_to_hub(hub_repo_id_adapter)
tokenizer.push_to_hub(hub_repo_id_adapter)

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...adapter_model.safetensors:  19%|#8        |  555kB / 2.96MB            

README.md: 0.00B [00:00, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Qndhm/distilled-bert-imdb-lora-adapter/commit/4d0cdfa26df4f9c5f48243bc46800636edf96d76', commit_message='Upload tokenizer', commit_description='', oid='4d0cdfa26df4f9c5f48243bc46800636edf96d76', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Qndhm/distilled-bert-imdb-lora-adapter', endpoint='https://huggingface.co', repo_type='model', repo_id='Qndhm/distilled-bert-imdb-lora-adapter'), pr_revision=None, pr_num=None)

In [29]:
from safetensors.torch import load_file
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import get_peft_model, LoraConfig
from huggingface_hub import hf_hub_download
import os

# compare the base model and the adapter
base_model_name = "distilbert-base-uncased"
hf_repo_id = "Qndhm/distilled-bert-imdb-lora-adapter"

# --- load base model and adapter ---
print(f"loading base model: {base_model_name}")
model = AutoModelForSequenceClassification.from_pretrained(
    base_model_name,
    num_labels=2
)
tokenizer = AutoTokenizer.from_pretrained(hf_repo_id)


# --- create a config file the same as that in Hub
peft_config = LoraConfig(
    task_type="SEQ_CLS",
    r=8,
    lora_alpha=16,
    lora_dropout=0.1,
    target_modules=["q_lin", "v_lin"],
    modules_to_save=["pre_classifier", "classifier"]#to be used in comparison
)

# add null LoRA adapter
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()


# load the adapter weights form hub
print(f"\n downloading weights from hub: {hf_repo_id}")
weights_path = hf_hub_download(repo_id=hf_repo_id, filename="adapter_model.safetensors")

adapter_weights = load_file(weights_path)


#compare the keys of adapter and base model
print("Hub keys of the adapter:", list(adapter_weights.keys()))

# print base model keys
model_trainable_keys = [k for k, v in model.named_parameters() if v.requires_grad]
print("base model keys:", model_trainable_keys)
new_state_dict = {}
for k, v in adapter_weights.items():
    #adjust the keys to be consistent
    new_key = k.replace(".weight", ".default.weight")
    if "classifier" in new_key:
      # ...classifier.bias -> ...classifier.modules_to_save.default.bias)
      if new_key.endswith(".bias"):
        new_key = new_key.replace(".bias", ".modules_to_save.default.bias")
        # ...classifier.default.weight -> ...classifier.modules_to_save.default.weight)
      elif new_key.endswith(".weight"):
        new_key = new_key.replace(".default.weight", ".modules_to_save.default.weight")
    new_state_dict[new_key] = v



print("New keys:", list(new_state_dict.keys()))

print("\n Load weights with new keys")
model.load_state_dict(new_state_dict, strict=False)

text_pos = "I do not like this movie, it was bad!"
inputs_pos = tokenizer(text_pos, return_tensors="pt")
with torch.no_grad():
    outputs_pos = model(**inputs_pos)
predicted_class_id_pos = outputs_pos.logits.argmax().item()
print(f"positive: '{text_pos}' --> prediction: {predicted_class_id_pos}")



loading base model: distilbert-base-uncased


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

trainable params: 739,586 || all params: 67,694,596 || trainable%: 1.0925

 downloading weights from hub: Qndhm/distilled-bert-imdb-lora-adapter


adapter_model.safetensors:   0%|          | 0.00/2.96M [00:00<?, ?B/s]

Hub keys of the adapter: ['base_model.model.classifier.bias', 'base_model.model.classifier.weight', 'base_model.model.distilbert.transformer.layer.0.attention.q_lin.lora_A.weight', 'base_model.model.distilbert.transformer.layer.0.attention.q_lin.lora_B.weight', 'base_model.model.distilbert.transformer.layer.0.attention.v_lin.lora_A.weight', 'base_model.model.distilbert.transformer.layer.0.attention.v_lin.lora_B.weight', 'base_model.model.distilbert.transformer.layer.1.attention.q_lin.lora_A.weight', 'base_model.model.distilbert.transformer.layer.1.attention.q_lin.lora_B.weight', 'base_model.model.distilbert.transformer.layer.1.attention.v_lin.lora_A.weight', 'base_model.model.distilbert.transformer.layer.1.attention.v_lin.lora_B.weight', 'base_model.model.distilbert.transformer.layer.2.attention.q_lin.lora_A.weight', 'base_model.model.distilbert.transformer.layer.2.attention.q_lin.lora_B.weight', 'base_model.model.distilbert.transformer.layer.2.attention.v_lin.lora_A.weight', 'base_mod

In [30]:
trainer.test(LightningModel(model), dataloaders=test_loader)#the same as the original model

INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing: |          | 0/? [00:00<?, ?it/s]

[{'accuracy': 0.9161999821662903}]