# FineTuning w/ LoRA
- Inspired by Shaw Talebi (https://www.youtube.com/watch?v=eC6Hd1hFvos)

In [1]:
! pip install accelerate evaluate peft bitsandbytes git+https://github.com/huggingface/transformers trl py7zr auto-gptq optimum


Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-9nd63r_4
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-9nd63r_4
  Resolved https://github.com/huggingface/transformers to commit 2209b7afa04b3a6366350065f541e9248d6663c2
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting accelerate
  Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft
  Downloading peft-0.9.0-py

In [2]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
# from huggingface (datasets, transformers, peft, evaluate)
from datasets import load_dataset, DatasetDict, Dataset
from transformers import (
    AutoTokenizer,
    AutoConfig,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer)

from peft import PeftModel, PeftConfig, get_peft_model, LoraConfig
import evaluate
import torch
import numpy as np


In [4]:
# base model: distilbert-base-uncased
model_checkpoint = 'distilbert-base-uncased'
# model_checkpoint = 'roberta-base' # you can alternatively use roberta-base but this model is bigger thus training will take longer

# define label maps
id2label = {0: "Negative", 1: "Positive"}
label2id = {"Negative":0, "Positive":1}

# generate classification model from model_checkpoint
model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint, num_labels=2, id2label=id2label, label2id=label2id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
# how dataset was generated

# load imdb data
imdb_dataset = load_dataset("imdb")

# define subsample size
N = 1000
# generate indexes for random subsample
rand_idx = np.random.randint(24999, size=N)

# extract train and test data
x_train = imdb_dataset['train'][rand_idx]['text']
y_train = imdb_dataset['train'][rand_idx]['label']

x_test = imdb_dataset['test'][rand_idx]['text']
y_test = imdb_dataset['test'][rand_idx]['label']

# create new dataset
dataset = DatasetDict({'train':Dataset.from_dict({'label':y_train,'text':x_train}),
                             'validation':Dataset.from_dict({'label':y_test,'text':x_test})})

Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

### Dataset & Model architecture

In [6]:
print(dataset['train']['text'][:1])
print(np.array(dataset['train']['label']).sum()/len(dataset['train']['label']))
print(model)
print(model.num_parameters)

["Without doubt the best of the novels of John Le Carre, exquisitely transformed into a classic film. Performances by Peter Egan (Magnus Pym, The Perfect Spy), Rudiger Weigang (Axel, real name Alexander Hampel, Magnus' Czech Intelligence controller), Ray McAnally (Magnus' con-man father) and Alan Howard (Jack Brotherhood, Magnus' mentor, believer and British controller), together with the rest of the characters, are so perfect and natural, the person responsible for casting them should have been given an award. Even the small parts, such as Major Membury, are performed to perfection. It says a lot for the power of the performances, and the strength of the characters in the novel that, despite the duplicity of Magnus, one cannot help but feel closer to Magnus and Axel than to Jack Brotherhood and the slimy Grant Lederer of U.S. Intelligence. I have read the book at least a dozen times, and watched the movie almost as many times, and continue to be mesmerized by both. If I had one book t

### Preprocessing

- Text → Numerical form (token)
- use AutoTokenizer.from_pretrained (grab the tokenizer for the particular base model)
- Dataset that we pass need to be the same length. We can achieve this by
  - 1. Truncating long sequences
  - 2. Padding short sequences to a predetermined fixed length

In [7]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space= True)

# add pad token if none exists
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer))

# print(len(dataset['train']['text'])) = 1000
def tokenize_function(examples):
    # examples: dataset[train or test]
    text = examples["text"] # text: 1000 rows of text

    # truncate when length exceeds 512
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        truncation=True,
        max_length=512
    )

    return tokenized_inputs

tokenized_dataset = dataset.map(tokenize_function, batched=True)
print(tokenized_dataset)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 1000
    })
})


In [8]:
txt = ["It was good",
       "Not a fan",
       "Better than the first one!",
       "Perfectly disgusting",
       "It's not worth waching",
       "worth taking",
       "not bad",
       "so-so"]

print("Untrained model predictions:")
print("----------------------------")
for t in txt:
    # tokenize text
    inputs = tokenizer.encode(t, return_tensors="pt")
    # compute logits
    logits = model(inputs).logits
    # convert logits to label
    predictions = torch.argmax(logits)

    print(t + " - " + id2label[predictions.tolist()])

Untrained model predictions:
----------------------------
It was good - Positive
Not a fan - Positive
Better than the first one! - Positive
Perfectly disgusting - Positive
It's not worth waching - Positive
worth taking - Positive
not bad - Positive
so-so - Positive


### Data Collator
- Data collator will dynamically pad examples in a given batch to be as long as the longest sequence in that batch. For example, if we have four examples in our batch the longest sequence has 500 but the other have shorter ones, it'll dynamically pad the shorter sequences to match the longer one. It is helpful because if you pad your sequences dynamically with collator, it's a lot more computationally efficient than padding all your examples.  

In [10]:
accuracy = evaluate.load("accuracy")
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=1)

    return {"accuracy": accuracy.compute(predictions=predictions, references=labels)}

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

peft_config = LoraConfig(task_type="SEQ_CLS",
                        r=4,
                        lora_alpha=32,
                        lora_dropout=0.01,
                        target_modules = ['q_lin'])

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

# hyperparameters
lr = 1e-3
batch_size = 4
num_epochs = 10

# define training arguments
training_args = TrainingArguments(
    output_dir= model_checkpoint + "-lora-text-classification",
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator, # this will dynamically pad examples in each batch to be equal length
    compute_metrics=compute_metrics,
)

# train model
trainer.train()

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

trainable params: 628,994 || all params: 67,584,004 || trainable%: 0.9306847223789819


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.373767,{'accuracy': 0.883}
2,0.389900,0.516539,{'accuracy': 0.889}
3,0.389900,0.643072,{'accuracy': 0.892}
4,0.153100,0.780468,{'accuracy': 0.89}
5,0.153100,1.01123,{'accuracy': 0.895}
6,0.045400,1.136632,{'accuracy': 0.892}
7,0.045400,1.160063,{'accuracy': 0.877}
8,0.008700,1.174173,{'accuracy': 0.891}
9,0.008700,1.166896,{'accuracy': 0.892}
10,0.002800,1.14662,{'accuracy': 0.889}


Trainer is attempting to log a value of "{'accuracy': 0.883}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.889}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.892}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.89}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'accuracy': 0.895}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This in

TrainOutput(global_step=2500, training_loss=0.11998436985015869, metrics={'train_runtime': 456.0924, 'train_samples_per_second': 21.925, 'train_steps_per_second': 5.481, 'total_flos': 1149906761366016.0, 'train_loss': 0.11998436985015869, 'epoch': 10.0})

In [12]:
model.to('cpu')

print("Trained model predictions:")
print("--------------------------")
for t in txt:
    inputs = tokenizer.encode(t, return_tensors="pt").to("cpu")
    logits = model(inputs).logits
    predictions = torch.max(logits,1).indices

    print(t + " - " + id2label[predictions.tolist()[0]])

Trained model predictions:
--------------------------
It was good - Positive
Not a fan - Negative
Better than the first one! - Negative
Perfectly disgusting - Negative
It's not worth waching - Negative
worth taking - Positive
not bad - Negative
so-so - Negative
