# Fine-tune FLAN-T5 using `bitsandbytes`, `peft` & `transformers` ü§ó 

In this notebook we will see how to properly use `peft` , `transformers` & `bitsandbytes` to fine-tune `flan-t5-large` in a google colab!

We will finetune the model on [`financial_phrasebank`](https://huggingface.co/datasets/financial_phrasebank) dataset, that consists of pairs of text-labels to classify financial-related sentences, if they are either `positive`, `neutral` or `negative`.

Note that you could use the same notebook to fine-tune `flan-t5-xl` as well, but you would need to shard the models first to avoid CPU RAM issues on Google Colab, check [these weights](https://huggingface.co/ybelkada/flan-t5-xl-sharded-bf16).

## Install requirements

In [1]:
!pip install -q bitsandbytes datasets accelerate
!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git@main

## Import model and tokenizer

In [2]:
# Select CUDA device index
import os
import torch

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, BitsAndBytesConfig

model_name = "google/flan-t5-large"

model = AutoModelForSeq2SeqLM.from_pretrained(model_name, quantization_config=BitsAndBytesConfig(load_in_8bit=True))
tokenizer = AutoTokenizer.from_pretrained(model_name)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


## Prepare model for training

Some pre-processing needs to be done before training such an int8 model using `peft`, therefore let's import an utiliy function `prepare_model_for_kbit_training` that will: 
- Casts all the non `int8` modules to full precision (`fp32`) for stability
- Add a `forward_hook` to the input embedding layer to enable gradient computation of the input hidden states
- Enable gradient checkpointing for more memory-efficient training

In [3]:
from peft import prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model)

## Load your `PeftModel` 

Here we will use LoRA (Low-Rank Adaptators) to train our model

In [4]:
from peft import LoraConfig, get_peft_model, TaskType


def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )


lora_config = LoraConfig(
    r=16, lora_alpha=32, target_modules=["q", "v"], lora_dropout=0.05, bias="none", task_type="SEQ_2_SEQ_LM"
)


model = get_peft_model(model, lora_config)
print_trainable_parameters(model)

trainable params: 4718592 || all params: 787868672 || trainable%: 0.5989059049678777


As you can see, here we are only training 0.6% of the parameters of the model! This is a huge memory gain that will enable us to fine-tune the model without any memory issue.

## Load and process data

Here we will use [`financial_phrasebank`](https://huggingface.co/datasets/financial_phrasebank) dataset to fine-tune our model on sentiment classification on financial sentences. We will load the split `sentences_allagree`, which corresponds according to the model card to the split where there is a 100% annotator agreement.

In [46]:
# loading dataset
dataset = load_dataset("Daivik1911/Fact-Updates")
dataset = dataset["train"].train_test_split(test_size=0.2)
dataset["validation"] = dataset["test"]
del dataset["test"]

classes = list(set(dataset["train"]['label']))
dataset = dataset.map(
    lambda x: {"text_label": [classes[label] for label in x["label"]]},
    batched=True,
    num_proc=1,
)

dataset

Map:   0%|          | 0/5865 [00:00<?, ? examples/s]

Map:   0%|          | 0/1467 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'text_label'],
        num_rows: 5865
    })
    validation: Dataset({
        features: ['text', 'label', 'text_label'],
        num_rows: 1467
    })
})

Let's also apply some pre-processing of the input data, the labels needs to be pre-processed, the tokens corresponding to `pad_token_id` needs to be set to `-100` so that the `CrossEntropy` loss associated with the model will correctly ignore these tokens.

In [44]:
processed_datasets["train"]

NameError: name 'processed_datasets' is not defined

In [62]:
# data preprocessing
text_column = "text"
label_column = "label"
max_length = 256


def preprocess_function(examples):
    inputs = examples["text"]
    targets = examples["label"]
    model_inputs = tokenizer(inputs, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt")
    labels = tokenizer(targets, max_length=3, padding="max_length", truncation=True, return_tensors="pt")
    labels = labels["input_ids"]
    labels[labels == tokenizer.pad_token_id] = -100
    model_inputs["labels"] = labels
    return model_inputs

def preprocess_function(examples):
    examples["text"] = [str(value) for value in examples["text"]]
    examples["label"] = [str(value) for value in examples["label"]]
    inputs = examples["text"]
    targets = examples["label"]
    model_inputs = tokenizer(inputs, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt")
    labels = tokenizer(targets, max_length=3, padding="max_length", truncation=True, return_tensors="pt")
    labels = labels["input_ids"]
    labels[labels == tokenizer.pad_token_id] = -100
    model_inputs["labels"] = labels
    return model_inputs


processed_datasets = dataset.map(
    preprocess_function,
    batched=True,
    num_proc=1,
    remove_columns=dataset["train"].column_names,
    load_from_cache_file=False,
    desc="Running tokenizer on dataset",
)

train_dataset = processed_datasets["train"]
val_dataset = processed_datasets["validation"]

# def tokenize(batch):
#     return tokenizer(batch["text"], padding=True, truncation=True, max_length=256)

# train_dataset = dataset["train"].map(tokenize, batched=True, batch_size=len(dataset["train"]))
# val_dataset = dataset["validation"].map(tokenize, batched=True, batch_size=len(dataset["validation"]))


# # Set dataset format
# train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])
# val_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])

Running tokenizer on dataset:   0%|          | 0/5865 [00:00<?, ? examples/s]

Running tokenizer on dataset:   0%|          | 0/1467 [00:00<?, ? examples/s]

In [64]:
train_dataset

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 5865
})

## Train our model! 

Let's now train our model, run the cells below.
Note that for T5 since some layers are kept in `float32` for stability purposes there is no need to call autocast on the trainer.

In [65]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    "temp",
    evaluation_strategy="epoch",
    learning_rate=1e-3,
    gradient_accumulation_steps=1,
    auto_find_batch_size=True,
    num_train_epochs=1,
    save_steps=100,
    save_total_limit=8,
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

In [66]:
trainer.train()



Epoch,Training Loss,Validation Loss
1,0.2122,0.117565




TrainOutput(global_step=734, training_loss=0.18956410982303462, metrics={'train_runtime': 956.7521, 'train_samples_per_second': 6.13, 'train_steps_per_second': 0.767, 'total_flos': 6801240112496640.0, 'train_loss': 0.18956410982303462, 'epoch': 1.0})

## Qualitatively test our model

Let's have a quick qualitative evaluation of the model, by taking a sample from the dataset that corresponds to a positive label. Run your generation similarly as you were running your model from `transformers`:

In [None]:
model.eval()
input_text = "In January-September 2009 , the Group 's net interest income increased to EUR 112.4 mn from EUR 74.3 mn in January-September 2008 ."
inputs = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(input_ids=inputs["input_ids"], max_new_tokens=10)

print("input sentence: ", input_text)
print(" output prediction: ", tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))

Generate config GenerationConfig {
  "_from_model_config": true,
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.27.0.dev0",
  "use_cache": false
}



input sentence:  In January-September 2009 , the Group 's net interest income increased to EUR 112.4 mn from EUR 74.3 mn in January-September 2008 .
 output prediction:  ['positive']


## Share your adapters on ü§ó Hub

Once you have trained your adapter, you can easily share it on the Hub using the method `push_to_hub` . Note that only the adapter weights and config will be pushed

In [67]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

In [68]:
model.push_to_hub("Daivik1911/Flan-T5-Fact_Updates", use_auth_token=True)



adapter_model.safetensors:   0%|          | 0.00/18.9M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Daivik1911/Flan-T5-Fact_Updates/commit/fa6efb126298af7773d1fc0f491bee87600d984d', commit_message='Upload model', commit_description='', oid='fa6efb126298af7773d1fc0f491bee87600d984d', pr_url=None, pr_revision=None, pr_num=None)

## Load your adapter from the Hub

You can load the model together with the adapter with few lines of code! Check the snippet below to load the adapter from the Hub and run the example evaluation!

In [69]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

peft_model_id = "Daivik1911/Flan-T5-Fact_Updates"
config = PeftConfig.from_pretrained(peft_model_id)

model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id)

adapter_config.json:   0%|          | 0.00/639 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/18.9M [00:00<?, ?B/s]

In [79]:
device = torch.device("cuda:0")

model = model.to(device)

datasetTest = load_dataset("Daivik1911/Fact-Updates-test")

datasetTest = datasetTest["train"]

In [82]:
datasetTest["label"][1]

0

In [83]:
model.eval()
gold = []
actual = []
for j in range(len(datasetTest["text"])):
    input_text = str(datasetTest["text"][j])
    output_text = str(datasetTest["label"][j])
    inputs = tokenizer(input_text, return_tensors="pt")
    inputs = {k: v.to(device) for k, v in inputs.items()}
    outputs = model.generate(**inputs, max_new_tokens=10)

    print("input sentence: ", input_text)
    # print(" output prediction: ", tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    # gold.append(output_text)
    if '0' in response.lower():
        actual.append(0)
    else:
        actual.append(1)

    if '0' in output_text.lower():
        gold.append(0)
    else:
        gold.append(1)


print(actual)
print(gold)


Token indices sequence length is longer than the specified maximum sequence length for this model (1097 > 512). Running this sequence through the model will result in indexing errors


input sentence:  <sod>Nullarbor National Park and Nullarbor Regional Reserve protect the world's largest semi-arid cave landscape, which is associated with many Aboriginal cultural sites. {{Citation needed|date=July 2008}} Wildlife inhabiting in the park includes the Southern Hairy-nosed [[Wombat]]<eod>
<sod>The Nullarbor Regional Reserve and the adjoining [[Nullarbor National Park]] protect the world's largest semi-arid cave landscape, which is associated with many Aboriginal cultural sites. {{Citation needed|date=July 2008}} Wildlife inhabiting in the regional reserve includes the Southern Hairy-nosed [[Wombat]]<eod>
input sentence:  <sod>{{hatnote|"Concentration camp" redirects here. For specific contexts see [[Nazi concentration camps]] (World War II) and [[British concentration camps]] (Boer War).}}<eod>
<sod>{{hatnote|"Concentration camp" redirects here. For specific contexts see [[Nazi concentration camps]] ([[World War II]]) and [[British concentration camps]] ([[Boer War]]).}}

input sentence:  <sod>On January 24, 1984, Apple Computer, Inc. introduced the original [[Macintosh 128K]]. Its [[System 1|early system software]] was partially based on the [[Lisa OS]], previously released by Apple for the [[Apple Lisa|Lisa]] computer in 1983; as part of an agreement allowing [[Xerox]] to buy shares in Apple at a favorable price, it also used concepts from the [[Xerox PARC]] [[Xerox Alto|Alto]], which former Apple CEO [[Steve Jobs]] and several other Macintosh team members had previewed. The operating system that was integral to early Macintosh computers is named '''System Software''' or simply "System", referred to by its major revision starting with [[System 6]] and [[System 7]]. Apple rebranded version 7.6 as '''Mac OS''' alongside its [[Macintosh clone]] program in 1996,<ref name="versionhistory">{{cite web|title=Macintosh: System Software Version History|publisher=[[Apple Inc.]]|date=August 7, 2001|url=https://support.apple.com/kb/TA31885|accessdate=September 25,

In [84]:
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
print(f1_score(gold, actual, average='macro'))

print(f1_score(gold, actual, average='micro'))

print(f1_score(gold, actual, average='weighted'))

accuracy = accuracy_score(gold, actual)
print(f"Accuracy: {accuracy:.4f}")

0.44875478927203066
0.8140747176368376
0.7306398567296137
Accuracy: 0.8141


In [77]:
!nvidia-smi

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Fri Mar 22 01:43:50 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Quadro RTX 5000     Off  | 00000000:C1:00.0 Off |                  Off |
| 33%   30C    P8     8W / 230W |   5454MiB / 16125MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces