# Low rank finetune FLAN-T5 on general tasks: sentiment analysis and summarization


Fine tuning models where all its parameters are changed is expensive with very large models. For perspective, a T4 16GB GPU will barely be able to fine tune a 1B parameter model. Large models like GPT-3 have 175B and more parameters. Therefore, in this notebook we try efficient ways of adapting a model. 

In this notebook we will see how to use `peft` , `transformers` & `bitsandbytes` to fine-tune `flan-t5-large`. The `peft` package allows us to use low rank adaption method of training less than 1% of the original model parameters. The [paper](https://arxiv.org/abs/2106.09685) illustrates how it is comparable to a full training. 

More information about the implementation of `peft` can be found [here](https://github.com/huggingface/peft). This approach allows us to fine tune a 3B parameter model on a single T4 GPU.

First, we will finetune the model on [`financial_phrasebank`](https://huggingface.co/datasets/financial_phrasebank) dataset, that consists of pairs of text-labels to classify financial-related sentences, if they are either `positive`, `neutral` or `negative`.

Second, we will finetune the model on a specific summarization task [`samsumsam`](https://huggingface.co/datasets/samsum/viewer/samsum/train?row=0)

Inspired from Sources [1](https://github.com/huggingface/peft/blob/main/examples/int8_training/Finetune_flan_t5_large_bnb_peft.ipynb) and [2](https://www.philschmid.de/fine-tune-flan-t5-peft).

# Install requirements

In [2]:
## Imports
import os
import torch
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
from peft import prepare_model_for_int8_training
from transformers import TrainingArguments, Trainer
from transformers import DataCollatorForSeq2Seq
from datasets import concatenate_datasets
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
import pandas as pd
import numpy as np

# Select CUDA device index
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

# Utility functions
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /opt/app-root/lib64/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /opt/app-root/lib64/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)


# 1. Sentiment analysis task

## Import model and tokenizer

In [4]:
model_name = "ybelkada/flan-t5-xl-sharded-bf16"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, load_in_8bit=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading (…)lve/main/config.json: 100%|██████████| 1.53k/1.53k [00:00<00:00, 892kB/s]
Downloading (…)model.bin.index.json: 100%|██████████| 50.8k/50.8k [00:00<00:00, 24.6MB/s]
Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00003.bin:   0%|          | 0.00/1.87G [00:00<?, ?B/s][A[A

Downloading (…)l-00001-of-00003.bin:   2%|▏         | 31.5M/1.87G [00:00<00:15, 120MB/s][A[A

Downloading (…)l-00001-of-00003.bin:   3%|▎         | 52.4M/1.87G [00:00<00:21, 85.4MB/s][A[A

Downloading (…)l-00001-of-00003.bin:   4%|▍         | 73.4M/1.87G [00:00<00:17, 105MB/s] [A[A

Downloading (…)l-00001-of-00003.bin:   5%|▌         | 94.4M/1.87G [00:00<00:19, 91.6MB/s][A[A

Downloading (…)l-00001-of-00003.bin:   6%|▌         | 105M/1.87G [00:01<00:24, 72.6MB/s] [A[A

Downloading (…)l-00001-of-00003.bin:   7%|▋         | 136M/1.87G [00:01<00:21, 82.4MB/s][A[A

Downloading (…)l-00001-of-00003.bin:   8%|▊         | 157M/1.87G [00:01<00:19, 90.0MB/s][A[A


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /opt/app-root/lib64/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /opt/app-root/lib64/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
Loading checkpoint shards: 100%|██████████| 3/3 [00:07<00:00,  2.54s/it]
Downloading (…)neration_config.json: 100%|██████████| 147/147 [00:00<00:00, 80.9kB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 2.50k/2.50k [00:00<00:00, 1.35MB/s]
Downloading spiece.model: 100%|██████████| 792k/792k [00:00<00:00, 11.5MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 2.42M/2.42M [00:00<00:00, 112MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 2.20k/2.20k [00:00<00:00, 1.67MB/s]


## Prepare model for training
Some pre-processing needs to be done before training such an int8 model using `peft`, therefore let's import an utiliy function `prepare_model_for_int8_training` that will: 
- Casts all the non `int8` modules to full precision (`fp32`) for stability
- Add a `forward_hook` to the input embedding layer to enable gradient computation of the input hidden states
- Enable gradient checkpointing for more memory-efficient training

In [6]:
## Prepare model for training
model = prepare_model_for_int8_training(model)

lora_config = LoraConfig(
    r=16, lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type="SEQ_2_SEQ_LM"
)

# add LoRA adaptor
model = get_peft_model(model, lora_config)
print_trainable_parameters(model)

trainable params: 9437184 || all params: 2859194368 || trainable%: 0.33006444422319176


As you can see, here we are only training 0.6% of the parameters of the model! This is a huge memory gain that will enable us to fine-tune the model without any memory issue.

## Load and process data

Here we will use [`financial_phrasebank`](https://huggingface.co/datasets/financial_phrasebank) dataset to fine-tune our model on sentiment classification on financial sentences. We will load the split `sentences_allagree`, which corresponds according to the model card to the split where there is a 100% annotator agreement.

In [7]:
# loading dataset
dataset = load_dataset("financial_phrasebank", "sentences_allagree")
dataset = dataset["train"].train_test_split(test_size=0.1)
dataset["validation"] = dataset["test"]
del dataset["test"]

classes = dataset["train"].features["label"].names
dataset = dataset.map(
    lambda x: {"text_label": [classes[label] for label in x["label"]]},
    batched=True,
    num_proc=1,
)

Downloading builder script: 100%|██████████| 6.04k/6.04k [00:00<00:00, 5.23MB/s]
Downloading metadata: 100%|██████████| 13.7k/13.7k [00:00<00:00, 10.1MB/s]
Downloading readme: 100%|██████████| 8.86k/8.86k [00:00<00:00, 6.64MB/s]


Downloading and preparing dataset financial_phrasebank/sentences_allagree to /opt/app-root/src/.cache/huggingface/datasets/financial_phrasebank/sentences_allagree/1.0.0/550bde12e6c30e2674da973a55f57edde5181d53f5a5a34c1531c53f93b7e141...


Downloading data: 100%|██████████| 682k/682k [00:00<00:00, 76.7MB/s]
                                                                       

Dataset financial_phrasebank downloaded and prepared to /opt/app-root/src/.cache/huggingface/datasets/financial_phrasebank/sentences_allagree/1.0.0/550bde12e6c30e2674da973a55f57edde5181d53f5a5a34c1531c53f93b7e141. Subsequent calls will reuse this data.


100%|██████████| 1/1 [00:00<00:00, 301.62it/s]
                                                    

Let's also apply some pre-processing of the input data, the labels needs to be pre-processed, the tokens corresponding to `pad_token_id` needs to be set to `-100` so that the `CrossEntropy` loss associated with the model will correctly ignore these tokens.

In [8]:
# data preprocessing
text_column = "sentence"
label_column = "text_label"
max_length = 128


def preprocess_function(examples):
    inputs = examples[text_column]
    targets = examples[label_column]
    model_inputs = tokenizer(inputs, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt")
    labels = tokenizer(targets, max_length=3, padding="max_length", truncation=True, return_tensors="pt")
    labels = labels["input_ids"]
    labels[labels == tokenizer.pad_token_id] = -100
    model_inputs["labels"] = labels
    return model_inputs


processed_datasets = dataset.map(
    preprocess_function,
    batched=True,
    num_proc=1,
    remove_columns=dataset["train"].column_names,
    load_from_cache_file=False,
    desc="Running tokenizer on dataset",
)

train_dataset = processed_datasets["train"]
eval_dataset = processed_datasets["validation"]

                                                                                          

## Train our model! 

Let's now train our model, run the cells below.
Note that for T5 since some layers are kept in `float32` for stability purposes there is no need to call autocast on the trainer.

In [9]:
training_args = TrainingArguments(
    "temp",
    evaluation_strategy="epoch",
    learning_rate=1e-3,
    gradient_accumulation_steps=1,
    auto_find_batch_size=True,
    num_train_epochs=1,
    save_steps=100,
    save_total_limit=8,
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

In [10]:
trainer.train()



Epoch,Training Loss,Validation Loss
1,No log,0.016685




TrainOutput(global_step=255, training_loss=0.29072899911917893, metrics={'train_runtime': 603.365, 'train_samples_per_second': 3.376, 'train_steps_per_second': 0.423, 'total_flos': 4370030543241216.0, 'train_loss': 0.29072899911917893, 'epoch': 1.0})

## Qualitatively test our model

Let's have a quick qualitative evaluation of the model, by taking a sample from the dataset that corresponds to a positive label. Run your generation similarly as you were running your model from `transformers`:

In [16]:
model.eval()
input_text_1 = "In January-September 2009 , the Group 's net interest income decreased to EUR 12.4 mn from EUR 74.3 mn in January-September 2008 ."
input_text_2 = "In January-September 2009 , the Group 's net interest income increased to EUR 112.4 mn from EUR 74.3 mn in January-September 2008 ."
inputs_1 = tokenizer(input_text_1, return_tensors="pt")
inputs_2 = tokenizer(input_text_2, return_tensors="pt")

outputs_1 = model.generate(input_ids=inputs_1["input_ids"], max_new_tokens=10)
outputs_2 = model.generate(input_ids=inputs_2["input_ids"], max_new_tokens=10)

print("input sentence: ", input_text_1)
print(" output prediction: ", tokenizer.batch_decode(outputs_1.detach().cpu().numpy(), skip_special_tokens=True))

print("input sentence: ", input_text_2)
print(" output prediction: ", tokenizer.batch_decode(outputs_2.detach().cpu().numpy(), skip_special_tokens=True))

input sentence:  In January-September 2009 , the Group 's net interest income decreased to EUR 12.4 mn from EUR 74.3 mn in January-September 2008 .
 output prediction:  ['negative']
input sentence:  In January-September 2009 , the Group 's net interest income increased to EUR 112.4 mn from EUR 74.3 mn in January-September 2008 .
 output prediction:  ['positive']


# 2. Train on summary dataset 

In [3]:
model_name = "ybelkada/flan-t5-xl-sharded-bf16"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, load_in_8bit=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [4]:
# Load dataset from the hub
dataset = load_dataset("samsum")

print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")
dataset

Found cached dataset samsum (/opt/app-root/src/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e)


  0%|          | 0/3 [00:00<?, ?it/s]

Train dataset size: 14732
Test dataset size: 819


DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

In [5]:
# The maximum total input sequence length after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded.
conc_datasets = concatenate_datasets([dataset["train"], dataset["test"]])

tokenized_inputs = conc_datasets.map(lambda x: tokenizer(x["dialogue"],
                                                         truncation=True),
                                     batched=True,
                                     remove_columns=["dialogue", "summary"])

input_lengths = [len(x) for x in tokenized_inputs["input_ids"]]
# take 85 percentile of max length for better utilization
max_source_length = int(np.percentile(input_lengths, 85))

# The maximum total sequence length for target text after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded."
tokenized_targets = conc_datasets.map(lambda x: tokenizer(x["summary"],
                                                          truncation=True),
                                      batched=True,
                                      remove_columns=["dialogue", "summary"])
target_lengths = [len(x) for x in tokenized_targets["input_ids"]]
# take 90 percentile of max length for better utilization
max_target_length = int(np.percentile(target_lengths, 90))
print(f"Max target length: {max_target_length}")

Map:   0%|          | 0/15551 [00:00<?, ? examples/s]

Map:   0%|          | 0/15551 [00:00<?, ? examples/s]

Max target length: 50


In [6]:
def preprocess_function(sample,padding="max_length"):
    # add prefix to the input for t5
    inputs = ["summarize: " + item for item in sample["dialogue"]]

    # tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)

    # Tokenize targets with the `text_target` keyword argument
    labels = tokenizer(text_target=sample["summary"], max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=["dialogue", "summary", "id"])
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")

# save datasets to disk for later easy loading
tokenized_dataset["train"].save_to_disk("data/train")
tokenized_dataset["test"].save_to_disk("data/eval")

Loading cached processed dataset at /opt/app-root/src/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e/cache-bd742c963bc41ca9.arrow


Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Loading cached processed dataset at /opt/app-root/src/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e/cache-66ae263e56bdc634.arrow


Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']


Saving the dataset (0/1 shards):   0%|          | 0/14732 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/819 [00:00<?, ? examples/s]

In [7]:
# Define LoRA Config
lora_config = LoraConfig(
 r=16,
 lora_alpha=32,
 target_modules=["q", "v"],
 lora_dropout=0.05,
 bias="none",
 task_type=TaskType.SEQ_2_SEQ_LM
)
# prepare int-8 model for training
model = prepare_model_for_int8_training(model)

# add LoRA adaptor
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

trainable params: 9437184 || all params: 2859194368 || trainable%: 0.33006444422319176


In [8]:
# we want to ignore tokenizer pad token in the loss
label_pad_token_id = -100
# Data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

output_dir="lora-flan-t5-xxl"

# Define training args
training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
	auto_find_batch_size=True,
    learning_rate=1e-3, # higher learning rate
    num_train_epochs=5,
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=500,
    save_strategy="no",
    report_to="tensorboard",
)

# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

In [None]:
# train model
trainer.train()

In [None]:
trainer.save_model('Summarizer')

# Conclusion
The notebook shows how we can use PEFT to fine tune the flan T5 model with downstream tasks such as sentiment analysis and summarization. The code can be adapted to other datasets such as documentation search. 