# Prefix tuning for conditional generation

!pip install -q peft transformers datasetsPrefix tuning is an additive method where only a sequence of continuous task-specific vectors is attached to the beginning of the input, or *prefix*. Only the prefix parameters are optimized and added to the hidden states in every layer of the model. The tokens of the input sequence can still attend to the prefix as *virtual tokens*. As a result, prefix tuning stores 1000x fewer parameters than a fully finetuned model, which means you can use one large language model for many tasks.

<Tip>

💡 Read [Prefix-Tuning: Optimizing Continuous Prompts for Generation](https://arxiv.org/abs/2101.00190) to learn more about prefix tuning. 

</Tip>

This guide will show you how to apply prefix tuning to train a [`t5-large`](https://huggingface.co/t5-large) model on the `sentences_allagree` subset of the [financial_phrasebank](https://huggingface.co/datasets/financial_phrasebank) dataset.

Before you begin, make sure you have all the necessary libraries installed:

```bash
!pip install -q peft transformers datasets
```

In [None]:
!pip install -q peft transformers datasets

## Setup

Start by defining the model and tokenizer, text and label columns, and some hyperparameters so it'll be easier to start training faster later. Set the environment variable `TOKENIZERS_PARALLELSIM` to `false` to disable the fast Rust-based tokenizer which processes data in parallel by default so you can use multiprocessing in Python.

In [2]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, default_data_collator, get_linear_schedule_with_warmup
from peft import get_peft_config, get_peft_model, get_peft_model_state_dict, PrefixTuningConfig, TaskType
from datasets import load_dataset
from torch.utils.data import DataLoader
from tqdm import tqdm
import torch
import os

os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["CUDA_VISIBLE_DEVICES"] = "3"

device = "cuda"
model_name_or_path = "t5-large"
tokenizer_name_or_path = "t5-large"

max_length = 128
lr = 1e-2
num_epochs = 5
batch_size = 8

2024-02-08 17:37:08.117046: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-08 17:37:08.117110: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-08 17:37:08.118652: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## Load dataset

For this guide, you'll train on the `sentences_allagree` subset of the [`financial_phrasebank`](https://huggingface.co/datasets/financial_phrasebank) dataset. This dataset contains financial news categorized by sentiment.

Use 🤗 [Datasets](https://huggingface.co/docs/datasets/index) [train_test_split](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) function to create a training and validation split and convert the `label` value to the more readable `text_label`. All of the changes can be applied with the [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function:

In [108]:
from datasets import load_dataset

# dataset_name = "twitter_complaints"
# dataset = load_dataset("ought/raft", dataset_name)
dataset = load_dataset('/kaggle/input/loader', 'sentences_allagree')

#dataset = load_dataset("financial_phrasebank", "sentences_allagree")
dataset = dataset["train"].train_test_split(test_size=0.1)
dataset["validation"] = dataset["test"]
del dataset["test"]

text_column = "sentence"
label_column = "text_label"
classes = dataset["train"].features["label"].names

  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 2264
    })
})

In [111]:
dataset = dataset.map(
    lambda x: {"text_label": [classes[label] for label in x["label"]]},
    batched=True,
    num_proc=1,
)

dataset["train"][0]

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

{'sentence': 'The remainder of its revenues will come from technology agreements with other firms , InterDigital said .',
 'label': 1,
 'text_label': 'neutral'}

## Preprocess dataset

Initialize a tokenizer, and create a function to pad and truncate the `model_inputs` and `labels`:

In [112]:
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Use the [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function to apply the `preprocess_function` to the dataset. You can remove the unprocessed columns since the model doesn't need them anymore:

In [115]:
import numpy as np 

def preprocess_function(examples):
    inputs = examples[text_column]
    targets = examples[label_column]
    model_inputs = tokenizer(inputs, max_length=max_length, padding="max_length", truncation=True) # , return_tensors="pt"
    labels = tokenizer(targets, max_length=2, padding="max_length", truncation=True) # , return_tensors="pt"
    labels = np.array(labels["input_ids"])
    labels[labels == tokenizer.pad_token_id] = -100
    model_inputs["labels"] = labels 
    return model_inputs

processed_datasets = dataset.map(
    preprocess_function,
    batched=True,
    num_proc=1,
    remove_columns=dataset["train"].column_names,
    load_from_cache_file=False,
    desc="Running tokenizer on dataset",
)

print(processed_datasets['train'][0])

Running tokenizer on dataset:   0%|          | 0/3 [00:00<?, ?ba/s]

Running tokenizer on dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

Create a [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) from the `train` and `eval` datasets. Set `pin_memory=True` to speed up the data transfer to the GPU during training if the samples in your dataset are on a CPU.

In [116]:
train_dataset = processed_datasets["train"]
eval_dataset = processed_datasets["validation"]

train_dataloader = DataLoader(
    train_dataset, shuffle=True, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True
)
eval_dataloader = DataLoader(eval_dataset, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True)

## Train model

Now you can setup your model and make sure it is ready for training. Specify the task in [PrefixTuningConfig](https://huggingface.co/docs/peft/main/en/package_reference/tuners#peft.PrefixTuningConfig), create the base `t5-large` model from [AutoModelForSeq2SeqLM](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForSeq2SeqLM), and then wrap the model and configuration in a [PeftModel](https://huggingface.co/docs/peft/main/en/package_reference/peft_model#peft.PeftModel). Feel free to print the [PeftModel](https://huggingface.co/docs/peft/main/en/package_reference/peft_model#peft.PeftModel)'s parameters and compare it to fully training all the model parameters to see how much more efficient it is!

In [117]:
peft_config = PrefixTuningConfig(task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, num_virtual_tokens=20)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
#"trainable params: 983040 || all params: 738651136 || trainable%: 0.13308583065659835"

In [136]:
model

PeftModelForSeq2SeqLM(
  (base_model): T5ForConditionalGeneration(
    (shared): Embedding(32128, 1024)
    (encoder): T5Stack(
      (embed_tokens): Embedding(32128, 1024)
      (block): ModuleList(
        (0): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=1024, out_features=1024, bias=False)
                (k): Linear(in_features=1024, out_features=1024, bias=False)
                (v): Linear(in_features=1024, out_features=1024, bias=False)
                (o): Linear(in_features=1024, out_features=1024, bias=False)
                (relative_attention_bias): Embedding(32, 16)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerFF(
              (DenseReluDense): T5DenseActDense(
                (wi): Linear(in_features=1024, out_features=4096, bias=False)
             

Setup the optimizer and learning rate scheduler:

In [118]:
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
lr_scheduler = get_linear_schedule_with_warmup(
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=(len(train_dataloader) * num_epochs),
)

Move the model to the GPU, and then write a training loop to begin!

In [19]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [48]:
peft_model_id = "shahzebnaveed/t5-large_PREFIX_TUNING_SEQ2SEQ"

adapter_model.safetensors:   0%|          | 0.00/3.93M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/shahzebnaveed/t5-large_PREFIX_TUNING_SEQ2SEQ/commit/b228bac19e3ebd7a7a650ed1c8af2f9fd0b38e93', commit_message='Upload model', commit_description='', oid='b228bac19e3ebd7a7a650ed1c8af2f9fd0b38e93', pr_url=None, pr_revision=None, pr_num=None)

In [121]:
model = model.to(device)

for epoch in range(10):
    model.train()
    total_loss = 0
    for step, batch in enumerate(tqdm(train_dataloader)):
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        total_loss += loss.detach().float()
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

    model.eval()
    eval_loss = 0
    eval_preds = []
    for step, batch in enumerate(tqdm(eval_dataloader)):
        batch = {k: v.to(device) for k, v in batch.items()}
        with torch.no_grad():
            outputs = model(**batch)
        loss = outputs.loss
        eval_loss += loss.detach().float()
        eval_preds.extend(
            tokenizer.batch_decode(torch.argmax(outputs.logits, -1).detach().cpu().numpy(), skip_special_tokens=True)
        )

    eval_epoch_loss = eval_loss / len(eval_dataloader)
    eval_ppl = torch.exp(eval_epoch_loss)
    train_epoch_loss = total_loss / len(train_dataloader)
    train_ppl = torch.exp(train_epoch_loss)
    print(f"{epoch=}: {train_ppl=} {train_epoch_loss=} {eval_ppl=} {eval_epoch_loss=}")
    
    if epoch in [3, 4, 10]:
        model.push_to_hub(peft_model_id, use_auth_token=True)

100%|██████████| 255/255 [00:42<00:00,  5.99it/s]
100%|██████████| 29/29 [00:04<00:00,  7.15it/s]
100%|██████████| 255/255 [00:42<00:00,  6.00it/s]
100%|██████████| 29/29 [00:04<00:00,  7.18it/s]
100%|██████████| 255/255 [00:42<00:00,  6.01it/s]
100%|██████████| 29/29 [00:04<00:00,  7.17it/s]
100%|██████████| 255/255 [00:42<00:00,  6.00it/s]
100%|██████████| 29/29 [00:04<00:00,  7.18it/s]


adapter_model.safetensors:   0%|          | 0.00/3.93M [00:00<?, ?B/s]

100%|██████████| 255/255 [00:42<00:00,  6.02it/s]
100%|██████████| 29/29 [00:04<00:00,  7.14it/s]
100%|██████████| 255/255 [00:42<00:00,  5.99it/s]
100%|██████████| 29/29 [00:04<00:00,  7.19it/s]
100%|██████████| 255/255 [00:42<00:00,  5.94it/s]
100%|██████████| 29/29 [00:04<00:00,  7.17it/s]
100%|██████████| 255/255 [00:42<00:00,  5.98it/s]
100%|██████████| 29/29 [00:04<00:00,  7.13it/s]
100%|██████████| 255/255 [00:42<00:00,  6.01it/s]
100%|██████████| 29/29 [00:04<00:00,  7.17it/s]
100%|██████████| 255/255 [00:42<00:00,  6.01it/s]
100%|██████████| 29/29 [00:04<00:00,  7.17it/s]


Let's see how well the model performs on the validation set:

In [122]:
correct = 0
total = 0
for pred, true in zip(eval_preds, dataset["validation"]["text_label"]):
    if pred.strip() == true.strip():
        correct += 1
    total += 1
accuracy = correct / total * 100
accuracy

96.91629955947137

In [123]:
# import pandas as pd
# pd.DataFrame(zip(dataset['train']['text_label'], dataset['train']['Tweet text'])).head()

97% accuracy in just a few minutes; pretty good!

Upload the model to a specifc model repository on the Hub with the [push_to_hub](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.push_to_hub) function:

In [124]:
model.push_to_hub(peft_model_id, use_auth_token=True)

CommitInfo(commit_url='https://huggingface.co/shahzebnaveed/t5-large_PREFIX_TUNING_SEQ2SEQ/commit/3ea257e53ecac100b48562e9ca80c8d9f9bbe66c', commit_message='Upload model', commit_description='', oid='3ea257e53ecac100b48562e9ca80c8d9f9bbe66c', pr_url=None, pr_revision=None, pr_num=None)

## Inference

Once the model has been uploaded to the Hub, anyone can easily use it for inference. Load the configuration and model:

In [125]:
from peft import PeftModel, PeftConfig

peft_model_id = "shahzebnaveed/t5-large_PREFIX_TUNING_SEQ2SEQ"

config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(model, peft_model_id)

adapter_model.safetensors:   0%|          | 0.00/3.93M [00:00<?, ?B/s]

Get and tokenize some text about financial news:

In [133]:
dataset['validation'][4]['sentence']

'Operating loss was EUR 179mn , compared to a loss of EUR 188mn in the second quarter of 2009 .'

In [135]:
inputs = tokenizer(
    dataset['validation'][4]['sentence'],
    return_tensors="pt",
)

model.to(device)

with torch.no_grad():
    inputs = {k: v.to(device) for k, v in inputs.items()}
    outputs = model.generate(input_ids=inputs["input_ids"], max_new_tokens=2)
    print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))
    tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)
tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)

['negative']