<a href="https://colab.research.google.com/github/shah-zeb-naveed/large-language-models/blob/main/peft/bloomz_prompt_tuning_clm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prompt tuning for causal language modeling

Prompting helps guide language model behavior by adding some input text specific to a task. Prompt tuning is an additive method for only training and updating the newly added prompt tokens to a pretrained model. This way, you can use one pretrained model whose weights are frozen, and train and update a smaller set of prompt parameters for each downstream task instead of fully finetuning a separate model. As models grow larger and larger, prompt tuning can be more efficient, and results are even better as model parameters scale.

<Tip>

💡 Read [The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/abs/2104.08691) to learn more about prompt tuning.

</Tip>

This guide will show you how to apply prompt tuning to train a [`bloomz-560m`](https://huggingface.co/bigscience/bloomz-560m) model on the `twitter_complaints` subset of the [RAFT](https://huggingface.co/datasets/ought/raft) dataset.

Before you begin, make sure you have all the necessary libraries installed:

```bash
!pip install -q peft transformers datasets
```

In [1]:
!pip install -q peft transformers datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.4/183.4 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[?25h

## Setup

Start by defining the model and tokenizer, the dataset and the dataset columns to train on, some training hyperparameters, and the [PromptTuningConfig](https://huggingface.co/docs/peft/main/en/package_reference/tuners#peft.PromptTuningConfig). The [PromptTuningConfig](https://huggingface.co/docs/peft/main/en/package_reference/tuners#peft.PromptTuningConfig) contains information about the task type, the text to initialize the prompt embedding, the number of virtual tokens, and the tokenizer to use:

In [12]:
from transformers import AutoModelForCausalLM, AutoTokenizer, default_data_collator, get_linear_schedule_with_warmup
from peft import get_peft_config, get_peft_model, PromptTuningInit, PromptTuningConfig, TaskType, PeftType
import torch
from datasets import load_dataset
import os
from torch.utils.data import DataLoader
from tqdm import tqdm

device = "cuda"
model_name_or_path = "bigscience/bloomz-560m"
tokenizer_name_or_path = "bigscience/bloomz-560m"

peft_config = PromptTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    prompt_tuning_init=PromptTuningInit.TEXT,
    num_virtual_tokens=8,
    prompt_tuning_init_text="Classify if the tweet is a complaint or not:",
    tokenizer_name_or_path=model_name_or_path,
)

dataset_name = "twitter_complaints"
checkpoint_name = f"{dataset_name}_{model_name_or_path}_{peft_config.peft_type}_{peft_config.task_type}_v1.pt".replace(
    "/", "_"
)
text_column = "Tweet text"
label_column = "text_label"
max_length = 64
lr = 3e-2
num_epochs = 10
batch_size = 8

In [13]:
TaskType.CAUSAL_LM

<TaskType.CAUSAL_LM: 'CAUSAL_LM'>

In [14]:
PromptTuningInit.TEXT

<PromptTuningInit.TEXT: 'TEXT'>

In [15]:
peft_config.task_type

<TaskType.CAUSAL_LM: 'CAUSAL_LM'>

## Load dataset

For this guide, you'll load the `twitter_complaints` subset of the [RAFT](https://huggingface.co/datasets/ought/raft) dataset. This subset contains tweets that are labeled either `complaint` or `no complaint`:

In [3]:
dataset = load_dataset("ought/raft", dataset_name)
dataset["train"][0]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/6.72k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/266k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/50 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3399 [00:00<?, ? examples/s]

{'Tweet text': '@HMRCcustomers No this is my first job', 'ID': 0, 'Label': 2}

To make the `Label` column more readable, replace the `Label` value with the corresponding label text and store them in a `text_label` column. You can use the [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function to apply this change over the entire dataset in one step:

In [4]:
classes = [k.replace("_", " ") for k in dataset["train"].features["Label"].names]
dataset = dataset.map(
    lambda x: {"text_label": [classes[label] for label in x["Label"]]},
    batched=True,
    num_proc=1,
)
dataset["train"][0]
#{"Tweet text": "@HMRCcustomers No this is my first job", "ID": 0, "Label": 2, "text_label": "no complaint"}

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Map:   0%|          | 0/3399 [00:00<?, ? examples/s]

{'Tweet text': '@HMRCcustomers No this is my first job',
 'ID': 0,
 'Label': 2,
 'text_label': 'no complaint'}

## Preprocess dataset

Next, you'll setup a tokenizer; configure the appropriate padding token to use for padding sequences, and determine the maximum length of the tokenized labels:

In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id
target_max_length = max([len(tokenizer(class_label)["input_ids"]) for class_label in classes])
print(target_max_length)

tokenizer_config.json:   0%|          | 0.00/222 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

3


Create a `preprocess_function` to:

1. Tokenize the input text and labels.
2. For each example in a batch, pad the labels with the tokenizers `pad_token_id`.
3. Concatenate the input text and labels into the `model_inputs`.
4. Create a separate attention mask for `labels` and `model_inputs`.
5. Loop through each example in the batch again to pad the input ids, labels, and attention mask to the `max_length` and convert them to PyTorch tensors.

In [6]:
def preprocess_function(examples):
    batch_size = len(examples[text_column])
    inputs = [f"{text_column} : {x} Label : " for x in examples[text_column]]
    targets = [str(x) for x in examples[label_column]]
    model_inputs = tokenizer(inputs)
    labels = tokenizer(targets)
    for i in range(batch_size):
        sample_input_ids = model_inputs["input_ids"][i]
        label_input_ids = labels["input_ids"][i] + [tokenizer.pad_token_id]
        # print(i, sample_input_ids, label_input_ids)
        model_inputs["input_ids"][i] = sample_input_ids + label_input_ids
        labels["input_ids"][i] = [-100] * len(sample_input_ids) + label_input_ids
        model_inputs["attention_mask"][i] = [1] * len(model_inputs["input_ids"][i])

    # print(model_inputs)
    for i in range(batch_size):
        sample_input_ids = model_inputs["input_ids"][i]
        label_input_ids = labels["input_ids"][i]
        model_inputs["input_ids"][i] = [tokenizer.pad_token_id] * (
            max_length - len(sample_input_ids)
        ) + sample_input_ids
        model_inputs["attention_mask"][i] = [0] * (max_length - len(sample_input_ids)) + model_inputs[
            "attention_mask"
        ][i]
        labels["input_ids"][i] = [-100] * (max_length - len(sample_input_ids)) + label_input_ids
        model_inputs["input_ids"][i] = torch.tensor(model_inputs["input_ids"][i][:max_length])
        model_inputs["attention_mask"][i] = torch.tensor(model_inputs["attention_mask"][i][:max_length])
        labels["input_ids"][i] = torch.tensor(labels["input_ids"][i][:max_length])
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

Use the [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function to apply the `preprocess_function` to the entire dataset. You can remove the unprocessed columns since the model won't need them:

In [7]:
processed_datasets = dataset.map(
    preprocess_function,
    batched=True,
    num_proc=1,
    remove_columns=dataset["train"].column_names,
    load_from_cache_file=False,
    desc="Running tokenizer on dataset",
)

Running tokenizer on dataset:   0%|          | 0/50 [00:00<?, ? examples/s]

Running tokenizer on dataset:   0%|          | 0/3399 [00:00<?, ? examples/s]

Create a [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) from the `train` and `eval` datasets. Set `pin_memory=True` to speed up the data transfer to the GPU during training if the samples in your dataset are on a CPU.

In [8]:
train_dataset = processed_datasets["train"]
eval_dataset = processed_datasets["test"]


train_dataloader = DataLoader(
    train_dataset, shuffle=True, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True
)
eval_dataloader = DataLoader(eval_dataset, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True)

## Train

You're almost ready to setup your model and start training!

Initialize a base model from [AutoModelForCausalLM](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForCausalLM), and pass it and `peft_config` to the `get_peft_model()` function to create a [PeftModel](https://huggingface.co/docs/peft/main/en/package_reference/peft_model#peft.PeftModel). You can print the new [PeftModel](https://huggingface.co/docs/peft/main/en/package_reference/peft_model#peft.PeftModel)'s trainable parameters to see how much more efficient it is than training the full parameters of the original model!

In [9]:
model = AutoModelForCausalLM.from_pretrained(model_name_or_path)
model = get_peft_model(model, peft_config)
print(model.print_trainable_parameters())
#"trainable params: 8192 || all params: 559222784 || trainable%: 0.0014648902430985358"

config.json:   0%|          | 0.00/715 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

trainable params: 8,192 || all params: 559,222,784 || trainable%: 0.0014648902430985358
None


In [27]:
print(model)

PeftModelForCausalLM(
  (base_model): BloomForCausalLM(
    (transformer): BloomModel(
      (word_embeddings): Embedding(250880, 1024)
      (word_embeddings_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (h): ModuleList(
        (0-23): 24 x BloomBlock(
          (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (self_attention): BloomAttention(
            (query_key_value): Linear(in_features=1024, out_features=3072, bias=True)
            (dense): Linear(in_features=1024, out_features=1024, bias=True)
            (attention_dropout): Dropout(p=0.0, inplace=False)
          )
          (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): BloomMLP(
            (dense_h_to_4h): Linear(in_features=1024, out_features=4096, bias=True)
            (gelu_impl): BloomGelu()
            (dense_4h_to_h): Linear(in_features=4096, out_features=1024, bias=True)
          )
        )
      

Setup an optimizer and learning rate scheduler:

In [10]:
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
lr_scheduler = get_linear_schedule_with_warmup(
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=(len(train_dataloader) * num_epochs),
)

Move the model to the GPU, then write a training loop to start training!

In [26]:
model = model.to(device)

for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for step, batch in enumerate(tqdm(train_dataloader)):
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        total_loss += loss.detach().float()
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

    model.eval()
    eval_loss = 0
    eval_preds = []
    for step, batch in enumerate(tqdm(eval_dataloader)):
        batch = {k: v.to(device) for k, v in batch.items()}
        with torch.no_grad():
            outputs = model(**batch)
        loss = outputs.loss
        eval_loss += loss.detach().float()
        eval_preds.extend(
            tokenizer.batch_decode(torch.argmax(outputs.logits, -1).detach().cpu().numpy(), skip_special_tokens=True)
        )

    eval_epoch_loss = eval_loss / len(eval_dataloader)
    eval_ppl = torch.exp(eval_epoch_loss)
    train_epoch_loss = total_loss / len(train_dataloader)
    train_ppl = torch.exp(train_epoch_loss)
    print(f"{epoch=}: {train_ppl=} {train_epoch_loss=} {eval_ppl=} {eval_epoch_loss=}")

    print('Pushing model...')
    model.push_to_hub(peft_model_id, use_auth_token=True)


100%|██████████| 7/7 [00:02<00:00,  2.75it/s]
100%|██████████| 425/425 [01:29<00:00,  4.78it/s]


epoch=0: train_ppl=tensor(3.4897, device='cuda:0') train_epoch_loss=tensor(1.2498, device='cuda:0') eval_ppl=tensor(838496.0625, device='cuda:0') eval_epoch_loss=tensor(13.6394, device='cuda:0')
Pushing model...


100%|██████████| 7/7 [00:02<00:00,  2.63it/s]
100%|██████████| 425/425 [01:35<00:00,  4.46it/s]


epoch=1: train_ppl=tensor(3.6137, device='cuda:0') train_epoch_loss=tensor(1.2847, device='cuda:0') eval_ppl=tensor(838496.0625, device='cuda:0') eval_epoch_loss=tensor(13.6394, device='cuda:0')
Pushing model...


100%|██████████| 7/7 [00:02<00:00,  2.53it/s]
100%|██████████| 425/425 [01:37<00:00,  4.36it/s]


epoch=2: train_ppl=tensor(3.3805, device='cuda:0') train_epoch_loss=tensor(1.2180, device='cuda:0') eval_ppl=tensor(838496.0625, device='cuda:0') eval_epoch_loss=tensor(13.6394, device='cuda:0')
Pushing model...


100%|██████████| 7/7 [00:02<00:00,  2.52it/s]
100%|██████████| 425/425 [01:37<00:00,  4.35it/s]


epoch=3: train_ppl=tensor(3.5788, device='cuda:0') train_epoch_loss=tensor(1.2750, device='cuda:0') eval_ppl=tensor(838496.0625, device='cuda:0') eval_epoch_loss=tensor(13.6394, device='cuda:0')
Pushing model...


100%|██████████| 7/7 [00:02<00:00,  2.49it/s]
100%|██████████| 425/425 [01:37<00:00,  4.36it/s]


epoch=4: train_ppl=tensor(3.3052, device='cuda:0') train_epoch_loss=tensor(1.1955, device='cuda:0') eval_ppl=tensor(838496.0625, device='cuda:0') eval_epoch_loss=tensor(13.6394, device='cuda:0')
Pushing model...


100%|██████████| 7/7 [00:02<00:00,  2.50it/s]
100%|██████████| 425/425 [01:37<00:00,  4.34it/s]


epoch=5: train_ppl=tensor(3.6756, device='cuda:0') train_epoch_loss=tensor(1.3017, device='cuda:0') eval_ppl=tensor(838496.0625, device='cuda:0') eval_epoch_loss=tensor(13.6394, device='cuda:0')
Pushing model...


100%|██████████| 7/7 [00:02<00:00,  2.53it/s]
100%|██████████| 425/425 [01:37<00:00,  4.35it/s]


epoch=6: train_ppl=tensor(3.3299, device='cuda:0') train_epoch_loss=tensor(1.2029, device='cuda:0') eval_ppl=tensor(838496.0625, device='cuda:0') eval_epoch_loss=tensor(13.6394, device='cuda:0')
Pushing model...


100%|██████████| 7/7 [00:02<00:00,  2.53it/s]
100%|██████████| 425/425 [01:37<00:00,  4.35it/s]


epoch=7: train_ppl=tensor(3.3208, device='cuda:0') train_epoch_loss=tensor(1.2002, device='cuda:0') eval_ppl=tensor(838496.0625, device='cuda:0') eval_epoch_loss=tensor(13.6394, device='cuda:0')
Pushing model...


100%|██████████| 7/7 [00:02<00:00,  2.52it/s]
100%|██████████| 425/425 [01:37<00:00,  4.36it/s]


epoch=8: train_ppl=tensor(3.3076, device='cuda:0') train_epoch_loss=tensor(1.1962, device='cuda:0') eval_ppl=tensor(838496.0625, device='cuda:0') eval_epoch_loss=tensor(13.6394, device='cuda:0')
Pushing model...


100%|██████████| 7/7 [00:02<00:00,  2.48it/s]
100%|██████████| 425/425 [01:37<00:00,  4.35it/s]


epoch=9: train_ppl=tensor(3.3006, device='cuda:0') train_epoch_loss=tensor(1.1941, device='cuda:0') eval_ppl=tensor(838496.0625, device='cuda:0') eval_epoch_loss=tensor(13.6394, device='cuda:0')
Pushing model...


## Share model

You can store and share your model on the Hub if you'd like. Log in to your Hugging Face account and enter your token when prompted:

In [16]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Use the [push_to_hub](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.push_to_hub) function to upload your model to a model repository on the Hub:

In [17]:
peft_model_id = "shahzebnaveed/bloomz-560m_prompt_tuning_clm"




adapter_model.safetensors:   0%|          | 0.00/32.9k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/shahzebnaveed/bloomz-560m_prompt_tuning_clm/commit/9a311ffc016738c229f193e3ae92b8dc0b01787d', commit_message='Upload model', commit_description='', oid='9a311ffc016738c229f193e3ae92b8dc0b01787d', pr_url=None, pr_revision=None, pr_num=None)

Once the model is uploaded, you'll see the model file size is only 33.5kB! 🤏

## Inference

Let's try the model on a sample input for inference. If you look at the repository you uploaded the model to, you'll see a `adapter_config.json` file. Load this file into [PeftConfig](https://huggingface.co/docs/peft/main/en/package_reference/config#peft.PeftConfig) to specify the `peft_type` and `task_type`. Then you can load the prompt tuned model weights, and the configuration into [from_pretrained()](https://huggingface.co/docs/peft/main/en/package_reference/peft_model#peft.PeftModel.from_pretrained) to create the [PeftModel](https://huggingface.co/docs/peft/main/en/package_reference/peft_model#peft.PeftModel):

In [28]:
from peft import PeftModel, PeftConfig

peft_model_id = "shahzebnaveed/bloomz-560m_prompt_tuning_clm"
config = PeftConfig.from_pretrained(peft_model_id)
config

PromptTuningConfig(peft_type=<PeftType.PROMPT_TUNING: 'PROMPT_TUNING'>, auto_mapping=None, base_model_name_or_path='bigscience/bloomz-560m', revision=None, task_type='CAUSAL_LM', inference_mode=True, num_virtual_tokens=8, token_dim=1024, num_transformer_submodules=1, num_attention_heads=16, num_layers=24, prompt_tuning_init='TEXT', prompt_tuning_init_text='Classify if the tweet is a complaint or not:', tokenizer_name_or_path='bigscience/bloomz-560m', tokenizer_kwargs=None)

In [29]:
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(model, peft_model_id)

Grab a tweet and tokenize it:

In [30]:
inputs = tokenizer(
    f'{text_column} : {"@nationalgridus I have no water and the bill is current and paid. Can you do something about this?"} Label : ',
    return_tensors="pt",
)

Put the model on a GPU and *generate* the predicted label:

In [31]:
model.to(device)

with torch.no_grad():
    inputs = {k: v.to(device) for k, v in inputs.items()}
    outputs = model.generate(
        input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_new_tokens=10, eos_token_id=3
    )
    print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))
# [
#     "Tweet text : @nationalgridus I have no water and the bill is current and paid. Can you do something about this? Label : complaint"
# ]

['Tweet text : @nationalgridus I have no water and the bill is current and paid. Can you do something about this? Label : no complaint']
