Modified by: [@tayyib-ul-hassan](https://www.linkedin.com/in/tayyib-ul-hassan-4b24541b2/)

Thanks to [Sam Witteveenai](https://www.youtube.com/@samwitteveenai) for the base notebook!

# Using 🤗 PEFT & bitsandbytes to finetune a LoRa checkpoint




In [None]:
!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git
!pip install transformers torch torchvision

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
!nvidia-smi -L

### Setup the model

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "bigscience/bloom-560m",
    device_map='auto',  # Use this parameter if you have multiple gpus and want to map out your model to multiple devices
    force_download=True
)

tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/693 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/693 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

In [None]:
print(model)

BloomForCausalLM(
  (transformer): BloomModel(
    (word_embeddings): Embedding(250880, 1024)
    (word_embeddings_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    (h): ModuleList(
      (0-23): 24 x BloomBlock(
        (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (self_attention): BloomAttention(
          (query_key_value): Linear(in_features=1024, out_features=3072, bias=True)
          (dense): Linear(in_features=1024, out_features=1024, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): BloomMLP(
          (dense_h_to_4h): Linear(in_features=1024, out_features=4096, bias=True)
          (gelu_impl): BloomGelu()
          (dense_4h_to_h): Linear(in_features=4096, out_features=1024, bias=True)
        )
      )
    )
    (ln_f): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  )
  (

### Freezing the original weights


In [None]:
for param in model.parameters():
  param.requires_grad = False  # freeze the model - train adapters later
  if param.ndim == 1:
    # cast the small parameters (e.g. layernorm) to fp32 for stability
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
  '''
  This class's forward method takes in an input and converts it to float32 form.
  '''
  def forward(self, x):
    return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

- model.lm_head = CastOutputToFloat(model.lm_head) assigns an instance of CastOutputToFloat as the lm_head attribute of the model.

- Here, model.lm_head is assumed to be a module (possibly a linear layer) in the model architecture.

- By replacing it with an instance of CastOutputToFloat, it ensures that the output of lm_head will be cast to float32.

### Setting up the LoRa Adapters

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16, # attention heads
    lora_alpha=32, # alpha scaling
    target_modules=["query_key_value"], # gathered from print(model)
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM" # set this for CLM or Seq2Seq
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 1572864 || all params: 560787456 || trainable%: 0.2804741766549072


No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'


## Data

### Load the data

In [None]:
import transformers
from datasets import load_dataset
data = load_dataset("Abirate/english_quotes")


In [None]:
print(data)
print(data["train"]) # List of dictionaries

DatasetDict({
    train: Dataset({
        features: ['quote', 'author', 'tags'],
        num_rows: 2508
    })
})
Dataset({
    features: ['quote', 'author', 'tags'],
    num_rows: 2508
})


### Add a new column in the dataset

In [None]:
def merge_columns(example):
    example["prediction"] = example["quote"] + " ->: " + str(example["tags"])
    return example

train_dataset = data['train']
train_dataset = train_dataset.map(merge_columns)
train_dataset["prediction"][:5]

["“Be yourself; everyone else is already taken.” ->: ['be-yourself', 'gilbert-perreira', 'honesty', 'inspirational', 'misattributed-oscar-wilde', 'quote-investigator']",
 "“I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you can't handle me at my worst, then you sure as hell don't deserve me at my best.” ->: ['best', 'life', 'love', 'mistakes', 'out-of-control', 'truth', 'worst']",
 "“Two things are infinite: the universe and human stupidity; and I'm not sure about the universe.” ->: ['human-nature', 'humor', 'infinity', 'philosophy', 'science', 'stupidity', 'universe']",
 "“So many books, so little time.” ->: ['books', 'humor']",
 "“A room without books is like a body without a soul.” ->: ['books', 'simile', 'soul']"]

In [None]:
train_dataset[0]

{'quote': '“Be yourself; everyone else is already taken.”',
 'author': 'Oscar Wilde',
 'tags': ['be-yourself',
  'gilbert-perreira',
  'honesty',
  'inspirational',
  'misattributed-oscar-wilde',
  'quote-investigator'],
 'prediction': "“Be yourself; everyone else is already taken.” ->: ['be-yourself', 'gilbert-perreira', 'honesty', 'inspirational', 'misattributed-oscar-wilde', 'quote-investigator']"}

### Tokenize the dataset

In [None]:
train_dataset = train_dataset.map(lambda samples: tokenizer(samples['prediction']), batched=True)

In [None]:
train_dataset

Dataset({
    features: ['quote', 'author', 'tags', 'prediction', 'input_ids', 'attention_mask'],
    num_rows: 2508
})

## Training

In [None]:

trainer = transformers.Trainer(
    model=model,
    train_dataset=train_dataset,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1, # Kept less for fast ft due to low resources, modify
        gradient_accumulation_steps=10, # Kept more for fast ft due to low resources, modify
        warmup_steps=1, # Kept less for fast ft due to low resources, modify
        max_steps=1, # Kept less for fast ft due to low resources, modify
        learning_rate=2e-4,
        logging_steps=1,
        output_dir='outputs'
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

try:
  trainer.train()
except KeyboardInterrupt:
  print("Key board interrupt!")

max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss
1,3.9635


## Share adapters on the 🤗 Hub

In [None]:
model.push_to_hub("tayyibsupercool/bloom-560m-lora",
                  use_auth_token=True,
                  commit_message="basic training",
                  private=True)



adapter_model.safetensors:   0%|          | 0.00/6.30M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/tayyibsupercool/bloom-560m-lora/commit/bedc7f697e7cbf23228123c84ea328068232040c', commit_message='basic training', commit_description='', oid='bedc7f697e7cbf23228123c84ea328068232040c', pr_url=None, pr_revision=None, pr_num=None)

## Load adapters from the Hub

In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

# Write the peft model id
peft_model_id = "tayyibsupercool/bloom-560m-lora"
# Load the configuration file of lora model
config = PeftConfig.from_pretrained(peft_model_id)
# Load the underlying model of the config file
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, device_map='auto')
# Load the tokenizer for the model
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model for the peft config file
model = PeftModel.from_pretrained(model, peft_model_id)

adapter_model.safetensors:   0%|          | 0.00/6.30M [00:00<?, ?B/s]

## Inference

In [None]:
batch = tokenizer("I like strawberries” ->: ", return_tensors='pt')

with torch.cuda.amp.autocast():
  output_tokens = model.generate(**batch, max_new_tokens=50)

print('\n\n', tokenizer.decode(output_tokens[0], skip_special_tokens=True))



 I like strawberries” ->:  I like strawberries. I like strawberries. I like strawberries. I like strawberries. I like strawberries. I like strawberries. I like strawberries. I like strawberries. I like
