## Fine-tune large models using 🤗 `peft` adapters, `transformers` & `bitsandbytes`

In this tutorial we will cover how we can fine-tune large language models using the very recent `peft` library and `bitsandbytes` for loading large models in 8-bit.
The fine-tuning method will rely on a recent method called "Low Rank Adapters" (LoRA), instead of fine-tuning the entire model you just have to fine-tune these adapters and load them properly inside the model.
After fine-tuning the model you can also share your adapters on the 🤗 Hub and load them very easily. Let's get started!

### Install requirements

First, run the cells below to install the requirements:

In [None]:
!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m52.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.4/261.4 kB[0m [31m33.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pypr

### Model loading

Here let's load the `opt-6.7b` model, its weights in half-precision (float16) are about 13GB on the Hub! If we load them in 8-bit we would require around 7GB of memory instead.

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "facebook/opt-6.7b",
    load_in_8bit=True,
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-6.7b")

config.json:   0%|          | 0.00/651 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/41.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.96G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.36G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

### Post-processing on the model

Finally, we need to apply some post-processing on the 8-bit model to enable training, let's freeze all our layers, and cast the layer-norm in `float32` for stability. We also cast the output of the last layer in `float32` for the same reasons.

In [None]:
for param in model.parameters():
  param.requires_grad = False  # freeze the model - train adapters later
  if param.ndim == 1:
    # cast the small parameters (e.g. layernorm) to fp32 for stability
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

### Apply LoRA

Here comes the magic with `peft`! Let's load a `PeftModel` and specify that we are going to use low-rank adapters (LoRA) using `get_peft_model` utility function from `peft`.

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 8388608 || all params: 6666862592 || trainable%: 0.12582542214183376


### Training

In [None]:
import transformers
from datasets import load_dataset
import pandas as pd
df = pd.read_csv("/home/superheroes_nlp_dataset.csv")
df['history_text'] = df['history_text'].dropna
df['history_text'] = df['history_text'].apply(lambda samples: tokenizer(str(samples)))
#df= df.map(lambda samples: tokenizer(samples['powers_text']), batched=True)

trainer = transformers.Trainer(
    model=model,
    train_dataset=df['history_text'],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        warmup_steps=100,
        max_steps=200,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir='outputs'
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,2.578
2,2.5754
3,2.5787
4,2.5668
5,2.5689
6,2.5554
7,2.5461
8,2.5362
9,2.5191
10,2.5134


Step,Training Loss
1,2.578
2,2.5754
3,2.5787
4,2.5668
5,2.5689
6,2.5554
7,2.5461
8,2.5362
9,2.5191
10,2.5134


TrainOutput(global_step=200, training_loss=0.43972798192175105, metrics={'train_runtime': 5152.2551, 'train_samples_per_second': 0.621, 'train_steps_per_second': 0.039, 'total_flos': 3.390315189829632e+16, 'train_loss': 0.43972798192175105, 'epoch': 2.2})

## Share adapters on the 🤗 Hub

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
model.push_to_hub("ybelkada/opt-6.7b-lora", use_auth_token=True)

## Load adapters from the Hub

You can also directly load adapters from the Hub using the commands below:

In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id = "ybelkada/opt-6.7b-lora"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_8bit=True, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id)

## Inference

You can then directly use the trained model or the model that you have loaded from the 🤗 Hub for inference as you would do it usually in `transformers`.

In [None]:
batch = tokenizer("He began life as a", return_tensors='pt')
with torch.cuda.amp.autocast():
  output_tokens = model.generate(**batch, max_new_tokens=500,max_length=500, temperature=0.8)
max_chunk_tokens = 200  # Set an appropriate chunk size
generated_chunks = []

with torch.cuda.amp.autocast():
    output_tokens = model.generate(**batch, max_new_tokens=max_chunk_tokens)

generated_chunks.append(output_tokens)

full_output = torch.cat(generated_chunks, dim=-1)
#print('\n\n', tokenizer.decode(output_tokens[0], skip_special_tokens=True))
generated_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
# Remove boundary-related information
clean_text = generated_text.split('<')[0].strip()
print('\n\n', clean_text)

Both `max_new_tokens` (=500) and `max_length`(=500) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)




 He began life as a slave on the Indian sub-continent, and was later turned into one of the many prisoners of war that were held at the Indian


In [None]:
batch = tokenizer("Before he turned to", return_tensors='pt')
with torch.cuda.amp.autocast():
  output_tokens = model.generate(**batch, max_new_tokens=800,max_length=800, temperature=0.8)
max_chunk_tokens = 200  # Set an appropriate chunk size
generated_chunks = []

with torch.cuda.amp.autocast():
    output_tokens = model.generate(**batch, max_new_tokens=max_chunk_tokens)

#generated_chunks.append(output_tokens)

#full_output = torch.cat(generated_chunks, dim=-1)
#print('\n\n', tokenizer.decode(output_tokens[0], skip_special_tokens=True))
generated_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
# Remove boundary-related information
clean_text = generated_text.split('<')[0].strip()
print('\n\n', clean_text)

Both `max_new_tokens` (=800) and `max_length`(=800) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)




 Before he turned to crime, Richard "Rick" Jones was one of the many prisoners of Indian Hil...


In [None]:
batch = tokenizer("He is a sidekick of", return_tensors='pt')
with torch.cuda.amp.autocast():
  output_tokens = model.generate(**batch, max_new_tokens=500,max_length=500, temperature=0.8)
#max_chunk_tokens = 200  # Set an appropriate chunk size
#generated_chunks = []

#with torch.cuda.amp.autocast():
   # output_tokens = model.generate(**batch, max_new_tokens=max_chunk_tokens)

#generated_chunks.append(output_tokens)

#full_output = torch.cat(generated_chunks, dim=-1)
#print('\n\n', tokenizer.decode(output_tokens[0], skip_special_tokens=True))
generated_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
# Remove boundary-related information
clean_text = generated_text.split('<')[0].strip()
print('\n\n', clean_text)

Both `max_new_tokens` (=1000) and `max_length`(=1000) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)




 He is a sidekick of the late Dr. Albert Wily, and was created by the late Dr. Richard "Rick" Jones. He is one of the more passive members of the...

14


In [None]:
batch = tokenizer("As the top agent of ", return_tensors='pt')
with torch.cuda.amp.autocast():
  output_tokens = model.generate(**batch, max_new_tokens=500,max_length=500, temperature=0.8)
#max_chunk_tokens = 200  # Set an appropriate chunk size
#generated_chunks = []

#with torch.cuda.amp.autocast():
   # output_tokens = model.generate(**batch, max_new_tokens=max_chunk_tokens)

#generated_chunks.append(output_tokens)

#full_output = torch.cat(generated_chunks, dim=-1)
#print('\n\n', tokenizer.decode(output_tokens[0], skip_special_tokens=True))
generated_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
# Remove boundary-related information
clean_text = generated_text.split('<')[0].strip()
print('\n\n', clean_text)

Both `max_new_tokens` (=500) and `max_length`(=500) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)




 As the top agent of ~~


In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Save the model
torch.save(model.state_dict(), '/content/drive/My Drive/model.pt')

Mounted at /content/drive


As you can see by fine-tuning for few steps we have almost recovered the quote from Albert Einstein that is present in the [training data](https://huggingface.co/datasets/Abirate/english_quotes).