## PEFT Finetuning Quick Start Notebook

This notebook shows how to train a Meta Llama 3 model on a single GPU (e.g. A10 with 24GB) using int8 quantization and LoRA finetuning.

**_Note:_** To run this notebook on a machine with less than 24GB VRAM (e.g. T4 with 16GB) the context length of the training dataset needs to be adapted.
We do this based on the available VRAM during execution.
If you run into OOM issues try to further lower the value of train_config.context_length.

### Step 0: Install pre-requirements and convert checkpoint

We need to have llama-recipes and its dependencies installed for this notebook. Additionally, we need to log in with the huggingface_cli and make sure that the account is able to to access the Meta Llama weights.

In [1]:
# uncomment if running from Colab T4
# ! pip install llama-recipes ipywidgets

# import huggingface_hub
# huggingface_hub.login()

### Step 0: Run Llama Locally

With pipeline: highest abstraction contains tokenizer + model

In [6]:
from transformers import pipeline
import torch

In [32]:
pipe = pipeline(
    "text-generation",
    model="meta-llama/Llama-3.2-1B-Instruct",
    device="cuda:5",
    max_new_tokens=512
    )

In [33]:
prompt = [
    {"role": "system", "content": "you are a helpful AI assistant, answer like a pirate"},
    {"role": "user", "content":"What are logarithms"}
]
# pipe(prompt[0]["generated_text"][-1])
print(pipe(prompt)[0]["generated_text"][-1]["content"])

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Ye be wantin' to know about logarithms, eh? Alright then, settle yerself down with a pint o' grog and listen close, for I be tellin' ye the tale o' logarithms.

Logarithms be a way o' dealin' with big numbers, matey. Ye see, when ye have a big number, like a treasure chest overflowin' with gold doubloons, ye need a way to figure out how many smaller chests ye need to fill to reach that amount. That be where logarithms come in.

A logarithm be the power to which a base number be raised to get a certain value. For example, if ye be wantin' to find the power to which 10 be raised to get 100, ye be needin' a logarithm. And that be 2, matey! 10 squared be 100.

So, ye see, logarithms be like a treasure map, helpin' ye navigate the high seas o' big numbers. They be used in all sorts o' things, like calculators, computers, and even navigation.

Now, I know what ye be thinkin', "What be the different types o' logarithms, Captain?" Well, matey, there be three main types:

1. Natural logarithm (

Another way of loading the model is through Auto Classes/ Causal Language Modelling,

must create tokenizer sepparately this time

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "meta-llama/Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cuda:5")

In [3]:
print(type(model))

<class 'transformers.models.llama.modeling_llama.LlamaForCausalLM'>


Still abstracted from torch model

In [17]:
prompt = [
    {"role": "system", "content": "you are a helpful AI assistant, explain like you are talking to a 5 year old, answer shortly do not over complicate"},
    {"role": "user", "content":"What are logarithms?"}
]

prompt = tokenizer.apply_chat_template(prompt, tokenize=False)
print(prompt)

model_input = tokenizer(prompt, add_special_tokens=False, return_tensors="pt").to("cuda:5") # return_tensors = PyTorch
# https://huggingface.co/docs/transformers/en/chat_templating#can-i-use-chat-templates-in-training
# must turn off add_special_tokens if previously tokenized with apply_chat_template, 
# no need if both is done in one pass ie. apply_chat_template(..., tokenize=True)

print(model_input)

model.eval()
with torch.inference_mode():
    generation = model.generate(**model_input, max_new_tokens=1000)
    print(tokenizer.decode(generation[0], skip_special_tokens=False))

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 02 Oct 2024

you are a helpful AI assistant, explain like you are talking to a 5 year old, answer shortly do not over complicate<|eot_id|><|start_header_id|>user<|end_header_id|>

What are logarithms?<|eot_id|>
{'input_ids': tensor([[128000, 128006,   9125, 128007,    271,  38766,   1303,  33025,   2696,
             25,   6790,    220,   2366,     18,    198,  15724,   2696,     25,
            220,   2437,   5020,    220,   2366,     19,    271,   9514,    527,
            264,  11190,  15592,  18328,     11,  10552,   1093,    499,    527,
           7556,    311,    264,    220,     20,   1060,   2362,     11,   4320,
          20193,    656,    539,    927,   1391,  49895, 128009, 128006,    882,
         128007,    271,   3923,    527,  91036,   1026,     30, 128009]],
       device='cuda:5'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

### Step 1: Load the model for fine tuning

We start by fine tuning a base pre trained model. Setup training configuration and load the model and tokenizer.

In [19]:
from transformers import LlamaForCausalLM, AutoTokenizer
from llama_recipes.configs import train_config as TRAIN_CONFIG

train_config = TRAIN_CONFIG()
train_config.model_name = "meta-llama/Meta-Llama-3.1-8B"
train_config.num_epochs = 1
train_config.run_validation = False
train_config.gradient_accumulation_steps = 4
train_config.batch_size_training = 1
train_config.lr = 3e-4
train_config.use_fast_kernels = True
train_config.use_fp16 = True
train_config.context_length = 1024 if torch.cuda.get_device_properties(0).total_memory < 16e9 else 2048 # T4 16GB or A10 24GB
train_config.batching_strategy = "packing"
train_config.output_dir = "meta-llama-samsum"
train_config.use_peft = True

# os.environ["CUDA_VISIBLE_DEVICES"] = "7"

from transformers import BitsAndBytesConfig
config = BitsAndBytesConfig(
    load_in_8bit=True,
)

model = LlamaForCausalLM.from_pretrained(
            train_config.model_name,
            device_map="cuda:5",
            quantization_config=config,
            use_cache=False,
            attn_implementation="sdpa" if train_config.use_fast_kernels else None,
            torch_dtype=torch.float16,
        )

tokenizer = AutoTokenizer.from_pretrained(train_config.model_name)
tokenizer.pad_token = tokenizer.eos_token

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [30]:
print(type(model))
print(model)

<class 'transformers.models.llama.modeling_llama.LlamaForCausalLM'>
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear8bitLt(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear8bitLt(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear8bitLt(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear8bitLt(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear8bitLt(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm

### Step 2: Check base model

Run the base model on an example input:

In [20]:
eval_prompt = """
Summarize this dialog:
A: Hi Tom, are you busy tomorrow’s afternoon?
B: I’m pretty sure I am. What’s up?
A: Can you go with me to the animal shelter?.
B: What do you want to do?
A: I want to get a puppy for my son.
B: That will make him so happy.
A: Yeah, we’ve discussed it many times. I think he’s ready now.
B: That’s good. Raising a dog is a tough issue. Like having a baby ;-) 
A: I'll get him one of those little dogs.
B: One that won't grow up too big;-)
A: And eat too much;-))
B: Do you know which one he would like?
A: Oh, yes, I took him there last Monday. He showed me one that he really liked.
B: I bet you had to drag him away.
A: He wanted to take it home right away ;-).
B: I wonder what he'll name it.
A: He said he’d name it after his dead hamster – Lemmy  - he's  a great Motorhead fan :-)))
---
Summary:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.inference_mode():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.



Summarize this dialog:
A: Hi Tom, are you busy tomorrow’s afternoon?
B: I’m pretty sure I am. What’s up?
A: Can you go with me to the animal shelter?.
B: What do you want to do?
A: I want to get a puppy for my son.
B: That will make him so happy.
A: Yeah, we’ve discussed it many times. I think he’s ready now.
B: That’s good. Raising a dog is a tough issue. Like having a baby ;-) 
A: I'll get him one of those little dogs.
B: One that won't grow up too big;-)
A: And eat too much;-))
B: Do you know which one he would like?
A: Oh, yes, I took him there last Monday. He showed me one that he really liked.
B: I bet you had to drag him away.
A: He wanted to take it home right away ;-).
B: I wonder what he'll name it.
A: He said he’d name it after his dead hamster – Lemmy  - he's  a great Motorhead fan :-)))
---
Summary:
- A: I want to get a puppy for my son.
- B: That will make him so happy.
- A: Yeah, we've discussed it many times. I think he's ready now.
- B: That's good. Raising a dog is a

We can see that the base model only repeats the conversation.

### Step 3: Load the preprocessed dataset

We load and preprocess the samsum dataset which consists of curated pairs of dialogs and their summarization:

In [31]:
from llama_recipes.configs.datasets import samsum_dataset
from llama_recipes.utils.dataset_utils import get_dataloader

samsum_dataset.trust_remote_code = True

train_dataloader = get_dataloader(tokenizer, samsum_dataset, train_config)
eval_dataloader = get_dataloader(tokenizer, samsum_dataset, train_config, "val")

  from torch.distributed._shard.checkpoint import (


Map:   0%|          | 0/14732 [00:00<?, ? examples/s]

Preprocessing dataset: 100%|██████████| 14732/14732 [00:04<00:00, 3330.92it/s]


Map:   0%|          | 0/818 [00:00<?, ? examples/s]

In [32]:
print(samsum_dataset)

<class 'llama_recipes.configs.datasets.samsum_dataset'>


### Step 4: Prepare model for PEFT

Let's prepare the model for Parameter Efficient Fine Tuning (PEFT):

In [33]:
from peft import get_peft_model, prepare_model_for_kbit_training, LoraConfig
from dataclasses import asdict
from llama_recipes.configs import lora_config as LORA_CONFIG

lora_config = LORA_CONFIG()
lora_config.r = 8
lora_config.lora_alpha = 32
lora_dropout: float=0.01

peft_config = LoraConfig(**asdict(lora_config))

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

### Step 5: Fine tune the model

Here, we fine tune the model for a single epoch.

In [34]:
import torch.optim as optim
from llama_recipes.utils.train_utils import train
from torch.optim.lr_scheduler import StepLR

model.train()

optimizer = optim.AdamW(
            model.parameters(),
            lr=train_config.lr,
            weight_decay=train_config.weight_decay,
        )
scheduler = StepLR(optimizer, step_size=1, gamma=train_config.gamma)

# Start the training process
results = train(
    model,
    train_dataloader,
    eval_dataloader,
    tokenizer,
    optimizer,
    scheduler,
    train_config.gradient_accumulation_steps,
    train_config,
    None,
    None,
    None,
    wandb_run=None,
)

  scaler = torch.cuda.amp.GradScaler()


Starting epoch 0/1
train_config.max_train_step: 0


Training Epoch: 1:   0%|[34m          [0m| 0/319 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  with autocast():
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Training Epoch: 1/1, step 303/1279 completed (loss: 0.37481799721717834):  24%|[34m██▍       [0m| 76/319 [23:30<1:16:24, 18.87s/it]

KeyboardInterrupt: 

### Step 6:
Save model checkpoint

In [7]:
model.save_pretrained(train_config.output_dir)

### Step 7:
Try the fine tuned model on the same example again to see the learning progress:

In [8]:
model.eval()
with torch.inference_mode():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



Summarize this dialog:
A: Hi Tom, are you busy tomorrow’s afternoon?
B: I’m pretty sure I am. What’s up?
A: Can you go with me to the animal shelter?.
B: What do you want to do?
A: I want to get a puppy for my son.
B: That will make him so happy.
A: Yeah, we’ve discussed it many times. I think he’s ready now.
B: That’s good. Raising a dog is a tough issue. Like having a baby ;-) 
A: I'll get him one of those little dogs.
B: One that won't grow up too big;-)
A: And eat too much;-))
B: Do you know which one he would like?
A: Oh, yes, I took him there last Monday. He showed me one that he really liked.
B: I bet you had to drag him away.
A: He wanted to take it home right away ;-).
B: I wonder what he'll name it.
A: He said he’d name it after his dead hamster – Lemmy  - he's  a great Motorhead fan :-)))
---
Summary:
A wants to get a puppy for his son. A took him to the animal shelter last Monday and he showed A one he really liked. A wants to get him one of those little dogs. A and B agre