<a href="https://colab.research.google.com/github/vanderbilt-data-science/ai-days/blob/main/fine_tuning_llama.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Parameter-Efficient Fine-Tuning Llama 3.1



## Fine-Tuning in Google Colab

This fine-tuning code can be run on an instance of Google Colab using the available GPU runtime. To change your runtime to GPU, select "Runtime"-> ""Change Runtime Type" -> GPU.

### Load Libraries

We will first need to install some packages:

In [1]:
!pip install accelerate peft bitsandbytes transformers trl

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.3-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting trl
  Downloading trl-0.15.2-py3-none-any.whl.metadata (11 kB)
Collecting datasets>=2.21.0 (from trl)
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets>=2.21.0->trl)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets>=2.21.0->trl)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets>=2.21.0->trl)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-man

### Model Setup and Training Objective

For this example, we will be fine-tuning Llama on the English Quotes Dataset. We want our model to output a quote and its author given the start of a quote.

First, we can access Llama-3.1-8B (or any Llama model) through HuggingFace. We will be using a non-gated version of Llama 3.1 8B provided by Nous Research. A gated "official" version of this model is available at "meta-llama/Llama-3.1-8B". In practice, there is no difference between the model weights between these two models.

**NOTE**: The below cells needs connection to a GPU, which you can access by selecting "Runtime" -> "Change Runtime Type" -> GPU

If you are not connected to a GPU, the error will say something along the lines of "you must have accelerate and bitsandbytes installed."

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, GemmaTokenizer

model_id = "NousResearch/Meta-Llama-3.1-8B"

compute_dtype = getattr(torch, "float16")
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quant_config, device_map={"":0})

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/826 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

Let's see how Llama does on this task without any fine-tuning. We will give it the start of a quote.

In [3]:
text = "Quote: Imagination is"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Quote: Imagination is the highest kite one can fly.
Quote: You can’t use up creativity. The more you use, the more you have.
Quote: The more you are in a state of gratitude, the more goodness you will have access to, and the more


As we can see above, the model does finish the quote but goes on to continue writing more text, followed by another quote. What we want is the quote followed by the author.

### Data and Training Functions

Next, we will set up our training configuation for [LoRA] (https://www.run.ai/guides/generative-ai/lora-fine-tuning), a highly efficient training method.

In [5]:
import os
os.environ["WANDB_DISABLED"] = "true"

In [6]:
from peft import LoraConfig

lora_config = LoraConfig(
    r=8,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)

We can load our dataset through the HF datasets library.

In [7]:
# may not use this dataset
from datasets import load_dataset

data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

README.md:   0%|          | 0.00/5.55k [00:00<?, ?B/s]

quotes.jsonl:   0%|          | 0.00/647k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2508 [00:00<?, ? examples/s]

Map:   0%|          | 0/2508 [00:00<?, ? examples/s]

Now we can define our Supervised Fine-Tuning (SFT) trainer below and start the training!

In [8]:
import transformers
from trl import SFTTrainer

def formatting_func(example):
    text = f"Quote: {example['quote'][0]}\nAuthor: {example['author'][0]}"
    return [text]

trainer = SFTTrainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    peft_config=lora_config,
    formatting_func=formatting_func,
)
trainer.train()

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Converting train dataset to ChatML:   0%|          | 0/2508 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/2508 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/2508 [00:00<?, ? examples/s]

Step,Training Loss
1,2.4238
2,1.1494
3,1.7305
4,1.0357
5,1.2253
6,0.8187
7,1.9164
8,0.8341
9,2.0395
10,1.4176


TrainOutput(global_step=10, training_loss=1.4590887188911439, metrics={'train_runtime': 15.1058, 'train_samples_per_second': 2.648, 'train_steps_per_second': 0.662, 'total_flos': 62088643584000.0, 'train_loss': 1.4590887188911439})

### Evaluation

Let's see how our model does after fine-tuning. We will run the same example we did at the beginning. Remember that we want our model to give us the rest of the quote and its author.

In [9]:
text = "Quote: Imagination is"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Quote: Imagination is the only weapon in the war against reality. - Louis Armstrong
Author Topic: The 2020


Great! This is exactly what we want. We can now save our model as a .pt file to access later.

In [10]:
#torch.save(model.state_dict(), "llama-3.1-8B-peft-quotes.pt")

## Conclusion

You have just successfully fine-tuned Llama 3.1 on our English Quotes dataset. Feel free to adapt this process for SFT on your own project.