# Llama3 Fine Tuning With Alpaca Dataset

In this notebook, you will fine-tune the LLama 3 (8B parameters) model using a customized variation of the Alpaca dataset. The Alpaca dataset, originally designed for **instruction-following** capabilities, will be modified to suit our specific objectives, enabling the model to generate more relevant and accurate outputs for the chosen task.

This notebook provides a step-by-step guide to preprocessing the dataset, configuring the fine-tuning pipeline, and evaluating the model's performance post-training.


![**Llama3](https://hub-apac-1.lobeobjects.space/blog/assets/98885a84481f8c3a76635b750bbff33c.webp)

[Llama3 Fine Tuning With Alpaca Dataset](#scrollTo=HOS9lneSmTN9)

>[What is Llama3 ?](#scrollTo=VxGQFTt_nS86)

>[Supervised Fine-Tuning for LLM Training](#scrollTo=9cQrA-G61O8j)

>>[1)Full Fine-Tuning:](#scrollTo=QrB_EM1OEli3)

>>[2)Low-Rank Adaptation (LoRA):](#scrollTo=324hEme2E8wb)

>>[3)Quantization-aware Low-Rank Adaptation (QLoRA):](#scrollTo=t_roZH_0d55l)

[1) Import Libraries and Set up GPU](#scrollTo=VjQWl_BZoOZS)

[2)  Initialize The Llama 3 Model and Load The Tokenizer](#scrollTo=SJyy8v1vkvlz)

[3) Exploratory Data Analysis and Preprocessing](#scrollTo=tkg_J5qSpBB4)

>>[Alpaca Dataset](#scrollTo=rO1xOSAzqhx4)

>>[Chat Template](#scrollTo=u73Vivw8JSvc)

[4) Training Model](#scrollTo=U2Mb5opfkRuv)

[5) Inference From Fıne-Tuned Model](#scrollTo=RzswT6foyB0F)


## What is Llama3 ?

Llama 3 is one of the versions in the LLama language model family released by Meta. It has been published in 8B and 70B sizes as both pre-trained and instruction-tuned variants. Both versions 8 and 70B use Grouped Query Attention (GQA).



Meta continued the development of the Llama language model family and released Llama 3.1 and Llama 3.2. While Llama 3.1 is available in 8B, 70B and 405B parameter sizes, it is capable of handling larger texts with context window support for up to 128,000 tokens. Llama 3.2, on the other hand, includes text-based models with 1D and 3D parameters, as well as audiovisual models with 11D and 90B parameters. This version stands out for its ability to process text and visual data together.

For more detailed information: https://ai.meta.com/blog/meta-llama-3/

We will use the Llama3 -8B version in this notebook, but different versions can also be used.


## Supervised Fine-Tuning for LLM Training

Supervised Fine-Tuning (SFT) is a method to enhance and customize pre-trained large language models (LLMs). It involves **retraining base models using a smaller dataset of instructions and responses**, transforming a basic text prediction model into one that can follow specific instructions and answer questions effectively.


![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c51db77-8d97-45a9-bd2c-d71e930ff0b8_2292x1234.png)


#### **1)Full Fine-Tuning**:

Full fine-tuning is a method where **all the parameters of a pre-trained model are re-trained** on a new dataset. In full fine-tuning, the entire model adapts to the new task by adjusting every parameter, allowing the model to learn the specifics of the new data comprehensively.


#### **2)Low-Rank Adaptation (LoRA)**:

Low-Rank Adaptation (LoRA) is a **parameter-efficient fine-tuning technique** used in deep learning **to adapt large pre-trained models to new tasks without retraining all the model’s parameters**. LoRA works by injecting additional, **low-rank weight matrices** into specific layers of the model, typically in attention layers or fully connected layers, while keeping the original model parameters frozen. This approach significantly reduces the number of trainable parameters and memory requirements, making it faster and more resource-efficient compared to full fine-tuning.


![Lora vs Full-ft](https://weeklyreport.ai/_astro/Diagram-2.J9V7jjP8_Z1vTWE9.webp)

![Lora vs Full-ft](https://miro.medium.com/v2/resize:fit:2000/format:webp/0*D74YMwWTzyEfaRdj.png)

#### **3)Quantization-aware Low-Rank Adaptation (QLoRA)**:

Quantization-aware Low-Rank Adaptation (QLoRA) is an advanced technique that **combines quantization and low-rank adaptation** to make the fine-tuning of large language models (LLMs) more efficient in terms of both memory and computation.





*   QLoRA first applies quantization-aware training, typically in **4-bit or 8-bit precision**, to the model. This reduces the precision of the model's weights which significantly reduces the memory usage. By quantizing only the base model parameters (which are kept frozen during adaptation), QLoRA minimizes memory usage while preserving most of the model’s expressive capacity.
*   On top of the quantized model, QLoRA applies **low-rank adaptation layers to specific parts of the model**. LoRA inserts small, low-rank matrices into the model that can be fine-tuned to learn task-specific information without modifying the quantized base model parameters. This allows QLoRA to retain the original model's capabilities while adapting to new data efficiently.

*   Since the model is quantized, QLoRA uses quantization-aware training methods, meaning it fine-tunes with the knowledge of quantized parameters, adjusting the low-rank layers to work effectively in this lower precision environment.






![](https://cdn.prod.website-files.com/640f56f76d313bbe39631bfd/65358e7d97cd72e210afa0bd_lora-qlora.png)


-----





## 1) Import Libraries and Set up GPU

In [None]:
! pip install datasets -q
! pip install git+https://github.com/huggingface/transformers -q
! pip install trl peft accelerate bitsandbytes session-info -q
! pip install -U wandb -q

In [None]:
if torch.cuda.get_device_capability()[0] >= 8:
    !pip install -qqq flash-attn
    attn_implementation = "flash_attention_2"
    torch_dtype = torch.bfloat16
else:
    attn_implementation = "eager"
    torch_dtype = torch.float16

In [None]:
import os
import torch

from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, TextStreamer
from peft import LoraConfig, get_peft_model
from tqdm import tqdm
import wandb
import time

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Check if cuda is avaliable and set gpu device.

!nvidia-smi

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

# 2)  Initialize The Llama 3 Model and Load The Tokenizer

Load the Llama-3-8B model as quantized along with the quantizaiton configs.

https://huggingface.co/meta-llama/Meta-Llama-3-8B

In [None]:
# Initialize the Llama3-8B model.

# YOUR CODE STARTS HERE

model_id = None

# YOUR CODE ENDS HERE

In [None]:
# Initialize the quantization configs. https://huggingface.co/docs/transformers/main/quantization/bitsandbytes

# YOUR CODE STARTS HERE

bnb_config = BitsAndBytesConfig(
    load_in_4bit=None,
    bnb_4bit_use_double_quant=None,
    bnb_4bit_quant_type=None,
    bnb_4bit_compute_dtype=None
)

# YOUR CODE ENDS HERE

In [None]:
# Load the base model with hugginface's AutoModelForCausalLM class. https://huggingface.co/transformers/v3.5.1/model_doc/auto.html#automodelforcausallm

# YOUR CODE STARTS HERE

base_model = AutoModelForCausalLM.from_pretrained(
    None,
    quantization_config=None,
    use_cache=None,
    token = None,
    attn_implementation=None
)


base_model.to(device) # Move to base model to GPU. Check the base model's components.

# YOUR CODE ENDS HERE

In [None]:
# Set the LoRA configs

# YOUR CODE STARTS HERE

peft_config = LoraConfig(
        lora_alpha=None,
        lora_dropout=None,
        r=None,
        bias=None,
        task_type=None,
        target_modules=None
)

# YOUR CODE ENDS HERE

In [None]:
# Set PEFT in the base model and examine the ratio of the number of parameters trainable in the final state to the total number of parameters of the model.

# YOUR CODE STARTS HERE

base_model = get_peft_model(None, None)
base_model.print_trainable_parameters()

# YOUR CODE ENDS HERE

In [None]:
# Check the new base model's components.

base_model

In [None]:
# Create the tokenizer object with Transformer's AutoTokenizer class.

# YOUR CODE STARTS HERE

tokenizer = AutoTokenizer.from_pretrained(None,
                                          token=None)

# YOUR CODE ENDS HERE

#tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tokenizer.pad_token = tokenizer.eos_token

In [None]:
# Check the special tokens of tokenizer.

tokenizer.special_tokens_map

## 3) Exploratory Data Analysis and Preprocessing

### Alpaca Dataset

The Alpaca dataset is a dataset developed by Stanford University. It consists of input and instruction-output response pairs, enabling the model to learn how to respond when given a specific command. It contains more than 52,000 examples. The data in Alpaca are in English.

For more detail : https://github.com/tatsu-lab/stanford_alpaca#data-release

The cleaned alpaca dataset is a cleaned variation of the alpaca dataset.
The data fields are as follows:

*   **instruction**: describes the task the model should perform. Each of the 52K instructions is unique.
*   **input**: optional context or input for the task.
*   **output**: the answer to the instruction as generated by text-davinci-003.

Dataset: https://huggingface.co/datasets/yahma/alpaca-cleaned

In [None]:
# Import the alpaca cleaned dataset with Hugginface's load_dataset function (https://huggingface.co/docs/datasets/loading). Then examine the data set.

# YOUR CODE STARTS HERE

dataset = load_dataset(None,
                       split = None)

# YOUR CODE ENDS HERE
dataset

In [None]:
# Create a smaller training dataset with random selections from the dataset for faster training times.

# YOUR CODE STARTS HERE

sample_ratio = None
small_dataset = dataset.shuffle(seed=None).select(range(int(len(dataset) * sample_ratio)))
small_dataset

# YOUR CODE ENDS HERE

In [None]:
# Examine examples from the dataset.

small_dataset[0]

### Chat Template

A chat template is a structure used to organize user input and model responses of a language model in a specific format. These templates ensure that the model produces accurate and consistent responses in accordance with the data format used during training of the model.

You will use the Alpaca chat template to format the data you will give as input to the model in the fine tune process. https://github.com/tatsu-lab/stanford_alpaca?tab=readme-ov-file#data-release



In [None]:
# Create the chat template correctly to format the input data. After correctly mapping the characteristics of the input data to the instruction, input and response sections, create a feature called "text" that combines these input values ​​appropriately.

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# YOUR CODE STARTS HERE

EOS_TOKEN = None

def formatting_prompts_func(examples):
    instructions = None
    inputs       = None
    outputs      = None
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts,}
pass

# YOUR CODE ENDS HERE


In [None]:
# Apply chat template to all data.

# YOUR CODE STARTS HERE

dataset_prepared = small_dataset.map(None,
                                     batched = None,)

# YOUR CODE ENDS HERE

In [None]:
# Check out examples from the dataset.

dataset_prepared['text'][0]

### 4) Training Model

In [None]:
# Create output direction to save the fine tuned model.

# YOUR CODE STARTS HERE

output_dir = None
os.makedirs(output_dir, exist_ok=True)

# YOUR CODE ENDS HERE

In [None]:
# Create training arguments with the help of sft config. https://huggingface.co/docs/trl/sft_trainer#trl.SFTConfig

# YOUR CODE STARTS HERE

training_args = SFTConfig(
    output_dir = None,
    per_device_train_batch_size = None,
    gradient_accumulation_steps = None,
    fp16 = None,
    learning_rate = None,
    logging_steps = None,
    num_train_epochs = None,
    max_seq_length = None,
    warmup_ratio = None,
    save_strategy = None,
    save_steps = None,
    load_best_model_at_end = None,
)

# YOUR CODE ENDS HERE

In [None]:
# Create a trainer object with the help of SFFT trainer. https://huggingface.co/docs/trl/sft_trainer#trl.SFTTrainer

# YOUR CODE STARTS HERE

trainer = SFTTrainer(
    model=None,
    train_dataset=None,
    dataset_text_field = None,
    peft_config=None,
    max_seq_length=None,
    tokenizer=None,
    args=None,
)

# YOUR CODE ENDS HERE

In [None]:
# Start training.

trainer.train()

In [None]:
wandb.finish()

In [None]:
# Save the fine tuned model.

# YOUR CODE STARTS HERE

output_dir = os.path.join(None, "final_checkpoint")
trainer.save_model(None)

# YOUR CODE ENDS HERE

## 5) Inference From Fıne-Tuned Model

In [None]:
# Load model and tokenizer. Move ft model to GPU.

# YOUR CODE STARTS HERE

ft_model = AutoModelForCausalLM.from_pretrained(None)
ft_model.to(device)
tokenizer = AutoTokenizer.from_pretrained(None)

# YOUR CODE ENDS HERE

In [None]:
# Prepare input data for the new model with Tokenizer.

# YOUR CODE STARTS HERE

inputs = tokenizer(
    [
        alpaca_prompt.format(
            None,  # instruction
            None,  # input
            ""  # output
        )
    ],
    return_tensors="pt"
)

# YOUR CODE ENDS HERE

inputs.to(ft_model.device) # Move the inputs to the same device with the model.

In [None]:
# With TextStreamer you can see the output in real time.

text_streamer = TextStreamer(tokenizer)

_ = ft_model.generate(**inputs, streamer=text_streamer, max_new_tokens=128)

In [None]:
import session_info
session_info.show()