# Fine-Tuning with Llama 3
# By: Vatsal Vinay Parikh

## Filtering Datasets for Evaluation  

You are building a training and evaluation pipeline for your company's healthcare chatbot, which is used by hospitals to onboard new patients.  

Your task is to create a pipeline to load the **MedQuad-MedicalQnADataset** to evaluate an LLM on its ability to answer medical questions. You need to load the dataset into the `ds` variable and only include the first **500 samples** of the **train** split of the dataset stored in `dataset_name` as your evaluation set.  

### Instructions  
1. **Import** necessary functions and classes from the `datasets` library.  
2. **Load** the dataset into the `ds` variable using `load_dataset()`.  
3. **Filter** `ds` to include the first **500 samples** of the **train** split.  
4. **Ensure** that the filtered dataset remains in the correct format (`Dataset`).  



In [5]:
# Import necessary functions from the datasets library
from datasets import load_dataset, Dataset

# Load the training split of the dataset
ds = load_dataset("keivalya/MedQuad-MedicalQnADataset", split = "train")
filtered_ds = ds.select(range(500))

# Display the first sample to verify correctness
print(filtered_ds[0])

{'qtype': 'susceptibility', 'Question': 'Who is at risk for Lymphocytic Choriomeningitis (LCM)? ?', 'Answer': 'LCMV infections can occur after exposure to fresh urine, droppings, saliva, or nesting materials from infected rodents.  Transmission may also occur when these materials are directly introduced into broken skin, the nose, the eyes, or the mouth, or presumably, via the bite of an infected rodent. Person-to-person transmission has not been reported, with the exception of vertical transmission from infected mother to fetus, and rarely, through organ transplantation.'}


## Creating Training Samples  

As part of a **customer service chatbot**, your team is building a pipeline to **preprocess a dataset** that will eventually be used to **fine-tune a language model**. The goal is to predict the **intent** of a customer's question and correctly route requests to the appropriate team for processing.  

### **Task**  
You are given a dataset where:  
- The **column `instruction`** contains the **customer’s question**.  
- The **column `intent`** represents the **user's intent**.  

Your task is to preprocess the dataset by merging both columns into a **single formatted prompt string** for each example.

### **Instructions**  
1. **Define a function** to format each row into the required structure.  
2. **Apply** this function to all rows in the dataset.  
3. **Store the transformed data** in a new column called **`intent_example`**.  
4. **Extract and print** the first processed example.  

In [9]:
# Define a function to merge instruction and intent into a formatted string
def create_intent_example(row):
    """
    Formats a row by merging 'instruction' and 'intent' into a single string.
    Stores the result in a new column 'intent_example'.
    """
    row['intent_example'] = f"Query: {row['instruction']}\nIntent: {row['intent']}"
    return row

# Apply the function to all rows in the dataset
processed_dataset = dataset.map(create_intent_example)

# Print the formatted example from the first row
print(processed_dataset[0]['intent_example'])

Query: question about cancelling order {{Order Number}}
Intent: cancel_order


### Saving preprocessed datasets
As part of your customer service chatbot project, you now have prepared a dataset for fine-tuning a Llama model. The next step is to save the dataset so that you can reload it later without having to repeat the preprocessing steps. This will allow your team to reuse the dataset across multiple experiments and iterations.



In [10]:
from datasets import load_from_disk

# Save the dataset to disk
ds.save_to_disk("preprocessed_dataset")

# Load the dataset from disk
ds_preprocessed = load_from_disk("preprocessed_dataset")

# Print the first element of the loaded dataset
print(ds_preprocessed[0])

{'instruction': 'What is the status of my order?', 'response': 'Your order has been shipped and is expected to arrive tomorrow.'}


## Defining custom recipes

You're fine-tuning a pre-trained Llama model for a customer who requires specific configurations. Your plan is to use TorchTune for fine-tuning, and so need to prepare a Python dictionary that you can use to store the requirements for the custom recipe you'll use to run the fine-tuning job.

### **Instructions**

1. Specify the customer requirements in your dictionary: first, add the ```torchtune.models.llama3_2.llama3_2_1b``` model.  
2. Add a **batch size of 8** and a **GPU device**.




In [11]:
config_dict = {
    # Define the model
    "model": {
        "_component_": "torchtune.models.llama3_2.llama3_2_1b"
    },
    # Define the batch size
    "batch_size": 8,
    # Define the device type
    "device": "cuda",
    # Define the number of training epochs
    "epochs": 15,
    # Define the optimizer with its learning rate
    "optimizer": {
        "_component_": "bitsandbytes.optim.PagedAdamW8bit",
        "lr": 3e-05
    },
    # Define the dataset component
    "dataset": {
        "_component_": "custom_dataset"
    },
    # Define the directory to save fine-tuning results
    "output_dir": "/tmp/finetune_results"
}

## Saving custom recipes

The customer has now asked you for a modification in the requirements. This time, they'd like to increase the number of parameters and use the Llama 3.2 model with 3B parameters. You make this modification to your dictionary, and then save it as a YAML file.

The ```yaml``` library has been pre-imported.

### Instructions
- Specify the new model requirement, the ```torchtune.models.llama3_2.llama3_2_3b model``, in your dictionary.
- Save the requirements as a YAML file named ```custom_recipe.yaml```.

In [13]:
import yaml

config_dict = {
    # Update the model
    "model":
    {
        "_component_": "torchtune.models.llama3_2.llama3_2_3b"
    },
    "batch_size": 8,
    "device": "cuda",
    "optimizer": {"_component_": "bitsandbytes.optim.PagedAdamW8bit", "lr": 3e-05},
    "dataset": {"_component_": "custom_dataset"},
    "output_dir": "/tmp/finetune_results"
}

# Save the updated configuration to a new YAML file
with open("custom_recipe.yaml", "w") as yaml_file:
    yaml.dump(config_dict, yaml_file)

## Setting up Llama training arguments

You are tasked with working with the Llama model used in a customer service chatbot by fine-tuning it on customer service data purpose built for question-answering. To ensure the best performance out of these models, your team will fine-tune a Llama model for this task using the ```bitext``` dataset.

You want to do a test run of the training loop to check if the training script works. So, you want to start by setting a small learning rate and limit the training to a handful of steps in your training arguments.

### Instructions
- Import and instantiate the helper class to store your training arguments.
- Set the training argument for learning rate to a value of ```2e-3```.

In [30]:
# Load helper class for the training arguments from the correct library
from transformers import TrainingArguments

# Define training arguments with specified parameters
training_arguments = TrainingArguments(
    # Set learning rate
    learning_rate=2e-3,
    # Set warmup ratio for learning rate scheduling
    warmup_ratio=0.03,
    # Define number of training epochs
    num_train_epochs=3,
    # Set output directory for saving checkpoints and logs
    output_dir='/tmp',
    # Define batch size per device
    per_device_train_batch_size=1,
    # Set gradient accumulation steps for effective batch size
    gradient_accumulation_steps=1,
    # Set checkpoint saving frequency
    save_steps=10,
    # Define logging frequency
    logging_steps=100,
    # Use a constant learning rate scheduler
    lr_scheduler_type='constant',
    # Disable reporting to external services (e.g., WandB)
    report_to='none'
)

## Fine-tuning Llama for customer service QA

You work at a company that builds customer service chatbots. Your team uses the Llama models in your customer service bot, and you want to improve the model by fine-tuning on a question-answering dataset related to customer service. To ensure the best performance out of these models, your team will fine-tune a Llama model for this task using the ```bitext``` dataset.

The training script is already almost complete, the only thing missing is the final step where you bring together the model, tokenizer, training dataset, and training arguments and start training.

### Instructions
- Import the class that lets you conduct supervised fine-tuning from its library.
- Instantiate the class used for supervised fine-tuning by passing the ```model```, ```tokenizer```, ```dataset```, and ```training_arguments```.
- Run the instance method to start fine-tuning your model.

In [2]:
from huggingface_hub import login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [31]:
# Import required libraries
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer
from datasets import load_dataset
from transformers import TrainingArguments

# Define model checkpoint
model_checkpoint = "Maykeye/TinyLLama-v0"

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

# Ensure the tokenizer pads properly
tokenizer.pad_token = tokenizer.eos_token

# Load dataset
dataset = load_dataset("alcamilo2/Bitext-customer-support-1-column", split = "train")
dataset = dataset.map(lambda x: {"text": x['text']})

# Instantiate the fine-tuning trainer
trainer = SFTTrainer(
    model=model,  # Attach model
    tokenizer=tokenizer,  # Attach tokenizer
    train_dataset=dataset,  # Assign dataset
    args=training_arguments,  # Apply training arguments
)

# Start fine-tuning
trainer.train()

  trainer = SFTTrainer(


Tokenizing train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Step,Training Loss
100,4.7521
200,3.55
300,3.178
400,2.7743
500,2.821
600,2.459
700,2.6641
800,2.322
900,2.3663
1000,2.2843


TrainOutput(global_step=3000, training_loss=2.1476695353190105, metrics={'train_runtime': 852.6825, 'train_samples_per_second': 3.518, 'train_steps_per_second': 3.518, 'total_flos': 7062481176192.0, 'train_loss': 2.1476695353190105})

## Evaluate generated text using ROUGE

You are given 10 samples from a question-answering dataset (```Softage-AI/sft-conversational_dataset```).

You have used ```TinyLlama-1.1B``` to generate answers to these samples, and your task is to evaluate the quality of the generated results to the ground truth.

The answers generated by this model are provided in ```test_answers``` and the ground truth in ```reference_answers```. Use the ROUGE evaluation metrics to evaluate the quality of the model's generation.

### Instructions

- Import evaluation class and metric (ROUGE metric).
- Instantiate the evaluation class and load the ROUGE metric.
- Run the evaluator instance with the given ```reference_answers``` and ```test_answers``` to compute the ROUGE scores.
- Store, in ```final_score``` the score from the results that checks overlap of word pairs between the reference and generated answers.

In [51]:
import torch

dataset = load_dataset("SoftAge-AI/sft-conversational_dataset", split = "train")

# Load the TinyLlama model and tokenizer
model_name = "Maykeye/TinyLLama-v0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

# Extract input questions and reference answers using the correct column names
test_samples = dataset["Query"][:10]  # Taking 10 questions
reference_answers = dataset["Answer"][:10]  # Corresponding ground truth responses


# Generate answers using TinyLlama model
test_answers = []
for question in test_samples:
    inputs = tokenizer(question, return_tensors="pt")
    with torch.no_grad():
        output = model.generate(**inputs, max_length=100)
    answer = tokenizer.decode(output[0], skip_special_tokens=True)
    test_answers.append(answer)

In [57]:
# Print sample results
print(f"Query: {test_samples[0]}")
print(f"Generated Answer: {test_answers[0]}")
print(f"Reference Answer: {reference_answers[0]}\n")

Query: When is the best time to visit Alaska?
Generated Answer: When is the best time to visit Alaska?
The family went to the park and saw a big, strong man. He was very curious and wanted to go.
The man said, "Let's go and see the people and the kids."
The man said, "Yes, we can!"
The man said, "I'm so excited to see the people. We can go to the park and see the birds. We can see the birds and the birds."

Reference Answer: The best time to visit Alaska is during the summer between May and September. The temperature is warm and pleasant, there are 16–24 hours of daylight and the scenery is breathtaking.

Here is a more detailed breakdown of the best time to visit Alaska for different activities:
1. The Northern Lights: The northern lights are visible from August 20 through April 20. The best time for a winter aurora vacation is February and March [1].
2. Wildlife viewing: June to August is the best time to see whales, bears, and other wildlife [1].
3. Hiking and camping: May to mid-Oc

In [56]:
import evaluate

# Load the ROUGE metric for evaluation
rouge_evaluator = evaluate.load('rouge')

# Validate input format to ensure they are lists of strings
if not isinstance(test_answers, list) or not isinstance(reference_answers, list):
    raise TypeError("Both test_answers and reference_answers must be lists of strings.")

if len(test_answers) != len(reference_answers):
    raise ValueError("test_answers and reference_answers must have the same number of elements.")

# Compute ROUGE scores
results = rouge_evaluator.compute(predictions=test_answers, references=reference_answers)

# Ensure results contain expected keys before extracting the score
if "rouge1" not in results:
    raise KeyError("Expected 'rouge1' key missing in ROUGE results.")

# Extract the ROUGE-1 score (measuring unigram overlap)
final_score = results['rouge1']

# Display the final ROUGE-1 score
print(f"ROUGE-1 Score: {final_score:.4f}")

ROUGE-1 Score: 0.1815


## Using LoRA adapters

You work at a startup that provides customer service chatbots that automatically resolve simple questions that customers may have.

You have been tasked with fine-tuning the ```Maykeye/TinyLLama-v0``` language model using the dataset. This model will be used in a chatbot that your team provides. The training script is already almost complete, but you wanted to integrate LoRA into your fine-tuning, as it is more efficient and would let your team's training pipeline complete more quickly during deployments.

The relevant model, tokenizer, dataset, and training arguments have been pre-loaded for you in model, tokenizer, dataset, and training_arguments.

### Instructions

- Import the LoRA configuration from the associated library.
- Instantiate LoRA configuration parameters with the defaults given to lora_config.
- Integrate the LoRA parameters into SFTTrainer.

In [64]:
# Import LoRA configuration class
from peft import LoraConfig

# Instantiate LoRA configuration with values
lora_config = LoraConfig(
  	r=12,
    lora_alpha=8,
  	task_type="CAUSAL_LM",
    lora_dropout=0.05,
    bias="none",
    target_modules=['q_proj', 'v_proj']
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    tokenizer=tokenizer,
    args=training_arguments,
  	# Pass the lora_config to trainer
  	peft_config = lora_config,
)

  trainer = SFTTrainer(


Applying chat template to train dataset:   0%|          | 0/400 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/400 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/400 [00:00<?, ? examples/s]

## LoRA Fine-Tuning Llama for Customer Service

You have been tasked with fine-tuning a language model to answer customer service questions. The Llama models are quite good at question answering, and should work well for this customer service task. Unfortunately, you don't have the compute capacity to conduct regular fine-tuning, and must use LoRA fine-tuning techniques using the bitext dataset.

You want to be able to train **Maykeye/TinyLLama-v0**. The training script is already almost complete, and you are already provided with the training code, with the exception of the LoRA configuration parameters.

The relevant **model, tokenizer, dataset, and training arguments** have been pre-loaded for you in `model`, `tokenizer`, `dataset`, and `training_arguments`.

### Instructions
- Add the argument to set your **LoRA adapters to rank 2**.
- Set the **scaling factor** so that it is **double your rank**.
- Set the **task type** used with **Llama-style models** in your **LoRA configuration**.


In [74]:
from peft import LoraConfig, get_peft_model

peft_config = LoraConfig(
    # Set rank parameter
  	r=2,
  	# Set scaling factor
    lora_alpha=4,
  	# Set the type of task
  	task_type="CAUSAL_LM",
    lora_dropout=0.05,
    bias="none",
    target_modules=['q_proj', 'v_proj']
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    tokenizer=tokenizer,
    args=training_arguments,
  	peft_config=peft_config,
)

trainer.train()

{'loss': 8.7582, 'grad_norm': 1.3358652591705322, 'learning_rate': 0.002, 'epoch': 0.05}
{'loss': 8.9783, 'grad_norm': 1.445554256439209, 'learning_rate': 0.002, 'epoch': 0.1}
{'loss': 8.5031, 'grad_norm': 1.3505754470825195, 'learning_rate': 0.002, 'epoch': 0.15}
{'loss': 7.8045, 'grad_norm': 1.3089845180511475, 'learning_rate': 0.002, 'epoch': 0.2}
{'loss': 6.4535, 'grad_norm': 2.0818891525268555, 'learning_rate': 0.002, 'epoch': 0.25}
{'loss': 8.094, 'grad_norm': 1.2905353307724, 'learning_rate': 0.002, 'epoch': 0.3}
{'loss': 6.7027, 'grad_norm': 2.470402240753174, 'learning_rate': 0.002, 'epoch': 0.35}
{'loss': 7.8301, 'grad_norm': 1.5499604940414429, 'learning_rate': 0.002, 'epoch': 0.4}
{'loss': 8.0945, 'grad_norm': 1.5618335008621216, 'learning_rate': 0.002, 'epoch': 0.45}
{'loss': 8.1392, 'grad_norm': 1.3591623306274414, 'learning_rate': 0.002, 'epoch': 0.5}
{'loss': 7.8518, 'grad_norm': 1.6783764362335205, 'learning_rate': 0.002, 'epoch': 0.55}
{'loss': 8.514, 'grad_norm': 1.7

## Loading 8-Bit Models

Your company has been using a **Llama model** for their customer service chatbot for a while now. You've been tasked with figuring out how to **reduce the model's GPU memory usage** without significantly affecting performance. This will allow the team to switch to a **cheaper compute cluster** and save the company a lot of money.

You decide to test if you can **load your model with 8-bit quantization** and maintain a reasonable performance.

You are given the model in `model_name`.  
`AutoModelForCausalLM` and `AutoTokenizer` are **already imported** for you.

### Instructions

- Import the configuration class to enable **loading of models with quantization**.
- Instantiate the **quantization configuration class**.
- Configure the **quantization parameters** to load the model in **8-bit**.
- Pass **quantization configuration** to `AutoModelForCausalLM` to load the quantized model.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_name="Maykeye/TinyLLama-v0"

# Import quantization configuration class
from transformers import BitsAndBytesConfig

# Instantiate quantization configuration
bnb_config = BitsAndBytesConfig(
  	# Set 8-bit loading
	load_in_8bit=True,

    # Specify device_map to load the model on CPU if CUDA is not available
    device_map="auto",
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
  	# Set quantization parameters to load quantized model
    quantization_config=bnb_config,

    # Explicitly use the device_map setting
    device_map="auto",

    low_cpu_mem_usage=True
)

## Speeding Up Inference in Quantized Models

Your company has been using a **Llama model** for their customer service chatbot for a while now with **quantization**. One of the biggest customer complaints you receive is that the bot **answers questions very slowly** and sometimes **produces weird answers**.

You suspect this might have to do with **quantizing to 4-bit without normalizing**.  
During your investigation, you also suspect that the **speed trade-off** comes from inference computations, which are using **32-bit floats**.

You want to **adjust the quantization configurations** to improve the **inference speed** of your model.  

The following imports have already been loaded:  `AutoModelForCausalLM`,`AutoTokenizer`, `BitsAndBytesConfig`  


### Instructions
- Set quantization type to **normalized 4-bit** to **reduce outliers**, thus producing **less nonsensical answers**.
- Set the **compute type** to `bfloat16` to **speed up inference computation speeds**.


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

model_name="Maykeye/TinyLLama-v0"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
  	# Set quantization type to normalized 4-bit
    bnb_4bit_quant_type="nf4",
  	# Set compute data type to be bfloat16
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    low_cpu_mem_usage=True
)