### LLaMA Supervised Fine-Tuning

This document will take the answers of GPT-4o on the Kababutare Medical Dataset and then fine-tune the LLaMA Model on those answers.

The purpose of this exercise is to test whether the LLaMA fine-tuning is able to distill the knowledge of GPT-4o and improve the performance on the open-ended question/answering related to healthcare dataset

In [1]:
import os

In [3]:
# os.environ["CUDA_VISIBLE_DEVICES"] = "7"
# If you want to use MPS (Apple Silicon)
import torch
if torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")

print(f"PyTorch version: {torch.__version__}")
print(f"MPS available: {torch.backends.mps.is_available()}")
print(f"MPS built: {torch.backends.mps.is_built()}")
print(f"Device: {device}")

PyTorch version: 2.6.0
MPS available: True
MPS built: True
Device: mps


In [4]:
import pandas as pd
import json
from unsloth import FastLanguageModel, is_bfloat16_supported, train_on_responses_only
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from datasets import Dataset
from trl import SFTTrainer

ModuleNotFoundError: No module named 'unsloth'

#### Reading the Question and Answer Pairs from Training Dataset Phase 2

In [19]:
ques_list = []
gpt_resp_list = []

with open('phase2_data_kabatubare/train_kabatubare.jsonl', 'rb') as file: #only reading the training dataset
    for line in file:
        json_object = json.loads(line)
        ques_list.append(json_object['question'])
        gpt_resp_list.append(json_object['gpt_response_base']) #getting GPT Responses from the dataset

In [20]:
gpt_inf_data = pd.DataFrame({'question': ques_list, 'gpt_response_base': gpt_resp_list})
gpt_inf_data

Unnamed: 0,question,gpt_response_base
0,i have a small dull ache in my left testicle a...,It's not uncommon for individuals to experienc...
1,i've heard conflicting opinions. 7 weeks of pr...,"During pregnancy, there are several dietary re..."
2,my friend slept over that had fever blisters. ...,I understand that you're feeling very anxious ...
3,what are some common food triggers for migraines?,Migraines can be triggered by a variety of fac...
4,why does grey hair itch so much? . why does my...,Itching in gray hair can be attributed to seve...
...,...,...
18744,how to make money online? . a brand new approa...,I'm here to assist with health-related inquiri...
18745,what can i eat with the stomach flu? . i can't...,When you're dealing with stomach flu (viral ga...
18746,what exams and tests help doctors to evaluate ...,To evaluate or test individuals for ringworm o...
18747,could i be pregnant?,Whether you could be pregnant depends on sever...


Create the HuggingFace Dataset from Pandas Dataframe

In [21]:
dataset = Dataset.from_pandas(gpt_inf_data)
dataset = dataset.train_test_split(test_size=0.1) #dividing the training dataset into further train:validation dataset
dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'gpt_response_base'],
        num_rows: 16874
    })
    test: Dataset({
        features: ['question', 'gpt_response_base'],
        num_rows: 1875
    })
})

### Fine-Tuning Code

#### Loading the model and tokenizer

In [6]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct",
    max_seq_length = 2048,
    load_in_4bit = False, # 4 bit quantization to reduce memory
    load_in_8bit = True, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    dtype=None, #None for auto-detection. Can be torch.bfloat16 or torch.float16 (will be automatically detected)
    device_map="auto"
)

==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.50.2.
   \\   /|    NVIDIA RTX A6000. Num GPUs = 1. Max memory: 47.413 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.4.1+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.0.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post1. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


#### Setting up the PEFT settings for the model

https://huggingface.co/blog/damjan-k/rslora\
https://medium.com/@fartypantsham/what-rank-r-and-alpha-to-use-in-lora-in-llm-1b4f025fd133

In [7]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 64, #max_full_rank=64 by default in FastLanguageModel
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64, #scaling_factor = lora_alpha/r. If we select lora_alpha = 2 * r then it will multiply the adapter weights by 2 which can be un-ncessary
    lora_dropout = 0.1,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    use_rslora = True,
    loftq_config = None,
)

Unsloth: Making `model.base_model.model.model` require gradients


#### Forming the chat template

In [8]:
# Define a function to apply the chat template
def format_chat_template(example):
        
    messages = [
        {"role": "system", "content": "You are a medical knowledge assistant trained to provide information and guidance on various health-related topics."},
        {"role": "user", "content": example['question']},
        {"role": "assistant", "content": example['gpt_response_base']}
    ]
    
    prompt = tokenizer.apply_chat_template(messages, tokenize=False)

    return {"text": prompt}

In [9]:
dataset_formatted = dataset.map(format_chat_template)

Map: 100%|██████████| 16874/16874 [00:02<00:00, 5761.89 examples/s]
Map: 100%|██████████| 1875/1875 [00:00<00:00, 5723.95 examples/s]


In [10]:
print(dataset_formatted['train']['text'][0])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 02 Apr 2025

You are a medical knowledge assistant trained to provide information and guidance on various health-related topics.<|eot_id|><|start_header_id|>user<|end_header_id|>

can osteoarthritis be prevented?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Osteoarthritis (OA) is a degenerative joint disease that occurs when the cartilage that cushions the ends of bones wears down over time. While it cannot be entirely prevented, there are several strategies that may help reduce the risk of developing osteoarthritis or slow its progression. Here are some preventive measures:

1. **Maintain a Healthy Weight:** Excess weight puts additional stress on joints, particularly those that bear weight, such as the knees and hips. Maintaining a healthy weight can reduce the risk of OA.

2. **Stay Active:** Regular physical activity strengthens the muscles around the joints, im

#### Initializing the TRL SFTTrainer and related Arguments

In [11]:
# full_model_path = "./llama32-sft-full-kabatubare" #use for full finetuning
peft_model_path = "./llama32-sft-peft-kabatubare" #use for LoRA based fine-tuning

training_args = TrainingArguments(
        output_dir=peft_model_path,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        # gradient_accumulation_steps=4,
        eval_strategy="steps",
        eval_steps=50,
        logging_strategy="steps",
        logging_steps=50,
        save_strategy="steps",
        save_steps=1000,
        warmup_steps = 5,
        num_train_epochs = 3,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        greater_is_better=False,
        seed = 42,
        report_to = "none",
    )

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset=dataset_formatted["train"],
    eval_dataset=dataset_formatted["test"],
    dataset_text_field = "text",
    max_seq_length = 2048,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer), #only use when using train_on_responses_only()
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = training_args)

Unsloth: Tokenizing ["text"] (num_proc=2): 100%|██████████| 16874/16874 [00:11<00:00, 1522.06 examples/s]
Unsloth: Tokenizing ["text"] (num_proc=2): 100%|██████████| 1875/1875 [00:02<00:00, 862.30 examples/s] 
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [12]:
trainer.train_dataset

Dataset({
    features: ['question', 'gpt_response_base', 'text', 'input_ids', 'attention_mask'],
    num_rows: 16874
})

In [13]:
print(tokenizer.decode(trainer.train_dataset['input_ids'][0]))

<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 02 Apr 2025

You are a medical knowledge assistant trained to provide information and guidance on various health-related topics.<|eot_id|><|start_header_id|>user<|end_header_id|>

can osteoarthritis be prevented?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Osteoarthritis (OA) is a degenerative joint disease that occurs when the cartilage that cushions the ends of bones wears down over time. While it cannot be entirely prevented, there are several strategies that may help reduce the risk of developing osteoarthritis or slow its progression. Here are some preventive measures:

1. **Maintain a Healthy Weight:** Excess weight puts additional stress on joints, particularly those that bear weight, such as the knees and hips. Maintaining a healthy weight can reduce the risk of OA.

2. **Stay Active:** Regular physical activity strengthens the muscles arou

#### Only Focus on the `Response Part` for the generation

In [14]:
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map (num_proc=64): 100%|██████████| 16874/16874 [00:03<00:00, 4395.88 examples/s]
Map (num_proc=64): 100%|██████████| 1875/1875 [00:02<00:00, 700.16 examples/s]


In [15]:
trainer.train_dataset

Dataset({
    features: ['question', 'gpt_response_base', 'text', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 16874
})

In [16]:
# The labels are created which only contain response. Left Padding is implemented and all the padding tokens are given a score of -100 to avoid loss calculation for pad_tokens
trainer.train_dataset['labels'][0]

[-100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 46,
 5455,
 78,
 277,
 40485,
 320,
 42439,
 8,
 374,
 264,
 5367,
 75989,
 10496,
 8624,
 430,
 13980,
 994,
 279,
 7558,
 88076,
 430,
 68241,
 279,
 10548,
 315,
 25896,
 38400,
 1523,
 927,
 892,
 13,
 6104,
 433,
 4250,
 387,
 11622,
 32098,
 11,
 1070,
 527,
 3892,
 15174,
 430,
 1253,
 1520,
 8108,
 279,
 5326,
 315,
 11469,
 52368,
 78,
 277,
 40485,
 477,
 6435,
 1202,
 33824,
 13,
 5810,
 527,
 1063,
 71123,
 11193,
 1473,
 16,
 13,
 3146,
 67834,
 467,
 264,
 44454,
 16923,
 68063,
 1398,
 1140,
 4785,
 9711,
 5217,
 8631,
 389,
 35358,
 11,
 8104,

#### Train the model

In [None]:
trainer_stats = trainer.train()

#### Saving the model and tokenizer

Just save the LoRA Adapters without merging with base model

In [None]:
peft_model_path = "./llama32-sft-peft-kabatubare" #use for LoRA based fine-tuning

# Or run the two below statements
model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

### Inference

In [None]:
# full_model_path = "./llama32-sft-full-kabatubare"
peft_model_path = "./llama32-sft-peft-kabatubare" #use for LoRA based fine-tuning

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = peft_model_path,
    max_seq_length = 2048,
    load_in_4bit = False, # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    dtype=None, #None for auto-detection. Can be torch.bfloat16 or torch.float16 (will be automatically detected)
    device_map="auto"
)

In [None]:
dataset['test']

In [None]:
FastLanguageModel.for_inference(model)

# for idx in range(1,50):

idx = 0

print(dataset['test']['question'][idx])

messages = [{"role": "system", "content": "You are a medical knowledge assistant trained to provide information and guidance on various health-related topics."},
            {"role": "user", "content": dataset['test']['question'][idx]}]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=2048, num_return_sequences=1)

text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(text.split("assistant")[1])

print('---------------------------------------------------')