## Data Processing Kabatubare Medical

This notebook will deal with getting Kabatubare Medical Data (Questions, Answers, Context) and GPT-4o answers (from Phase 1). After combining the data, this dataset will be shuffled and divided into the train/test pairs to be further used for fine-tuning and evaluation purposes

Reading the dataset containing Questions, GroundTruth Answers and LLaMA Responses from Phase 1

In [1]:
import json
import pandas as pd

def json_to_df(file_path):
    ques_list = []
    response_list = []
    sample_id_list = []
    llm_model = file_path.split('/')[-1].split('_')[1]

    with open(file_path, 'rb') as file:
        for line in file:
            json_object = json.loads(line)
            column_name = list(json_object.keys())
            sample_id_list.append(json_object[column_name[0]])
            ques_list.append(json_object[column_name[1]])
            response_list.append(json_object[column_name[2]])
    return pd.DataFrame({'question': ques_list, f'{llm_model}_response_base': response_list}, index=sample_id_list)
gpt_df = json_to_df('/Users/wcchang/Documents/llm_distillation/dataset3_gpt_qa_pairs.jsonl')
gpt_df

Unnamed: 0,question,gpt_response_base
0,A microbiology student is studying the differe...,D
1,A 53-year-old woman comes to see her primary c...,A
2,Accurate and rapid identification of individua...,C
3,A 40-year-old man is referred to an optometris...,D
4,A 3-week-old infant presents to the emergency ...,A
...,...,...
19154,You are asked to see a formerly premature male...,A
19155,Which is NOT a sign of neonatal hypoglycemia?\...,C
19156,A 29-year-old woman presents for an annual flu...,A
19157,A 39-year-old woman comes to the physician for...,C


In [2]:
llama_df = json_to_df('/Users/wcchang/Documents/llm_distillation/matched_llama_dataset.jsonl')
llama_df

Unnamed: 0,question,llama_response_base
0,A microbiology student is studying the differe...,"A, D"
1,A 53-year-old woman comes to see her primary c...,A
2,Accurate and rapid identification of individua...,C
3,A 40-year-old man is referred to an optometris...,D
4,A 3-week-old infant presents to the emergency ...,C
...,...,...
19375,You are asked to see a formerly premature male...,A
19376,Which is NOT a sign of neonatal hypoglycemia?\...,C
19377,A 29-year-old woman presents for an annual flu...,C
19378,A 39-year-old woman comes to the physician for...,B


### Reading the dataset containing Questions and GPT Responses from Phase 1

Combining two dataframes into a single dataframe.

Since both the dataframes have same number of questions, we can simply read the gpt_responses into the llama_responses_df to have a final df.

In [3]:
llama_df['gpt_response_base'] = gpt_df['gpt_response_base'].tolist()
combined_df = llama_df
combined_df

Unnamed: 0,question,llama_response_base,gpt_response_base
0,A microbiology student is studying the differe...,"A, D",D
1,A 53-year-old woman comes to see her primary c...,A,A
2,Accurate and rapid identification of individua...,C,C
3,A 40-year-old man is referred to an optometris...,D,D
4,A 3-week-old infant presents to the emergency ...,C,A
...,...,...,...
19375,You are asked to see a formerly premature male...,A,A
19376,Which is NOT a sign of neonatal hypoglycemia?\...,C,C
19377,A 29-year-old woman presents for an annual flu...,C,A
19378,A 39-year-old woman comes to the physician for...,B,C


Split into training and testing dataset

In [4]:
import pandas as pd
import json
import numpy as np
from sklearn.model_selection import train_test_split


# Set random seed for reproducibility
np.random.seed(42)

# Split the dataset into train and test sets
train_df, test_df = train_test_split(combined_df, test_size=0.2, random_state=42, shuffle=True)

# Convert DataFrames to lists of dictionaries for JSONL format i.e. [{'col1': 1, 'col2': 0.5}, {'col1': 2, 'col2': 0.75}]
train_data = train_df.to_dict('records')
test_data = test_df.to_dict('records')

# Save train data to a jsonl file
with open('train_dataset.jsonl', 'w') as f:
    for item in train_data:
        json.dump(item, f)
        f.write('\n')

# Save test data to a jsonl file
with open('test_dataset.jsonl', 'w') as f:
    for item in test_data:
        json.dump(item, f)
        f.write('\n')

In [5]:
ques_list = []
gpt_response_list = []
# now the combined dataframe has the following columns: question, llama_response_base, gpt_response_base

with open('train_dataset.jsonl', 'rb') as file: #only reading the training dataset
    for line in file:
        json_object = json.loads(line)
        ques_list.append(json_object['question'])
        gpt_response_list.append(json_object['gpt_response_base']) #getting GPT Responses from the dataset

gpt_inf_dataset = pd.DataFrame({'question': ques_list, 'gpt_response_base': gpt_response_list})
gpt_inf_dataset

Unnamed: 0,question,gpt_response_base
0,A 26-year-old woman presents to a physician fo...,A
1,A 24-year-old woman presents with blisters and...,A
2,A 47-year-old man presents to his ophthalmolog...,D
3,A 58-year-old man presents with a sudden-onset...,C
4,A 14-year-old boy comes to the physician becau...,A
...,...,...
15322,A 70-year-old male patient comes into your off...,A
15323,A 21-year-old man presents to the office for a...,B
15324,A 43-year-old man presents to the emergency de...,A
15325,A 70-year-old woman comes to the physician for...,A


In [24]:
# from datasets import Dataset

# dataset = Dataset.from_pandas(gpt_inf_dataset)
# #dividing the training dataset into further train:validation dataset
# dataset = dataset.train_test_split(test_size=0.1) 
# dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'gpt_response_base'],
        num_rows: 13794
    })
    test: Dataset({
        features: ['question', 'gpt_response_base'],
        num_rows: 1533
    })
})

In [25]:
print("Original format example:")
print(dataset['train'][0])

Original format example:
{'question': 'A 40-year-old man presents to the emergency department with altered mental status. He has a history of cirrhosis of the liver secondary to alcoholism. He started becoming more confused a few days ago and it has been getting gradually worse. His temperature is 98.8°F (37.1°C), blood pressure is 134/90 mmHg, pulse is 83/min, respirations are 15/min, and oxygen saturation is 98% on room air. Physical exam reveals a distended abdomen that is non-tender. Neurological exam is notable for a confused patient and asterixis. Laboratory values are ordered as seen below.\n\nSerum:\nNa+: 139 mEq/L\nCl-: 100 mEq/L\nK+: 3.3 mEq/L\nHCO3-: 22 mEq/L\nBUN: 20 mg/dL\nGlucose: 59 mg/dL\nCreatinine: 1.1 mg/dL\nCa2+: 10.2 mg/dL\n\nWhich of the following is the best next treatment for this patient?\nOptions:\nA: Dextrose\nB: Lactulose\nC: Potassium\nD: Rifaximin\n', 'gpt_response_base': 'B'}


### Fine Tuning

Loading the model and tokenizer

In [36]:
import pandas as pd
import json
import torch
# from unsloth import FastLanguageModel, is_bfloat16_supported, train_on_responses_only
from transformers import (
    Trainer,
    TrainingArguments, 
    DataCollatorForSeq2Seq,
    AutoModelForCausalLM,
    AutoTokenizer,
    DataCollatorForLanguageModeling
)
from datasets import Dataset
# from trl import SFTTrainer
from peft import (
    get_peft_model,
    LoraConfig,
    TaskType,
    prepare_model_for_kbit_training
)  

In [10]:
# First verify MPS availability
print(f"MPS available: {torch.backends.mps.is_available()}")
print(f"MPS built: {torch.backends.mps.is_built()}")
device = 'mps' if torch.backends.mps.is_available() else 'cpu'
print(f"Using device: {device}")

# Load base model and tokenizer
model_name = "meta-llama/Llama-3.2-3B-Instruct"  # or your preferred model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16
)

MPS available: True
MPS built: True
Using device: mps


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [11]:
model.to(device)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 3072)
    (layers): ModuleList(
      (0-27): 28 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (k_proj): Linear(in_features=3072, out_features=1024, bias=False)
          (v_proj): Linear(in_features=3072, out_features=1024, bias=False)
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=3072, out_features=8192, bias=False)
          (up_proj): Linear(in_features=3072, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((3072,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((3072,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((3072,), eps=1e-05)
    (rotary_emb

In [12]:
# Configure LoRA
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=64,  # rank
    lora_alpha=64,  # scaling factor
    lora_dropout=0.1,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    bias="none",
)

In [13]:
# Prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

In [14]:
# Format chat template (reusing your existing function)
def format_chat_template(example):
    messages = [
        {"role": "system", "content": "You are a medical knowledge assistant trained to answer multiple choice questions. Please only provide the unique correct option's letter (e.g. A, B, C, etc) with no additional text or punctuation."},
        {"role": "user", "content": example['question']},
        {"role": "assistant", "content": example['gpt_response_base']}
    ]
    
    prompt = tokenizer.apply_chat_template(messages, tokenize=False)
    return {"text": prompt}

Create the HuggingFace Dataset from Pandas Dataframe
Forming the chat template

In [26]:
# Load and format your dataset
dataset = Dataset.from_pandas(gpt_inf_dataset)
dataset = dataset.train_test_split(test_size=0.1)
dataset_formatted = dataset.map(format_chat_template)
# Let's check the formatted output
print("\nFormatted example:")
print(dataset_formatted['train'][0]['text'])

Map:   0%|          | 0/13794 [00:00<?, ? examples/s]

Map:   0%|          | 0/1533 [00:00<?, ? examples/s]


Formatted example:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 04 Apr 2025

You are a medical knowledge assistant trained to answer multiple choice questions. Please only provide the unique correct option's letter (e.g. A, B, C, etc) with no additional text or punctuation.<|eot_id|><|start_header_id|>user<|end_header_id|>

A 62-year-old man presents to his primary care provider complaining of abdominal pain. He reports a 6-month history of gradually progressive epigastric pain and 10 pounds of unexpected weight loss. He has also noticed that his skin feels itchier than usual. His past medical history is notable for gout, hypertension, and diabetes mellitus. He takes allopurinol, enalapril, and glyburide. He has a 10-pack-year smoking history and a distant history of cocaine abuse. His temperature is 100.1°F (37.8°C), blood pressure is 135/85 mmHg, pulse is 100/min, and respirations are 20/min. On exam, he has notable h

In [16]:
# import sys
# import torch
# import accelerate
# import transformers

# print(f"Python version: {sys.version}")
# print(f"PyTorch version: {torch.__version__}")
# print(f"Accelerate version: {accelerate.__version__}")
# print(f"Transformers version: {transformers.__version__}")
# print(f"MPS available: {torch.backends.mps.is_available()}")

Python version: 3.9.21 (main, Dec 11 2024, 10:21:40) 
[Clang 14.0.6 ]
PyTorch version: 2.6.0
Accelerate version: 1.6.0
Transformers version: 4.50.3
MPS available: True


In [28]:
def format_training_data(example):
    # Format the prompt
    messages = [
        {"role": "system", "content": "You are a medical knowledge assistant trained to answer multiple choice questions. Please only provide the correct option's letter (e.g. A, B, C, etc) with no additional text or punctuation."},
        {"role": "user", "content": example['question']},
        {"role": "assistant", "content": example['gpt_response_base']}
    ]
    
    # Apply the chat template
    text = tokenizer.apply_chat_template(messages, tokenize=False)
    
    # Tokenize the text
    tokenized = tokenizer(text, truncation=True, max_length=2048)
    
    # Create labels (same as input_ids for causal language modeling)
    tokenized["labels"] = tokenized["input_ids"].copy()
    
    return tokenized

# Apply the formatting
dataset_formatted = dataset.map(
    format_training_data,
    remove_columns=dataset.column_names["train"]  # Remove original columns
)

# Verify the format
print("Dataset features:", dataset_formatted["train"].features)

Map:   0%|          | 0/13794 [00:00<?, ? examples/s]

Map:   0%|          | 0/1533 [00:00<?, ? examples/s]

Dataset features: {'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}


The error "expected sequence of length 204 at dim 1 (got 197)" happens because different questions in the medical dataset have different lengths.

In [31]:
# Let's check a few examples from your dataset
def check_lengths(example):
    messages = [
        {"role": "system", "content": "You are a medical knowledge assistant trained to answer multiple choice questions. Please only provide the correct option's letter (e.g. A, B, C, etc) with no additional text or punctuation."},
        {"role": "user", "content": example['question']},
        {"role": "assistant", "content": example['gpt_response_base']}
    ]
    text = tokenizer.apply_chat_template(messages, tokenize=False)
    tokens = tokenizer(text)
    # Return a dictionary instead of just the length
    return {"length": len(tokens['input_ids'])}

# Check lengths of first few examples
lengths = dataset['train'].select(range(5)).map(check_lengths)
print("\nToken lengths of first 5 examples:")
for i, item in enumerate(lengths):
    print(f"Example {i}: {item['length']} tokens")

# Also print an example of the actual tokenized text to understand the content
example = dataset['train'][0]
messages = [
    {"role": "system", "content": "You are a medical knowledge assistant trained to answer multiple choice questions. Please only provide the correct option's letter (e.g. A, B, C, etc) with no additional text or punctuation."},
    {"role": "user", "content": example['question']},
    {"role": "assistant", "content": example['gpt_response_base']}
]
text = tokenizer.apply_chat_template(messages, tokenize=False)
tokens = tokenizer(text)
print("\nExample tokenized text length:", len(tokens['input_ids']))
print("First few tokens:", tokens['input_ids'][:10])

Map:   0%|          | 0/5 [00:00<?, ? examples/s]


Token lengths of first 5 examples:
Example 0: 302 tokens
Example 1: 161 tokens
Example 2: 342 tokens
Example 3: 177 tokens
Example 4: 272 tokens

Example tokenized text length: 302
First few tokens: [128000, 128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696]


In [33]:
# Set pad token to be the same as eos token
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.eos_token_id

def format_training_data(example):
    messages = [
        {"role": "system", "content": "You are a medical knowledge assistant trained to answer multiple choice questions. Please only provide the correct option's letter (e.g. A, B, C, etc) with no additional text or punctuation."},
        {"role": "user", "content": example['question']},
        {"role": "assistant", "content": example['gpt_response_base']}
    ]
    
    text = tokenizer.apply_chat_template(messages, tokenize=False)
    
    # Tokenize with padding
    tokenized = tokenizer(
        text,
        truncation=True,
        max_length=512,
        padding='max_length',
        return_tensors=None
    )
    
    tokenized["labels"] = tokenized["input_ids"].copy()
    
    return tokenized

# Apply the formatting
dataset_formatted = dataset.map(
    format_training_data,
    remove_columns=dataset.column_names["train"],
    desc="Formatting dataset"
)

# Set the format to PyTorch tensors
dataset_formatted.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

# Verify the lengths
def check_formatted_lengths(example):
    return {"length": len(example['input_ids'])}

formatted_lengths = dataset_formatted['train'].select(range(5)).map(check_formatted_lengths)
print("\nToken lengths after formatting:")
for i, item in enumerate(formatted_lengths):
    print(f"Example {i}: {item['length']} tokens")

Formatting dataset:   0%|          | 0/13794 [00:00<?, ? examples/s]

Formatting dataset:   0%|          | 0/1533 [00:00<?, ? examples/s]

Map:   0%|          | 0/5 [00:00<?, ? examples/s]


Token lengths after formatting:
Example 0: 512 tokens
Example 1: 512 tokens
Example 2: 512 tokens
Example 3: 512 tokens
Example 4: 512 tokens


In [38]:
import torch

# Check MPS availability and current device
print(f"MPS available: {torch.backends.mps.is_available()}")
print(f"MPS built: {torch.backends.mps.is_built()}")
print(f"Current device for model: {next(model.parameters()).device}")

# Print model size and batch info
print(f"\nModel size: {sum(p.numel() for p in model.parameters()) / 1e6:.2f}M parameters")
print(f"Training batch size: {training_args.per_device_train_batch_size}")
print(f"Number of training examples: {len(dataset_formatted['train'])}")

# Check if model is actually on MPS
if str(next(model.parameters()).device) != 'mps':
    # Move model to MPS
    model = model.to('mps')
    print("Model moved to MPS")
print(f"Model device after check: {next(model.parameters()).device}")

MPS available: True
MPS built: True
Current device for model: mps:0

Model size: 3310.01M parameters
Training batch size: 4
Number of training examples: 13794
Model moved to MPS
Model device after check: mps:0


In [40]:
training_args = TrainingArguments(
    # Basic Training Settings
    output_dir="./llama2-medical-qa-peft",
    num_train_epochs=3,
    learning_rate=2e-4,
    
    # Batch Size and Memory Settings
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=4,
    
    # Optimization Settings
    optim="adamw_torch",
    weight_decay=0.01,
    lr_scheduler_type="linear",
    warmup_ratio=0.1,
    
    # Saving and Evaluation - Make these match!
    save_strategy="steps",              # Changed to match eval_strategy
    evaluation_strategy="steps",
    eval_steps=100,
    save_steps=100,                     # Should be same as eval_steps
    save_total_limit=3,
    
    # Logging
    logging_strategy="steps",
    logging_steps=50,
    logging_first_step=True,
    
    # Hardware Settings
    use_mps_device=True,
    fp16=False,
    bf16=False,
    
    # Additional Training Features
    gradient_checkpointing=True,
    group_by_length=True,
    load_best_model_at_end=True,
    metric_for_best_model="loss",
    greater_is_better=False,
    
    # Miscellaneous
    remove_unused_columns=True,
    disable_tqdm=False,
    report_to="none",
)

In [41]:
from transformers import Trainer, TrainingArguments

# Training arguments
training_args = TrainingArguments(
    output_dir="./llama2-medical-qa-peft",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=False,  # Disable fp16 for MPS
    bf16=False,  # Disable bf16 for MPS
    optim="adamw_torch",
    save_strategy="epoch",
    evaluation_strategy="epoch",
    logging_steps=100,
    use_mps_device=True,
    report_to="none"
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_formatted["train"],
    eval_dataset=dataset_formatted["test"],
    data_collator=DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False  # We're doing causal language modeling, not masked
    )
)

# Start training
trainer.train()

# Save the final model
trainer.save_model("./finetuned-medical-qa-model")

Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

#### Setting up the parameter-efficient fine-tuning (PEFT) settings for the model

https://huggingface.co/blog/damjan-k/rslora\
https://medium.com/@fartypantsham/what-rank-r-and-alpha-to-use-in-lora-in-llm-1b4f025fd133

### Inference

In [None]:
def run_inference(model, tokenizer, question, max_length=512):
    # Format the input with chat template
    messages = [
        {"role": "system", "content": "You are a medical knowledge assistant trained to answer multiple choice questions. Please only provide the correct option's letter (e.g. A, B, C, etc) with no additional text or punctuation."},
        {"role": "user", "content": question}
    ]
    
    # Apply chat template
    prompt = tokenizer.apply_chat_template(messages, tokenize=False)
    
    # Tokenize
    inputs = tokenizer(
        prompt, 
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=max_length
    ).to(model.device)
    
    # Generate
    outputs = model.generate(
        **inputs,
        max_new_tokens=5,  # We only need a single letter as response
        num_return_sequences=1,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id
    )
    
    # Decode
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    return response

# Load the finetuned model and tokenizer
model_path = "./llama2-medical-qa-peft"  # or wherever you saved your model
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Make sure model is on MPS
device = 'mps' if torch.backends.mps.is_available() else 'cpu'
model = model.to(device)

# Test on a few examples
print("Testing inference on examples:\n")
for idx in range(5):
    question = dataset['test']['question'][idx]
    true_answer = dataset['test']['gpt_response_base'][idx]
    
    print(f"Question {idx+1}:")
    print(question)
    print(f"\nTrue answer: {true_answer}")
    
    response = run_inference(model, tokenizer, question)
    print(f"Model response: {response}")
    print("\n" + "="*80 + "\n")

# Calculate accuracy on test set
def evaluate_model(model, tokenizer, test_dataset, num_examples=None):
    correct = 0
    total = 0
    
    examples = range(len(test_dataset)) if num_examples is None else range(num_examples)
    
    for idx in tqdm(examples, desc="Evaluating"):
        question = test_dataset['question'][idx]
        true_answer = test_dataset['gpt_response_base'][idx]
        
        response = run_inference(model, tokenizer, question)
        # Extract just the letter answer from response
        predicted_answer = response.strip()[-1] if response.strip() else ""
        
        if predicted_answer == true_answer:
            correct += 1
        total += 1
    
    accuracy = correct / total
    print(f"\nAccuracy on {total} test examples: {accuracy:.2%}")
    return accuracy

# Run evaluation on first 100 examples
evaluate_model(model, tokenizer, dataset['test'], num_examples=100)