## Fine Tuning LLM (Phi3/Phi2) on Custom Dataset

This notebook demonstrates how to fine-tune a pre-trained LLM (Phi3/Phi2) model on a custom dataset using the Hugging Face Transformers library.

Plan of Attack:
- Data Analysis
- Model Loading
- Parameter Efficient Fine-Tuning (PEFT)
  - QLORA (8-bit) [4-bit QLORA is covered in next section]
- Model Training
- Model Save and Load

## LLM Fine-Tuning
- Language Modelling
- Supervised Fine Tuning (SFT)
- Preference Fine Tuning


In [1]:
!pip install virtualenv
!virtualenv finetune

Collecting virtualenv
  Downloading virtualenv-20.30.0-py3-none-any.whl.metadata (4.5 kB)
Collecting distlib<1,>=0.3.7 (from virtualenv)
  Downloading distlib-0.3.9-py2.py3-none-any.whl.metadata (5.2 kB)
Downloading virtualenv-20.30.0-py3-none-any.whl (4.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.3/4.3 MB[0m [31m42.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading distlib-0.3.9-py2.py3-none-any.whl (468 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 kB[0m [31m35.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: distlib, virtualenv
Successfully installed distlib-0.3.9 virtualenv-20.30.0
created virtual environment CPython3.11.12.final.0-64 in 928ms
  creator CPython3Posix(dest=/content/finetune, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/root/.local/share/virtualenv)
    added seed packages: pip==2

In [2]:
!source /content/finetune/bin/activate

In [3]:
!pip install uv

Collecting uv
  Downloading uv-0.6.14-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Downloading uv-0.6.14-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.9/16.9 MB[0m [31m53.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: uv
Successfully installed uv-0.6.14


In [None]:
!uv pip install accelerate bitsandbytes trl peft transformers datasets huggingface_hub[hf_xet]

### 2. Load Dataset

In [6]:
import pandas as pd
from datasets import load_dataset, Dataset, DatasetDict

# Login using e.g. `huggingface-cli login` to access this dataset
df = pd.read_json("hf://datasets/UCSC-VLAA/MedReason/ours_quality_33000.jsonl", lines=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [7]:
df=df.head(200)

In [8]:
df.head()

Unnamed: 0,dataset_name,id_in_dataset,question,answer,reasoning,options
0,medmcqa,7131,Urogenital Diaphragm is made up of the followi...,Colle's fascia. Explanation: Colle's fascia do...,Finding reasoning paths:\n1. Urogenital diaphr...,Answer Choices:\nA. Deep transverse Perineus\n...
1,medmcqa,7133,Child with Type I Diabetes. What is the advise...,After 5 years. Explanation: Screening for diab...,**Finding reasoning paths:**\n\n1. Type 1 Diab...,Answer Choices:\nA. After 5 years\nB. After 2 ...
2,medmcqa,7134,Most sensitive test for H pylori is-,Biopsy urease test. Explanation: <P>Davidson&;...,**Finding reasoning paths:**\n\n1. Consider th...,Answer Choices:\nA. Fecal antigen test\nB. Bio...
3,medmcqa,7137,Ligation of the common hepatic aery will compr...,Right gastric and right gastroepiploic aery. E...,**Finding reasoning paths:**\n\n1. Common hepa...,Answer Choices:\nA. Right and Left gastric aer...
4,medmcqa,7138,Typhoid investigation of choice in 1st week,Blood culture. Explanation: (A) Blood culture ...,Finding reasoning paths:\n\n1. Consider the pa...,Answer Choices:\nA. Blood culture\nB. Widal te...


In [10]:
df.tail()

Unnamed: 0,dataset_name,id_in_dataset,question,answer,reasoning,options
195,medmcqa,7464,Spleen is derived from -,Dorsal mesogastrium. Explanation: Ans. is 'b' ...,**Finding reasoning paths:**\n\n1. The spleen ...,Answer Choices:\nA. Ventral mesogastrium\nB. D...
196,medmcqa,7467,Best way to confirm that no stones are left ba...,"Cholangiogram. Explanation: Ans. is 'c' i.e., ...",Finding reasoning paths:\n\n1. Consider the an...,Answer Choices:\nA. Choledochoscope\nB. Palpat...
197,medmcqa,7470,Which of the following is a precancerous condi...,Chronic gastric atrophy. Explanation: Premalig...,Finding reasoning paths:\n\n1. Autoimmune pern...,Answer Choices:\nA. Peptic ulcer\nB. Chronic g...
198,medmcqa,7471,Following are the normal features in temporoma...,Pain while opening the mouth,Finding reasoning paths:\n1. Temporomandibular...,Answer Choices:\nA. Joint sound\nB. Pain while...
199,medmcqa,7472,Confirmatory test for syphilis -,FTA-ABS,**Finding reasoning paths:**\n\n1. Understand ...,Answer Choices:\nA. VDRL\nB. FTA-ABS\nC. RPQ\n...


In [16]:
dataset = Dataset.from_pandas(df)
dataset = dataset.shuffle(seed=0)
dataset = dataset.train_test_split(test_size=0.1)

In [17]:
dataset

DatasetDict({
    train: Dataset({
        features: ['dataset_name', 'id_in_dataset', 'question', 'answer', 'reasoning', 'options'],
        num_rows: 180
    })
    test: Dataset({
        features: ['dataset_name', 'id_in_dataset', 'question', 'answer', 'reasoning', 'options'],
        num_rows: 20
    })
})

In [18]:
dataset['test'][10]

{'dataset_name': 'medmcqa',
 'id_in_dataset': 7346,
 'question': 'Magnesium sulphate potentiates the hypotensive action of -',
 'answer': 'Nifedipine. Explanation: ans-B',
 'reasoning': "Finding reasoning paths:\n1. Magnesium sulfate -> Nifedipine -> Decreased systolic blood pressure\n2. Magnesium sulfate -> Calcium channel blockers (e.g., Amlodipine, Nisoldipine, Diltiazem) -> Decreased systolic blood pressure\n\nReasoning Process:\n1. **Understanding Magnesium Sulfate's Role**: Magnesium sulfate is known to have vasodilatory effects, which can lead to a decrease in blood pressure. It can also interact with other medications to enhance their hypotensive effects.\n\n2. **Exploring Nifedipine**: Nifedipine is a calcium channel blocker that is commonly used to treat hypertension. It works by relaxing the blood vessels, which lowers blood pressure. Magnesium sulfate can potentiate the effects of nifedipine by further relaxing the vascular smooth muscle, leading to a more pronounced decrea

## Load Base Model and Prepare Formatting

Lets load phi2 model and tokenize text data with formatting

In [22]:
df.head()

Unnamed: 0,dataset_name,id_in_dataset,question,answer,reasoning,options
0,medmcqa,7131,Urogenital Diaphragm is made up of the followi...,Colle's fascia. Explanation: Colle's fascia do...,Finding reasoning paths:\n1. Urogenital diaphr...,Answer Choices:\nA. Deep transverse Perineus\n...
1,medmcqa,7133,Child with Type I Diabetes. What is the advise...,After 5 years. Explanation: Screening for diab...,**Finding reasoning paths:**\n\n1. Type 1 Diab...,Answer Choices:\nA. After 5 years\nB. After 2 ...
2,medmcqa,7134,Most sensitive test for H pylori is-,Biopsy urease test. Explanation: <P>Davidson&;...,**Finding reasoning paths:**\n\n1. Consider th...,Answer Choices:\nA. Fecal antigen test\nB. Bio...
3,medmcqa,7137,Ligation of the common hepatic aery will compr...,Right gastric and right gastroepiploic aery. E...,**Finding reasoning paths:**\n\n1. Common hepa...,Answer Choices:\nA. Right and Left gastric aer...
4,medmcqa,7138,Typhoid investigation of choice in 1st week,Blood culture. Explanation: (A) Blood culture ...,Finding reasoning paths:\n\n1. Consider the pa...,Answer Choices:\nA. Blood culture\nB. Widal te...


In [24]:
def formatting_func(example):
    text = f"""
            Given the question and option generate a answer and reasoning.
            ### question: {example['question']}
            ### option: {example['options']}

            ### answer with reasoning: {example['answer']} and {example['reasoning']}

            """
    return text

In [25]:
print(formatting_func(dataset['train'][100]))


            Given the question and option generate a answer and reasoning.
            ### question: True about Barret's esophagus -a) Premalignantb) Predispose to sq. cell Cac) Can be diagnosed by seeing under endoscoped) Biopsy is necessary to diagnosee) Stricture may be present in high esophagus
            ### option: Answer Choices:
A. bce
B. bde
C. abcd
D. acde

            ### answer with reasoning: acde. Explanation: Diagnosis of Barrett's esophagus
- The diagnosis of Barrett's esophagus is suspected on endoscopy when there is difficulty in visualizing the squamo-columnar junction at its normal location and by the appearance of pink, more luxuriant columnar mucosa in the lower esophagus instead of gray-pink squamous mucosa.
-  the diagnosis is confirmed by biopsy.

Strictures in Barrett's esophagus occur at the squamo-columnar junction and move high up as the squamo-columnar junction moves up with progressive injury. and Finding reasoning paths:
1. Squamous epithelium is repla

## Load Base Model and Tokenize

In [26]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

base_model_id = "microsoft/phi-2"

model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True,
                                             torch_dtype=torch.float16, load_in_8bit=True)


config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [27]:
tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    padding_size='left',
    add_eos_token=True,
    add_bos_token=True,
    use_fast=False
)

tokenizer.pad_token = tokenizer.eos_token



tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

In [28]:
max_length = 400

def tokenize(prompt):
  result = tokenizer(
      formatting_func(prompt),
      truncation = True,
      max_length=max_length,
      padding = "max_length"
  )

  result['labels'] = result['input_ids'].copy()

  return result

In [29]:
print(tokenize(dataset['train'][0]))

{'input_ids': [50256, 198, 50276, 15056, 262, 1808, 290, 3038, 7716, 257, 3280, 290, 14607, 13, 198, 50276, 21017, 1808, 25, 317, 5827, 10969, 284, 262, 6253, 351, 2566, 489, 24464, 618, 2045, 3371, 262, 826, 13, 1550, 12452, 11, 339, 318, 5906, 284, 1445, 465, 826, 4151, 1568, 453, 1613, 262, 3095, 1370, 13, 8995, 284, 543, 286, 777, 25377, 561, 4439, 428, 8668, 10470, 30, 198, 50276, 21017, 3038, 25, 23998, 10031, 1063, 25, 198, 32, 13, 2275, 646, 1087, 16384, 198, 33, 13, 8498, 354, 3238, 16384, 198, 34, 13, 13123, 291, 16384, 198, 35, 13, 440, 3129, 296, 20965, 16384, 628, 50276, 21017, 3280, 351, 14607, 25, 2275, 646, 1087, 16384, 13, 50125, 341, 25, 21234, 21662, 341, 318, 286, 22753, 39630, 286, 530, 4151, 357, 3506, 4151, 8, 288, 14, 83, 2465, 286, 13889, 16384, 13, 406, 10534, 13621, 385, 318, 1582, 47557, 290, 27063, 14607, 13532, 25, 198, 16, 13, 6031, 489, 24464, 618, 2045, 284, 262, 826, 5644, 281, 2071, 351, 262, 12749, 393, 25377, 12755, 4151, 3356, 13, 198, 17, 13, 383,

In [30]:
dataset = dataset.map(tokenize)

Map:   0%|          | 0/180 [00:00<?, ? examples/s]

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

## How Does the Base Model Do Out of the Box?

In [31]:
eval_prompt = """
Given the question and option generate a answer and reasoning.
### question : Urogenital Diaphragm is made up of the following, except:
### option : Answer Choices:
                A. Deep transverse Perineus
                B. Perinial membrane
                C. Colle's fascia
                D. Sphincter Urethrae

### answer with reasoning:
"""

In [32]:
# tokenize -> generate -> decode

model_input = tokenizer(
      eval_prompt,
      truncation = True,
      max_length=max_length,
      padding = "max_length",
      return_tensors='pt'
  ).to("cuda")


In [33]:
model.eval()

with torch.no_grad():
  output = model.generate(**model_input, max_new_tokens=256,
                                           repetition_penalty=1.15)
  result = tokenizer.decode(output[0], skip_special_tokens=True)

  print(result)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Given the question and option generate a answer and reasoning.
### question : Urogenital Diaphragm is made up of the following, except:
### option : Answer Choices:
                  A. Deep transverse Perineus
                  B. Perinial membrane
                  C. Colle's fascia
                  D. Sphincter Urethrae

### answer with reasoning:



Possible rewrite:

# Question
Urogenital diaphragm is a muscular structure that separates the urinary bladder from the vagina in females. It consists of several layers of muscles and connective tissues. Which one of these structures is not part of the urogenital diaphragm?

- A. Deep transverse perineum (a thick band of muscle at the bottom of the pelvis)
- B. Perinial membrane (a thin layer of tissue that covers the opening of the urethra)
- C. Colle's fascia (a sheet of fibrous tissue that supports the pelvic organs)
- D. Sphincter urethrae (a ring of smooth muscle that controls the flow of urine out of the body)

Choose the correct

## LORA Config
- Lets configure 8-bit QLORA Config

In [34]:
from peft import LoraConfig, get_peft_model

target_modules = ["Wqkv", "fc1", "fc2"]

config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules = target_modules,
    bias = "none",
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)



In [35]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [36]:
print_trainable_parameters(model)

trainable params: 26214400 || all params: 2805898240 || trainable%: 0.9342605382581515


## Model Training

In [37]:
from accelerate import Accelerator

accelerator = Accelerator(gradient_accumulation_steps=1)

model = accelerator.prepare_model(model)

In [38]:
# Trainer, Training Arguments, DataCollator

from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
from datetime import datetime

project = "phi2-finetune"
run_name = 'train-dir'
output_dir = "./" + run_name

args=TrainingArguments(
        output_dir=output_dir,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=1,
        max_steps=500,
        learning_rate=2.5e-5, # Want a small lr for finetuning
        optim="paged_adamw_8bit",
        logging_steps=25,              # When to start reporting loss
        logging_dir="./logs",        # Directory for storing logs
        save_strategy="steps",       # Save the model checkpoint every logging step
        save_steps=25,                # Save checkpoints every 50 steps
        eval_strategy="steps", # Evaluate the model every logging step
        eval_steps=25,               # Evaluate and save checkpoints every 50 steps
        do_eval=True,                # Perform evaluation at the end of training
    )

trainer = Trainer(
    model=model,
    args = args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mmyemail-subrata[0m ([33mmyemail-subrata-ey[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss,Validation Loss
25,1.9989,1.828079
50,1.9117,1.77502
75,1.7677,1.706166
100,1.773,1.640676
125,1.6985,1.578358
150,1.7706,1.529205
175,1.5466,1.489707
200,1.5834,1.461608
225,1.5986,1.442981
250,1.4552,1.425368


TrainOutput(global_step=500, training_loss=1.6000532913208008, metrics={'train_runtime': 573.9111, 'train_samples_per_second': 1.742, 'train_steps_per_second': 0.871, 'total_flos': 6419582976000000.0, 'train_loss': 1.6000532913208008, 'epoch': 5.555555555555555})

## Lets Try Tained Model | Load PEFT Model
By default, the PEFT library will only save the QLoRA adapters, so we need to first load the base model from the Huggingface Hub:

Process-> Load base model,  then merge the PEFT Model

In [39]:
import torch
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    trust_remote_code=True,
    load_in_8bit=True,
    torch_dtype=torch.float16
)

eval_tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    add_bos_token=True,
    trust_remote_code=True,
    use_fast=False
)
eval_tokenizer.pad_token = eval_tokenizer.eos_token

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [40]:
from peft import PeftModel

ft_model = PeftModel.from_pretrained(base_model, '/content/train-dir/checkpoint-500')

In [44]:
eval_prompt = """
Given the question and option generate a answer and reasoning.
### question : Urogenital Diaphragm is made up of the following, except:
### option : Answer Choices:
                A. Deep transverse Perineus
                B. Perinial membrane
                C. Colle's fascia
                D. Sphincter Urethrae

### answer with reasoning:
"""

# Move model_input to the GPU
model_input = eval_tokenizer(eval_prompt, return_tensors="pt").to("cuda")

ft_model.eval()
with torch.no_grad():
  output = ft_model.generate(**model_input, max_new_tokens=400,
                                           repetition_penalty=1.15)
  result = eval_tokenizer.decode(output[0], skip_special_tokens=True)

  print(result)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Given the question and option generate a answer and reasoning.
### question : Urogenital Diaphragm is made up of the following, except:
### option : Answer Choices:
                  A. Deep transverse Perineus
                  B. Perinial membrane
                  C. Colle's fascia
                  D. Sphincter Urethrae

### answer with reasoning:

The correct option is D. Sphincter Urethrae. Explanation: The urogenital diaphragm (UD) is an anatomical structure that separates the urinary bladder from the vagina in females. It consists of two layers: the external sphincter muscle, which surrounds the urethra and controls urination; and the internal sphincter muscle, which forms part of the pelvic floor muscles and supports the uterus and rectum. The other options are parts of the female reproductive system or connective tissue.



Possible additional problems and solutions:

Problem 1: Which of the following organs is not covered by the urogenital diaphragm?

Answer Choices:

- A. 

In [43]:
!zip -r phi2_qlora_adapter.zip /content/train-dir/checkpoint-500


  adding: content/train-dir/checkpoint-500/ (stored 0%)
  adding: content/train-dir/checkpoint-500/tokenizer_config.json (deflated 93%)
  adding: content/train-dir/checkpoint-500/added_tokens.json (deflated 84%)
  adding: content/train-dir/checkpoint-500/rng_state.pth (deflated 25%)
  adding: content/train-dir/checkpoint-500/special_tokens_map.json (deflated 75%)
  adding: content/train-dir/checkpoint-500/merges.txt (deflated 53%)
  adding: content/train-dir/checkpoint-500/adapter_config.json (deflated 53%)
  adding: content/train-dir/checkpoint-500/vocab.json (deflated 68%)
  adding: content/train-dir/checkpoint-500/training_args.bin (deflated 52%)
  adding: content/train-dir/checkpoint-500/README.md (deflated 66%)
  adding: content/train-dir/checkpoint-500/adapter_model.safetensors (deflated 7%)
  adding: content/train-dir/checkpoint-500/scheduler.pt (deflated 56%)
  adding: content/train-dir/checkpoint-500/optimizer.pt (deflated 11%)
  adding: content/train-dir/checkpoint-500/traine