#### **Fine-tuning Llama3 model on ChatDoctor dataset using PEFT, LORA and SFTTrainer**

The target of this project is to fine-tune Llama3 to generate physician-like responses to patient's queries. Llama3 will first be fine-tuned on the ChatDoctor dataset, which contains several thousand patient-physician interactions.

We would like Llama3 responses to be comparable in the quality of medical advice.

#### Imports and initializations

In [1]:
import json
import re
from pprint import pprint
from src import process_text, utils

import pandas as pd
import torch
from datasets import Dataset, load_dataset, DatasetDict
from huggingface_hub import notebook_login
from peft import LoraConfig, PeftModel
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from trl import SFTTrainer
import logging

DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"
MODEL_L3 = "meta-llama/Meta-Llama-3-8B-Instruct"

logging.basicConfig(level=logging.INFO)

#### **Load Dataset**

We load 3% of the 'lavita/ChatDoctor-HealthCareMagic-100k' dataset, then split it into train/val and test splits. 

*Note*: Loading the entire dataset and using it to fine-tune the Llama3 model can improve the fine-tuned model but model+data might not fit into our GPU RAM.

In [30]:
dataset = load_dataset("lavita/ChatDoctor-HealthCareMagic-100k", split='train[:3%]')
dev_test = dataset.train_test_split(test_size=0.2)
train_valid = dev_test['train'].train_test_split(test_size=0.2)

dataset = DatasetDict({
    "train":train_valid['train'],
    "validation":train_valid['test'],
    "test":dev_test['test'],
})


#### **Example**
Let us look at an example patient-physician interaction

In [41]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=0)

example = dataset['train'][0]
for key in example.keys():
    print(f"\n{key}")
    lines = text_splitter.split_text(example[key])
    for line in lines:
        print(line)


instruction
If you are a doctor, please answer the medical questions based on the patient's description.

input
yes my son n law was puching my daughter in stomach to abort baby my daughter had been haing severe
pain for 2 weeks through her body took her to 2 hospitals they refused to treat her due to her
being pregnant so he calls said she need to go to doctor i got her to a hospital they took her in
was not for sure what was going on and after she passed they said endocarditis which after she was
buried when we find he been punching her.the baby did not have fluid around it,my daughter had
infection over her entire body was coughing up blood chest pains rapid heart rate of 228do u think
him punching her could cause the endocarditis?

output
Hi and pleased to answer you. Throughout pregnancy, the fetus bathes in a translucent pouch that
gradually fills with amniotic fluid until the 35th week of amenorrhea where the amount of fluid is
at its maximum with about 980 ml. Then the amnioti

#### Convert data points to alpaca format

In [4]:
def to_alpaca(data_point, deploy=False):
    COMMAND = "You are a doctor. Answer the following query by a patient."

    #a_instruction = data_point['instruction']
    a_input = data_point['input']
    a_response = data_point['output']

    if deploy:
        training_prompt = f"""
            ### Instruction:{COMMAND}
            ### Input:{a_input}
            ### Response:
            """.strip()
        example = {
            "question":a_input,
            "answer": a_response, 
            "text": training_prompt
            }
    else:
        training_prompt = f"""
            ### Instruction:{COMMAND}
            ### Input:{a_input}
            ### Response:{a_response}
            """.strip()
        example = {
            "question":a_input,
            "answer": a_response, 
            "text": training_prompt
            }

    return example

for key in ['train', 'validation', 'test']:
    dataset[key] = dataset[key].shuffle(seed=42).map(to_alpaca)
    #.remove_columns(['input', 'output'])


Map:   0%|          | 0/2153 [00:00<?, ? examples/s]

Map:   0%|          | 0/539 [00:00<?, ? examples/s]

Map:   0%|          | 0/673 [00:00<?, ? examples/s]

#####  **Model Creation** 
Create and initialize the model, tokenizer with PEFT, LORA and BitsAndBytes

In [6]:
# Model and tokenizer
llm3, tokenizer = utils.create_model_and_tokenizer(MODEL_L3)
lora_r = 16
lora_alpha = 32
lora_dropout = 0.1
lora_target_modules = ["q_proj", "up_proj", "o_proj", "k_proj", "down_proj", "gate_proj","v_proj"]
 
# LORA and training arguments
peft_config = LoraConfig(
    r=lora_r,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    target_modules=lora_target_modules,
    bias="none",
    task_type="CAUSAL_LM",
)

INFO:root:Creating model meta-llama/Meta-Llama-3-8B-Instruct..


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

#### **Model Training**

In [7]:
OUTPUT_DIR = "experiments/text_classification"

training_arguments = TrainingArguments(
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=2,
    optim="paged_adamw_32bit",
    logging_steps=1,
    learning_rate=1e-4,
    fp16=True,
    max_grad_norm=0.3,
    num_train_epochs=2,
    evaluation_strategy="steps",
    eval_steps=0.2,
    warmup_ratio=0.05,
    save_strategy="steps",
    save_steps=200,
    group_by_length=True,
    output_dir=OUTPUT_DIR,
    report_to="tensorboard",
    save_safetensors=True,
    lr_scheduler_type="cosine",
    seed=42,
)

# Trainer
trainer = SFTTrainer(
    model=llm3,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=4096,
    tokenizer=tokenizer,
    args=training_arguments,
)



Map:   0%|          | 0/2153 [00:00<?, ? examples/s]

Map:   0%|          | 0/539 [00:00<?, ? examples/s]

In [None]:
from peft import AutoPeftModelForCausalLM

trainer.train()
trainer.save_model()

# Official method: Saves adapted and base separately
base_model_name = MODEL_L3
adapter_model_name = OUTPUT_DIR

model = AutoModelForCausalLM.from_pretrained(base_model_name)
model = PeftModel.from_pretrained(model, adapter_model_name)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)


# Method 2: Saves merged model (which is huge)
'''model = PeftModel.from_pretrained(model, OUTPUT_DIR)
merged_model = model.merge_and_unload()
merged_model.save_pretrained("merged_model", safe_serialization=True)
tokenizer.save_pretrained("merged_model")'''

You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss,Validation Loss
216,2.3042,2.333785
432,2.1572,2.2723
648,2.0763,2.255844
864,1.9946,2.242996


: 

#### **Deployment**
Deploy the fine-tuned model on test set and check output

In [8]:
# Load the base model and fine-tuned model

OUTPUT_DIR = "experiments/text_classification/checkpoint-1000"
base_model_name = MODEL_L3
adapter_model_name = OUTPUT_DIR

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    #bnb_4bit_compute_dtype="bf16",
)

model = AutoModelForCausalLM.from_pretrained(base_model_name, 
    device_map="auto",
    use_safetensors=True,
    quantization_config=bnb_config,
    trust_remote_code=True)

#model = PeftModel.from_pretrained(model, adapter_model_name)


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [10]:
model = PeftModel.from_pretrained(model, adapter_model_name)


INFO:peft.tuners.tuners_utils:Already found a `peft_config` attribute in the model. This will lead to having multiple adapters in the model. Make sure to know what you are doing!


In [25]:
examples = []
llm_answers = []

# used for inference
def summarize(model, text: str):
    inputs = tokenizer(text, return_tensors="pt").to(DEVICE)
    inputs_length = len(inputs["input_ids"][0])
    with torch.inference_mode():
        outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.0001)
    return tokenizer.decode(outputs[0][inputs_length:], skip_special_tokens=True)

# run model on some examples from test set
for data_point in dataset["test"].select(range(5)):
    examples.append(to_alpaca(data_point, deploy=True))

test_df = pd.DataFrame(examples)

for idx in range(5):
    example = test_df.iloc[idx]
    llm_answers.append(summarize(model, example.text)) # have to make sure what is returned is answer, not entire output
#print(answer)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


#### **Llama3 Response vs Physician's Response**

Let us evaluate the performance of Llama3 fine-tuned model against physician's response.

In [29]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=0)
q_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=0)

for i in range(5):

    example = test_df.iloc[i]
    print("\nPatient's Question:")
    for q in q_splitter.split_text(example.question):
        print(q)

    print("\nAI response:")
    for line in text_splitter.split_text(llm_answers[i]):
        print(line)

    print("\nPhysician's response:")
    for line in text_splitter.split_text(example.answer):
        print(line)

Patient's Question:
I am a male 79 years old, have brief chills the may sweat have had for about 2 months also white
blood cell count to high had bone marrow test and ok so far dr has no idea what is wrong when I
chill have no fever as I check that and blood pressure. thanks

AI response:
Thank you for sharing your symptoms with me. It's good to know that your blood work and bone marrow
test came back normal.
Based on your symptoms of brief chills and sweating, I would like to investigate further. I would
like to perform a thorough physical examination, including checking your vital signs, such as your
blood pressure, pulse, and respiratory rate.
Additionally, I would like to order some additional tests to help us better understand what's going
on. These tests may include a complete blood count, a urinalysis, and possibly some imaging
studies, such as a chest X-ray or a computed tomography (CT) scan.
Please let me know if you have any questions or concerns about these tests or any othe