# Fine-Tuning an LLM: Medical Test Performance (Before and After)

This notebook is designed as a starter kit to fine-tuning. Fine-tuning has a ton of useful applications, from changing the tone of the LLM, incorporating a specific vernacular or body of language into the LLM, or having the LLM "specialize" in a certain area. Therefore this is an important area of understanding for fully leveraging LLMs. 

We're going to be trying to get the LLM to pass a medical entrance exam as an example. See the description: 

_MedMCQA is a large-scale, Multiple-Choice Question Answering (MCQA) dataset designed to address real-world medical entrance exam questions._

_MedMCQA has more than 194k high-quality AIIMS & NEET PG entrance exam MCQs covering 2.4k healthcare topics and 21 medical subjects are collected with an average token length of 12.77 and high topical diversity._

_Each sample contains a question, correct answer(s), and other options which require a deeper language understanding as it tests the 10+ reasoning abilities of a model across a wide range of medical subjects & topics. A detailed explanation of the solution, along with the above information, is provided in this study._

_MedMCQA provides an open-source dataset for the Natural Language Processing community. It is expected that this dataset would facilitate future research toward achieving better QA systems. The dataset contains questions about the following topics:_

- The dataset can be found here: https://huggingface.co/datasets/medmcqa
- Papers with Code link: https://paperswithcode.com/dataset/medmcqa
- Dataset on Github for Download: https://github.com/MedMCQA/MedMCQA

We are going to be using GPT3.5 for this purpose in order to keep things simple for now (no custom LLMs). OpenAI makes this process relatively straightforward. 

The harder part, however, is assessing performance. OpenAI does provide some stuff like token accuracy to do this, but it isn't perfect - it doesn't tell us if the LLM really "accurately" answered the question. 

The process we're going to use to do this is a bit convulted, and not perfect, but it should work relatively well. After all, this is a multiple-choice exam, so it should be easy for an LLM to see if another LLM made the right choice. 

What we are going to do is the following: 

- Spit the data into training and testing sets 
- We are going to fine-tune the LLM on the training set to create two endpoints, one fine-tuned "expert" on the training set and one standard
- Run the standard LLM on the test set and get the responses
- Since the fine-tuned LLM is an "expert" (or should be), we will use it as a "Grader" to grade answers as correct (True) or incorrect (False) using OpenAI functions to properly format the output as well. Remember, it's a multiple choice test, so this shouldn't be hard for the LLM to determine. 
- We will then grade the LLM and give it a score 
- Finally, we will run the fine-tuned LLM on the test-set
- We will have the LLM again grade itself using OpenAI functions to structure responses
- We will grade the fine-tuned LLM and give it a score

Of course this is not perfect! We're having an LLM grade itself which is inherently flawed, but it is a multiple choice test, so it shouldn't be too hard for the LLM to determine if the LLM chose a, b, c, or d correctly. 

We'll give each LLM a "letter grade" based on its accuracy, like a B+ or an A for example.

Let's see how the LLM boosts its grade by "studying"!

In [94]:
# Import dependencies
import pandas as pd
import numpy as np 
import yaml
import json
import seaborn as sns
import matplotlib.pyplot as plt 
from sklearn.model_selection import train_test_split
from openai import AzureOpenAI

In [95]:
# Load the configuration from config.yml
with open('config.yml', 'r') as config_file:
    config = yaml.safe_load(config_file)

# Extract the 'openai' section from the loaded configuration
openai_config = config.get('openai', {})

#Set up the OpenAI client 
client = AzureOpenAI(
  azure_endpoint = openai_config.get('api_base'), 
  api_key= openai_config.get('api_key'),  
  api_version= openai_config.get('api_version')
)

# Let's test to make sure everything is working
response = client.chat.completions.create(
    model="gpt-35-turbo", # model = "deployment_name".
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Do you enjoy fine-tuning LLMs? "},
    ]
)

# Print a sample response
print(response.choices[0].message.content)

As an AI language model, I don't have emotions or feelings. However, I am designed to perform text-based tasks such as fine-tuning LLMs, and I am always happy to assist with any tasks assigned to me.


In [96]:

# Initialize an empty list to store the JSON objects
data_list = []

# Read the JSON Lines file
with open('data/raw/train.json', 'r', encoding='utf-8') as file:
    for line in file:
        # Parse each line as a separate JSON object
        json_object = json.loads(line)
        # Append the JSON object to the list
        data_list.append(json_object)


In [97]:
df = pd.DataFrame(data_list) #Store as dataframe

In [98]:
df.head(10) #Let's explore dataframe

Unnamed: 0,question,exp,cop,opa,opb,opc,opd,subject_name,topic_name,id,choice_type
0,Chronic urethral obstruction due to benign pri...,Chronic urethral obstruction because of urinar...,3,Hyperplasia,Hyperophy,Atrophy,Dyplasia,Anatomy,Urinary tract,e9ad821a-c438-4965-9f77-760819dfa155,single
1,Which vitamin is supplied from only animal sou...,Ans. (c) Vitamin B12 Ref: Harrison's 19th ed. ...,3,Vitamin C,Vitamin B7,Vitamin B12,Vitamin D,Biochemistry,Vitamins and Minerals,e3d3c4e1-4fb2-45e7-9f88-247cc8f373b3,single
2,All of the following are surgical options for ...,"Ans. is 'd' i.e., Roux en Y Duodenal Bypass Ba...",4,Adjustable gastric banding,Biliopancreatic diversion,Duodenal Switch,Roux en Y Duodenal By pass,Surgery,Surgical Treatment Obesity,5c38bea6-787a-44a9-b2df-88f4218ab914,multi
3,Following endaerectomy on the right common car...,The central aery of the retina is a branch of ...,1,Central aery of the retina,Infraorbital aery,Lacrimal aery,Nasociliary aretry,Ophthalmology,,cdeedb04-fbe9-432c-937c-d53ac24475de,multi
4,Growth hormone has its effect on growth through?,"Ans. is 'b' i.e., IGI-1GH has two major functi...",2,Directly,IG1-1,Thyroxine,Intranuclear receptors,Physiology,,dc6794a3-b108-47c5-8b1b-3b4931577249,single
5,Scrub typhus is transmitted by: September 2004,Ans. C i.e. Mite,3,Louse,Tick,Mite,Milk,Social & Preventive Medicine,,5ab84ea8-12d1-47d4-ab22-668ebf01e64c,single
6,Abnormal vascular patterns seen with colposcop...,"Abnormal vascular pattern include punctation, ...",3,Punctation,Mosaicism,Satellite lesions,Atypical vessels,Gynaecology & Obstetrics,,a83de6e4-9427-4480-b404-d96621ebb640,multi
7,Per rectum examination is not a useful test fo...,PILONIDAL SINUS/DISEASE (Jeep Bottom; Driver's...,3,Anal fissure,Hemorrhoid,Pilonidal sinus,Rectal ulcer,Surgery,Urology,f3bf8583-231b-4b7a-828c-179b0f9ccdd9,single
8,Characteristics of Remifentanyl – a) Metabolis...,Remifentanil is the shortest acting opioid due...,3,ab,bc,abc,bcd,Anaesthesia,,73515f05-e947-4801-8077-3abdeca95c84,single
9,Hypomimia is ?,Ans. C. Deficit of expression by gestureHypomi...,3,Decreased ability to copy,Decreased execution,Deficit of expression by gesture,Deficit of fluent speech,Psychiatry,,53f79833-21b0-4336-8ef4-404c687ec807,single


In [99]:
print(df.shape) #  How many rows and columns? 

(182822, 11)


In [100]:
system_promt = '''You are a fine-tuned LLM taking a medical entrance exam with the goal of achieving as high of a score as possible. 
Please answer the questions as accurately as possible based on your medical knowledge. 
Answer the questions following the instructions specified.'''

multi_choice_prompt = '''This question is a multiple choice question. Therefore, select every option that is true and do not select any false answers. 
If the options are A, B, C, and D, select all options out of A, B, C, and D that are true. 
An example would be selecting A, C, and D, A and B, or just A. 
Here is the question: '''

single_choice_prompt = '''This question has a single-choice answer. Therefore, select only one answer that is true and none else. 
For example, if the options are A, B, C, and D, you may select only one of these (A, for example). 
An example would be selecting just C.
Here is the question: '''

In [101]:
df['system'] = system_promt
df['opa'] = 'Option A: ' + df['opa']
df['opb'] = 'Option B: ' + df['opb']
df['opc'] = 'Option C: ' + df['opc']
df['opd'] = 'Option D: ' + df['opd']

In [102]:
df_single = df[df['choice_type'] == 'single'].copy()
df_multi = df[df['choice_type'] == 'multi'].copy()

In [103]:
df_single['user'] = single_choice_prompt + df['question'] + '\n' + df['opa'] + '\n' + df['opb'] + '\n' + df['opc'] + '\n' + df['opd']
df_multi['user'] = multi_choice_prompt + df['question'] + '\n' + df['opa'] + '\n' + df['opb'] + '\n' + df['opc'] + '\n' + df['opd']

In [104]:
df_single.to_csv('single_test.csv')
df_multi.to_csv('multi_test.csv')

In [105]:
df_final = pd.concat([df_single, df_multi], axis=0, ignore_index=True)
df_final.isnull().sum()

question            0
exp             21953
cop                 0
opa                 0
opb                 0
opc                 0
opd                 0
subject_name        0
topic_name      95613
id                  0
choice_type         0
system              0
user                0
dtype: int64

In [106]:
df_final = df_final.dropna(subset=['exp'])
df_final.isnull().sum()

question            0
exp                 0
cop                 0
opa                 0
opb                 0
opc                 0
opd                 0
subject_name        0
topic_name      73792
id                  0
choice_type         0
system              0
user                0
dtype: int64

In [107]:
df_final = df_final[['system', 'user', 'exp']]
df_final.rename(columns={'exp': 'assistant'}, inplace=True)

In [108]:
df_final.head(5)

Unnamed: 0,system,user,assistant
0,You are a fine-tuned LLM taking a medical entr...,This question has a single-choice answer. Ther...,Chronic urethral obstruction because of urinar...
1,You are a fine-tuned LLM taking a medical entr...,This question has a single-choice answer. Ther...,Ans. (c) Vitamin B12 Ref: Harrison's 19th ed. ...
2,You are a fine-tuned LLM taking a medical entr...,This question has a single-choice answer. Ther...,"Ans. is 'b' i.e., IGI-1GH has two major functi..."
3,You are a fine-tuned LLM taking a medical entr...,This question has a single-choice answer. Ther...,Ans. C i.e. Mite
4,You are a fine-tuned LLM taking a medical entr...,This question has a single-choice answer. Ther...,PILONIDAL SINUS/DISEASE (Jeep Bottom; Driver's...


In [109]:
# First, split into training (80%) and temp (20%)
df_train, df_temp = train_test_split(df_final, test_size=0.2, random_state=42)

# Then, split the temp into validation and test sets each (50% of the temp, equivalent to 20% of the original)
df_validation, df_test = train_test_split(df_temp, test_size=0.5, random_state=42)
# df_train now has 60% of the data, df_validation has 20%, and df_test has 20%


In [110]:
# Define a dictionary of dataframes
dfs = {
    'train': df_train,
    'test': df_test,
    'validation': df_validation
}

# Loop through the dictionary and write each dataframe to a JSONL file
for set_name, df_set in dfs.items():
    file_path = f'data/processed/fine_tune_{set_name}.jsonl'
    with open(file_path, 'w') as outfile:
        for index, row in df_set.iterrows():
            entry = {
                "messages": [
                    {"role": "system", "content": row['system']},
                    {"role": "user", "content": row['user']},
                    {"role": "assistant", "content": row['assistant']}
                ]
            }
            json_line = json.dumps(entry) + "\n"  # Convert the dictionary to a JSON string and add a newline
            outfile.write(json_line)
