# Fine-Tuning an LLM: Medical Test Performance (Before and After)

This notebook is designed as a starter kit to fine-tuning. Fine-tuning has a ton of useful applications, from changing the tone of the LLM, incorporating a specific vernacular or body of language into the LLM, or having the LLM "specialize" in a certain area. Therefore this is an important area of understanding for fully leveraging LLMs. 

We're going to be trying to get the LLM to pass a medical entrance exam as an example. See the description: 

_MedMCQA is a large-scale, Multiple-Choice Question Answering (MCQA) dataset designed to address real-world medical entrance exam questions._

_MedMCQA has more than 194k high-quality AIIMS & NEET PG entrance exam MCQs covering 2.4k healthcare topics and 21 medical subjects are collected with an average token length of 12.77 and high topical diversity._

_Each sample contains a question, correct answer(s), and other options which require a deeper language understanding as it tests the 10+ reasoning abilities of a model across a wide range of medical subjects & topics. A detailed explanation of the solution, along with the above information, is provided in this study._

_MedMCQA provides an open-source dataset for the Natural Language Processing community. It is expected that this dataset would facilitate future research toward achieving better QA systems. The dataset contains questions about the following topics:_

- The dataset can be found here: https://huggingface.co/datasets/medmcqa
- Papers with Code link: https://paperswithcode.com/dataset/medmcqa
- Dataset on Github for Download: https://github.com/MedMCQA/MedMCQA

We are going to be using GPT3.5 for this purpose in order to keep things simple for now (no custom LLMs). OpenAI makes this process relatively straightforward. 

The harder part, however, is assessing performance. OpenAI does provide some stuff like token accuracy to do this, but it isn't perfect - it doesn't tell us if the LLM really "accurately" answered the question. 

The process we're going to use to do this is a bit convulted, and not perfect, but it should work relatively well. After all, this is a multiple-choice exam, so it should be easy for an LLM to see if another LLM made the right choice. 

What we are going to do is the following: 

- Spit the data into training and testing sets 
- We are going to fine-tune the LLM on the training set to create two endpoints, one fine-tuned "expert" on the training set and one standard
- Run the standard LLM on the test set and get the responses
- Since the fine-tuned LLM is an "expert" (or should be), we will use it as a "Grader" to grade answers as correct (True) or incorrect (False) using OpenAI functions to properly format the output as well. Remember, it's a multiple choice test, so this shouldn't be hard for the LLM to determine. 
- We will then grade the LLM and give it a score 
- Finally, we will run the fine-tuned LLM on the test-set
- We will have the LLM again grade itself using OpenAI functions to structure responses
- We will grade the fine-tuned LLM and give it a score

Of course this is not perfect! We're having an LLM grade itself which is inherently flawed, but it is a multiple choice test, so it shouldn't be too hard for the LLM to determine if the LLM chose a, b, c, or d correctly. 

We'll give each LLM a "letter grade" based on its accuracy, like a B+ or an A for example.

Let's see how the LLM boosts its grade by "studying"!

### Import dependencies

In [2]:
# Standard library imports
import json
import os
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from requests.exceptions import RequestException
import time
from requests.exceptions import RequestException, Timeout


# Related third party imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import yaml
from sklearn.model_selection import train_test_split
from tqdm import tqdm

# Local application/library specific imports
import openai
from openai import OpenAI



### Read in configs and test OpenAI 

Here we're just testing to make sure we can call out to OpenAI and our configs are working. 

In [80]:
# Load the configuration from config.yml
with open('config.yml', 'r') as config_file:
    config = yaml.safe_load(config_file)

# Extract the 'openai' section from the loaded configuration
openai_config = config.get('openai_personal', {})
client = OpenAI(api_key=openai_config.get('api_key'))

#Call out to OpenAI 
completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."},
    {"role": "user", "content": "Compose a poem that explains the concept of recursion in programming."}
  ]
)

#Print response 
print(completion.choices[0].message)


ChatCompletionMessage(content="In a realm of algorithms, mysterious and vast,\nLies a notion that echoes from the distant past.\nIt's a concept called recursion, powerful and wise,\nUnraveling complexities before our very eyes.\n\nLike a mirrored reflection, a loop within a loop,\nRecursion dives deep, to a code's very root.\nWith elegance and grace, it dances and twirls,\nA concept that challenges, yet it unfurls.\n\nIn the heart of a function, it finds its repose,\nCalling itself, as the story it unfolds.\nLike a fractal expanding, never to cease,\nRecursion wanders, embracing inner peace.\n\nFor each call that's made, a new world is born,\nNesting and nesting, yet never torn.\nUnraveling problems, layer by layer,\nRecursion unveils, a solution so fair.\n\nDown and down it delves, levels unfound,\nTackling complexity, turning it around.\nInfinite iterations, a mind-bending song,\nRecursion's journey, forever lifelong.\n\nWith every recursive step, a problem finds its end,\nBreaking d

### Read in the training data

Here we are reading in the training data, which is what we really care about - it has the answers in it. We're going to split this into a training and validation file and basically "test" on the validation file. So let's do that. 

To do this we'll read in the file, save it as a dataframe for easier manipulation, and explore a bit. 

In [4]:
# Initialize an empty list to store the JSON objects
data_list = []

# Read the JSON Lines file
with open('data/raw/train.json', 'r', encoding='utf-8') as file:
    for line in file:
        # Parse each line as a separate JSON object
        json_object = json.loads(line)
        # Append the JSON object to the list
        data_list.append(json_object)


In [5]:
df = pd.DataFrame(data_list) #Store as dataframe

In [6]:
df.head(10) #Let's explore dataframe

Unnamed: 0,question,exp,cop,opa,opb,opc,opd,subject_name,topic_name,id,choice_type
0,Chronic urethral obstruction due to benign pri...,Chronic urethral obstruction because of urinar...,3,Hyperplasia,Hyperophy,Atrophy,Dyplasia,Anatomy,Urinary tract,e9ad821a-c438-4965-9f77-760819dfa155,single
1,Which vitamin is supplied from only animal sou...,Ans. (c) Vitamin B12 Ref: Harrison's 19th ed. ...,3,Vitamin C,Vitamin B7,Vitamin B12,Vitamin D,Biochemistry,Vitamins and Minerals,e3d3c4e1-4fb2-45e7-9f88-247cc8f373b3,single
2,All of the following are surgical options for ...,"Ans. is 'd' i.e., Roux en Y Duodenal Bypass Ba...",4,Adjustable gastric banding,Biliopancreatic diversion,Duodenal Switch,Roux en Y Duodenal By pass,Surgery,Surgical Treatment Obesity,5c38bea6-787a-44a9-b2df-88f4218ab914,multi
3,Following endaerectomy on the right common car...,The central aery of the retina is a branch of ...,1,Central aery of the retina,Infraorbital aery,Lacrimal aery,Nasociliary aretry,Ophthalmology,,cdeedb04-fbe9-432c-937c-d53ac24475de,multi
4,Growth hormone has its effect on growth through?,"Ans. is 'b' i.e., IGI-1GH has two major functi...",2,Directly,IG1-1,Thyroxine,Intranuclear receptors,Physiology,,dc6794a3-b108-47c5-8b1b-3b4931577249,single
5,Scrub typhus is transmitted by: September 2004,Ans. C i.e. Mite,3,Louse,Tick,Mite,Milk,Social & Preventive Medicine,,5ab84ea8-12d1-47d4-ab22-668ebf01e64c,single
6,Abnormal vascular patterns seen with colposcop...,"Abnormal vascular pattern include punctation, ...",3,Punctation,Mosaicism,Satellite lesions,Atypical vessels,Gynaecology & Obstetrics,,a83de6e4-9427-4480-b404-d96621ebb640,multi
7,Per rectum examination is not a useful test fo...,PILONIDAL SINUS/DISEASE (Jeep Bottom; Driver's...,3,Anal fissure,Hemorrhoid,Pilonidal sinus,Rectal ulcer,Surgery,Urology,f3bf8583-231b-4b7a-828c-179b0f9ccdd9,single
8,Characteristics of Remifentanyl – a) Metabolis...,Remifentanil is the shortest acting opioid due...,3,ab,bc,abc,bcd,Anaesthesia,,73515f05-e947-4801-8077-3abdeca95c84,single
9,Hypomimia is ?,Ans. C. Deficit of expression by gestureHypomi...,3,Decreased ability to copy,Decreased execution,Deficit of expression by gesture,Deficit of fluent speech,Psychiatry,,53f79833-21b0-4336-8ef4-404c687ec807,single


In [7]:
print(df.shape) #  How many rows and columns? 

(182822, 11)


### Set up prompts

Here, we're setting up our system prompts, and prompts to pre-pend the single and multi-choice answers. This is all to use later and get the data in a format for fine-tuning. We'll add the system prompts to our fine-tuning datasets and API calls, and pre-pend our questions with our descriptions for single and multi-choice questions accordingly. Let's go ahead and set up a dataframe that does this. 

In [8]:
system_promt = '''You are a fine-tuned LLM taking a medical entrance exam with the goal of achieving as high of a score as possible. 
Please answer the questions as accurately as possible based on your medical knowledge. 
Answer the questions following the instructions specified.'''

multi_choice_prompt = '''This question is a multiple choice question. Therefore, select every option that is true and do not select any false answers. 
If the options are A, B, C, and D, select all options out of A, B, C, and D that are true. 
An example would be selecting A, C, and D, A and B, or just A. 
Here is the question: '''

single_choice_prompt = '''This question has a single-choice answer. Therefore, select only one answer that is true and none else. 
For example, if the options are A, B, C, and D, you may select only one of these (A, for example). 
An example would be selecting just C.
Here is the question: '''

### Add clarity to option columns

Add descriptions of each option to make it clear to the LLM what its options are. 

In [9]:
df['system'] = system_promt
df['opa'] = 'Option A: ' + df['opa']
df['opb'] = 'Option B: ' + df['opb']
df['opc'] = 'Option C: ' + df['opc']
df['opd'] = 'Option D: ' + df['opd']

### Split into single and multi-choice, engineer accordingly

Here we do the following: 

- Split the dataframe into single and multiple choice
- Add the related prompts ahead of the question to give the LLM some context
- Add every option with its new "Option: " component in it
- This creates a complete "question" for the LLM with appropriate context

In [10]:
df_single = df[df['choice_type'] == 'single'].copy()
df_multi = df[df['choice_type'] == 'multi'].copy()

In [11]:
df_single['user'] = single_choice_prompt + df['question'] + '\n' + df['opa'] + '\n' + df['opb'] + '\n' + df['opc'] + '\n' + df['opd']
df_multi['user'] = multi_choice_prompt + df['question'] + '\n' + df['opa'] + '\n' + df['opb'] + '\n' + df['opc'] + '\n' + df['opd']

### Check for nulls

Here we check for nulls (and do indeed find) that some questions are blank in the "exp" column (the answer column). Since these don't have answers, we drop these. 

In [12]:
df_final = pd.concat([df_single, df_multi], axis=0, ignore_index=True)
df_final.isnull().sum()

question            0
exp             21953
cop                 0
opa                 0
opb                 0
opc                 0
opd                 0
subject_name        0
topic_name      95613
id                  0
choice_type         0
system              0
user                0
dtype: int64

In [13]:
df_final = df_final.dropna(subset=['exp'])
df_final.isnull().sum()

question            0
exp                 0
cop                 0
opa                 0
opb                 0
opc                 0
opd                 0
subject_name        0
topic_name      73792
id                  0
choice_type         0
system              0
user                0
dtype: int64

### Rename "exp" (which is the answer column) to assistant for fine-tuning purposes later

Here we rename the exp column to assistant - this is so the fine-tuning job knows that this is the answer. This is just the format needed for fine tuning. 

In [14]:
df_final = df_final[['system', 'user', 'exp']]
df_final.rename(columns={'exp': 'assistant'}, inplace=True)

In [15]:
df_final.head(5)

Unnamed: 0,system,user,assistant
0,You are a fine-tuned LLM taking a medical entr...,This question has a single-choice answer. Ther...,Chronic urethral obstruction because of urinar...
1,You are a fine-tuned LLM taking a medical entr...,This question has a single-choice answer. Ther...,Ans. (c) Vitamin B12 Ref: Harrison's 19th ed. ...
2,You are a fine-tuned LLM taking a medical entr...,This question has a single-choice answer. Ther...,"Ans. is 'b' i.e., IGI-1GH has two major functi..."
3,You are a fine-tuned LLM taking a medical entr...,This question has a single-choice answer. Ther...,Ans. C i.e. Mite
4,You are a fine-tuned LLM taking a medical entr...,This question has a single-choice answer. Ther...,PILONIDAL SINUS/DISEASE (Jeep Bottom; Driver's...


### Train validation test split

Here we split into train, test, and validation sets, all based off the training set since this has answers. 

In [16]:
# First, split into training (80%) and temp (20%)
df_train, df_temp = train_test_split(df_final, test_size=0.2, random_state=42)

# Then, split the temp into validation and test sets each (50% of the temp, equivalent to 20% of the original)
df_validation, df_test = train_test_split(df_temp, test_size=0.5, random_state=42)
# df_train now has 60% of the data, df_validation has 20%, and df_test has 20%


In [None]:
df_validation = df_validation.reset_index(drop=True)

### FIX THIS - Let's run our first "test" on a non fine-tuned model

Now, let's run our first test. We're going to ask the LLM each question, one by one, record the answer (and the rest of the data fields) and save to a JSON object. This is the LLM taking the "test" without fine-tuning. We'll grade this later. 

A couple things to note here: 

- We're using multi-threading to make it faster
- We have retries on in case it fails or hits the rate limit or times out 
- Save responses in a JSON, convert to CSV later for a "completed exam" 
- We'll "grade" this exam later

Once its done lets take a look at the finished exam to make sure it looks normal. 

In [85]:
# Function to get the GPT answer
def get_gpt_answer(client, index, system, user, assistant_content, retries=5, delay=20, timeout=10):
    for _ in range(retries):
        try:
            completion = client.chat.create(
                model="gpt-3.5-turbo",
                messages=[
                    {"role": "system", "content": system},
                    {"role": "user", "content": user}
                ],
                timeout=timeout
            )
            answer = completion.choices[0].message.content
            return {"index": index, "system": system, "user": user, "assistant": assistant_content, "gpt_answer": answer}
        except Exception as e:
            error_message = str(e).lower()
            if 'rate limit' in error_message:
                time.sleep(delay)
            else:
                time.sleep(delay)
    return {"index": index, "system": system, "user": user, "assistant": assistant_content, "gpt_answer": None}


# Function to save the result to file and update progress
def save_result_and_update_progress(processed_data_path, progress_path, result, index):
    if result is not None:
        with open(processed_data_path, 'a') as f:
            f.write(json.dumps(result) + "\n")  # Append each result as a new line
        with open(progress_path, 'a') as f:
            f.write(json.dumps(index) + "\n")  # Append the index as a new line

# Ensure directory exists
os.makedirs('data/processed', exist_ok=True)

progress_path = 'data/processed/gpt_validation_progress.json'
processed_data_path = 'data/processed/gpt_validation_answers.json'

# Load progress if it exists
processed_indices = []
if os.path.isfile(progress_path):
    with open(progress_path, 'r') as f:
        processed_indices = [int(line.strip()) for line in f if line.strip().isdigit()]

# Prepare unprocessed DataFrame
unprocessed_df = df_validation.loc[~df_validation.index.isin(processed_indices)]

# Setup ThreadPoolExecutor
executor = ThreadPoolExecutor(max_workers=8)  # Adjust as necessary

# Submit tasks to the executor for unprocessed entries
futures_to_index = {executor.submit(get_gpt_answer, client, index, row['system'], row['user'], row['assistant']): index
                    for index, row in unprocessed_df.iterrows()}

# Initialize the progress bar and print the starting index
initial_progress = len(processed_indices)
starting_index = max(processed_indices + [-1]) + 1 if processed_indices else 0
print(f"Starting from index {starting_index}")

# Progress tracking with tqdm
with tqdm(total=len(df_validation), initial=initial_progress) as pbar:
    for future in as_completed(futures_to_index):
        result = future.result()
        index = futures_to_index[future]
        if result is not None:
            save_result_and_update_progress(processed_data_path, progress_path, result, index)
            pbar.update(1)

# Shutdown the executor
executor.shutdown()
print("Processing complete.")





Starting from index 0


  0%|          | 0/16087 [00:09<?, ?it/s]


KeyboardInterrupt: 

In [145]:
# Load the results from the JSON cache and convert to DataFrame
with open('data/processed/gpt_validation_answers.json') as f:
    cached_results = json.load(f)

# Convert the list of dictionaries to a DataFrame
df_results = pd.DataFrame.from_records(cached_results)

# After all tasks are completed, check your DataFrame
df_results[:10]

Unnamed: 0,index,system,user,assistant,gpt_answer
0,54266,You are a fine-tuned LLM taking a medical entr...,This question has a single-choice answer. Ther...,Brucella spp are gram negative sho bacilli whi...,The likely organism grown in culture based on ...
1,118614,You are a fine-tuned LLM taking a medical entr...,This question has a single-choice answer. Ther...,"In kwashiorkor, the hair is straight and hypop...",The correct answer is Option B: Kwashiorkor.
2,74128,You are a fine-tuned LLM taking a medical entr...,This question has a single-choice answer. Ther...,Ans: D (Streptococcus viridans) Ref: Harrison'...,The correct answer is D: Streptococcus viridans.
3,67907,You are a fine-tuned LLM taking a medical entr...,This question has a single-choice answer. Ther...,"Ans. is 'b' i.e., Vesicoureteric reflux o Acco...",The most common underlying anomaly in a child ...
4,24456,You are a fine-tuned LLM taking a medical entr...,This question has a single-choice answer. Ther...,.,Option B: Mycosis fungoides
5,27231,You are a fine-tuned LLM taking a medical entr...,This question has a single-choice answer. Ther...,"Ans. is 4b' i.e., Mumps o Unilateral sensorine...",The correct answer is Option B: Mumps.
6,156740,You are a fine-tuned LLM taking a medical entr...,This question is a multiple choice question. T...,Answer is D (Bronchogenic carcinoma) : CE leve...,The correct answer is: Option D: Bronchogenic ...
7,149863,You are a fine-tuned LLM taking a medical entr...,This question is a multiple choice question. T...,Ans. D. Adduction of thumbThe ulnar nerve can ...,"Options A, B, and C are true. Select options A..."
8,11275,You are a fine-tuned LLM taking a medical entr...,This question has a single-choice answer. Ther...,Human parasite may complete their life cycle i...,The correct answer is Option A: One host.
9,128537,You are a fine-tuned LLM taking a medical entr...,This question is a multiple choice question. T...,Exhumation is the digging out of an already bu...,"In all of the following conditions, exhumation..."


### Write to JSONL files for fine-tuning

Now that we have training, testing, and validation splits, we are going to write to JSONL objects with the correct format we built to use in a fine-tuning job. Let's go ahead and do this. 

In [19]:
# Define a dictionary of dataframes
dfs = {
    'train': df_train,
    'test': df_test,
    'validation': df_validation
}

# Loop through the dictionary and write each dataframe to a JSONL file
for set_name, df_set in dfs.items():
    file_path = f'data/processed/fine_tune_{set_name}.jsonl'
    with open(file_path, 'w') as outfile:
        for index, row in df_set.iterrows():
            entry = {
                "messages": [
                    {"role": "system", "content": row['system']},
                    {"role": "user", "content": row['user']},
                    {"role": "assistant", "content": row['assistant']}
                ]
            }
            json_line = json.dumps(entry) + "\n"  # Convert the dictionary to a JSON string and add a newline
            outfile.write(json_line)


In [140]:
file_obj_train = client.files.create(
  file=open("data/processed/fine_tune_train.jsonl", "rb"),
  purpose="fine-tune"
)

file_obj_train_id = file_obj_train.id

In [141]:
file_obj_validation = client.files.create(
  file=open("data/processed/fine_tune_validation.jsonl", "rb"),
  purpose="fine-tune"
)

file_obj_validation_id = file_obj_validation.id

In [143]:
fine_tune_obj = client.fine_tuning.jobs.create(
  training_file=file_obj_train_id,
  validation_file=file_obj_validation_id,
  model="gpt-3.5-turbo"
)

fine_tune_id = fine_tune_obj.id

In [144]:
# Retrieve the state of a fine-tune
client.fine_tuning.jobs.retrieve(fine_tune_id)

FineTuningJob(id='ftjob-lqKTZiyD7Fe6qi0atre4ytwF', created_at=1699438500, error=None, fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), model='gpt-3.5-turbo-0613', object='fine_tuning.job', organization_id='org-bZlpCYnpzq95DFTylECno992', result_files=[], status='validating_files', trained_tokens=None, training_file='file-rR6i4TM0MboqumN7Fl3VgI4i', validation_file='file-V93AaEurq8SVxZcTCEKhST0o')

In [77]:
exam_custom_function = [
    {
        'name': 'grade_exam',
        'description': 'Decide whether or not the question was answered correctly - return True if correct and False if incorrect.',
        'parameters': {
            'type': 'object',
            'properties': {
                'graded': {
                    'type': 'boolean',
                    'description': 'Whether the response was correct or not - return True if correct and False if incorrect'
                }
            }
        }
    }
]

In [78]:
# Replace 'your_file.jsonl' with the path to your JSONL file
file_path = 'data/processed/fine_tune_test.jsonl'

# Open the file and read the first line
with open(file_path, 'r') as file:
    first_line = file.readline()

# Parse the first JSON object from the first line
first_json_object = json.loads(first_line)

# Get the first message
first_message = first_json_object

print(first_message)


{'messages': [{'role': 'system', 'content': 'You are a fine-tuned LLM taking a medical entrance exam with the goal of achieving as high of a score as possible. \nPlease answer the questions as accurately as possible based on your medical knowledge. \nAnswer the questions following the instructions specified.'}, {'role': 'user', 'content': 'This question is a multiple choice question. Therefore, select every option that is true and do not select any false answers. \nIf the options are A, B, C, and D, select all options out of A, B, C, and D that are true. \nAn example would be selecting A, C, and D, A and B, or just A. \nHere is the question: After a brawl, a young male presented with inability to extend his distal interphalangeal joint. An X-ray was taken and was shown to be normal. What should he the next step in managing the patient?\nOption A: Splint\nOption B: Surgery\nOption C: Wax bath\nOption D: Ignore'}, {'role': 'assistant', 'content': "Ans. a. Splint (Ref. Apley's 8/e p339)Ma

In [79]:
import json

# Replace 'your_file.jsonl' with the path to your JSONL file
file_path = 'data/processed/fine_tune_validation.jsonl'

# Open the file and read the first line
with open(file_path, 'r') as file:
    first_line = file.readline()

# Parse the first JSON object from the first line
first_json_object = json.loads(first_line)

# Get the 'messages' from the first JSON object
messages = first_json_object['messages']

# Initialize variables for system and user messages
system_message = ""
user_message = ""
assistant_message = ""

# Iterate through the messages and separate them based on 'role'
for message in messages:
    if message['role'] == 'system':
        system_message = message['content']
    elif message['role'] == 'user':
        user_message = message['content']
    elif message['role'] == 'assistant': 
        assistant_message = message['content']


# Print the separated messages
print("System Message:")
print(system_message)
print("\nUser Message:")
print(user_message)
print("\nAssistant Message:")
print(assistant_message)


System Message:
You are a fine-tuned LLM taking a medical entrance exam with the goal of achieving as high of a score as possible. 
Please answer the questions as accurately as possible based on your medical knowledge. 
Answer the questions following the instructions specified.

User Message:
This question has a single-choice answer. Therefore, select only one answer that is true and none else. 
For example, if the options are A, B, C, and D, you may select only one of these (A, for example). 
An example would be selecting just C.
Here is the question: A 41-year-old patient is diagnosed with infective endocarditis. Which of the following has good prognosis?
Option A: Prosthetic valve endocarditis
Option B: IV drug abuse
Option C: Staphylococcus aureus
Option D: Streptococcus viridans

Assistant Message:
Ans: D (Streptococcus viridans) Ref: Harrison's 18th edn, pg: 1062Explanation:"Overall survival rates for patients with NVE caused by viridans streptococci, HACEK organisms, or enteroco

In [98]:

response = client.chat.completions.create(
    model = 'gpt-3.5-turbo',
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_message + '\n' + assistant_message}
    ], 
    functions = [
    {
        'name': 'grade_exam',
        'description': 'Decide whether or not the question was answered correctly - return True if correct and False if incorrect.',
        'parameters': {
            'type': 'object',
            'properties': {
                'graded': {
                    'type': 'boolean',
                    'description': 'Whether the response was correct or not - return True if correct and False if incorrect'
                }
            }
        }
    }
],
    function_call = 'auto'
)

# The string representation of the JSON
arguments_json = response.choices[0].message.function_call.arguments

# Parse the JSON string into a Python dictionary
arguments_dict = json.loads(arguments_json)

# Extract the 'graded' value
graded = arguments_dict["graded"]

print(graded)  # This should print: True

True
