## Detailed Article Explaination

The detailed code explanation for this article is available at the following link:

https://www.daniweb.com/programming/computer-science/tutorials/542514/fine-tuning-openai-vision-models-for-visual-question-answering

For my other articles for Daniweb.com, please see this link:

https://www.daniweb.com/members/1235222/usmanmalik57

## Importing and Installing Required libraries

In [17]:
!pip install openai



In [112]:
from openai import OpenAI
import pandas as pd
import json
import os
from sklearn.utils import shuffle
from sklearn.metrics import accuracy_score

## Importing and Preprocessing the Dataset

In [39]:
#Data download link
#https://www.kaggle.com/datasets/bhavikardeshna/visual-question-answering-computer-vision-nlp

dataset = pd.read_csv(r"D:\Datasets\dataset\data_train.csv")
dataset.head()

Unnamed: 0,question,answer,image_id
0,what is the object on the shelves,cup,image100
1,how man chairs are there,6,image888
2,what is hanged to the right side of the bed,curtain,image1174
3,how many picture are on the wall,2,image942
4,what is the object on the floor behind the rack,room_divider,image1220


In [177]:
dataset['image_num'] = dataset['image_id'].str.extract('(\d+)').astype(int)
filtered_data = dataset[dataset['image_num'] < 495]

filtered_data.head()

Unnamed: 0,question,answer,image_id,image_num
0,what is the object on the shelves,cup,image100,100
10,what is above the cupboards,books,image110,110
15,what is to the left of the sofa,lamp,image488,488
16,what is around the table,chair,image467,467
17,what is to the right of cot,drawer,image79,79


In [70]:

# Base URL for the images
base_url = "https://raw.githubusercontent.com/usmanmalik57/daniweb-articles/refs/heads/main/vqa_images/"

# Shuffle the dataset
shuffled_dataset = shuffle(filtered_data)

# Split the dataset: first 300 for training, next 100 for testing
training_data = shuffled_dataset[:300]
test_data = shuffled_dataset[300:400]

# Create the JSONL structure for training data and save each entry on a single line
training_output_file = r'D:\Datasets\dataset\training_data.jsonl'

with open(training_output_file, 'w') as f:
    for index, row in training_data.iterrows():
        # Update image URL
        image_url = f"{base_url}image{row['image_num']}.png"
        entry = {
            "messages": [
                {"role": "system", "content": "You are an assistant that answers questions related to images."},
                {"role": "user", "content": row['question']},
                {"role": "user", "content": [
                    {"type": "image_url", "image_url": {"url": image_url}}
                ]},
                {"role": "assistant", "content": row['answer']}
            ]
        }
        # Write each entry as a single line in the JSONL file
        f.write(json.dumps(entry) + '\n')

print(f"Training JSONL file saved as {training_output_file}")


Training JSONL file saved as D:\Datasets\dataset\training_data.jsonl


## Vision Fine-tuning OpenAI GPT-4o Mini 

In [None]:
client = OpenAI(
    # This is the default and can be omitted
    api_key = os.environ.get('OPENAI_API_KEY'),
)

training_file = client.files.create(
  file=open(training_output_file, "rb"),
  purpose="fine-tune"
)

print(training_file)

In [47]:
fine_tuning_job_gpt4o = client.fine_tuning.jobs.create(
  training_file=training_file.id,
  model="gpt-4o-2024-08-06"
)

In [None]:
# List up to 10 events from a fine-tuning job
print(client.fine_tuning.jobs.list_events(fine_tuning_job_id = fine_tuning_job_gpt4o.id,
                                    limit=10))

In [64]:
ft_model_id = client.fine_tuning.jobs.retrieve(fine_tuning_job_gpt4o.id).fine_tuned_model

## Evaluating Fine-Tuned Vision Model

In [129]:
test_data['full_image_path'] = test_data['image_num'].apply(lambda x: f"{base_url}image{x}.png")
test_data.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data['full_image_path'] = test_data['image_num'].apply(lambda x: f"{base_url}image{x}.png")


Unnamed: 0,question,answer,image_id,image_num,full_image_path
963,how many draws are there in the unit where the...,2,image62,62,https://raw.githubusercontent.com/usmanmalik57...
616,what is to the left of the laptop,printer,image393,393,https://raw.githubusercontent.com/usmanmalik57...
7249,what is in front of the shelf below the map,globe,image309,309,https://raw.githubusercontent.com/usmanmalik57...
6017,what is the object close to the wall above the...,blackboard,image124,124,https://raw.githubusercontent.com/usmanmalik57...
3786,what is the colour of pillar,brown,image228,228,https://raw.githubusercontent.com/usmanmalik57...
7894,how many computers are on the table,2,image381,381,https://raw.githubusercontent.com/usmanmalik57...
4140,what is near the computer,mouse,image112,112,https://raw.githubusercontent.com/usmanmalik57...
5719,what is hanging diagonally behind the water di...,telephone,image21,21,https://raw.githubusercontent.com/usmanmalik57...
4768,what is to the right of tv,shelves,image63,63,https://raw.githubusercontent.com/usmanmalik57...
7918,how many steps are,3,image96,96,https://raw.githubusercontent.com/usmanmalik57...


In [107]:
def get_single_prediction(query, image_path, model_id, system_role):

    try:
        # Make the API call to get the response from the model
        response = client.chat.completions.create(
          model= model_id,
          temperature = 0,
          messages=[
                {"role": "system", "content": system_role},
                {"role": "user", "content": [
                    {"type": "text", "text": query},
                    {"type": "image_url", "image_url": {"url": image_path}
                    }
                ]}
          ]
        )
        
        
        # Extract the prediction from the API response
        prediction = response.choices[0].message.content.strip().lower()
        return prediction
    except Exception as e:
        print(f"Error making prediction: {e}")
        return None  # In case of failure


In [110]:
image_path = test_data["full_image_path"].iloc[1]
query = test_data["question"].iloc[1]
system_role = "You are an assistant that answers questions related to images."
model_id = "gpt-4o-2024-08-06"
response = get_single_prediction(query, image_path, model_id, system_role)

print(f"Image path: {image_path}")
print(f"User Query: {query}")
print(f"Model Response: {response}")

Image path: https://raw.githubusercontent.com/usmanmalik57/daniweb-articles/refs/heads/main/vqa_images/image393.png
User Query: what is to the left of the laptop
Model Response: to the left of the laptop, there is a printer.


In [111]:
image_path = test_data["full_image_path"].iloc[1]
query = test_data["question"].iloc[1]
system_role = "You are an assistant that answers questions related to images."
model_id = ft_model_id
response = get_single_prediction(query, image_path, model_id, system_role)

print(f"Image path: {image_path}")
print(f"User Query: {query}")
print(f"Model Response: {response}")

Image path: https://raw.githubusercontent.com/usmanmalik57/daniweb-articles/refs/heads/main/vqa_images/image393.png
User Query: what is to the left of the laptop
Model Response: printer


In [135]:

def make_predictions(dataframe, model_id, system_role):
    actual_answers = []
    predicted_answers = []
    
    # Initialize a counter to track record numbers
    record_number = 1
    
    # Iterate through each row in the dataframe
    for _, row in dataframe.iterrows():
        image_path = row['full_image_path']
        query = row['question']
        actual_answer = row['answer'].lower()
        
        # Get the predicted answer from the API
        predicted_answer = get_single_prediction(query, image_path, model_id, system_role)
        
        if predicted_answer:
            # Append actual and predicted answers for accuracy calculation
            actual_answers.append(actual_answer)
            predicted_answers.append(predicted_answer)
        else:
            print(f"Skipping record #{record_number} due to prediction error.")
            record_number += 1
            continue
        
        # Print the status indicating the record number processed and the response
        print(f"Record #{record_number} processed. Response: {predicted_answer}")
        
        # Increment the record number for the next iteration
        record_number += 1
    
    # Calculate accuracy using sklearn's accuracy_score
    accuracy = accuracy_score(actual_answers, predicted_answers) * 100
    print(f"Accuracy: {accuracy:.2f}%")
    
    return accuracy, predicted_answers


### Results Using Default GPT-4o Model

In [136]:
model_id = "gpt-4o-2024-08-06"
system_role = """
You are an assistant that answers questions related to images. 
Return your response in a single word without period at the end. 
For digits you should return digit number and not word. "
"""
gpt_4o_predictions = make_predictions(test_data, model_id, system_role)

Record #1 processed. Response: 2
Record #2 processed. Response: printer
Record #3 processed. Response: globe
Record #4 processed. Response: speaker
Record #5 processed. Response: brown
Record #6 processed. Response: 2
Record #7 processed. Response: papers
Record #8 processed. Response: cables
Record #9 processed. Response: wall
Record #10 processed. Response: 3
Record #11 processed. Response: chair
Record #12 processed. Response: blender
Record #13 processed. Response: rug
Record #14 processed. Response: basket
Record #15 processed. Response: door
Record #16 processed. Response: pillows
Record #17 processed. Response: brown
Record #18 processed. Response: dishwasher
Record #19 processed. Response: painting
Record #20 processed. Response: foosball
Record #21 processed. Response: kettle
Record #22 processed. Response: cabinet
Record #23 processed. Response: nothing
Record #24 processed. Response: coat
Record #25 processed. Response: napkins
Record #26 processed. Response: phone
Record #2

### Results Using Fine-tuned GPT-4o Model

In [137]:
model_id = ft_model_id
system_role = "You are an assistant that answers questions related to images."
gpt_4o_fine_tuned_predictions = make_predictions(test_data, model_id, system_role)

Record #1 processed. Response: 2
Record #2 processed. Response: printer
Record #3 processed. Response: globe
Record #4 processed. Response: chalkboard
Record #5 processed. Response: brown
Record #6 processed. Response: 2
Record #7 processed. Response: mouse
Record #8 processed. Response: wire
Record #9 processed. Response: wall
Record #10 processed. Response: 2
Record #11 processed. Response: chair
Record #12 processed. Response: blender
Record #13 processed. Response: carpet
Record #14 processed. Response: basket
Record #15 processed. Response: door
Record #16 processed. Response: pillow
Record #17 processed. Response: brown
Record #18 processed. Response: dishwasher
Record #19 processed. Response: picture
Record #20 processed. Response: foosball_table
Record #21 processed. Response: printer
Record #22 processed. Response: whiteboard
Record #23 processed. Response: clothes
Record #24 processed. Response: jacket
Record #25 processed. Response: sink
Record #26 processed. Response: telep

## Comparing Default vs Fine-Tuned GPT-4o Model

In [145]:
comparison_df = pd.DataFrame({
    'Actual Answers': test_data['answer'],
    'Default GPT-4o': gpt_4o_predictions[1],
    'Fine-tuned GPT-4o': gpt_4o_fine_tuned_predictions[1]
})

# Display the new DataFrame
comparison_df.head(20)

Unnamed: 0,Actual Answers,Default GPT-4o,Fine-tuned GPT-4o
963,2,2,2
616,printer,printer,printer
7249,globe,globe,globe
6017,blackboard,speaker,chalkboard
3786,brown,brown,brown
7894,2,2,2
4140,mouse,papers,mouse
5719,telephone,cables,wire
4768,shelves,wall,wall
7918,3,3,2


In [172]:
def compare_answer(answer, prediction):
    
    content = f"""
    Compare the actual answer and prediction and check if the actual answer and prediction have the same meaning.
    They dont have to be the exact match but the meaning must be similarl.
    Actual answer {answer}.
    Prediction: {prediction}.
    Return True if the have same meaning, else return False. Do not return anything else.
    
    """
    response = client.chat.completions.create(
        model= "gpt-4o-2024-08-06",
        temperature=0,
        max_tokens=10,
        messages=[
            {"role": "user", "content": content}
        ]
    )
    
    response = response.choices[0].message.content.strip().lower() == 'true'
    print(f"{answer} -> {prediction} -> {response}")
    return response

In [173]:
def count_matching_answers(answers, predictions):
    count = 0
    # Iterate through both lists together using zip
    for answer, prediction in zip(answers, predictions):
        # Call the compare_answer function and increment count if True
        if compare_answer(answer, prediction):
            count += 1
    return count

In [174]:
matching_count = count_matching_answers(test_data['answer'], gpt_4o_predictions[1])
print(f"Number of matching answers: {matching_count}")

2 -> 2 -> True
printer -> printer -> True
globe -> globe -> True
blackboard -> speaker -> False
brown -> brown -> True
2 -> 2 -> True
mouse -> papers -> False
telephone -> cables -> False
shelves -> wall -> False
3 -> 3 -> True
chair -> chair -> True
electric_mixer -> blender -> False
cap_stand -> rug -> False
ironing_board -> basket -> False
door -> door -> True
blanket, pillow -> pillows -> False
brown -> brown -> True
glass_set -> dishwasher -> False
picture -> painting -> False
foosball_table -> foosball -> True
printer -> kettle -> False
shelves -> cabinet -> False
pillow -> nothing -> False
door -> coat -> False
bottle_of_liquid -> napkins -> False
telephone -> phone -> True
bottle_of_liquid -> bottle -> False
aluminium_foil -> pans -> False
brown -> brown -> True
white -> white -> True
1 -> dresser -> False
book -> books -> True
dvd_player -> table -> False
white, blue, pink -> gray, purple, white, pink, green -> False
blue -> blue -> True
table -> chair -> False
blender -> blen

In [175]:
matching_count = count_matching_answers(test_data['answer'], gpt_4o_fine_tuned_predictions[1])
print(f"Number of matching answers: {matching_count}")

2 -> 2 -> True
printer -> printer -> True
globe -> globe -> True
blackboard -> chalkboard -> True
brown -> brown -> True
2 -> 2 -> True
mouse -> mouse -> True
telephone -> wire -> False
shelves -> wall -> False
3 -> 2 -> False
chair -> chair -> True
electric_mixer -> blender -> False
cap_stand -> carpet -> False
ironing_board -> basket -> False
door -> door -> True
blanket, pillow -> pillow -> False
brown -> brown -> True
glass_set -> dishwasher -> False
picture -> picture -> False
foosball_table -> foosball_table -> True
printer -> printer -> True
shelves -> whiteboard -> False
pillow -> clothes -> False
door -> jacket -> False
bottle_of_liquid -> sink -> False
telephone -> telephone -> True
bottle_of_liquid -> bottle -> False
aluminium_foil -> aluminum_foil -> True
brown -> brown -> True
white -> white -> True
1 -> dresser -> False
book -> book -> True
dvd_player -> chair -> False
white, blue, pink -> white, purple -> False
blue -> blue -> True
table -> chair -> False
blender -> blen