## Detailed Article Explaination

The detailed code explanation for this article is available at the following link:

https://www.daniweb.com/programming/computer-science/tutorials/542648/fine-tuning-openai-gpt-4o-for-multi-label-text-classification

For my other articles for Daniweb.com, please see this link:

https://www.daniweb.com/members/1235222/usmanmalik57

## Importing and Installing Required Libraries

In [1]:
!pip install openai



In [17]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from itertools import combinations
from collections import Counter
from sklearn.metrics import hamming_loss, accuracy_score
import json
import os
from openai import OpenAI

## Importing and Preprocessing the Dataset

In [4]:
## dataset download link
## https://www.kaggle.com/datasets/shivanandmn/multilabel-classification-dataset?select=train.csv

dataset = pd.read_csv(r"D:\Datasets\Multilabel Research Paper Classification\train.csv")
print(f"Dataset Shape: {dataset.shape}")
dataset.head()

Dataset Shape: (20972, 9)


Unnamed: 0,ID,TITLE,ABSTRACT,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance
0,1,Reconstructing Subject-Specific Effect Maps,Predictive models allow subject-specific inf...,1,0,0,0,0,0
1,2,Rotation Invariance Neural Network,Rotation invariance and translation invarian...,1,0,0,0,0,0
2,3,Spherical polyharmonics and Poisson kernels fo...,We introduce and develop the notion of spher...,0,0,1,0,0,0
3,4,A finite element approximation for the stochas...,The stochastic Landau--Lifshitz--Gilbert (LL...,0,0,1,0,0,0
4,5,Comparative study of Discrete Wavelet Transfor...,Fourier-transform infra-red (FTIR) spectra o...,1,0,0,1,0,0


In [5]:
subjects = ["Computer Science", "Physics", "Mathematics", "Statistics", "Quantitative Biology", "Quantitative Finance"]
filtered_dataset = dataset[(dataset[subjects] == 1).sum(axis=1) >= 2]
print(f"Filtered Dataset Shape: {filtered_dataset.shape}")
filtered_dataset.head()

Filtered Dataset Shape: (5044, 9)


Unnamed: 0,ID,TITLE,ABSTRACT,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance
4,5,Comparative study of Discrete Wavelet Transfor...,Fourier-transform infra-red (FTIR) spectra o...,1,0,0,1,0,0
21,22,Many-Body Localization: Stability and Instability,Rare regions with weak disorder (Griffiths r...,0,1,1,0,0,0
28,29,Minimax Estimation of the $L_1$ Distance,We consider the problem of estimating the $L...,0,0,1,1,0,0
29,30,Density large deviations for multidimensional ...,We investigate the density large deviation f...,0,1,1,0,0,0
30,31,mixup: Beyond Empirical Risk Minimization,"Large deep neural networks are powerful, but...",1,0,0,1,0,0


In [44]:
train_dataset = filtered_dataset.iloc[:100]  # First 100 records for training
test_dataset = filtered_dataset.sample(n=100, random_state=42)  # randomly selecting 100 records for testing

# Display the shapes of the resulting datasets
print(f"Training Dataset Shape: {train_dataset.shape}")
print(f"Testing Dataset Shape: {test_dataset.shape}")

Training Dataset Shape: (100, 9)
Testing Dataset Shape: (100, 9)


## Creating a Training File for OpenAI Fine Tuning

In [14]:

# Initialize list to hold JSON-like strings
json_lines = []

# Template for system role content
system_content = (
    "You are an expert in various scientific domains.\n"
    "Given the following research paper title and abstract, classify the research paper into at least two or more of the following categories:\n"
    "- Computer Science\n"
    "- Physics\n"
    "- Mathematics\n"
    "- Statistics\n"
    "- Quantitative Biology\n"
    "- Quantitative Finance\n\n"
    "Return only a comma-separated list of the categories (e.g., [Computer Science,Physics] or [Computer Science,Physics,Mathematics]).\n"
    "Use the exact case sensitivity and spelling of the categories provided above."
)

# Loop through each row in the DataFrame
for _, row in train_dataset.iterrows():
    # Identify the categories with a value of 1 and reverse the list
    categories = [
        subject for subject in ["Computer Science", "Physics", "Mathematics", "Statistics", "Quantitative Biology", "Quantitative Finance"]
        if row[subject] == 1
    ][::-1]  # Reverse the order of categories
    
    # Create JSON structure for each row
    json_record = {
        "messages": [
            {"role": "system", "content": system_content},
            {"role": "user", "content": f"Title: {row['TITLE']}\nAbstract: {row['ABSTRACT']}"},
            {"role": "assistant", "content": f"[{','.join(categories)}]"}
        ]
    }
    # Convert to JSON string and add to list
    json_lines.append(json.dumps(json_record))

# Join all JSON strings with newline separators for the final output
final_output = "\n".join(json_lines)



In [15]:
# Save the JSON records to 'train.json'

training_file_path = r"D:\Datasets\Multilabel Research Paper Classification\train.json"

with open(training_file_path, 'w') as file:
    file.write(final_output)

print("Data successfully saved to 'train.json'")

Data successfully saved to 'train.json'


In [None]:
client = OpenAI(
    # This is the default and can be omitted
    api_key = os.environ.get('OPENAI_API_KEY'),
)


training_file = client.files.create(
  file=open(training_file_path, "rb"),
  purpose="fine-tune"
)

print(training_file.id)

## Fine Tuning GPT-4o Model

In [21]:
fine_tuning_job_gpt4o = client.fine_tuning.jobs.create(
  training_file=training_file.id,
  model="gpt-4o-2024-08-06"
)

In [None]:
# List up to 10 events from a fine-tuning job
print(client.fine_tuning.jobs.list_events(fine_tuning_job_id = fine_tuning_job_gpt4o.id,
                                    limit=10))

In [None]:
ft_model_id = client.fine_tuning.jobs.retrieve(fine_tuning_job_gpt4o.id).fine_tuned_model
print(ft_model_id)

## Predicting Research Paper Category with Fine-tuned GPT-4o

In [45]:

def find_research_category(client, model, dataset):

    outputs = []
    i = 0

    for _, row in dataset.iterrows():
        title = row['TITLE']
        abstract = row['ABSTRACT']

        content = """You are an expert in various scientific domains.
                     Given the following research paper title and abstract, classify the research paper into at least two or more of the following categories:
                    - Computer Science
                    - Physics
                    - Mathematics
                    - Statistics
                    - Quantitative Biology
                    - Quantitative Finance

                    Return only a comma-separated list of the categories (e.g., [Computer Science,Physics] or [Computer Science,Physics,Mathematics]).
                    Use the exact case sensitivity and spelling of the categories provided above.

                    text: Title: {}\nAbstract: {}""".format(title, abstract)


 

        research_category = client.chat.completions.create(
                                model= model,
                                temperature = 0,
                                max_tokens = 100,
                                messages=[
                                      {"role": "user", "content": content}
                                  ]
                              ).choices[0].message.content


        outputs.append(research_category)
        print(i + 1, research_category)
        i += 1

    return outputs

In [46]:

def parse_outputs_to_dataframe(outputs):

    subjects = ["Computer Science", "Physics", "Mathematics", "Statistics", "Quantitative Biology", "Quantitative Finance"]
    # Remove square brackets and split the subjects for each entry in outputs
    parsed_data = [item.strip('[]').split(',') for item in outputs]

    # Create an empty DataFrame with columns for each subject, initializing with 0s
    df = pd.DataFrame(0, index=range(len(parsed_data)), columns=subjects)

    # Populate the DataFrame with 1s based on the presence of each subject in each row
    for i, subjects_list in enumerate(parsed_data):
        for subject in subjects_list:
            if subject in subjects:
                df.loc[i, subject] = 1

    return df

In [47]:
model = ft_model_id
outputs = find_research_category(client, 
                                 model, 
                                 test_dataset)

1 [Mathematics,Physics]
2 [Statistics,Mathematics]
3 [Statistics,Computer Science]
4 [Statistics,Mathematics]
5 [Statistics,Computer Science]
6 [Statistics,Mathematics]
7 [Mathematics,Computer Science]
8 [Statistics,Computer Science]
9 [Statistics,Computer Science]
10 [Mathematics,Computer Science]
11 [Statistics,Computer Science]
12 [Statistics,Computer Science]
13 [Statistics,Physics,Computer Science]
14 [Statistics,Computer Science]
15 [Statistics,Computer Science]
16 [Statistics,Mathematics]
17 [Statistics,Mathematics]
18 [Statistics,Computer Science]
19 [Statistics,Computer Science]
20 [Mathematics,Computer Science]
21 [Statistics,Mathematics]
22 [Mathematics,Computer Science]
23 [Statistics,Mathematics]
24 [Statistics,Computer Science]
25 [Mathematics,Computer Science]
26 [Statistics,Computer Science]
27 [Statistics,Computer Science]
28 [Quantitative Biology,Computer Science]
29 [Mathematics,Physics]
30 [Statistics,Computer Science]
31 [Statistics,Computer Science]
32 [Statistics

In [48]:
predictions = parse_outputs_to_dataframe(outputs)
targets = test_dataset[subjects]

# Calculate Hamming Loss
hamming = hamming_loss(targets, predictions)
print(f"Hamming Loss: {hamming}")

# Calculate Subset Accuracy (Exact Match Ratio)
subset_accuracy = accuracy_score(targets, predictions)
print(f"Subset Accuracy: {subset_accuracy}")


Hamming Loss: 0.09333333333333334
Subset Accuracy: 0.69
