# 📘 Fine-Tuning with OpenAI GPT

<u><b>Full Procedure<u><b>
1. Prepare the Dataset
- Load and read a CSV file containing labeled bug reports, their description, and precomputed embeddings

2. Prepare Data to Fine-Tune <code>gpt-o-mini</code>
- Use <code>train_test_split</code> to divide the data into training and validation sets
- Transform the data into a JSONL format, required by OpenAI's fine-tuning API
- Upload and send the formatted JSONL file to OpenAI servers

3. Fine-Tune the Model 
- Start the fine-tuning training job using <code>gpt-o-mini</code>

4. Model Evaluation
- Use the fine-tuned against the <code>gpt-o-mini</code> without fine-tuning and classify new samples with both
- Compare predictions against true labels using standard classification metrics

### Imports and Read Data

In [1]:
import pandas as pd
import numpy as np
import json
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix

from openai import OpenAI

client = OpenAI()

In [2]:
# Load the CSV file with the correct delimiter
data_path = "bug_reports_mozilla_firefox_resolved_fixed_comments_embeddings.csv"

data = pd.read_csv(data_path)

In [3]:
data.head(2)

Unnamed: 0,Bug ID,Type,Summary,Product,Component,Status,Resolution,Priority,Severity,Description,Concat,N_tokens,Embeddings
0,1955715,enhancement,Update addonsInfo asrouter targeting to allow ...,Firefox,Messaging System,RESOLVED,FIXED,P1,--,"Currently, the addonsInfo targeting returns an...",summary update addonsinfo asrouter target allo...,80,"[-0.015150155872106552, 0.003520532278344035, ..."
1,1953155,task,Enable expand on hover and remove coming soon ...,Firefox,Sidebar,RESOLVED,FIXED,P1,--,"When expand on hover is enabled, the message s...",summary enable expand on hover remove coming s...,55,"[-0.01597077213227749, 0.009659321047365665, 0..."


In [4]:
# Drop all P3 examples
data = data[data["Priority"] != "P3"].reset_index(drop = True)

In [5]:
# Drop unnecessary columns
data_sub = data[["Concat", "Priority"]]

In [7]:
# Create a new subset of the data
data_sub.head()

Unnamed: 0,Concat,Priority
0,summary update addonsinfo asrouter target allo...,P1
1,summary enable expand on hover remove coming s...,P1
2,summary add support picker style tile aboutwel...,P1
3,summary what new notification window toast not...,P1
4,summary add new callout create tab group actio...,P1


In [None]:
# Mapping the 'label' column (Priority) to a more human-readable text
label_mapping = {"P1": "priority", "P2": "non-priority"}
data_sub["Priority"] = data_sub["Priority"].map(label_mapping)

In [9]:
# We can see that the class labels have been changed to 'priority' and 'non-priority'
data_sub.head()

Unnamed: 0,Concat,Priority
0,summary update addonsinfo asrouter target allo...,priority
1,summary enable expand on hover remove coming s...,priority
2,summary add support picker style tile aboutwel...,priority
3,summary what new notification window toast not...,priority
4,summary add new callout create tab group actio...,priority


### Prepare and Upload Data to OpenAI API

In [10]:
# Split the data into training and validation sets (80% train, 20% validation)
train_data, validation_data = train_test_split(data_sub, test_size = 0.2, random_state = 12345)

In [13]:
# Function to save data from CSV to JSON file with the message/role structure
def save_to_jsonl(data, output_file_path):
    jsonl_data = []
    for index, row in data.iterrows():
        jsonl_data.append({
            "messages": [
                {"role": "system", "content": "Given a bug report from Bugzilla, classify whether it is 'priority' or 'non-priority'."},
                {"role": "user", "content": row['Concat']},
                {"role": "assistant", "content": f"\"{row['Priority']}\""}
            ]
        })

    # Save to JSONL format
    with open(output_file_path, 'w') as f:
        for item in jsonl_data:
            f.write(json.dumps(item) + '\n')

In [14]:
# Save the training and validation sets to separate JSONL files
train_output_file_path = 'data_for_finetuning_prepared_train.jsonl' 
validation_output_file_path = 'data_for_finetuning_prepared_valid.jsonl'

save_to_jsonl(train_data, train_output_file_path)
save_to_jsonl(validation_data, validation_output_file_path)

In [15]:
# Check saved files
print(f"Training dataset save to: {train_output_file_path}")
print(f"Validation dataset save to: {validation_output_file_path}")

Training dataset save to: data_for_finetuning_prepared_train.jsonl
Validation dataset save to: data_for_finetuning_prepared_valid.jsonl


In [None]:
# Upload Dataset to OpenAI API
train_file = client.files.create(
  file = open(train_output_file_path, "rb"),
  purpose = "fine-tune"
)

valid_file = client.files.create(
  file = open(validation_output_file_path, "rb"),
  purpose = "fine-tune"
)

print(f"Training file Info: {train_file}")
print(f"Validation file Info: {valid_file}")

### Starting the Fine-Tuning Job

In [None]:
model = client.fine_tuning.jobs.create(
  training_file = train_file.id, 
  validation_file = valid_file.id,
  model = "gpt-4o-mini-2024-07-18", 
  hyperparameters = {
    "n_epochs": 5,
	"batch_size": 5
  }
)
job_id = model.id
status = model.status

print(f'Fine-tuning model with jobID: {job_id}.')
print(f"Training Response: {model}")
print(f"Training Status: {status}")

In [None]:
# Retrieve the state of a fine-tune
client.fine_tuning.jobs.retrieve(job_id)

In [None]:
# Assessing Fine-Tuned Model
result = client.fine_tuning.jobs.list()

# Retrieve the fine tuned model
fine_tuned_model = result.data[0].fine_tuned_model
print(fine_tuned_model)

In [3]:
# Check the response given from the fine-tuned model for a given bug report
completion = client.chat.completions.create(
  model = fine_tuned_model,
  messages=[
    {"role": "system", "content": "Given a bug report from Bugzilla, classify whether it is 'priority' or 'non-priority'."},
    {"role": "user", "content": "migrate preference experimental nimbus"}
  ]
)
print(completion.choices[0].message.content)

"priority"


### Model Evaluation

In [25]:
# Predict function to pass the bug reports to the chat completion in order to get the classification response
def predict(test, model):
    
    y_pred = []
    categories = ["non-priority", "priority"]

    for index, row in test.iterrows():
        response = client.chat.completions.create(
            model = model,
            messages = [
                {
                    "role": "system",
                    "content": "Given a bug report from Bugzilla, classify whether it is 'priority' or 'non-priority'.",
                },
                {"role": "user", "content": row["Concat"]},
            ],
        )

        answer = response.choices[0].message.content

        # Determine the predicted category

        for category in categories:
            if category.lower() in answer.lower():
                y_pred.append(category)
                break
        else:
            y_pred.append("None")
            
    return y_pred

In [26]:
# Function to evaluate the model's performance
def evaluate(y_true, y_pred):
    labels = ["non-priority", "priority"]
    mapping = {label: idx for idx, label in enumerate(labels)}

    def map_func(x):
        return mapping.get(
            x, -1
        )  # Map to -1 if not found, but should not occur with correct data

    y_true_mapped = np.vectorize(map_func)(y_true)
    y_pred_mapped = np.vectorize(map_func)(y_pred)

    # Calculate accuracy

    accuracy = accuracy_score(y_true = y_true_mapped, y_pred = y_pred_mapped)
    print(f"Accuracy: {accuracy:.3f}")

    # Generate accuracy report for each class

    unique_labels = set(y_true_mapped)  # Get unique labels

    for label in unique_labels:
        label_indices = [
            i for i in range(len(y_true_mapped)) if y_true_mapped[i] == label
        ]
        label_y_true = [y_true_mapped[i] for i in label_indices]
        label_y_pred = [y_pred_mapped[i] for i in label_indices]
        label_accuracy = accuracy_score(label_y_true, label_y_pred)
        print(f"Accuracy for label {labels[label]}: {label_accuracy:.3f}")
        
    # Generate classification report

    class_report = classification_report(
        y_true = y_true_mapped,
        y_pred = y_pred_mapped,
        target_names = labels,
        labels = list(range(len(labels))),
    )
    print("\nClassification Report:")
    print(class_report)

    # Generate confusion matrix

    conf_matrix = confusion_matrix(
        y_true = y_true_mapped, y_pred = y_pred_mapped, labels = list(range(len(labels)))
    )
    print("\nConfusion Matrix:")
    print(conf_matrix)

In [27]:
# Evaluate model without fine-tuning
y_pred = predict(validation_data, "gpt-4o-mini-2024-07-18")
y_true = validation_data["Priority"]

In [28]:
evaluate(y_true, y_pred)

Accuracy: 0.429
Accuracy for label non-priority: 0.687
Accuracy for label priority: 0.288

Classification Report:
              precision    recall  f1-score   support

non-priority       0.34      0.69      0.46       601
    priority       0.63      0.29      0.40      1106

    accuracy                           0.43      1707
   macro avg       0.49      0.49      0.43      1707
weighted avg       0.53      0.43      0.42      1707


Confusion Matrix:
[[413 188]
 [787 319]]


In [29]:
# Evaluate fine-tuned model
y_pred = predict(validation_data, fine_tuned_model)

In [30]:
evaluate(y_true, y_pred)

Accuracy: 0.752
Accuracy for label non-priority: 0.534
Accuracy for label priority: 0.871

Classification Report:
              precision    recall  f1-score   support

non-priority       0.69      0.53      0.60       601
    priority       0.77      0.87      0.82      1106

   micro avg       0.75      0.75      0.75      1707
   macro avg       0.73      0.70      0.71      1707
weighted avg       0.75      0.75      0.74      1707


Confusion Matrix:
[[321 280]
 [142 963]]
