# How to submit OpenAI GPT-3.5-turbo Fine-tuning Jobs

In this notebook file, I will go through how to submit a GPT-3.5-turbo fine-tuning job to OpenAI. Note that as of Sept 1st 2023, <br>
the fine-tuning feature is only available for GPT-3.5-turbo, with the feature expected to be released for GPT-4 in Fall of the same year.

In this document, I will go through the following:
- Data Preparation
- Error Checking
- Training File Upload
- Job Submission
- Status Tracking
- Calling a Fine-tuned Model

References:
- https://openai.com/blog/gpt-3-5-turbo-fine-tuning-and-api-updates
- https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset

### 1. Data Preparation

You have to submit a .jsonl training file for the fine-tuning job. Each sample should look like a flattened version of this:

Each sample should include the following:
- a system message: a pre-prompt that tells GPT what it is supposed to mimic.
- a user message: the main prompt including the instructions to execute a task.
- an assistant message: what you want GPT to respond with.

In [None]:
{
  "messages": [
    { "role": "system", "content": "You are an assistant that occasionally misspells words" },
    { "role": "user", "content": "Tell me a story." },
    { "role": "assistant", "content": "One day a student went to schoool." }
  ]
}

Alternatively, if you have a .json file of training samples formatted as above, you can simply convert it to a .jsonl file:

In [1]:
import json

# Read the original JSON file
with open('training_file.json', 'r') as f:
    data = json.load(f)

# Write each object as a separate JSON object in a JSONL file
with open('training_file.jsonl', 'w') as f:
    for obj in data:
        json_line = json.dumps(obj)
        f.write(json_line + '\n')

### 2. Error Checking

This makes use of the code from the OpenAI fine-tuning documentation. This is to make sure your .jsonl file is in the correct format.

In [1]:
# We start by importing the required packages

import json
import os
import tiktoken
import numpy as np
from collections import defaultdict

In [2]:
# Next, we specify the data path and open the JSONL file

data_path = "training_file.jsonl"

# Load dataset
with open(data_path) as f:
    dataset = [json.loads(line) for line in f]

In [3]:
# We can inspect the data quickly by checking the number of examples and the first item

# Initial dataset stats
print("Num examples:", len(dataset))
print("First example:")
for message in dataset[0]["messages"]:
    print(message)

Num examples: 305
First example:
{'role': 'system', 'content': 'You are an assistant that generate answerable user-like questions from a passage.'}
{'role': 'user', 'content': 'Given a passage from the user manual of a specific machinery, imagine you are operating said machinery, generate a set of questions that a user operating it might ask.\n\nPassage:\nTo disable and enable Cruise Grade Braking for the current ignition cycle, press and hold the Tow/Haul button for five seconds. A Driver Information Center (DIC) message displays.\n\nEach question should be user-like.\nEach question should be closely related to the passage.\nEach question must be answerable by the information in the passage.\nThe set of questions should be sufficient to completely cover the information in the passage.\nQuestions must not be questions about the passage location.\nEach question should not be a combination of two or more questions.\nEach question should focus on practical application, problem-solving, or

Now that we have a sense of the data, we need to go through all the different examples and check to make sure the formatting is correct and matches the Chat completions message structure

In [4]:
# Format error checks
format_errors = defaultdict(int)

for ex in dataset:
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue

    messages = ex.get("messages", None)
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue

    for message in messages:
        if "role" not in message or "content" not in message:
            format_errors["message_missing_key"] += 1

        if any(k not in ("role", "content", "name") for k in message):
            format_errors["message_unrecognized_key"] += 1

        if message.get("role", None) not in ("system", "user", "assistant"):
            format_errors["unrecognized_role"] += 1

        content = message.get("content", None)
        if not content or not isinstance(content, str):
            format_errors["missing_content"] += 1

    if not any(message.get("role", None) == "assistant" for message in messages):
        format_errors["example_missing_assistant_message"] += 1

if format_errors:
    print("Found errors:")
    for k, v in format_errors.items():
        print(f"{k}: {v}")
else:
    print("No errors found")

No errors found


Beyond the structure of the message, we also need to ensure that the length does not exceed the 4096 token limit.

In [5]:
# Token counting functions
encoding = tiktoken.get_encoding("cl100k_base")

# not exact!
# simplified from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens

def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if message["role"] == "assistant":
            num_tokens += len(encoding.encode(message["content"]))
    return num_tokens

def print_distribution(values, name):
    print(f"\n#### Distribution of {name}:")
    print(f"min / max: {min(values)}, {max(values)}")
    print(f"mean / median: {np.mean(values)}, {np.median(values)}")
    print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")

In [6]:
# Warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []

for ex in dataset:
    messages = ex["messages"]
    if not any(message["role"] == "system" for message in messages):
        n_missing_system += 1
    if not any(message["role"] == "user" for message in messages):
        n_missing_user += 1
    n_messages.append(len(messages))
    convo_lens.append(num_tokens_from_messages(messages))
    assistant_message_lens.append(num_assistant_tokens_from_messages(messages))

print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 4096 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 4096 token limit, they will be truncated during fine-tuning")

# Pricing and default n_epochs estimate
MAX_TOKENS_PER_EXAMPLE = 4096

MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
TARGET_EPOCHS = 3
MIN_EPOCHS = 1
MAX_EPOCHS = 25

n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")
print("See pricing page to estimate total costs")

Num examples missing system message: 0
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 3, 3
mean / median: 3.0, 3.0
p5 / p95: 3.0, 3.0

#### Distribution of num_total_tokens_per_example:
min / max: 212, 700
mean / median: 319.11803278688524, 308.0
p5 / p95: 248.4, 395.20000000000005

#### Distribution of num_assistant_tokens_per_example:
min / max: 26, 437
mean / median: 68.85573770491803, 57.0
p5 / p95: 39.0, 105.80000000000007

0 examples may be over the 4096 token limit, they will be truncated during fine-tuning
Dataset has ~97331 tokens that will be charged for during training
By default, you'll train for 3 epochs on this dataset
By default, you'll be charged for ~291993 tokens
See pricing page to estimate total costs


### 3. Training File Upload

Once we have checked that the training file is in the correct format, we can start uploading the training file to OpenAI.

Make sure you have the latest version of the openai package. (I couldn't find the version require for fine-tuning online, just that I need to update it.)

In [7]:
import openai

openai.api_key = os.environ.get("OPENAI_API_KEY")

In [None]:
# Uploading the training file

training_file_name = 'training_file.jsonl'

training_response = openai.File.create(
    file=open(training_file_name, 'rb'), purpose="fine-tune"
)

training_file_id = training_response["id"]

print("Training file id:", training_file_id)

### 4. Job Submission

Now that we have the training file uploaded, we can submit a fine-tuning job. Note that for each organization, there can only be one fine-tuning job running at a time.

In [None]:
# suffix name is the name you are giving to the model
suffix_name = "iNAGO-QGen-example"

# fine-tuning job submission
response = openai.FineTuningJob.create(
    training_file = training_file_id,
    model = "gpt-3.5-turbo",
    suffix = suffix_name
)

job_id = response["id"]

print(response)

### 5. Status Tracking

You can track the fine-tuning process with the code below. Note that a fine-tuning job usually starts after 10-15 minutes after submission.

In [None]:
response = openai.FineTuningJob.list_events(id=job_id, limit=50)

events = response["data"]
events.reverse()

for event in events:
    print(event['message'])

### 6. Calling a Fine-tuned Model

To call a fine-tuned model, you need to specify the model_id.

In [8]:
# calling a fine-tuned model

test_messages = []
test_messages.append({"role": "system", "content": "You are an assistant that generate answerable user-like questions from a passage."})
prompt = f'''Generate around ten questions from the following passage:

Before you can use a phone to access Model 3, follow these steps to authenticate it:
Download the Tesla mobile app to your phone. Log into the Tesla mobile app using your Tesla Account user name and password. NOTE: You must remain logged in to your Tesla Account to use your phone to access Model 3.
Ensure that your phone's Bluetooth settings are turned on. You must have your phone's Bluetooth setting turned on AND you must also ensure that Bluetooth is turned on within your phone's global settings for the Tesla mobile app.
For example, on your phone, navigate to Settings, choose the Tesla mobile app, and ensure the Bluetooth setting is enabled. NOTE: Model 3 communicates with your phone using Bluetooth.
To authenticate your phone or use it as a key, the phone must be powered on and Bluetooth must be enabled. Keep in mind that your phone must have enough battery power to run Bluetooth and that many phones disable Bluetooth when the battery is low.
Ensure that Allow Mobile Access (Controls > Safety & Security > Allow Mobile Access) is enabled. In the Tesla mobile app, touch PHONE KEY then touch START to search for your Model 3. When your Model 3 is detected, the mobile app asks you to tap your key card.
Tap the key card against the Model 3 card reader on the door pillar or center console (see Key Card on page 9).'''
test_messages.append({"role": "user", "content": prompt})

model_id = os.environ.get("MODEL_ID")

response = openai.ChatCompletion.create(
    model = model_id, messages = test_messages, temperature = 1, n = 1
)
print(response["choices"][0]['message']['content'])

What does the phone need to be powered on to authenticate it?
What does the phone need to enable it to use as a key?
What happens when my Model 3 is detected by the mobile app?
When do I need to tap the key card against the Model 3 card reader on the door pillar?
When do I need to tap the key card against the Model 3 card reader on the centre console?
What must I do to use the phone to access Model 3?
What must I do before I can use a phone to access Model 3?
How do I authenticate the phone to use as a key?
What should I do to use the phone as a key?
How to ensure Model 3 and my phone stay connected?


In [None]:
# calling a fine-tuned model

import openai

test_messages = []
test_messages.append({"role": "system", "content": "You are an assistant that generate answerable user-like questions from a passage."})
prompt = '''Your prompt...'''
test_messages.append({"role": "user", "content": prompt})

model_id = os.environ.get("OPENAI_API_KEY")

response = openai.ChatCompletion.create(
    model = model_id, messages = test_messages, temperature = 1, n = 1
)
print(response["choices"][0]['message']['content'])