# Step 1: Defining the Task
For this workshop we will:
* Train GPT3.5 to automatically produce nutrition_logs, training_logs, and response messages in XML tags.
* Our goal is to replace complex prompt instruction with simpler prompts against a specialized model.

# Step 2: Prepare fine-tuning data set

In [2]:
# imports
from openai import OpenAI
import json
import tiktoken
from collections import defaultdict
import numpy as np

client = OpenAI()

### Aggregate data set from larger model with more complex instructions

> You are an expert personal trainer and fitness coach for athletes. Start by asking for a daily update, and you will be told what the user ate and trained that day. Respond with: 1. in <fitness_log> tags log the training (do not include dates) 2. in <nutrition_log> tags log the name of the food (do not include dates) 3. in <message> respond with "Nice!" Do not add any new line characters.

* Then we simulated a few conversations and captured the responses in a list.
* We collected a total of 20 samples for our task.
* Fine-Tuning API requies a minimum of 10 samples for fine-tuning
* OpenAI reports that a remarkable performance gain is observed when trainig examples go beyond 50 samples
* 50 to 100 examples are typical
* Exact number of required data samples depends on your specific use case

In [3]:
system_prompt = '''You are an expert personal trainer and fitness coach for athletes. Start by asking for a daily update, and you will be told what the user ate and trained that day. Respond with: 
1. in <fitness_log> tags log the training (do not include dates)
2. in <nutrition_log> tags log the name of the food (do not include dates)
3. in <message> respond with "Nice!"

Do not add any new line characters.'''

user_prompts = [{"role": "user", "content": "I ate eggs and ran half a mile"}, 
                 {"role": "user", "content": "I ate pizza and did 10 pushups"}, 
                 {"role": "user", "content": "I ate a salad and swam 5 laps"}, 
                 {"role": "user", "content": "I ate a burger and did 15 squats"},
                 {"role": "user", "content": "I ate cookies and did 20 crunches"}, 
                 {"role": "user", "content": "I ate a steak and did 30 burpees"}, 
                 {"role": "user", "content": "I ate a pie and did 5 pull-ups"}, 
                 {"role": "user", "content": "I ate cake and did 1 hour of yoga"},
                 {"role": "user", "content": "I ate soup and took a long walk"},
                 {"role": "user", "content": "I ate candy and did pilates"},
                 {"role": "user", "content": "I ate pasta and biked for 1 hour"},
                 {"role": "user", "content": "I ate sushi and hiked for 1 hour"},
                 {"role": "user", "content": "I ate cheese and did jumping jacks"},
                 {"role": "user", "content": "I ate shrimp and did Tae Kwon Do"},
                 {"role": "user", "content": "I ate cereal and did 20 sprints"},
                 {"role": "user", "content": "I ate chips and rowed for 1 hour"},
                 {"role": "user", "content": "I ate nuts and walked for 1 hour"},
                 {"role": "user", "content": "I ate tacos and ran 3 miles"},
                 {"role": "user", "content": "I ate a banana and did 50 pushups"},
                 {"role": "user", "content": "I ate an orange and swam 10 laps"},
                ]

In [None]:
system_role = {"role": "system", "content": "You are a personal trainer and fitness coach."}
conversations = []

# loop through user prompts and generate responses
for user_prompt in user_prompts:
  completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "system", "content": system_prompt}, user_prompt],
    temperature=1,
    max_tokens=1024,
    seed=42
  )

  assistant_role = {"role": "assistant", "content": completion.choices[0].message.content}

  # build conversations collection
  conversation = [system_role, user_prompt, assistant_role]
  conversations.append(conversation)
  


# Step 3: Convert Data to .jsonl format

In [4]:
# view raw conversations
conversations

[[{'role': 'system',
   'content': 'You are a personal trainer and fitness coach.'},
  {'role': 'user', 'content': 'I ate eggs and ran half a mile'},
  {'role': 'assistant',
   'content': '<fitness_log>ran half a mile</fitness_log><nutrition_log>eggs</nutrition_log><message>Nice!</message>'}],
 [{'role': 'system',
   'content': 'You are a personal trainer and fitness coach.'},
  {'role': 'user', 'content': 'I ate pizza and did 10 pushups'},
  {'role': 'assistant',
   'content': '<fitness_log>10 pushups</fitness_log><nutrition_log>pizza</nutrition_log><message>Nice!</message>'}],
 [{'role': 'system',
   'content': 'You are a personal trainer and fitness coach.'},
  {'role': 'user', 'content': 'I ate a salad and swam 5 laps'},
  {'role': 'assistant',
   'content': '<fitness_log>swam 5 laps</fitness_log><nutrition_log>salad</nutrition_log><message>Nice!</message>'}],
 [{'role': 'system',
   'content': 'You are a personal trainer and fitness coach.'},
  {'role': 'user', 'content': 'I ate a

In [5]:
jsonl_file_name = 'data.jsonl'

# Write each conversation to the JSON Lines file line by line
with open(jsonl_file_name, 'w') as file:
    for conversation in conversations:
        json.dump({"messages": conversation}, file)
        file.write('\n')  # Add a newline to separate lines

print(f'The JSONL file has been written to {jsonl_file_name}')

The JSONL file has been written to data.jsonl


# Step 4: Verify Data and Evaluate Cost
Here we perform following tasks:
* Check if the data format is correct
* All messages should be within max token limit of 4096
* Statistical parameters of data (mean, median, max, min, P5, P95)
* Count total number of tokens
* Estimate the cost of fine-tuning

In [6]:
## UTILITY FUNCTION FROM AIUG DEVELOPER DAY

# verify data utility
def verify_data(data_path):

    # Load dataset
    with open(data_path) as f:
        dataset = [json.loads(line) for line in f]

    # We can inspect the data quickly by checking the number of examples and the first item

    # Initial dataset stats
    print("Num examples:", len(dataset))
    print("First example:")
    for message in dataset[0]["messages"]:
        print(message)

    # Now that we have a sense of the data, we need to go through all the different examples and check to make sure the formatting is correct and matches the Chat completions message structure

    # Format error checks
    format_errors = defaultdict(int)

    for ex in dataset:
        if not isinstance(ex, dict):
            format_errors["data_type"] += 1
            continue

        messages = ex.get("messages", None)
        if not messages:
            format_errors["missing_messages_list"] += 1
            continue

        for message in messages:
            if "role" not in message or "content" not in message:
                format_errors["message_missing_key"] += 1

            if any(k not in ("role", "content", "name") for k in message):
                format_errors["message_unrecognized_key"] += 1

            if message.get("role", None) not in ("system", "user", "assistant"):
                format_errors["unrecognized_role"] += 1

            content = message.get("content", None)
            if not content or not isinstance(content, str):
                format_errors["missing_content"] += 1

        if not any(message.get("role", None) == "assistant" for message in messages):
            format_errors["example_missing_assistant_message"] += 1

    if format_errors:
        print("Found errors:")
        for k, v in format_errors.items():
            print(f"{k}: {v}")
    else:
        print("No errors found")

    # Beyond the structure of the message, we also need to ensure that the length does not exceed the 4096 token limit.

    # Token counting functions
    encoding = tiktoken.get_encoding("cl100k_base")

    # not exact!
    # simplified from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
    def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
        num_tokens = 0
        for message in messages:
            num_tokens += tokens_per_message
            for key, value in message.items():
                num_tokens += len(encoding.encode(value))
                if key == "name":
                    num_tokens += tokens_per_name
        num_tokens += 3
        return num_tokens

    def num_assistant_tokens_from_messages(messages):
        num_tokens = 0
        for message in messages:
            if message["role"] == "assistant":
                num_tokens += len(encoding.encode(message["content"]))
        return num_tokens

    def print_distribution(values, name):
        print(f"\n#### Distribution of {name}:")
        print(f"min / max: {min(values)}, {max(values)}")
        print(f"mean / median: {np.mean(values)}, {np.median(values)}")
        print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")

    # Last, we can look at the results of the different formatting operations before proceeding with creating a fine-tuning job:

    # Warnings and tokens counts
    n_missing_system = 0
    n_missing_user = 0
    n_messages = []
    convo_lens = []
    assistant_message_lens = []

    for ex in dataset:
        messages = ex["messages"]
        if not any(message["role"] == "system" for message in messages):
            n_missing_system += 1
        if not any(message["role"] == "user" for message in messages):
            n_missing_user += 1
        n_messages.append(len(messages))
        convo_lens.append(num_tokens_from_messages(messages))
        assistant_message_lens.append(num_assistant_tokens_from_messages(messages))

    print("Num examples missing system message:", n_missing_system)
    print("Num examples missing user message:", n_missing_user)
    print_distribution(n_messages, "num_messages_per_example")
    print_distribution(convo_lens, "num_total_tokens_per_example")
    print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
    n_too_long = sum(l > 16385 for l in convo_lens)
    print(f"\n{n_too_long} examples may be over the 16385 token limit, they will be truncated during fine-tuning")

    # Pricing and default n_epochs estimate
    MAX_TOKENS_PER_EXAMPLE = 16385

    MIN_TARGET_EXAMPLES = 10
    MAX_TARGET_EXAMPLES = 25000
    TARGET_EPOCHS = 3
    MIN_EPOCHS = 1
    MAX_EPOCHS = 25

    n_epochs = TARGET_EPOCHS
    n_train_examples = len(dataset)
    if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
        n_epochs = min(MAX_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
    elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
        n_epochs = max(MIN_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

    n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
    print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
    print(f"By default, you'll train for {n_epochs} epochs on this dataset")
    print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")
    print(f"Based on current rate the cost of training is ~ $ {round(n_epochs * n_billing_tokens_in_dataset * 0.008 * 0.001,6)}")

In [7]:
# Verify the data file
verify_data('data.jsonl')

Num examples: 20
First example:
{'role': 'system', 'content': 'You are a personal trainer and fitness coach.'}
{'role': 'user', 'content': 'I ate eggs and ran half a mile'}
{'role': 'assistant', 'content': '<fitness_log>ran half a mile</fitness_log><nutrition_log>eggs</nutrition_log><message>Nice!</message>'}
No errors found




Num examples missing system message: 0
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 3, 3
mean / median: 3.0, 3.0
p5 / p95: 3.0, 3.0

#### Distribution of num_total_tokens_per_example:
min / max: 55, 63
mean / median: 59.45, 59.0
p5 / p95: 57.8, 62.0

#### Distribution of num_assistant_tokens_per_example:
min / max: 24, 29
mean / median: 26.2, 26.0
p5 / p95: 24.9, 28.0

0 examples may be over the 16385 token limit, they will be truncated during fine-tuning
Dataset has ~1189 tokens that will be charged for during training
By default, you'll train for 3 epochs on this dataset
By default, you'll be charged for ~3567 tokens
Based on current rate the cost of training is ~ $ 0.028536


# Step 5: Upload Data to OpenAI Servers

In [8]:

def upload_data_to_openai(file_path):
    upload_response = client.files.create(
    file=open(file_path, "rb"),
    purpose="fine-tune"
    )
    
    file_id = upload_response.id
    print(f"File upload complete. ID: {file_id}")

    return file_id

In [9]:
file_id = upload_data_to_openai('data.jsonl')

File upload complete. ID: file-zHaVEsMNEqxYX1oYKDpEUoze


# Step 6 & 7: Train and monitor a new fine-tuned model
* We will recieve an email notification when the training job is complete
* We can monitor progress on [OpenAI Portal under the finetuning tab](https://platform.openai.com/finetune).
* We can also monitor progress using a [Weights and Biases](https://wandb.ai/site) integration

In [10]:
def start_finetuning(file_id, n_epochs):
  job = client.fine_tuning.jobs.create(
    training_file=file_id, 
    model="gpt-3.5-turbo-0125",
    hyperparameters={
          "n_epochs":n_epochs
          }
  )

  job_id = job.id
  print(f'Fine-tuning job successfully started: {job_id}')

  return job_id

In [11]:
job_id = start_finetuning(file_id, 3)

Fine-tuning job successfully started: ftjob-Y7Ts51nqb7vOKCDVDPUyJOG3


In [14]:
# check status of fine-tuning job
client.fine_tuning.jobs.retrieve(job_id)

FineTuningJob(id='ftjob-Y7Ts51nqb7vOKCDVDPUyJOG3', created_at=1721097136, error=Error(code=None, message=None, param=None), fine_tuned_model='ft:gpt-3.5-turbo-0125:personal::9lSMdXeW', finished_at=1721097474, hyperparameters=Hyperparameters(n_epochs=3, batch_size=1, learning_rate_multiplier=2), model='gpt-3.5-turbo-0125', object='fine_tuning.job', organization_id='org-qXIGb538HTtKmSiCKxVtZbGD', result_files=['file-GzzziUysRJRmsoNblsMRsE22'], seed=1209432246, status='succeeded', trained_tokens=3447, training_file='file-zHaVEsMNEqxYX1oYKDpEUoze', validation_file=None, estimated_finish=None, integrations=[], user_provided_suffix=None)

# Step 8: Evaluating the fine-tuned model
* Once the training job is complete, we can test our finetuned model using [OpenAI's Playground GUI](https://platform.openai.com/playground?mode=chat)
* We can also use the fine-tuned model via the OpenAI API.

In [5]:
# utility for testing your fine-tuned model
def test_model(model_id, system_role, prompt):
  response = client.chat.completions.create(
  model=model_id,
  messages=[
    {"role": "system", "content": f"{system_role}"},
    {"role": "user", "content": f"{prompt}"}
    ]
  )

  return_message = response.choices[0].message.content

  return return_message


In [6]:
# model_id can be obtained from https://platform.openai.com/finetune/
model_id = "ft:gpt-3.5-turbo-0125:personal::9liQwOAm"

# This is the same role used for training
system_role = "You are a personal trainer and fitness coach."

# this is a test question
prompt = "I ate pizza and hiked for 3 miles?"

# Test system Response
response = test_model(
    model_id,
    system_role,
    prompt
    )

print(response)

<fitness_log>hiked for 3 miles</fitness_log><nutrition_log>pizza</nutrition_log><message>Nice!</message>


# Final Exercise 🚀

* Fine-Tune GPT3.5 Turbo using OpenAI's Fine-Tuning API
  * You can adjust the default behavior of myfitnessagent, add pre or post scripts, or change the formatting of the outputs.
  * Generate your training dataset and save it in a JSONL format.
  * Verify the data file, upload it, and start the fine-tuning job.
  * Test the fine-tuned model to see if it works