# Fine-Tuning a GPT Model using OpenAI‚Äôs API on Google Colab

This notebook walks through the process of preparing custom training data, uploading it to the OpenAI API, and fine-tuning a GPT model. Specifically, we will:
1. Set up our environment and authenticate with the OpenAI API.
2. Prepare our dataset (emails) for the fine-tuning process.
3. Use the OpenAI API to upload training and validation datasets.
4. Initiate a fine-tuning job and retrieve the fine-tuned model.
5. Test the resulting fine-tuned model with custom prompts and compare results.

Throughout this notebook, we‚Äôll use Python and the OpenAI API.

## Introduction

Fine-tuning a GPT model involves training the model on a custom dataset to improve its performance on a specialized task.

In this scenario, we have a collection of emails and our goal is to extract short user prompts that‚Äîwhen given to the GPT model‚Äîwill reproduce the email content in a humorous, conversational, and engaging style aligned with a system persona.

In the following cells, we will:
- Connect to Google Drive and set up the environment.
- Load and process a set of emails.
- Use GPT to generate corresponding prompts from each email.
- Create training and validation sets.
- Upload these sets to OpenAI and fine-tune a GPT model.
- Finally, we will test the fine-tuned model by providing new prompts and comparing its responses to the original email style.

By the end of this notebook, you should have a deeper understanding of how to prepare data, invoke the OpenAI fine-tuning API, and evaluate the results.

---



# Setup and Environment Configuration

In this first cell, we:
- Connect Google Drive to access the dataset and work from a specific directory.
- Install necessary packages (OpenAI).
- Retrieve the API key from Google Colab‚Äôs `userdata` to authenticate with OpenAI‚Äôs API.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%cd /content/drive/MyDrive/GenAI/OpenAI/Fine Tuning a GPT Model

/content/drive/MyDrive/GenAI/OpenAI/Fine Tuning a GPT Model


In [None]:
!pip install openai



In [None]:
from google.colab import userdata
api_key = userdata.get('genai_course')

In this cell, we:
- Import required Python libraries such as `os`, `re`, `pandas`, and `random`.
- Initialize the OpenAI client with our `api_key` and specify the base model (`gpt-4o`) we‚Äôll be working with.

The `OpenAI` class and client will be used to interact with the OpenAI API: uploading files, creating fine-tuning jobs, and generating text completions.
python
Copy code


In [None]:
# Import libraries
import os
from os.path import isfile, join
import re
import pandas as pd
from openai import OpenAI
import random
import json

In [None]:
# Connect to the OpenAI api
client = OpenAI(api_key=api_key)
MODEL = "gpt-4o"

# Data Preparation

## Reading Email Files

In the next steps, we will:
- Specify the path to the directory containing email files.
- Retrieve a list of email filenames.
- Read each file and store its content and metadata (title and body) into a list.

Our end goal is to have a structured representation of each email‚Äôs content so that we can then ask GPT to produce short prompts that would generate these emails.


In [None]:
# Get the names of all files
path = 'emails data'
files = [f for f in os.listdir(path) if isfile(join(path, f))]
files

['üöÄ GPT-4o can do this.txt',
 'üöÄ prompt engineering guide.txt',
 'üí° OpenAI is too Vanilla.txt',
 'üöÄFree Course Alert Master Statistics with Excel.txt',
 'ü§ì Statistics and your sphincter....txt',
 'üêçüî• Hot Course Dropping Python for Data Analysis!.txt',
 'üåü A Stellar Guide to Data Careers Which Path Will You Choose.txt',
 'ü¶å baby reindeer and AI.txt',
 'üëä Stats with Python No Nonsense, Pure Skill!.txt',
 'üí• Temporal Fusion Transformer, explained.txt',
 'üöÄData Analyst Career Path, 5 Pandas Tricks and Data Science in 2024.txt',
 'üîì Be Exceptional. Only 5_ Will Today‚ÄîWill You.txt',
 'ü§Ø I am sure this would deceive you.txt',
 'ü¶ç Ape king mode on.txt',
 'üêõ Dune 2, spicy sausages and AI.txt',
 'The price of a Coca-Cola.txt']

In [None]:
emails = []
for em in files:
  data_dict = {}
  # Open the first email
  with open(f'{path}/{em}', 'r') as f:
      email = f.read()

  title = email.split('\n')[0]
  data_dict['content'] = f"Title: {title.replace('Title: ', '')}, Body: {email.replace(title, '').strip()}"
  emails.append(data_dict)
# Example of an email entry (checking one of them)
emails[6]

We define a **system prompt** that instructs GPT to:
- Act as an expert prompt engineer and content creator.
- Analyze each email and generate a concise prompt (up to 10 words) that could produce the given email‚Äôs body.

This prompt will guide GPT to focus on extracting only the essential parts that would lead to the email content if the model were prompted with it.


In [None]:
# Define the system prompt
system_prompt = """
You are an expert prompt engineer and content creator.
Analyze the email and draft a prompt up to 10 words that would result in that email's body.
Write just the content of the prompt and nothing else"""

For each email, we:
- Send the email's content (including title and body) to GPT.
- Ask GPT (through our system prompt and the user role message) to produce a short prompt that leads to the original email body.

The result is a set of minimal ‚Äúseed prompts‚Äù associated with each email. These seed prompts are what the model will later learn to transform back into a full email when fine-tuned.


In [None]:
# Use the GPT model to extract the user prompt
prompts = []
for mail in emails:
  response = client.chat.completions.create(
      model=MODEL,
      messages=[
          {"role": "system", "content": system_prompt},
          {"role": "user", "content": mail['content']}])
  print(response.choices[0].message.content)
  prompts.append(response.choices[0].message.content)

Create an engaging email informing subscribers about OpenAI's latest API, GPT-4o, which can convert videos into text effortlessly. Highlight the tutorial content covering integration with Python, using Whisper for audio transcription, and leveraging GPT-4o. Include a call to action with a link to the tutorial video, encouraging subscribers to explore this innovative technology.
Create an engaging email to promote a free eBook on prompt engineering techniques in AI. Highlight how mastering prompts can significantly enhance business efficiency. Include examples like one-shot, few-shot, chaining, and tree of thoughts to demonstrate practical applications. Use an enthusiastic and motivational tone to inspire readers to explore these AI opportunities. Make a connection with pop culture references, and emphasize the potential for growth and the importance of taking action now. Additionally, include a call to action for downloading the eBook and offer a personal touch with a sign-off.
Write a

We now pair each email with its newly generated prompt. This gives us a dataset of `(email_content, prompt)` pairs. Later, we will use these pairs to fine-tune our model. The email content is the ‚Äúassistant‚Äù output we want the model to learn to generate from the given prompt.

In [None]:
# Combine the emails and prompts
combined_data = list(zip(emails, prompts))
combined_data[0]

({'content': "Title: üöÄ GPT-4o can do this, Body: You won't believe this.\n\nGPT-4o can read videos!\n\nYeah, you heard that right.\n\nGod damn video-to-text.\n\nOpenAI's latest API can take any video and convert it into text, turning hours of manual work into minutes.\n\nIn my latest tutorial, I‚Äôm breaking down how to:\n\nIntegrate OpenAI API with Python\nSeamlessly convert video content to text\nUse Whisper to transcribe audio and supercharge GPT-4o\nReady to level up?\n\nWatch the tutorial now: https://youtu.be/JQ_Er9bMLPk\u200b\n\n\nCatch you on the inside!\n\nDiogo\n\nPS: Don't forget to watch."},
 "Create an engaging email informing subscribers about OpenAI's latest API, GPT-4o, which can convert videos into text effortlessly. Highlight the tutorial content covering integration with Python, using Whisper for audio transcription, and leveraging GPT-4o. Include a call to action with a link to the tutorial video, encouraging subscribers to explore this innovative technology.")

To avoid bias in training and evaluation, we shuffle the combined `(email, prompt)` pairs and split them into training (80%) and test/validation (20%) sets. This ensures that the model is tested on data it has not seen during training.


In [None]:
# Shuffle the combined data
random.shuffle(combined_data)
combined_data[0]

({'content': "Title: üåü A Stellar Guide to Data Careers Which Path Will You Choose, Body: Hey Reader\n\nSweet child, let me be brief‚Ä¶\n\nBecause I‚Äôm about to drop something seismic.\n\nToday, I‚Äôm unveiling a gem on Zero To Mastery that‚Äôs set to send shockwaves from Silicon Valley to the farthest reaches of cyberspace.\n\nThis isn't your average snooze-fest of data babble.\n\nNah.\n\nI am talking about the holy grail for Data Engineers, Analysts, and Scientists.\n\nA blog post so riveting, it‚Äôll give the tech gurus and data nerds a collective mindgasm.\n\nImagine, with just a click, unlocking the secrets to transforming your career in data.\n\nI am slicing through the buzzwords, demystifying the jargon, and serving up the raw truth on what it takes to excel in the data game.\n\nSo, what are you waiting for?\n\nThe knowledge bomb has dropped, and it's waiting for you.\n\nRead story\nTo your success,\n\nDiogo"},
 'Write an engaging email inviting readers to explore a new blog 

In [None]:
# Split into training and test
train_size = int(len(combined_data) * 0.8)
train_data_combined = combined_data[:train_size]
test_data_combined = combined_data[train_size:]

In [None]:
test_data_combined[2]

({'content': 'Title: üëä Stats with Python: No Nonsense, Pure Skill!, Body: Statistics with Python:Course Launch!\n\nHey!\n\nDiogo here.\n\nCut to the chase - you\'ve seen my courses, you know the drill.\n\nBut this time, it\'s different.\n\nI\'m dropping a game-changer: "Statistics with Python: Zero to Mastery."\n\nThis isn\'t just another course; it\'s your ticket to mastering stats with Python, and I mean business.\n\nWhy This Course? Because You Need It.\n\nReal Skills, No Fluff: Over 25 hours of raw, unfiltered knowledge. I am talking about projects that mirror real-life chaos, not textbook scenarios.\nChatGPT Integration: I\'m throwing in cutting-edge techniques to supercharge your data analysis with ChatGPT. This is next-level stuff.\nTailor-Made for You: You\'ve been through my courses. You loved them. This one? It\'s going to blow your mind.\nWhat\'s in It for You:\n\nConcrete Skills: Forget theory. I\'m talking about practical skills that will make you a data wizard.\nPortfo

We define a function `prepare_data()` that formats each `(assistant_output, prompt)` pair into a structure compatible with OpenAI‚Äôs fine-tuning requirements. It:
- Includes a system message that sets the persona and style of the generated email.
- Includes a user message containing the extracted prompt.
- Includes an assistant message containing the original email content as the desired ‚Äúground truth‚Äù output.

The fine-tuning process involves teaching the model to produce the assistant content given the user prompt, all under the specified system persona and style.


In [None]:
def prepare_data(assistant_output, prompt):
  system_prompt_emails = """
  You are Diogo, you create online courses on Analytics and AI.
  You are an expert at writing Engaging and Conversational emails.
  The paragraphs are 1 sentence only and written with a humorous, provoking / light hearted way.
  You start the emails with a thought-provoking hook.
  """
  return {
      "messages": [
          {"role": "system", "content": system_prompt_emails},
          {"role": "user", "content": f"{prompt}"},
          {"role": "assistant", "content": f" Here is the {assistant_output}"}
      ]
  }

In [None]:
# Apply the function to the train and validation data
train_data = []
for mail, prompt in train_data_combined:
  train_data.append(prepare_data(mail['content'], prompt))
validation_data = []
for mail, prompt in test_data_combined:
  validation_data.append(prepare_data(mail['content'], prompt))

# Saving Data in JSONL Format

The OpenAI fine-tuning endpoint expects data in `.jsonl` format. Here we define a helper function `write_jsonl()` to write out the training and validation sets to `train_data.jsonl` and `validation_data.jsonl`, respectively.

In [None]:
# Prepare a function that creates JSONL files
def write_jsonl(data_list: list, filename: str) -> None:
    with open(filename, 'w') as out:
        for ddict in data_list:
            jout = json.dumps(ddict) + '\n'
            out.write(jout)

In [None]:
# Write the training and validation data to jsonl
write_jsonl(train_data, 'train_data.jsonl')
write_jsonl(validation_data, 'validation_data.jsonl')


Before we send the data to OpenAI, let‚Äôs take a peek at the first few lines of the `train_data.jsonl` and `validation_data.jsonl` files. This helps ensure the formatting is correct and that we‚Äôre sending the right kind of data.


In [None]:
# Preview the output
!head -n 15 train_data.jsonl

{"messages": [{"role": "system", "content": "\n  You are Diogo, you create online courses on Analytics and AI.\n  You are an expert at writing Engaging and Conversational emails.\n  The paragraphs are 1 sentence only and written with a humorous, provoking / light hearted way.\n  You start the emails with a thought-provoking hook.\n  "}, {"role": "user", "content": "Write an engaging email inviting readers to explore a new blog post on Zero To Mastery. Highlight its impact and value for data professionals like engineers, analysts, and scientists. Use a conversational tone, add excitement, and emphasize how this content will transform their careers by demystifying industry jargon and cutting through the noise. Encourage quick action by expressing urgency and the potential of the insights provided. Conclude with a friendly call to action, motivating readers to dive into the blog post for success in their data journey."}, {"role": "assistant", "content": " Here is the Title: \ud83c\udf1f A

In [None]:
# Preview the output
!head -n 10 validation_data.jsonl

{"messages": [{"role": "system", "content": "\n  You are Diogo, you create online courses on Analytics and AI.\n  You are an expert at writing Engaging and Conversational emails.\n  The paragraphs are 1 sentence only and written with a humorous, provoking / light hearted way.\n  You start the emails with a thought-provoking hook.\n  "}, {"role": "user", "content": "Create an engaging email to promote a free eBook on prompt engineering techniques in AI. Highlight how mastering prompts can significantly enhance business efficiency. Include examples like one-shot, few-shot, chaining, and tree of thoughts to demonstrate practical applications. Use an enthusiastic and motivational tone to inspire readers to explore these AI opportunities. Make a connection with pop culture references, and emphasize the potential for growth and the importance of taking action now. Additionally, include a call to action for downloading the eBook and offer a personal touch with a sign-off."}, {"role": "assista

# Fine-Tuning Setup

## Uploading Data to OpenAI

We now upload our training and validation files to the OpenAI API. This step returns file IDs that we will use to reference the data when starting the fine-tuning job.

In [None]:
# Upload the files to OpenAI and retrieve the fileID
def upload_file(file_name: str, purpose: str) -> str:
    with open(file_name, 'rb') as file_tuning:
        response = client.files.create(file = file_tuning, purpose = purpose)
    return response.id

In [None]:
# Apply the function
training_file_id = upload_file('train_data.jsonl', 'fine-tune')
validation_file_id = upload_file('validation_data.jsonl', 'fine-tune')

# Show the outputs
print(f"The training ID is {training_file_id}")
print(f"The validation ID is {validation_file_id}")

The training ID is file-LE6D1IkUgCT4Q9CtgqgUT805
The validation ID is file-L7ebNVfZtRppQxd2Gh5ts1Xi


With the file IDs in hand, we now request the creation of a fine-tuning job. We specify:
- The base model to fine-tune (`MODEL_TUNING`).
- The training and validation file IDs.
- An optional `suffix` to name this fine-tuning job for easier reference.

Upon completion, this will produce a fine-tuned model ID that we can use to generate improved email outputs.


In [None]:
# Creating the fine tuning job
MODEL_TUNING = "gpt-4o-2024-08-06"

response = client.fine_tuning.jobs.create(
    training_file=training_file_id,
    validation_file=validation_file_id,
    model=MODEL_TUNING,
    suffix="Email-Writer-v3")

In [None]:
# Retrieve the ID
response.id

'ftjob-x93Y4FYY0MIWSToeImy88kDM'

After the fine-tuning process completes, we retrieve the fine-tuned model‚Äôs ID. If the model is successfully fine-tuned, the `tuned_model_id` variable will hold the reference to our new model.

**Note:** If no fine-tuned model is returned, it may be due to:
- Technical issues with OpenAI‚Äôs service.
- Insufficient or poor-quality training data.


In [None]:
# Tuned model
tuned_model = client.fine_tuning.jobs.retrieve(response.id)
tuned_model_id = tuned_model.fine_tuned_model

if tuned_model_id is None:
    raise Exception("No fine tuned model found")
else:
    print(f"The fine tuned model is {tuned_model_id}")

The fine tuned model is ft:gpt-4o-2024-08-06:diogo-resende:email-writer-v3:ATni7V1s


# Testing the Fine-Tuned Model

We now test the fine-tuned model by providing a new user prompt:
- The system prompt ensures the style and persona remain consistent.
- The user prompt describes a new scenario: writing an email about a new section on Fine-Tuning a GPT 4o Model.

We then print the resulting email to see how well the fine-tuned model performs.

In [None]:
# Define the messages
system_prompt = """
  You are Diogo, you create online courses on Analytics and AI.
  You are an expert at writing Engaging and Conversational emails.
  The paragraphs are 1 sentence only and written with a humorous, provoking / light hearted way.
  You start the emails with a thought-provoking hook.
  """

In [None]:
# Define the user prompt
user_prompt = """
Write an email about my new section on Fine Tuning a GPT 4o Model.
The goal is to create the best Email Writer Ever using OpenAI GPT-4o.
We explore the Fine-tuning in OpenAI.
We explore the fine-tuning assessment.
"""

In [None]:
# Define the messages
messages = [{"role": "system",
             "content": system_prompt},
            {"role": "user",
             "content": user_prompt}]

In [None]:
# Try the fine tuned model
response = client.chat.completions.create(
    model=tuned_model_id,
    messages = messages,
    temperature = 1)
print(response.choices[0].message.content)

 Here is the Title: Fine Tuning a GPT 4o Model.

   Body:

‚Äã

Hey ,

+++ New Section on Fine Tuning with OpenAI - Lets Create The Best Email Writer Ever! üåü

Hold on to your keyboard. ‚Äã
ü•Å Drumroll, please! ü•Å

I've just unleashed a brand new section that's about to make waves in the AI community: Fine Tuning with OpenAI.

You read that right. We're diving deep‚Äîdeeper than a tech bro's obsession with the latest smartwatch.

Here's the deal.

We're on a mission to create the Best Email Writer Ever using OpenAI ‚Äòs GPT 4o.

Yes, we're exploring the galaxy of Fine-tuning in OpenAI, from the basics to the outer reaches of the Fine Tuning Assessment.

Curious?

Stay tuned. This course is about to level up in ways you never imagined.

Catch the latest updates here: How To Train and Fine Tune GPT-3 and GPT-4 GPT-4 Model: DIY Mastery with Practical Steps‚ÄîBeginner to Pro.

 Catch the New Section Here


And don't worry, I'll keep you posted on all the juicy details as we roll them

# Comparing the Fine-Tuned Model‚Äôs Output

To further evaluate, we can compare the fine-tuned model‚Äôs output to the original validation data. We will:
- Iterate through the validation prompts.
- Send each validation prompt to the fine-tuned model.
- Print out the model‚Äôs response and compare it to the original assistant message to check alignment and style consistency.

In [None]:
# Compare the fine tuned models output with the validation emaiks
for mail in validation_data:
  user_prompt = mail['messages'][1]['content']
  messages = [{"role": "system",
             "content": system_prompt},
            {"role": "user",
             "content": user_prompt}]

  response = client.chat.completions.create(
    model=tuned_model_id,
    messages = messages,
    temperature = 1)
  print(f"Validation Email is\n{validation_data[0]['messages'][2]['content']}")
  print("----------------------------------------------")
  print(f"The fine tuned email is {response.choices[0].message.content}")

Validation Email is
 Here is the Title: üöÄ prompt engineering guide, Body: Reader,

Do you, like Brother Elon, think AI poses a threat?

Are you scared the robots are going to take over and dominate us like in the Terminator?

Don't be.

Or, better yet.

Be more like Arnold and be back to terminate those mofos.

Imagine mastering a skill so powerful that it can multiply your results tenfold, transforming your business overnight.

Sounds like a fairytale, right?

But this isn't fantasy‚Äîit's the world of prompt engineering techniques.

Let me break it down for you.

Everything in AI starts with a prompt.

Whether it's one-shot, few-shot, chaining, or the tree of thoughts, these techniques are the secret sauce that can turbocharge your AI's performance.

One-shot learning is like a master chef whipping up a gourmet dish from a single glance at the ingredients.

With just one example, you can train your model to understand complex tasks. Imagine the time you'll save!

Few-shot learning

In [None]:
validation_data[0]['messages'][2]['content']

" Here is the Title: üöÄ prompt engineering guide, Body: Reader,\n\nDo you, like Brother Elon, think AI poses a threat?\n\nAre you scared the robots are going to take over and dominate us like in the Terminator?\n\nDon't be.\n\nOr, better yet.\n\nBe more like Arnold and be back to terminate those mofos.\n\nImagine mastering a skill so powerful that it can multiply your results tenfold, transforming your business overnight.\n\nSounds like a fairytale, right?\n\nBut this isn't fantasy‚Äîit's the world of prompt engineering techniques.\n\nLet me break it down for you.\n\nEverything in AI starts with a prompt.\n\nWhether it's one-shot, few-shot, chaining, or the tree of thoughts, these techniques are the secret sauce that can turbocharge your AI's performance.\n\nOne-shot learning is like a master chef whipping up a gourmet dish from a single glance at the ingredients.\n\nWith just one example, you can train your model to understand complex tasks. Imagine the time you'll save!\n\nFew-shot

# Conclusion

In this notebook, we:
- Loaded and processed a dataset of emails.
- Generated short prompt ‚Äúseeds‚Äù using GPT that could reproduce the original emails.
- Organized the data into training and validation sets.
- Fine-tuned a GPT model on this dataset using the OpenAI API.
- Tested the resulting fine-tuned model on both novel and validation prompts.

**Key Takeaways**:
- Fine-tuning allows us to specialize a general-purpose model (like GPT) on domain-specific data, improving its performance for particular tasks.
- The quality, quantity, and structure of the training data significantly influence the final results.
- By comparing outputs from the fine-tuned model against original validation data, we can gauge the success of the fine-tuning process and identify areas for improvement.

**Next Steps**:
- Provide more data and higher-quality examples.
- Experiment with different system prompts or instructions to refine the model‚Äôs persona and style.
- Iterate on the data preparation, prompt engineering, and evaluation strategies for even better results.

You have now walked through the entire process of preparing, fine-tuning, and evaluating a GPT model using OpenAI‚Äôs API in a Google Colab environment.
