# Fine-tune OpenAI models - text classification

This notebook provides a step-by-step guide for our new `gpt-3.5-turbo` fine-tuning. We'll perform text classification using the [AG news dataset](https://www.kaggle.com/datasets/amananandrai/ag-news-classification-dataset). This is a common dataset for text classification tasks.

We will go through the following steps:

1. **Setup:** Loading our dataset and filtering down to one domain to fine-tune on.
2. **Data preparation:** Preparing your data for fine-tuning by creating training and validation examples, and uploading them to the `Files` endpoint.
3. **Fine-tuning:** Creating your fine-tuned model.
4. **Inference:** Using your fine-tuned model for inference on new inputs.

By the end of this you should be able to train, evaluate and deploy a fine-tuned `gpt-3.5-turbo` model.

For more information on fine-tuning, you can refer to our [documentation guide](https://platform.openai.com/docs/guides/fine-tuning), [API reference](https://platform.openai.com/docs/api-reference/fine-tuning) or [blog post](https://openai.com/blog/gpt-3-5-turbo-fine-tuning-and-api-updates)

## Setup

In [1]:
# make sure to use the latest version of the openai python package
!pip install --upgrade openai 

You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.10/bin/python3.10 -m pip install --upgrade pip' command.[0m[33m
[0m

In [2]:
import json
import openai
import os
import pandas as pd
from pprint import pprint

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "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")


Fine-tuning works best when focused on a particular domain. It's important to make sure your dataset is both focused enough for the model to learn, but general enough that unseen examples won't be missed. Having this in mind, we have extracted a subset from the AG news dataset.

In [104]:
# Read in the dataset we'll use for this task.
# This will be the AG News dataset, which downloaded from https://www.kaggle.com/datasets/amananandrai/ag-news-classification-dataset
AGnews_df = pd.read_csv("AGNews.csv")

AGnews_df.head()


Unnamed: 0,Class Index,Title,Description
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."


In the above dataset, the index to class mappings are as follows: 1-World, 2-Sports, 3-Business, 4-Sci/Tech

In [105]:
# Create index to class mapping dictionary
index_class_mapping_dict =  {1:'World', 2:'Sports', 3:'Business', 4:'Sci_Tech'}

In [106]:
# Convert indexes to news categories
AGnews_df['News category'] =  AGnews_df['Class Index'].apply(lambda x: index_class_mapping_dict[x])
del AGnews_df['Class Index']

In [107]:
AGnews_df.head()


Unnamed: 0,Title,Description,News category
0,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli...",Business
1,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...,Business
2,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...,Business
3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...,Business
4,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco...",Business


In [108]:
# Sampling to ensure all the news categories are there in the training and test data
AGnews_df = AGnews_df.sample(frac=1)
AGnews_df = AGnews_df.reset_index(drop=True)
AGnews_df.loc[:100]['News category'].value_counts()


News category
Business    31
Sci_Tech    26
Sports      24
World       20
Name: count, dtype: int64

## Data preparation

We'll begin by preparing our data. When fine-tuning with the `ChatCompletion` format, each training example is a simple list of `messages`. For example, an entry could look like:

```
[{'role': 'system',
  'content': 'You are a helpful news classification assistant. You are to extract the text category from each of the news texts provided.'},

 {'role': 'user',
  'content': 'Title: Wall St. Bears Claw Back Into the Black (Reuters)\n\nDescription: Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.\n\nNews category: '},

 {'role': 'assistant',
  'content': 'Business'}]
```

During the training process this conversation will be split, with the final entry being the `completion` that the model will produce, and the remainder of the `messages` acting as the prompt. Consider this when building your training examples - if your model will act on multi-turn conversations, then please provide representative examples so it doesn't perform poorly when the conversation starts to expand.

Please note that currently there is a 4096 token limit for each training example. Anything longer than this will be truncated at 4096 tokens.


In [109]:
training_data = []

system_message = "You are a helpful news classification assistant. You are to extract the text category from each of the news texts provided."

def create_user_message(row):
    return f"""Title: {row['Title']}\n\nDescription: {row['Description']}\n\nNews category: """

def prepare_example_conversation(row):
    messages = []
    messages.append({"role": "system", "content": system_message})

    user_message = create_user_message(row)
    messages.append({"role": "user", "content": user_message})

    messages.append({"role": "assistant", "content": row["News category"]})

    return {"messages": messages}

pprint(prepare_example_conversation(AGnews_df.iloc[0]))

{'messages': [{'content': 'You are a helpful news classification assistant. '
                          'You are to extract the text category from each of '
                          'the news texts provided.',
               'role': 'system'},
              {'content': 'Title: Japan leader appoints new party chiefs ahead '
                          'of cabinet reshuffle\n'
                          '\n'
                          'Description: TOKYO, (AFP) - Japanese Prime Minister '
                          'Junichiro Koizumi has appointed new party '
                          'executives to his ruling Liberal Democratic Party '
                          '(LDP), hours ahead of an expected cabinet reshuffle '
                          'aimed at pushing a reform package.\n'
                          '\n'
                          'News category: ',
               'role': 'user'},
              {'content': 'World', 'role': 'assistant'}]}


Let's now do this for a subset of the dataset to use as our training data. You can begin with even 30-50 well-pruned examples. You should see performance continue to scale linearly as you increase the size of the training set, but your jobs will also take longer.

In [110]:
# use the first 100 rows of the dataset for training
training_df = AGnews_df.loc[0:100]

# apply the prepare_example_conversation function to each row of the training_df
training_data = training_df.apply(prepare_example_conversation, axis=1).tolist()

for example in training_data[:5]:
    print(example)

{'messages': [{'role': 'system', 'content': 'You are a helpful news classification assistant. You are to extract the text category from each of the news texts provided.'}, {'role': 'user', 'content': 'Title: Japan leader appoints new party chiefs ahead of cabinet reshuffle\n\nDescription: TOKYO, (AFP) - Japanese Prime Minister Junichiro Koizumi has appointed new party executives to his ruling Liberal Democratic Party (LDP), hours ahead of an expected cabinet reshuffle aimed at pushing a reform package.\n\nNews category: '}, {'role': 'assistant', 'content': 'World'}]}
{'messages': [{'role': 'system', 'content': 'You are a helpful news classification assistant. You are to extract the text category from each of the news texts provided.'}, {'role': 'user', 'content': 'Title: TechBrief: Judge clears Google on search ads\n\nDescription: Google won a major legal victory Wednesday when a US judge said the search engine could continue to sell ads triggered by searches that use trademarked compa

In [111]:
training_data[0:2]

[{'messages': [{'role': 'system',
    'content': 'You are a helpful news classification assistant. You are to extract the text category from each of the news texts provided.'},
   {'role': 'user',
    'content': 'Title: Japan leader appoints new party chiefs ahead of cabinet reshuffle\n\nDescription: TOKYO, (AFP) - Japanese Prime Minister Junichiro Koizumi has appointed new party executives to his ruling Liberal Democratic Party (LDP), hours ahead of an expected cabinet reshuffle aimed at pushing a reform package.\n\nNews category: '},
   {'role': 'assistant', 'content': 'World'}]},
 {'messages': [{'role': 'system',
    'content': 'You are a helpful news classification assistant. You are to extract the text category from each of the news texts provided.'},
   {'role': 'user',
    'content': 'Title: TechBrief: Judge clears Google on search ads\n\nDescription: Google won a major legal victory Wednesday when a US judge said the search engine could continue to sell ads triggered by searche

In addition to training data, we can also **optionally** provide validation data, which will be used to make sure that the model does not overfit your training set.

In [112]:
validation_df = AGnews_df.loc[101:200]
validation_data = validation_df.apply(prepare_example_conversation, axis=1).tolist()

We then need to save our data as `.jsonl` files, with each line being one training example conversation.


In [115]:
def write_jsonl(data_list: list, filename: str) -> None:
    with open(filename, "w") as out:
        for ddict in data_list:
            jout = json.dumps(ddict) + "\n"
            out.write(jout)

In [116]:
training_file_name = "AGnews_training.jsonl"
write_jsonl(training_data, training_file_name)

validation_file_name = "AGnews_validation.jsonl"
write_jsonl(validation_data, validation_file_name)

This is what the first 5 lines of our training `.jsonl` file look like:

In [117]:
# print the first 5 lines of the training file
!head -n 5 AGnews_training.jsonl

{"messages": [{"role": "system", "content": "You are a helpful news classification assistant. You are to extract the text category from each of the news texts provided."}, {"role": "user", "content": "Title: Japan leader appoints new party chiefs ahead of cabinet reshuffle\n\nDescription: TOKYO, (AFP) - Japanese Prime Minister Junichiro Koizumi has appointed new party executives to his ruling Liberal Democratic Party (LDP), hours ahead of an expected cabinet reshuffle aimed at pushing a reform package.\n\nNews category: "}, {"role": "assistant", "content": "World"}]}
{"messages": [{"role": "system", "content": "You are a helpful news classification assistant. You are to extract the text category from each of the news texts provided."}, {"role": "user", "content": "Title: TechBrief: Judge clears Google on search ads\n\nDescription: Google won a major legal victory Wednesday when a US judge said the search engine could continue to sell ads triggered by searches that use trademarked compa

### Upload files

You can now upload the files to our `Files` endpoint to be used by the fine-tuned model.


In [118]:
openai.api_key = OPENAI_API_KEY

training_response = openai.File.create(
    file=open(training_file_name, "rb"), purpose="fine-tune"
)
training_file_id = training_response["id"]

validation_response = openai.File.create(
    file=open(validation_file_name, "rb"), purpose="fine-tune"
)
validation_file_id = validation_response["id"]

print("Training file ID:", training_file_id)
print("Validation file ID:", validation_file_id)

Training file ID: file-q8InsZm2pteeZVT90L9P2YP6
Validation file ID: file-0wna3HLYo6Ulo4YKtviqgnIf


## Fine-tuning

Now we can create our fine-tuning job with the generated files and an optional suffix to identify the model. The response will contain an `id` which you can use to retrieve updates on the job.

Note: The files have to first be processed by our system, so you might get a `File not ready` error. In that case, simply retry a few minutes later.


In [128]:
response = openai.FineTuningJob.create(
    training_file=training_file_id,
    validation_file=validation_file_id,
    model="gpt-3.5-turbo",
    suffix="recipe-text-class",
)


job_id = response["id"]

print("Job ID:", response["id"])
print("Status:", response["status"])

InvalidRequestError: File 'file-0wna3HLYo6Ulo4YKtviqgnIf' is still being processed and is not ready to be used for fine-tuning. Please try again later.

#### Check job status

You can make a `GET` request to the `https://api.openai.com/v1/alpha/fine-tunes` endpoint to list your alpha fine-tune jobs. In this instance you'll want to check that the ID you got from the previous step ends up as `status: succeeded`.

Once it is completed, you can use the `result_files` to sample the results from the validation set (if you uploaded one), and use the ID from the `fine_tuned_model` parameter to invoke your trained model.


In [59]:
response = openai.FineTuningJob.retrieve(job_id)

print("Job ID:", response["id"])
print("Status:", response["status"])
print("Trained Tokens:", response["trained_tokens"])


Job ID: ftjob-4SnO3nafyAb1ZeOtmPDNS346
Status: running
Trained Tokens: None


We can track the progress of the fine-tune with the events endpoint. You can rerun the cell below a few times until the fine-tune is ready.


In [60]:
response = openai.FineTuningJob.list_events(id=job_id, limit=50)

events = response["data"]
events.reverse()

for event in events:
    print(event["message"])

Created fine-tune: ftjob-4SnO3nafyAb1ZeOtmPDNS346
Fine tuning job started
Step 10/303: training loss=1.14
Step 20/303: training loss=1.47
Step 30/303: training loss=1.08
Step 40/303: training loss=0.00
Step 50/303: training loss=0.00
Step 60/303: training loss=0.00
Step 70/303: training loss=0.00
Step 80/303: training loss=0.00
Step 90/303: training loss=0.00
Step 100/303: training loss=0.00
Step 110/303: training loss=0.00
Step 120/303: training loss=0.02
Step 130/303: training loss=0.00
Step 140/303: training loss=0.00
Step 150/303: training loss=0.00
Step 160/303: training loss=0.00
Step 170/303: training loss=0.00
Step 180/303: training loss=0.00
Step 190/303: training loss=0.00
Step 200/303: training loss=0.00
Step 210/303: training loss=0.00
Step 220/303: training loss=0.00
Step 230/303: training loss=0.00
Step 240/303: training loss=0.00
Step 250/303: training loss=0.00
Step 260/303: training loss=0.00
Step 270/303: training loss=0.00
Step 280/303: training loss=0.00
Step 290/30

Now that it's done, we can get a fine-tuned model ID from the job:


In [61]:
response = openai.FineTuningJob.retrieve(job_id)
fine_tuned_model_id = response["fine_tuned_model"]

print("Fine-tuned model ID:", fine_tuned_model_id)

Fine-tuned model ID: ft:gpt-3.5-turbo-0613:personal:recipe-text-class:7tRcw203


## Inference

The last step is to use your fine-tuned model for inference. Similar to the classic `FineTuning`, you simply call `ChatCompletions` with your new fine-tuned model name filling the `model` parameter.


In [63]:
test_df = AGnews_df.loc[201:300]
test_row = test_df.iloc[0]
test_messages = []
test_messages.append({"role": "system", "content": system_message})
user_message = create_user_message(test_row)
test_messages.append({"role": "user", "content": create_user_message(test_row)})

pprint(test_messages)

[{'content': 'You are a helpful news classification assistant. You are to '
             'extract the text category from each of the news texts provided.',
  'role': 'system'},
 {'content': 'Title: Fake goods tempting young adults\n'
             '\n'
             'Description: Young people are increasingly happy to buy pirated '
             'goods or illegal download content from the net, a survey shows.\n'
             '\n'
             'News category: ',
  'role': 'user'}]


In [64]:
response = openai.ChatCompletion.create(
    model=fine_tuned_model_id, messages=test_messages, temperature=0, max_tokens=500
)
print(response["choices"][0]["message"]["content"])

Business


In [129]:
test_row = test_df.iloc[3]
test_messages = []
test_messages.append({"role": "system", "content": system_message})
user_message = create_user_message(test_row)
test_messages.append({"role": "user", "content": create_user_message(test_row)})

pprint(test_messages)

response = openai.ChatCompletion.create(
    model=fine_tuned_model_id, messages=test_messages, temperature=0, max_tokens=500
)
print(response["choices"][0]["message"]["content"])

[{'content': 'You are a helpful news classification assistant. You are to '
             'extract the text category from each of the news texts provided.',
  'role': 'system'},
 {'content': 'Title: AOL to Sell Cheap PCs to Minorities and Seniors\n'
             '\n'
             'Description:  NEW YORK (Reuters) - America Online on Thursday '
             'said it  plans to sell a low-priced PC targeting low-income and '
             'minority  households who agree to sign up for a year of dialup '
             'Internet  service.\n'
             '\n'
             'News category: ',
  'role': 'user'}]
Business


## Conclusion

Congratulations, you are now ready to fine-tune your own models using the `ChatCompletion` format! We look forward to seeing what you build
