<h2>OpenAI - Fine-tuning a ChatGPT Model on custom dataset</h2>

Here, we have a dataset for customer support tickets which consists of emails and ticket types. We want to predict the support ticket type by analysing the email content.

The <a>customer_support_tickets.csv</a> file was downloaded from [here.](https://www.kaggle.com/datasets/suraj520/customer-support-ticket-dataset)

In [17]:
import pandas as pd
import tiktoken
import random
import json

<h4>Explore the data</h4>

In [30]:
df = pd.read_csv('customer_support_tickets.csv')

df.dropna(subset=['Ticket Description', 'Ticket Type', 'Ticket Priority'], inplace=True)

df = df[['Customer Name', 'Customer Email', 'Ticket Type', 'Ticket Subject', 'Ticket Description', 'Product Purchased', 'Date of Purchase', 'Ticket Priority']]
df.columns = df.columns.str.replace(' ', '_').str.lower()
# df['ticket_type'] = df['ticket_type'].str.replace(' ', '_').str.lower()
df['ticket_type'] = df['ticket_priority'].str.replace(' ', '_').str.lower()
df = df[df['ticket_type'].isin(['low', 'critical'])]

# create new column for ticket type category and set numierical value for each category
df['ticket_type_category'] = df['ticket_type'].astype('category').cat.codes.astype(str)
# print all unique category and code mapping
print(dict(enumerate(df['ticket_type'].astype('category').cat.categories)))

# df.groupby('ticket_type').count()
df.head()

{0: 'critical', 1: 'low'}


Unnamed: 0,customer_name,customer_email,ticket_type,ticket_subject,ticket_description,product_purchased,date_of_purchase,ticket_priority,ticket_type_category
0,Marisa Obrien,carrollallison@example.com,critical,Product setup,I'm having an issue with the {product_purchase...,GoPro Hero,2021-03-22,Critical,0
1,Jessica Rios,clarkeashley@example.com,critical,Peripheral compatibility,I'm having an issue with the {product_purchase...,LG Smart TV,2021-05-22,Critical,0
2,Christopher Robbins,gonzalestracy@example.com,low,Network problem,I'm facing a problem with my {product_purchase...,Dell XPS,2020-07-14,Low,1
3,Christina Dillon,bradleyolson@example.org,low,Account access,I'm having an issue with the {product_purchase...,Microsoft Office,2020-11-13,Low,1
4,Alexander Carroll,bradleymark@example.com,low,Data loss,I'm having an issue with the {product_purchase...,Autodesk AutoCAD,2020-02-04,Low,1


<h4>Prepare the data</h4>

In [31]:
df['ticket_description'] = df.apply(lambda row: row['ticket_description'].replace("###", ''), axis=1)
# df['ticket_description'] = df.apply(lambda row: row['ticket_description'].replace("{product_purchased}", row['product_purchased']), axis=1)

df = df.reset_index(drop=True)

df['prompt'] =  "Subject:" + df['ticket_subject'] + "\nFrom:" + df['customer_name'] + "<" + df['customer_email'] + ">\nDate:" + df['date_of_purchase'] + "\nContent:" + df['ticket_description'] #+ "\n\n###\n\n"
df['completion'] = df['ticket_type_category']#.apply(lambda x: ' ' + str(x))

df = df.sample(frac=1).reset_index(drop=True)

# get 100 random samples for each ticket_type and shuffle the data
df = df.groupby('completion').apply(lambda x: x.sample(n=625, random_state=42)).sample(frac=1, random_state=42).reset_index(drop=True)

df.shape

(1250, 11)

Ensure that the prompt + completion doesn't exceed 2048 tokens, including the separator

In [33]:
encoding = tiktoken.get_encoding("cl100k_base")

In [37]:
# Tokenize the prompt and completion
def calculate_token_length(row):
    return len(encoding.encode(row['prompt'])) + len(encoding.encode(row['completion']))

df['token_length'] = df.apply(calculate_token_length, axis=1)

df = df[df['token_length'] <= 2048]

df = df[['prompt', 'completion']]
df.to_json('dataset.jsonl', orient='records', lines=True)

<h4>Prepare the custom dataset using openAI tool</h4> 
The CLI command: <code>`openai tools fine_tunes.prepare_data -f [TRAINING_FILE_NAME] </code>

This command will guide through the steps in validating your data, gives suggestions and reformats it.

We additionally specify `-q` which auto-accepts all suggestions.

In [38]:
!openai tools fine_tunes.prepare_data -f dataset.jsonl  -q

Analyzing...

- Your file contains 1250 prompt-completion pairs
- Based on your data it seems like you're trying to fine-tune a model for classification
- For classification, we recommend you try one of the faster and cheaper models, such as `ada`
- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training
- Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples. If you intend to do open-ended generation, then you should leave the prompts empty
- All prompts start with prefix `Subject:`
- The completion should start with a whitespace character (` `). This tends to produce better results due to the tokenization we use. See https://platform.openai.com/docs/gu

<h4>Fine-Tune the model</h4>

The tool suggests to add `--compute_classification_metrics --classification_n_classes 5` as it is a multi-classification. In case of binary classification you need to add `--classification_positive_class <label>`

In [47]:
import os
import openai
from dotenv import load_dotenv

load_dotenv()

openai.api_key = os.getenv("OPENAI_API_KEY")

# !openai api fine_tunes.create -t "dataset_prepared_train.jsonl" -v "dataset_prepared_valid.jsonl" --compute_classification_metrics --classification_n_classes 3 -m ada
!openai api fine_tunes.create -t "dataset_prepared_train.jsonl" -v "dataset_prepared_valid.jsonl" --compute_classification_metrics --classification_n_classes 2 --classification_positive_class " 0" -m ada --n_epochs 7

Found potentially duplicated files with name 'dataset_prepared_train.jsonl', purpose 'fine-tune' and size 430137 bytes
file-dKpIuthuNRKVQGBwH0D5v3Kn
Enter file ID to reuse an already uploaded file, or an empty string to upload this file anyway: ^C



After you've started a fine-tune job, it may take some time to complete. If the event stream is interrupted for any reason, you can resume it by running:

In [60]:
!openai api fine_tunes.follow -i ft-WmyiGqKm3uop96kRlMfKJG8z

[2023-06-17 22:22:45] Created fine-tune: ft-WmyiGqKm3uop96kRlMfKJG8z
[2023-06-17 22:24:57] Fine-tune costs $6.38
[2023-06-17 22:24:58] Fine-tune enqueued. Queue number: 0
[2023-06-17 22:24:59] Fine-tune started
[2023-06-17 22:35:46] Completed epoch 1/2
[2023-06-17 22:45:04] Uploaded model: davinci:ft-personal-2023-06-17-17-15-03
[2023-06-17 22:45:05] Uploaded result file: file-RsKkJTKvZJltASH15ji21SWM
[2023-06-17 22:45:05] Fine-tune succeeded

Job complete! Status: succeeded 🎉
Try out your fine-tuned model:

openai api completions.create -m davinci:ft-personal-2023-06-17-17-15-03 -p <YOUR_PROMPT>


<h5>Results and expected model performance</h5>

In [62]:
!openai api fine_tunes.results -i ft-WmyiGqKm3uop96kRlMfKJG8z > result.csv



In [63]:
results = pd.read_csv('result.csv')
results[results['classification/accuracy'].notnull()].tail(1)

Unnamed: 0,step,elapsed_tokens,elapsed_examples,training_loss,training_sequence_accuracy,training_token_accuracy,validation_loss,validation_sequence_accuracy,validation_token_accuracy,classification/accuracy,classification/precision,classification/recall,classification/auroc,classification/auprc,classification/f1.0
1000,1001,229794,2002,0.013592,0.5,0.5,0.017348,0.0,0.0,0.524,0.0,0.0,0.484091,0.468847,0.0


In [None]:
test = pd.read_json('dataset_prepared_valid.jsonl', lines=True)
test.head()

The accuracy reaches 52.4%. This is same as that of scikit-learn models (see [here](https://github.com/sankalptambe/openai-model-ada-fine-tuned/blob/main/Comparing%20ChatGPT%20with%20sklearn%20models.ipynb)). 

The Ada model expects `~500` samples/category for good results (52.4% is due to the shitty dataset. It is used just to compare the model performance).

In [None]:
results[results['classification/accuracy'].notnull()]['classification/accuracy'].plot()

<h4>Using the model</h4>

We can now call the model to get the predictions.

In [None]:
import openai

ft_model = 'ada:ft-personal-2023-06-17-10-47-01'
res = openai.Completion.create(model=ft_model, prompt=test['prompt'][1] + '\n\n###\n\n', max_tokens=1, temperature=0)
print('Actual value: ' + str(test['completion'][1]))
res['choices'][0]['text']

To get the log probabilities, we can specify logprobs parameter on the completion request

In [None]:
res = openai.Completion.create(model=ft_model, prompt=test['prompt'][0] + '\n\n###\n\n', max_tokens=1, temperature=0, logprobs=2)
res['choices'][0]['logprobs']['top_logprobs'][0]

By requesting log_probs, we can see the prediction (log) probability for each class.