## Setup

To complete the following guide you will need to install the following packages:

- openai
- pandas
- requests

You will also need:

- OpenAI account (https://platform.openai.com/)
- OpenAI API key

In [1]:
!pip install openai pandas requests --quiet

You should consider upgrading via the '/Users/scottkramer/.pyenv/versions/3.9.9/bin/python3.9 -m pip install --upgrade pip' command.[0m


In [2]:
import json
import os

from openai import OpenAI
import pandas as pd

In [3]:
# Uncomment the line below and set your OPENAI_API_KEY environment variable set to your account's key!
# os.environ['OPENAI_API_KEY'] = 'XXX'

client = OpenAI()

## Problem Definition: Insurance Support Ticket Classifier

*Note: The problem definition, data, and labels used in this example were synthetically generated using an LLM.*

In the insurance industry, customer support plays a crucial role in ensuring client satisfaction and retention. Insurance companies receive a high volume of support tickets daily, covering a wide range of topics such as billing, policy administration, claims assistance, and more. Manually categorizing these tickets can be time-consuming and inefficient, leading to longer response times and potentially impacting customer experience.

### Task
In last week's exercise, we performed prompt engineering to increase the accuracy of our predictions on the test.tsv dataset to 63.24%.

This week, we will now fine-tune our first model to increase the accuracy even higher!

#### Labeled Data

The data can be found in the week-2 `data` folder.

We will use the following datasets:
- `./data/train.tsv`
- `./data/test.tsv`

In [4]:
training_examples = pd.read_csv('data/train.tsv', sep='\t')
test_examples = pd.read_csv('data/test.tsv', sep='\t')

# In order to not leak information about the test labels into our prompts, the list of possible categories will be defined 
# based on the training labels.
categories = sorted(training_examples['label'].unique().tolist())
categories_str = '\n'.join(categories)

training_tickets = training_examples['text'].tolist()
training_labels = training_examples['label'].tolist()

test_tickets = test_examples['text'].tolist()
test_labels = test_examples['label'].tolist()

### Dataset Curation

We first must transform our dataset into the format expected by OpenAI, and then upload the dataset. The dataset must conform to the schema expected by the Chat Completions API.

See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details

In [5]:
def create_prompt(ticket):
    return f"""classify a customer support ticket into one of the following categories:
<categories>
{categories_str}
</categories>

Here is the customer support ticket:    
<ticket>{ticket}</ticket>

Respond using this format:
<category>The category label you chose goes here</category>"""    

In [6]:
# Converts the training examples to the format expected by OpenaI.
def training_examples_to_json(examples):
    json_objs = list()
    for idx, example in examples.iterrows():  
        user_msg = create_prompt(example['text'])
        asst_msg = f"<category>{example['label']}</category>"
        msg = {"messages": [
            {"role": "user", "content": user_msg}, 
            {"role": "assistant", "content": asst_msg}
        ]}
        json_objs.append(msg)
    
    return json_objs

training_json = training_examples_to_json(training_examples)

In [7]:
# Writes the data to a file and then uploads it to OpenAI
dataset_file_name = 'ticket-classification_training_data.jsonl'

with open(dataset_file_name, 'w') as f:
    for obj in training_json:
        json.dump(obj, f)
        f.write('\n')

client.files.create(
  file=open(dataset_file_name, "rb"),
  purpose="fine-tune"
)

FileObject(id='file-fuCR9aHq80i2e8XLTvVkvmsQ', bytes=52416, created_at=1724943702, filename='ticket-classification_training_data.jsonl', object='file', purpose='fine-tune', status='processed', status_details=None)

### Fine-Tuning

We will now fine-tune models using the OpenAI API. OpenAI supports creating fine-tuning jobs both via the fine-tuning UI or programmatically. The number of epochs, learning rate, and batch size can all be optimized manually for your use case. In this exercise, we will use the default parameters.

See https://platform.openai.com/docs/guides/fine-tuning/create-a-fine-tuned-model for more details

In [8]:
# Creates a training job with the default hyperparameters
client.fine_tuning.jobs.create(
  training_file='file-fuCR9aHq80i2e8XLTvVkvmsQ', # the file ID that was returned when the training file was uploaded to the OpenAI API.
  model='gpt-4o-mini-2024-07-18'
)

FineTuningJob(id='ftjob-rRFzvpFyRis67jPmQOpo7ijA', created_at=1724943716, error=Error(code=None, message=None, param=None), fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), model='gpt-4o-mini-2024-07-18', object='fine_tuning.job', organization_id='org-bFtmYNQfekSDNC4ezHI3dmrI', result_files=[], seed=1489947951, status='validating_files', trained_tokens=None, training_file='file-fuCR9aHq80i2e8XLTvVkvmsQ', validation_file=None, estimated_finish=None, integrations=[], user_provided_suffix=None)

### Evluate Results

We will now deploy our models and evaluate the results. We will calculate the accuracy on two different models.

- The base model gpt-4o-mini model without any fine-tuning.
- Our fine-tuned model.

In the example below, you'll see that fine-tuning improved accuracy on our test set from 69% to 94%!

See https://platform.openai.com/docs/guides/fine-tuning/use-a-fine-tuned-model for more details

In [9]:
# Uses an LLM to predicted class labels for a list of support tickets
def classify_tickets(tickets, model):
    responses = list()

    for ticket in tickets:
        user_prompt = create_prompt(ticket)
    
        response = client.chat.completions.create(
            model=model,
            messages=[{ "role": "user", "content": user_prompt}],
            temperature=0, # setting temperature to 0 for this use case, so that responses are as deterministic as possible
            stop=["</category>"],
            max_tokens=2048,
        )

        response = response.choices[0].message.content.split("<category>")[-1].strip()
        responses.append(response)

    return responses


# Calculates the percent of predictions we classified correctly
def evaluate_accuracy(predicted, actual):
    num_correct = sum([predicted[i] == actual[i] for i in range(len(actual))])
    return round(100 * num_correct / len(actual), 2)

In [10]:
# Determine how the base model without any fine-tuning performs
model_id = 'gpt-4o-mini'

training_responses = classify_tickets(
    tickets=training_tickets, 
    model=model_id
)
accuracy = evaluate_accuracy(training_responses, training_labels)
print(f"Training Set Accuracy: {accuracy}%")

test_responses = classify_tickets(
    tickets=test_tickets, 
    model=model_id
)

accuracy = evaluate_accuracy(test_responses, test_labels)
print(f"Test Set Accuracy: {accuracy}%")

Training Set Accuracy: 70.59%
Test Set Accuracy: 69.12%


In [12]:
# Determine how the base model performs with the increases rank, epochs, and learning rate
model_id = 'ft:gpt-4o-mini-2024-07-18:brainiac-labs::A1b3dY1n' # REPLACE THIS WITH THE OUTPUT MODEL ID IN THE OPENAI FINE-TUNING DASHBOARD

training_responses = classify_tickets(
    tickets=training_tickets, 
    model=model_id
)
accuracy = evaluate_accuracy(training_responses, training_labels)
print(f"Training Set Accuracy: {accuracy}%")

test_responses = classify_tickets(
    tickets=test_tickets, 
    model=model_id
)

accuracy = evaluate_accuracy(test_responses, test_labels)
print(f"Test Set Accuracy: {accuracy}%")

Training Set Accuracy: 100.0%
Test Set Accuracy: 94.12%
