<a href="https://colab.research.google.com/github/zmarkofsky/DSML_capstone/blob/main/colabs/Tune_Empathy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Empathy Classification Model Fine Tuning

The goal of this notebook is to fine tune the OpenAI GPT 3.5 model to be able to correctly identify wheater a set of text is considered empathetic or not.

## Inputs and Setup

In [1]:
!pip install --upgrade openai tiktoken



In [2]:
!pip install git+https://github.com/wandb/wandb.git@e688ecc9a816e12aef82878e2ab12befe678a3e6

Collecting git+https://github.com/wandb/wandb.git@e688ecc9a816e12aef82878e2ab12befe678a3e6
  Cloning https://github.com/wandb/wandb.git (to revision e688ecc9a816e12aef82878e2ab12befe678a3e6) to /tmp/pip-req-build-q0os99ci
  Running command git clone --filter=blob:none --quiet https://github.com/wandb/wandb.git /tmp/pip-req-build-q0os99ci
  Running command git rev-parse -q --verify 'sha^e688ecc9a816e12aef82878e2ab12befe678a3e6'
  Running command git fetch -q https://github.com/wandb/wandb.git e688ecc9a816e12aef82878e2ab12befe678a3e6
  Running command git checkout -q e688ecc9a816e12aef82878e2ab12befe678a3e6
  Resolved https://github.com/wandb/wandb.git to commit e688ecc9a816e12aef82878e2ab12befe678a3e6
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [3]:
import json
import openai
import os
import pandas as pd
from pprint import pprint
import tiktoken
from sklearn.model_selection import train_test_split
from google.colab import userdata
import numpy as np
from collections import defaultdict
import wandb
from wandb.integration.openai.fine_tuning import WandbLogger


client = openai.OpenAI(api_key=userdata.get('OPENAI_API_KEY'))
encoding = tiktoken.get_encoding("cl100k_base")

WANDB_PROJECT = "OpenAI-Empathy-Fine-Tune"

In [4]:
#Estimated token counter
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens

def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if message["role"] == "assistant":
            num_tokens += len(encoding.encode(message["content"]))
    return num_tokens

def print_distribution(values, name):
    print(f"\n#### Distribution of {name}:")
    print(f"min / max: {min(values)}, {max(values)}")
    print(f"mean / median: {np.mean(values)}, {np.median(values)}")
    print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")

## Data Import and Prep

In [5]:
df_empathy = pd.read_csv("https://raw.githubusercontent.com/UrologyUnbound/SIOP_ML_2024_Discord/main/data/train/empathy_train.csv")
df_empathy.empathy = df_empathy.empathy.astype(str)
df_empathy.head()

Unnamed: 0,_id,text,empathy
0,116,"Hi Jonathan, I hope this message finds you wel...",1
1,54,"Jonathan, I hope you are well - I am very exci...",1
2,1,"Hi Jonathan, Good to hear you are enjoying the...",1
3,130,"Jonathan, First I want to thank you for your h...",0
4,114,"Hey Jonathan! I've been in touch with Terry, I...",1


In [6]:
df_empathy.empathy = df_empathy.empathy.map({"1":"Empathetic","0":"Non-empathetic"})

In [7]:
df_empathy.iloc[df_empathy.text.str.len().idxmax()].text

'Hello Jonathan, I hope you are doing well. As I am only in office today and you are on a travel, I am contacting you via Mail.I became a feedback from people related to your reports in the Beta project that I want to discuss with you. In general, the team is happy about your skills in contribution and espacially your way to identify improvements they may would have not seen. At first, your technical reports contain too much commentary that are not neccessary and make the report too long. Your ideas and thoughts about the technical solution are important, but you should only note them down separatly to discuss those with the project team, not to write them in the reports.Second is that your technical writing level is not on the level it is required to be. to presennt this to customer or our CEO. This requires that the colleagues are correcting your work  what leads due to the high pressure to delays, which is embarassingfor the colleagues who commited end dates.It might be that you wil

In [8]:
df_empathy.dtypes

_id         int64
text       object
empathy    object
dtype: object

In [9]:
x_train,x_test,y_train,y_test = train_test_split(df_empathy["text"],df_empathy["empathy"],random_state=42,test_size = 0.2, shuffle=True, stratify=df_empathy["empathy"])
train_data = pd.concat([x_train , y_train], axis = 1)
test_data = pd.concat([x_train , y_train], axis = 1)
train_data.head()


Unnamed: 0,text,empathy
3,"Jonathan, First I want to thank you for your h...",Non-empathetic
26,"Hi Jonathan, Thank you for your message. I am ...",Empathetic
22,"Hi Jonathan, I just happened to know that you ...",Empathetic
8,"Hi Jonathan, I have been hearing about some of...",Empathetic
27,"Hi Jonathan, I am glad to hear that you are en...",Non-empathetic


In [10]:
training_data = []

system_message = """
Your task is to classify the provided text as either "Empathetic" or "Non-empathetic".
Empathetic Responses involve understanding, supportiveness, and active engagement. Understanding is demonstrated by showing comprehension
of the individual's feelings and perspective. Supportiveness entails offering genuine support, guidance, or constructive feedback while
respecting the individual's contributions and feelings. Active engagement is displayed through asking questions or suggesting actions that
actively engage with the individual's situation. Non-Empathetic Responses lack empathetic qualities. They may lack personalization, offering
generic advice or feedback without addressing the individual's specific feelings or situation. Dismissiveness occurs when the individual's
feelings, concerns, or contributions are downplayed or ignored. Superficiality refers to appearing empathetic on the surface but lacking depth
in understanding or supporting the individual's actual needs.
"""

def create_user_message(row):
    return f"""Message to Classify: {row['text']}"""

def prepare_example_conversation(row):
    messages = []
    messages.append({"role": "system", "content": system_message})

    user_message = create_user_message(row)
    messages.append({"role": "user", "content": user_message})

    messages.append({"role": "assistant", "content": row["empathy"]})

    return {"messages": messages}

pprint(prepare_example_conversation(train_data.iloc[0]))

{'messages': [{'content': '\n'
                          'Your task is to classify the provided text as '
                          'either "Empathetic" or "Non-empathetic".\n'
                          'Empathetic Responses involve understanding, '
                          'supportiveness, and active engagement. '
                          'Understanding is demonstrated by showing '
                          'comprehension\n'
                          "of the individual's feelings and perspective. "
                          'Supportiveness entails offering genuine support, '
                          'guidance, or constructive feedback while\n'
                          "respecting the individual's contributions and "
                          'feelings. Active engagement is displayed through '
                          'asking questions or suggesting actions that\n'
                          "actively engage with the individual's situation. "
                          'Non-Empathet

In [11]:
training_json = train_data.apply(prepare_example_conversation, axis=1).tolist()
test_json = test_data.apply(prepare_example_conversation, axis=1).tolist()


for example in training_json[:5]:
    print(example)

{'messages': [{'role': 'system', 'content': '\nYour task is to classify the provided text as either "Empathetic" or "Non-empathetic".\nEmpathetic Responses involve understanding, supportiveness, and active engagement. Understanding is demonstrated by showing comprehension\nof the individual\'s feelings and perspective. Supportiveness entails offering genuine support, guidance, or constructive feedback while\nrespecting the individual\'s contributions and feelings. Active engagement is displayed through asking questions or suggesting actions that\nactively engage with the individual\'s situation. Non-Empathetic Responses lack empathetic qualities. They may lack personalization, offering\ngeneric advice or feedback without addressing the individual\'s specific feelings or situation. Dismissiveness occurs when the individual\'s\nfeelings, concerns, or contributions are downplayed or ignored. Superficiality refers to appearing empathetic on the surface but lacking depth\nin understanding o

In [12]:
def write_jsonl(data_list: list, filename: str) -> None:
    with open(filename, "w") as out:
        for ddict in data_list:
            jout = json.dumps(ddict) + "\n"
            out.write(jout)

In [13]:
training_file_name = "tmp_empathy_finetune_training.jsonl"
write_jsonl(training_json, training_file_name)

testing_file_name = "tmp_empathy_finetune_testing.jsonl"
write_jsonl(test_json, testing_file_name)

In [14]:
!head -n 5 tmp_empathy_finetune_training.jsonl

{"messages": [{"role": "system", "content": "\nYour task is to classify the provided text as either \"Empathetic\" or \"Non-empathetic\".\nEmpathetic Responses involve understanding, supportiveness, and active engagement. Understanding is demonstrated by showing comprehension\nof the individual's feelings and perspective. Supportiveness entails offering genuine support, guidance, or constructive feedback while\nrespecting the individual's contributions and feelings. Active engagement is displayed through asking questions or suggesting actions that\nactively engage with the individual's situation. Non-Empathetic Responses lack empathetic qualities. They may lack personalization, offering\ngeneric advice or feedback without addressing the individual's specific feelings or situation. Dismissiveness occurs when the individual's\nfeelings, concerns, or contributions are downplayed or ignored. Superficiality refers to appearing empathetic on the surface but lacking depth\nin understanding or

### Pre-Tuning Checks

In [15]:
# Format error checks - Training set
with open("/content/tmp_empathy_finetune_testing.jsonl", 'r', encoding='utf-8') as f:
    dataset = [json.loads(line) for line in f]

format_errors = defaultdict(int)

for ex in dataset:
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue

    messages = ex.get("messages", None)
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue

    for message in messages:
        if "role" not in message or "content" not in message:
            format_errors["message_missing_key"] += 1

        if any(k not in ("role", "content", "name", "function_call", "weight") for k in message):
            format_errors["message_unrecognized_key"] += 1

        if message.get("role", None) not in ("system", "user", "assistant", "function"):
            format_errors["unrecognized_role"] += 1

        content = message.get("content", None)
        function_call = message.get("function_call", None)

        if (not content and not function_call) or not isinstance(content, str):
            format_errors["missing_content"] += 1

    if not any(message.get("role", None) == "assistant" for message in messages):
        format_errors["example_missing_assistant_message"] += 1

if format_errors:
    print("Found errors:")
    for k, v in format_errors.items():
        print(f"{k}: {v}")
else:
    print("No errors found")

No errors found


In [16]:
# Format error checks - Training set
with open("/content/tmp_empathy_finetune_training.jsonl", 'r', encoding='utf-8') as f:
    dataset = [json.loads(line) for line in f]

format_errors = defaultdict(int)

for ex in dataset:
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue

    messages = ex.get("messages", None)
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue

    for message in messages:
        if "role" not in message or "content" not in message:
            format_errors["message_missing_key"] += 1

        if any(k not in ("role", "content", "name", "function_call", "weight") for k in message):
            format_errors["message_unrecognized_key"] += 1

        if message.get("role", None) not in ("system", "user", "assistant", "function"):
            format_errors["unrecognized_role"] += 1

        content = message.get("content", None)
        function_call = message.get("function_call", None)

        if (not content and not function_call) or not isinstance(content, str):
            format_errors["missing_content"] += 1

    if not any(message.get("role", None) == "assistant" for message in messages):
        format_errors["example_missing_assistant_message"] += 1

if format_errors:
    print("Found errors:")
    for k, v in format_errors.items():
        print(f"{k}: {v}")
else:
    print("No errors found")

No errors found


In [17]:
# Warnings and tokens counts
with open("/content/tmp_empathy_finetune_testing.jsonl", 'r', encoding='utf-8') as f:
    dataset = [json.loads(line) for line in f]

n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []

for ex in dataset:
    messages = ex["messages"]
    if not any(message["role"] == "system" for message in messages):
        n_missing_system += 1
    if not any(message["role"] == "user" for message in messages):
        n_missing_user += 1
    n_messages.append(len(messages))
    convo_lens.append(num_tokens_from_messages(messages))
    assistant_message_lens.append(num_assistant_tokens_from_messages(messages))

print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 4096 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 4096 token limit, they will be truncated during fine-tuning")

Num examples missing system message: 0
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 3, 3
mean / median: 3.0, 3.0
p5 / p95: 3.0, 3.0

#### Distribution of num_total_tokens_per_example:
min / max: 324, 625
mean / median: 435.75, 436.0
p5 / p95: 352.7, 503.6

#### Distribution of num_assistant_tokens_per_example:
min / max: 3, 4
mean / median: 3.4166666666666665, 3.0
p5 / p95: 3.0, 4.0

0 examples may be over the 4096 token limit, they will be truncated during fine-tuning


In [18]:
# Pricing and default n_epochs estimate
MAX_TOKENS_PER_EXAMPLE = 4096

TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25

n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")
print(f"Estimated training cost ~${((n_epochs * n_billing_tokens_in_dataset)/1000)*.0080}")

Dataset has ~10458 tokens that will be charged for during training
By default, you'll train for 4 epochs on this dataset
By default, you'll be charged for ~41832 tokens
Estimated training cost ~$0.334656


In [19]:
with open(training_file_name, "rb") as training_fd:
    training_response = client.files.create(
        file=training_fd, purpose="fine-tune"
    )

training_file_id = training_response.id

with open(testing_file_name, "rb") as validation_fd:
    validation_response = client.files.create(
        file=validation_fd, purpose="fine-tune"
    )
validation_file_id = validation_response.id

print("Training file ID:", training_file_id)
print("Validation file ID:", validation_file_id)

Training file ID: file-aLnklu8um8QdfgnyZq9R8nx9
Validation file ID: file-b7FhDPgmIKejbVYxWsj7fHwy


## Fine Tuning

In [22]:
# Only Run this cell when wanting to create a new fine-tuning job, otherwise you will be paying to redo work

#Uncomment the below code when wanting to run a new fine-tuning job
# response = client.fine_tuning.jobs.create(
#     training_file=training_file_id,
#     validation_file=validation_file_id,
#     model="gpt-3.5-turbo",
#     hyperparameters = {"n_epochs":3, "batch_size":"auto", "learning_rate_multiplier":2},
#     suffix="empathy_tuned_v2"
# )

# job_id = response.id
job_id = "ftjob-p8I8rBrqF8bN47qhiyOWzD9h"

# print("Job ID:", response.id)
# print("Status:", response.status)

In [23]:
WandbLogger.sync(fine_tune_job_id=job_id, project=WANDB_PROJECT, openai_client=client)

[34m[1mwandb[0m: Retrieving fine-tune job...


[34m[1mwandb[0m: Waiting for the OpenAI fine-tuning job to finish training...
[34m[1mwandb[0m: To avoid blocking, you can call `WandbLogger.sync` with `wait_for_job_success=False` after OpenAI training completes.
[34m[1mwandb[0m: Fine-tuning finished, logging metrics, model metadata, and run metadata to Weights & Biases
[34m[1mwandb[0m: Logging training/validation files...


VBox(children=(Label(value='0.002 MB of 0.002 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
train_accuracy,█▄▄▁▄█▄████▄▄█▄█▄████▄▄█▄▄▄▄▄▄▄█▄▄███▄█▄
train_loss,▁▆▂▃▂▁▃▁▁▁▁▆▆▁▆▁▆▁▁▁▁▃▃▁▇▅█▄▃▄█▁▇▇▁▁▁▇▁▇
valid_loss,▂▂▂▂▁▁▂▁▆▁▆▇▁▁▁▁▁▆▁▁▅█▁▄▂▁▁▁▁▇█▁▁▄▁▁▃▂▁▁
valid_mean_token_accuracy,█▁▁▁██▁█▂█▂▂█████▂██▂▂█▂▂████▂▂██▂██▂▂██

0,1
fine_tuned_model,ft:gpt-3.5-turbo-012...
status,succeeded
train_accuracy,0.83333
train_loss,1.66267
valid_loss,1e-05
valid_mean_token_accuracy,1.0


'🎉 wandb sync completed successfully'

In [None]:
#Check Job Status
response = client.fine_tuning.jobs.retrieve(job_id)

print("Job ID:", response.id)
print("Status:", response.status)
print("Trained Tokens:", response.trained_tokens)


In [None]:
#Track Fine-Tuning Endpoints
response = client.fine_tuning.jobs.list_events(job_id)

events = response.data
events.reverse()

for event in events:
    print(event.message)

In [24]:
# When job is done, run to fets fine-tuned model id
response = client.fine_tuning.jobs.retrieve(job_id)
fine_tuned_model_id = response.fine_tuned_model

if fine_tuned_model_id is None:
    raise RuntimeError("Fine-tuned model ID not found. Your job has likely not been completed yet.")

print("Fine-tuned model ID:", fine_tuned_model_id)

Fine-tuned model ID: ft:gpt-3.5-turbo-0125:personal:empathy-tuned-v2:99k2Fsmn


## Fine-Tuned Model Testing

In [25]:
df_empathy_dev = pd.read_csv("https://raw.githubusercontent.com/UrologyUnbound/SIOP_ML_2024_Discord/main/data/dev/empathy_val_public.csv")

In [26]:
test_row = df_empathy_dev.iloc[2]
test_messages = []
test_messages.append({"role": "system", "content": system_message})
user_message = create_user_message(test_row)
test_messages.append({"role": "user", "content": user_message})

pprint(test_messages)

[{'content': '\n'
             'Your task is to classify the provided text as either '
             '"Empathetic" or "Non-empathetic".\n'
             'Empathetic Responses involve understanding, supportiveness, and '
             'active engagement. Understanding is demonstrated by showing '
             'comprehension\n'
             "of the individual's feelings and perspective. Supportiveness "
             'entails offering genuine support, guidance, or constructive '
             'feedback while\n'
             "respecting the individual's contributions and feelings. Active "
             'engagement is displayed through asking questions or suggesting '
             'actions that\n'
             "actively engage with the individual's situation. Non-Empathetic "
             'Responses lack empathetic qualities. They may lack '
             'personalization, offering\n'
             "generic advice or feedback without addressing the individual's "
             'specific feelings o

In [27]:
response = client.chat.completions.create(
    model=fine_tuned_model_id, messages=test_messages, temperature=0, max_tokens=500
)
print(response.choices[0].message.content)

Empathetic
