in this notebook we will follow the framework laid out by openai to finetune: [link](https://platform.openai.com/docs/guides/fine-tuning/introduction)

we will syntheitcally create our dataset using an LLM by boostrapping a few samples. we will then evaulate the validity of this dataset using an LLM.

will repurpose the Q&A on Retrieved Data in Arize


In [57]:
import os
import dotenv

import pandas as pd
import json
from jinja2 import Template
from sklearn.metrics import classification_report
from pycm import ConfusionMatrix
import matplotlib.pyplot as plt

from openai import OpenAI
import phoenix.evals.default_templates as templates
from phoenix.evals import (
    OpenAIModel,
    llm_classify,
)

import nest_asyncio

dotenv.load_dotenv()
nest_asyncio.apply()

In [2]:
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# lets try to predict normally with 3.5-turbo

In [3]:
positive_normal = '''I recently purchased the Galaxy Explorer drone and am absolutely thrilled with its performance. The drone's battery life is impressive, allowing for extended flight times, and the camera quality is outstanding, capturing crisp and clear images from great heights. The intuitive controls made it easy for me as a beginner to navigate, and I've been able to capture some truly breathtaking footage. Highly recommend to anyone looking for a reliable and high-quality drone.'''
negative_normal = '''Had dinner at The Green Terrace last night and was deeply disappointed. Despite the cozy ambiance, the service was sluggish, and our orders took forever to arrive. When the food finally came, it was lukewarm at best. The pasta was overcooked, and the salad lacked freshness. It's a shame because I had high expectations based on the reviews. Sadly, I won't be returning or recommending this place to friends.'''
positive_tough = '''Lost my job, but at least I won't have to endure that dreadful commute anymore'''
negative_tough = '''Your presentation was surprisingly good; I expected much less'''

SYSTEM_PROMPT = '''You are given a social media review. Classify it as postive, negative. '''

original_dataset = [[positive_normal, 'positive'],
               [negative_normal, 'negative'], 
               [positive_tough, 'positive'],
               [negative_tough, 'negative']]

In [4]:
def classify_sentiment(social_media_post, model="gpt-3.5-turbo"):
    response = client.chat.completions.create(model=model, 
                                              temperature = 0.1,
                                              messages=[{"role": "system", "content": SYSTEM_PROMPT}, 
                                                        {"role": "user", "content": social_media_post}])
    return response.choices[0].message.content

In [5]:
df = pd.DataFrame(original_dataset, columns=['text', 'label'])
df['predicted_label'] = df['text'].apply(classify_sentiment).apply(lambda x: x.strip().lower())

In [6]:
df['correct'] = df['label'] == df['predicted_label']

In [7]:
original_incorrect = df[~df['correct']]

## accuracy for base gpt-3.5 model

In [8]:
display(df)

Unnamed: 0,text,label,predicted_label,correct
0,I recently purchased the Galaxy Explorer drone...,positive,positive,True
1,Had dinner at The Green Terrace last night and...,negative,negative,True
2,"Lost my job, but at least I won't have to endu...",positive,positive,True
3,Your presentation was surprisingly good; I exp...,negative,positive,False


In [9]:
df['correct'].mean()

0.75

# build a dataset for finetuning

In [10]:
# Format the examples
formatted_examples = ",\n    ".join([f"'set-{i+1}': (\"{text}\", '{sentiment}')" for i, (text, sentiment) in enumerate(original_dataset)])
formatted_examples = '{'+formatted_examples+'}'

# Step 1: Read the file content
with open('dataset_generator_template.txt', 'r') as file:
    file_content = file.read()

# Step 3: Render the template with examples
template = Template(file_content)
dataset_gen_system_prompt = template.render(examples=formatted_examples)

print(dataset_gen_system_prompt)


You are responsible for creating a dataset of social media posts and their associated sentiments(postiive or negative).

Read the instructions below carefully and follow each step in order to accurately generate your response.

Please generate 20 sets of social media posts. Each set should not share similar context.

Please output the sets in the following format:
{'set-1': [<POST>, sentiment],
    'set-2': [<POST>, sentiment], ...,
    'set-20': [<POST>, sentiment]}

Here are some examples:
{'set-1': ("I recently purchased the Galaxy Explorer drone and am absolutely thrilled with its performance. The drone's battery life is impressive, allowing for extended flight times, and the camera quality is outstanding, capturing crisp and clear images from great heights. The intuitive controls made it easy for me as a beginner to navigate, and I've been able to capture some truly breathtaking footage. Highly recommend to anyone looking for a reliable and high-quality drone.", 'positive'),
    '

In [12]:
print(dataset_response.choices[0].message.content)

{"set-1":["Just tried the new caramel macchiato from JavaBeans Cafe, and it's a game changer! Perfectly balanced and not too sweet. #CoffeeLover","positive"],"set-2":["Can't believe I wasted two hours on 'The Endless Night'. Worst movie ever with a plot that makes zero sense and terrible acting.","negative"],"set-3":["Finally joined the gym and had my first workout today. Feeling motivated and ready to get in shape!","positive"],"set-4":["My phone's latest update has made it slower than ever. Really regretting this update.","negative"],"set-5":["The book club's selection this month was a real page-turner. Couldn't put it down until I finished it!","positive"],"set-6":["Tried the new sushi place downtown and was not impressed. The fish didn't taste fresh and the rolls were poorly made.","negative"],"set-7":["Landed my dream job today! So grateful for this opportunity and excited to start.","positive"],"set-8":["Our vacation was ruined by the constant rain. Didn't get to do half the thin

In [15]:
def generate_synthetic_dataset():
    dataset_response = client.chat.completions.create(model='gpt-4-turbo-preview', 
                                              temperature = 0.1,
                                              messages=[{"role": "system", "content": dataset_gen_system_prompt}, 
                                                        {"role": "user", "content": "JSON Response:"}]).choices[0].message.content
    # sometimes the string ```json``` is not present in the response, so we need to check for it
    # if it is in there we want to remove it and the trailing ``` from the response
    if 'json' in dataset_response:
        parsed_response = dataset_response.replace('json', '')
        parsed_response = parsed_response.replace('`', '')
        parsed_response = json.loads(parsed_response)
    else:
        parsed_response = json.loads(dataset_response)
    
    return parsed_response

In [16]:
parsed_response = generate_synthetic_dataset()

{'set-1': ["Just finished reading 'The Light in the Forest' and it was a journey full of emotions. The storytelling was captivating and the characters felt so real. Definitely a must-read for anyone who loves historical fiction.", 'positive'], 'set-2': ['Tried the new vegan burger at Sunshine Café and it was a letdown. The patty was dry and lacked flavor, and the bun was stale. Was really hoping for a better experience.', 'negative'], 'set-3': ["Finally got to visit the Grand Canyon and it was absolutely breathtaking. Pictures don't do it justice. The vastness and beauty of it all is something everyone should experience.", 'positive'], 'set-4': ["My phone's latest update has made it slower than ever. Apps take forever to open and it keeps freezing. Really regretting this update.", 'negative'], 'set-5': ["The concert last night was incredible! The band's performance was electrifying and the crowd's energy was through the roof. Best live show I've been to in years.", 'positive'], 'set-6'

In [50]:
# Convert the dictionary to a DataFrame
df = pd.DataFrame.from_dict(parsed_response, orient='index', columns=['reference', 'output'])
# add some inccorect ones to see how the judge LLM does
df.loc['set-21'] = [positive_normal, 'negative']
df.loc['set-22'] = [negative_normal, 'positive']
df.reset_index(inplace=True)
df.drop(columns=['index'], inplace=True)
df['input'] =  SYSTEM_PROMPT


## evalute dataset for fine tuning

throw some bad ones in there to ensure our eval is working well

In [31]:
print(templates.QA_PROMPT_TEMPLATE)


You are given a question, an answer and reference text. You must determine whether the
given answer correctly answers the question based on the reference text. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {input}
    ************
    [Reference]: {reference}
    ************
    [Answer]: {output}
    [END DATA]
Your response must be a single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"correct" means that the question is correctly and fully answered by the answer.
"incorrect" means that the question is not correctly or only partially answered by the
answer.



### As you can see, the Judge LLM rightly identifies the incorrect training data

In [51]:
model = OpenAIModel(
    model="gpt-4",
    temperature=0.0,
)
rails = list(templates.QA_PROMPT_RAILS_MAP.values())
Q_and_A_classifications = llm_classify(
    dataframe=df,
    template=templates.QA_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    concurrency=10,
)["label"].tolist()

In [64]:
Q_and_A_classifications

['correct',
 'correct',
 'correct',
 'correct',
 'correct',
 'correct',
 'correct',
 'correct',
 'correct',
 'correct',
 'correct',
 'correct',
 'correct',
 'correct',
 'correct',
 'correct',
 'correct',
 'correct',
 'correct',
 'correct',
 'incorrect',
 'incorrect']

### create dataset for finetuning

In [68]:
finetune_df = pd.DataFrame.from_dict(parsed_response, orient='index', columns=['post', 'sentiment'])

In [69]:
def create_finetune_files(finetune_df):
    '''create both training and validation jsonl files for fine-tuning the model'''
    # Prepare to write to a .jsonl file
    formatted_examples = []
    for index, row in finetune_df.iterrows():
        # Structure the data as needed for fine-tuning
        formatted_example = {
            "messages": [
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": row.post},
                {"role": "assistant", "content": row.sentiment}
            ]
        }
        formatted_examples.append(formatted_example)
    # split into train and validation sets
    train = formatted_examples[:int(len(formatted_examples)*0.8)]
    validation = formatted_examples[int(len(formatted_examples)*0.8):]    
     # write each to a jsonl file
    with open('fine_tune_train_data.jsonl', 'w') as file:
        for example in train:
            file.write(json.dumps(example) + '\n')
    
    with open('fine_tune_validation_data.jsonl', 'w') as file:
        for example in validation:
            file.write(json.dumps(example) + '\n')

In [70]:
create_finetune_files(finetune_df)

# Finetune the model

In [None]:
# Upload the training file
training_file_response = client.files.create(
    file=open('fine_tune_train_data.jsonl', "rb"),
    purpose='fine-tune'
)
training_file_id = training_file_response.id

In [None]:
# Upload the validation file
validation_file_response = client.files.create(
    file=open('fine_tune_validation_data.jsonl', "rb"),
    purpose='fine-tune'
)
validation_file_id = validation_file_response.id

In [None]:
fine_tuning_info = client.fine_tuning.jobs.create(
  training_file=training_file_id, 
  model="gpt-3.5-turbo", 
  suffix="sentiment-classifier",
  validation_file=validation_file_id
)

In [None]:
fine_tuning_info

In [None]:
ft_model = 'gpt-3.5-turbo-sentiment-classifier'

# evaluate the new model

In [None]:
df = pd.DataFrame(original_dataset, columns=['text', 'label'])
df['predicted_label'] = df['text'].apply(classify_sentiment).apply(lambda x: x.strip().lower())