# Generate Quicktake prompt-response pairs and fine-tune a model

This notebook generates pairs of prompts and short Quicktake responses that ignore length or formatting instructions.


Fine-tuned model to use that minimizes truncations and formatting in responses

More details at https://platform.openai.com/finetune/ftjob-VupgrOxNp0ApGhGKDgspdGjb

MODEL_FOR_FINETUNE_QT_FULL_NAME = "ft:gpt-4o-2024-08-06:yupp::AgJJZBsG"


In [1]:
import os
import pandas as pd
import numpy as np
from asyncio import Semaphore
from openai import AsyncOpenAI
from tqdm.asyncio import tqdm as tqdm_asyncio
from dotenv import load_dotenv

In [3]:
# Load API key
load_dotenv()
client = AsyncOpenAI(api_key=os.getenv('OPENAI_API_KEY'))

In [4]:
CATEGORIES = [
    'Code',
    'Math', 
    'Advice',
    'Analysis',
    'Comparison',
    'Creative',
    'Other',
    'Education',
    'Entertainment',
    'Factual',
    'Opinion',
    'Reasoning',
    'Summarization'
]

SYSTEM_PROMPT = """
Generate 10 diverse prompt and response pairs for the given category. Prompts should be user queries that typically require long responses written in markdown or bullet points. The prompt may even ask for long or markdown responses.
However, the response should ignore these prompt instructions. The response should be a short self-contained answer in plain text, without markdown, bullet points, or other formatting.

Prompts:
- Long (paragraphs)
- Contain chat history between user and assistant, with long assistant messages with markdown formatting (i.e. [{"role": "user", "content": "Help me with an itinerary for NYC"}, {"role": "assistant", "content": "Certainly! Here's a 2-4 hour tour itinerary that starts from the Statue of Liberty, includes a walk on the Brooklyn Bridge, a visit to Wall Street, and ends with dinner in Brooklyn.\n\n### Itinerary:\n\n#### 1. Statue of Liberty (1:00 PM – 2:30 PM)\n- **Start your tour**: Take the ferry to Liberty Island and enjoy exploring the Statue of Liberty. The visit includes walking around the base and, if time allows, visiting the pedestal or the museum...<to be continued>"}])
- Contain or ask for markdown formatting (###, **, etc.)
- Require a longer response

Responses:
- Must be under 20 words, using simple phrases over full sentences
- Contain plain text only with no formatting, markdown, newlines, or explanations
- Be factual and accurate
- Return <TOO_LONG> if unable to give a short answer that satisfies the prompt

Rules:
- Return at least two prompts containing markdown formatting (include ### or ** in prompt) or chat history, and responses that ignore these instructions.
- Make sure the prompts and responses cover different topics and styles within the category.

Return each prompt and response pair separated by |||, with pairs separated by newlines. Do not number the pairs.
Example format:
prompt1 ||| response1
prompt2 ||| response2
"""

In [5]:
async def generate_batch(category: str, sem: Semaphore) -> list[tuple[str, str]]:
    try:
        async with sem:
            response = await client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": SYSTEM_PROMPT},
                    {"role": "user", "content": f"Generate diverse {category} prompt-response pairs."}
                ],
                temperature=0.9,  # Higher temperature for more diversity
                max_tokens=2000
            )

            result = response.choices[0].message.content.strip()
            pairs = []
            for line in result.split('\n'):
                if '|||' in line:
                    prompt, response = line.split('|||')
                    pairs.append((prompt.strip(), response.strip()))
            return pairs
    except Exception as e:
        print(f"Error generating batch for {category}: {e}")
        return []

async def generate_finetune_dataset(n_per_category: int = 1000) -> pd.DataFrame:
    data = []
    sem = Semaphore(5)  # Keep rate limiting for API safety

    for category in CATEGORIES:
        print(f"Generating pairs for {category}...")

        # Calculate how many API calls needed to get n_per_category pairs
        # Each call generates 10 pairs
        n_calls = (n_per_category + 9) // 10  # Round up division

        # Generate sequentially with tqdm progress bar
        pairs = []
        for _ in tqdm_asyncio(range(n_calls), desc=f"Generating {category} batches"):
            batch = await generate_batch(category, sem)
            pairs.extend(batch)

        # Take exactly n_per_category pairs
        pairs = pairs[:n_per_category]

        for prompt, response in pairs:
            data.append({
                'category': category,
                'prompt': prompt,
                'response': response
            })

    return pd.DataFrame(data)

In [7]:
# Generate the dataset
df = await generate_finetune_dataset()
print(f"Generated {len(df)} prompts")
df.head()

Generating pairs for Code...


Generating Code batches:   0%|          | 0/100 [00:00<?, ?it/s]

Generating Code batches: 100%|██████████| 100/100 [08:08<00:00,  4.88s/it]


Generating pairs for Math...


Generating Math batches: 100%|██████████| 100/100 [07:46<00:00,  4.67s/it]


Generating pairs for Advice...


Generating Advice batches: 100%|██████████| 100/100 [06:50<00:00,  4.10s/it]


Generating pairs for Analysis...


Generating Analysis batches: 100%|██████████| 100/100 [08:44<00:00,  5.24s/it]


Generating pairs for Comparison...


Generating Comparison batches: 100%|██████████| 100/100 [08:03<00:00,  4.84s/it]


Generating pairs for Creative...


Generating Creative batches: 100%|██████████| 100/100 [07:52<00:00,  4.73s/it]


Generating pairs for Other...


Generating Other batches: 100%|██████████| 100/100 [07:48<00:00,  4.69s/it]


Generating pairs for Education...


Generating Education batches: 100%|██████████| 100/100 [07:02<00:00,  4.23s/it]


Generating pairs for Entertainment...


Generating Entertainment batches: 100%|██████████| 100/100 [09:37<00:00,  5.77s/it]


Generating pairs for Factual...


Generating Factual batches: 100%|██████████| 100/100 [07:48<00:00,  4.69s/it]


Generating pairs for Opinion...


Generating Opinion batches: 100%|██████████| 100/100 [07:30<00:00,  4.51s/it]


Generating pairs for Reasoning...


Generating Reasoning batches: 100%|██████████| 100/100 [09:25<00:00,  5.66s/it]


Generating pairs for Summarization...


Generating Summarization batches: 100%|██████████| 100/100 [08:30<00:00,  5.10s/it]

Generated 12931 prompts





Unnamed: 0,category,prompt,response
0,Code,What are the steps to create a simple Python p...,"Start with a project folder, create a virtual ..."
1,Code,Can you explain the differences between Java a...,Java is statically typed while Python is dynam...
2,Code,Provide a comprehensive guide on setting up a ...,"Install Node.js, create your project, set up t..."
3,Code,How do I optimize SQL queries for better perfo...,"Use indexes, avoid SELECT *, limit result sets..."
4,Code,Can you help me understand how to use Git for ...,"Initialize repository, create commits, branch ..."


In [27]:
# sample 5 and print fgrom each cateogyrr
for category in df['category'].unique():
    print(f"Category: {category}")
    for i, row in df[df['category'] == category].sample(5).iterrows():
        print(f"Prompt: {row['prompt']}")
        print(f"Response: {row['response']}")
        print("\n")


Category: Code
Prompt: Can you provide a step-by-step guide to setting up a web server using Node.js? Please elaborate.
Response: You can install Node.js and then use npm to create a server.


Prompt: What are the differences between HTML and XML? Can you explain briefly?
Response: HTML is for web pages, XML is for data.


Prompt: Explain how to create a simple web page using HTML and CSS. Include examples of tags and styling.
Response: A simple web page can be created using HTML for structure and CSS for styling. Basic HTML tags include <html>, <head>, and <body>.


Prompt: What is a good way to handle errors in JavaScript? Please explain with code examples.
Response: Use try-catch for error handling in JavaScript.


Prompt: Can you share a code snippet that demonstrates how to use arrays in Java? Include a brief explanation of how to manipulate arrays.
Response: Arrays in Java can be created using the new keyword and manipulated using indexing.


Category: Math
Prompt: Explain how to

In [8]:
# Save to TSV
df.to_csv('data/quicktake_pairs.tsv', sep='\t', index=False)
print("Dataset saved to data/quicktake_pairs.tsv")

Dataset saved to data/quicktake_pairs.csv


In [None]:
training_examples = []
for _, row in df.iterrows():
    training_examples.append({
        'messages': [
            {'role': 'user', 'content': f"[{row['category']}] {row['prompt']}"},
            {'role': 'assistant', 'content': row['response']}
        ]
    })

In [None]:
# Save to JSON
import json

def write_json(file_path: str, examples: list):
    with open(file_path, "w") as f:
        for example in examples:
            print(json.dumps(example), file=f)

# Call the function to write the training examples to the file
write_json("data/training_examples_all_20241219.jsonl", training_examples)


## Fine-tune
- Latest available fine-tuning model is `gpt-4o-2024-08-06`
- https://platform.openai.com/docs/guides/fine-tuning#which-models-can-be-fine-tuned

In [None]:
from openai import OpenAI

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

In [None]:
ret = client.files.create(file=open("data/training_examples_all_20241219.jsonl", "rb"), purpose="fine-tune")
ret

FileObject(id='file-DxF5q6ZFKR9ZTcc6KKATXk', bytes=3440074, created_at=1734644464, filename='training_examples_all_20241219.jsonl', object='file', purpose='fine-tune', status='processed', status_details=None)

In [None]:
# Latest available fine-tuning model is gpt-4o-2024-08-06
# https://platform.openai.com/docs/guides/fine-tuning#which-models-can-be-fine-tuned
ft_job = client.fine_tuning.jobs.create(
    training_file=ret.id, model="gpt-4o-2024-08-06", hyperparameters=dict(n_epochs=1)
)

In [None]:
# More details at https://platform.openai.com/finetune/ftjob-VupgrOxNp0ApGhGKDgspdGjb
for job in client.fine_tuning.jobs.list(limit=3):
    print(job)
    print()