Step 1: Initialize the parameters to create the dataset

**Temperature:**
Choose the temperature (between 0 and 1) to use when generating data. Lower values are great for precise tasks, like writing code, whereas larger values are better for creative tasks, like writing stories.


**Number of Samples:**

choose how many examples you want to generate. The more you generate, a) the longer it takes and b) the more expensive data generation will be. But generally, more examples will lead to a higher-quality model

In [None]:
prompt = """A model that takes in a material of a railway component and its defect description, and outputs: 1. Detailed visual properties of the defect, suitable for creating a realistic texture in a 3D model
, 2. Recommendations for textures to simulate the defect accurately in a 3D modeling environment, 3. Necessary changes in parameters or attributes in the 3D model
to reflect the defect accurately."""
temperature = .5
number_of_examples = 100 #better to generate 100 samples

Step 2- Install OpenAI

In [None]:
!pip install openai==0.28



In [None]:

openai.api_key = "YOUR KEY" #Replace with your API Key

Step 3- Generate Examples using OpenAI

You can also download a sample dataset from this link: https://drive.google.com/file/d/1yQx0mW6C-YDwIWBqC2f3_8nbrGwKwcyS/view?usp=sharing

In [None]:
import os
import openai
import random
from tenacity import retry, stop_after_attempt, wait_exponential

openai.api_key = "sk-proj-5iy4bwrqAW8GpguiEawaT3BlbkFJ8p88lLSjOCeDbxWsAOlr" #Replace with your API Key

N_RETRIES = 1

@retry(stop=stop_after_attempt(N_RETRIES), wait=wait_exponential(multiplier=1, min=4, max=70))
def generate_example(prompt, prev_examples, temperature=.5):
    messages=[
        {
            "role": "system",
            "content": f"You are generating data which will be used to train a machine learning model.\n\nYou will be given a high-level description of the model we want to train, and from that, you will generate data samples, each with a prompt/response pair.\n\nYou will do so in this format:\n```\nprompt\n-----------\n$prompt_goes_here\n-----------\n\nresponse\n-----------\n$response_goes_here\n-----------\n```\n\nOnly one prompt/response pair should be generated per turn.\n\nFor each turn, make the example slightly more complex than the last, while ensuring diversity.\n\nMake sure your samples are unique and diverse, yet high-quality and complex enough to train a well-performing model.\n\nHere is the type of model we want to train:\n`{prompt}`"
        }
    ]

    if len(prev_examples) > 0:
        if len(prev_examples) > 8:
            prev_examples = random.sample(prev_examples, 8)
        for example in prev_examples:
            messages.append({
                "role": "assistant",
                "content": example
            })

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=messages,
        temperature=temperature,
        max_tokens=100,
    )

    return response.choices[0].message['content']

# Generate examples
prev_examples = []
for i in range(number_of_examples):
    print(f'Generating example {i}')
    example = generate_example(prompt, prev_examples, temperature)
    prev_examples.append(example)

print(prev_examples)

Generating example 0
Generating example 1
Generating example 2
Generating example 3
Generating example 4
Generating example 5
Generating example 6
Generating example 7
Generating example 8
Generating example 9
Generating example 10
Generating example 11
Generating example 12
Generating example 13
Generating example 14
Generating example 15
Generating example 16
Generating example 17
Generating example 18
Generating example 19
Generating example 20
Generating example 21
Generating example 22
Generating example 23
Generating example 24
Generating example 25
Generating example 26
Generating example 27
Generating example 28
Generating example 29
Generating example 30
Generating example 31
Generating example 32
Generating example 33
Generating example 34
Generating example 35
Generating example 36
Generating example 37
Generating example 38
Generating example 39
Generating example 40
Generating example 41
Generating example 42
Generating example 43
Generating example 44
Generating example 4


This code cell defines a function generate_system_message that creates a system message for model inference based on a given prompt. The function utilizes the OpenAI ChatCompletion.create method with the GPT-4 model, sending a system message that guides the user on how to construct a simple and concise system prompt for the model. The function then captures the user's prompt, processes it, and returns the model's generated system message. Finally, it prints the generated system message, inviting users to rerun the cell for potentially improved outcomes.

Key Points:
Function Purpose: To generate a system message for model inference.
Model Used: GPT-4, leveraging OpenAI's Chat Completion.
Input: User-provided prompt.
Output: Generated system message based on the user's prompt.
Usage: Helpful for dynamically creating system prompts for specific model inference tasks.

In [None]:
def generate_system_message(prompt):

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
          {
            "role": "system",
            "content": "You will be given a high-level description of the model we are training, and from that, you will generate a simple system prompt for that model to use. Remember, you are not generating the system message for data generation -- you are generating the system message to use for inference. A good format to follow is `Given $INPUT_DATA, you will $WHAT_THE_MODEL_SHOULD_DO.`.\n\nMake it as concise as possible. Include nothing but the system prompt in your response.\n\nFor example, never write: `\"$SYSTEM_PROMPT_HERE\"`.\n\nIt should be like: `$SYSTEM_PROMPT_HERE`."
          },
          {
              "role": "user",
              "content": prompt.strip(),
          }
        ],
        temperature=temperature,
        max_tokens=500,
    )

    return response.choices[0].message['content']

system_message = generate_system_message(prompt)

print(f'The system message is: `{system_message}`. Feel free to re-run this cell if you want a better result.')

The system message is: `Given the material of a railway component and its defect description, provide detailed visual properties of the defect, recommend textures for accurately simulating the defect in a 3D modeling environment, and suggest necessary changes in parameters or attributes in the 3D model to accurately reflect the defect.`. Feel free to re-run this cell if you want a better result.


Now let's put our examples into a dataframe and turn them into a final pair of datasets.

In [None]:
import json
import pandas as pd

# Initialize lists to store prompts and responses
prompts = []
responses = []

# Parse out prompts and responses from examples
for example in prev_examples:
  try:
    split_example = example.split('-----------')
    prompts.append(split_example[1].strip())
    responses.append(split_example[3].strip())
  except:
    pass

# Create a DataFrame
df = pd.DataFrame({
    'prompt': prompts,
    'response': responses
})

# Remove duplicates
df = df.drop_duplicates()

print('There are ' + str(len(df)) + ' successfully-generated examples.')

# Initialize list to store training examples
training_examples = []

# Create training examples in the format required for GPT-3.5 fine-tuning
for index, row in df.iterrows():
    training_example = {
        "messages": [
            {"role": "system", "content": system_message.strip()},
            {"role": "user", "content": row['prompt']},
            {"role": "assistant", "content": row['response']}
        ]
    }
    training_examples.append(training_example)

# Save training examples to a .jsonl file
with open('training_examples.jsonl', 'w') as f:
    for example in training_examples:
        f.write(json.dumps(example) + '\n')

There are 98 successfully-generated examples.


# Upload the file to OpenAI

In [None]:
file_id = openai.File.create(
  file=open("/content/training_examples.jsonl", "rb"),
  purpose='fine-tune'
).id

# Train the model! You may need to wait a few minutes before running the next cell to allow for the file to process on OpenAI's servers.

In [None]:
job = openai.FineTuningJob.create(training_file=file_id, model="gpt-3.5-turbo")

job_id = job.id

# Now, just wait until the fine-tuning run is done, and you'll have a ready-to-use model!

Run this cell every 20 minutes or so -- eventually, you'll see a message "New fine-tuned model created: ft:gpt-3.5-turbo-0613:xxxxxxxxxxxx"

Once you see that message, you can go to the OpenAI Playground (or keep going to the next cells and use the API) to try the model!

In [None]:
openai.FineTuningJob.list_events(id=job_id, limit=5)

<OpenAIObject list at 0x7b2046a58d10> JSON: {
  "object": "list",
  "data": [
    {
      "object": "fine_tuning.job.event",
      "id": "ftevent-ith1WsjbWELhbCUmMIYG1dYj",
      "created_at": 1711101454,
      "level": "info",
      "message": "The job has successfully completed",
      "data": {},
      "type": "message"
    },
    {
      "object": "fine_tuning.job.event",
      "id": "ftevent-xww6wu00pZWJrmY4Y0dlbgXv",
      "created_at": 1711101451,
      "level": "info",
      "message": "New fine-tuned model created: ft:gpt-3.5-turbo-0125:personal::95VwRTeM",
      "data": {},
      "type": "message"
    },
    {
      "object": "fine_tuning.job.event",
      "id": "ftevent-Bo67QfOgxUcPTEchs30RKqFw",
      "created_at": 1711101430,
      "level": "info",
      "message": "Step 141/150: training loss=0.31",
      "data": {
        "step": 141,
        "train_loss": 0.30675598978996277,
        "total_steps": 150,
        "train_mean_token_accuracy": 0.8888888955116272
      },
  

# Once your model is trained, run the next cell to grab the fine-tuned model name.

In [None]:
model_name_pre_object = openai.FineTuningJob.retrieve(job_id)
model_name = model_name_pre_object.fine_tuned_model
print(model_name)

ft:gpt-3.5-turbo-0125:personal::95VwRTeM


# Let's try it out!

In [None]:
response = openai.ChatCompletion.create(
    model=model_name,
    messages=[
      {
        "role": "system",
        "content": system_message,
      },
      {
          "role": "user",
          "content": df['prompt'].sample().values[0],
      }
    ],
)
print(df['prompt'][0])
response.choices[0].message['content']

Walking in a forest during autumn.


'The clinking of glasses and cutlery, the chatter of customers, the sizzle from the kitchen, the occasional laughter, and the muffled music playing in the background.'

In [None]:
# Let's test the model on customized prompt
customized_prompt = 'defect texture' # @param {type:"string"}


response = openai.ChatCompletion.create(
    model=model_name,
    messages=[
      {
        "role": "system",
        "content": system_message,
      },
      {
          "role": "user",
          "content": customized_prompt,
      }
    ],
)
response.choices[0].message['content']

'The sharp, rhythmic sound of a knife cutting through the vegetables, the occasional crunch as the knife hits the chopping board, and the soft thud as the chopped vegetables fall into a bowl.'

Reference: https://github.com/mshumer/gpt-llm-trainer