# First Steps in finetuning with open ai

#### Updates

* 20231205 modifed to use try4 data and perform retuning of existing model
* 20231125 modifed to use try3 data
* 20231119 updated to provide a training and test jsonl file
* 20231116 updated to use finetuningYYYYMMDD.jsonl
* 20231109 updated to use new 1.x api
* 20231109 uses the BAS dataset

In [1]:
# Import the os package
import os


# Imports via openai docs
from pathlib import Path
from openai import OpenAI


# import the dotenv package
from dotenv import load_dotenv

import pprint

# From the IPython.display package, import display and Markdown
from IPython.display import display, Markdown



In [2]:
# Get the current working directory
cwd = os.getcwd()
# Construct the .env file path
env_path = os.path.join(cwd, '.env')

# Load the .env file
load_dotenv(dotenv_path=env_path)

True

In [3]:
# Set openai.api_key to the OPENAI environment variable
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]


# specify the key and init the client

In [4]:
client = OpenAI()
client.api_key=OPENAI_API_KEY

# Determine the OpenAI API version

In [5]:
# from chatGPT when asked how to query api version,
# which sadly does not work since the model was trained
# before the api changed - I can only assume.

#import openai
#openai.api_key=OPENAI_API_KEY
# To get the API version, you would typically make an API call
# and the version would be included in the response headers.
# For example, you could list the available engines and check the headers:
#response = openai.Engine.list()

# The API version would be in the response headers if available
#api_version = response.headers.get('OpenAI-Api-Version')

#print(api_version)

# Sanity check
Verify API key and network allows usage of the openAI API

In [6]:
# Define the system message
system_msg = 'You are a helpful assistant who understands data science.'

# Define the user message
user_msg = 'Create a small dataset of data about people. The format of the dataset should be a data frame with 5 rows and 3 columns. The columns should be called "name", "height_cm", and "eye_color". The "name" column should contain randomly chosen first names. The "height_cm" column should contain randomly chosen heights, given in centimeters. The "eye_color" column should contain randomly chosen eye colors, taken from a choice of "brown", "blue", and "green". Provide Python code to generate the dataset, then provide the output in the format of a markdown table.'



# Create a dataset using GPT
response = client.chat.completions.create(
    model="gpt-3.5-turbo-1106",
    messages=[
        {"role": "system", "content": system_msg},
        {"role": "user", "content": user_msg}
    ]
)

In [7]:
response.choices[0].finish_reason
#response["choices"] 

'stop'

In [8]:
response.choices[0].message.content

'Sure, here\'s a Python code to generate the dataset and the resulting output in the format of a markdown table:\n\n```python\nimport pandas as pd\nimport random\n\n# Generate random data\ndata = {\n    "name": ["Alice", "Bob", "Charlie", "David", "Eve"],\n    "height_cm": [random.randint(150, 190) for _ in range(5)],\n    "eye_color": [random.choice(["brown", "blue", "green"]) for _ in range(5)]\n}\n\n# Create a DataFrame\ndf = pd.DataFrame(data)\n\n# Output the DataFrame\nprint(df)\n```\n\nThis code will produce a DataFrame with 5 rows and 3 columns. Here\'s the output in the format of a markdown table:\n\n|    | name    | height_cm | eye_color |\n|---:|:--------|----------:|:----------|\n|  0 | Alice   |       169 | blue      |\n|  1 | Bob     |       158 | brown     |\n|  2 | Charlie |       182 | green     |\n|  3 | David   |       176 | green     |\n|  4 | Eve     |       167 | blue      |'

# Upload a file for model tuning

### Setup dirs

In [9]:
import pathlib
dirpath = os.getcwd()
print("current directory is : " + dirpath)
# Use pathlib to find the root dir of the git repo
root_path = pathlib.PurePath(dirpath).parents[0]
data_path = root_path / 'data'
train_path = data_path / 'try4' / 'train'
test_path = data_path / 'try4' / 'test'
print("root directory is: ", root_path)
print("data directory is: ",  data_path)
print("train directory is: ",  train_path)
print("test directory is: ", test_path)
# Create equivalent dir names in the environment
# Data
DATA_DIR_NAME = data_path.as_posix()
print("DATA_DIR_NAME: ", DATA_DIR_NAME)
os.environ['DATA_DIR_NAME'] = DATA_DIR_NAME

current directory is : /workspaces/BALSA/notebooks
root directory is:  /workspaces/BALSA
data directory is:  /workspaces/BALSA/data
train directory is:  /workspaces/BALSA/data/try4/train
test directory is:  /workspaces/BALSA/data/try4/test
DATA_DIR_NAME:  /workspaces/BALSA/data


### Specify the JSONL file for model tuning

In [10]:
# This can be varied to point to different files.
TRAIN_FILE_NAME = 'train20231206.jsonl'
TEST_FILE_NAME = 'test20231206.jsonl'
print("TRAIN_FILE_NAME: ", TRAIN_FILE_NAME)
print("TEST_FILE_NAME: ", TEST_FILE_NAME)

TRAIN_FILE_NAME:  train20231206.jsonl
TEST_FILE_NAME:  test20231206.jsonl


In [11]:
TRAIN_FQPN = train_path /  pathlib.Path(TRAIN_FILE_NAME).as_posix()
TEST_FQPN = test_path /  pathlib.Path(TEST_FILE_NAME).as_posix()
print(TRAIN_FQPN)
print(TEST_FQPN)
TRAIN_FQPN

/workspaces/BALSA/data/try4/train/train20231206.jsonl
/workspaces/BALSA/data/try4/test/test20231206.jsonl


PurePosixPath('/workspaces/BALSA/data/try4/train/train20231206.jsonl')

In [12]:
response = client.files.create(
    file=Path(TRAIN_FQPN),
    purpose="fine-tune",
)

print(response)

FileObject(id='file-50yEx8Cu8vcz9YF8uFTcfnXL', bytes=609017, created_at=1701822498, filename='train20231206.jsonl', object='file', purpose='fine-tune', status='processed', status_details=None)


In [13]:
print(response.id)
train_file_id = response.id

file-50yEx8Cu8vcz9YF8uFTcfnXL


In [14]:

response = client.files.create(
    file=Path(TEST_FQPN),
    purpose="fine-tune",
)

print(response)

FileObject(id='file-OkfiU5inHSZRnBqsAyWIUZag', bytes=102865, created_at=1701822501, filename='test20231206.jsonl', object='file', purpose='fine-tune', status='processed', status_details=None)


In [15]:
print(response.id)
test_file_id = response.id

file-OkfiU5inHSZRnBqsAyWIUZag


# Actual fine tune of a model

In [16]:
# create a new fine tuning model 
# recommended fine tuning model
model="gpt-3.5-turbo-1106"
# Retune our last model
model="ft:gpt-3.5-turbo-1106:personal::8SXoaFOx"

response = client.fine_tuning.jobs.create(
  training_file=train_file_id,
  validation_file=test_file_id, 
  model=model
)
print(response)

FineTuningJob(id='ftjob-LLnCULY592u3rMgXXN1dyRzH', created_at=1701822562, error=None, fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), model='ft:gpt-3.5-turbo-1106:personal::8SXoaFOx', object='fine_tuning.job', organization_id='org-kHUq2JzdiW8FIDxqE01bYdot', result_files=[], status='validating_files', trained_tokens=None, training_file='file-50yEx8Cu8vcz9YF8uFTcfnXL', validation_file='file-OkfiU5inHSZRnBqsAyWIUZag')


# Trying to learn the training queue api

In [17]:
# List 10 fine-tuning jobs
#pprint.pprint(client.fine_tuning.jobs.list(limit=10))
result = client.fine_tuning.jobs.list(limit=10)
for a_job in result.data:
    # print the jobs raw
    #pprint.pprint(a_job)
    # print just the file for a job
    print(a_job.training_file)
    # simple test to see if our tune job is in top ten based upon fileid
    if train_file_id == a_job.training_file:
        print("yes")




# Retrieve the state of a fine-tune
#client.fine_tuning.jobs.retrieve("ftjob-abc123")

# Cancel a job
#client.fine_tuning.jobs.cancel("ftjob-abc123")

# List up to 10 events from a fine-tuning job
#client.fine_tuning.jobs.list_events(id="ftjob-abc123", limit=10)

# Delete a fine-tuned model (must be an owner of the org the model was created in)
#client.models.delete("ft:gpt-3.5-turbo:acemeco:suffix:abc123")

file-50yEx8Cu8vcz9YF8uFTcfnXL
yes
file-xgyECSI6Og4nQEwpifo7eZsF
file-6V1iarnGT8Ng5YEMUAc0FaAN
file-ho5MM6kcSaLtAwa6o36fbxDz
file-Fwux98ZJRrpbK4kN7JQSafeB
file-mOWBskmEo89j5l8yRFPFxnqe
file-HQqakeKHnHi4YFtdZJwGIEM6
file-S22pfJMZv7asuZNoMlrElq6T
file-RltDbgHjpQ9qANthDCvqJQkO
file-6n4dELlk1gyh7brCV8iulYMW


# Lets try to use it

In [42]:
#our_mode = "normal"
our_mode = "bas"

# dependency analytics disazble for syntax
# default model
default_model="gpt-3.5-turbo-0613"
# our tuned model
# Chelsea model #1
#our_model="ft:gpt-3.5-turbo-0613:personal::8DvbJsff"
# Chelsea model #2
#our_model="ft:gpt-3.5-turbo-0613:personal::8IV7laj9"
# bas model #2
#tuned_model="ft:gpt-3.5-turbo-0613:personal::8IV7laj9"
# model we trained using20231116 data
#tuned_model="ft:gpt-3.5-turbo-0613:personal::8LXzZa1D"
# model trained using 20231119 data
tuned_model="ft:gpt-3.5-turbo-0613:personal::8MRBWlFr"



if our_mode == "normal":
    print("normal")
    # stock model
    our_model=default_model
    # Define the system message
    system_msg = 'you are a helpful assistant who understands IBM BAL (IBM Basic Assembler Language).'
    # Define the user message
    user_msg = 'Provide an example of how to add two numbers in IBM BAL assembly.'
else:
    print("tuned")
    # tuned model
    our_model=tuned_model
   # Define the system message
    system_msg = 'you are a helpful assistant who understands IBM BAL (IBM Basic Assembler Language).'
    # Define the user message
    #user_msg = 'Provide an example of how to add two numbers in IBM BAL assembly.'
    #user_msg = 'Provide an example of how to subract two numbers in IBM BAL assembly.'
    user_msg = 'Provide an example of how to subract two numbers in IBM BAL assembly.  Use markdown to denote actual code section.'






# Create a dataset using GPT
response = client.chat.completions.create(
    model=our_model,
    messages=[
        {"role": "system", "content": system_msg},
        {"role": "user", "content": user_msg}
    ]
)

tuned


In [43]:
print("finish_reason: ", response.choices[0].finish_reason)
print("conten: ", response.choices[0].message.content)

finish_reason:  stop
conten:  
Here's an example of how to subtract two numbers in IBM BAL assembly:

```assembly
          SR   R1,R2 
```

This code will subtract the contents of register R2 from the contents of register R1, and store the result in R1.
