# Fine-Tuning a Model - ChatBot Example

In this project, we'll explore how to fine-tune a GPT model such as text-babbage model with our own data set. You should note, this may not be needed for more advanced text-davinci models or future GPT-4 models, but let's explore the process of creating our 
own custom fine-tuning data set, formatting it for OpenAI, and then training and calling our own custom model.

### Library Imports

In [1]:
import os
import json
import pandas as pd
import tiktoken
import openai

### Load the Q&A Data

In [2]:
data_frame = pd.read_csv("/Volumes/Data/Datasets/genai_datasets/python_qa.csv")
data_frame.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body,ParentId,Answer
0,11060,912.0,2008-08-14T13:59:21Z,,18,How should I unit test a code-generator?,This is a difficult and open-ended question I ...,11060,I started writing up a summary of my experienc...
1,17250,394.0,2008-08-20T00:16:40Z,,24,Create an encrypted ZIP file in Python,I'm creating an ZIP file with ZipFile in Pytho...,17250,I created a simple library to create a passwor...
2,31340,242853.0,2008-08-27T23:44:47Z,,71,"How do threads work in Python, and what are co...",I've been trying to wrap my head around how th...,31340,"Yes, because of the Global Interpreter Lock (G..."
3,34020,3561.0,2008-08-29T05:43:16Z,,17,Are Python threads buggy?,A reliable coder friend told me that Python's ...,34020,Python threads are good for concurrent I/O pro...
4,34570,577.0,2008-08-29T16:10:41Z,2011-11-08T16:11:43Z,13,What is the best quick-read Python book out th...,I am taking a class that requires Python. We w...,34570,"I loved Dive Into Python, especially if you're..."


### Formatting for Fine Tuning

The formatting for a fine-tuning data set involves a prompt and expected completion. This leads fine-tuning to be a great choice for dialogue instances, such as question and answer or customer support.

The format should look like the following (a list of dictionaries): <br><br>

    [{"prompt": "some prompt string","completion":"the best completed text option given the prompt"},]
    

Convert the information from CSV to the fine tuning format

In [3]:
questions, answers = data_frame["Body"], data_frame["Answer"] # Use tuple Unpacking

In [4]:
questions.head()

0    This is a difficult and open-ended question I ...
1    I'm creating an ZIP file with ZipFile in Pytho...
2    I've been trying to wrap my head around how th...
3    A reliable coder friend told me that Python's ...
4    I am taking a class that requires Python. We w...
Name: Body, dtype: object

In [5]:
answers.head()

0    I started writing up a summary of my experienc...
1    I created a simple library to create a passwor...
2    Yes, because of the Global Interpreter Lock (G...
3    Python threads are good for concurrent I/O pro...
4    I loved Dive Into Python, especially if you're...
Name: Answer, dtype: object

In [6]:
# Now we will create the list of dictionary in the format
qa_openai_format = [{"prompt": q, "completion": a} for q, a in zip(questions, answers)]
qa_openai_format[5]

{'prompt': "I am starting to use Python (specifically because of Django) and I would like to remove the burden for exhaustive testing by performing some static analysis.  What tools/parameters/etc. exist to detect issues at compile time that would otherwise show up during runtime? (type errors are probably the most obvious case of this, but undefined variables are another big one that could be avoided with an in-depth analysis of the AST.)\n\nObviously testing is important, and I don't imply that tests can be obviated entirely; however, there are many runtime errors in python that are not possible in other languages that perform stricter run-time checking -- I'm hoping that there are tools to bring at least some of these capabilities to python as well.\n",

In [7]:
# Check the length of the training prompt
len(qa_openai_format)

4429

### Price Estimation

In case you are ever worried about how many tokens your text actually has (to get an estimate of your costs) OpenAI has a library called "tiktoken", which allows you to estimate a cost based on token counts.

Splitting text strings into tokens is useful because models like GPT-3 see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token). Different models use different encodings.

**tiktoken** supports 3 different encodings for OpenAI models:

* "gpt2" for most gpt-3 models
* "p50k_base" for code models, and Davinci models, like "text-davinci-003"
* "cl100k_base" for text-embedding-ada-002

Make sure to view the pricing page on the OpenAI page for full information, for now, we'll cut down the data size so we don't spend too much money during training.

In [8]:
def num_tokens_from_string(string, encoding_name):
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [9]:
# In order to minimize cost, we would just consider the first 500 entries from our openai format and dump it in a json file
dataset_size = 500

In [10]:
# From our format, extract the same number of records
with open("training_data.json", 'w') as f:
    for entry in qa_openai_format[:dataset_size]:
        f.write(json.dumps(entry))
        f.write("\n")

In [11]:
# Now lets find out the number of tokens
token_counter = 0
for element in qa_openai_format[:dataset_size]:
    for key, value in element.items():
        token_counter += num_tokens_from_string(value, 'p50k_base')

In [12]:
print(f"There are {token_counter} tokens")
print(f"Fine tuning using babbage costs $0.0004 per 1000 tokens")
print(f"Estimated price: ${(4*token_counter / 1000) * 0.0004}") # 4 is the number of epochs we want it to train 

There are 184352 tokens
Fine tuning using babbage costs $0.0004 per 1000 tokens
Estimated price: $0.29496320000000004


### Command Line for Fine-Tuning

Note, you can find the full official guide here:

https://platform.openai.com/docs/guides/fine-tuning

OpenAI recommends using the terminal/command line via their OpenAI tool, which you have by simply running:

    pip install --upgrade openai


Now you can head over to the terminal to fine tune the model using the following command:

    openai api fine_tunes.create -t training_data.json -m babbage

In [13]:
openai.api_key = os.getenv("OPENAI_API_KEY")

#### Create a file for fine-tuning in OpenAI from training data

In [14]:
# Now we will create a file for fine tuning
openai.File.create(
    file=open("training_data.json", "rb"),
    purpose="fine-tune"
)

<File file id=file-fRmeYJjtGx2eOHbZGcZBJuRH at 0x11d925610> JSON: {
  "object": "file",
  "id": "file-fRmeYJjtGx2eOHbZGcZBJuRH",
  "purpose": "fine-tune",
  "filename": "file",
  "bytes": 706183,
  "created_at": 1697093667,
  "status": "uploaded",
  "status_details": null
}

#### Create the fine tuned model

In [15]:
# Start the fine-tuning job
openai.FineTuningJob.create(training_file="file-fRmeYJjtGx2eOHbZGcZBJuRH", model="babbage-002")

InvalidRequestError: Fine-tuning jobs cannot be created on an Explore plan. You can upgrade to a paid plan on your billing page: https://platform.openai.com/account/billing/overview

### Now we make the OpenAI call with the fined tuned model returned from the command prompt

#### <Put down the code, after the credit card issue is resolved  >

https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset


https://norahsakal.com/blog/fine-tune-gpt3-model



In [None]:
 response = openai.Completion.create(
    model="babbage:ft-EsQwSCdusLOAWeteVTmJuMrJ",
    prompt="What are good python books?",
    max_tokens=128,
    temperature=0.7,
    top_p=1.0
)
