Sources:
https://platform.openai.com/docs/guides/fine-tuning

In [1]:
import yaml

In [4]:
with open('Cat_0_Data/data/Cat_0_commonexample.yaml') as f:
    data = yaml.safe_load(f)

In [8]:
def parse_examples(examples: str) -> list[str]:
    """Converts examples from YAML file to list of utterance texts

    Args:
        examples (string): Examples from YAML data file.
                           It is a single string containing all utterances
                           separated by '\n- '

    Returns:
        List[str]: list of utterance texts
    """

    texts = [text.strip() for text in examples.lstrip("- ").split("\n- ")]
    # splitting by '\n -' causes empty string to be present in the list,
    # in case the value of examples is ''.
    if "" in texts:
        texts.remove("")
    return texts

A prompt separator string needs to be added at the end of the prompt both while fine tuning and sending requests to the model. Else, the outputted completions would most likely be random, instead of our desired output intents. This separator should NOT be present in the utterance at all.

In [137]:
prompt_separator = "\n\nIntent:\n\n"

Completions should start with a whitespace since OpenAI's tokenization tokenizes most words with a preceding whitespace.

For classification tasks, OpenAI recommends us to choose classes which map to a single token and set max_tokens=1. But some classes like "Enquiry" or "Incident" map to multiple tokens. So, we set a stop sequence at the end of the completion instead.

In [None]:
completion_stop_sequence = " END"

In [100]:
json_list = []

for intent_dict in data['nlu']:
    intent = intent_dict['intent']
    examples = parse_examples(intent_dict['examples'])
    for example in examples[:40]:
        json_list.append({'prompt': f'{example}{prompt_separator}', 
                          'completion': f' {intent}{completion_stop_sequence}'})

In [101]:
len(json_list)

120

In [102]:
import jsonlines

In [103]:
with jsonlines.open('data_new.jsonl', mode='w') as writer:
    # Write each data item as a separate line
    for item in json_list:
        writer.write(item)

In [29]:
import dotenv
dotenv.load_dotenv(dotenv.find_dotenv())

True

In [104]:
!openai api fine_tunes.create -t data_new.jsonl -m ada

Upload progress: 100%|████████████████████| 17.1k/17.1k [00:00<00:00, 20.6Mit/s]
Uploaded file from data_new.jsonl: file-EVRMo8oO1Az4m6TM3zBfXLRs
Created fine-tune: ft-uUWLxtby9xSUbzFaEXxBC6FZ
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-06-28 15:09:16] Created fine-tune: ft-uUWLxtby9xSUbzFaEXxBC6FZ

Stream interrupted (client disconnected).
To resume the stream, run:

  openai api fine_tunes.follow -i ft-uUWLxtby9xSUbzFaEXxBC6FZ



In [123]:
!openai api fine_tunes.follow -i ft-uUWLxtby9xSUbzFaEXxBC6FZ

[2023-06-28 15:09:16] Created fine-tune: ft-uUWLxtby9xSUbzFaEXxBC6FZ
[2023-06-28 15:50:00] Fine-tune costs $0.01
[2023-06-28 15:50:01] Fine-tune enqueued. Queue number: 5
[2023-06-28 15:50:12] Fine-tune is in the queue. Queue number: 4
[2023-06-28 15:51:09] Fine-tune is in the queue. Queue number: 3
[2023-06-28 15:52:08] Fine-tune is in the queue. Queue number: 2
[2023-06-28 15:52:38] Fine-tune is in the queue. Queue number: 1
[2023-06-28 15:52:52] Fine-tune is in the queue. Queue number: 0
[2023-06-28 15:52:54] Fine-tune started
[2023-06-28 15:53:26] Completed epoch 1/4
[2023-06-28 15:53:44] Completed epoch 2/4
[2023-06-28 15:54:02] Completed epoch 3/4
[2023-06-28 15:54:20] Completed epoch 4/4
[2023-06-28 15:54:36] Uploaded model: ada:ft-personal-2023-06-28-10-24-36
[2023-06-28 15:55:17] Uploaded result file: file-gbizTYbB1ufAYFGhc2dJjF82
[2023-06-28 15:55:17] Fine-tune succeeded

Job complete! Status: succeeded 🎉
Try out your fine-tuned model:

openai api completions.create -m ada:ft

In [149]:
import openai
response = openai.Completion.create(
    model="ada:ft-personal-2023-06-28-10-24-36",
    prompt=f"Help! I got logged out of my outlook account due to issues in MFA{prompt_separator}",
    stop=[completion_stop_sequence]
)
print(response['choices'][0]['text'].strip())

Incident
