Sources:
https://platform.openai.com/docs/guides/fine-tuning

In [33]:
import yaml
import re
import json

In [40]:
with open('data/nlu_converted.yml') as f:
    data = yaml.safe_load(f)

In [41]:
data

{'version': '2.0',
 'nlu': [{'intent': 'DLEmail',
   'examples': '- Hello. I need help setting up a [distribution list](Software_Attribute)\n- Hello. I need help setting up a [Distribution List](Software_Attribute)\n- Hello. I need help setting up a [Distribution list](Software_Attribute)\n- Hello. I need help setting up a [DL](Software_Attribute)\n- Hello. I need help setting up a [dl](Software_Attribute)\n- Hello. I need help setting up a [Group Email](Software_Attribute)\n- Hello. I need help setting up a [Email List](Software_Attribute)\n- Hello. I need help setting up a [group email](Software_Attribute)\n- Hello. I need help setting up a [email list](Software_Attribute)\n- Hello. I need help setting up a [mail list](Software_Attribute)\n- Hello. I need help setting up a [Mail List](Software_Attribute)\n- Hi, I would like to remove an [owner](Software_Attribute) from our [distribution list](Software_Attribute) Can you help pls\n- Hi, I would like to remove an [owner](Software_Attri

In [6]:
def parse_examples(examples: str) -> list[str]:
    """Converts examples from YAML file to list of utterance texts

    Args:
        examples (string): Examples from YAML data file.
                           It is a single string containing all utterances
                           separated by '\n- '

    Returns:
        List[str]: list of utterance texts
    """

    texts = [text.strip() for text in examples.lstrip("- ").split("\n- ")]
    # splitting by '\n -' causes empty string to be present in the list,
    # in case the value of examples is ''.
    if "" in texts:
        texts.remove("")
    return texts

In [25]:
def parse_entities_from_text(text: str) -> tuple[str, list]:
    """Lists labelled entities from texts and returns list of entities
    along with original text.

    Args:
        text (str): Text with labelled entities

    Returns:
        Tuple[str, List[MappedEntitiesDict]]: Tuple containing
            original text and list of mapped entities
    """

    # The following regex pattern detects whether utterance
    # is labelled with entities
    # Group 1: (\[.+?\]) [entity value]
    # Group 2: ({\s*\"entity\":\s*\".+?\"\s*}) (entity name)
    # Example: Book flight from [pune](city_name) to [mumbai](city_name)
    entity_regex_pattern_1 = r"(\[.+?\])(\(.+?\))"

    # The following regex pattern detects whether utterance
    # is labelled with entities (may or may not contain roles)
    # Group 1: (\[.+?\]) [entity value]
    # Group 2: ({\s*\"entity\":\s*\".+?\"\s*})
    #          {"entity": "entity name", "role": "role name (optional)"}
    # Example: Book flight from [pune]{"entity": "city_name", "role": "source"}
    #          to [mumbai]{"entity": "city_name", "role": "destination"}
    entity_regex_pattern_2 = r"(\[.+?\])({\s*\"entity\":\s*\".+?\"\s*})"

    # utterance after removing labelled entities (if any)
    text_without_entities = text

    mapped_entities = []

    for entity_regex_pattern in [entity_regex_pattern_1, entity_regex_pattern_2]:
        for _ in range(len(re.findall(entity_regex_pattern, text))):

            match = re.search(entity_regex_pattern, text_without_entities)

            entity_value = match.group(1)
            entity_value = entity_value[1:-1]  # removing square brackets []

            if entity_regex_pattern == entity_regex_pattern_1:
                entity_name = match.group(2)
                entity_name = entity_name[1:-1]  # removing brackets ()
                role_name = None  # not provided

            elif entity_regex_pattern == entity_regex_pattern_2:
                entity_dict = match.group(2)
                entity_dict = json.loads(entity_dict)
                entity_name = entity_dict["entity"]
                role_name = entity_dict.get("role")

            # removing the labelled entity and role from the text
            text_without_entities = (
                text_without_entities[: match.start()]
                + entity_value
                + text_without_entities[match.end() :]
            )

            # start and end position of utterance to entity mapping
            start = match.start(1)
            end = match.end(1) - 2  # since [ and ] are removed

            mapped_entities.append(
                {
                    "name": entity_name,
                    "value": entity_value,
                    "role_name": role_name,
                }
            )

    return text_without_entities, mapped_entities

A prompt separator string needs to be added at the end of the prompt both while fine tuning and sending requests to the model. Else, the outputted completions would most likely be random, instead of our desired output intents. This separator should NOT be present in the utterance at all.

In [29]:
prompt_separator = "\n\nNLU_RESULTS:\n\n"

Completions should start with a whitespace since OpenAI's tokenization tokenizes most words with a preceding whitespace.

For classification tasks, OpenAI recommends us to choose classes which map to a single token and set max_tokens=1. But some classes like "Enquiry" or "Incident" map to multiple tokens. So, we set a stop sequence at the end of the completion instead.

In [30]:
completion_stop_sequence = " END"

In [51]:
json_list = []

for intent_dict in data['nlu']:
    intent = intent_dict['intent']
    examples = parse_examples(intent_dict['examples'])
    for example in examples[:40]:
        example, entities_dict = parse_entities_from_text(example)
        prompt = f'{example}{prompt_separator}'
        completion_dict = json.dumps({'intent': intent, 'entities': entities_dict})
        completion = f' {completion_dict}{completion_stop_sequence}'
        json_list.append({'prompt': prompt, 
                          'completion': completion})

In [52]:
json_list[0]

{'prompt': 'Hello. I need help setting up a distribution list\n\nNLU_RESULTS:\n\n',
 'completion': ' {"intent": "DLEmail", "entities": [{"name": "Software_Attribute", "value": "distribution list", "role_name": null}]} END'}

In [53]:
len(json_list)

291

In [54]:
import jsonlines

In [57]:
with jsonlines.open('intent_entity.jsonl', mode='w') as writer:
    # Write each data item as a separate line
    for item in json_list:
        writer.write(item)

In [58]:
import dotenv
dotenv.load_dotenv(dotenv.find_dotenv())

True

In [60]:
!openai api fine_tunes.create -t intent_entity.jsonl -m ada

Upload progress: 100%|████████████████████| 85.1k/85.1k [00:00<00:00, 37.0Mit/s]
Uploaded file from intent_entity.jsonl: file-d8LXQqks694NOHLlb8cuU2VR
Created fine-tune: ft-h6nHAGr4G6u4JtDkswE412vP
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-06-28 17:56:49] Created fine-tune: ft-h6nHAGr4G6u4JtDkswE412vP

Stream interrupted (client disconnected).
To resume the stream, run:

  openai api fine_tunes.follow -i ft-h6nHAGr4G6u4JtDkswE412vP



In [77]:
!openai api fine_tunes.follow -i ft-h6nHAGr4G6u4JtDkswE412vP

[2023-06-28 17:56:49] Created fine-tune: ft-h6nHAGr4G6u4JtDkswE412vP
[2023-06-28 18:45:50] Fine-tune costs $0.03
[2023-06-28 18:45:50] Fine-tune enqueued. Queue number: 6
[2023-06-28 18:48:35] Fine-tune is in the queue. Queue number: 5
[2023-06-28 18:49:50] Fine-tune is in the queue. Queue number: 4
[2023-06-28 18:49:52] Fine-tune is in the queue. Queue number: 3
[2023-06-28 18:49:57] Fine-tune is in the queue. Queue number: 2
[2023-06-28 18:50:08] Fine-tune is in the queue. Queue number: 1
[2023-06-28 18:51:12] Fine-tune is in the queue. Queue number: 0
[2023-06-28 18:51:14] Fine-tune started
[2023-06-28 18:52:12] Completed epoch 1/4
[2023-06-28 18:52:57] Completed epoch 2/4
[2023-06-28 18:53:42] Completed epoch 3/4
[2023-06-28 18:54:26] Completed epoch 4/4
[2023-06-28 18:54:47] Uploaded model: ada:ft-personal-2023-06-28-13-24-46
[2023-06-28 18:55:08] Uploaded result file: file-iAhYpNQzwMcTosPS8kQBUD5G
[2023-06-28 18:55:08] Fine-tune succeeded

Job complete! Status: succeeded 🎉
Try ou

In [109]:
import openai
response = openai.Completion.create(
    model="ada:ft-personal-2023-06-28-13-24-46",
    prompt=f"I need new laptop battery{prompt_separator}",
    max_tokens=1000,
    stop=[completion_stop_sequence]
)
print(response['choices'][0]['text'].strip())

{"intent": "Hardware&Peripheral", "entities": [{"name": "Hardware_Peripheral", "value": "new laptop battery", "role_name": null}]}
