# Text Classification - Finetune GPT-4o-mini for NER


---

---

## $\color{blue}{Sections:}$

* Preamble
1.   Admin
2.   Data
4.   Prompt
5.   JSONL
6.   Check Datasets
7. Create OpenAI Finetuned Model

## $\color{blue}{Preamble:}$

Uploading dataset to OpenAI Finetuning GPT-4o-mini on NER data. Subsequently launching the finetuning.

## $\color{blue}{Admin}$
* Install relevant Libraries
* Import relevant Libraries

In [None]:
%%capture
!pip install tiktoken openai cohere

In [None]:
pip install dill

Collecting dill
  Downloading dill-0.3.9-py3-none-any.whl.metadata (10 kB)
Downloading dill-0.3.9-py3-none-any.whl (119 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/119.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━[0m [32m71.7/119.4 kB[0m [31m2.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.4/119.4 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dill
Successfully installed dill-0.3.9


In [None]:
import openai
import re
import pandas as pd
import requests
import json
from google.colab import drive
from google.colab import userdata
from collections import defaultdict
import os
import dill

## $\color{blue}{Data}$

* Connect to Drive
* Load the data to a string

In [None]:
drive.mount("/content/drive")
%cd '/content/drive/MyDrive'

Mounted at /content/drive
/content/drive/MyDrive


In [None]:
import pandas as pd
path = 'class/datasets/ner_annotated'
df_train = pd.read_pickle(path + 'train')
df_dev = pd.read_pickle(path + 'dev')
df_example = pd.read_pickle(path + 'example')

In [None]:
df_dev.columns

Index(['id', 'content', 'annotated_content'], dtype='object')

# $\color{blue}{JSONL}$

----

The API requires data to be uploaded in this format.
The payload requires a system message (definition of LLM role), a user message (input prompt), and an assistant messages (expected output).

In [None]:
prompt = """The task is to label the Location and Person entities in the given ###Text section, Following the format in the ###Examples section.
The output should be identicle to the input with the exception of the Person and Location tags if required.

###Examples
Input: “Is it John of Tuam?”   “Are you sure of that now?” asked Mr Fogarty dubiously. “I thought it was some Italian or American.”   “John of Tuam,” repeated Mr Cunningham, “was the man.”   He drank and the other gentlemen followed his lead.
Output: “Is it @@John of Tuam##Person ?”   “Are you sure of that now?” asked @@Mr Fogarty##Person dubiously. “I thought it was some Italian or American.”   “@@John of Tuam##Person,” repeated @@Mr Cunningham##Person, “was the man.”   He drank and the other gentlemen followed his lead.

Input: sibly there were several others. He personally, being of a sceptical bias, believed and didn’t make the smallest bones about saying so either that man or men in the plural were always hanging around on the waiting list about a lady,
Output: sibly there were several others. He personally, being of a sceptical bias, believed and didn’t make the smallest bones about saying so either that man or men in the plural were always hanging around on the waiting list about a lady,

Input: Now to the historical, for as Madam Mina write not in her stenography, I must, in my cumbrous old fashion, that so each day of us may not go unrecorded. We got to the Borgo Pass just after sunrise yesterday morning. When I saw the signs of the dawn I got ready for the hypnotism.
Output: Now to the historical, for as @@Madam Mina##Person write not in her stenography, I must, in my cumbrous old fashion, that so each day of us may not go unrecorded. We got to the @@Borgo Pass##Location  just after sunrise yesterday morning. When I saw the signs of the dawn I got ready for the hypnotism.

**DON'T LABEL PRONOUNS AS PERSON**

###Text
Input: {}
Output:"""


system_message = """You are an excellent linguist."""


In [None]:
def format_data(df):
  dataset = []
  for i in range(df.shape[0]):
    point = {"messages" : [{"role": "system" , "content" : system_message}]}
    point["messages"].append({"role": "user", "content": prompt.format(df.loc[i]['content'])})
    point["messages"].append({"role": "assistant", "content": df.loc[i]['annotated_content']})
    dataset.append(point)
  return dataset

def save_to_jsonl(dataset, file_path):
  """
  Convert dataset into jsonl.

  Parameters
  ----------
  dataset : list
      List of dicts containing datapoint information.
  filepath: str
      File path to save to.

  Returns
  -------
  None
  """
  with open(file_path,"w") as file:
    for data in dataset:
      json_line = json.dumps(data)
      file.write(json_line + '\n')

##### $\color{red}{To-File}$


In [None]:
train_dataset = format_data(df_train)
save_to_jsonl(train_dataset, "class/datasets/train_openai_ner_ft.jsonl")

In [None]:
train_dataset[0]

{'messages': [{'role': 'system', 'content': 'You are an excellent linguist.'},
  {'role': 'user',
   'content': "The task is to label the Location and Person entities in the given ###Text section, Following the format in the ###Examples section.\nThe output should be identicle to the input with the exception of the Person and Location tags if required.\n\n###Examples\nInput: “Is it John of Tuam?”   “Are you sure of that now?” asked Mr Fogarty dubiously. “I thought it was some Italian or American.”   “John of Tuam,” repeated Mr Cunningham, “was the man.”   He drank and the other gentlemen followed his lead.\nOutput: “Is it @@John of Tuam##Person ?”   “Are you sure of that now?” asked @@Mr Fogarty##Person dubiously. “I thought it was some Italian or American.”   “@@John of Tuam##Person,” repeated @@Mr Cunningham##Person, “was the man.”   He drank and the other gentlemen followed his lead.\n\nInput: sibly there were several others. He personally, being of a sceptical bias, believed and di

# $\color{blue}{Check - Datasets}$

In [None]:
# Get example
def message_check(file_path, ind):
  """
  Check message from jsonl file.

  Parameters
  ----------
  filepath : str
      Path to jsonl file.
  ind: int
      Required ind for checking.

  Returns
  -------
  None
  """
  # Load the dataset
  with open(file_path, 'r', encoding='utf-8') as f:
      dataset = [json.loads(line) for line in f]

  # Initial dataset stats
  print("Num examples:", len(dataset))
  print("First example:")
  for message in dataset[ind]["messages"]:
      print(message)

In [None]:
# Format error checks
def check_errors(file_path):
  """
  Check if there are any errors in file that will cause OpenAI training process to fail.

  Parameters
  ----------
  filepath : str
      Path to the json file.

  Returns
  -------
  None
  """
  with open(file_path, 'r', encoding='utf-8') as f:
    dataset = [json.loads(line) for line in f]

  format_errors = defaultdict(int)

  for ex in dataset:
      if not isinstance(ex, dict):
          format_errors["data_type"] += 1
          continue

      messages = ex.get("messages", None)
      if not messages:
          format_errors["missing_messages_list"] += 1
          continue

      for message in messages:
          if "role" not in message or "content" not in message:
              format_errors["message_missing_key"] += 1

          if any(k not in ("role", "content", "name", "function_call") for k in message):
              format_errors["message_unrecognized_key"] += 1

          if message.get("role", None) not in ("system", "user", "assistant", "function"):
              format_errors["unrecognized_role"] += 1

          content = message.get("content", None)
          function_call = message.get("function_call", None)

          if (not content and not function_call) or not isinstance(content, str):
              format_errors["missing_content"] += 1

      if not any(message.get("role", None) == "assistant" for message in messages):
          format_errors["example_missing_assistant_message"] += 1

  if format_errors:
      print("Found errors:")
      for k, v in format_errors.items():
          print(f"{k}: {v}")
  else:
      print("No errors found")

In [None]:
message_check("class/datasets/train_openai_ner_ft.jsonl",10)

Num examples: 97
First example:
{'role': 'system', 'content': 'You are an excellent linguist.'}
{'role': 'user', 'content': "The task is to label the Location and Person entities in the given ###Text section, Following the format in the ###Examples section.\nThe output should be identicle to the input with the exception of the Person and Location tags if required.\n\n###Examples\nInput: “Is it John of Tuam?”   “Are you sure of that now?” asked Mr Fogarty dubiously. “I thought it was some Italian or American.”   “John of Tuam,” repeated Mr Cunningham, “was the man.”   He drank and the other gentlemen followed his lead.\nOutput: “Is it @@John of Tuam##Person ?”   “Are you sure of that now?” asked @@Mr Fogarty##Person dubiously. “I thought it was some Italian or American.”   “@@John of Tuam##Person,” repeated @@Mr Cunningham##Person, “was the man.”   He drank and the other gentlemen followed his lead.\n\nInput: sibly there were several others. He personally, being of a sceptical bias, bel

In [None]:
check_errors("class/datasets/train_openai_book_ft.jsonl")

No errors found


In [None]:
check_errors("class/datasets/dev_openai_book_ft.jsonl")

No errors found


# $\color{blue}{Create-OpenAi-Finetuned-Model}$

##### $\color{red}{Load-File}$

In [None]:
endpoint = "https://api.openai.com/v1/files" # endpoint for files

key = userdata.get('OPENAI_API_KEY')

headers = {'Authorization': f"Bearer {key}"}

def upload_file(file_path, endpoint, headers):
  """
  Upload a file to the OpenAI file system.

  Parameters
  ----------
  filepath : str
      Path to the json file.
  endpoint : str
      Use 'https://api.openai.com/v1/files'.
  headers : dict
      Use {'Authorization': f"Bearer {key}"}.

  Returns
  -------
  response : json
      Response from OpenAI confirming details of the upload.
  """
  with open(file_path,'rb') as f:
    response = requests.post(endpoint, headers=headers, files={'file': f}, data={'purpose': 'fine-tune'})
  return response.json()

In [None]:
train_file_response = upload_file("class/datasets/train_openai_ner_ft.jsonl", endpoint, headers)

In [None]:
train_file_response

{'object': 'file',
 'id': 'file-1PW8zfdZxtagXNURr8kKQb',
 'purpose': 'fine-tune',
 'filename': 'train_openai_ner_ft.jsonl',
 'bytes': 262876,
 'created_at': 1733495172,
 'status': 'processed',
 'status_details': None}

##### $\color{red}{Create-Models}$

In [None]:
URL = "https://api.openai.com/v1/fine_tuning/jobs" # endpoint


headers = {
  "Content-Type": "application/json",
  "Authorization": f"Bearer {key}"
}

In [None]:
payload = {
  "training_file": train_file_response['id'],
  "model": "gpt-4o-mini-2024-07-18"
}
finetune_response = requests.post(URL, json=payload, headers=headers)
finetune_meta = json.loads(finetune_response.content)

In [None]:
finetune_meta

{'object': 'fine_tuning.job',
 'id': 'ftjob-Kn1UUueyuQsTGNm36sw1P4i7',
 'model': 'gpt-4o-mini-2024-07-18',
 'created_at': 1733495217,
 'finished_at': None,
 'fine_tuned_model': None,
 'organization_id': 'org-4bBdSgsciB8iKzeJ61GgVdXt',
 'result_files': [],
 'status': 'validating_files',
 'validation_file': None,
 'training_file': 'file-1PW8zfdZxtagXNURr8kKQb',
 'hyperparameters': {'n_epochs': 'auto',
  'batch_size': 'auto',
  'learning_rate_multiplier': 'auto'},
 'trained_tokens': None,
 'error': {},
 'user_provided_suffix': None,
 'seed': 230197262,
 'estimated_finish': None,
 'integrations': []}