<a href="https://colab.research.google.com/github/sms-astanley/octoai/blob/main/AEWF_L%26L_June_26th_LLM_Quality_Optimization_Bootcamp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLM Quality Optimization Bootcamp

### Author: Thierry Moreau - Co-founder, Head of DevRel @ OctoAI


In this notebook you'll learn how to fine-tune an open source LLM (Llama3-8B) from scratch to perform a specialized task - PII redaction via function calling.

**We'll show that taking a small and efficient LLM like Llama3-8B and fine-tuning it, you can achieve significant quality improvements over a state of the art model like GPT-4-Turbo, while also achieving significant cost savings (in the order of 100x).**

This notebook is divided into 4 parts:

1. Fine-tuning dataset collection
2. Kick off fine-tuning on OpenPipe
3. Deploy your fine-tune on OctoAI
4. Evaluate your fine-tune against GPT-4-Turbo

![llm deployment cycle](https://raw.githubusercontent.com/tmoreau89/image-assets/main/fine_tune/llm_deployment_cycle.png)


## Pre-requisites

### OctoAI

We'll use OctoAI to deploy our fine-tune. OctoAI offers efficient, customizable and reliable GenAI inference endpoints. You can sign up for an account on http://octoai.cloud/, and get access to the latest and greatest open source LLMs.

Create an API token by following [this guide](https://octo.ai/docs/getting-started/how-to-create-an-octoai-access-token) and save it somewhere safe: we'll need it later in this notebook.

By signing up you will automatically get $10 in credits, enough to generate 66.7M Llama3-8B tokens.

### OpenPipe

We'll use OpenPipe to fine tune our Llama3-8B model. OpenPipe lets developers build fine tuning datasets, fire off fine-tune jobs across a variety of base models, and run comprehensive quality evaluations.

Get started by signing up for an account: https://openpipe.ai/.

### OpenAI (optional)

We'll use OpenAI for quality comparisons against our fine-tuned LLM. More specifically we'll compare our fine-tune against GPT-4-Turbo, and GPT-3.5-Turbo.

You can create an OpenAI account at the following URL: https://platform.openai.com. Create a new API key on [this link](https://platform.openai.com/api-keys) once you've created an account.

### Python Packages

Last but not least, run the cell below to install the necessary pip packages.

In [None]:
# Ignore the dependency resolver error
! pip install -q openai datasets

In [None]:
import os
from getpass import getpass

# Enter your OctoAI Token
OCTOAI_TOKEN = getpass()
os.environ["OCTOAI_TOKEN"] = OCTOAI_TOKEN

In [None]:
# Enter your OpenAI Token
OPENAI_API_KEY = getpass()
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

# 1. Building a Fine-Tuning Dataset

A fine-tuned model is only as good as the dataset it's been trained on. Therefore the dataset collection step is critical in order to get a high quality fine tune.

There are two approaches to building a fine-tuning dataset using OpenPipe's SDK:
* Record a sufficient number of requests and responses from an LLM (e.g. GPT-4)
* Upload a pre-built dataset directly in JSONL format

In this notebook we'll opt for the latter approach by taking an already labeled dataset from HuggingFace and turning into a synthetic LLM request log.

## PII Masking Dataset

The dataset we're using in this notebook is the Personally Identifiable Information (PII) masking 200k dataset from [AI4Privacy](https://www.ai4privacy.com/), available via this [link on HuggingFace](https://huggingface.co/datasets/ai4privacy/pii-masking-200k).

This dataset has about 200k synthetic text samples that each contain one or more PII entries across 54 PII classes.

An example of PII redaction looks as follow.

Input text:
```
Dear Omer, as per our records, your license 78B5R2MVFAHJ48500 is still registered in our records for access to the educational tools.
```

Redacted text:
```
Dear [FIRSTNAME], as per our records, your license [VEHICLEVIN] is still registered in our records for access to the educational tools.
```

Privacy mask:
```
[ { "value": "Omer", "start": 5, "end": 9, "label": "FIRSTNAME" }, { "value": "78B5R2MVFAHJ48500", "start": 44, "end": 61, "label": "VEHICLEVIN" } ]
```

## Function calling for PII redaction

Instead of training an LLM to do the PII redaction directly on the input text, we'll use the LLM's function calling ability to call a function that will perform the redaction on the original text.

Using a function to perform the redaction gives us flexibility to implement different redaction approaches after the LLM has been fine tuned.
* We can redact by replacing the PII with the PII class it belogs to, e.g. `Omer` becomes `[FIRSTNAME`]`.
* We can redact by replacing the PII with masked information, e.g. `Omer` becomes `XXXXXX`.
* We can redact by replacing the PII with a fake PII by mapping each unique original PII to a corresponding fake PII substitue from a database, e.g. `Omer` becomes `Kendall`.

First we specify a system prompt that indicates to the LLM what categories it needs to redact:
```python
system_prompt =
"""
You are an expert model trained to redact potentially sensitive information from documents. You have been given a document to redact. Your goal is to accurately redact the sensitive information from the document. Sensitive information can be in one of the following categories:

- ACCOUNTNAME: name of an account
...
- ZIPCODE: zipcode indicating location or address
            
You are a function calling AI and should return the specific string that needs to be redacted, along with the category of sensitive information that it belongs to. If there is no sensitive information in the document, return no strings.
"""
```

Second, we'll need to provide the function call specification of the function that performs the actual redaction:
```python
tool_choice = {
  "type": "function",
  "function": {"name": "redact"}
}

tools = [
  {
    "function": {
      "name": "redact",
      "parameters": {
        "type": "object",
        "properties": {
          "fields_to_redact": {
            "type": "array",
            "items": {
              "type": "object",
              "required": [
                "string",
                "pii_type"
              ],
              "properties": {
                "string": {
                  "type": "string",
                  "description": "The exact matching string to redact. Include any whitespace or punctuation. Must be an exact string match!"
                },
                "pii_type": {
                  "enum": [
                    "ACCOUNTNAME",
                    ...
                    "ZIPCODE"
                  ],
                  "type": "string"
                }
              }
            }
          }
        }
      }
    },
    "type": "function"
  }
]
```

With the above system prompt and the function call definition, we'll invoke the LLM by passing in the text that needs to be redacted:

```python

import requests

user_prompt = "Dear Omer, as per our records, your license 78B5R2MVFAHJ48500 is still registered in our records for access to the educational tools."

req = requests.post("https://text.octoai.run/v1/chat/completions",
    headers={
        "Content-Type": "application/json",
        "Authorization": f"Bearer {OCTOAI_TOKEN}"
    },
    json={
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}]
        "model": "openpipe-llama-3-8b-32k",
        "max_tokens": 512,
        "presence_penalty": 0,
        "temperature": 0,
        "top_p": 0.9,
        "peft": lora_asset_name, # you'll know what to assign this to below
        "tool_choice": tool_choice,
        "tools": tools
    }
)

print(response.json()["choices"][0]["message"]["tool_calls"][0]["function"]["arguments"])
```

Which when fine-tuned correcty, should return the following chat completions response containing the  arguments to pass into the `redact()` function call:
```json
{
  'fields_to_redact':
  [
    {
      'string': 'Omer',
      'pii_type': 'FIRSTNAME'
    },
    {
      'string': '78B5R2MVFAHJ48500',
      'pii_type': 'VEHICLEVIN'
    }
  ]
}
```

In [None]:
# Define the system prompt with all of the PII categories
# The nice thing about this system prompt is that it can be very easily extended
# to fit your unique use case.
system_prompt = """
You are an expert model trained to redact potentially sensitive information from documents. You have been given a document to redact. Your goal is to accurately redact the sensitive information from the document. Sensitive information can be in one of the following categories:

- ACCOUNTNAME: name of an account
- ACCOUNTNUMBER: number of an account
- AGE: a person's age
- AMOUNT: information indicating a certain monetary amount
- BIC: a business identifier code
- BITCOINADDRESS: bitcoint address, generally stored in a cryptocurrency wallet
- BUILDINGNUMBER: number of a building in a physical address
- CITY: name of a city indicating location or address
- COMPANYNAME: name of a company
- COUNTRY: name of a country indicating location or address
- CREDITCARDCVV: credit card CVV
- CREDITCARDISSUER: credit card issuer
- CREDITCARDNUMBER: credit card number
- CURRENCY: currency of a balance or transaction
- CURRENCYCODE: the code a currency (e.g. USD)
- CURRENCYNAME: name of a currency (e.g. US dollar)
- CURRENCYSYMBOL: symbol of a currency (e.g. $)
- DATE: a specific calendar date
- DOB: a specific calendar date representing birth
- EMAIL: an email ID
- ETHEREUMADDRESS: ethereum address, generally stored in a cryptocurrency wallet
- EYECOLOR: eye color, used to identify a person
- FIRSTNAME: first name of a person
- GENDER: a gender identifier
- HEIGHT: height of a person
- IBAN: international banking account number
- IP: IP address
- IPV4: IP v4 address
- IPV6: IP v6 address
- JOBAREA: job area, specialization or category
- JOBTITLE: job title
- LASTNAME: last name of a person
- LITECOINADDRESS: litecoin address, generally stored in a cryptocurrency wallet
- MAC: MAC address
- MASKEDNUMBER: masked number
- MIDDLENAME: middle name of a person
- NEARBYGPSCOORDINATE: nearby GPS coordinates
- ORDINALDIRECTION: ordinal direction (north, south, northeast, etc.)
- PASSWORD: a secure string used for authentication
- PHONEIMEI: the IMEI of a phone
- PHONENUMBER: a telephone number
- PIN: a personal identificaiton number (PIN)
- PREFIX: prefix used to identify a person (Mr., Mrs., Dr. etc.)
- SECONDARY ADDRESS: a secondary physical address address
- SEX: a sex identifier (male/female)
- SSN: a social security number
- STATE: name of a state indicating location or address
- STREET: name of a street indicating location or address
- TIME: time of the day
- URL: URL of a website
- USERAGENT: user agent to identify the application, operating system, vendor etc.
- USERNAME: user name to identify user
- VERHICLEVIN: vehicle identification number or license number
- VEHICLEVRM: vehicle registration mark
- ZIPCODE: zipcode indicating location or address

You should return the specific string that needs to be redacted, along with the category of sensitive information that it belongs to. If there is no sensitive information in the document, return no strings.
"""

In [None]:
# Define the tools for the LLM to invoke

tool_choice = {
  "type": "function",
  "function": {"name": "redact"}
}

tools = [
  {
    "function": {
      "name": "redact",
      "parameters": {
        "type": "object",
        "properties": {
          "fields_to_redact": {
            "type": "array",
            "items": {
              "type": "object",
              "required": [
                "string",
                "pii_type"
              ],
              "properties": {
                "string": {
                  "type": "string",
                  "description": "The exact matching string to redact. Include any whitespace or punctuation. Must be an exact string match!"
                },
                "pii_type": {
                  "enum": [
                    "ACCOUNTNAME",
                    "ACCOUNTNUMBER",
                    "AGE",
                    "AMOUNT",
                    "BIC",
                    "BITCOINADDRESS",
                    "BUILDINGNUMBER",
                    "CITY",
                    "COMPANYNAME",
                    "COUNTY",
                    "CREDITCARDCVV",
                    "CREDITCARDISSUER",
                    "CREDITCARDNUMBER",
                    "CURRENCY",
                    "CURRENCYCODE",
                    "CURRENCYNAME",
                    "CURRENCYSYMBOL",
                    "DATE",
                    "DOB",
                    "EMAIL",
                    "ETHEREUMADDRESS",
                    "EYECOLOR",
                    "FIRSTNAME",
                    "GENDER",
                    "HEIGHT",
                    "IBAN",
                    "IP",
                    "IPV4",
                    "IPV6",
                    "JOBAREA",
                    "JOBTITLE",
                    "JOBTYPE",
                    "LASTNAME",
                    "LITECOINADDRESS",
                    "MAC",
                    "MASKEDNUMBER",
                    "MIDDLENAME",
                    "NEARBYGPSCOORDINATE",
                    "ORDINALDIRECTION",
                    "PASSWORD",
                    "PHONEIMEI",
                    "PHONENUMBER",
                    "PIN",
                    "PREFIX",
                    "SECONDARYADDRESS",
                    "SEX",
                    "SSN",
                    "STATE",
                    "STREET",
                    "TIME",
                    "URL",
                    "USERAGENT",
                    "USERNAME",
                    "VEHICLEVIN",
                    "VEHICLEVRM",
                    "ZIPCODE"
                  ],
                  "type": "string"
                }
              }
            }
          }
        }
      }
    },
    "type": "function"
  }
]

## Preparing the dataset

In the next cell we'll load the dataset from Huggingface and generate and build a 10k sample large fine-tuning dataset using a combination of the input text as LLM prompt, and a re-worked privacy mask as the expected tools call response from the LLM.

In [None]:
from datasets import load_dataset

# Load the dataset from huggingface
dataset = load_dataset("ai4privacy/pii-masking-200k")

In [None]:
import json
from google.colab import files

# OpenPipe dataset size
# To improve on accuracy results you can bump the size to a larger value, e.g. 10,000
TRAINING_SIZE = 10000

# Create a dataset for OpenPipe
openpipe_dataset = []
for idx, item in enumerate(dataset['train'].select(range(0, TRAINING_SIZE))):
    function_arguments = {
        "fields_to_redact": []
    }
    for i in item["privacy_mask"]:
        function_arguments["fields_to_redact"].append({
            "string": i["value"],
            "pii_type": i["label"]
        })
    dataitem = {
        "messages": [
            {
                "role": "system",
                "content": system_prompt
            },
            {
                "role": "user",
                "content": item["source_text"]
            },
            {
                "role": "assistant",
                "content": None,
                "tool_calls":
                    [
                        {
                            "id":"",
                            "type":"function",
                            "function":
                            {
                                "name": "redact",
                                "arguments": json.dumps(function_arguments, indent=2)
                            }
                        }
                    ]
            },
        ],
        "tools": tools,
        "tool_choice": tool_choice
    }
    openpipe_dataset.append(dataitem)

with open('openpipe_dataset.jsonl', 'w') as outfile:
    for entry in openpipe_dataset:
        json.dump(entry, outfile)
        outfile.write('\n')

# This will let you download the file on your browser if you're using Google CoLab
files.download('openpipe_dataset.jsonl')

# 2. Fine tune the LLM

We'll use OpenPipe for this step.

## Upload the dataset to OpenPipe

Check your downloads folder, you should find an `openpipe_dataset.jsonl` in there.

Now follow along the instructions on [this page](https://docs.openpipe.ai/features/exporting-data#dataset-export) to upload your dataset on OpenPipe.

1. On your OpenPipe console, click on "Datasets" listed in the bar on the left.
2. Click on "+ New Dataset" button at the top right of the window.
3. Click on "Upload Data" button at the top left of the window.
4. Drop the jsonl file that was just downloaded in the "Upload File" window.
5. Click on the Upload button and wait for the dataset to get uploaded.

You'll get the confirmation window below if the dataset gets successfully uploaded.

![upload confirmation window](https://raw.githubusercontent.com/tmoreau89/image-assets/main/fine_tune/openpipe_dataset_uploaded.png)


You'll see your dataset entries under the "Dataset view" - 10,000 of them which should have gotten split into a 9,000 training and 1,000 test set.

![dataset view](https://raw.githubusercontent.com/tmoreau89/image-assets/main/fine_tune/dataset_view.png)

Hit "Settings", and rename your Dataset if need be.

## Launch a fine-tune

Under the Dataset view on the OpenPipe console, you can launch a fine tune by clicking on the "Fine Tune" button at the top right of the window.

You can define a model ID - this lets us uniquely identify the resulting fine tuned model.

Next you can chose your base model:
* You can choose between open source models (Llama, Mistral) or closed source models (GPT). Selecting an open source model gives you ownership of the weights, and lets you deploy the model on the platform of your choice. For this notebook we'll select the "Llama 3 8B 32K" model.
* You'll see that we have a good working set size to work with with 10k training samples.
* Finally you can choose to tweak the advanced options but we'll leave them as-is.

![fine tuning settings](https://raw.githubusercontent.com/tmoreau89/image-assets/main/fine_tune/finetuning_launch.png)

Let's go ahead and hit "Start Training" to kick off the fine-tuning job.

# 3. Deploy the fine-tuned LLM

In this section we'll export the model weights in LoRA FP16 form to be hosted on OctoAI.

You'll be informed that the fine tune job has completed by email. Once you've been notified you can proceed with the steps below.

## Export your model weights

Access your fine-tuned model by clicking on "Fine Tune" on the OpenPipe console.
Click on the model that was just fine tuned.

![model fine tune](https://raw.githubusercontent.com/tmoreau89/image-assets/main/fine_tune/model_finetune.png)

At the bottom of the page, you can select a format that the model weights gets exported in. Select "LoRA:FP32" under the "Format" drop-down and hit Export Weights.

It will take a couple of minutes until your model weights are ready for download. When the weights are ready, right click on "Download Weights" to copy the link to the weights and set the URL aside, which we'll need in the next step to set `lora_url`.

![fine tune download link](https://raw.githubusercontent.com/tmoreau89/image-assets/main/fine_tune/model_finetune_link.png)

## Upload the LoRA to OctoAI


First, we'll install the `octoai` CLI (and the jq library for later use)



In [None]:
%%capture
!curl https://s3.amazonaws.com/downloads.octoai.cloud/octoai/install_octoai_cli_and_sdk.sh -sSfL | sh
!apt-get install jq


Next, login to the CLI with your token. If you don't have one, you can follow the instructions under the pre-requisites section.

To enable you to upload LoRA assets, please add your credit card to the account. Note that you have $10 in free credits, which should more then cover test usage.

In [None]:
!octoai login

Below, you'll need to set the `lora_url` to the URL you just copied from the "Download Weights" link in the previous step.

In [None]:
import random

# Llama-3-8B-Instruct (32K token context)
checkpoint_name = "octoai:openpipe-llama-3-8b-32k"
lora_url = "SET ME"
assert(lora_url != "SET ME") # Please update lora_url

# Define an asset name on OctoAI to uniquely identify the LoRA
lora_asset_name = "pii-redaction-finetune-{}".format(str(random.randint(1000, 999999)))

# Set checkpoint name and LoRA URL as env vars
%env checkpoint_name=$checkpoint_name
%env lora_url=$lora_url
%env lora_asset_name=$lora_asset_name

Below, we upload the LoRA to OctoAI.

We need to specify what base checkpoint and architecture ("engine") the model corresponds to.

The command below uses "--upload-from-url" which lets you upload these files from the OpenPipe download URL. Note also that there is an "upload-from-dir" that lets you specify a local directory if you've downloaded the LoRA zip file on your local drive.

The "--wait" flag allows to block until the upload has completed, making scripting possible.

In [None]:
%%bash

octoai asset create \
  --checkpoint $checkpoint_name \
  --format safetensors \
  --type lora \
  --engine text/llama-3-8b \
  --name $lora_asset_name \
  --data-type fp16 \
  --upload-from-url $lora_url \
  --wait

You can double check to make sure the LoRA got added using "octoai asset get" command:

In [None]:
!octoai asset get -n $lora_asset_name

## Sanity checking that the fine-tune is running on OctoAI

Let's go ahead and use OctoAI to run a test inference with the code below. It's doable by supplying the LoRA name as `peft` parameter (a LoRA is a type of Parameter-Efficient Fine Tune) when making a call to the model.

In [None]:
import requests

print("Using assset ", lora_asset_name)

test_prompt = "Dear Omer, as per our records, your license 78B5R2MVFAHJ48500 is still registered in our records for access to the educational tools. Please feedback on it's operability."

messages=[
    {
        "role": "system",
        "content": system_prompt
    },
    {
        "role": "user",
        "content": test_prompt
    },
]

req = requests.post("https://text.octoai.run/v1/chat/completions",
    headers={
        "Content-Type": "application/json",
        "Authorization": f"Bearer {OCTOAI_TOKEN}"
    },
    json={
        "messages": messages,
        "model": "openpipe-llama-3-8b-32k",
        "max_tokens": 512,
        "presence_penalty": 0,
        "temperature": 0,
        "top_p": 0.9,
        "peft": lora_asset_name,
        "tool_choice": tool_choice,
        "tools": tools
    }
)

print(json.dumps(req.json(),indent=4))

The output looks good! Let's now run a more exhaustive set of quality evaluations to see where this model stands next to very capable LLMs like GPT-4.

# 4. Quality Evaluations

In this section we'll run more extensive quality tests between our fine-tuned model and OpenAI's ChatGPT-4-Turbo.

While the other parts of this tutorial were covered by the \$100 fine-tuning credits on OpenPipe and \$10 inference credits on OctoAI, you'll have to spend a bit on OpenAI inference to obtain the full set of comparative results.

We estimate that running this evaluation on 1000 test samples should cost you about \$10 in GPT-4-Turbo usage. If you wish to spend less, you can simply reduce the size of the test set (e.g. 100 samples should run you \$1).

## LLM inferences

In the code below we'll evaluate our fine tune against GPT-4-Turbo against 1000 test samples. You can of course feel free to scale down the evaluation to a smaller number, e.g. 100 since it'll take some time to generate the responses.

Most importanly we need to make sure that the test samples are not taken from the data used to fine-tune our LoRA, this is why we start at offset 10,000 (our fine-tuning dataset was 10,000 large).

In [None]:
import concurrent.futures

# Max concurrent threads
MAX_THREADS = 10

# Test size
TEST_SIZE = 1000

# Run the LLM prediction
def predict_redaction(text, model):
    model_args = {
        "model": model,
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": text}],
        "max_tokens": 512,
        "temperature": 0,
        "tool_choice": tool_choice,
        "tools": tools,
    }

    if model.startswith("gpt"):
        # Run on OpenAI
        from openai import OpenAI
        client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))
        response = client.chat.completions.create(**model_args)
        try:
            return json.loads(response.choices[0].message.tool_calls[0].function.arguments)
        except:
            return None
    else:
        # Run on OctoAI
        model_args["model"] = "openpipe-llama-3-8b-32k"
        model_args["peft"] = model
        response = requests.post(
            "https://text.octoai.run/v1/chat/completions",
            headers={
                "Content-Type": "application/json",
                "Authorization": "Bearer {}".format(OCTOAI_TOKEN)
            },
            json=model_args
        )
        try:
            tool_call = response.json()["choices"][0]["message"]["tool_calls"][0]["function"]["arguments"]
            return json.loads(tool_call)
        except:
            return None

# Task to run in parallel
def task(idx, item):
    print("Evaluating test sample {}".format(idx))
    entry = {
      "input": item["source_text"],
      "output": {
         "fields_to_redact": []
      }
    }
    for i in item["privacy_mask"]:
        entry["output"]["fields_to_redact"].append({
            "string": i["value"],
            "pii_type": i["label"]
        })
    entry["gpt4_output"] = predict_redaction(item["source_text"], "gpt-4-0125-preview")
    entry["finetune_output"] = predict_redaction(item["source_text"], lora_asset_name)
    return entry

# We store the results in test_data
test_data = []

with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_THREADS) as executor:
    # Submit tasks to the executor
    futures = [
        executor.submit(task, idx, item) for idx, item in enumerate(dataset['train'].select(range(TRAINING_SIZE, TRAINING_SIZE+TEST_SIZE)))
    ]
    # Collect the results
    test_data = [future.result() for future in concurrent.futures.as_completed(futures)]

# Show test results
print(test_data)



## Quality Metric

All quality evaluations start by defining a quality metric. In our case, we already have a labeled dataset from AI4Privacy, which we can use as our ground evaluation ground truth.

We introduce a scoring system that works fairly simply. Each PII that needs to be redacted is represented as a pair containing:
* The PII string itself, e.g. `5943919109159496`
* The PII class, e.g. `CREDITCARDNUMBER`

We use the SequenceMatcher library to obtain a similarity score between the ground truth PII and the one that's been inferred by the LLMs.

If the PII string and class match perfectly, we get a score of 1.0. If any information starts to divert (e.g. LLM classifies PII as `MIDDLENAME` instead of `FIRSTNAME`, the score becomes lower, but is not 0.

In [None]:
from difflib import SequenceMatcher

def similar(a, b):
    a_string = "{}, {}".format(a['string'], a['pii_type'])
    b_string = "{}, {}".format(b['string'], b['pii_type'])
    return SequenceMatcher(None, a_string, b_string).ratio()

def derive_score(ref, test):
    # Assess OpenPipe accuracy
    final_score = 0
    if test and "fields_to_redact" in test:
        # sum of all of the best similarity scores across the test PII
        score = 0
        for t in test["fields_to_redact"]:
            # we retain the best similarity score across all pairwise PII comparisons
            best_score = 0
            for r in ref["fields_to_redact"]:
                sim_score = similar(r, t)
                if sim_score > best_score:
                    best_score = sim_score
            score += best_score
        # divide the sum by the max of PII classes in reference data, and test data
        # this is a simple formula to introduce a penalty in case we have a false positive or false negative
        final_score = score/max(len(ref["fields_to_redact"]), len(test["fields_to_redact"]))
    return final_score

gpt4_total_score = 0
finetune_total_score = 0

for elem in test_data:
    # Retrieve pii masks
    ref_output = elem["output"]
    gpt4_output = elem["gpt4_output"]
    ft_output = elem["finetune_output"]
    print("Ground truth")
    print("\t{}".format(ref_output))
    # Assess GPT4 accuracy
    print("Eval GPT4")
    print("\t{}".format(gpt4_output))
    score = derive_score(ref_output, gpt4_output)
    print("\tScore = {}".format(score))
    gpt4_total_score += score
    # Assess fine-tune accuracy
    print("Eval fine-tune")
    print("\t{}".format(ft_output))
    score = derive_score(ref_output, ft_output)
    print("\tScore = {}".format(score))
    finetune_total_score += score
    print("\n\n")


Take a look at the scores now for GPT-4-Turbo vs. your fine-tune!

GPT-4 costs \$30.00 / 1M output tokens, while Llama-3-8B on OctoAI costs \$0.15 / 1M output tokens.

Essentially with fine tuning you're getting a 200x cost reduction, while improving overall quality of processing significantly!

In [None]:
print("GPT-4-Turbo Total Score: {:.3f}".format(gpt4_total_score/TEST_SIZE))
print("Fine-tune Total Score: {:.3f}".format(finetune_total_score/TEST_SIZE))