The [last post in this series](https://mlops.systems/posts/2024-06-15-isafpr-first-finetune.html) showed that finetuning an LLM needn't be
particularly difficult. I used `axolotl` to produce finetuned versions of
Llama3, Mistral and TinyLlama models. During the course we were given a bunch of
credits by various companies in the LLM and finetuning space. Among those were
credits from some finetuning-as-a-service companies and I thought now might be a
good time to try out these services now that I've done the process manually a
few times.

I picked three to try out: Predibase, OpenPipe and OpenAI. All were surprisingly
similar in the approach they took. I'll give a few details on the experience for
each and how they compare to each other. With all the services, the process was
roughly the same as when I did it manually:

1. Upload custom data
2. Select some hyperparameters
3. Start the finetuning
4. Try the model

The step I had the most trouble with was the custom data upload, since each
provider wanted the data in a different format. Converting the data from the
Pydantic models I had previously created was not a huge deal, but I wasn't sure
about the tradeoffs that I was making (or that were being made for me) by
converting my data into these formats.

## Predibase

I started with [Predibase](https://predibase.com/) since I had enjoyed the talk
Travis Addair had given during the course. Predibase is famous for their work on
LORA adapters, particularly their demonstration of [Lora
Land](https://predibase.com/blog/lora-land-fine-tuned-open-source-llms-that-outperform-gpt-4)
where they gave some examples of how finetuned LORA models / adapters could
outperform GPT-4.

Predibase requires that the data you upload has certain column names depending
on the task you select for the finetuning. At the moment they have instruction
tuning and text completion as their two tasks, but it wasn't clear to me which
to select. (They also have [a Colab notebook](https://colab.research.google.com/drive/1r505Aq_SWZdaSkBIs3ovh4F8c36DHwAh?usp=sharing) to help with constructing splits from your
data.)

Once your data is ready and validated, you can select the model you want to
finetune along with a few other hyperparameters. This is the full extent of what
you can set from the web UI:

![Screenshot of Predibase website and the hyperparameters you can set](images/predibase-hyperparameters.png)

There's also a helpful dataset preview pane to give a final sanity check for
your data, to make sure that the inputs and outputs look what you'd expect:

![Screenshot of Predibase website and the dataset preview
pane](images/predibase-dataset-preview.png)

As you'll read in a little bit, this feature helps catch potentially costly
errors before you start the finetuning process.

Once you click the button to start the training, there isn't a great deal of
information available to you beyond (eventually) a loss curve that you can see.
I chose to finetune Qwen2 in Predibase and this took about 53 minutes using an
A-100 GPU accelerator.

Once your model is ready, you can prompt the model in the UI, or using their
REST API / Python SDK. They give code snippets prefilled with some dummy text
that you can easily try out locally. Let's show that here, but before you can
run your inference query you have to first deploy the model. I hadn't expected
this extra step, and it takes a while to spin up since it's deploying the
adaptor along with the base model it was finetuned alongside. My Qwen2 model has
a context window of 131072 tokens and supposedly would cost $3.90 per hour that
it was up (as a dedicated deployment).

Let's show the results we got:

In [None]:
pr1 = """2011-11-S-011 ISAF Joint Command - Afghanistan For Immediate Release
      KABUL, Afghanistan (Nov. 7, 2011) — A combined Afghan and coalition
      security force conducted an operation in search of a Haqqani facilitator
      in Argo district, Badakshan province. The facilitator coordinates suicide
      attacks with other insurgent leaders in the area. During the operation, a
      local national male failed to comply with repeated verbal warnings and
      displayed hostile intent toward the security force. The security force
      engaged the individual, resulting in his death. The security force
      confiscated a shotgun and intelligence linking the local national to the
      Haqqani network. The security force also detained two suspected insurgents during the operation."""

prompt = f"""You are an expert at identifying events in a press release. You are precise and always make sure you are correct, drawing inference from the text of the press release. event_types = ['airstrike', 'detention', 'captureandkill', 'insurgentskilled', 'exchangeoffire', 'civiliancasualty'], provinces = ['badakhshan', 'badghis', 'baghlan', 'balkh', 'bamyan', 'day_kundi', 'farah', 'faryab', 'ghazni', 'ghor', 'helmand', 'herat', 'jowzjan', 'kabul', 'kandahar', 'kapisa', 'khost', 'kunar', 'kunduz', 'laghman', 'logar', 'nangarhar', 'nimroz', 'nuristan', 'paktya', 'paktika', 'panjshir', 'parwan', 'samangan', 'sar_e_pul', 'takhar', 'uruzgan', 'wardak', 'zabul'], target_groups = ['taliban', 'haqqani', 'criminals', 'aq', 'hig', 'let', 'imu', 'judq', 'iju', 'hik', 'ttp', 'other']

### Instruction:

PRESS RELEASE TEXT: '{pr1}'

### Response:
"""

In [None]:
import os
from predibase import Predibase

pb = Predibase(api_token=os.getenv("PREDIBASE_API_KEY"))
# pb = Predibase(api_token="")

lorax_client = pb.deployments.client("isafpr")
print(lorax_client.generate(prompt, max_new_tokens=100).generated_text)

SOME TEXT GOES HERE

## OpenAI

I was actually surprised that this is even a thing that people do or that is
offered by OpenAI. Currently you're able to finetune three versions of GPT3.5 as
well as `babbage-002` and `davinci-002`. In the OpenAI presentation during the
course they mentioned that they were working to make it possible to finetune
GPT4 as well, but no timeline was given on this.

So why would someone want to finetune GPT3.5? I think there are some problems
that are sufficiently complex or of a specific nature where the OpenAI GPT
family shines where you might want to squeeze out a final last bit of
performance and where the open LLMs just aren't there yet.

The OpenAI models are sort of the antithesis of an
'open' model and nothing about the finetuning process lent itself to disabusing
you of that idea. This was the UI to fill in in order to finetune a model and as
you can see there aren't really too many options available to you.

![OpenAI Finetuning UI](/images/openai-finetuning-ui.png)

Supposedly the data you upload (options for train as well as a separate test set
here) will never be used by OpenAI to train their models but you have to just
trust them on that front.

![UI medatada during finetuning 1](/images/openai-finetuning-ui-2.png)
![UI medatada during finetuning 2](/images/openai-finetuning-ui-3.png)

As with Predibase, during finetuning you don't have access to any logs or even
too much feedback during training. You get a loss curve and a few scraps of
metadata and that's it. The training took around 90 minutes to run and then
you're able to prompt the model to see how it works, using the standard OpenAI
interface and methods you're used to:

In [2]:
from openai import OpenAI
from rich import print
import json
import os

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

response = client.chat.completions.create(
    model="ft:gpt-3.5-turbo-SOME_EXTRA_STUFF_HERE_FOR_MY_MODEL",
    messages=[
        {
            "role": "system",
            "content": "You are an expert at identifying events in a press release. You are precise and always make sure you are correct, drawing inference from the text of the press release. event_types = ['airstrike', 'detention', 'captureandkill', 'insurgentskilled', 'exchangeoffire', 'civiliancasualty'], provinces = ['badakhshan', 'badghis', 'baghlan', 'balkh', 'bamyan', 'day_kundi', 'farah', 'faryab', 'ghazni', 'ghor', 'helmand', 'herat', 'jowzjan', 'kabul', 'kandahar', 'kapisa', 'khost', 'kunar', 'kunduz', 'laghman', 'logar', 'nangarhar', 'nimroz', 'nuristan', 'paktya', 'paktika', 'panjshir', 'parwan', 'samangan', 'sar_e_pul', 'takhar', 'uruzgan', 'wardak', 'zabul'], target_groups = ['taliban', 'haqqani', 'criminals', 'aq', 'hig', 'let', 'imu', 'judq', 'iju', 'hik', 'ttp', 'other']."
        },
        {
            "role": "user",
            "content": pr1
        }
    ],
    temperature=0
)

print(json.loads(response.choices[0].message.content))

They also give you an interface to see the response of the base model
side-by-side against the finetuned model:

![Side-by-side UI of base model and finetuned model inference](/images/openai-finetuning-ui-4.png)

As you can see, it's done pretty well! It stuck to the JSON structure, and the
extracted metadata looks good. Of course, since this is a GPT3.5 model, there's
no way to now download this model and run it locally. You're hostage to OpenAI,
to being online, etc etc. Not a scenario I'd like to be in, so I don't think
I'll pursue this much further and rather use my OpenAI credits for other
purposes.

All that said, I do think there might be some scenarios where only the OpenAI
models are reliable enough to use (be that in terms of accuracy or with
sufficient guardrails) and there were people in the course who were in this
boat.

## OpenPipe

In [None]:
# pip install openpipe

from openpipe import OpenAI
from rich import print
import json
import os

client = OpenAI(
  openpipe={"api_key": os.getenv("OPENPIPE_API_KEY")}
)

completion = client.chat.completions.create(
    model="openpipe:fine-steaks-taste",
    messages=[
        {
            "role": "system",
            "content": "You are an expert at identifying events in a press release. You are precise and always make sure you are correct, drawing inference from the text of the press release. event_types = ['airstrike', 'detention', 'captureandkill', 'insurgentskilled', 'exchangeoffire', 'civiliancasualty'], provinces = ['badakhshan', 'badghis', 'baghlan', 'balkh', 'bamyan', 'day_kundi', 'farah', 'faryab', 'ghazni', 'ghor', 'helmand', 'herat', 'jowzjan', 'kabul', 'kandahar', 'kapisa', 'khost', 'kunar', 'kunduz', 'laghman', 'logar', 'nangarhar', 'nimroz', 'nuristan', 'paktya', 'paktika', 'panjshir', 'parwan', 'samangan', 'sar_e_pul', 'takhar', 'uruzgan', 'wardak', 'zabul'], target_groups = ['taliban', 'haqqani', 'criminals', 'aq', 'hig', 'let', 'imu', 'judq', 'iju', 'hik', 'ttp', 'other']."
        },
        {
            "role": "user",
            "content": pr1
        }
    ],
    temperature=0,
    openpipe={
        "tags": {
            "prompt_id": "counting",
            "any_key": "any_value"
        }
    },
)

print(json.loads(completion.choices[0].message.content))