# Fine-tuning using LLM Engine

Fine-tuning helps improve model performance by training on specific examples of prompts and desired responses. LLMs are initially trained on data collected from the entire internet. With fine-tuning, LLMs can be optimized to perform better in a specific domain by learning from examples for that domain. Smaller LLMs that have been fine-tuned on a specific use case often outperform larger ones that were trained more generally.

In this notebook, we will demonstrate fine-tuning open source models in order to classify emails into two categories, based on their content.

## Preparing the Data

We will demonstrate fine-tuning open source models in order to classify emails into two categories, based on their content.

We will prepare 950 examples to fine-tune on, and use 50 examples to test the performance of our fine-tune. 



In [None]:
%pip install scikit-learn

In [1]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd

categories = ['rec.sport.baseball', 'rec.sport.hockey']
sports_dataset = fetch_20newsgroups(subset='train', shuffle=True, random_state=42, categories=categories)

labels = [sports_dataset.target_names[x].split('.')[-1] for x in sports_dataset['target']]
texts = [text.strip() for text in sports_dataset['data']]
df = pd.DataFrame(zip(texts, labels), columns = ['raw_prompt','response'])[:1000]
df_train = df[:950]
df_test = df[950:]

Let's first take a look at our data:

In [2]:
df_train['raw_prompt'].iloc[0]

"From: dougb@comm.mot.com (Doug Bank)\nSubject: Re: Info needed for Cleveland tickets\nReply-To: dougb@ecs.comm.mot.com\nOrganization: Motorola Land Mobile Products Sector\nDistribution: usa\nNntp-Posting-Host: 145.1.146.35\nLines: 17\n\nIn article <1993Apr1.234031.4950@leland.Stanford.EDU>, bohnert@leland.Stanford.EDU (matthew bohnert) writes:\n\n|> I'm going to be in Cleveland Thursday, April 15 to Sunday, April 18.\n|> Does anybody know if the Tribe will be in town on those dates, and\n|> if so, who're they playing and if tickets are available?\n\nThe tribe will be in town from April 16 to the 19th.\nThere are ALWAYS tickets available! (Though they are playing Toronto,\nand many Toronto fans make the trip to Cleveland as it is easier to\nget tickets in Cleveland than in Toronto.  Either way, I seriously\ndoubt they will sell out until the end of the season.)\n\n-- \nDoug Bank                       Private Systems Division\ndougb@ecs.comm.mot.com          Motorola Communications Sect

In [3]:
df_train['response'].value_counts()

response
baseball    498
hockey      452
Name: count, dtype: int64

In [4]:
df_test['response'].value_counts()

response
baseball    25
hockey      25
Name: count, dtype: int64

Since we are training a text generation model, let's do a bit of (extremely basic) prompt engineering to use the model for classification.

In [5]:
def build_prompt(text: str):
    return f"Prompt: {text}\nCategory: "

def prepare_df(df: pd.DataFrame):
    # df['prompt'] = df.apply(lambda row: build_prompt(row['raw_prompt']), axis=1)
    df['prompt'] = df['raw_prompt'].apply(build_prompt)
    df.drop('raw_prompt', axis=1, inplace=True)

In [None]:
prepare_df(df_train)

In [7]:
df_train.head()

Unnamed: 0,response,prompt
0,baseball,Prompt: From: dougb@comm.mot.com (Doug Bank)\n...
1,hockey,Prompt: From: gld@cunixb.cc.columbia.edu (Gary...
2,baseball,Prompt: From: rudy@netcom.com (Rudy Wade)\nSub...
3,hockey,Prompt: From: monack@helium.gas.uug.arizona.ed...
4,baseball,Prompt: Subject: Let it be Known\nFrom: <ISSBT...


The data needs to end up in a CSV file that has two columns: `prompt` and `response`, and that is publicly accessible.

In [None]:
df_train.to_csv("sports_training_dataset.csv")

Currently, data needs to be uploaded to a publicly accessible web URL so that it can be read for fine-tuning. Publicly accessible HTTP and HTTPS URLs are currently supported. Support for privately sharing data with the LLM Engine API is coming shortly. For quick iteration, you can look into tools like Pastebin or Github Gists to quickly host your CSV files in a public manner. We created an example Github Gist you can see [here](https://gist.github.com/tigss/7cec73251a37de72756a3b15eace9965). To use the gist, you can just use the URL given when you click the “Raw” button ([URL](https://gist.githubusercontent.com/tigss/7cec73251a37de72756a3b15eace9965/raw/85d9742890e1e6b0c06468507292893b820c13c9/llm_sample_data.csv)).

We've uploaded our CSV file to `s3://scale-demo-datasets/sports/sports_training_dataset.csv`, which maps to a URL of `https://scale-demo-datasets.s3.us-west-2.amazonaws.com/sports/sports_training_dataset.csv`.

## Fine-Tuning the Model

Next, we create the fine-tune from our training file via the FineTune API. Note: this can take roughly 15-20 minutes from launching the job to finishing the job with a few hundred examples, as there is a queue of jobs to run.

For this section, you will need an API key to interact with Scale. To retrieve your API key, head to [Scale Spellbook](https://spellbook.scale.com/) where you will get an API key on the [settings](https://spellbook.scale.com/settings) page.

In [8]:
# Note: you must have the environment variable SCALE_API_KEY set to your Spellbook API key. 

from llmengine import FineTune, Completion, Model

FineTune.validate_api_key()

In [9]:
create_fine_tune_response = FineTune.create(
    model="llama-2-7b",
    training_file="https://scale-demo-datasets.s3.us-west-2.amazonaws.com/sports/sports_training_dataset.csv",
    validation_file=None,
    hyperparameters={"epochs": "1", "lr": "0.0002"},
    suffix="my-first-fine-tune"
)

fine_tune_id = create_fine_tune_response.fine_tune_id

In [10]:
# Wait for fine tune to complete


fine_tune_status = FineTune.get(fine_tune_id).status
print(fine_tune_status)
if fine_tune_status == "SUCCESS":
    print("Fine-Tune Succeeded!")
elif fine_tune_status in ["FAILURE", "CANCELLED"]:
    raise ValueError("Fine-Tune failed")


BatchJobStatus.RUNNING


Once the fine-tune completes, we can get your fine-tune via looking at all the models available to you.

In [None]:
# We can look at all the models that are available for completions
all_models = Model.list().model_endpoints
all_models

In [13]:
your_fine_tuned_model = "llama-2-7b.my-first-fine-tune.2023-07-19-00-48-07"  # Note: you will have a different model!

## Test the Fine-Tune

Next, we run our model on the test dataset via the Completions API. Since we trained using a prompt template, use that prompt template when making predictions. Note: you may have to wait a few minutes after the fine-tune succeeds in order for your model to be loaded.

In [15]:
def get_classification(prompt: str):
    for _ in range(5):
        try:
            response = Completion.create(
                model=your_fine_tuned_model, 
                prompt=build_prompt(prompt), 
                max_new_tokens=2, 
                temperature=0.01
            )
            return response.output.text.rstrip("\n")
        except Exception as e:
            print(e)
    else:
        return "Error"

In [None]:
df_test["predicted_response"] = df_test["raw_prompt"].apply(get_classification)

Let's peek at the data and calculate our test accuracy!

In [17]:
df_test.head()

Unnamed: 0,raw_prompt,response,predicted_response
950,From: tedward@cs.cornell.edu (Edward [Ted] Fis...,baseball,baseball
951,From: smorris@venus.lerc.nasa.gov (Ron Morris ...,hockey,hockey
952,From: shah@pitt.edu (Ravindra S Shah)\nSubject...,hockey,hockey
953,From: timlin@spot.Colorado.EDU (Michael Timlin...,baseball,baseball
954,From: gp2011@andy.bgsu.edu (George Pavlic)\nSu...,hockey,hockey


In [18]:
num_correct = len(df_test[df_test["predicted_response"] == (df_test["response"])])
num_correct / len(df_test)

0.98

In [19]:
df_test[df_test["predicted_response"] != df_test["response"]]

Unnamed: 0,raw_prompt,response,predicted_response
974,From: maX <maX@maxim.rinaco.msk.su>\nSubject: ...,hockey,baseball
