# Milestone 3: Fine-Tuning and Testing a ChatGPT Model

In [2]:
from datasets import load_dataset
from dotenv import load_dotenv

dataset = load_dataset("common-pile/caselaw_access_project", split="train", data_files="cap_00000.jsonl.gz")

  from .autonotebook import tqdm as notebook_tqdm
Generating train split: 40000 examples [00:02, 16887.33 examples/s]


I chose the Caselaw Access Project dataset from Hugging Face because it provides a collection of legal texts which ideal for exploring legal AI applications. I loaded a single file (`cap_00000.jsonl.gz`) to manage size and cost constraints. Originally, I was going to use fine-tuning for writing grant proposals, but that would have requires more examples of funded projects than I could find and manually creating a dataset.

In [9]:
df = pd.DataFrame(dataset)
print(df.head())
print("Columns:", df.columns.tolist())

                          id                  source  \
0  f2d_474/html/0001-01.html  Caselaw Access Project   
1  f2d_474/html/0003-01.html  Caselaw Access Project   
2  f2d_474/html/0004-01.html  Caselaw Access Project   
3  f2d_474/html/0006-01.html  Caselaw Access Project   
4  f2d_474/html/0008-01.html  Caselaw Access Project   

                        added                     created  \
0  2024-08-24T03:29:51.129235  2024-08-24T03:29:51.129683   
1  2024-08-24T03:29:51.129235  2024-08-24T03:29:51.129683   
2  2024-08-24T03:29:51.129235  2024-08-24T03:29:51.129683   
3  2024-08-24T03:29:51.129235  2024-08-24T03:29:51.129683   
4  2024-08-24T03:29:51.129235  2024-08-24T03:29:51.129683   

                                            metadata  \
0  {'author': 'PER CURIAM:', 'license': 'Public D...   
1  {'author': '
      PER CURIAM:', 'license': 'P...   
2  {'author': 'PER CURIAM:', 'license': 'Public D...   
3  {'author': 'PER CURIAM:', 'license': 'Public D...   
4  {'author': 'P

The CAP dataset loaded with 40,000 rows, and the `text` column contains case narratives and the `metadata` provides author and license info, but I’ll focus on `text`.

In [7]:
df["text"][1]

"\n    John OTERO and Grace Otero, his wife, Appellants, v. INTERNATIONAL UNION OF ELECTRICAL, RADIO AND MACHINE WORKERS (IUE) an association, Ap-pellee.\n    No. 71-1716.\n    United States Court of Appeals, Ninth Circuit.\n    Feb. 9, 1973.\n    W. Roy Tribble (argued), Chandler, Ariz., for appellants.\n    Melvin Warshaw, Asst. Gen. Counsel ' (argued), Ruth Weyand, Richard Seupi, International Union of Electrical, Radio, & Machine Workers, Washington, D.C., Herbert B. Finn, of Finn & Van Baalen, Phoenix, Ariz., for appellee.\n    Before BARNES, KILKENNY and GOODWIN, Circuit Judges.\n   PER CURIAM:\n\nThe district court had jurisdiction of this action, though not by reason of diversity, which does not here exist. 28 U.S.C. § 1332; United Steel Workers of America v. Bouligny, Inc., 382 U.S. 145, 150-151, 86 S.Ct. 272 (1965). Jurisdiction depends on the existence herein of a collective bargaining contract between an employer (itself a union) and a “labor organization” representing the 

In [11]:
def label_case(text):
    text = str(text).lower() if text else ""
    if "affirmed" in text and "appellee" in text:
        return "Favorable"  # Favorable to appellee
    elif "reversed" in text and "appellant" in text:
        return "Favorable"  # Favorable to appellant
    elif "dismissed" in text and "appellant" in text:
        return "Unfavorable"  # Unfavorable to appellant
    elif "reversed" in text and "appellee" in text:
        return "Unfavorable"  # Unfavorable to appellee
    else:
        return None  # Insufficient context

df["label"] = df["text"].apply(label_case)
df = df.dropna(subset=["label"])
print("Labeled rows:", len(df))
print("Unlabeled rows dropped:", len(dataset) - len(df))

Labeled rows: 30560
Unlabeled rows dropped: 9440


I labeled cases as “Favorable” or “Unfavorable” using keywords like “affirmed” or “reversed” in the `text` column. I subsampled 1000 cases and dropped unlabeled rows, resulting in 30560 examples. This simple rule-based approach may miss more complex outcomes.

In [42]:
def convert_to_jsonl(df, filename):
    with open(filename, "w", encoding="utf-8", errors="replace") as f:
        for _, row in df.iterrows():
            try:
                text = str(row["text"]).encode(errors="replace").decode()
                user_content = text.replace("\n", " ").replace('"', '\\"')
                if len(user_content) > 500:
                    user_content = user_content[:500]
                assistant_content = str(row["label"])
                messages = [
                    {"role": "system", "content": "You're a chatbot that only responds with emojis!"},
                    {"role": "user", "content": user_content},
                    {"role": "assistant", "content": assistant_content}
                ]
                json_obj = {"messages": messages}
                f.write(json.dumps(json_obj) + "\n")
            except Exception as e:
                print(f"Error processing row: {e}, skipping row")
                continue

from sklearn.model_selection import train_test_split
train_df, valid_df = train_test_split(df, test_size=0.2)
convert_to_jsonl(train_df, "train.jsonl")
convert_to_jsonl(valid_df, "valid.jsonl")

with open("train.jsonl") as f:
    print(f.read()[:500])

{"messages": [{"role": "system", "content": "You're a chatbot that only responds with emojis!"}, {"role": "user", "content": "     Charles B. CANNON et al., Plaintiffs-Appellants, v. U. S. ACOUSTICS CORPORATION et al., Defendants-Appellees.     No. 75-1810.     United States Court of Appeals, Seventh Circuit.     Argued Jan. 20, 1976.     Decided March 31, 1976.                 N. A. Giambalvo, Chicago, Ill., for plaintiffs-appellants.     Francis D. Morrissey, Joseph M. Fasano, Chicago, Ill., f


On my first successful run, the job failed because it would have cost over $100.
> Creating this fine-tuning job would exceed your hard limit, please check your plan and billing details. Cost of job ftjob-kfKUttFukXm0vQr6qEUQKvaD: USD 104.30. Quota remaining for your project proj_5GjmgsX2kTZp9OuilKK176GW: USD 9.28.

So I reduced the dataset size to 200 training and 50 validation samples for testing.

In [44]:
train_sample = train_df.sample(n=200) if len(train_df) > 200 else train_df
valid_sample = valid_df.sample(n=50) if len(valid_df) > 50 else valid_df
convert_to_jsonl(train_sample, "train.jsonl")
convert_to_jsonl(valid_sample, "valid.jsonl")

with open("train.jsonl") as f:
    print(f.read()[:500])

{"messages": [{"role": "system", "content": "You're a chatbot that only responds with emojis!"}, {"role": "user", "content": "     BITUMINOUS FIRE & MARINE INSURANCE COMPANY, Appellant, v. IZZY ROSEN\u2019S, INC., Appellee.     No. 73-1313.     United States Court of Appeals, Sixth Circuit. .     Argued Nov. 27, 1973.     Decided and Filed March 21, 1974.            Eugene C. Gaerig, Memphis, Tenn., on brief, for appellant.     Thomas R. Prewitt, and Edward M. Kaplan, Memphis, Tenn., on brief, f


In [45]:
!pip install openai requests tiktoken numpy


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [46]:
import openai
import os
import json
import requests
import time
import tiktoken
import numpy as np
from collections import defaultdict
from dotenv import load_dotenv

load_dotenv()

client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

I encountered a `JSONDecodeError` at character 406 in `train.jsonl`, indicating a malformed JSON object. I’m inspecting the first line to identify the issue, likely due to unescaped characters or truncation, as of 08:46 PM EDT on July 20, 2025.

In [47]:
with open("train.jsonl", encoding="utf-8") as f:
    first_line = f.readline()
print("First line of train.jsonl:")
print(first_line)
print(f"Length: {len(first_line)}")
print(f"Character 406: '{first_line[406]}' if available")

First line of train.jsonl:
{"messages": [{"role": "system", "content": "You're a chatbot that only responds with emojis!"}, {"role": "user", "content": "     BITUMINOUS FIRE & MARINE INSURANCE COMPANY, Appellant, v. IZZY ROSEN\u2019S, INC., Appellee.     No. 73-1313.     United States Court of Appeals, Sixth Circuit. .     Argued Nov. 27, 1973.     Decided and Filed March 21, 1974.            Eugene C. Gaerig, Memphis, Tenn., on brief, for appellant.     Thomas R. Prewitt, and Edward M. Kaplan, Memphis, Tenn., on brief, for appellee.     Before PHILLIPS, Chief Judge, and CELEBREZZE and MILLER, Circuit Judges.    WILLIAM E. MILLER, Circuit Judge.  Thi"}, {"role": "assistant", "content": "Favorable"}]}

Length: 683
Character 406: 'b' if available


In [48]:
import json

data_file = "train.jsonl"

def basic_checks(data_file):
    try:
        with open(data_file, encoding="utf-8") as f:
            dataset = [json.loads(line) for line in f]

        print(f"Basic checks for file {data_file}:")
        print("Count of examples in training dataset:", len(dataset))
        print("First example:")
        print(dataset[0])
        return True, dataset
    except FileNotFoundError as e:
        print(f"File not found error occurred in file {data_file}: {e}")
        return False, None
    except json.JSONDecodeError as e:
        print(f"JSON decoding error occurred in file {data_file}: {e}")
        return False, None
    except Exception as e:
        print(f"An error occurred in file {data_file}: {e}")
        return False, None

valid, dataset = basic_checks(data_file)

Basic checks for file train.jsonl:
Count of examples in training dataset: 200
First example:
{'messages': [{'role': 'system', 'content': "You're a chatbot that only responds with emojis!"}, {'role': 'user', 'content': '     BITUMINOUS FIRE & MARINE INSURANCE COMPANY, Appellant, v. IZZY ROSEN’S, INC., Appellee.     No. 73-1313.     United States Court of Appeals, Sixth Circuit. .     Argued Nov. 27, 1973.     Decided and Filed March 21, 1974.            Eugene C. Gaerig, Memphis, Tenn., on brief, for appellant.     Thomas R. Prewitt, and Edward M. Kaplan, Memphis, Tenn., on brief, for appellee.     Before PHILLIPS, Chief Judge, and CELEBREZZE and MILLER, Circuit Judges.    WILLIAM E. MILLER, Circuit Judge.  Thi'}, {'role': 'assistant', 'content': 'Favorable'}]}


In [49]:
def format_checks(dataset, filename):
    format_errors = defaultdict(int)

    for ex in dataset:
        if not isinstance(ex, dict):
            format_errors["data_type"] += 1
            continue
        messages = ex.get("messages")
        if not isinstance(messages, list) or len(messages) < 3:
            format_errors["missing_or_invalid_messages"] += 1
            continue
        expected_roles = ["system", "user", "assistant"]
        for i, (msg, expected_role) in enumerate(zip(messages, expected_roles)):
            if not isinstance(msg, dict):
                format_errors[f"message_{i}_not_dict"] += 1
                continue
            if msg.get("role") != expected_role:
                format_errors[f"message_{i}_role_incorrect"] += 1
            if "content" not in msg or not isinstance(msg["content"], str) or not msg["content"].strip():
                format_errors[f"message_{i}_missing_or_empty_content"] += 1

    if format_errors:
        print(f"Formatting errors found in file {filename}:")
        for k, v in format_errors.items():
            print(f"{k}: {v}")
        return False
    print(f"No formatting errors found in file {filename}")
    return True

format_ok = format_checks(dataset, data_file)

No formatting errors found in file train.jsonl


In [51]:
with open("train.jsonl", "rb") as f:
    train_file = client.files.create(file=f, purpose="fine-tune")
train_file_id = train_file.id

with open("valid.jsonl", "rb") as f:
    valid_file = client.files.create(file=f, purpose="fine-tune")
valid_file_id = valid_file.id

print(f"Train file uploaded with ID: {train_file_id}")
print(f"Valid file uploaded with ID: {valid_file_id}")

Train file uploaded with ID: file-MfNr19Xz118qptzpnsKAtU
Valid file uploaded with ID: file-KTz5y9wafNqAmsgDHT95yv


In [52]:
fine_tuning_job = client.fine_tuning.jobs.create(
    training_file=train_file_id,
    validation_file=valid_file_id,
    model="gpt-4o",
    hyperparameters={
        "n_epochs": 3
    },
    suffix="case_law"
)

print(f"Fine-tuning job created with ID: {fine_tuning_job.id}")

Fine-tuning job created with ID: ftjob-kuRap6Olf5Vgu22vqOvK8BSq


In [59]:
job_id = "ftjob-cvGcW25lZtw4UH1txiSiEuZs"

api_key = os.getenv("OPENAI_API_KEY")
headers = {
    "Authorization": f"Bearer {api_key}"
}

start_time = time.time()

while True:
    response = requests.get(f"https://api.openai.com/v1/fine_tuning/jobs/{job_id}", headers=headers)
    job_status = response.json()["status"]
    print(f"Elapsed time: {int((time.time() - start_time) // 60)} minutes {int((time.time() - start_time) % 60)} seconds")
    print(f"Job Status: {job_status}")
    if job_status == "succeeded":
        fine_tuned_model_name = response.json()["fine_tuned_model"]
        print(f"Fine-tuning complete. Model name: {fine_tuned_model_name}")
        break
    elif job_status == "failed":
        print("Fine-tuning failed.")
        break
    time.sleep(30)

Elapsed time: 0 minutes 0 seconds
Job Status: succeeded
Fine-tuning complete. Model name: ft:gpt-4o-2024-08-06:personal:emoji:BvbXkZhR


Yet again, using gpt-3.5-turbo caused moderation failures, so I switched to gpt-4o. The job completed successfully after about an hour.

In [60]:
result = client.fine_tuning.jobs.retrieve(job_id)
print(result)

FineTuningJob(id='ftjob-cvGcW25lZtw4UH1txiSiEuZs', created_at=1753066955, error=Error(code=None, message=None, param=None), fine_tuned_model='ft:gpt-4o-2024-08-06:personal:emoji:BvbXkZhR', finished_at=1753068710, hyperparameters=Hyperparameters(batch_size=1, learning_rate_multiplier=2.0, n_epochs=3), model='gpt-4o-2024-08-06', object='fine_tuning.job', organization_id='org-x0P8HBiXM7Scb14exTndOhLB', result_files=['file-Uj5LVLNjVAVc2LemcQXaZp'], seed=1079288, status='succeeded', trained_tokens=104532, training_file='file-MfNr19Xz118qptzpnsKAtU', validation_file='file-KTz5y9wafNqAmsgDHT95yv', estimated_finish=None, integrations=[], metadata=None, method=Method(type='supervised', dpo=None, reinforcement=None, supervised=SupervisedMethod(hyperparameters=SupervisedHyperparameters(batch_size=1, learning_rate_multiplier=2.0, n_epochs=3))), user_provided_suffix='emoji', usage_metrics=None, shared_with_openai=False, eval_id=None)


In [62]:
result_file_id = result.result_files[0]
result_file_content = client.files.content(result_file_id).read()
import base64
decoded = base64.b64decode(result_file_content).decode()

with open("decoded_result.csv", "w") as f:
    f.write(decoded)

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("decoded_result.csv")

df['step'] = pd.to_numeric(df['step'], errors='coerce')
df['train_loss'] = pd.to_numeric(df['train_loss'], errors='coerce')
df['train_accuracy'] = pd.to_numeric(df['train_accuracy'], errors='coerce')
df['valid_loss'] = pd.to_numeric(df['valid_loss'], errors='coerce')
df['valid_mean_token_accuracy'] = pd.to_numeric(df['valid_mean_token_accuracy'], errors='coerce')

plt.figure(figsize=(12, 5))
plt.plot(df['step'], df['train_loss'], label='Training Loss', linewidth=2)
plt.plot(df['step'], df['valid_loss'], label='Validation Loss', linewidth=2, linestyle='--')
plt.title('Loss over Training Steps')
plt.xlabel('Step')
plt.ylabel('Loss')e
plt.legend()
plt.grid(True)
plt.show()

plt.figure(figsize=(12, 5))
plt.plot(df['step'], df['train_accuracy'], label='Training Accuracy', linewidth=2)
plt.plot(df['step'], df['valid_mean_token_accuracy'], label='Validation Accuracy', linewidth=2, linestyle='--')
plt.title('Accuracy over Training Steps')
plt.xlabel('Step')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.show()

The fine-tuned model completed training successfully on the third attempt with a total of 104,532 tokens over 3 epochs. The training and validation losses were extremely low (0.001 and 0.002 respectively), which suggest strong learning on the training set. The full validation loss was 0.195. The loss curves show rapid convergence within the first 200 steps and where relatively flat thereafter. Accuracy was variable mid-training, which could mean there was a mix of easy and hard samples in the validation data. Overall, the metrics suggest a highly successful fine-tune.

In [67]:
messages = [
    {"role": "system", "content": "You are a legal case classifier."},
    {"role": "user", "content": "BITUMINOUS FIRE & MARINE INSURANCE COMPANY, Appellant, v. IZZY ROSEN’S, INC., Appellee. United States Court of Appeals, Sixth Circuit. Argued Nov. 27, 1973. Decided and Filed March 21, 1974. Eugene C. Gaerig, Memphis, Tenn., on brief, for appellant. Thomas R. Prewitt, and Edward M. Kaplan, Memphis, Tenn., on brief, for appellee. Before PHILLIPS, Chief Judge, and CELEBREZZE and MILLER, Circuit Judges. WILLIAM E. MILLER, Circuit Judge."}
]

response = client.chat.completions.create(
    model="ft:gpt-4o-2024-08-06:personal:emoji:BvbXkZhR",
    messages=messages
)

print(response.choices[0].message.content)

Favorable


In [68]:
messages = [
    {"role": "system", "content": "You are a legal case classifier."},
    {"role": "user", "content": "Priscilla E. CHAVEZ, Plaintiff-Appellant, v. TEMPE UNION HIGH SCHOOL DISTRICT #213, et al. United States Court of Appeals, Ninth Circuit. Dec. 5, 1977. Theodore C. Jarvi, Scottsdale, Ariz., for plaintiff-appellant."}
]

response = client.chat.completions.create(
    model="ft:gpt-4o-2024-08-06:personal:emoji:BvbXkZhR",
    messages=messages
)

print(response.choices[0].message.content)

Favorable


In [69]:
messages = [
    {"role": "system", "content": "You are a legal case classifier."},
    {"role": "user", "content": "UNITED STATES of America, Plaintiff-Appellant, v. James E. HOOKER, Defendant-Appellee. United States Court of Appeals, Ninth Circuit. Oct. 26, 1979. Donald M. Currie, Asst. U.S. Atty., Seattle, Wash., for plaintiff-appellant. Irwin H. Schwartz, Federal Public Defender, Tacoma, Wash., for defendant-appellee. Before DUNIWAY, PECK and HUG, Circuit Judges."}
]

response = client.chat.completions.create(
    model="ft:gpt-4o-2024-08-06:personal:emoji:BvbXkZhR",
    messages=messages
)

print(response.choices[0].message.content)

Favorable


In [70]:
messages = [
    {"role": "system", "content": "You are a legal case classifier."},
    {"role": "user", "content": "UNITED STATES of America, Plaintiff-Appellee, v. John Doe, Defendant-Appellant. The Ninth Circuit affirmed the lower court’s conviction of the defendant on all counts, rejecting arguments about improper jury instructions and ineffective assistance of counsel. The court held that the trial was fair and the evidence against the defendant was overwhelming."}
]
response = client.chat.completions.create(
    model="ft:gpt-4o-2024-08-06:personal:emoji:BvbXkZhR",
    messages=messages
)

print(response.choices[0].message.content)

Favorable


It seems despite the training metrics showing a good fit, the model isn't performing great on these test cases. It seems to consistently predict "Favorable" even in cases where it was clearly "Unfavorable". This suggests the model may have overfit to the training data. More diverse training examples and possibly more epochs could help. I could also check the training samples to ensure a balanced representation of both classes. Overall, while the fine-tuning process completed successfully, the model's real-world performance definitely needs improvement.