<img src="../../docs/images/DSPy8.png" alt="DSPy7 Image" height="150"/>

# DSPy Finetuning Demo

[<img align="center" src="https://colab.research.google.com/assets/colab-badge.svg" />](https://colab.research.google.com/github/stanfordnlp/dspy/blob/main/examples/finetuning/finetuning_demo.ipynb)

This is a demo notebook showing an example of how fine-tuning models work in DSPy.
In particular, we fine-tune `OpenAI`s `GPT-4o-mini` on `HotPotQA`.
Note that the fine-tuning features in DSPy are experimental and are subject to change.

## I. Module Setup

The fine-tuning features are available on a special branch named `demo_finetune`.
The following code block checks out the DSPy repository, switches to this branch, and installs DSPy locally.

In [None]:
# Replace the following with the path to the DSPy repository on your machine,
# if you have already cloned it. Make sure your working tree is clean! If you
# don't specify a path, we will clone the repository to the current directory.
# 
# Leave as is to run on Google Colab.
repo_path = ''

if not repo_path:
    try:
        repo_path = 'dspy'
        !git clone https://github.com/stanfordnlp/dspy $repo_path
    except:
        raise Exception("Could not clone and install the repository!")


!cd $repo_path && git checkout demo_finetune  # Switch to the branch
!pip install -e $repo_path  # Install the local package

We are making use of a demo cache for this notebook, located in "dspy/examples/finetuning/demo_cache" in the `demo_finetuning` branch of the DSPy repository.
This allows us to utilize cached calls to the OpenAI API.
If you are not modifying this notebook, you don't need to set up the `OPENAI_API_KEY`.

In [2]:
import os

demo_cache_path = os.path.join(repo_path, "dspy", "examples", "finetuning", "demo_cache")
os.environ["DSP_CACHEDIR"] = demo_cache_path

If you would like to modify the notebook and change the requests being made to the OpenAI API, you should set the `OPENAI_API_KEY`.
The next block demonstrates how.

In [3]:
# Here are the different options for setting the OPENAI_API_KEY.
# (1) If you are running this notebook locally, you can set it in your shell:
# export OPENAI_API_KEY=<YOUR_OPENAI_API_KEY>

# (2) If you are running this notebook in a Google Colab, you can set it in
#     the secrets panel and import by uncommenting the following lines:
# from google.colab import userdata
# os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

# (3) You can set it directly in the notebook by uncommenting the following
#     line with the appropriate key. If you pick this option, make sure to
#     remove your key before sharing your notebook or committing to a
#     repository.
# os.environ["OPENAI_API_KEY"] = <OPENAI_API_KEY>

We will make use of a random number generator in this notebook.
We are creating a `Random` object here to ensure that our notebook is reproducible.

In [2]:
import random

rng = random.Random()

## II. Task Setup

In a typical DSPy fashion, we will start by setting up our task, dataset, metric, models and optimizer settings used for `HotPotQA`!

In [3]:
import dspy
from dspy.datasets import HotPotQA
from dspy.evaluate import Evaluate
from dsp.utils.utils import deduplicate


# We are setting the experimental flag to True to make use of the fine-tuning
# features that are still in development.
dspy.settings.configure(experimental=True)

# Define the program
class BasicMH(dspy.Module):
    def __init__(self, passages_per_hop=3, num_hops=2):
        super().__init__()
        self.num_hops = 2
        self.retrieve = dspy.Retrieve(k=passages_per_hop)
        self.generate_query = [dspy.ChainOfThought("context, question -> search_query") for _ in range(self.num_hops)]
        self.generate_answer = dspy.ChainOfThought("context, question -> answer")
    
    def forward(self, question):
        context = []
        
        for hop in range(self.num_hops):
            search_query = self.generate_query[hop](context=context, question=question).search_query
            passages = self.retrieve(search_query).passages
            context = deduplicate(context + passages)

        answer = self.generate_answer(context=context, question=question).copy(context=context)
        return answer

# Prepare the dataset
TRAIN_SIZE = 1000
DEV_SIZE = 500
dataset = HotPotQA(train_seed=1, eval_seed=2023, test_size=0, only_hard_examples=True)
trainset = [x.with_inputs('question') for x in dataset.train][:TRAIN_SIZE]
devset = [x.with_inputs('question') for x in dataset.dev][:DEV_SIZE]

# Prepare the metric and evaluator
NUM_THREADS = 12
metric = dspy.evaluate.answer_exact_match
evaluate = Evaluate(devset=devset, metric=metric, num_threads=NUM_THREADS, display_progress=True)

# Prepare the retriever model
COLBERT_V2_ENDPOINT = "http://20.102.90.50:2017/wiki17_abstracts"
retriever = dspy.ColBERTv2(url=COLBERT_V2_ENDPOINT)

Here we specifiy the parameters required for `BootstrapFewShotWithRandomSearch` at the beginning of our notebook so that they can be re-used.

In [4]:
from dspy.teleprompt.random_search import BootstrapFewShotWithRandomSearch

# Define the hyperparameters for prompt optimization
MAX_BOOTSTRAPPED_DEMOS = 3
MAX_LABELED_DEMOS = 3
NUM_CANDIDATE_PROGRAMS = 6
OPTIMIZER_NUM_TRAIN = 100
OPTIMIZER_NUM_VAL = 150

In the next block, we are setting up our fine-tunable LM. In `DSPy`, all the
finetunable LMs are subclasses of the `TrainableLM` class, which itself is a
subclass of the `LM` class.
This allows us to communicate with `TrainableLM`s using a common interface.

`TrainableLM`s can be used as regular `LM`s, but they can also be used to fine-tune models!

In [5]:
# Model parameters
GPT_4O_MINI = "gpt-4o-mini-2024-07-18"
MAX_TOKENS = 1024
MODEL_PARAMETERS = {
  "max_tokens": MAX_TOKENS,
  "temperature": 0,
}

# Prepare the language model
lm = dspy.TrainableOpenAI(model=GPT_4O_MINI, **MODEL_PARAMETERS)

# Showing that the model is an instance of the TrainableLM class
isinstance(lm, dspy.TrainableLM)

True

Let's evaluate a vanilla program on our LM!

In [6]:
vanilla_program = BasicMH()

with dspy.context(lm=lm, rm=retriever):
  print("Evaluating the vanilla program on the devset...")
  vanilla_base_eval = evaluate(vanilla_program)

Evaluating the vanilla program on the devset...


Average Metric: 214 / 500  (42.8): 100%|██████████| 500/500 [00:04<00:00, 118.23it/s]


## III. Prompt Optimize the Base Model

We start by prompt optimizing the base model, which will allow us to get higher quality bootstraps for training!

In [7]:
# Prepare the training and validation sets for the optimizer using the original
# trainset. This ensures that our "devset" is left untouched.
shuffled_trainset = [d for d in trainset]
rng.shuffle(shuffled_trainset)
optimizer_trainset = shuffled_trainset[:OPTIMIZER_NUM_TRAIN]
optimizer_valset = shuffled_trainset[OPTIMIZER_NUM_TRAIN:OPTIMIZER_NUM_TRAIN+OPTIMIZER_NUM_VAL]

# Initialize the optimizer
bfrs_optimizer = BootstrapFewShotWithRandomSearch(
    metric=metric,
    max_bootstrapped_demos=MAX_BOOTSTRAPPED_DEMOS,
    max_labeled_demos=MAX_LABELED_DEMOS,
    num_candidate_programs=NUM_CANDIDATE_PROGRAMS,
    num_threads=NUM_THREADS
)

# Compile the optimizer
with dspy.context(lm=lm, rm=retriever):
    vanilla_program = BasicMH()
    bfrs_base_program = bfrs_optimizer.compile(vanilla_program, trainset=optimizer_trainset, valset=optimizer_valset)

Going to sample between 1 and 3 traces per predictor.
Will attempt to bootstrap 6 candidate sets.


Average Metric: 69 / 150  (46.0): 100%|██████████| 150/150 [00:01<00:00, 125.64it/s]


Score: 46.0 for set: [0, 0, 0]
New best score: 46.0 for seed -3
Scores so far: [46.0]
Best score: 46.0


Average Metric: 77 / 150  (51.3): 100%|██████████| 150/150 [00:00<00:00, 285.04it/s]


Score: 51.33 for set: [3, 3, 3]
New best score: 51.33 for seed -2
Scores so far: [46.0, 51.33]
Best score: 51.33


  5%|▌         | 5/100 [00:00<00:01, 87.84it/s]


Bootstrapped 3 full traces after 6 examples in round 0.


Average Metric: 83 / 150  (55.3): 100%|██████████| 150/150 [00:01<00:00, 118.81it/s]


Score: 55.33 for set: [3, 3, 3]
New best score: 55.33 for seed -1
Scores so far: [46.0, 51.33, 55.33]
Best score: 55.33
Average of max per entry across top 1 scores: 0.5533333333333333
Average of max per entry across top 2 scores: 0.6066666666666667
Average of max per entry across top 3 scores: 0.6266666666666667
Average of max per entry across top 5 scores: 0.6266666666666667
Average of max per entry across top 8 scores: 0.6266666666666667
Average of max per entry across top 9999 scores: 0.6266666666666667


  6%|▌         | 6/100 [00:00<00:01, 90.50it/s]


Bootstrapped 2 full traces after 7 examples in round 0.


Average Metric: 80 / 150  (53.3): 100%|██████████| 150/150 [00:01<00:00, 115.86it/s]


Score: 53.33 for set: [3, 3, 3]
Scores so far: [46.0, 51.33, 55.33, 53.33]
Best score: 55.33
Average of max per entry across top 1 scores: 0.5533333333333333
Average of max per entry across top 2 scores: 0.6
Average of max per entry across top 3 scores: 0.6333333333333333
Average of max per entry across top 5 scores: 0.6466666666666666
Average of max per entry across top 8 scores: 0.6466666666666666
Average of max per entry across top 9999 scores: 0.6466666666666666


  2%|▏         | 2/100 [00:00<00:01, 87.84it/s]


Bootstrapped 1 full traces after 3 examples in round 0.


Average Metric: 79 / 150  (52.7): 100%|██████████| 150/150 [00:01<00:00, 109.82it/s]


Score: 52.67 for set: [3, 3, 3]
Scores so far: [46.0, 51.33, 55.33, 53.33, 52.67]
Best score: 55.33
Average of max per entry across top 1 scores: 0.5533333333333333
Average of max per entry across top 2 scores: 0.6
Average of max per entry across top 3 scores: 0.6133333333333333
Average of max per entry across top 5 scores: 0.6466666666666666
Average of max per entry across top 8 scores: 0.6466666666666666
Average of max per entry across top 9999 scores: 0.6466666666666666


  2%|▏         | 2/100 [00:00<00:00, 124.16it/s]


Bootstrapped 1 full traces after 3 examples in round 0.


Average Metric: 85 / 150  (56.7): 100%|██████████| 150/150 [00:01<00:00, 136.13it/s]


Score: 56.67 for set: [3, 3, 3]
New best score: 56.67 for seed 2
Scores so far: [46.0, 51.33, 55.33, 53.33, 52.67, 56.67]
Best score: 56.67
Average of max per entry across top 1 scores: 0.5666666666666667
Average of max per entry across top 2 scores: 0.6133333333333333
Average of max per entry across top 3 scores: 0.6333333333333333
Average of max per entry across top 5 scores: 0.64
Average of max per entry across top 8 scores: 0.6533333333333333
Average of max per entry across top 9999 scores: 0.6533333333333333


  1%|          | 1/100 [00:00<00:01, 75.49it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 81 / 150  (54.0): 100%|██████████| 150/150 [00:01<00:00, 130.89it/s]


Score: 54.0 for set: [3, 3, 3]
Scores so far: [46.0, 51.33, 55.33, 53.33, 52.67, 56.67, 54.0]
Best score: 56.67
Average of max per entry across top 1 scores: 0.5666666666666667
Average of max per entry across top 2 scores: 0.6133333333333333
Average of max per entry across top 3 scores: 0.6466666666666666
Average of max per entry across top 5 scores: 0.66
Average of max per entry across top 8 scores: 0.66
Average of max per entry across top 9999 scores: 0.66


  1%|          | 1/100 [00:00<00:01, 75.95it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 75 / 150  (50.0): 100%|██████████| 150/150 [00:01<00:00, 135.02it/s]


Score: 50.0 for set: [3, 3, 3]
Scores so far: [46.0, 51.33, 55.33, 53.33, 52.67, 56.67, 54.0, 50.0]
Best score: 56.67
Average of max per entry across top 1 scores: 0.5666666666666667
Average of max per entry across top 2 scores: 0.6133333333333333
Average of max per entry across top 3 scores: 0.6466666666666666
Average of max per entry across top 5 scores: 0.66
Average of max per entry across top 8 scores: 0.66
Average of max per entry across top 9999 scores: 0.66


  3%|▎         | 3/100 [00:00<00:00, 118.88it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


Average Metric: 79 / 150  (52.7): 100%|██████████| 150/150 [00:01<00:00, 130.65it/s]


Score: 52.67 for set: [3, 3, 3]
Scores so far: [46.0, 51.33, 55.33, 53.33, 52.67, 56.67, 54.0, 50.0, 52.67]
Best score: 56.67
Average of max per entry across top 1 scores: 0.5666666666666667
Average of max per entry across top 2 scores: 0.6133333333333333
Average of max per entry across top 3 scores: 0.6466666666666666
Average of max per entry across top 5 scores: 0.66
Average of max per entry across top 8 scores: 0.66
Average of max per entry across top 9999 scores: 0.66
9 candidate programs found.


Let's evaluate our prompt optimized program on the base model!

In [8]:
with dspy.context(lm=lm, rm=retriever):
  print("Evaluating the prompt optimized program on the devset...")
  bfrs_base_eval = evaluate(bfrs_base_program)

Evaluating the prompt optimized program on the devset...


Average Metric: 262 / 500  (52.4): 100%|██████████| 500/500 [00:04<00:00, 115.45it/s]


## IV. Bootstrap Data

In this section, we bootstrap data for fine-tuning.
In the code block below, we are deciding which program should be used to collect the bootstraps.
We are setting this to the prompt optimized program, but one could also set this to the vanilla program, though doing so would lead to lower quality bootstraps.

In [9]:
bootstrap_program = bfrs_base_program  # Can also set to vanilla_program

We are now ready to bootstrap traces using our `trainset`!
The code block below does this using the `bootstrap_data` function, which takes in a program and a dataset; runs the program on the dataset and return collected data in dictionaries.
More information is included in the code comments below.

In [10]:
from dspy.teleprompt.finetune_teleprompter import bootstrap_data


# Bootstrap traces using the prompt optimized program
with dspy.context(lm=lm, rm=retriever):
    bootstrapped_data = bootstrap_data(program=bootstrap_program, dataset=trainset, metric=metric, num_threads=NUM_THREADS)

# The bootstra_data function returns a list of dictionaries, each dictionary
# corresponding to an exmaple in the data, with the the following keys and
# values:
# - The `example` field corresponding to the example itself
# - The `prediction` field corresponding to the prediction made by the program
#   on the example.
# - The `trace` field corresponding to the trace generated by the program on the
#   example.
# - The `score` field corresponding to the metric score of the example, if the
#   metric is provided. Otherwise, it is not included in the data.
print(f"\nTrainset has {len(trainset)} examples.")
print(f"Bootstrapped data has {len(bootstrapped_data)} examples.")
print(f"Bootstrapped data dictionaries have the following keys: {list(bootstrapped_data[0].keys())}")

Average Metric: 540 / 1000  (54.0): 100%|██████████| 1000/1000 [00:07<00:00, 128.03it/s]



Trainset has 1000 examples.
Bootstrapped data has 1000 examples.
Bootstrapped data dictionaries have the following keys: ['example', 'prediction', 'trace', 'score']


Shuffle our data for training!

In [11]:
rng.shuffle(bootstrapped_data)

We have now bootstrapped examples!
Before proceeding to fine-tuning, we filter out the unsuccessful examples, because we only want to keep the positive examples for our fine-tuning stage.

In [12]:
# Filter the bootstrapped data to only include examples where the metric score
# is 1
filtered_bootstrapped_data = [d for d in bootstrapped_data if d['score']]
print(f"Filtered bootstrapped data has {len(filtered_bootstrapped_data)} examples.")

Filtered bootstrapped data has 540 examples.


Once we are settled on the particular bootstrap examples we want to use, we need to convert our bootstrap data (e.g. example traces) into prompt and completion "pairs" we could use for training.
To do so, we utilize the `trace` in the dictionaries returned by `bootstrap_data` function.
In particular, we create module level prompt and completion pairs, where the prompt correspond to any text passed to the model _initially_.
The completion then includes any intermediate generation (e.g. chain of though rationale) as well as the final outpus of the module (e.g. "Answer:")

In [13]:
from dspy.teleprompt.finetune_teleprompter import convert_to_module_level_prompt_completion_data

# To train a model on our program traces, we need a dataset that consists of
# the prompts and the completions for each module in our program. We can use the
# helper method `convert_to_prompt_completion_data` to generate these traces.
# This function takes in a list of dictionaries, each of which must contain the
# "trace" field, which is used for generating the prompts and completions for
# each module in the program. The function returns a list of dictionaries, each
# of which contains the keys "prompt" and "completion", among others. Refer to
# the documentation for more information.
#
# The `exclude_demos` flag controls whether the demonstrations should be
# included in the prompts. We are setting it to True here to have the "vanilla"
# prompts as part of our finetuning data.
finetune_data = convert_to_module_level_prompt_completion_data(filtered_bootstrapped_data, exclude_demos=True)
print(f"For each datapoint, we get one prompt completion pair for each module in our program of {len(vanilla_program.predictors())} modules.")
print(f"As a result, finetune data has {len(filtered_bootstrapped_data)} * {len(bootstrap_program.predictors())} = {len(finetune_data)} examples.")
print(f"The finetune data dictionaries have the following keys: {list(finetune_data[0].keys())}")

For each datapoint, we get one prompt completion pair for each module in our program of 3 modules.
As a result, finetune data has 540 * 3 = 1620 examples.
The finetune data dictionaries have the following keys: ['prompt', 'completion', 'predictor_ind']


Let's look at an example prompt completion pair!

In [14]:
print(f"This prompt completion pair is collected from the following program using the predictor at index {finetune_data[0]['predictor_ind']}.\n")
print(bootstrap_program)

This prompt completion pair is collected from the following program using the predictor at index 0.

generate_query[0] = Predict(StringSignature(context, question -> rationale, search_query
    instructions='Given the fields `context`, `question`, produce the fields `search_query`.'
    context = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Context:', 'desc': '${context}'})
    question = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Question:', 'desc': '${question}'})
    rationale = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${produce the output fields}. We ...', '__dspy_field_type': 'output'})
    search_query = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Search Query:', 'desc': '${search_query}'})
))
generate_query[1] = Predict(StringSignature(context, q

In [15]:
print(finetune_data[0]['prompt'])

Given the fields `context`, `question`, produce the fields `search_query`.

---

Follow the following format.

Context: ${context}

Question: ${question}

Reasoning: Let's think step by step in order to ${produce the output fields}. We ...

Search Query: ${search_query}

---

Context: N/A

Question: Did Jeffrey Steele and Frank Iero both become Nashville songwriters?

Reasoning: Let's think step by step in order to


In [16]:
print(finetune_data[0]['completion'])

 determine if both Jeffrey Steele and Frank Iero became Nashville songwriters. First, I need to verify the careers of both individuals to see if they have worked as songwriters in Nashville. This involves looking up their professional backgrounds and any notable contributions they may have made to the Nashville music scene.
Search Query: Jeffrey Steele Frank Iero Nashville songwriters


Note that we don't shuffle this dataset again.
This leads to the prompt and complation pairs coming from the same traces to be co-located.
We now save our prompt completion data to a file, where each line is a separate dictionary.
This is the format expected by the fine-tuning methods in DSPy!

In [17]:
import ujson

train_path = "trainset_data.jsonl"

print(f"Writing dataset with length {len(finetune_data)} to {train_path}")
with open(train_path, "w") as f:
    ujson.dump(finetune_data, f)

Writing dataset with length 1620 to trainset_data.jsonl


## V. Fine-tune!

We are now ready to start fine-tuning!
You can execute the next code block to get a fine-tuned model.
By default, you will get the reference to a model we trained.
You won't be able to make any new queries to it as `OpenAI` models aren't sharable publicly for the time being, but you can query it using the examples that are cached.
If you would like to train your own model, unset the `USE_CACHED_MODEL` flag.

As mentioned, all the trainable LMs in DSPy implement the `dspy.TrainableLM` class.
The main interface to trainable LMs is the `get_finetune` method, which takes in a training method and a path to a train file, along with other kwargs that vary based on the specific subclass implementing the interface.
This method then returns a `future` object that holds a reference to the LM that's being fine-tuned.
Once the fine-tuning process completes, the `result` of this future object is populated with the fine-tuned model (or with any errors if the training didn't complete)

Training from scratch took us around ~61 minutes - it may take you longer or shorter depending on how busy `OpenAI` servers are.

In [None]:
from dsp.modules.lm import TrainingMethod

USE_CACHED_MODEL = False

if USE_CACHED_MODEL:
  CACHED_FINETUNED_MODEL_ID = ""
  finetuned_lm = dspy.TrainableOpenAI(model=CACHED_FINETUNED_MODEL_ID, **MODEL_PARAMETERS)
else:
    # Define the hyperparameters for finetuning. To look at the specifics of
    # what kinds of arguments should be passed to `get_finetune`, refer to the
    # documentation for the particular subclass you are using, in this case,
    # `dspy.TrainableOpenAI`.
    hyperparameters = {
        "n_epochs": 3
    }

    # Select the training method. Must be one of the options in the
    # `TrainingMethod` enum.
    method = TrainingMethod.SFT

    # Get a future object for the finetuned language model. This will get populated
    # with the finetuned model once the training is complete.
    future_finetuned_lm = lm.get_finetune(
        method=method,
        train_path=train_path,
        hyperparameters=hyperparameters
    )

    # Call the result method on the future object to get the finetuned language.
    # This will block the execution until the training is complete.
    finetuned_lm = future_finetuned_lm.result()

# Record the model ID to access the model again in the future. You can directly
# substitute this model ID for the `CACHED_FINETUNED_MODEL_ID` variable above,
# and the next line will load the model directly.
finetuned_lm.kwargs['model']

Let's evaluate our fine-tuned model!

In [None]:
vanilla_program = BasicMH()

with dspy.context(lm=finetuned_lm, rm=retriever):
  print("Evaluating the vanilla program on the devset using the finetuned model...")
  vanilla_finetune_eval = evaluate(vanilla_program, devset=devset)

## VI. Prompt Optimize the Fine-tuned Model

Now that we have our fine-tuned model, we can prompt optimize it again!

In [None]:
# Prepare the training and validation sets for the optimizer using the original
# trainset. This ensures that our "devset" is left untouched.
shuffled_trainset = [d for d in trainset]
rng.shuffle(shuffled_trainset)
optimizer_trainset = shuffled_trainset[:OPTIMIZER_NUM_TRAIN]
optimizer_valset = shuffled_trainset[OPTIMIZER_NUM_TRAIN:OPTIMIZER_NUM_TRAIN+OPTIMIZER_NUM_VAL]

# Initialize the optimizer
bfrs_optimizer = BootstrapFewShotWithRandomSearch(
    metric=metric,
    max_bootstrapped_demos=MAX_BOOTSTRAPPED_DEMOS,
    max_labeled_demos=MAX_LABELED_DEMOS,
    num_candidate_programs=NUM_CANDIDATE_PROGRAMS,
    num_threads=NUM_THREADS
)

# Compile the optimizer and save the returned program
with dspy.context(lm=finetuned_lm, rm=retriever):
    vanilla_program = BasicMH()
    bfrs_finetune_program = bfrs_optimizer.compile(vanilla_program, trainset=optimizer_trainset, valset=optimizer_valset)

Let's see how well our prompt optimized fine-tuned model does!

In [None]:
vanilla_program = BasicMH()

with dspy.context(lm=finetuned_lm, rm=retriever):
  print("Evaluating the vanilla program on the devset using the finetuned model...")
  bfrs_finetune_eval = evaluate(bfrs_finetune_program, devset=devset)

## VII. Summary

The code block below summarizes all the results we had looked at!

In [None]:
print(f"Results for HotPotQA fine-tuning {GPT_4O_MINI} with a starting trainset with {len(trainset)}. Out of the {len(trainset)} examples, only {len(filtered_bootstrapped_data)} examples were judged as successful using a prompt optimized with BootstrapFewShotRandomSearch prompt optimized (bfrs) program on the base model. Traces from these examples are used for SFT. The reported results are computed on a held-out devset of {len(devset)} examples.\n")
print(f"    Base model (vanilla program): {vanilla_base_eval}")
print(f"    Base model (bfrs program): {bfrs_base_eval}")
print(f"    Fine-tuned model (vanilla program): {vanilla_finetune_eval}")
print(f"    Fine-tuned model (bfrs program): {bfrs_finetune_eval}")

This was a demo notebook to showcase the experimental fine-tuning capabilities in DSPy!
We are actively working on making fine-tuning a first class citizen in DSPy -- do let us know if you have any suggestions!