# DSPy: Named Entity Recognition

This tutorial demonstrates how to perform entity extraction using the CoNLL-2003 dataset with DSPy. The focus is on extracting entities referring to people. 

We will:
- Extract and label entities from the CoNLL-2003 dataset that refer to people.
- Define a DSPy program for extracting entities that refer to people.
- Optimize and evaluate the program on a subset of the CoNLL-2003 dataset.

By the end of this tutorial, you'll understand how to structure tasks in DSPy using signatures and modules, evaluate your system's performance, and improve its quality with optimizers.

In [1]:
from os import getenv
from typing import Any, Dict, List

import dspy
from datasets import load_dataset
from dotenv import load_dotenv

load_dotenv()

True

## Prepare the Dataset

In [3]:
def extract_people_entities(data_row: Dict[str, Any]) -> List[str]:
    """
    Extracts entities referring to people from a row of the CoNLL-2003 dataset.

    Args:
        data_row (Dict[str, Any]): A row from the dataset containing tokens and NER tags.

    Returns:
        List[str]: List of tokens tagged as people.
    """

    return [
        token
        for token, ner_tag in zip(data_row["tokens"], data_row["ner_tags"])
        if ner_tag in (1, 2)  # CoNLL entity codes 1 and 2 refer to people
    ]


def prepare_dataset(data_split, start: int, end: int) -> List[dspy.Example]:
    """
    Prepares a sliced dataset split for use with DSPy.

    Args:
        data_split: The dataset split (e.g., train or test).
        start (int): Starting index of the slice.
        end (int): Ending index of the slice.

    Returns:
        List[dspy.Example]: List of DSPy Examples with tokens and expected labels.
    """

    return [
        dspy.Example(
            tokens=row["tokens"], expected_extracted_people=extract_people_entities(row)
        ).with_inputs("tokens")
        for row in data_split.select(range(start, end))
    ]


# Load the dataset
dataset = load_dataset("conll2003")

# Prepare the training and test sets
train_set = prepare_dataset(dataset["train"], 0, 50)
test_set = prepare_dataset(dataset["test"], 0, 200)

Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

## Configure DSPy and create an Entity Extraction Program

Here, we define a DSPy program for extracting entities referring to people from tokenized text.

Key DSPy concepts:
- Signatures: Define structured input/output schemas for your program.
- Modules: Encapsulate program logic in reusable, composable units. 

In [4]:
class PeopleExtraction(dspy.Signature):
    """
    Extract contiguous tokens referring to specific people, if any, from a list of string tokens.
    Output a list of tokens. In other words, do not combine multiple tokens into a single value.
    """

    tokens: list[str] = dspy.InputField(desc="tokenized text")
    extracted_people: list[str] = dspy.OutputField(
        desc="all tokens referring to specific people extracted from the tokenized text"
    )


people_extractor = dspy.ChainOfThought(PeopleExtraction)

In [5]:
lm = dspy.LM(
    model="openrouter/azure/gpt-4o",
    api_key=getenv("OPENROUTER_API_KEY"),
    base_url=getenv("OPENROUTER_BASE_URL"),
)
dspy.settings.configure(lm=lm)

In [6]:
people_extractor(
    tokens=[
        "John",
        "is",
        "going",
        "to",
        "the",
        "store",
        "with",
        "Mary",
        "and",
        "their",
        "dog",
        ".",
    ]
)

Prediction(
    reasoning='The text mentions two specific people, "John" and "Mary". These are proper nouns and clearly refer to individuals. Other tokens like "dog" do not refer to people, so they are excluded.',
    extracted_people=['John', 'Mary']
)

In [7]:
dspy.inspect_history(n=1)





[34m[2025-03-17T14:25:20.985169][0m

[31mSystem message:[0m

Your input fields are:
1. `tokens` (list[str]): tokenized text

Your output fields are:
1. `reasoning` (str)
2. `extracted_people` (list[str]): all tokens referring to specific people extracted from the tokenized text

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## tokens ## ]]
{tokens}

[[ ## reasoning ## ]]
{reasoning}

[[ ## extracted_people ## ]]
{extracted_people}        # note: the value you produce must adhere to the JSON schema: {"type": "array", "items": {"type": "string"}}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Extract contiguous tokens referring to specific people, if any, from a list of string tokens.
        Output a list of tokens. In other words, do not combine multiple tokens into a single value.


[31mUser message:[0m

[[ ## tokens ## ]]
["John", "is", "going", "to", "the", "store", "with", "Mary", "

## Define Metric and Evaluation Functions

In DSPy, evaluating a program's performance is critical for iterative development. A good evaluation framework allows us to:

- Measure the quality of our program's outputs.
- Compare outputs against ground-truth labels.
- Identify areas for improvement.

In [8]:
def extraction_correctness_metric(
    example: dspy.Example, prediction: dspy.Prediction, trace=None
) -> bool:
    """
    Computes correctness of entity extraction predictions.

    Args:
        example (dspy.Example): The dataset example containing expected people entities.
        prediction (dspy.Prediction): The prediction from the DSPy people extraction program.
        trace: Optional trace object for debugging.

    Returns:
        bool: True if predictions match expectations, False otherwise.
    """

    return prediction.extracted_people == example.expected_extracted_people


evaluate_correctness = dspy.Evaluate(
    devset=test_set,
    metric=extraction_correctness_metric,
    num_threads=24,
    display_progress=True,
    display_table=True,
)

In [9]:
evaluate_correctness(people_extractor, devset=test_set)

Average Metric: 196.00 / 200 (98.0%): : 202it [02:31,  1.33it/s]                       

2025/03/17 14:28:02 INFO dspy.evaluate.evaluate: Average Metric: 196 / 200 (98.0%)





Unnamed: 0,tokens,expected_extracted_people,reasoning,extracted_people,extraction_correctness_metric
0,"[SOCCER, -, JAPAN, GET, LUCKY, WIN, ,, CHINA, IN, SURPRISE, DEFEAT...",[CHINA],The tokenized text does not contain any references to specific peo...,[],
1,"[Nadim, Ladki]","[Nadim, Ladki]","The tokens ""Nadim"" and ""Ladki"" appear to form a name referring to ...","[Nadim, Ladki]",✔️ [True]
2,"[AL-AIN, ,, United, Arab, Emirates, 1996-12-06]",[],"The tokens provided appear to describe a location and a date, but ...",[],✔️ [True]
3,"[Japan, began, the, defence, of, their, Asian, Cup, title, with, a...",[],The tokenized text describes a sports event involving Japan and Sy...,[],✔️ [True]
4,"[But, China, saw, their, luck, desert, them, in, the, second, matc...",[],The tokenized text describes a soccer match involving China and Uz...,[],✔️ [True]
...,...,...,...,...,...
195,"['The', 'Wallabies', 'have', 'their', 'sights', 'set', 'on', 'a', ...","[David, Campese]","The text mentions ""David Campese"" as a specific person, referring ...","[David, Campese]",✔️ [True]
196,"['The', 'Wallabies', 'currently', 'have', 'no', 'plans', 'to', 'ma...",[],"The tokenized text mentions ""the 34-year-old winger,"" but it does ...",[],✔️ [True]
197,"['Campese', 'will', 'be', 'up', 'against', 'a', 'familiar', 'foe',...","[Campese, Rob, Andrew]","The text mentions two specific individuals: ""Campese"" and ""Rob And...","[Campese, Rob, Andrew]",✔️ [True]
198,"['""', 'Campo', 'has', 'a', 'massive', 'following', 'in', 'this', '...","[Campo, Andrew]","The text mentions two specific people: ""Campo"" and ""Andrew."" These...","[Campo, Andrew]",✔️ [True]


98.0

## Optimize the Model

DSPy includes powerful optimizers that can improve the quality of your system.

Here, we use DSPy's MIPROv2 optimizer to:
- Automatically tune the program's language model (LM) prompt by 1. using the LM to adjust the prompt's instructions and 2. building few-shot examples from the training dataset that are augmented with reasoning generated from `dspy.ChainOfThought`.
- Maximize correctness on the training set.

In [10]:
mipro_optimizer = dspy.MIPROv2(
    metric=extraction_correctness_metric,
    auto="medium",
)
optimized_people_extractor = mipro_optimizer.compile(
    people_extractor,
    trainset=train_set,
    max_bootstrapped_demos=4,
    requires_permission_to_run=False,
    minibatch=False,
)

2025/01/17 07:50:00 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING MEDIUM AUTO RUN SETTINGS:
num_trials: 25
minibatch: False
num_candidates: 19
valset size: 40

2025/01/17 07:50:00 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/01/17 07:50:00 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/01/17 07:50:00 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=19 sets of demonstrations...


Bootstrapping set 1/19
Bootstrapping set 2/19
Bootstrapping set 3/19


 40%|████      | 4/10 [00:06<00:09,  1.54s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 4/19


 40%|████      | 4/10 [00:03<00:04,  1.21it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 5/19


 20%|██        | 2/10 [00:03<00:12,  1.51s/it]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 6/19


 20%|██        | 2/10 [00:00<00:00, 619.73it/s]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 7/19


 10%|█         | 1/10 [00:00<00:00, 471.96it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 8/19


 20%|██        | 2/10 [00:00<00:00, 727.99it/s]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 9/19


 30%|███       | 3/10 [00:00<00:00, 946.65it/s]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 10/19


 10%|█         | 1/10 [00:02<00:18,  2.02s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 11/19


 30%|███       | 3/10 [00:01<00:04,  1.65it/s]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 12/19


 20%|██        | 2/10 [00:00<00:00, 427.92it/s]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 13/19


 30%|███       | 3/10 [00:01<00:03,  1.81it/s]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 14/19


 20%|██        | 2/10 [00:00<00:00, 660.73it/s]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 15/19


 10%|█         | 1/10 [00:00<00:00, 665.76it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 16/19


 10%|█         | 1/10 [00:00<00:00, 669.80it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 17/19


 30%|███       | 3/10 [00:00<00:00, 481.22it/s]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 18/19


 20%|██        | 2/10 [00:00<00:00, 914.09it/s]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 19/19


 40%|████      | 4/10 [00:00<00:00, 992.09it/s]
2025/01/17 07:50:18 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/01/17 07:50:18 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


2025/01/17 07:50:27 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing instructions...

2025/01/17 07:53:16 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/01/17 07:53:16 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Extract contiguous tokens referring to specific people, if any, from a list of string tokens.
Output a list of tokens. In other words, do not combine multiple tokens into a single value.

2025/01/17 07:53:16 INFO dspy.teleprompt.mipro_optimizer_v2: 1: Identify all contiguous tokens that represent the full names of individuals mentioned in the provided text. Return these tokens as a list. Do not combine multiple tokens, and focus solely on person names, excluding organizations, locations, or other entities.  Provide a step-by-step rationale explaining your selections.

2025/01/17 07:53:16 INFO dspy.teleprompt.mipro_optimizer_v2: 2: You are a highly specialized AI tasked with identifying individuals mentioned in breaking news articles for

Average Metric: 37.00 / 40 (92.5%): 100%|██████████| 40/40 [00:16<00:00,  2.40it/s] 

2025/01/17 07:53:32 INFO dspy.evaluate.evaluate: Average Metric: 37 / 40 (92.5%)
2025/01/17 07:53:32 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 92.5

2025/01/17 07:53:32 INFO dspy.teleprompt.mipro_optimizer_v2: ==> STEP 3: FINDING OPTIMAL PROMPT PARAMETERS <==
2025/01/17 07:53:32 INFO dspy.teleprompt.mipro_optimizer_v2: We will evaluate the program over a series of trials with different combinations of instructions and few-shot examples to find the optimal combination using Bayesian Optimization.

2025/01/17 07:53:32 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 1 / 25 =====



Average Metric: 38.00 / 40 (95.0%): 100%|██████████| 40/40 [00:14<00:00,  2.68it/s] 

2025/01/17 07:53:47 INFO dspy.evaluate.evaluate: Average Metric: 38 / 40 (95.0%)
2025/01/17 07:53:47 INFO dspy.teleprompt.mipro_optimizer_v2: [92mBest full score so far![0m Score: 95.0
2025/01/17 07:53:47 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 95.0 with parameters ['Predictor 0: Instruction 12', 'Predictor 0: Few-Shot Set 7'].
2025/01/17 07:53:47 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0]
2025/01/17 07:53:47 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 95.0


2025/01/17 07:53:47 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 2 / 25 =====



Average Metric: 38.00 / 40 (95.0%): 100%|██████████| 40/40 [00:12<00:00,  3.29it/s] 

2025/01/17 07:54:00 INFO dspy.evaluate.evaluate: Average Metric: 38 / 40 (95.0%)
2025/01/17 07:54:00 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 95.0 with parameters ['Predictor 0: Instruction 10', 'Predictor 0: Few-Shot Set 7'].
2025/01/17 07:54:00 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 95.0]
2025/01/17 07:54:00 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 95.0


2025/01/17 07:54:00 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 3 / 25 =====



Average Metric: 39.00 / 40 (97.5%): 100%|██████████| 40/40 [00:11<00:00,  3.38it/s] 

2025/01/17 07:54:11 INFO dspy.evaluate.evaluate: Average Metric: 39 / 40 (97.5%)
2025/01/17 07:54:11 INFO dspy.teleprompt.mipro_optimizer_v2: [92mBest full score so far![0m Score: 97.5
2025/01/17 07:54:11 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.5 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 18'].
2025/01/17 07:54:11 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 95.0, 97.5]
2025/01/17 07:54:11 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/01/17 07:54:11 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 4 / 25 =====



Average Metric: 38.00 / 40 (95.0%): 100%|██████████| 40/40 [00:13<00:00,  2.91it/s] 

2025/01/17 07:54:25 INFO dspy.evaluate.evaluate: Average Metric: 38 / 40 (95.0%)
2025/01/17 07:54:25 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 95.0 with parameters ['Predictor 0: Instruction 15', 'Predictor 0: Few-Shot Set 2'].
2025/01/17 07:54:25 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 95.0, 97.5, 95.0]
2025/01/17 07:54:25 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/01/17 07:54:25 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 5 / 25 =====



Average Metric: 39.00 / 40 (97.5%): 100%|██████████| 40/40 [00:12<00:00,  3.28it/s] 

2025/01/17 07:54:37 INFO dspy.evaluate.evaluate: Average Metric: 39 / 40 (97.5%)
2025/01/17 07:54:37 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.5 with parameters ['Predictor 0: Instruction 8', 'Predictor 0: Few-Shot Set 18'].
2025/01/17 07:54:37 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 95.0, 97.5, 95.0, 97.5]
2025/01/17 07:54:37 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/01/17 07:54:37 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 6 / 25 =====



Average Metric: 36.00 / 40 (90.0%): 100%|██████████| 40/40 [00:15<00:00,  2.59it/s] 

2025/01/17 07:54:53 INFO dspy.evaluate.evaluate: Average Metric: 36 / 40 (90.0%)
2025/01/17 07:54:53 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 90.0 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 1'].
2025/01/17 07:54:53 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 95.0, 97.5, 95.0, 97.5, 90.0]
2025/01/17 07:54:53 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/01/17 07:54:53 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 25 =====



Average Metric: 38.00 / 40 (95.0%): 100%|██████████| 40/40 [00:13<00:00,  2.88it/s] 

2025/01/17 07:55:07 INFO dspy.evaluate.evaluate: Average Metric: 38 / 40 (95.0%)
2025/01/17 07:55:07 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 95.0 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 12'].
2025/01/17 07:55:07 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 95.0, 97.5, 95.0, 97.5, 90.0, 95.0]
2025/01/17 07:55:07 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/01/17 07:55:07 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 8 / 25 =====



Average Metric: 38.00 / 40 (95.0%): 100%|██████████| 40/40 [00:13<00:00,  2.94it/s] 

2025/01/17 07:55:20 INFO dspy.evaluate.evaluate: Average Metric: 38 / 40 (95.0%)
2025/01/17 07:55:20 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 95.0 with parameters ['Predictor 0: Instruction 11', 'Predictor 0: Few-Shot Set 13'].
2025/01/17 07:55:20 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 95.0, 97.5, 95.0, 97.5, 90.0, 95.0, 95.0]
2025/01/17 07:55:20 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/01/17 07:55:20 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 9 / 25 =====



Average Metric: 39.00 / 40 (97.5%): 100%|██████████| 40/40 [00:14<00:00,  2.67it/s] 

2025/01/17 07:55:35 INFO dspy.evaluate.evaluate: Average Metric: 39 / 40 (97.5%)
2025/01/17 07:55:35 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.5 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 4'].
2025/01/17 07:55:35 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 95.0, 97.5, 95.0, 97.5, 90.0, 95.0, 95.0, 97.5]
2025/01/17 07:55:35 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/01/17 07:55:35 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 10 / 25 =====



Average Metric: 37.00 / 40 (92.5%): 100%|██████████| 40/40 [00:13<00:00,  3.00it/s] 

2025/01/17 07:55:49 INFO dspy.evaluate.evaluate: Average Metric: 37 / 40 (92.5%)
2025/01/17 07:55:49 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 92.5 with parameters ['Predictor 0: Instruction 14', 'Predictor 0: Few-Shot Set 1'].
2025/01/17 07:55:49 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 95.0, 97.5, 95.0, 97.5, 90.0, 95.0, 95.0, 97.5, 92.5]
2025/01/17 07:55:49 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/01/17 07:55:49 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 11 / 25 =====



Average Metric: 39.00 / 40 (97.5%): 100%|██████████| 40/40 [00:00<00:00, 2207.15it/s]

2025/01/17 07:55:49 INFO dspy.evaluate.evaluate: Average Metric: 39 / 40 (97.5%)
2025/01/17 07:55:49 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.5 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 18'].
2025/01/17 07:55:49 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 95.0, 97.5, 95.0, 97.5, 90.0, 95.0, 95.0, 97.5, 92.5, 97.5]
2025/01/17 07:55:49 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/01/17 07:55:49 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 12 / 25 =====



Average Metric: 39.00 / 40 (97.5%): 100%|██████████| 40/40 [00:12<00:00,  3.13it/s] 

2025/01/17 07:56:02 INFO dspy.evaluate.evaluate: Average Metric: 39 / 40 (97.5%)
2025/01/17 07:56:02 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.5 with parameters ['Predictor 0: Instruction 6', 'Predictor 0: Few-Shot Set 18'].
2025/01/17 07:56:02 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 95.0, 97.5, 95.0, 97.5, 90.0, 95.0, 95.0, 97.5, 92.5, 97.5, 97.5]
2025/01/17 07:56:02 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/01/17 07:56:02 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 13 / 25 =====



Average Metric: 39.00 / 40 (97.5%): 100%|██████████| 40/40 [00:11<00:00,  3.45it/s] 

2025/01/17 07:56:13 INFO dspy.evaluate.evaluate: Average Metric: 39 / 40 (97.5%)
2025/01/17 07:56:13 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.5 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 18'].
2025/01/17 07:56:13 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 95.0, 97.5, 95.0, 97.5, 90.0, 95.0, 95.0, 97.5, 92.5, 97.5, 97.5, 97.5]
2025/01/17 07:56:13 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/01/17 07:56:13 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 14 / 25 =====



Average Metric: 37.00 / 40 (92.5%): 100%|██████████| 40/40 [00:16<00:00,  2.48it/s] 

2025/01/17 07:56:30 INFO dspy.evaluate.evaluate: Average Metric: 37 / 40 (92.5%)
2025/01/17 07:56:30 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 92.5 with parameters ['Predictor 0: Instruction 8', 'Predictor 0: Few-Shot Set 6'].
2025/01/17 07:56:30 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 95.0, 97.5, 95.0, 97.5, 90.0, 95.0, 95.0, 97.5, 92.5, 97.5, 97.5, 97.5, 92.5]
2025/01/17 07:56:30 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/01/17 07:56:30 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 15 / 25 =====



Average Metric: 37.00 / 40 (92.5%): 100%|██████████| 40/40 [00:13<00:00,  2.92it/s] 

2025/01/17 07:56:43 INFO dspy.evaluate.evaluate: Average Metric: 37 / 40 (92.5%)
2025/01/17 07:56:43 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 92.5 with parameters ['Predictor 0: Instruction 8', 'Predictor 0: Few-Shot Set 14'].
2025/01/17 07:56:43 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 95.0, 97.5, 95.0, 97.5, 90.0, 95.0, 95.0, 97.5, 92.5, 97.5, 97.5, 97.5, 92.5, 92.5]
2025/01/17 07:56:43 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/01/17 07:56:43 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 16 / 25 =====



Average Metric: 39.00 / 40 (97.5%): 100%|██████████| 40/40 [00:00<00:00, 3544.21it/s] 

2025/01/17 07:56:43 INFO dspy.evaluate.evaluate: Average Metric: 39 / 40 (97.5%)
2025/01/17 07:56:43 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.5 with parameters ['Predictor 0: Instruction 8', 'Predictor 0: Few-Shot Set 18'].
2025/01/17 07:56:43 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 95.0, 97.5, 95.0, 97.5, 90.0, 95.0, 95.0, 97.5, 92.5, 97.5, 97.5, 97.5, 92.5, 92.5, 97.5]
2025/01/17 07:56:43 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/01/17 07:56:43 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 17 / 25 =====



Average Metric: 37.00 / 40 (92.5%): 100%|██████████| 40/40 [00:12<00:00,  3.31it/s]

2025/01/17 07:56:55 INFO dspy.evaluate.evaluate: Average Metric: 37 / 40 (92.5%)
2025/01/17 07:56:55 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 92.5 with parameters ['Predictor 0: Instruction 18', 'Predictor 0: Few-Shot Set 16'].
2025/01/17 07:56:55 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 95.0, 97.5, 95.0, 97.5, 90.0, 95.0, 95.0, 97.5, 92.5, 97.5, 97.5, 97.5, 92.5, 92.5, 97.5, 92.5]
2025/01/17 07:56:55 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/01/17 07:56:55 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 18 / 25 =====



Average Metric: 36.00 / 40 (90.0%): 100%|██████████| 40/40 [00:11<00:00,  3.50it/s] 

2025/01/17 07:57:07 INFO dspy.evaluate.evaluate: Average Metric: 36 / 40 (90.0%)
2025/01/17 07:57:07 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 90.0 with parameters ['Predictor 0: Instruction 9', 'Predictor 0: Few-Shot Set 5'].
2025/01/17 07:57:07 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 95.0, 97.5, 95.0, 97.5, 90.0, 95.0, 95.0, 97.5, 92.5, 97.5, 97.5, 97.5, 92.5, 92.5, 97.5, 92.5, 90.0]
2025/01/17 07:57:07 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/01/17 07:57:07 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 19 / 25 =====



Average Metric: 38.00 / 40 (95.0%): 100%|██████████| 40/40 [00:13<00:00,  2.95it/s] 

2025/01/17 07:57:21 INFO dspy.evaluate.evaluate: Average Metric: 38 / 40 (95.0%)
2025/01/17 07:57:21 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 95.0 with parameters ['Predictor 0: Instruction 13', 'Predictor 0: Few-Shot Set 17'].
2025/01/17 07:57:21 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 95.0, 97.5, 95.0, 97.5, 90.0, 95.0, 95.0, 97.5, 92.5, 97.5, 97.5, 97.5, 92.5, 92.5, 97.5, 92.5, 90.0, 95.0]
2025/01/17 07:57:21 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/01/17 07:57:21 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 20 / 25 =====



Average Metric: 37.00 / 40 (92.5%): 100%|██████████| 40/40 [00:11<00:00,  3.61it/s] 

2025/01/17 07:57:32 INFO dspy.evaluate.evaluate: Average Metric: 37 / 40 (92.5%)
2025/01/17 07:57:32 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 92.5 with parameters ['Predictor 0: Instruction 17', 'Predictor 0: Few-Shot Set 15'].
2025/01/17 07:57:32 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 95.0, 97.5, 95.0, 97.5, 90.0, 95.0, 95.0, 97.5, 92.5, 97.5, 97.5, 97.5, 92.5, 92.5, 97.5, 92.5, 90.0, 95.0, 92.5]
2025/01/17 07:57:32 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/01/17 07:57:32 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 21 / 25 =====



Average Metric: 39.00 / 40 (97.5%): 100%|██████████| 40/40 [00:11<00:00,  3.35it/s] 

2025/01/17 07:57:44 INFO dspy.evaluate.evaluate: Average Metric: 39 / 40 (97.5%)
2025/01/17 07:57:44 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.5 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 18'].
2025/01/17 07:57:44 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 95.0, 97.5, 95.0, 97.5, 90.0, 95.0, 95.0, 97.5, 92.5, 97.5, 97.5, 97.5, 92.5, 92.5, 97.5, 92.5, 90.0, 95.0, 92.5, 97.5]
2025/01/17 07:57:44 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/01/17 07:57:44 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 22 / 25 =====



Average Metric: 39.00 / 40 (97.5%): 100%|██████████| 40/40 [00:11<00:00,  3.40it/s] 

2025/01/17 07:57:55 INFO dspy.evaluate.evaluate: Average Metric: 39 / 40 (97.5%)
2025/01/17 07:57:55 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.5 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 4'].
2025/01/17 07:57:55 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 95.0, 97.5, 95.0, 97.5, 90.0, 95.0, 95.0, 97.5, 92.5, 97.5, 97.5, 97.5, 92.5, 92.5, 97.5, 92.5, 90.0, 95.0, 92.5, 97.5, 97.5]
2025/01/17 07:57:55 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/01/17 07:57:55 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 23 / 25 =====



Average Metric: 38.00 / 40 (95.0%): 100%|██████████| 40/40 [00:12<00:00,  3.32it/s] 

2025/01/17 07:58:07 INFO dspy.evaluate.evaluate: Average Metric: 38 / 40 (95.0%)
2025/01/17 07:58:07 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 95.0 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 10'].
2025/01/17 07:58:07 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 95.0, 97.5, 95.0, 97.5, 90.0, 95.0, 95.0, 97.5, 92.5, 97.5, 97.5, 97.5, 92.5, 92.5, 97.5, 92.5, 90.0, 95.0, 92.5, 97.5, 97.5, 95.0]
2025/01/17 07:58:07 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/01/17 07:58:07 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 24 / 25 =====



Average Metric: 38.00 / 40 (95.0%): 100%|██████████| 40/40 [00:12<00:00,  3.16it/s] 

2025/01/17 07:58:20 INFO dspy.evaluate.evaluate: Average Metric: 38 / 40 (95.0%)
2025/01/17 07:58:20 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 95.0 with parameters ['Predictor 0: Instruction 16', 'Predictor 0: Few-Shot Set 4'].
2025/01/17 07:58:20 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 95.0, 97.5, 95.0, 97.5, 90.0, 95.0, 95.0, 97.5, 92.5, 97.5, 97.5, 97.5, 92.5, 92.5, 97.5, 92.5, 90.0, 95.0, 92.5, 97.5, 97.5, 95.0, 95.0]
2025/01/17 07:58:20 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/01/17 07:58:20 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 25 / 25 =====



Average Metric: 37.00 / 40 (92.5%): 100%|██████████| 40/40 [00:11<00:00,  3.35it/s] 

2025/01/17 07:58:32 INFO dspy.evaluate.evaluate: Average Metric: 37 / 40 (92.5%)
2025/01/17 07:58:32 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 92.5 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 9'].
2025/01/17 07:58:32 INFO dspy.teleprompt.mipro_optimizer_v2: Scores so far: [92.5, 95.0, 95.0, 97.5, 95.0, 97.5, 90.0, 95.0, 95.0, 97.5, 92.5, 97.5, 97.5, 97.5, 92.5, 92.5, 97.5, 92.5, 90.0, 95.0, 92.5, 97.5, 97.5, 95.0, 95.0, 92.5]
2025/01/17 07:58:32 INFO dspy.teleprompt.mipro_optimizer_v2: Best score so far: 97.5


2025/01/17 07:58:32 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 97.5!





In [11]:
evaluate_correctness(optimized_people_extractor, devset=test_set)

Average Metric: 196.00 / 200 (98.0%): 100%|██████████| 200/200 [00:16<00:00, 12.14it/s]

2025/01/17 07:59:01 INFO dspy.evaluate.evaluate: Average Metric: 196 / 200 (98.0%)





Unnamed: 0,tokens,expected_extracted_people,rationale,extracted_people,extraction_correctness_metric
0,"[SOCCER, -, JAPAN, GET, LUCKY, WIN, ,, CHINA, IN, SURPRISE, DEFEAT...",[CHINA],${[]}. We found no names of specific people in the input text. Alt...,[],
1,"[Nadim, Ladki]","[Nadim, Ladki]","${[""Nadim"", ""Ladki""]}. We extract ""Nadim Ladki"" as a person's name.","[Nadim, Ladki]",✔️ [True]
2,"[AL-AIN, ,, United, Arab, Emirates, 1996-12-06]",[],${[]}. We did not find any mentions of individuals.,[],✔️ [True]
3,"[Japan, began, the, defence, of, their, Asian, Cup, title, with, a...",[],"${[]}. We found no people mentioned. ""Japan"" and ""Syria"" are coun...",[],✔️ [True]
4,"[But, China, saw, their, luck, desert, them, in, the, second, matc...",[],We extracted no people. There are no tokens that refer to specifi...,[],✔️ [True]
...,...,...,...,...,...
195,"['The', 'Wallabies', 'have', 'their', 'sights', 'set', 'on', 'a', ...","[David, Campese]","${[""David"", ""Campese""]}. We extracted ""David Campese"" because it a...","[David, Campese]",✔️ [True]
196,"['The', 'Wallabies', 'currently', 'have', 'no', 'plans', 'to', 'ma...",[],"We extracted no people. While the text refers to a ""34-year-old w...",[],✔️ [True]
197,"['Campese', 'will', 'be', 'up', 'against', 'a', 'familiar', 'foe',...","[Campese, Rob, Andrew]","${[""Campese"", ""Rob"", ""Andrew""]}. We extracted ""Campese"" and ""Rob A...","[Campese, Rob, Andrew]",✔️ [True]
198,"['""', 'Campo', 'has', 'a', 'massive', 'following', 'in', 'this', '...","[Campo, Andrew]","${[""Campo"", ""Andrew""]}. We extracted ""Campo"" and ""Andrew"" as they ...","[Campo, Andrew]",✔️ [True]


98.0

In [12]:
dspy.inspect_history(n=1)





[34m[2025-01-17T07:59:01.772698][0m

[31mSystem message:[0m

Your input fields are:
1. `tokens` (list[str]): tokenized text

Your output fields are:
1. `rationale` (str): ${produce the extracted_people}. We ...
2. `extracted_people` (list[str]): all tokens referring to specific people extracted from the tokenized text

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## tokens ## ]]
{tokens}

[[ ## rationale ## ]]
{rationale}

[[ ## extracted_people ## ]]
{extracted_people}        # note: the value you produce must be pareseable according to the following JSON schema: {"type": "array", "items": {"type": "string"}}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        You are a highly specialized AI tasked with identifying individuals mentioned in sensitive intelligence reports. Accurate extraction is crucial for national security.  Extract contiguous tokens referring to specific people, if any, fro

In [13]:
cost = sum(
    [x["cost"] for x in lm.history if x["cost"] is not None]
)  # cost in USD, as calculated by LiteLLM for certain providers
cost

3.6951775000000042

## Saving and Loading Optimized Programs

DSPy supports saving and loading programs, enabling you to reuse optimized systems without the need to re-optimize from scratch. This feature is especially useful for deploying your programs in production environments or sharing them with collaborators.

In [14]:
optimized_people_extractor.save("optimized_extractor.json")

loaded_people_extractor = dspy.ChainOfThought(PeopleExtraction)
loaded_people_extractor.load("optimized_extractor.json")

loaded_people_extractor(
    tokens=["Italy", "recalled", "Marcello", "Cuttitta"]
).extracted_people

['Marcello', 'Cuttitta']