# Calibrating Meaning Similarity

This notebook tests and improves how we measure whether a changed sentence still means the same as the original.  
It uses example pairs with human ratings to find the best way to automatically score meaning similarity, aiming to match human opinions as closely as possible.

### setting up dspy

In [11]:
import os
os.chdir('../promptsmith')

In [12]:
from promptsmith.dspy_init import get_dspy
dspy = get_dspy()

### loading the golden set and printing few examples

In [13]:
from promptsmith.golden.load import load_golden_set


dev_set, test_set = load_golden_set(judge_name="meaning", seed=42)

print(f"Loaded {len(dev_set)} dev examples and {len(test_set)} test examples")


Loaded 48 dev examples and 12 test examples


In [14]:
def print_examples(examples, n=1):
    for i in range(min(n, len(examples))):
        print(f"\nExample {i+1}:")
        print(f"Input: {examples[i].input_text}")
        print(f"Output: {examples[i].output_text}")
        print(f"Gold score: {examples[i].gold_score}")
        print(f"Gold reasoning: {examples[i].gold_reasoning}")

print_examples(dev_set, n=3)


Example 1:
Input: Linda borrowed the book from Mark yesterday.
Output: Mark borrowed the book from Linda yesterday.
Gold score: 0.6
Gold reasoning: Roles reversed; rest identical.

Example 2:
Input: The novel was written by George Orwell in 1949.
Output: The novel was written by J. K. Rowling in 1949.
Gold score: 0.2
Gold reasoning: Author misattributed; timeframe correct but misleading.

Example 3:
Input: He earned a salary of $75,000 last year.
Output: He earned a salary of $750,000 last year.
Gold score: 0.2
Gold reasoning: Order‑of‑magnitude hallucination.


### baseline prompt

The baseline prompt is the original set of instructions given to the model for judging whether the meaning of a sentence is preserved after it’s rewritten.  
It asks the model to compare the original and rewritten sentences, check if all important ideas are still present, and give a score from 0 (meaning lost) to 1 (meaning fully kept), along with an explanation.

In [15]:
from promptsmith.judges import JudgeMeaning
from dspy import Predict

baseline_judge = Predict(JudgeMeaning)
print(baseline_judge.signature.instructions)

You are an expert in evaluating meaning preservation.

Your task is to determine whether the restructured text preserves the essential meaning of the original input text.

Please consider:
1. Are all key ideas and important details still present?
2. Were any important parts of the original meaning lost or changed significantly?
3. Are minor rephrasings or small generalizations acceptable if they don't distort the overall meaning?

Assign a score between 0 (poor preservation) and 1 (perfect preservation).

Explain your reasoning clearly, mentioning what was preserved well and any important losses or changes.


### calibrating the judge

In [16]:
from promptsmith.judges.metrics import calibrate_judge

calibrated_judge, found_optimized = calibrate_judge(devset=dev_set, judge_class=JudgeMeaning)

2025/05/04 13:49:13 INFO dspy.evaluate.evaluate: Average Metric: 38.0 / 48 (79.2%)
2025/05/04 13:49:13 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/05/04 13:49:13 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/05/04 13:49:13 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=3 sets of demonstrations...


Baseline dev acc: 79.170
Bootstrapping set 1/3
Bootstrapping set 2/3
Bootstrapping set 3/3


  0%|          | 0/10 [00:00<?, ?it/s]2025/05/04 13:49:15 ERROR dspy.teleprompt.bootstrap: Failed to run or to evaluate example Example({'input_text': 'Linda borrowed the book from Mark yesterday.', 'output_text': 'Mark borrowed the book from Linda yesterday.', 'gold_score': 0.6, 'gold_reasoning': 'Roles reversed; rest identical.'}) (input_keys={'input_text', 'output_text'}) with <function score_agreement_metric at 0x105a86700> due to getattr(): attribute name must be string.
 10%|█         | 1/10 [00:02<00:22,  2.50s/it]2025/05/04 13:49:18 ERROR dspy.teleprompt.bootstrap: Failed to run or to evaluate example Example({'input_text': 'The novel was written by George\xa0Orwell in 1949.', 'output_text': 'The novel was written by J.\u202fK.\xa0Rowling in 1949.', 'gold_score': 0.2, 'gold_reasoning': 'Author misattributed; timeframe correct but misleading.'}) (input_keys={'input_text', 'output_text'}) with <function score_agreement_metric at 0x105a86700> due to getattr(): attribute name must 

Average Metric: 31.30 / 38 (82.4%): 100%|██████████| 38/38 [00:12<00:00,  2.97it/s]

2025/05/04 13:50:23 INFO dspy.evaluate.evaluate: Average Metric: 31.300000000000004 / 38 (82.4%)
2025/05/04 13:50:23 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 82.37

2025/05/04 13:50:23 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 2 / 7 - Minibatch ==



Average Metric: 28.10 / 35 (80.3%): 100%|██████████| 35/35 [00:13<00:00,  2.55it/s]

2025/05/04 13:50:37 INFO dspy.evaluate.evaluate: Average Metric: 28.10000000000001 / 35 (80.3%)
2025/05/04 13:50:37 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 80.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1'].
2025/05/04 13:50:37 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [80.29]
2025/05/04 13:50:37 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [82.37]
2025/05/04 13:50:37 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 82.37


2025/05/04 13:50:37 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 3 / 7 - Minibatch ==



Average Metric: 29.30 / 35 (83.7%): 100%|██████████| 35/35 [00:14<00:00,  2.48it/s]

2025/05/04 13:50:51 INFO dspy.evaluate.evaluate: Average Metric: 29.299999999999997 / 35 (83.7%)
2025/05/04 13:50:51 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 83.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0'].
2025/05/04 13:50:51 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [80.29, 83.71]
2025/05/04 13:50:51 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [82.37]
2025/05/04 13:50:51 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 82.37


2025/05/04 13:50:51 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 4 / 7 - Minibatch ==



Average Metric: 28.40 / 35 (81.1%): 100%|██████████| 35/35 [00:11<00:00,  3.02it/s]

2025/05/04 13:51:03 INFO dspy.evaluate.evaluate: Average Metric: 28.399999999999995 / 35 (81.1%)
2025/05/04 13:51:03 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 81.14 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1'].
2025/05/04 13:51:03 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [80.29, 83.71, 81.14]
2025/05/04 13:51:03 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [82.37]
2025/05/04 13:51:03 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 82.37


2025/05/04 13:51:03 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 5 / 7 - Minibatch ==



Average Metric: 28.05 / 35 (80.1%): 100%|██████████| 35/35 [00:11<00:00,  3.18it/s]

2025/05/04 13:51:14 INFO dspy.evaluate.evaluate: Average Metric: 28.050000000000008 / 35 (80.1%)
2025/05/04 13:51:14 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 80.14 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2'].
2025/05/04 13:51:14 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [80.29, 83.71, 81.14, 80.14]
2025/05/04 13:51:14 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [82.37]
2025/05/04 13:51:14 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 82.37


2025/05/04 13:51:14 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 6 / 7 - Minibatch ==



Average Metric: 28.40 / 35 (81.1%): 100%|██████████| 35/35 [00:10<00:00,  3.19it/s]

2025/05/04 13:51:25 INFO dspy.evaluate.evaluate: Average Metric: 28.400000000000006 / 35 (81.1%)
2025/05/04 13:51:25 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 81.14 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0'].
2025/05/04 13:51:25 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [80.29, 83.71, 81.14, 80.14, 81.14]
2025/05/04 13:51:25 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [82.37]
2025/05/04 13:51:25 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 82.37


2025/05/04 13:51:25 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 7 - Full Evaluation =====
2025/05/04 13:51:25 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 82.425) from minibatch trials...



Average Metric: 31.45 / 38 (82.8%): 100%|██████████| 38/38 [00:11<00:00,  3.30it/s]

2025/05/04 13:51:36 INFO dspy.evaluate.evaluate: Average Metric: 31.45 / 38 (82.8%)
2025/05/04 13:51:36 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 82.76
2025/05/04 13:51:36 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [82.37, 82.76]
2025/05/04 13:51:36 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 82.76
2025/05/04 13:51:36 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/05/04 13:51:36 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 82.76!





2025/05/04 13:51:56 INFO dspy.evaluate.evaluate: Average Metric: 37.6 / 48 (78.3%)


Tuned dev acc: 78.330


In [17]:
if found_optimized:
    print("\nOptimization successful - found better prompt!")
    print("\n=== BASELINE PROMPT ===")
    print(baseline_judge.signature.instructions)
    print("\n=== OPTIMIZED PROMPT ===")
    print(calibrated_judge.signature.instructions)
else:
    print("\nNo better prompt found - using baseline")
    print("\n=== BASELINE PROMPT ===")
    print(baseline_judge.signature.instructions)


No better prompt found - using baseline

=== BASELINE PROMPT ===
You are an expert in evaluating meaning preservation.

Your task is to determine whether the restructured text preserves the essential meaning of the original input text.

Please consider:
1. Are all key ideas and important details still present?
2. Were any important parts of the original meaning lost or changed significantly?
3. Are minor rephrasings or small generalizations acceptable if they don't distort the overall meaning?

Assign a score between 0 (poor preservation) and 1 (perfect preservation).

Explain your reasoning clearly, mentioning what was preserved well and any important losses or changes.
