# DSPy - Advanced Prompt Engineering

1. Breakout Room #1:
  - Task 1: Dependencies
  - Task 2: Loading Our Model
  - Task 3: Loading Our Data
  - Task 4: Setting Our Signature
  - Task 5: Creating a Predictor
  - Task 6: Making a Chain, I mean...Module
  - Task 7: Evaluate
  - Task 8: Program Optimization
2. Breakout Room #2:
  - Task 1: Defining Appliation
  - Task 2: Hyper-Parameters and Data
  - Task 3: Signature And Module Creation
  - Task 4: Evaluating Our LongFormQA Module
  - Task 5: Adding Assertions

---

In the following notebook, we'll explore an introduction to DSPy and what it can do in just a few lines of code!

# 🤝 Breakout Room #1

## Task 1: Dependencies

We'll start by installing DSPy, `nltk` (for later) and including our OpenAI API key.

In [2]:
!pip install -qU dspy-ai nltk

DSPy can leverage OpenAI's models under the hood, and still provide an advantage - in order to do so, however, we'll need to provide an OpenAI API Key!

In [1]:
import os
import getpass

os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ')

## Task 2: Loading Our Model

Now we can setup our OpenAI language model - which we'll use through the remaining cells in the notebook.

In [3]:
from dspy import LM

llm = LM(model='openai/gpt-3.5-turbo')

* 'allow_population_by_field_name' has been renamed to 'populate_by_name'
* 'smart_union' has been removed


Similar to other libraries, we can call the LLM directly with a string to get a response!

In [4]:
llm("What is the square root of pi?")

['The square root of pi is approximately 1.77245385091.']

In [7]:
llm("On a scale of 0-4 how dope is the phrase 'This is top tier.'")

['I would rate the phrase "This is top tier" a 3 out of 4. It conveys a sense of high quality and excellence, making it a strong and impactful statement.']

We'll also set our `setting.configure` with our OpenAI model in the `lm` (Language Model) field for a default LM to use in case we don't specify which LM we'd like to use when calling our DSPy `Predictors`.

In [8]:
import dspy

dspy.settings.configure(lm=llm)

## Task 3: Load Our Data

We're going to be using a dataset that provides a number of example sentences, along with a rating that indicates their "dopeness" level.

In [9]:
from datasets import load_dataset

dataset = load_dataset("llm-wizard/dope_or_nope_v2")

We have a total of 99 rows of data, and will be splitting that into a `trainset` and a `valset` - for training and evaluation.

In [10]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Sentence', 'Rating', 'Fire Emojis'],
        num_rows: 99
    })
})

Due to the nature of the dataset, we'll need to shuffle our dataset to ensure our labels are not clumped up, and our `valset` is remotely representative to our `trainset`.

In [11]:
dataset = dataset.shuffle(seed=42)  # randomly rearranges dataset - uses a random see for reproducability

We'll move our `Dataset` into the expected format in DSPy which is the [`Example`](https://dspy-docs.vercel.app/docs/deep-dive/data-handling/examples)!


Our examples will have two keys:

- `sentence`, our input sentence to be rated
- `rating`, our rating label

We'll specify our input as `sentence` to properly leverage the DSPy framework.

In [12]:
from dspy import Example

trainset = []

for row in dataset["train"].select(range(0,len(dataset["train"])-10)):
  trainset.append(Example(sentence=row["Sentence"], rating=row["Rating"]).with_inputs("sentence"))

len(trainset)

89

We'll repeat the same process for our `valset` as well.

In [13]:
valset = []

for row in dataset["train"].select(range(len(trainset),len(dataset["train"]))):
  print(row)
  valset.append(Example(sentence=row["Sentence"], rating=row["Rating"]).with_inputs("sentence"))

len(valset)

{'Sentence': 'This is top tier.', 'Rating': 4, 'Fire Emojis': '🔥🔥🔥🔥'}
{'Sentence': 'Big mood.', 'Rating': 3, 'Fire Emojis': '🔥🔥🔥'}
{'Sentence': 'The presentation was outstanding.', 'Rating': 1, 'Fire Emojis': '🔥'}
{'Sentence': "I'm living my best life.", 'Rating': 4, 'Fire Emojis': '🔥🔥🔥🔥'}
{'Sentence': "Sksksksk, that's hilarious.", 'Rating': 3, 'Fire Emojis': '🔥🔥🔥'}
{'Sentence': 'The report is comprehensive.', 'Rating': 1, 'Fire Emojis': '🔥'}
{'Sentence': 'This is next level.', 'Rating': 4, 'Fire Emojis': '🔥🔥🔥🔥'}
{'Sentence': 'The meeting was productive.', 'Rating': 1, 'Fire Emojis': '🔥'}
{'Sentence': 'The analysis was insightful.', 'Rating': 1, 'Fire Emojis': '🔥'}
{'Sentence': 'I stan a legend.', 'Rating': 3, 'Fire Emojis': '🔥🔥🔥'}


10

Let's take a peek at an example from our `trainset` and `valset`!

In [15]:
train_example = trainset[0]
print(f"Sentence: {train_example.sentence}")
print(f"Rating: {train_example.rating}")

Sentence: The results were satisfactory.
Rating: 0


In [16]:
valset_example = valset[0]
print(f"Sentence: {valset_example.sentence}")
print(f"Rating: {valset_example.rating}")

Sentence: This is top tier.
Rating: 4


## Task 4: Setting Our Signature

The first foundational unit in DSPy is the `Signature`.

In a sense, a `Signature` can be thought of as both a prompt, as well as metadata about that prompt.

Going beyond just a simple `SystemMessage`, as seen in other frameworks, the `Signature` helps DSPy validate datatypes, create examples, and more.

> NOTE: DSPy's [documentation](https://dspy-docs.vercel.app/docs/deep-dive/signature/understanding-signatures#what-is-a-signature) goes into more detail about what exactly a `Signature` is.

In [19]:
from dspy import Signature, InputField, OutputField

class DopeOrNopeSignature(Signature):
  """Rate a sentence from 0 to 4 on a dopeness scale"""
  sentence: str = InputField()
  rating: int = OutputField()

## Task 5: Creating a Predictor

Now that we have our `Signature`, we can build a `Predictor` that leverages it.

A `Predictor`, in the simplest terms, is what calls the LLM using our signature. Importantly, the `Predictor` knows how to leverage our signature to call the LLM. From DSPy's documentation, one of the most interesting parts of a `Predictor` is that it can *learn* to become better at the desired task!

Let's take a look at our `TypedPredictor` below to see more.

In [20]:
from dspy.functional import TypedPredictor

predict_dopeness = TypedPredictor(DopeOrNopeSignature)

In [21]:
predict_dopeness

TypedPredictor(DopeOrNopeSignature(sentence -> rating
    instructions='Rate a sentence from 0 to 4 on a dopeness scale'
    sentence = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Sentence:', 'desc': '${sentence}'})
    rating = Field(annotation=int required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Rating:', 'desc': '${rating}'})
))

In [22]:
dopeness_prediction = predict_dopeness(sentence=valset_example.sentence)
print(f"Sentence: {valset_example.sentence}")
print(f"Prediction: {dopeness_prediction}")

Sentence: This is top tier.
Prediction: Prediction(
    rating=4
)


We can, at any time, check our LLMs outputs through the `inspect_history`.

In [23]:
llm.inspect_history(n=1)





[31mSystem message:[0m

Your input fields are:
1. `sentence` (str)

Your output fields are:
1. `rating` (int): ${rating} (Respond with a single int value)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## sentence ## ]]
{sentence}

[[ ## rating ## ]]
{rating}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Rate a sentence from 0 to 4 on a dopeness scale


[31mUser message:[0m

[[ ## sentence ## ]]
This is top tier.

Respond with the corresponding output fields, starting with the field `rating`, and then ending with the marker for `completed`.


[31mResponse:[0m

[32m[[ ## rating ## ]]
4
[[ ## completed ## ]][0m







Notice how, without our input - the `TypedPredictor` has included format instructions to the LLM to help ensure our returned data resembles what we desire.

Let's look at another example of a `Predictor` - this time with Chain of Thought.

In order to use this - we don't have to do anything with our `Signature`! We can leave it exactly as is - and allow the `Predictor` to adapt to it.

> NOTE: We won't be using this predictor going forward - this is just to showcase the ease of using another `Predictor` with a `Signature`.

In [24]:
from dspy.functional import TypedChainOfThought

generate_rating_with_chain_of_thought = TypedChainOfThought(DopeOrNopeSignature)

rating_prediction = generate_rating_with_chain_of_thought(sentence=valset_example.sentence)

In [25]:
print(f"Sentence: {valset_example.sentence}")
print(f"Reasoning: {rating_prediction.reasoning}")
print(f"Ground Truth Label: {valset_example.rating}")
print(f"Prediction: {rating_prediction.rating}")

Sentence: This is top tier.
Reasoning: I would rate this sentence as a 4 because it conveys a high level of excellence or superiority.
Ground Truth Label: 4
Prediction: 4


We can, again, check our LLM's history to see what the actual prompt/response is.


In [26]:
llm.inspect_history(n=1)





[31mSystem message:[0m

Your input fields are:
1. `sentence` (str)

Your output fields are:
1. `reasoning` (str): ${produce the rating}. We ...
2. `rating` (int): ${rating} (Respond with a single int value)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## sentence ## ]]
{sentence}

[[ ## reasoning ## ]]
{reasoning}

[[ ## rating ## ]]
{rating}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Rate a sentence from 0 to 4 on a dopeness scale


[31mUser message:[0m

[[ ## sentence ## ]]
This is top tier.

Respond with the corresponding output fields, starting with the field `reasoning`, then `rating`, and then ending with the marker for `completed`.


[31mResponse:[0m

[32m[[ ## reasoning ## ]]
I would rate this sentence as a 4 because it conveys a high level of excellence or superiority. 

[[ ## rating ## ]]
4

[[ ## completed ## ]][0m







## Task 6: Making a Chain, I mean...Module.

Now that we have our `TypedPredictor`, we can create a `Module`!

A `Module` is useful because it allows us to interact with the `Predictor` and `Signature` in a way that DSPy can leverage for optimization.

The helps the DSPy framework determine paths through your program - and helps during the `compilation` or optimisation steps (formerly `teleprompting`).

> NOTE: You might notice this looks strikingly familiar to PyTorch, and this is by design!

In [27]:
from dspy import Module, Prediction

class DopeOrNopeStudent(Module):
  def __init__(self):
    super().__init__()

    self.generate_rating = TypedPredictor(DopeOrNopeSignature)

  def forward(self, sentence):
    prediction = self.generate_rating(sentence=sentence)
    return Prediction(rating=prediction.rating)

## Task 7: Evaluate

As with any good framework, DSPy has the ability to `Evaluate` - we can leverage this to determine how our current DSPy "program" (our `Module` in this case) operates.

> NOTE: DSPy's "program" could be loosely related to a "chain" from the popular LLM Framework LangChain.

In [30]:
from dspy.evaluate.evaluate import Evaluate

evaluate_fewshot = Evaluate(devset=valset, num_threads=1, display_progress=True, display_table=10)

def exact_match_metric(answer, pred, trace=None):
  return answer.rating == pred.rating

evaluate_fewshot(DopeOrNopeStudent(), metric=exact_match_metric)

Average Metric: 5 / 10  (50.0): 100%|██████████| 10/10 [00:00<00:00, 260.81it/s]


Unnamed: 0,sentence,example_rating,pred_rating,exact_match_metric
0,This is top tier.,4,4,✔️ [True]
1,Big mood.,3,3,✔️ [True]
2,The presentation was outstanding.,1,4,
3,I'm living my best life.,4,3,
4,"Sksksksk, that's hilarious.",3,3,✔️ [True]
5,The report is comprehensive.,1,3,
6,This is next level.,4,4,✔️ [True]
7,The meeting was productive.,1,3,
8,The analysis was insightful.,1,3,
9,I stan a legend.,3,3,✔️ [True]


50.0

#### ❓Question #1:

Does DSPy lend itself to more complex less exactly defined evaluations? Provide reasoning for your answer.

#### ! Answer #1:

I believe DSPy is designed to handle complex less exactly defined evaluations.

DSPy supports the following which helps handle complex evaluations:
- Modularity - can break complex tasks into simpler components. Can create complex Modules from simpler Modules
- Flexibility - can define custom inputs and outputs, data flow, and evaluation
- Self-optimization - based on performance feedback
- Choice - offers numerous different options that can easily be plugged in without changing the fundamental process using the Examples, Signatures, and Modules, and the different TypedPredictors
- Customization - can create custom evaluators, 

Since DSPy separates the program flow from the parameters controling the behavior of a module or model through weights or configurations, this allows focus on the goals of the program and not worry about the details of parameter tuning. This is helpful in handling complex evaluations.


## Task 8: Program Optimization (the Artist Formerly Known as Teleprompting)

Optimization is the crux of the DSPy framework - it is what allows it to operate at a level beyond traditional prompt engineering.

At a high level, optimisation is a way for the DSPy framework to take the program, a training set, and a metric - and make changes/tweaks to our program to improve our metrics on our dataset.

Let's get started with the `LabeledFewShot` optimizer.

The `LabeledFewShot` optimizer very simply provides a sample of the `trainset` as few-shot examples!

In [31]:
from dspy.teleprompt import LabeledFewShot

labeled_fewshot_optimizer = LabeledFewShot(k=4)

Once we define our optimizer, we can compile our program!

In [32]:
compiled_dspy = labeled_fewshot_optimizer.compile(student=DopeOrNopeStudent(), trainset=trainset)

Let's evaluate!

In [33]:
evaluate_fewshot(compiled_dspy, metric=exact_match_metric)

Average Metric: 3 / 10  (30.0): 100%|██████████| 10/10 [00:00<00:00, 198.13it/s]


Unnamed: 0,sentence,example_rating,pred_rating,exact_match_metric
0,This is top tier.,4,3,
1,Big mood.,3,3,✔️ [True]
2,The presentation was outstanding.,1,3,
3,I'm living my best life.,4,3,
4,"Sksksksk, that's hilarious.",3,3,✔️ [True]
5,The report is comprehensive.,1,2,
6,This is next level.,4,3,
7,The meeting was productive.,1,3,
8,The analysis was insightful.,1,3,
9,I stan a legend.,3,3,✔️ [True]


30.0

As you can see - with no effort at all - we can improve our performance on our `valset`!

Not sure what happened here. This is worse than the previous evaluation!

Let's try another optimizer - this time: [`BootstrapFewShot`](https://dspy-docs.vercel.app/docs/deep-dive/teleprompter/bootstrap-fewshot).

The key thing to note is that this optimizer works with even very few examples - by way of generating new examples by the LLMs!

In [34]:
from dspy.teleprompt import BootstrapFewShot

optimizer = BootstrapFewShot(metric=exact_match_metric, max_bootstrapped_demos=4, max_labeled_demos=12)

compiled_dspy_BOOTSTRAP = optimizer.compile(student=DopeOrNopeStudent(), trainset=trainset)

  6%|▌         | 5/89 [00:00<00:00, 252.33it/s]

Bootstrapped 4 full traces after 6 examples in round 0.





In [35]:
llm.inspect_history(n=1)





[31mSystem message:[0m

Your input fields are:
1. `sentence` (str)

Your output fields are:
1. `rating` (int): ${rating} (Respond with a single int value)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## sentence ## ]]
{sentence}

[[ ## rating ## ]]
{rating}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Rate a sentence from 0 to 4 on a dopeness scale


[31mUser message:[0m

[[ ## sentence ## ]]
The approval was granted.

Respond with the corresponding output fields, starting with the field `rating`, and then ending with the marker for `completed`.


[31mAssistant message:[0m

[[ ## rating ## ]]
1

[[ ## completed ## ]]


[31mUser message:[0m

[[ ## sentence ## ]]
I admire your dedication.

Respond with the corresponding output fields, starting with the field `rating`, and then ending with the marker for `completed`.


[31mAssistant message:[0m

[[ ## rating ## ]]
1

[[ ## complete

In [38]:
print(len(llm.history))
llm.history[len(llm.history)-2]

31


{'prompt': None,
 'messages': [{'role': 'system',
   'content': 'Your input fields are:\n1. `sentence` (str)\n\nYour output fields are:\n1. `rating` (int): ${rating} (Respond with a single int value)\n\nAll interactions will be structured in the following way, with the appropriate values filled in.\n\n[[ ## sentence ## ]]\n{sentence}\n\n[[ ## rating ## ]]\n{rating}\n\n[[ ## completed ## ]]\n\nIn adhering to this structure, your objective is: \n        Rate a sentence from 0 to 4 on a dopeness scale'},
  {'role': 'user',
   'content': '[[ ## sentence ## ]]\nThe approval was granted.\n\nRespond with the corresponding output fields, starting with the field `rating`, and then ending with the marker for `completed`.'},
  {'role': 'assistant',
   'content': '[[ ## rating ## ]]\n1\n\n[[ ## completed ## ]]'},
  {'role': 'user',
   'content': '[[ ## sentence ## ]]\nI admire your dedication.\n\nRespond with the corresponding output fields, starting with the field `rating`, and then ending with t

#### 🏗️ Activity #1:

Outline how `BootstrapFewShot` works "under the hood" in natural language or create a diagram of the workflow.


### Activity #1:
Basic flow is as follows:
- Create the signature defining the inputs and outputs
- Prepare the training dataset with one or more Example objects containing data conforming to the Signature
- Create a Module that defines the Predictor based on the Signature as well as the forward() function that defines input -> output process
- Create a metric function based on the input and output
- Set up the parameters for number of bootstrapped demonstrations, number of labeled demonstrations and the number of bootstrap rounds
- Call the compile function which:
    - Calls the predictor defined in the module - this calls the forward(method)
    - The forward() method sets up the prompt for the LLM using the signature, the LM parameters and the few shot examples passed
    - The prompt is sent to the LLM and returns the prediction
    - The bootstrap process creates a number of new examples through the LLM based off the examples passed
    - These examples are used to send prompts to the LLM, the response to which are evaluated using the metric algorithm
    - The module parameters and prompt are tweaked in order to improve the metric performance
    - This is repeated a number of times based on the max_rounds
    - The end result is a more performant, compiled version of the original module

![dspy floe](images/dspy.jpg)

Let's finally evaluate!

In [36]:
evaluate_fewshot(compiled_dspy_BOOTSTRAP, metric=exact_match_metric)

Average Metric: 8 / 10  (80.0): 100%|██████████| 10/10 [00:00<00:00, 101.39it/s]


Unnamed: 0,sentence,example_rating,pred_rating,exact_match_metric
0,This is top tier.,4,4,✔️ [True]
1,Big mood.,3,3,✔️ [True]
2,The presentation was outstanding.,1,3,
3,I'm living my best life.,4,4,✔️ [True]
4,"Sksksksk, that's hilarious.",3,3,✔️ [True]
5,The report is comprehensive.,1,1,✔️ [True]
6,This is next level.,4,4,✔️ [True]
7,The meeting was productive.,1,1,✔️ [True]
8,The analysis was insightful.,1,2,
9,I stan a legend.,3,3,✔️ [True]


80.0

We can see that this optimization helps our program achieve 30 points higher on our evaluation!

In [37]:
llm.inspect_history(n=1)





[31mSystem message:[0m

Your input fields are:
1. `sentence` (str)

Your output fields are:
1. `rating` (int): ${rating} (Respond with a single int value)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## sentence ## ]]
{sentence}

[[ ## rating ## ]]
{rating}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Rate a sentence from 0 to 4 on a dopeness scale


[31mUser message:[0m

[[ ## sentence ## ]]
This tea is piping hot.

Respond with the corresponding output fields, starting with the field `rating`, and then ending with the marker for `completed`.


[31mAssistant message:[0m

[[ ## rating ## ]]
4

[[ ## completed ## ]]


[31mUser message:[0m

[[ ## sentence ## ]]
Your professionalism is appreciated.

Respond with the corresponding output fields, starting with the field `rating`, and then ending with the marker for `completed`.


[31mAssistant message:[0m

[[ ## rating ## ]]
1

[[ ##

In [38]:
for name, parameter in compiled_dspy_BOOTSTRAP.named_parameters():
  print(f"Parameter {name}: Num Examples: {len(parameter.demos)}, {parameter.demos[0]}")
  print()

Parameter generate_rating.predictor: Num Examples: 12, Example({'augmented': True, 'sentence': 'This tea is piping hot.', 'rating': 4}) (input_keys=None)



# 🤝 Breakout Room #2

## Task 1: Defining Application

In this breakoutroom, we'll be using DSPy to optimize a Multi-Hop QA module with `Assertions`.

So what is a "Multi-Hop QA module"?

Well - going beyond naive RAG retrieval, Multi-Hop QA lets us create applications that are well-suited to questions that (potentially have) multiple "hops" required to answer them.

For instance: "Who is the top goal scorer that has ever played on the Winnipeg Jets, and what years did he play for the Winnipeg Jets?"

You can see that there are two "hops" required to respond correctly:

1. Who is the top goal scorer for the Winnipeg Jets?
2. What years did X player play for the Winnipeg Jets?

While this is a toy example, the idea is the same across complexity: Questions that take more than one step of reasoning to answer.

Let's grab some data, set-up some hyper-parameters, and then get to implmentation!

## Task 2: Hyper-Parameters and Data

We'll use the DSPy ColBERT abstracts as our retrieval system for this example.

We'll also use `GPT-4o-Mini` as our LM to keep things light and inexpensive as we'll be sending quite a few LLM calls.

In [43]:
colbertv2_wiki17_abstracts = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
dspy.settings.configure(rm=colbertv2_wiki17_abstracts)
lm_openai_four_mini = dspy.LM(model='openai/gpt-4o-mini', max_tokens=500)
dspy.settings.configure(lm=lm_openai_four_mini, trace=[], temperature=0.7)

We'll be using the [`HotPotQA`](https://hotpotqa.github.io/) dataset which is a number of multi-hop QA pairs that includes context, and is based on Wikipedia (for compatibility with our Retriever system).

- train_seed - random seed for sampling the training get
- train_size - limit number of examples from training set
- eval_seed - random seed for sampling the evaluation set
- test_size - no test set provided
- keep_details - additional details or metadata about dataset are/are not preserved


In [44]:
from dspy.datasets import HotPotQA

dataset = HotPotQA(train_seed=1, train_size=20, eval_seed=2023, dev_size=50, test_size=0, keep_details=True)
trainset = [x.with_inputs('question') for x in dataset.train]
devset = [x.with_inputs('question') for x in dataset.dev]

We can look at a few examples:

In [45]:
train_example = trainset[0]
print(f"Question: {train_example.question}")
print(f"Answer: {train_example.answer}")
print(f"Relevant Wikipedia Titles: {train_example.gold_titles}")

Question: At My Window was released by which American singer-songwriter?
Answer: John Townes Van Zandt
Relevant Wikipedia Titles: {'Townes Van Zandt', 'At My Window (album)'}


In [46]:
dev_example = devset[18]
print(f"Question: {dev_example.question}")
print(f"Answer: {dev_example.answer}")
print(f"Relevant Wikipedia Titles: {dev_example.gold_titles}")

Question: What is the nationality of the chef and restaurateur featured in Restaurant: Impossible?
Answer: English
Relevant Wikipedia Titles: {'Restaurant: Impossible', 'Robert Irvine'}


## Task 3: Signature and Module Creation

As we learned above - the bread and butter for DSPy is the `Signature` and `Module`, so we'll create each below.

For our `Signatures`, things are fairly straight-forward, we need to:

1. Create a `Signature` that will allow us to generate sub-questions.
2. Create a `Signature` that will provide citations for our responses.

In [47]:
from dsp.utils import deduplicate

class GenerateSearchQuery(dspy.Signature):
    """Write a simple search query that will help answer a complex question."""
    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    query = dspy.OutputField()

class GenerateCitedParagraph(dspy.Signature):
    """Generate a paragraph with citations."""
    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    paragraph = dspy.OutputField(desc="includes citations")

Our `Module` is a bit more complex than what we've seen before - so let's walk through what's happening inside of it. We're going to concern ourselves with the `forward` method - as that is where the logic of our `Module` is contained.

In the `forward` method we:

1. Create an empty list of contexts.
2. For each `hop` in our `max_hops` (by default, it will be 2) we:
  - Generate a new `query` using our `GenerateSearchQuery` with a `ChainOfThought` predictor.
  - Retrieve a number (default 3) of `passages` based on that new `query`.
  - Add unique (non-present) `passages` into our `context` list.
3. Take all that `context` and our original `question` and generate a cited paragraph and use it to predict an answer.

In [48]:
class LongFormQA(dspy.Module):
    def __init__(self, passages_per_hop=3, max_hops=2):
        super().__init__()
        self.generate_query = [dspy.ChainOfThought(GenerateSearchQuery) for _ in range(max_hops)]
        self.retrieve = dspy.Retrieve(k=passages_per_hop)
        self.generate_cited_paragraph = dspy.ChainOfThought(GenerateCitedParagraph)
        self.max_hops = max_hops

    def forward(self, question):
        context = []
        for hop in range(self.max_hops):
            query = self.generate_query[hop](context=context, question=question).query
            passages = self.retrieve(query).passages
            context = deduplicate(context + passages)
        pred = self.generate_cited_paragraph(context=context, question=question)
        pred = dspy.Prediction(context=context, paragraph=pred.paragraph)
        return pred

Next, we'll need a way to evaluate how we're doing!

## Task 4: Evaluating our LongFormQA Module.

Now we'd like to evaluate our module - we'll need a number of helper functions to do so - which will be instantiated below.

#### Utility Functions for Citation Checking

In [49]:
import nltk
import regex as re

from nltk.tokenize import sent_tokenize
nltk.download('punkt')

def extract_text_by_citation(paragraph):
    # extracts text chunks and associates it with the citation number
    citation_regex = re.compile(r'(.*?)(\[\d+\]\.)', re.DOTALL)
    parts_with_citation = citation_regex.findall(paragraph)
    citation_dict = {}
    for part, citation in parts_with_citation:
        part = part.strip()
        citation_num = re.search(r'\[(\d+)\]\.', citation).group(1)
        citation_dict.setdefault(str(int(citation_num) - 1), []).append(part)
    return citation_dict

def correct_citation_format(paragraph):
    # validates a paragraphs citation is correct format - citations are associated with proper sentences
    modified_sentences = []
    sentences = sent_tokenize(paragraph)
    for sentence in sentences:
        modified_sentences.append(sentence)
    citation_regex = re.compile(r'\[\d+\]\.')
    i = 0
    if len(modified_sentences) == 1:
      has_citation = bool(citation_regex.search(modified_sentences[i]))
    while i < len(modified_sentences):
      if len(modified_sentences[i:i+2]) == 2:
        sentence_group = " ".join(modified_sentences[i:i+2])
        has_citation = bool(citation_regex.search(sentence_group))
        if not has_citation:
            return False
        i += 2 if has_citation and i+1 < len(modified_sentences) and citation_regex.search(modified_sentences[i+1]) else 1
      else:
        return True
    return True

def has_citations(paragraph):
    # checks for citations in the paragraph e.g., [1]., [2].) 
    return bool(re.search(r'\[\d+\]\.', paragraph))

def citations_check(paragraph):
    # combines checks that it has citations and they are valid format
    return has_citations(paragraph) and correct_citation_format(paragraph)

[nltk_data] Downloading package punkt to /home/rchrdgwr/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Checking Citation Faithfulness

We will create a number of useful metrics for our pipeline - included "Faithfulness", as well as a number of more traditional metrics. "

In [50]:
class CheckCitationFaithfulness(dspy.Signature):
    # check cited text is based on the provided context 
    """Verify that the text is based on the provided context."""
    context = dspy.InputField(desc="may contain relevant facts")
    text = dspy.InputField(desc="between 1 to 2 sentences")
    faithfulness = dspy.OutputField(desc="boolean indicating if text is faithful to context")

def citation_faithfulness(example, pred, trace):
    paragraph, context = pred.paragraph, pred.context
    citation_dict = extract_text_by_citation(paragraph)
    if not citation_dict:
        return False, None
    context_dict = {str(i): context[i].split(' | ')[1] for i in range(len(context))}
    faithfulness_results = []
    unfaithful_citations = []
    check_citation_faithfulness = dspy.ChainOfThought(CheckCitationFaithfulness)
    for citation_num, texts in citation_dict.items():
        if citation_num not in context_dict:
            continue
        current_context = context_dict[citation_num]
        for text in texts:
            try:
                result = check_citation_faithfulness(context=current_context, text=text)
                is_faithful = result.faithfulness.lower() == 'true'
                faithfulness_results.append(is_faithful)
                if not is_faithful:
                    unfaithful_citations.append({'paragraph': paragraph, 'text': text, 'context': current_context})
            except ValueError as e:
                faithfulness_results.append(False)
                unfaithful_citations.append({'paragraph': paragraph, 'text': text, 'error': str(e)})
    final_faithfulness = all(faithfulness_results)
    if not faithfulness_results:
        return False, None
    return final_faithfulness, unfaithful_citations

#### ❓Question #2:

How is faithfulness being determined here? How is this different from Ragas Faithfulness.


#### ! Answer #2:

The CheckCitationFaithfulness class defines the following:
- context - input field - the source where citations are to be drawn
- text - input field - the cited information
- faithfulness - output field - boolean indication faithfulness - true/false

The citation_faithfulness function is responsible for determining whether the text is faithful to the context provided ie the cited statement (or text) is based on facts or details found in the context. 

The process is as follows:
- extract all citations from from the predicted paragraph - if none is found it is considered unfaithful
- find all provided citations from the context
- create a chain of thought predictor to check a citations faithfulness comparing the text and corresponding context
- iterating over all citations in the context 
    - determine if the citation number is in the context
    - use the COT predictor to determine if the text returned is based on the context
    - if so record that it is faithful
    - othewise log it to the unfaithful_citations
- if all of the citations are faithful it is considered faithful

In summary - this process validates that all of the text citations in the paragraph are faithful to the source context. It is a boolean and considered faithful or not. It is stricter than RAGAS and is used to ensure that cited information is properly represented

In RAGAS, faithfulness is determined based on how accurately the generated test reflects the information from the context based on if the answer can be inferred from the given context. It is defined as a ratio:

        Numclaims in answer that can be inferred from context / total num claims in the answer 

The RAGAS failthfulness is a range from 0 to 1.




Next, we can create a number of useful metrics that rely on more traditional evaluations, like Precision, Recall, and "does this contain the answer".

In [51]:
from dsp.utils import normalize_text

def extract_cited_titles_from_paragraph(paragraph, context):
    cited_indices = [int(m.group(1)) for m in re.finditer(r'\[(\d+)\]\.', paragraph)]
    cited_indices = [index - 1 for index in cited_indices if index <= len(context)]
    cited_titles = [context[index].split(' | ')[0] for index in cited_indices]
    return cited_titles

def calculate_recall(example, pred, trace=None):
    gold_titles = set(example['gold_titles'])
    found_cited_titles = set(extract_cited_titles_from_paragraph(pred.paragraph, pred.context))
    intersection = gold_titles.intersection(found_cited_titles)
    recall = len(intersection) / len(gold_titles) if gold_titles else 0
    return recall

def calculate_precision(example, pred, trace=None):
    gold_titles = set(example['gold_titles'])
    found_cited_titles = set(extract_cited_titles_from_paragraph(pred.paragraph, pred.context))
    intersection = gold_titles.intersection(found_cited_titles)
    precision = len(intersection) / len(found_cited_titles) if found_cited_titles else 0
    return precision

def answer_correctness(example, pred, trace=None):
    assert hasattr(example, 'answer'), "Example does not have 'answer'."
    normalized_context = normalize_text(pred.paragraph)
    if isinstance(example.answer, str):
        gold_answers = [example.answer]
    elif isinstance(example.answer, list):
        gold_answers = example.answer
    else:
        raise ValueError("'example.answer' is not string or list.")
    return 1 if any(normalize_text(answer) in normalized_context for answer in gold_answers) else 0

### Creating the Evaluation Function

In essence, all this function does is call all the created metrics above and sum/average them.

In [52]:
from tqdm import tqdm

def evaluate(module):
    correctness_values = []
    recall_values = []
    precision_values = []
    citation_faithfulness_values = []
    for i in tqdm(range(len(devset[:20]))):
        example = devset[i]
        try:
            pred = module(question=example.question)
            correctness_values.append(answer_correctness(example, pred))
            citation_faithfulness_score, _ = citation_faithfulness(None, pred, None)
            citation_faithfulness_values.append(citation_faithfulness_score)
            recall = calculate_recall(example, pred)
            precision = calculate_precision(example, pred)
            recall_values.append(recall)
            precision_values.append(precision)
        except Exception as e:
            print(f"Failed generation with error: {e}")

    average_correctness = sum(correctness_values) / len(devset[:20]) if correctness_values else 0
    average_recall = sum(recall_values) / len(devset[:20]) if recall_values else 0
    average_precision = sum(precision_values) / len(devset[:20]) if precision_values else 0
    average_citation_faithfulness = sum(citation_faithfulness_values) / len(devset[:20]) if citation_faithfulness_values else 0

    print(f"\nAverage Correctness: {average_correctness}")
    print(f"Average Recall: {average_recall}")
    print(f"Average Precision: {average_precision}")
    print(f"Average Citation Faithfulness: {average_citation_faithfulness}")

### Evaluating our LongFormQA Module

Finally, we can evaluate our module!

In [53]:
longformqa = LongFormQA()
evaluate(longformqa)

100%|██████████| 20/20 [00:00<00:00, 125.96it/s]


Average Correctness: 0.9
Average Recall: 0.0
Average Precision: 0.0
Average Citation Faithfulness: 0.0





This did surprisingly poorly on `Recall`, `Precision` and `Citation Faithfulness`.

#### ❓Question #3:

Why did our `Module` do surprisingly poorly on `Recall`, `Precision` and `Citation Faithfulness`?

> HINT: The name `LongFormQA` should provide a fairly big hint.

#### ! Answer #3:

LongFormQA is designed for longer, more detailed answers pulling data from multiple sources in this case using multi hop. This can cause additional content and hallucination. It  is more difficult to control than short precise answers. This leads to a drop in prceison and recall.

The recall and precision are based on cited titles from the paragraphs and comparing them to the "gold titles" provided in the example. The cited titles could differ impacting precision and recall.

Long form answers might paraphrase or add additional information which could make it harder to find exact match with the "gold answer".

Recall - impacted by missing relevant titles or citations

Precision - impacted by finding irrelevant or incorrect titles

Answer correctness - impacted by longer, paraphrased response, introduction of variation and imprecision

We also used a 0.7 temperature rating for the Lamguage Model which could introduce some randomness.

## Task 5: Adding `Assertions`.

DSPy comes equipped with an extremely useful feature called `Assertions` and `Suggestions`.

Let's take a look at what each one does:

1. `dspy.Assert` - this is a hard rule that must be followed, and if it's not followed; an exception will be raised.
2. `dspy.Suggest` - this is a looser rule, or guiding principle, it will not raise an exception if the rule isn't met; but it will try and ensure the suggestion is met.

Let's improve our `Module` with some `dspy.Suggest`s!


In [54]:
class LongFormQAWithAssertions(dspy.Module):
    def __init__(self, passages_per_hop=3, max_hops=2):
        super().__init__()
        self.generate_query = [dspy.ChainOfThought(GenerateSearchQuery) for _ in range(max_hops)]
        self.retrieve = dspy.Retrieve(k=passages_per_hop)
        self.generate_cited_paragraph = dspy.ChainOfThought(GenerateCitedParagraph)
        self.max_hops = max_hops

    def forward(self, question):
        context = []
        for hop in range(self.max_hops):
            query = self.generate_query[hop](context=context, question=question).query
            passages = self.retrieve(query).passages
            context = deduplicate(context + passages)
        pred = self.generate_cited_paragraph(context=context, question=question)
        pred = dspy.Prediction(context=context, paragraph=pred.paragraph)
        dspy.Suggest(citations_check(pred.paragraph), "Make sure every 1-2 sentences has citations. If any 1-2 sentences lack citations, add them in 'text... [x].' format.", target_module=self.generate_cited_paragraph)
        _, unfaithful_outputs = citation_faithfulness(None, pred, None)
        if unfaithful_outputs:
            unfaithful_pairs = [(output['text'], output['context']) for output in unfaithful_outputs]
            for _, context in unfaithful_pairs:
                dspy.Suggest(len(unfaithful_pairs) == 0, f"Make sure your output is based on the following context: '{context}'.", target_module=self.generate_cited_paragraph)
        else:
            return pred
        return pred

#### 🏗️ Activity #2:

Write out the above flow in natural language or using a drawing program.

What is the key advantage provided by using `dspy.Suggest`?

### Response #2:

- Using multi-hop create the search queries and retrieve relevant passages based on the input question
- create the paragraph with citations based on retrieved content
- using dspy.suggest - check that there is a citation every 1-2 sentences - if not, add it 
- check the faithfulness of the generated text to the context
- using dspy.suggest - change the response if unfaithful text is found to ensure it is based on the context


Advantage of using dspy.suggest:
- suggest provides guidance to improve the response based on checks for citation faithfulness
- this encourages (not forces) the model to correct itself during the generation process
- it verifies that each 1-2 sentences has a citation
- if there is no citation, it suggests adding the citation
- this ensures the created content adheres to the citation standards - resulting in better citation precision
- for each unfaithful process, it is suggested that the text be recised based on the provided text. This helps remind the model to ensure the output aligns with the specific retrieved context. This should improve citation faithfulness

The suggestions ensure:
- citations are correctly placed every 1-2 sentences
- correcting unfaithful outputs will reduce hallucinations and improve faithfulness

Overall - dspy.Suggest can help the module improve by identifying weaknesses in the response and suggesting ways to improve 

In [55]:
from dspy.primitives.assertions import assert_transform_module, backtrack_handler
from dspy.predict import Retry

longformqa_with_assertions = assert_transform_module(LongFormQAWithAssertions().map_named_predictors(Retry), backtrack_handler)
evaluate(longformqa_with_assertions)

  0%|          | 0/20 [00:00<?, ?it/s]

100%|██████████| 20/20 [02:43<00:00,  8.20s/it]


Average Correctness: 0.9
Average Recall: 0.45
Average Precision: 0.5416666666666666
Average Citation Faithfulness: 0.4





Set up a few-shot learning process using BootstrapFewShotwithRandomSearch

Use this to evaluate our LongFormQAWithAssertions

BootstrapFewShotwithRandomSearch - bootstrap few shot learning with random search
- select a few labeled examples and improve the performance by bootstrapping additional examples

Compile the Student-Teacher model
- Both student and teacher use instances of LongFormQAWithAssertions
- This allos backtrack and retry predictions using the Retry predictor
- The student model learns from the teacher models predictions
- Teacher provides guidance and corrections to the student during training
- Helps it learn to improve on generating long-form answers with proper citations and faithfulness

Evaluate based on the valset

In [56]:
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

longformqa = LongFormQAWithAssertions()
teleprompter = BootstrapFewShotWithRandomSearch(metric = answer_correctness, max_bootstrapped_demos=2, num_candidate_programs=6)
cited_longformqa_student_teacher = teleprompter.compile(student=assert_transform_module(LongFormQAWithAssertions().map_named_predictors(Retry), backtrack_handler), teacher = assert_transform_module(LongFormQAWithAssertions().map_named_predictors(Retry), backtrack_handler), trainset=trainset, valset=devset[:25])
evaluate(cited_longformqa_student_teacher)

Going to sample between 1 and 2 traces per predictor.
Will attempt to bootstrap 6 candidate sets.


Average Metric: 22 / 25  (88.0): 100%|██████████| 25/25 [01:23<00:00,  3.35s/it]  


New best score: 88.0 for seed -3
Scores so far: [88.0]
Best score so far: 88.0


Average Metric: 22 / 25  (88.0): 100%|██████████| 25/25 [00:00<00:00, 196.44it/s] 


Scores so far: [88.0, 88.0]
Best score so far: 88.0


 20%|██        | 4/20 [00:48<03:14, 12.15s/it]


Bootstrapped 2 full traces after 5 examples in round 0.


Average Metric: 20 / 25  (80.0): 100%|██████████| 25/25 [04:32<00:00, 10.90s/it]


Scores so far: [88.0, 88.0, 80.0]
Best score so far: 88.0


 10%|█         | 2/20 [00:23<03:27, 11.51s/it]


Bootstrapped 2 full traces after 3 examples in round 0.


Average Metric: 19 / 25  (76.0): 100%|██████████| 25/25 [06:02<00:00, 14.49s/it]


Scores so far: [88.0, 88.0, 80.0, 76.0]
Best score so far: 88.0


  5%|▌         | 1/20 [00:14<04:39, 14.70s/it]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 18 / 25  (72.0): 100%|██████████| 25/25 [05:28<00:00, 13.14s/it]


Scores so far: [88.0, 88.0, 80.0, 76.0, 72.0]
Best score so far: 88.0


  5%|▌         | 1/20 [00:16<05:17, 16.70s/it]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 18 / 25  (72.0): 100%|██████████| 25/25 [05:44<00:00, 13.77s/it]


Scores so far: [88.0, 88.0, 80.0, 76.0, 72.0, 72.0]
Best score so far: 88.0


 10%|█         | 2/20 [00:11<01:39,  5.52s/it]


Bootstrapped 1 full traces after 3 examples in round 0.


Average Metric: 19 / 25  (76.0): 100%|██████████| 25/25 [03:28<00:00,  8.35s/it]


Scores so far: [88.0, 88.0, 80.0, 76.0, 72.0, 72.0, 76.0]
Best score so far: 88.0


  5%|▌         | 1/20 [00:13<04:11, 13.22s/it]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 20 / 25  (80.0): 100%|██████████| 25/25 [05:05<00:00, 12.22s/it]


Scores so far: [88.0, 88.0, 80.0, 76.0, 72.0, 72.0, 76.0, 80.0]
Best score so far: 88.0


 15%|█▌        | 3/20 [00:56<05:18, 18.72s/it]


Bootstrapped 2 full traces after 4 examples in round 0.


Average Metric: 19 / 25  (76.0): 100%|██████████| 25/25 [05:21<00:00, 12.84s/it]


Scores so far: [88.0, 88.0, 80.0, 76.0, 72.0, 72.0, 76.0, 80.0, 76.0]
Best score so far: 88.0
9 candidate programs found.


100%|██████████| 20/20 [00:00<00:00, 218.10it/s]


Average Correctness: 0.9
Average Recall: 0.45
Average Precision: 0.5416666666666666
Average Citation Faithfulness: 0.4





So ran for 40 minutes

Best score was on first run

No change in metrics 

Not sure what we accomplished here