# Getting Started with RAG in DSPy

This notebook will show you how to use DSPy to compile a RAG program! DSPy compilation is a fairly new tool for LLM developers, so let's start with an overview of the concept. By `compiling`, we mean finding the prompts that elicit the behavior we want from LLMs when connected in some kind of pipeline.

For example, RAG is a very common LLM pipeline. In it's simplest form, RAG consists of 2 steps, (1) Retrieve and (2) Answer a Question. Part (2), Answering a Question, has an associated prompt, for example, people generally use:

```
--

Please answer the question based on the following context.

context  {context}

question {question}

--
```

This prompt may be a good initial point for an LLM to understand the task. However, it is not the *optimal* prompt. DSPy optimizes the prompt for you by jointly (1) tweaking the instructions, such as rewriting an initial prompt like: 

```
Please answer the question based on the following context.
```

to 

```
Assess the context and answer the given questions that are predominantly about software usage, process optimization, and troubleshooting. Focus on providing accurate information related to tech or software-related queries.
```

Further, DSPy (2) finds examples of desired input-outputs in the prompt to further improve performance, also known as `In-Context Learning`. In this example, we will begin with the simple prompt: `Please answer the question based on the following context.` and end up with:

```

```

In order to leverage black-box optimization techniques like random search, bayesian optimization, or evolutionary algorithms, we need a metric. Coming up with metrics to describe desired system behavior has been a longstanding challenge in Machine Learning research. Excitingly, LLMs have made amazing progress. For example, we can evaluate a RAG answer by prompting an LLM with, `Is the assessed text grounded in the context? Say no if it includes significant facts not in the context`. We then optimize the RAG program to increase the metric LLM's assessment of answer quality.

This example contains 4 parts:

- 0: DSPy Settings and Installation
- 1: DSPy Datasets with `dspy.Example`
- 2: LLM Metrics in DSPy
- 3: LLM Programming with `dspy.Module`
- 4: Optimization with `BootstrapFewShot`, `BootstrapFewShotRandomSearch`, and `BayesianSignatureOptimizer`.


We are using 2 datasets for this example. Firstly, we have an index of the Weaviate Blog Posts. We will use the Weaviate Blog Posts as the retrieved context to help with our second dataset, the Weaviate FAQs. The Weaviate FAQs consists of 44 question-answer pairs of frequently asked Weaviate questions such as: `Do I need to know about Docker (Compose) to use Weaviate?`

We isolate 10 examples to use as our test set and optimize our program with the remaining 34.

Our uncompiled RAG program achieves a score of 270 on the held-out test set.

Our RAG program compiled with the `BayesianSignatureOptimizer` achieves a score of 340! A ~30% improvement!

# 0: DSPy Settings and Installations

In [1]:
# %pip uninstall dspy-ai weaviate-client
# %pip install dspy-ai==2.1.9 weaviate-client==3.26.2

In [26]:
# Connect to Weaviate Retriever and configure LLM
import dspy
from dspy.retrieve.weaviate_rm import WeaviateRM
from wcs_client_adapter import WcsClientAdapter
from wcs_client_adapter import COLLECTION_TEXT_KEY, WCS_COLLECTION_NAME

def display_md(content):
  display(Markdown(content))

llm = dspy.OpenAI(model="gpt-3.5-turbo")

wcs_client = WcsClientAdapter.get_wcs_client()
retriever_model = WeaviateRM(WCS_COLLECTION_NAME, weaviate_client=wcs_client, weaviate_collection_text_key=COLLECTION_TEXT_KEY)
dspy.settings.configure(lm=llm, rm=retriever_model)

# 1. DSPy Datasets with `dspy.Example`

Our retrieval engine is filled with chunks from Weaviate Blog posts.

Please see weaviate/recipes/integrations/dspy/Weaviate-Import.ipynb for a full tutorial.

# Index Paper to WCS

In [3]:
from indexers import NaiveWcsIndexer

doc_uri = "https://arxiv.org/html/2312.10997v5"
indexer = NaiveWcsIndexer(doc_uri)

## Import User Questions from CSV

In [4]:
import csv
from typing import List

answerable_questions_path = "./data/answerable-questions.csv"
unanswerable_questions_path = "./data/unanswerable-questions.csv"

def load_questions_from_csv(file_path: str) -> List[str]:
    questions = []
    with open(file_path, mode='r', newline='', encoding='utf-8') as file:
        reader = csv.reader(file)
        for row in reader:
            if row:
                questions.append(row[0])
    return questions

answerable_questions = load_questions_from_csv(answerable_questions_path)
unanswerable_questions = load_questions_from_csv(unanswerable_questions_path)
all_questions = answerable_questions + unanswerable_questions
print(f"Answerable Questions (Top 5): \n" + "\n".join(answerable_questions[:5]) + "\n")
print(f"Unanswerable Questions (Top 5): \n" + "\n".join(unanswerable_questions[:5]) + "\n")

Answerable Questions (Top 5): 
What are the core challenges that RAG aims to solve in large language models (LLMs)?
How does the paper define and differentiate between Naive RAG Advanced RAG and Modular RAG?
What specific improvements does Advanced RAG introduce over Naive RAG?
Can you explain the roles of retrieval generation and augmentation processes in the RAG framework?
How are external knowledge sources integrated during the retrieval phase to enhance the generation quality?

Unanswerable Questions (Top 5): 
How do different RAG implementations impact the latency of response generation in real-time systems?
What metrics are used to evaluate the trade-off between retrieval accuracy and generation quality in RAG systems?
How can RAG be optimized for low-resource languages or dialects?
What are the implications of data drift on RAG systems over time?
How can developers ensure that RAG systems do not inadvertently propagate fake news or misinformation?



In [5]:
len(all_questions)

100

# Wrap each FAQ into an `Example` object

The dspy `Example` object optionally lets you attach metadata, or additional labels, to input/output pairs.

For example, you may want to jointly supervise the answer as well as the context the retrieval system produced to feed into the answer generator.

In [6]:
import dspy
from typing import List, NamedTuple
import random

class DataSplits(NamedTuple):
    train: List
    dev: List
    test: List

def split_data(data: List, train_size: float, dev_size: float, test_size: float) -> DataSplits:
    if train_size + dev_size + test_size != 1:
        raise ValueError("The sum of train_size, dev_size, and test_size must be 1.")

    random.shuffle(data)  
    
    train_end = int(train_size * len(data))
    dev_end = train_end + int(dev_size * len(data))
    
    train_set = data[:train_end]
    dev_set = data[train_end:dev_end]
    test_set = data[dev_end:]
    
    return DataSplits(train=train_set, dev=dev_set, test=test_set)

splits = split_data(all_questions, 0.7, 0.15, 0.15)

trainset = splits.train
devset = splits.dev
testset = splits.test

# 2. LLM Metrics

Define a Metric for Performance.

In [7]:
# Reference - https://github.com/stanfordnlp/dspy/blob/main/examples/tweets/tweet_metric.py

metricLM = dspy.OpenAI(model='gpt-4-turbo', max_tokens=1000, model_type='chat')

# Signature for LLM assessments.

class Assess(dspy.Signature):
    """Assess the quality of an answer to a question."""
    
    context = dspy.InputField(desc="The context for answering the question.")
    assessed_question = dspy.InputField(desc="The evaluation criterion.")
    assessed_answer = dspy.InputField(desc="The answer to the question.")
    assessment_answer = dspy.OutputField(desc="A rating between 1 and 5. Only output the rating and nothing else.")

def llm_metric(gold, pred, trace=None):
    predicted_answer = pred.answer
    question = gold.question
    
    print(f"Test Question: {question}")
    print(f"Predicted Answer: {predicted_answer}")
    
    detail = "Is the assessed answer detailed?"
    faithful = "Is the assessed text grounded in the context? Say no if it includes significant facts not in the context."
    overall = f"Please rate how well this answer answers the question, `{question}` based on the context.\n `{predicted_answer}`"
    
    with dspy.context(lm=metricLM):
        context = dspy.Retrieve(k=5)(question).passages
        detail = dspy.ChainOfThought(Assess)(context="N/A", assessed_question=detail, assessed_answer=predicted_answer)
        faithful = dspy.ChainOfThought(Assess)(context=context, assessed_question=faithful, assessed_answer=predicted_answer)
        overall = dspy.ChainOfThought(Assess)(context=context, assessed_question=overall, assessed_answer=predicted_answer)
    
    print(f"Faithful: {faithful.assessment_answer}")
    print(f"Detail: {detail.assessment_answer}")
    print(f"Overall: {overall.assessment_answer}")
    
    
    total = float(detail.assessment_answer) + float(faithful.assessment_answer)*2 + float(overall.assessment_answer)
    
    return total / 5.0

## Inspect the metric

In [8]:
test_example = dspy.Example(question="What are the core challenges that RAG aims to solve in large language models (LLMs)?")
test_pred = dspy.Example(answer="Hallucinations, outdated knowledge, and opaque reasoning processes.")

type(llm_metric(test_example, test_pred))

Test Question: What are the core challenges that RAG aims to solve in large language models (LLMs)?
Predicted Answer: Hallucinations, outdated knowledge, and opaque reasoning processes.
Faithful: 5
Detail: 1
Overall: 5


float

In [9]:
test_example = dspy.Example(question="How does the paper define and differentiate between Naive RAG Advanced RAG and Modular RAG?")
test_pred = dspy.Example(answer="Naive, Advanced, and Modular RAG evolve in complexity and flexibility.")

type(llm_metric(test_example, test_pred))

Test Question: How does the paper define and differentiate between Naive RAG Advanced RAG and Modular RAG?
Predicted Answer: Naive, Advanced, and Modular RAG evolve in complexity and flexibility.
Faithful: 5
Detail: 2
Overall: 2


float

In [10]:
metricLM.inspect_history(n=3)




Assess the quality of an answer to a question.

---

Follow the following format.

Context: The context for answering the question.

Assessed Question: The evaluation criterion.

Assessed Answer: The answer to the question.

Reasoning: Let's think step by step in order to ${produce the assessment_answer}. We ...

Assessment Answer: A rating between 1 and 5. Only output the rating and nothing else.

---

Context: N/A

Assessed Question: Is the assessed answer detailed?

Assessed Answer: Naive, Advanced, and Modular RAG evolve in complexity and flexibility.

Reasoning: Let's think step by step in order to[32m produce the assessment_answer. We need to determine if the assessed answer provides detailed information. The answer mentions three types of RAG (Naive, Advanced, and Modular) but does not elaborate on what these terms mean, how they differ from each other, or any specific details about their complexity and flexibility. The answer is very brief and lacks depth, which is necessar

'\n\n\nAssess the quality of an answer to a question.\n\n---\n\nFollow the following format.\n\nContext: The context for answering the question.\n\nAssessed Question: The evaluation criterion.\n\nAssessed Answer: The answer to the question.\n\nReasoning: Let\'s think step by step in order to ${produce the assessment_answer}. We ...\n\nAssessment Answer: A rating between 1 and 5. Only output the rating and nothing else.\n\n---\n\nContext: N/A\n\nAssessed Question: Is the assessed answer detailed?\n\nAssessed Answer: Naive, Advanced, and Modular RAG evolve in complexity and flexibility.\n\nReasoning: Let\'s think step by step in order to\x1b[32m produce the assessment_answer. We need to determine if the assessed answer provides detailed information. The answer mentions three types of RAG (Naive, Advanced, and Modular) but does not elaborate on what these terms mean, how they differ from each other, or any specific details about their complexity and flexibility. The answer is very brief a

# 3. The DSPy Programming Model

This block of first code will initilaize the `GenerateAnswer` signature.

Then we will compose a `dspy.Module` consisting of:
- Retrieve
- GenerateAnswer

The DSPy programming model is one of the most powerful aspects of DSPy, we get:
- An intuitive interface to compose prompts into programs.
- A clean way to organize prompts into Signatures.
- Structured output parsing with `dspy.OutputField`
- Built-in prompt extensions such as `ChainOfThought`, `ReAct`, and more!

In [11]:
class GenerateAnswer(dspy.Signature):
    """Answer questions based on the context."""
    
    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField()

In [12]:
class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()
        
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
    
    def forward(self, question):
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(answer=prediction.answer)

# A little more info on built-in dspy modules

The DSPy programming model gives you a lot of cool features out of the box. Observe how different modules implement signatures with additional prompting techniques like `ChainOfThought` and `ReAct`. `Predict` is the base class to observe what a standrd prompt looks like without the module extensions.

### dspy.Predict

In [13]:
dspy.Predict(GenerateAnswer)(question="What are the core challenges that RAG aims to solve in large language models (LLMs)?")
llm.inspect_history(n=1)




Answer questions based on the context.

---

Follow the following format.

Context: may contain relevant facts
Question: ${question}
Answer: ${answer}

---

Question: What are the core challenges that RAG aims to solve in large language models (LLMs)?
Answer:[32m Context: RAG (Retrieval-Augmented Generation) aims to improve large language models by incorporating a retrieval mechanism to enhance their performance in generating text.
Question: What are the core challenges that RAG aims to solve in large language models (LLMs)?
Answer: RAG aims to address issues such as factual accuracy, coherence, and relevance in generated text by leveraging a retrieval mechanism to incorporate external knowledge sources.[0m





'\n\n\nAnswer questions based on the context.\n\n---\n\nFollow the following format.\n\nContext: may contain relevant facts\nQuestion: ${question}\nAnswer: ${answer}\n\n---\n\nQuestion: What are the core challenges that RAG aims to solve in large language models (LLMs)?\nAnswer:\x1b[32m Context: RAG (Retrieval-Augmented Generation) aims to improve large language models by incorporating a retrieval mechanism to enhance their performance in generating text.\nQuestion: What are the core challenges that RAG aims to solve in large language models (LLMs)?\nAnswer: RAG aims to address issues such as factual accuracy, coherence, and relevance in generated text by leveraging a retrieval mechanism to incorporate external knowledge sources.\x1b[0m\n\n\n'

### dspy.ChainOfThought

In [14]:
dspy.ChainOfThought(GenerateAnswer)(question="What are the core challenges that RAG aims to solve in large language models (LLMs)?")
llm.inspect_history(n=1)




Answer questions based on the context.

---

Follow the following format.

Context: may contain relevant facts

Question: ${question}

Reasoning: Let's think step by step in order to ${produce the answer}. We ...

Answer: ${answer}

---

Question: What are the core challenges that RAG aims to solve in large language models (LLMs)?

Reasoning: Let's think step by step in order to[32m Context: RAG (Retrieval-Augmented Generation) is a model that combines retrieval-based and generation-based approaches in large language models (LLMs) to improve performance in natural language processing tasks.

Question: What are the core challenges that RAG aims to solve in large language models (LLMs)?

Reasoning: Let's think step by step in order to understand the purpose of RAG in LLMs. RAG aims to address challenges such as information retrieval, context understanding, and response generation in LLMs by incorporating a retrieval mechanism to enhance the generation process.

Answer: RAG aims to so

"\n\n\nAnswer questions based on the context.\n\n---\n\nFollow the following format.\n\nContext: may contain relevant facts\n\nQuestion: ${question}\n\nReasoning: Let's think step by step in order to ${produce the answer}. We ...\n\nAnswer: ${answer}\n\n---\n\nQuestion: What are the core challenges that RAG aims to solve in large language models (LLMs)?\n\nReasoning: Let's think step by step in order to\x1b[32m Context: RAG (Retrieval-Augmented Generation) is a model that combines retrieval-based and generation-based approaches in large language models (LLMs) to improve performance in natural language processing tasks.\n\nQuestion: What are the core challenges that RAG aims to solve in large language models (LLMs)?\n\nReasoning: Let's think step by step in order to understand the purpose of RAG in LLMs. RAG aims to address challenges such as information retrieval, context understanding, and response generation in LLMs by incorporating a retrieval mechanism to enhance the generation pro

### dspy.ReAct

In [15]:
dspy.ReAct(GenerateAnswer, tools=[dspy.settings.rm])(question="What are the core challenges that RAG aims to solve in large language models (LLMs)?")
llm.inspect_history(n=1)




You will be given `context`, `question` and you will respond with `answer`.

To do this, you will interleave Thought, Action, and Observation steps.

Thought can reason about the current situation, and Action can be the following types:

(1) Search[query], which takes a search query and returns one or more potentially relevant passages from a corpus
(2) Finish[answer], which returns the final `answer` and finishes the task

---

Follow the following format.

Context: may contain relevant facts

Question: ${question}

Thought 1: next steps to take based on last observation

Action 1: always either Search[query] or, when done, Finish[answer]

Observation 1: observations based on action

Thought 2: next steps to take based on last observation

Action 2: always either Search[query] or, when done, Finish[answer]

Observation 2: observations based on action

Thought 3: next steps to take based on last observation

Action 3: always either Search[query] or, when done, Finish[answer]

---

C

"\n\n\nYou will be given `context`, `question` and you will respond with `answer`.\n\nTo do this, you will interleave Thought, Action, and Observation steps.\n\nThought can reason about the current situation, and Action can be the following types:\n\n(1) Search[query], which takes a search query and returns one or more potentially relevant passages from a corpus\n(2) Finish[answer], which returns the final `answer` and finishes the task\n\n---\n\nFollow the following format.\n\nContext: may contain relevant facts\n\nQuestion: ${question}\n\nThought 1: next steps to take based on last observation\n\nAction 1: always either Search[query] or, when done, Finish[answer]\n\nObservation 1: observations based on action\n\nThought 2: next steps to take based on last observation\n\nAction 2: always either Search[query] or, when done, Finish[answer]\n\nObservation 2: observations based on action\n\nThought 3: next steps to take based on last observation\n\nAction 3: always either Search[query] or

# Initialize DSPy Program

In [16]:
uncompiled_rag = RAG()

# Test uncompiled inference 

In [17]:
print(uncompiled_rag("What are the core challenges that RAG aims to solve in large language models (LLMs)?").answer)

The core challenges that RAG aims to solve in large language models (LLMs) are handling queries beyond training data, ensuring current information accuracy, and reducing the risk


# Check the last call to the LLM

In [18]:
llm.inspect_history(n=1)




Answer questions based on the context.

---

Follow the following format.

Context: may contain relevant facts

Question: ${question}

Reasoning: Let's think step by step in order to ${produce the answer}. We ...

Answer: ${answer}

---

Context:
[1] «paper introduces up-to-date evaluation framework and benchmark. At the end, this article delineates the challenges currently faced and points out prospective avenues for research and development. IIntroduction Large language models (LLMs) have achieved remarkable success, though they still face significant limitations, especially in domain-specific or knowledge-intensive tasks[1], notably producing “hallucinations”[2]when handling queries beyond their training data or requiring current information. To overcome challenges, Retrieval-Augmented Generation (RAG) enhances LLMs by retrieving relevant document chunks from external knowledge base through semantic similarity calculation. By referencing external knowledge, RAG effectively reduce

'\n\n\nAnswer questions based on the context.\n\n---\n\nFollow the following format.\n\nContext: may contain relevant facts\n\nQuestion: ${question}\n\nReasoning: Let\'s think step by step in order to ${produce the answer}. We ...\n\nAnswer: ${answer}\n\n---\n\nContext:\n[1] «paper introduces up-to-date evaluation framework and benchmark. At the end, this article delineates the challenges currently faced and points out prospective avenues for research and development. IIntroduction Large language models (LLMs) have achieved remarkable success, though they still face significant limitations, especially in domain-specific or knowledge-intensive tasks[1], notably producing “hallucinations”[2]when handling queries beyond their training data or requiring current information. To overcome challenges, Retrieval-Augmented Generation (RAG) enhances LLMs by retrieving relevant document chunks from external knowledge base through semantic similarity calculation. By referencing external knowledge, 

# 4. DSPy Optimization

# Evaluate our RAG Program before it is compiled

In [19]:
# Reminder our dataset looks like this:

devset[0]

'What are the latest evaluation frameworks and benchmarks used to measure the effectiveness of RAG systems?'

In [20]:
from dspy.evaluate.evaluate import Evaluate

evaluate = Evaluate(devset=devset, num_threads=1, display_progress=True, display_table=5)

evaluate(RAG(), metric=llm_metric)

  0%|          | 0/15 [00:00<?, ?it/s]

ERROR:dspy.evaluate.evaluate:[2m2024-04-28T02:38:47.429068Z[0m [[31m[1merror    [0m] [1mError for example in dev set: 		 'str' object has no attribute 'inputs'[0m [[0m[1m[34mdspy.evaluate.evaluate[0m][0m [36mfilename[0m=[35mevaluate.py[0m [36mlineno[0m=[35m147[0m


Average Metric: 0.0 / 1  (0.0):   7%|▋         | 1/15 [00:00<00:01, 10.15it/s]

ERROR:dspy.evaluate.evaluate:[2m2024-04-28T02:38:47.431058Z[0m [[31m[1merror    [0m] [1mError for example in dev set: 		 'str' object has no attribute 'inputs'[0m [[0m[1m[34mdspy.evaluate.evaluate[0m][0m [36mfilename[0m=[35mevaluate.py[0m [36mlineno[0m=[35m147[0m


Average Metric: 0.0 / 2  (0.0):  13%|█▎        | 2/15 [00:00<00:00, 20.04it/s]

ERROR:dspy.evaluate.evaluate:[2m2024-04-28T02:38:47.432423Z[0m [[31m[1merror    [0m] [1mError for example in dev set: 		 'str' object has no attribute 'inputs'[0m [[0m[1m[34mdspy.evaluate.evaluate[0m][0m [36mfilename[0m=[35mevaluate.py[0m [36mlineno[0m=[35m147[0m


Average Metric: 0.0 / 3  (0.0):  20%|██        | 3/15 [00:00<00:00, 29.92it/s]

ERROR:dspy.evaluate.evaluate:[2m2024-04-28T02:38:47.433776Z[0m [[31m[1merror    [0m] [1mError for example in dev set: 		 'str' object has no attribute 'inputs'[0m [[0m[1m[34mdspy.evaluate.evaluate[0m][0m [36mfilename[0m=[35mevaluate.py[0m [36mlineno[0m=[35m147[0m


Average Metric: 0.0 / 4  (0.0):  20%|██        | 3/15 [00:00<00:00, 29.92it/s]

AttributeError: 'str' object has no attribute 'inputs'

Average Metric: 0.0 / 4  (0.0):  27%|██▋       | 4/15 [00:20<00:00, 29.92it/s]

# Metric Analysis

The maximum value per rating is (5 + 5*2 + 5) / 5 = 4

4 * 10 test questions = 40

In [None]:
llm.inspect_history(n=1)

In [None]:
metricLM.inspect_history(n=3)

# BootstrapFewShot

In [None]:
from dspy.teleprompt import BootstrapFewShot

teleprompter = BootstrapFewShot(metric=llm_metric, max_labeled_demos=8, max_rounds=3)

# also common to init here, e.g. Rag()
compiled_rag = teleprompter.compile(uncompiled_rag, trainset=trainset)

### Inspect the compiled prompt

In [None]:
compiled_rag("What do cross encoders do?").answer

In [None]:
llm.inspect_history(n=1)

### Evaluate the Compiled RAG Program

In [None]:
evaluate(compiled_rag, metric=llm_metric)

# BootstrapFewShotWithRandomSearch

In [None]:
# Accidentally spent $12 on this with `num_candidate_programs=20`, caution!

In [None]:
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

teleprompter = BootstrapFewShotWithRandomSearch(metric=llm_metric, 
                                                max_bootstrapped_demos=4,
                                                max_labeled_demos=4, 
                                                max_rounds=1,
                                                num_candidate_programs=2,
                                                num_threads=2)

# also common to init here, e.g. Rag()
second_compiled_rag = teleprompter.compile(uncompiled_rag, trainset=trainset)

In [None]:
second_compiled_rag("What do cross encoders do?")

In [None]:
llm.inspect_history(n=1)

In [None]:
evaluate(second_compiled_rag, metric=llm_metric)

# BayesianSignatureOptimizer

In [None]:
from dspy.teleprompt import BayesianSignatureOptimizer

llm_prompter = dspy.OpenAI(model='gpt-4', max_tokens=2000, model_type='chat')

teleprompter = BayesianSignatureOptimizer(task_model=dspy.settings.lm,
                                          metric=llm_metric,
                                          prompt_model=llm_prompter,
                                          n=5,
                                          verbose=False)

kwargs = dict(num_threads=1, display_progress=True, display_table=0)
third_compiled_rag = teleprompter.compile(RAG(), devset=devset,
                                         optuna_trials_num=3,
                                         max_bootstrapped_demos=4,
                                         max_labeled_demos=4,
                                         eval_kwargs=kwargs)

In [None]:
third_compiled_rag("What do cross encoders do?")

# Check this out!!

Below you can see how the BayesianSignatureOptimizer jointly (1) optimizes the task instruction to:

```
Assess the context and answer the given questions that are predominantly about software usage, process optimization, and troubleshooting. Focus on providing accurate information related to tech or software-related queries.
```

As well as sourcing input-output examples for the prompt!

In [None]:
llm.inspect_history(n=1)

In [None]:
evaluate(third_compiled_rag, metric=llm_metric)

# Test Set Eval

In [None]:
# Evaluate Uncompiled
from dspy.evaluate.evaluate import Evaluate

# Set up the `evaluate_on_hotpotqa` function. We'll use this many times below.
evaluate = Evaluate(devset=testset, num_threads=1, display_progress=True, display_table=5)

In [None]:
evaluate(uncompiled_rag, metric=llm_metric)

In [None]:
evaluate(compiled_rag, metric=llm_metric)

In [None]:
evaluate(second_compiled_rag, metric=llm_metric)

In [None]:
evaluate(third_compiled_rag, metric=llm_metric)