[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/weaviate/recipes/blob/main/integrations/llm-agent-frameworks/dspy/3.Adding-Depth-to-RAG-Programs.ipynb)

# Adding Depth to RAG Programs

In this tutorial we will be building on the previous work on `Getting Started with RAG`, now adding **Depth**, or additional LLM components, to our RAG program.

## RAG Programs

We will explore 4 variants of RAG:

1. [RAG] 

Retrieve->`GenerateAnswer`

2. [RAGwithSummarizer] 

Retrieve->`SummarizeContext`->`GenerateAnswer`

3. [MultiHopRAG] 

Loop(`AskQuestion`->Retrieve)->`GenerateAnswer`

4. [MultiHopRAGwithSummarizer] 

Loop(`AskQuestion`->Retrieve->`SummarizeContext`)->`GenerateAnswer`

## LLM Metric

We will use an LLM metric to evaluate these systems along a weighted combination of (1) how detailed the responses are, (2) how well supported they are by the context, and (3) how well they align with a given ground truth answer to the question.

Our LLM metric is itself a 3-layer DSPy program:

`SummarizeContext` -> `JudgeAnswer` -> `ParseFloat`

The LLM Metric scores answer quality as a combination of:

`Detail` + `Faithfulness`*2 + `Alignment`

## Optimizers

We will explore the `BootstrapFewShot` optimizer to bootstrap 2 synthetic input-output examples for each component in the 4 RAG programs, and then we will let the `BayesianSignatureOptimizer` loose on our most complex program the MultiHopRAGwithSummarizer.

## Results

### Development Set

| Program                    | Optimizer | Devset Uncompiled | Devset Compiled |
|----------------------------|-----------|-------------------|-----------------|
| RAG                        | BFS       | 248               | 288             |
| RAGwithSummarizer          | BFS       | 288               | 276             |
| MultiHopRAG                | BFS       | 240               | 260             |
| MultiHopRAGwithSummarizer  | BFS       | 204               | 288             |
| MultiHopRAGwithSummarizer  | BSO       | 204               | 300             |

### Test Set

| Program                    | Optimizer | Testset Uncompiled | Testset Compiled |
|----------------------------|-----------|--------------------|------------------|
| RAG                        | BFS       | 315                | 350              |
| RAGwithSummarizer          | BFS       | 325                | 292              |
| MultiHopRAG                | BFS       | 299                | 302              |
| MultiHopRAGwithSummarizer  | BFS       | 316                | 341              |
| MultiHopRAGwithSummarizer  | BSO       | 316                | 301              |

# A Deeper look into the discovered Prompts

(From the `BayesianSignatureOptimizer` on `MultiHopRAGwithSummarizer`)

| Component                | Token Count |
|--------------------------|-------------|
| QueryWriter Prompt       | 4,090       |
| Summarize Prompt         | 3,209       |
| OptimizedAnswer Prompt   | 4,324       |


#### Each example has `4` input-output examples of the task, and a task description either kept as is or replaced by an LLM written version if optimal according to the metric.

#### Note how you get `Reasoning` input-output traces in the few-shot examples!

# The full optimized prompts can be found at the bottom of the notebook.

## Load Data into Weaviate
**You need a running Weaviate cluster with data**:
1. Learn about the installation options [here](https://weaviate.io/developers/weaviate/installation)
2. Import your data:
    1. You can follow the `Weaviate-Import.ipynb` notebook to load in the Weaviate blogs (recipes/integrations/dspy/Weaviate-Import.ipynb)
    2. Or follow this [Quickstart Guide](https://weaviate.io/developers/weaviate/quickstart)

# Installation and Settings

In [9]:
# Connect to Weaviate Retriever and configure LLM
import dspy
from dspy.retrieve.weaviate_rm import WeaviateRM
import weaviate
import openai

gpt_turbo = dspy.OpenAI(model="gpt-3.5-turbo", max_tokens=4000)
mistral_ollama = dspy.OllamaLocal(model="mistral", max_tokens=4000, timeout_s=480)

weaviate_client = weaviate.Client("http://localhost:8080")

retriever_model = WeaviateRM("WeaviateBlogChunk", weaviate_client=weaviate_client)
dspy.settings.configure(lm=gpt_turbo, rm=retriever_model)

In [10]:
gpt_turbo("What is Artificial Intelligence")

['Artificial Intelligence (AI) is the simulation of human intelligence processes by machines, especially computer systems. These processes include learning, reasoning, problem-solving, perception, and language understanding. AI technologies are used in a wide range of applications, such as speech recognition, image recognition, natural language processing, and autonomous vehicles. AI systems can be designed to perform tasks that typically require human intelligence, and they are becoming increasingly advanced and prevalent in our daily lives.']

In [11]:
mistral_ollama("What is Artificial Intelligence?")

[' Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think and learn like humans. These machines are designed to perform tasks that would normally require human intelligence, such as visual perception, speech recognition, decision-making, and language translation. AI systems use various techniques and algorithms, including machine learning, deep learning, natural language processing, and reasoning, to analyze data, identify patterns, make predictions, and solve problems. The ultimate goal of AI is to create intelligent machines that can work and learn autonomously, adapt to new situations, and improve their performance over time.']

# Dataset

In [5]:
# To read the dataset from a file named 'dataset.json' in the same filesystem, we will use the following approach:

import json

# Assuming 'dataset.json' is in the same directory as this script
file_path = './WeaviateBlogRAG-0-0-0.json'

# Read the dataset from 'dataset.json'
with open(file_path, 'r') as file:
    dataset = json.load(file)

# Initialize empty lists for gold_answers and queries
gold_answers = []
queries = []

# Parse the gold_answers and queries
for row in dataset:
    gold_answers.append(row["gold_answer"])
    queries.append(row["query"])

In [6]:
import dspy
data = []

for i in range(len(gold_answers)):
    data.append(dspy.Example(gold_answer=gold_answers[i], question=queries[i]).with_inputs("question"))

trainset = data[:30]
devset = data[30:35] # Small Devset
testset = data[35:]

# How do you create a dummy endpoint in FastAPI that returns `{"Hello": "World"}` when accessed? 

In [7]:
print(devset[1].gold_answer)

To create a dummy endpoint in FastAPI that returns `{"Hello": "World"}` when accessed, you need to follow these steps:

1. Import the FastAPI module: `from fastapi import FastAPI`
2. Create an instance of the FastAPI class: `app = FastAPI()`
3. Define a route that responds to HTTP GET requests at the root ("/") URL. This is done by using the `@app.get("/")` decorator followed by a function that returns the desired message. The function could look like this:
```python
def read_root():
    """
    Say hello to the world
    """
    return {"Hello": "World"}
```
So, the complete code would look like this:
```python
from fastapi import FastAPI

app = FastAPI()

@app.get("/")
def read_root():
    """
    Say hello to the world
    """
    return {"Hello": "World"}
```
When this code is run and the application is accessed at its root URL, it will respond with `{"Hello": "World"}`.


# What optimization has Weaviate introduced to manage memory usage during parallel data imports?

In [8]:
print(devset[2].gold_answer)

Weaviate has introduced thread pooling optimization to manage memory usage during parallel data imports. This optimization ensures that the parallelization does not exceed the number of CPU cores, thus providing maximum performance without unnecessary memory usage.


# LLM Metric

In [16]:
class Evaluator(dspy.Signature):
    """Evaluate the quality of a system's answer to a question according to a given criterion."""
    
    context = dspy.InputField(desc="The context for answering the question.")
    criterion = dspy.InputField(desc="The evaluation criterion.")
    question = dspy.InputField(desc="The question asked to the system.")
    ground_truth_answer = dspy.InputField(desc="An expert written Ground Truth Answer to the question.")
    predicted_answer = dspy.InputField(desc="The system's answer to the question.")
    rating = dspy.OutputField(desc="A rating between 1 and 5. IMPORTANT!! Only output the rating as an `int` and nothing else.")

class RatingParser(dspy.Signature):
    """Parse the rating from a string."""
    
    raw_rating_response = dspy.InputField(desc="The string that contains the rating in it.")
    rating = dspy.OutputField(desc="An integer valued rating.")
    
class Summarizer(dspy.Signature):
    """Summarize the information provided in the search results in 5 sentences."""
    
    question = dspy.InputField(desc="a question to a search engine")
    context = dspy.InputField(desc="context filtered as relevant to the query by a search engine")
    summary = dspy.OutputField(desc="a 5 sentence summary of information in the context that would help answer the question.")

class RAGMetricProgram(dspy.Module):
    def __init__(self):
        self.evaluator = dspy.ChainOfThought(Evaluator)
        self.rating_parser = dspy.Predict(RatingParser)
        self.summarizer = dspy.ChainOfThought(Summarizer)
    
    def forward(self, gold, pred, trace=None):
        # Todo add trace to interface with teleprompters
        predicted_answer = pred.answer
        question = gold.question
        ground_truth_answer = gold.gold_answer
        
        detail = "Is the assessed answer detailed?"
        faithful = "Is the assessed answer factually supported by the context?"
        ground_truth = f"The Ground Answer Truth to the Question: {question} is given as: \n \n {ground_truth_answer} \n \n How aligned is this Predicted Answer? {predicted_answer}"
        
        # Judgement
        with dspy.context(lm=gpt_turbo):
            context = dspy.Retrieve(k=10)(question).passages
            # Context Summary
            context = self.summarizer(question=question, context=context).summary
            raw_detail_response = self.evaluator(context=context, 
                                 criterion=detail,
                                 question=question,
                                 ground_truth_answer=ground_truth_answer,
                                 predicted_answer=predicted_answer).rating
            raw_faithful_response = self.evaluator(context=context, 
                                 criterion=faithful,
                                 question=question,
                                 ground_truth_answer=ground_truth_answer,
                                 predicted_answer=predicted_answer).rating
            raw_ground_truth_response = self.evaluator(context=context, 
                                 criterion=ground_truth,
                                 question=question,
                                 ground_truth_answer=ground_truth_answer,
                                 predicted_answer=predicted_answer).rating
        
        # Structured Output Parsing
        with dspy.context(lm=gpt_turbo):
            detail_rating = self.rating_parser(raw_rating_response=raw_detail_response).rating
            faithful_rating = self.rating_parser(raw_rating_response=raw_faithful_response).rating
            ground_truth_rating = self.rating_parser(raw_rating_response=raw_ground_truth_response).rating
        
        total = float(detail_rating) + float(faithful_rating)*2 + float(ground_truth_rating)
    
        return total / 5.0

toy_ground_truth_answer = """
Cross encoders score the relevance of a document to a query. They are commonly used to rerank documents.
"""

lgtm_query = "What do cross encoders do?"
lgtm_example = dspy.Example(question=lgtm_query, gold_answer=toy_ground_truth_answer)


# If this is your first time exploring LLM metrics,
# I recommend trying the exercise of improving this answer to achieve a higher LLM rating.

lgtm_pred = dspy.Example(answer="They re-rank documents.")

llm_metric = RAGMetricProgram()
llm_metric_rating = llm_metric(lgtm_example, lgtm_pred)
print(llm_metric_rating)

def MetricWrapper(gold, pred, trace=None):
    return llm_metric(gold, pred)

3.4


# RAG

In [13]:
# Note we will re-use this initial signature when defining the other RAG programs,

# Copy-paste signature initialization from 1. Getting Started with RAG in DSPy
class GenerateAnswer(dspy.Signature):
    """Assess the context and answer the given questions that are predominantly about software usage, process optimization, and troubleshooting. Focus on providing accurate information related to tech or software-related queries."""
    
    context = dspy.InputField(desc="Helpful information for answering the question.")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="A detailed answer that is supported by the context.")

In [14]:
class RAG(dspy.Module):
    def __init__(self, passages_per_hop=3, max_hops=2):
        super().__init__()
        
        self.retrieve = dspy.Retrieve(k=passages_per_hop)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
    
    def forward(self, question):
        context = self.retrieve(question).passages
        with dspy.context(lm=mistral_ollama):
            pred = self.generate_answer(context=context, question=question).answer
        return dspy.Prediction(context=context, answer=pred, question=question)

# LGTM Test

In [17]:
uncompiled_Prediction = RAG()(lgtm_query)
print(f"LGTM test query: {lgtm_query} \n \n ")
print(f"Uncompiled Answer: {uncompiled_Prediction.answer} \n \n")
test_example = dspy.Example(question=lgtm_query, gold_answer=toy_ground_truth_answer)
test_pred = uncompiled_Prediction
llm_metric_rating = llm_metric(test_example, test_pred)
print(f"LLM Metric Rating: {llm_metric_rating}")

LGTM test query: What do cross encoders do? 
 
 
Uncompiled Answer: Cross Encoders are a type of ranking model that uses a classification mechanism to determine the similarity between data pairs (query and data object) instead of producing vector embeddings for data. They achieve high accuracy but require more time and computational resources compared to other methods like Metadata Rankers or Score Rankers. To use Cross Encoders, you need to provide a pair of items as input, which can be the query and a data object in the context of search. 
 

LLM Metric Rating: 4.0


# Adding Depth to RAG

# RAG with Search Result Summarization

In [11]:
class Summarizer(dspy.Signature):
    """Please summarize all relevant information in the context."""
    
    context = dspy.InputField(desc="Documents determined to be relevant to the question.")
    question = dspy.InputField()
    summarized_context = dspy.OutputField(desc="A detailed summary of information in the context.")

In [15]:
class RAGwithSummarizer(dspy.Module):
    def __init__(self):
        super().__init__()
        
        self.retrieve = dspy.Retrieve(k=10)
        self.summarizer = dspy.ChainOfThought(Summarizer)
        self.summarizer._compiled = True
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
    
    def forward(self, question):
        context = self.retrieve(question).passages
        with dspy.context(llm=gpt_turbo):
            summarized_context = self.summarizer(context=context, question=question).summarized_context
        with dspy.context(lm=mistral_ollama):
            pred = self.generate_answer(context=summarized_context, question=question).answer
        return dspy.Prediction(answer=pred, summarized_context=summarized_context, context=context, question=question)

In [16]:
uncompiled_Prediction = RAGwithSummarizer()(lgtm_query)
print(f"Uncompiled Answer: {uncompiled_Prediction.answer} \n \n")
test_example = dspy.Example(question=lgtm_query, gold_answer=toy_ground_truth_answer)
test_pred = uncompiled_Prediction
llm_metric_rating = llm_metric(test_example, test_pred)
print(f"LLM Metric Rating: {llm_metric_rating}")

Uncompiled Answer: Cross Encoders are models that perform content-based re-ranking by comparing the similarity between data pairs using a classification mechanism. They are known for their accuracy but slower speed compared to Bi-Encoders, which is why they are often used in specialized use cases. Pre-trained Cross-Encoder models are available for such purposes, and Large Language Models (LLMs) can also 
 

LLM Metric Rating: 3.2


# Multi-Hop RAG

This implementation is based on the Baleen architecture from Khattab et al.

In [12]:
class GenerateSearchQuery(dspy.Signature):
    """Write a search query that will help answer a complex question."""

    context = dspy.InputField(desc="Contexts produced from previous search queries.")
    question = dspy.InputField(desc="The complex question we began with.")
    query = dspy.OutputField(desc="A search query that will help answer the question.")

In [18]:
from dsp.utils import deduplicate

class MultiHopRAG(dspy.Module):
    def __init__(self, passages_per_hop=3, max_hops=2):
        super().__init__()

        self.generate_question = dspy.ChainOfThought(GenerateSearchQuery)
        self.retrieve = dspy.Retrieve(k=passages_per_hop)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
        self.max_hops = max_hops
    
    def forward(self, question):
        context_history = []
        queries = []
        
        # Note to self, discuss how this differs from AutoGPT
        context = self.retrieve(question).passages
        for hop in range(self.max_hops):
            query = self.generate_question(context=context, question=question).query
            queries.append(query) # For inspection
            passages = self.retrieve(query).passages
            context_history.append(passages)
            context = deduplicate(context + passages)
    
        with dspy.context(lm=mistral_ollama):
            pred = self.generate_answer(context=context, question=question)
        return dspy.Prediction(answer=pred.answer, queries=queries, context_history=context_history, question=question)

In [19]:
uncompiled_Prediction = MultiHopRAG()(lgtm_query)
print(f"Uncompiled Answer: {uncompiled_Prediction.answer} \n \n")
test_example = dspy.Example(question=lgtm_query, gold_answer=toy_ground_truth_answer)
test_pred = uncompiled_Prediction
llm_metric_rating = llm_metric(test_example, test_pred)
print(f"LLM Metric Rating: {llm_metric_rating}")

Uncompiled Answer: Cross-encoder models are a type of machine learning model used to measure the similarity between two data points, such as two sentences or two images. They differ from Bi-Encoder models in that they take a pair of items as input and output a single score representing their similarity (Figure 3), while Bi-Encoders produce separate embeddings for each item. Cross-encoder models have been shown to achieve higher accuracy than Bi-Encoders when trained on representative datasets. However, due to the need to use the model with every data item during a search in combination with the query, this method is less efficient for large-scale semantic search applications. To address the efficiency issue, we can combine the strengths of both models by 
 

LLM Metric Rating: 2.8


# Multi-Hop RAG with Summarization

In [13]:
class MultiHopRAGwithSummarizer(dspy.Module):
    def __init__(self, passages_per_hop=3, max_hops=2):
        super().__init__()

        self.generate_question = dspy.ChainOfThought(GenerateSearchQuery)
        self.retrieve = dspy.Retrieve(k=passages_per_hop)
        self.summarizer = dspy.ChainOfThought(Summarizer)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
        self.max_hops = max_hops
    
    def forward(self, question):
        context = []
        queries = []
        summarized_context_log = []
        
        context = self.retrieve(question).passages
        for hop in range(self.max_hops):
            query = self.generate_question(context=context, question=question).query
            queries.append(query)
            passages = self.retrieve(query).passages
            summarized_passages = self.summarizer(question=query, context=passages).summarized_context
            summarized_context_log.append(summarized_passages)
            context.append(summarized_passages)
        with dspy.context(lm=mistral_ollama):
            pred = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=pred.answer, queries=queries, summarized_context_log=summarized_context_log, question=question)

In [21]:
uncompiled_Prediction = MultiHopRAGwithSummarizer()(lgtm_query)
print(f"Uncompiled Answer: {uncompiled_Prediction.answer} \n \n")
test_example = dspy.Example(question=lgtm_query, gold_answer=toy_ground_truth_answer)
test_pred = uncompiled_Prediction
llm_metric_rating = llm_metric(test_example, test_pred)
print(f"LLM Metric Rating: {llm_metric_rating}")

Uncompiled Answer: Cross Encoders are models that utilize pre-trained models to rank the relevance of documents. They take a `(query, document)` pair as input and output a high precision relevance score. These models can also be used as guardrails for generative models to prevent harmful or NSFW content from making it through the search pipeline. Cross Encoders are more accurate than 
 

LLM Metric Rating: 3.6


# Collect Programs

In [22]:
# https://github.com/stanfordnlp/dspy/blob/main/testing/optimizer_tester.py

class ProgramWrapper:
    def __init__(self, name, program):
        self.name = name
        self.uncompiled = program()
        self.uncomplied_score = 0.0
        self.compiled = None
        self.compiled_score = 0.0

programs = {
    "RAG": ProgramWrapper("RAG", RAG),
    "RAGwithSummarizer": ProgramWrapper("RAGwithSummarizer", RAGwithSummarizer),
    "MultiHopRAG": ProgramWrapper("MultiHopRAG", MultiHopRAG),
    "MultiHopRAGwithSummarizer": ProgramWrapper("MultiHopRAGwithSummarizer", MultiHopRAGwithSummarizer),
}

print(programs["RAG"].uncompiled("What are cross encoders?").answer)

Cross Encoders are a type of ranking model used in semantic search applications for content-based re-ranking. They achieve higher accuracy than Bi-Encoders by representing data and queries as vectors in a high-dimensional space and comparing their similarity to rank results. However, they are less efficient due to their time-consuming nature. To address this issue, the combination of Bi-Encoders and Cross-Encoders is used. In this approach, Bi-Encoders are employed for efficient retrieval of candidate results based on a query, followed by Cross-Encoders for accurate reranking of these candidates. This method allows us to benefit from the strengths of both models and handle large scale


# LGTM Test (Uncompiled)

In [23]:
for program_key in programs.keys():
    response = programs[program_key].uncompiled(lgtm_query)
    print(f"{program_key}: {response.answer}\n")
    metric_pred = llm_metric(test_example, response)
    print(f"LLM Rating: {metric_pred}\n")

RAG: Cross Encoders are a type of ranking model that uses a classification mechanism to determine the similarity between data pairs

LLM Rating: 2.2

RAGwithSummarizer: Cross Encoders are models that perform content-based re-ranking by comparing the similarity between data pairs using a classification mechanism. They are known for their accuracy but slower speed compared to Bi-Encoders, which is why they are often used in specialized use cases. Pre-trained Cross-Encoder models are available for such purposes, and Large Language Models (LLMs) can also

LLM Rating: 3.2

MultiHopRAG: Cross-encoder models are a type of machine learning model used to measure the similarity between two data points, such as two sentences or two images. They differ from Bi-Encoder models in that they take a pair of items as input and output a single score representing their similarity (Figure 3), while Bi-Encoders produce separate embeddings for each item. Cross-encoder models have been shown to achieve higher

# Evaluate

In [24]:
devset[0]

Example({'gold_answer': 'The strategy for chunking text for vectorization when dealing with a 512 token length limit involves using a Large Language Model to identify suitable places to cut up text chunks. This process, known as "chunking", breaks down long documents into smaller sections, each containing an important piece of information. This approach not only helps to stay within the LLMs token limit but also enhances the retrieval of information. It\'s important to note that the chunking should be done thoughtfully, not just splitting a list of items into 2 chunks because the first half fell into the tail end of a chunk[:512] loop.', 'question': 'What is the strategy for chunking text for vectorization when dealing with a 512 token length limit?'}) (input_keys={'question'})

In [35]:
from dspy.evaluate.evaluate import Evaluate

evaluate = Evaluate(devset=devset, num_threads=4, display_progress=False)

for program_key in programs.keys():
    print(f"Evaluating... {program_key}\n")
    uncompiled_score = evaluate(programs[program_key].uncompiled, metric=MetricWrapper)
    print(f"Uncomplied score for {program_key} = {uncompiled_score}.\n")
    programs[program_key].uncompiled_score = uncompiled_score

Evaluating... RAG

Average Metric: 12.399999999999999 / 5  (248.0%)
Uncomplied score for RAG = 248.0.

Evaluating... RAGwithSummarizer

Average Metric: 14.399999999999999 / 5  (288.0%)
Uncomplied score for RAGwithSummarizer = 288.0.

Evaluating... MultiHopRAG

Average Metric: 12.0 / 5  (240.0%)
Uncomplied score for MultiHopRAG = 240.0.

Evaluating... MultiHopRAGwithSummarizer

Average Metric: 10.200000000000001 / 5  (204.0%)
Uncomplied score for MultiHopRAGwithSummarizer = 204.0.



# Compile Programs (BootstrapFewShot)

In [26]:
from dspy.teleprompt import BootstrapFewShot

#teacherLM = dspy.OpenAI(model="gpt-4", max_tokens=12_000, model_type='chat')
teleprompter = BootstrapFewShot(metric=llm_metric, max_bootstrapped_demos=2, max_rounds=1)

for program_key in programs.keys():
    print(f"Compiling... {program_key}\n")
    compiled_program = teleprompter.compile(programs[program_key].uncompiled, trainset=trainset)
    programs[program_key].compiled = compiled_program

Compiling... RAG



  7%|██▉                                         | 2/30 [00:51<11:58, 25.65s/it]


Bootstrapped 2 full traces after 3 examples in round 0.
Compiling... RAGwithSummarizer



  7%|██▉                                         | 2/30 [00:33<07:49, 16.76s/it]


Bootstrapped 2 full traces after 3 examples in round 0.
Compiling... MultiHopRAG



  7%|██▉                                         | 2/30 [00:41<09:41, 20.75s/it]


Bootstrapped 2 full traces after 3 examples in round 0.
Compiling... MultiHopRAGwithSummarizer



  7%|██▉                                         | 2/30 [01:03<14:45, 31.62s/it]

Bootstrapped 2 full traces after 3 examples in round 0.





In [27]:
programs["MultiHopRAGwithSummarizer"].compiled("What are cross encoders").answer

'Cross Encoders are machine learning models used for content-based re-ranking of search results. They utilize pre-trained models, such as those available on Sentence Transformers, to rank the relevance of documents based on their semantic similarity with a given query. The advantage of using cross encoders is that they can reason about the relevance of results without requiring specialized training for each specific domain or dataset, leading to a more personalized and context-aware search experience.'

In [20]:
#gpt_turbo.inspect_history(n=1)
#mistral_ollama.inspect_history(n=1)

# LGTM Test (Compiled)

In [30]:
for program_key in programs.keys():
    response = programs[program_key].compiled(lgtm_query)
    print(f"{program_key}: {response.answer}\n")
    metric_pred = llm_metric(test_example, response)
    print(f"LLM Rating: {metric_pred}\n")

RAG: Cross encoders are a type of ranking model used for content-based re-ranking that achieve high accuracy within their domain but are more time-consuming compared to other methods like bi-encoders. Instead of producing vector embeddings for data, cross encoders use a classification mechanism for data pairs. The input of the model is always a pair of items, such as two sentences, and it outputs a value between 0 and 1 indicating their similarity. To use cross encoders for search, you need to calculate the similarity between each data item and the search query by passing them as pairs to the model.

LLM Rating: 4.0

RAGwithSummarizer: Cross encoders are models used for content-based re-ranking. They do not generate vector embeddings for data but rather compare data pairs using a classification mechanism. This approach allows for determining the similarity between data items and a query, resulting in higher accuracy compared to bi-encoders. However, cross encoders can be less efficient

# Evaluate Compiled

In [31]:
from dspy.evaluate.evaluate import Evaluate

evaluate = Evaluate(devset=devset, num_threads=4, display_progress=False)

for program_key in programs.keys():
    print(f"Evaluating... {program_key}\n")
    compiled_score = evaluate(programs[program_key].compiled, metric=MetricWrapper)
    print(f"Complied score for {program_key} = {compiled_score}.\n")
    programs[program_key].compiled_score = compiled_score

Evaluating... RAG

Average Metric: 14.399999999999999 / 5  (288.0%)
Complied score for RAG = 288.0.

Evaluating... RAGwithSummarizer

Average Metric: 13.8 / 5  (276.0%)
Complied score for RAGwithSummarizer = 276.0.

Evaluating... MultiHopRAG

Average Metric: 13.0 / 5  (260.0%)
Complied score for MultiHopRAG = 260.0.

Evaluating... MultiHopRAGwithSummarizer

Average Metric: 14.4 / 5  (288.0%)
Complied score for MultiHopRAGwithSummarizer = 288.0.



# Optimizing 4-layer programs with the BayesianSignatureOptimizer!

In [14]:
from dspy.teleprompt import BayesianSignatureOptimizer

llm_prompter = dspy.OpenAI(model='gpt-4', max_tokens=4000, model_type='chat')

teleprompter = BayesianSignatureOptimizer(task_model = dspy.settings.lm,
                                          metric=MetricWrapper,
                                          prompt_model=llm_prompter,
                                          n=5,
                                          verbose=False)

kwargs = dict(num_threads=4, display_progress=False)

BSO_optimized_MultiHopRAGwithSummarizer = teleprompter.compile(MultiHopRAGwithSummarizer(),
                                                              devset=devset,
                                                              optuna_trials_num=3,
                                                              max_bootstrapped_demos=4,
                                                              max_labeled_demos=4,
                                                              eval_kwargs=kwargs)

 80%|████████████████████████████████████         | 4/5 [00:45<00:11, 11.38s/it]


Bootstrapped 4 full traces after 5 examples in round 0.


 80%|████████████████████████████████████         | 4/5 [00:46<00:11, 11.60s/it]


Bootstrapped 4 full traces after 5 examples in round 0.


 80%|████████████████████████████████████         | 4/5 [00:44<00:11, 11.20s/it]


Bootstrapped 4 full traces after 5 examples in round 0.


 80%|████████████████████████████████████         | 4/5 [00:44<00:11, 11.19s/it]
[I 2024-03-01 17:32:40,174] A new study created in memory with name: no-name-1603d8be-db64-48b6-bf67-fc547136396f


Bootstrapped 4 full traces after 5 examples in round 0.


  df = df.applymap(truncate_cell)
[I 2024-03-01 17:33:39,365] Trial 0 finished with value: 284.0 and parameters: {'10762695104_predictor_instruction': 1, '10762695104_predictor_demos': 4, '10762694384_predictor_instruction': 3, '10762694384_predictor_demos': 2, '10762694624_predictor_instruction': 0, '10762694624_predictor_demos': 0}. Best is trial 0 with value: 284.0.


Average Metric: 14.2 / 5  (284.0%)




Parse the rating from a string.

---

Follow the following format.

Raw Rating Response: The string that contains the rating in it.
Rating: An integer valued rating.

---

Raw Rating Response: 5
Rating:[32m 5[0m





[I 2024-03-01 17:34:42,225] Trial 1 finished with value: 300.0 and parameters: {'10762695104_predictor_instruction': 0, '10762695104_predictor_demos': 4, '10762694384_predictor_instruction': 3, '10762694384_predictor_demos': 3, '10762694624_predictor_instruction': 0, '10762694624_predictor_demos': 4}. Best is trial 1 with value: 300.0.


Average Metric: 15.0 / 5  (300.0%)




Parse the rating from a string.

---

Follow the following format.

Raw Rating Response: The string that contains the rating in it.
Rating: An integer valued rating.

---

Raw Rating Response: 5
Rating:[32m 5[0m





[I 2024-03-01 17:35:51,242] Trial 2 finished with value: 300.0 and parameters: {'10762695104_predictor_instruction': 4, '10762695104_predictor_demos': 1, '10762694384_predictor_instruction': 0, '10762694384_predictor_demos': 0, '10762694624_predictor_instruction': 1, '10762694624_predictor_demos': 2}. Best is trial 1 with value: 300.0.


Average Metric: 15.0 / 5  (300.0%)




Parse the rating from a string.

---

Follow the following format.

Raw Rating Response: The string that contains the rating in it.
Rating: An integer valued rating.

---

Raw Rating Response: 5
Rating:[32m 5[0m





In [37]:
evaluate(BSO_optimized_MultiHopRAGwithSummarizer, metric=MetricWrapper)

Average Metric: 15.0 / 5  (300.0%)


300.0

In [15]:
BSO_optimized_MultiHopRAGwithSummarizer("What are cross encoders?")

Prediction(
    context=['[Cross Encoders](#cross-encoders) (collapsing the use of Large Language Models for ranking into this category as well)\n1. [Metadata Rankers](#metadata-rankers)\n1. [Score Rankers](#score-rankers)\n\n## Cross Encoders\nCross Encoders are one of the most well known ranking models for content-based re-ranking. There is quite a collection of pre-trained cross encoders available on [sentence transformers](https://www.sbert.net/docs/pretrained_cross-encoders.html). We are currently envisioning interfacing cross encoders with Weaviate using the following syntax.', 'Bi-Encoders are fast, but are not as accurate as the expensive fisherman aka the Cross-Encoder. Cross-Encoders are time-consuming, like the fisherman who would need to limit the number of fishing rounds they could do. So we can chain those two methods behind each other (see Figure 5). First, you use a Bi-Encoder to retrieve a *list of result candidates*, then you use a Cross-Encoder on this list of candid

In [42]:
mistral_ollama.inspect_history(n=1)





Assess the context and answer the given questions that are predominantly about software usage, process optimization, and troubleshooting. Focus on providing accurate information related to tech or software-related queries.

---

Follow the following format.

Context: Helpful information for answering the question.

Question: ${question}

Reasoning: Let's think step by step in order to ${produce the answer}. We ...

Answer: A detailed answer that is supported by the context.

---

Context:
[1] «Similarly to the original Gorilla paper’s use of Abstract Syntax Tree evaluation, we are also considering an n-gram match where we construct keywords for each query such as “bm25”, “query”, “title” and check how many are contained in the generated query. We can also use the finer-grained perplexity metric that measures the log probability of the ground truth tokens at each step of decoding. We are currently using a simple greedy decoding algorithm to sample from the LoRA fine-tuned LlaMA 7B L

# Alright, we are finished optimizing. Evaluate on Held-out Test Set.

In [45]:
from dspy.evaluate.evaluate import Evaluate

evaluate = Evaluate(devset=testset, num_threads=4, display_progress=False)

for program_key in programs.keys():
    print(f"Evaluating... {program_key}\n")
    compiled_score = evaluate(programs[program_key].uncompiled, metric=MetricWrapper)
    print(f"Complied score for {program_key} = {compiled_score}.\n")

Evaluating... RAG

Average Metric: 47.199999999999996 / 15  (314.7%)
Complied score for RAG = 314.67.

Evaluating... RAGwithSummarizer

Average Metric: 48.800000000000004 / 15  (325.3%)
Complied score for RAGwithSummarizer = 325.33.

Evaluating... MultiHopRAG

Average Metric: 44.800000000000004 / 15  (298.7%)
Complied score for MultiHopRAG = 298.67.

Evaluating... MultiHopRAGwithSummarizer

Average Metric: 47.400000000000006 / 15  (316.0%)
Complied score for MultiHopRAGwithSummarizer = 316.0.



In [44]:
from dspy.evaluate.evaluate import Evaluate

evaluate = Evaluate(devset=testset, num_threads=4, display_progress=False)

for program_key in programs.keys():
    print(f"Evaluating... {program_key}\n")
    compiled_score = evaluate(programs[program_key].compiled, metric=MetricWrapper)
    print(f"Complied score for {program_key} = {compiled_score}.\n")

Evaluating... RAG

Average Metric: 52.6 / 15  (350.7%)
Complied score for RAG = 350.67.

Evaluating... RAGwithSummarizer

Average Metric: 43.8 / 15  (292.0%)
Complied score for RAGwithSummarizer = 292.0.

Evaluating... MultiHopRAG

Average Metric: 45.400000000000006 / 15  (302.7%)
Complied score for MultiHopRAG = 302.67.

Evaluating... MultiHopRAGwithSummarizer

Average Metric: 51.2 / 15  (341.3%)
Complied score for MultiHopRAGwithSummarizer = 341.33.



In [46]:
evaluate(BSO_optimized_MultiHopRAGwithSummarizer, metric=MetricWrapper)

Average Metric: 45.2 / 15  (301.3%)


301.33

# Final LGTM tests

In [47]:
BSO_optimized_MultiHopRAGwithSummarizer("What is ref2vec?")

Prediction(
    context=['---\ntitle: What is Ref2Vec and why you need it for your recommendation system\nslug: ref2vec-centroid\nauthors: [connor]\ndate: 2022-11-23\ntags: [\'integrations\', \'concepts\']\nimage: ./img/hero.png\ndescription: "Weaviate introduces Ref2Vec, a new module that utilises Cross-References for Recommendation!"\n---\n![Ref2vec-centroid](./img/hero.png)\n\n<!-- truncate -->\n\nWeaviate 1.16 introduced the [Ref2Vec](/developers/weaviate/modules/retriever-vectorizer-modules/ref2vec-centroid) module. In this article, we give you an overview of what Ref2Vec is and some examples in which it can add value such as recommendations or representing long objects. ## What is Ref2Vec? The name Ref2Vec is short for reference-to-vector, and it offers the ability to vectorize a data object with its cross-references to other objects. The Ref2Vec module currently holds the name ref2vec-**centroid** because it uses the average, or centroid vector, of the cross-referenced vectors t

In [48]:
BSO_optimized_MultiHopRAGwithSummarizer("What version of Weaviate added re-ranking?")

Prediction(
    context=['An `alpha` of 0 is pure bm25 and an alpha of 1 is pure vector search. Therefore, the set `alpha` is dependent on your data and application. Another emerging development is the effectiveness of zero-shot re-ranking models. Weaviate currently offers 2 [re-ranking models from Cohere](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/reranker-cohere): `rerank-english-v2.0` and `rerank-multilingual-v2.0`. As evidenced from the name, these models mostly differ because of the training data used and the resulting multilingual capabilities.', '### rankedFusion\n\nThe `rankedFusion` algorithm is the original hybrid fusion algorithm that has been available since the launch of hybrid search in Weaviate. In this algorithm, each object is scored according to its position in the results for the given search, starting from the highest score for the top-ranked object and decreasing down the order. The total score is calculated by adding these rank-ba

# BSO Optimized `QueryWriter` Prompt

# BSO Optimized `Summarize` Prompt

# BSO Optimized `Answer` Prompt