<a href="https://colab.research.google.com/github/wenqiglantz/edd-recursive-doc-agent-vs-metadata-replacement/blob/main/edd_recursive_doc_agent_metadata_replacement.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluation Driven Development for Multi Document RAG Pipeline

This notebook demonstrates how to use EDD to decide which of these two strategies perform best for a multi document RAG pipeline:


*   Recursive retriever + document agent
*   Metadata replacement + node sentence window



In [4]:
!pip install llama_index==0.8.41 pypdf sentence-transformers

Collecting llama_index==0.8.41
  Downloading llama_index-0.8.41-py3-none-any.whl (868 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/868.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m204.8/868.1 kB[0m [31m5.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m868.1/868.1 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: llama_index
  Attempting uninstall: llama_index
    Found existing installation: llama-index 0.8.40
    Uninstalling llama-index-0.8.40:
      Successfully uninstalled llama-index-0.8.40
Successfully installed llama_index-0.8.41


In [5]:
import os, openai, logging, sys

os.environ["OPENAI_API_KEY"] = "sk-################################"
openai.api_key = os.environ["OPENAI_API_KEY"]

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)

## Common Tasks

### Load documents

In [6]:
from llama_index import SimpleDirectoryReader

titles = [
    "DevOps_Self-Service_Pipeline_Architecture",
    "DevOps_Self-Service_Terraform_Project_Structure",
    "DevOps_Self-Service_Pipeline_Security_Guardrails"
    ]

documents = {}
for title in titles:
    documents[title] = SimpleDirectoryReader(input_files=[f"./data/{title}.pdf"]).load_data()
print(f"loaded documents with {len(documents)} documents")

loaded documents with 3 documents


## Recursive retriever + document agent

In [7]:
from llama_index import (
    VectorStoreIndex,
    SummaryIndex,
    ServiceContext,
    Response
)
from llama_index.schema import IndexNode
from llama_index.tools import QueryEngineTool, ToolMetadata
from llama_index.llms import OpenAI
from llama_index.retrievers import RecursiveRetriever
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.response_synthesizers import get_response_synthesizer
from llama_index.agent import OpenAIAgent
import pandas as pd
import openai
import os


In [8]:
#define LLM
llm = OpenAI(temperature=0.1, model_name="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(llm=llm)

[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### Create document agents

In [9]:
# Build agents dictionary
agents = {}

for title in titles:

    # build vector index
    vector_index = VectorStoreIndex.from_documents(documents[title], service_context=service_context)

    # build summary index
    list_index = SummaryIndex.from_documents(documents[title], service_context=service_context)

    # define query engines
    vector_query_engine = vector_index.as_query_engine()
    list_query_engine = list_index.as_query_engine()

    # define tools
    query_engine_tools = [
        QueryEngineTool(
            query_engine=vector_query_engine,
            metadata=ToolMetadata(
                name="vector_tool",
                description=f"Useful for retrieving specific context related to {title}",
            ),
        ),
        QueryEngineTool(
            query_engine=list_query_engine,
            metadata=ToolMetadata(
                name="summary_tool",
                description=f"Useful for summarization questions related to {title}",
            ),
        ),
    ]

    # build agent
    function_llm = OpenAI(model="gpt-3.5-turbo-0613")
    agent = OpenAIAgent.from_tools(
        query_engine_tools,
        llm=function_llm,
        verbose=False,
    )

    agents[title] = agent

### Create index nodes

In [10]:
# define index nodes that link to the document agents
nodes = []
for title in titles:
    doc_summary = (
        f"This content contains details about {title}. "
        f"Use this index if you need to lookup specific facts about {title}.\n"
        "Do not use this index if you want to query multiple documents."
    )
    node = IndexNode(text=doc_summary, index_id=title)
    nodes.append(node)

# define retriever
vector_index = VectorStoreIndex(nodes)
vector_retriever = vector_index.as_retriever(similarity_top_k=1)

### Define recursive retriever and query engine

In [11]:
# define recursive retriever
# note: can pass `agents` dict as `query_engine_dict` since every agent can be used as a query engine
recursive_retriever = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever},
    query_engine_dict=agents,
    verbose=False,
)

response_synthesizer = get_response_synthesizer(response_mode="compact")

# define query engine
recursive_query_engine = RetrieverQueryEngine.from_args(
    recursive_retriever,
    response_synthesizer=response_synthesizer,
    service_context=service_context,
)

### Run test queries

In [12]:
response = recursive_query_engine.query("Give me a summary of DevOps self-service-centric pipeline security and guardrails.")
print(str(response))

DevOps self-service-centric pipeline security and guardrails involve implementing security measures and guardrails to ensure the security of pipelines. One tool that can assist with this is Trivy, an open source security scanner that scans container images for known vulnerabilities. By integrating Trivy into your CI/CD workflow, you can quickly identify and address security risks. Additionally, using the `--soft-fail` flag in your GitHub Actions workflow allows the workflow to continue even if vulnerabilities are found, providing flexibility and efficiency in the development process. It's important to note that there is a known issue with TFSec not working well with pinned Terraform reusable modules, but the TFSec team is actively working on a fix for this issue. For more information on Trivy and how to use it, you can refer to the documentation provided by the aquasecurity/trivy-action repository.


In [13]:
response = recursive_query_engine.query("What is Harden Runner in DevOps self-service-centric pipeline security and guardrails?")
print(str(response))

Harden Runner in DevOps self-service-centric pipeline security and guardrails refers to the process of securing the runner environment used in CI/CD pipelines. This involves implementing security measures such as access controls, security patches, secure communication protocols, and authentication and authorization mechanisms. By hardening the runner environment, organizations can ensure that the execution of CI/CD pipelines is done securely, reducing the risk of unauthorized access and security incidents.


## Metadata Replacement + Node Sentence Window

### Set up node parser, service context

In [14]:
from llama_index import ServiceContext, set_global_service_context
from llama_index.llms import OpenAI
from llama_index.embeddings import OpenAIEmbedding, HuggingFaceEmbedding
from llama_index.node_parser import SentenceWindowNodeParser, SimpleNodeParser

# create the sentence window node parser
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)
simple_node_parser = SimpleNodeParser.from_defaults()

llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
embed_model = HuggingFaceEmbedding(
    model_name="sentence-transformers/all-mpnet-base-v2", max_length=512
)
ctx = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embed_model
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

### Extract nodes and build index

In [15]:
from llama_index import VectorStoreIndex

document_list = SimpleDirectoryReader("data").load_data()
nodes = node_parser.get_nodes_from_documents(document_list)
sentence_index = VectorStoreIndex(nodes, service_context=ctx)

### Define query engine

In [16]:
from llama_index.indices.postprocessor import MetadataReplacementPostProcessor

metadata_query_engine = sentence_index.as_query_engine(
    similarity_top_k=2,
    # the target key defaults to `window` to match the node_parser's default
    node_postprocessors=[
        MetadataReplacementPostProcessor(target_metadata_key="window")
    ],
)

### Run test queries

In [17]:
query = "Give me a summary of DevOps self-service-centric pipeline security and guardrails."
response = metadata_query_engine.query(query)
print(str(response))

DevOps self-service-centric pipeline security and guardrails involve implementing a list of hand-picked actions to ensure the security and compliance of pipelines, infrastructure, source code, base images, and dependent libraries. These actions are implemented in reusable workflows for both infrastructure and application pipelines, and developers are expected to adhere to them when developing workflows for their applications. The goal is to provide a self-service environment where developers can confidently build and deploy their applications while maintaining the necessary security measures.


In [18]:
query = "What is Harden Runner in DevOps self-service-centric pipeline security and guardrails?"
response = metadata_query_engine.query(query)
print(str(response))

Harden-Runner is a purpose-built security monitoring agent for pipelines in DevOps self-service-centric pipeline security and guardrails. It is designed to detect and prevent malicious patterns observed during past software supply chain security breaches. Some of its main features include automatically discovering and correlating outbound traffic with each step in the pipeline, preventing exfiltration of credentials, and detecting tampering of source code during the build.


## Evaluations

### Generate evaluation questions

In [19]:
import random
random.seed(42)
from llama_index.evaluation import DatasetGenerator
import nest_asyncio

nest_asyncio.apply()

# load data
document_list = SimpleDirectoryReader("data").load_data()

question_dataset = []
if os.path.exists("question_dataset.txt"):
    with open("question_dataset.txt", "r") as f:
        for line in f:
            question_dataset.append(line.strip())
else:
    # generate questions
    data_generator = DatasetGenerator.from_documents(document_list)
    generated_questions = data_generator.generate_questions_from_nodes()
    print(f"Generated {len(generated_questions)} questions.")

    # randomly pick 30 questions
    generated_questions = random.sample(generated_questions, 30)
    question_dataset.extend(generated_questions)
    print(f"Randomly picked {len(question_dataset)} questions.")

    # save the questions into a txt file
    with open("question_dataset.txt", "w") as f:
        for question in question_dataset:
            f.write(f"{question.strip()}\n")

for i, question in enumerate(question_dataset, start=1):
    print(f"{i}. {question}")

1. What is the high-level design of DevOps pipelines?
2. What is a recently introduced feature in Infracost Cloud?
3. What is the purpose of Infracost in cloud cost management?
4. Why is it important to include TruffleHog in your pipelines?
5. How can you fix the vulnerability in the base image according to the provided instructions?
6. What is the purpose of the aquasecurity/trivy-action in the GitHub Actions CI workflow?
7. What are the optional parameters that can be used with the Checkov action?
8. How can Infracost be integrated into the infrastructure pipeline?
9. How are application pipelines triggered?
10. Give me a summary of DevOps Self-Service Pipeline Architecture and Its 3–2–1 Rule.
11. What command is used to generate the Infracost report in HTML format?
12. How does Terraform enable the creation of reusable infrastructure?
13. How can the GitHub Actions workflow be configured to dynamically select the backend configuration file based on the environment?
14. What is the d

### Define evaluators

In [20]:
from llama_index.evaluation import FaithfulnessEvaluator, RelevancyEvaluator

# use gpt-4 to evaluate
gpt4_service_context = ServiceContext.from_defaults(llm=OpenAI(temperature=0.1, llm="gpt-4"))

faithfulness_gpt4 = FaithfulnessEvaluator(service_context=gpt4_service_context)
relevancy_gpt4 = RelevancyEvaluator(service_context=gpt4_service_context)

### Define evaluation batch runner

In [21]:
from llama_index.evaluation import BatchEvalRunner

runner = BatchEvalRunner(
    {"faithfulness": faithfulness_gpt4, "relevancy": relevancy_gpt4},
    workers=10,
    show_progress=True
)

In [22]:
def get_eval_results(key, eval_results):
    results = eval_results[key]
    correct = 0
    for result in results:
        if result.passing:
            correct += 1
    score = correct / len(results)
    print(f"{key} Correct: {correct}. Score: {score}")
    return score

### Evaluation of recursive retriever + document agent

In [23]:
eval_results = await runner.aevaluate_queries(
    recursive_query_engine, queries=question_dataset
)

print("------------------")
score = get_eval_results("faithfulness", eval_results)
score = get_eval_results("relevancy", eval_results)

100%|██████████| 30/30 [03:53<00:00,  7.77s/it]
100%|██████████| 60/60 [00:02<00:00, 20.01it/s]

------------------
faithfulness Correct: 30. Score: 1.0
relevancy Correct: 29. Score: 0.9666666666666667





### Evaluation of metadata replacement + node sentence window

In [24]:
eval_results = await runner.aevaluate_queries(
    metadata_query_engine, queries=question_dataset
)

print("------------------")
score = get_eval_results("faithfulness", eval_results)
score = get_eval_results("relevancy", eval_results)

100%|██████████| 30/30 [00:08<00:00,  3.49it/s]
100%|██████████| 60/60 [00:02<00:00, 20.76it/s]

------------------
faithfulness Correct: 24. Score: 0.8
relevancy Correct: 26. Score: 0.8666666666666667



