# End-to-End Evaluation for RAG Pipeline without Recursive Document Agents

Let's evaluate this RAG pipeline for DevSecOps which was implemented without recursive document agents.

## Set up the query engine

### Install LlamaIndex and set up

In [91]:
!pip install llama_index==0.8.12
!pip install pypdf



In [92]:
from llama_index import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    ServiceContext,
    Response
)
from llama_index.evaluation import (
    DatasetGenerator,
    QueryResponseEvaluator,
    ResponseEvaluator
)
from llama_index.llms import OpenAI
import pandas as pd
import openai
import os

In [93]:
openai.api_key = 'YOUR-API-KEY'

#define LLM
llm = OpenAI(temperature=0.1, model_name="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(llm=llm)

In [35]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Load documents, build index and query engine

In [94]:
# load documents
document_list = SimpleDirectoryReader("data").load_data()

# build vector index and query engine
vector_index = VectorStoreIndex.from_documents(document_list, service_context=service_context)
query_engine = vector_index.as_query_engine()

## End-to-End Evaluation

### Generate the dataset

In [96]:
import random
random.seed(42)
from llama_index.prompts import Prompt

gpt4_service_context = ServiceContext.from_defaults(llm=OpenAI(temperature=0.1, llm="gpt-4"))

question_dataset = []
if os.path.exists("question_dataset.txt"):
    with open("question_dataset.txt", "r") as f:
        for line in f:
            question_dataset.append(line.strip())
else:
    # generate questions
    data_generator = DatasetGenerator.from_documents(
        document_list,
        text_question_template=Prompt(
            "A sample from the documents is below.\n"
            "---------------------\n"
            "{context_str}\n"
            "---------------------\n"
            "Using the documentation sample, carefully follow the instructions below:\n"
            "{query_str}"
        ),
        question_gen_query=(
            "You are an evaluator for a search pipeline. Your task is to write a list of summarization "
            "questions or question/answer questions using the provided documents. Restrict the questions to the "
            "context information provided.\n"
            "Question: "
        ),
        # set this to be low, so we can generate more questions
        service_context=gpt4_service_context
    )
    generated_questions = data_generator.generate_questions_from_nodes()
    print(f"Generated {len(generated_questions)} questions.")

    # randomly pick 30 questions from each dataset
    generated_questions = random.sample(generated_questions, 30)
    question_dataset.extend(generated_questions)

    print(f"Randomly picked {len(question_dataset)} questions.")

    # save the questions!
    with open("question_dataset.txt", "w") as f:
        for question in question_dataset:
            f.write(f"{question.strip()}\n")

### Print the questions

In [97]:
for i, question in enumerate(question_dataset, start=1):
    print(f"{i}. {question}")

1. What is the high-level design of DevOps pipelines?
2. What is a recently introduced feature in Infracost Cloud?
3. What is the purpose of Infracost in cloud cost management?
4. Why is it important to include TruffleHog in your pipelines?
5. How can you fix the vulnerability in the base image according to the provided instructions?
6. What is the purpose of the aquasecurity/trivy-action in the GitHub Actions CI workflow?
7. What are the optional parameters that can be used with the Checkov action?
8. How can Infracost be integrated into the infrastructure pipeline?
9. How are application pipelines triggered?
10. What is the topic of the second part in the series?
11. What command is used to generate the Infracost report in HTML format?
12. How does Terraform enable the creation of reusable infrastructure?
13. How can the GitHub Actions workflow be configured to dynamically select the backend configuration file based on the environment?
14. What is the diff feature in Infracost and ho

In [98]:
# define jupyter display function
def display_eval_df(query: str, response: Response, eval_result: str) -> None:
    eval_df = pd.DataFrame(
        {
            "Query": query,
            "Response": str(response),
            "Source": response.get_formatted_sources(500) + "...",
            "Evaluation Result": eval_result,
        },
        index=[0],
    )
    eval_df = eval_df.style.set_properties(
        **{
            "inline-size": "600px",
            "overflow-wrap": "break-word",
        },
        subset=["Response", "Source"]
    )
    display(eval_df)

### Evaluating Response, test with one question first


In [118]:
evaluator = ResponseEvaluator(service_context=gpt4_service_context)
response_vector = query_engine.query(question_dataset[0])
eval_result = evaluator.evaluate(response_vector)

pd.set_option("display.max_colwidth", 0)
display_eval_df(question_dataset[0], response_vector, eval_result)

Unnamed: 0,Query,Response,Source,Evaluation Result
0,What is the high-level design of DevOps pipelines?,"The high-level design of DevOps pipelines involves two types of pipelines: infrastructure pipelines and application pipelines. The infrastructure pipeline is responsible for provisioning the infrastructure using Terraform, which is an open-source tool for building infrastructure as code. The Terraform GitHub Actions workflow is used to automate the creation of GitHub secrets after successful infrastructure provisioning. These secrets are then used by the application pipelines to kick off CI/CD for the specified GitHub environment. On the other hand, the application pipelines are developed using GitHub Actions and can vary depending on the nature of the applications.","> Source (Doc id: e74193a1-280c-4977-b625-e2f79a15b38c): • How do we tie infrastructure pipelines with application pipelines to make them work together seamlessly? We need a glue to integrate these two types of pipelines. And this glue is GitHu b secrets creation automation. Upon successful infrastructure provisioning, we can use Terraform to automate GitHub secrets creation by calling the GitHub provider. Notice the double -ended arrows for the infrastructure pipeline in the diagram above, as the Terraform outputs for the secrets get insert... > Source (Doc id: 23c321e3-2a5c-4032-a317-b6e3b034f2ac): diagram by author Note : The diagram does not depict alternative flows, such as for terraform destroy . You can always add alternative flows per your pipeline requirements. Application Pipelines Our application pipelines are developed using GitHub Actions. Below is a high -level overview of the two typical pipelines (CI and CD for microservices). These are mere examples. Your workflows could contain different steps depending on the nature of your applications. Microservice CI Git......",YES


### Evaluating Response for Hallucination



In [100]:
import time
import asyncio
import nest_asyncio
nest_asyncio.apply()

def evaluate_query_engine(evaluator, query_engine, questions):
    async def run_query(query_engine, q):
        try:
            return await query_engine.aquery(q)
        except:
            return Response(response="Error, query failed.")

    total_correct = 0
    all_results = []
    for batch_size in range(0, len(questions), 5):
        batch_qs = questions[batch_size:batch_size+5]

        tasks = [run_query(query_engine, q) for q in batch_qs]
        responses = asyncio.run(asyncio.gather(*tasks))
        print(f"finished batch {(batch_size // 5) + 1} out of {len(questions) // 5}")

        for response in responses:
            eval_result = 1 if "YES" in evaluator.evaluate(response) else 0
            total_correct += eval_result
            all_results.append(eval_result)

        # helps avoid rate limits
        time.sleep(1)

    return total_correct, all_results

In [101]:
total_correct, all_results = evaluate_query_engine(evaluator, query_engine, question_dataset)

print(f"Hallucination? Scored {total_correct} out of {len(question_dataset)} questions correctly.")

finished batch 1 out of 6
finished batch 2 out of 6
finished batch 3 out of 6
finished batch 4 out of 6
finished batch 5 out of 6
finished batch 6 out of 6
Hallucination? Scored 27 out of 30 questions correctly.


### Find out the hallucinated questions and investigate why

In [102]:
import numpy as np

hallucinated_questions = np.array(question_dataset)[np.array(all_results) == 0]
print(hallucinated_questions)

['What is the purpose of uploading the report to an artifact?'
 'What severity levels does Trivy consider for vulnerabilities?'
 'What is the topic of the document?']


In [105]:
response = query_engine.query('What is the topic of the document?')
print(str(response))
print("-----------------")
print(response.get_formatted_sources(length=1000))

The topic of the document is "DevOps Self-Service Centric Terraform Project Structure".
-----------------
> Source (Doc id: 56053b40-728f-4f95-a1a6-d13cc764a0ee): image by author

> Source (Doc id: b13abc3b-9fd7-4670-9320-572d47f211bd): image by author


### Evaluating Response for Answer Quality

In [106]:
import time
import asyncio
import nest_asyncio
nest_asyncio.apply()
from llama_index import Response

def evaluate_query_engine(evaluator, query_engine, questions):
    async def run_query(query_engine, q):
        try:
            return await query_engine.aquery(q)
        except:
            return Response(response="Error, query failed.")

    total_correct = 0
    all_results = []
    for batch_size in range(0, len(questions), 5):
        batch_qs = questions[batch_size:batch_size+5]

        tasks = [run_query(query_engine, q) for q in batch_qs]
        responses = asyncio.run(asyncio.gather(*tasks))
        print(f"finished batch {(batch_size // 5) + 1} out of {len(questions) // 5}")

        for question, response in zip(batch_qs, responses):
            eval_result = 1 if "YES" in evaluator.evaluate(question, response) else 0
            total_correct += eval_result
            all_results.append(eval_result)

        # helps avoid rate limits
        time.sleep(1)

    return total_correct, all_results

In [107]:
evaluator = QueryResponseEvaluator(service_context=gpt4_service_context)

total_correct, all_results = evaluate_query_engine(evaluator, query_engine, question_dataset)

print(f"Response satisfies the query? Scored {total_correct} out of {len(question_dataset)} questions correctly.")

finished batch 1 out of 6
finished batch 2 out of 6
finished batch 3 out of 6
finished batch 4 out of 6
finished batch 5 out of 6
finished batch 6 out of 6
Response satisfies the query? Scored 23 out of 30 questions correctly.


### Find out unanswered queries and investigate why

In [108]:
import numpy as np

unanswered_queries = np.array(question_dataset)[np.array(all_results) == 0]
print(unanswered_queries)

['What is a recently introduced feature in Infracost Cloud?'
 'What is the topic of the second part in the series?'
 'What command is used to generate the Infracost report in HTML format?'
 'What is the diff feature in Infracost and how does it serve as a guardrail for cloud cost management?'
 'What is the purpose of uploading the report to an artifact?'
 'Can you provide a link to a website that provides information on creating Terraform modules?'
 'What is the intended audience for these documents?']


In [116]:
response = query_engine.query('What is the topic of the second part in the series?')
print(str(response))
print("-----------------")
print(response.get_formatted_sources(length=256))

The topic of the second part in the series is "DevOps Self-Service Centric Pipeline Security and Guardrails."
-----------------
> Source (Doc id: 6d5a6daf-46ce-4035-be17-e7421a77f581): DevOps Self -Service Centric GitHub Actions’ Workflow Orchestration  
How to orchestrate GitHub Actions’ workflows that a re driven by image immutability  
betterprogramming.pub  
 
 
DevOps Self -Service Centric Pipeline Security and Guardrails  
A lis...

> Source (Doc id: 892a4b0f-1894-473d-9801-8ecc13db644d): popularity of the Terraform tool and IaC practices to automate deployments in the cloud 
increasingly.  
 
Image source:  The top programming languages | The State of the Octoverse 
(github.com)
