<a href="https://colab.research.google.com/github/wenqiglantz/edd-recursive-doc-agent-vs-metadata-replacement/blob/main/edd_zephyr_7b_gpt3_5_metadata_replacement_multi_doc.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluation Driven Development for Multi Document RAG Pipeline with GPT-3.5 and Zephyr-7b

This notebook demonstrates how to use EDD to decide which of the two LLMs perform best for a multi document RAG pipeline for Metadata replacement + node sentence window:


*   gpt-3.5-turbo
*   zephyr-7b-alpha

Suggest to upgrade to Colab Pro to run on T4 high-RAM. I tried to run on the free tier T4 GPU but failed during the download of Zephyr-7b.


In [1]:
!pip install llama_index==0.8.45.post1 pypdf sentence-transformers transformers accelerate bitsandbytes



### Load documents

In [2]:
from llama_index import SimpleDirectoryReader

titles = [
    "DevOps_Self-Service_Pipeline_Architecture",
    "DevOps_Self-Service_Terraform_Project_Structure",
    "DevOps_Self-Service_Pipeline_Security_Guardrails"
    ]

documents = {}
for title in titles:
    documents[title] = SimpleDirectoryReader(input_files=[f"./data/{title}.pdf"]).load_data()
print(f"loaded documents with {len(documents)} documents")

loaded documents with 3 documents


### Set up node parser, service context

In [3]:
from llama_index import ServiceContext, set_global_service_context
from llama_index.llms import OpenAI
from llama_index.embeddings import OpenAIEmbedding, HuggingFaceEmbedding
from llama_index.node_parser import SentenceWindowNodeParser, SimpleNodeParser

# create the sentence window node parser
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)
simple_node_parser = SimpleNodeParser.from_defaults()

## on gpt-3.5-turbo

### Extract nodes and build index

In [4]:
import os, openai, logging, sys

os.environ["OPENAI_API_KEY"] = "sk-##############"
openai.api_key = os.environ["OPENAI_API_KEY"]

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)

In [5]:
#define LLM and embedding model
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
ctx = ServiceContext.from_defaults(
    llm=llm,
    embed_model="local:BAAI/bge-base-en-v1.5"
)

from llama_index import VectorStoreIndex

# extract nodes and build index
document_list = SimpleDirectoryReader("data").load_data()
nodes = node_parser.get_nodes_from_documents(document_list)
sentence_index = VectorStoreIndex(nodes, service_context=ctx)

### Define query engine

In [6]:
from llama_index.indices.postprocessor import MetadataReplacementPostProcessor

metadata_query_engine = sentence_index.as_query_engine(
    similarity_top_k=2,
    # the target key defaults to `window` to match the node_parser's default
    node_postprocessors=[
        MetadataReplacementPostProcessor(target_metadata_key="window")
    ],
)

### Run test queries

In [7]:
response = metadata_query_engine.query("Give me a summary of DevOps self-service-centric pipeline security and guardrails.")
print(str(response))

DevOps self-service-centric pipeline security and guardrails involve implementing a set of actions to ensure the security of pipelines, infrastructure, source code, base images, and dependent libraries. These actions are hand-picked and aim to provide security scans and guardrails for various components of the DevOps process. The goal is to establish a secure and reliable self-service environment for DevOps practices.


In [8]:
response = metadata_query_engine.query("What is Harden Runner in DevOps self-service-centric pipeline security and guardrails?")
print(str(response))

Harden-Runner is a purpose-built security monitoring agent that is used in all pipelines, including infrastructure and application pipelines for both CI and CD workflows. It automatically discovers and correlates outbound traffic with each step in the pipeline to detect and prevent malicious patterns. Its main purpose is to prevent the exfiltration of credentials in the pipeline.


## zephyr-7b

In [9]:
import logging

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)

In [10]:
import torch
from transformers import BitsAndBytesConfig
from llama_index.prompts import PromptTemplate
from llama_index.llms import HuggingFaceLLM

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)


def messages_to_prompt(messages):
  prompt = ""
  for message in messages:
    if message.role == 'system':
      prompt += f"<|system|>\n{message.content}</s>\n"
    elif message.role == 'user':
      prompt += f"<|user|>\n{message.content}</s>\n"
    elif message.role == 'assistant':
      prompt += f"<|assistant|>\n{message.content}</s>\n"

  # ensure we start with a system prompt, insert blank if needed
  if not prompt.startswith("<|system|>\n"):
    prompt = "<|system|>\n</s>\n" + prompt

  # add final assistant prompt
  prompt = prompt + "<|assistant|>\n"

  return prompt


llm_zephyr = HuggingFaceLLM(
    model_name="HuggingFaceH4/zephyr-7b-alpha",
    tokenizer_name="HuggingFaceH4/zephyr-7b-alpha",
    query_wrapper_prompt=PromptTemplate("<|system|>\n</s>\n<|user|>\n{query_str}</s>\n<|assistant|>\n"),
    context_window=3900,
    max_new_tokens=256,
    model_kwargs={"quantization_config": quantization_config},
    # tokenizer_kwargs={},
    generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
    messages_to_prompt=messages_to_prompt,
    device_map="auto",
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [11]:

from llama_index import ServiceContext

service_context_zephyr = ServiceContext.from_defaults(
    llm=llm_zephyr,
    embed_model="local:BAAI/bge-base-en-v1.5"
)

### Extract nodes and build index

In [12]:
from llama_index import VectorStoreIndex

document_list = SimpleDirectoryReader("data").load_data()
nodes = node_parser.get_nodes_from_documents(document_list)
sentence_index_zephyr = VectorStoreIndex(nodes, service_context=service_context_zephyr)

### Define query engine

In [13]:
from llama_index.indices.postprocessor import MetadataReplacementPostProcessor

metadata_query_engine_zephyr = sentence_index_zephyr.as_query_engine(
    similarity_top_k=2,
    # the target key defaults to `window` to match the node_parser's default
    node_postprocessors=[
        MetadataReplacementPostProcessor(target_metadata_key="window")
    ],
)

### Run test queries

In [14]:
query = "Give me a summary of DevOps self-service-centric pipeline security and guardrails."
response = metadata_query_engine_zephyr.query(query)
print(str(response))



The article discusses DevOps self-service-centric pipeline security and guardrails, providing a list of hand-picked actions for security scans and guardrails for pipelines, infrastructure, source code, base images, and dependent libraries. The author acknowledges that coming from a traditional DevOps mindset, security measures and guardrails may be a concern. The article is part of a series that explores DevOps self-service pipeline architecture, Terraform project structure, and GitHub Actions workflow orchestration.


In [15]:
query = "What is Harden Runner in DevOps self-service-centric pipeline security and guardrails?"
response = metadata_query_engine_zephyr.query(query)
print(str(response))

Harden Runner is a purpose-built security monitoring agent for pipelines that automatically discovers and correlates outbound traffic with each step in the pipeline to detect and prevent malicious patterns observed during past software supply chain security breaches. It is used in all pipelines, including infrastructure and application pipelines for CI and CD, and is the only action used in all pipelines due to its unique nature and purpose. Its main features include preventing exfiltration of credentials in the pipeline.


## Evaluations

### Generate evaluation questions

In [16]:
import random
random.seed(42)
from llama_index.evaluation import DatasetGenerator
import nest_asyncio

nest_asyncio.apply()

# load data
document_list = SimpleDirectoryReader("data").load_data()

question_dataset = []
if os.path.exists("question_dataset.txt"):
    with open("question_dataset.txt", encoding='utf-8') as f:
        for line in f:
            question_dataset.append(line.strip())
else:
    # generate questions
    data_generator = DatasetGenerator.from_documents(document_list)
    generated_questions = data_generator.generate_questions_from_nodes()
    print(f"Generated {len(generated_questions)} questions.")

    # randomly pick 30 questions
    generated_questions = random.sample(generated_questions, 30)
    question_dataset.extend(generated_questions)
    print(f"Randomly picked {len(question_dataset)} questions.")

    # save the questions into a txt file
    with open("question_dataset.txt", "w") as f:
        for question in question_dataset:
            f.write(f"{question.strip()}\n")

for i, question in enumerate(question_dataset, start=1):
    print(f"{i}. {question}")

1. What is the high-level design of DevOps pipelines?
2. What is a recently introduced feature in Infracost Cloud?
3. What is the purpose of Infracost in cloud cost management?
4. Why is it important to include TruffleHog in your pipelines?
5. How can you fix the vulnerability in the base image according to the provided instructions?
6. What is the purpose of the aquasecurity/trivy-action in the GitHub Actions CI workflow?
7. What are the optional parameters that can be used with the Checkov action?
8. How can Infracost be integrated into the infrastructure pipeline?
9. How are application pipelines triggered?
10. Give me a summary of DevOps Self-Service Pipeline Architecture and Its 3–2–1 Rule.
11. What command is used to generate the Infracost report in HTML format?
12. How does Terraform enable the creation of reusable infrastructure?
13. How can the GitHub Actions workflow be configured to dynamically select the backend configuration file based on the environment?
14. What is the d

### Define evaluators

In [17]:
from llama_index.evaluation import FaithfulnessEvaluator, RelevancyEvaluator

# use gpt-4 to evaluate
gpt4_service_context = ServiceContext.from_defaults(llm=OpenAI(temperature=0.1, llm="gpt-4"))

faithfulness_gpt4 = FaithfulnessEvaluator(service_context=gpt4_service_context)
relevancy_gpt4 = RelevancyEvaluator(service_context=gpt4_service_context)

### Define evaluation batch runner

In [18]:
from llama_index.evaluation import BatchEvalRunner

runner = BatchEvalRunner(
    {"faithfulness": faithfulness_gpt4, "relevancy": relevancy_gpt4},
    workers=10,
    show_progress=True
)

In [19]:
def get_eval_results(key, eval_results):
    results = eval_results[key]
    correct = 0
    for result in results:
        if result.passing:
            correct += 1
    score = correct / len(results)
    print(f"{key} Correct: {correct}. Score: {score}")
    return score

### Evaluation on gpt-3.5-turbo

In [20]:
eval_results = await runner.aevaluate_queries(
    metadata_query_engine, queries=question_dataset
)

print("------------------")
score = get_eval_results("faithfulness", eval_results)
score = get_eval_results("relevancy", eval_results)

100%|██████████| 30/30 [00:05<00:00,  5.08it/s]
100%|██████████| 60/60 [00:02<00:00, 22.74it/s]

------------------
faithfulness Correct: 28. Score: 0.9333333333333333
relevancy Correct: 27. Score: 0.9





### Evaluation on zephyr-7b

In [21]:
eval_results = await runner.aevaluate_queries(
    metadata_query_engine_zephyr, queries=question_dataset
)

print("------------------")
score = get_eval_results("faithfulness", eval_results)
score = get_eval_results("relevancy", eval_results)

100%|██████████| 30/30 [04:32<00:00,  9.07s/it]  
100%|██████████| 60/60 [00:02<00:00, 20.01it/s]

------------------
faithfulness Correct: 28. Score: 0.9333333333333333
relevancy Correct: 29. Score: 0.9666666666666667



