This notebook demonstrates a simple synchronous RAG smoke test:

- It loads a handful of ground-truth questions from sample_gt.csv into test_dict, giving you a short list of prompts to exercise the pipeline.
`ProcessChunks` is instantiated with default_config, then get_chunks/embed_chunks/index_chunks run sequentially over a few PDF pages to build the in-memory VectorSearch. The tqdm progress bar reflects the blocking loop, so you can see embedding progress even in a synchronous run.

- The test loop is a plain for over test_dict wrapped with tqdm. Each iteration calls run_rag(question) synchronously, waits for the OpenAI response, and appends the resulting RAGResult to all_results. Because the work happens serially, the bar advances after every completed call, giving you immediate feedback on latency per query.

- Finally, it converts all_results into a pandas DataFrame to inspect question, answer, retrieved context, and the token/cost metrics populated inside RAGResult. That table is the quick health check confirming the synchronous pipeline works end-to-end before trying any async batching or larger evaluations.

In [1]:
import pandas as pd

gt_sample = pd.read_csv("sample_gt.csv")
gt_dict = gt_sample.to_dict(orient="records")
gt_dict[2]

{'question': 'performance differences in model APIs',
 'summary_answer': 'The excerpt mentions that the same AI model may perform differently across various APIs due to optimization techniques used, which necessitates thorough testing when switching APIs.',
 'difficulty': 'intermediate',
 'text': 'After developing a model, a developer can choose to open source it, make it\naccessible via an API, or both. Many model developers are also model\nservice providers. Cohere and Mistral open source some models and\nprovide APIs for some. OpenAI is typically known for their commercial\nmodels, but they’ve also open sourced models (GPT-2, CLIP). Typically,\nmodel providers open source weaker models and keep their best models\nbehind paywalls, either via APIs or to power their products.\nModel APIs can be available through model providers (such as OpenAI and\nAnthropic), cloud service providers (such as Azure and GCP [Google Cloud\nPlatform]), or third-party API providers (such as Databricks Mosa

In [2]:
test_dict = gt_dict[:2]
test_dict

[{'question': 'importance of reliable benchmarks in AI',
  'summary_answer': 'It emphasizes that many benchmarks may not accurately measure the intended metrics, stressing the need for careful evaluation.',
  'difficulty': 'beginner',
  'text': 'writing benchmarks. As new benchmarks are constantly introduced and old\nbenchmarks become saturated, you should look for the latest benchmarks.\nMake sure to evaluate how reliable a benchmark is. Because anyone can\ncreate and publish a benchmark, many benchmarks might not be measuring\nwhat you expect them to measure.',
  'id': 'a762b76e'},
 {'question': 'what is evaluation-driven development in ai?',
  'summary_answer': 'Evaluation-driven development refers to the approach of establishing evaluation criteria before building an AI application, similar to how test-driven development focuses on writing tests before code.',
  'difficulty': 'beginner',
  'text': 'Before investing time, money, and resources into building an application,\nit’s impo

In [3]:
from rag import ProcessChunks, default_config

pc = ProcessChunks(config=default_config)
chunks = pc.get_chunks(300, 250, start_page=1, end_page=3)
embeddings = pc.embed_chunks(chunks)
vector_index = pc.index_chunks(embeddings, chunks)

  from .autonotebook import tqdm as notebook_tqdm
100%|██████████| 11/11 [00:01<00:00,  8.35it/s]


In [4]:
# # test
# test_question = 'importance of reliable benchmarks in AI'

# from rag import search
# import json

# result = search(user_query=test_question)
# print(json.dumps(result, indent=2))

In [5]:
from rag import run_rag
from tqdm import tqdm

all_results = []
for q in tqdm(test_dict, total=len(test_dict)):
    result = run_rag(q['question'])
    all_results.append(result)
all_results

100%|██████████| 2/2 [00:11<00:00,  5.70s/it]


[RAGResult(question='importance of reliable benchmarks in AI', answer="Reliable benchmarks in AI are crucial for several reasons:\n\n1. **Model Selection**: With an overwhelming number of foundation models available, benchmarks help in evaluating and comparing these models based on different criteria relevant to specific applications. This aids teams in making informed choices regarding which models to deploy.\n\n2. **Evaluation of Performance**: Trustworthy benchmarks enable developers to assess how well a model performs on tasks critical to its intended uses, such as factual consistency, domain-specific capabilities (like math and reasoning), and overall effectiveness in real-world applications.\n\n3. **Visibility and Evaluation**: Many AI applications struggle with uncertain returns on investment, often due to a lack of clarity around how these applications are performing. Reliable benchmarks provide a means for continuous evaluation, ensuring that deployed models can be monitored e

In [6]:
pd.DataFrame(all_results)

Unnamed: 0,question,answer,context,input_tokens,output_tokens,input_cost,output_cost,total_cost
0,importance of reliable benchmarks in AI,Reliable benchmarks in AI are crucial for seve...,"[{'start': 750, 'text': 'of foundation models ...",910,282,0.0001365,0.0001692,0.0003057
1,what is evaluation-driven development in ai?,**Evaluation-driven development** in AI is an ...,"[{'start': 2000, 'text': 'st even more. AI app...",911,193,0.00013665,0.0001158,0.00025245
