<a href="https://colab.research.google.com/github/xprilion/gemini-as-a-judge-for-rag-evals/blob/main/Step_3_Perform_Eval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Gemini As A Judge for RAG Evals

## Perform Evaluations

### 1. Load the datasets

In [1]:
!wget https://raw.githubusercontent.com/xprilion/gemini-as-a-judge-for-rag-evals/refs/heads/main/qna_dataset.json

--2025-03-02 02:36:46--  https://raw.githubusercontent.com/xprilion/gemini-as-a-judge-for-rag-evals/refs/heads/main/qna_dataset.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 974753 (952K) [text/plain]
Saving to: ‘qna_dataset.json’


2025-03-02 02:36:46 (42.7 MB/s) - ‘qna_dataset.json’ saved [974753/974753]



### Packages

In [2]:
%%capture
!pip install qdrant-client[fastembed]
!pip install google-genai
!pip install weave

### Imports

In [3]:
import pandas as pd
import json
import os
import time
from tqdm import tqdm
from google import genai
from google.genai import types
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
import uuid
import weave
from weave import Evaluation
import asyncio

from google.colab import userdata

  warn(


### Helpers

In [4]:
collection_name = "product_reviews"

In [5]:
os.environ["WANDB_API_KEY"] = userdata.get("WANDB_TOKEN")

In [6]:
GEMINI_KEY = userdata.get('GEMINI_API_KEY')
gemini_client = genai.Client(
    api_key=GEMINI_KEY
)

In [7]:
def getGeminiResponse(prompt, max_tokens=8192, response_type="text/plain"):
    contents = [
        types.Content(
            role="user",
            parts=[
                types.Part.from_text(
                    text=prompt
                ),
            ],
        ),
    ]
    generate_content_config = types.GenerateContentConfig(
        temperature=0,
        top_p=0.95,
        top_k=40,
        max_output_tokens=max_tokens,
        response_mime_type=response_type,
    )
    response = gemini_client.models.generate_content(
        model="gemini-2.0-flash", contents=contents, config=generate_content_config
    )
    return response.text

In [8]:
getGeminiResponse("What is 2+3?")

'2 + 3 = 5\n'

### EDA

In [11]:
qa_dataset = json.load(open("qna_dataset.json"))

In [12]:
len(qa_dataset)

6701

In [13]:
qa_dataset[0]

{'answer': 'Index 0, 4, 12',
 'matched_indexes': [0, 4, 12],
 'question': 'Best hammer for DIY projects'}

In [23]:
qa_dataset = qa_dataset[:500]

### Connect Qdrant

In [14]:
QDRANT_URL = "https://qdrant-1.sg-1.cloudtop.dev"
QDRANT_KEY = userdata.get('PERSONAL_QDRANT_KEY')

In [15]:
qdrant_client = QdrantClient(url=QDRANT_URL, api_key=QDRANT_KEY, port=None)

### Ask questions from the QA dataset

In [16]:
def getRagResponse(question, k=10, skip_ai=False):
    search_result = qdrant_client.query(collection_name=collection_name, query_text=question, limit=k)
    system_prompt = """
      You are an intelligent assistant designed to provide accurate and informative answers based on retrieved documents.

      Your primary task is to:

      Understand the user's query.
      Retrieve relevant information from the provided context (documents).
      Synthesize the retrieved information into a coherent and accurate response.

      documents:

      """

    documents_text = ""
    matched_ids = []

    doc_count = 1
    for result in search_result:
      documents_text += str(doc_count) + ": \n" + result.document + "\n\n"
      matched_ids.append(result.metadata["index"])
      doc_count += 1

    users_query = "\n\n The user is asking: " + question

    prompt = system_prompt + documents_text + users_query

    if skip_ai:
        return prompt, matched_ids

    response = getGeminiResponse(prompt)

    return response, matched_ids

In [24]:
qa_dataset[0]

{'answer': 'Index 0, 4, 12',
 'matched_indexes': [0, 4, 12],
 'question': 'Best hammer for DIY projects'}

In [25]:
user_query = "What are the key features of the heavy-duty workbench?"

In [26]:
result, indexes = getRagResponse(user_query, 5, True)

In [27]:
indexes

[355, 357, 351, 358, 350]

In [28]:
evals = []
matches = 0
k = 5
for eval_ques in tqdm(qa_dataset):
    query = eval_ques["question"]
    result, indexes = getRagResponse(query, k, True)
    intersection = list(set(indexes) & set(eval_ques["matched_indexes"]))
    evals.append({"query": eval_ques, "result": len(intersection) > 0})
    if len(eval_ques["matched_indexes"]) > 0 and len(intersection) > 0:
        matches += 1
    if len(eval_ques["matched_indexes"]) == 0 and len(indexes) == 0:
        matches += 1
    if len(evals) % 100 == 0:
        print(f"Checks: {matches}/{len(evals)} of {len(qa_dataset)}")

 20%|██        | 100/500 [00:30<02:00,  3.33it/s]

Checks: 41/100 of 500


 40%|████      | 200/500 [01:00<01:30,  3.31it/s]

Checks: 78/200 of 500


 60%|██████    | 300/500 [01:30<00:59,  3.35it/s]

Checks: 111/300 of 500


 80%|████████  | 400/500 [02:00<00:29,  3.37it/s]

Checks: 153/400 of 500


100%|██████████| 500/500 [02:30<00:00,  3.32it/s]

Checks: 189/500 of 500





In [29]:
matches

189

### Accuracy

In [30]:
accuracy = matches / len(evals)
accuracy

0.378

### Observing

In [31]:
@weave.op()
def accuracy(question, output):
    intersection = list(set(output) & set(question["matched_indexes"]))
    if len(question["matched_indexes"]) > 0 and len(intersection) > 0:
        return True
    if len(question["matched_indexes"]) == 0 and len(output) == 0:
        return True
    return False

In [32]:
@weave.op()
def top_5(question):
    result, indexes = getRagResponse(question["question"], 5, True)
    return indexes

In [33]:
@weave.op()
def top_10(question):
    result, indexes = getRagResponse(question["question"], 10, True)
    return indexes

In [34]:
@weave.op()
def top_20(question):
    result, indexes = getRagResponse(question["question"], 20, True)
    return indexes

In [35]:
evaluation = Evaluation(
    dataset=[{"question": x} for x in qa_dataset], scorers=[accuracy]
)

In [36]:
weave.init('gemini-rag-eval')

Logged in as Weights & Biases user: xprilion.
View Weave data at https://wandb.ai/xprilion/gemini-rag-eval/weave


<weave.trace.weave_client.WeaveClient at 0x7bd1b556f790>

In [37]:
await evaluation.evaluate(top_5)

🍩 https://wandb.ai/xprilion/gemini-rag-eval/r/call/019554be-2109-7f40-ac90-a7e65515519d


{'accuracy': {'true_count': 189, 'true_fraction': 0.378},
 'model_latency': {'mean': 0.7037788381576539}}

In [38]:
await evaluation.evaluate(top_10)

{'accuracy': {'true_count': 287, 'true_fraction': 0.574},
 'model_latency': {'mean': 0.6618615546226502}}

In [39]:
await evaluation.evaluate(top_20)

🍩 https://wandb.ai/xprilion/gemini-rag-eval/r/call/019554be-935b-7631-a22f-1d0daa55157b


{'accuracy': {'true_count': 347, 'true_fraction': 0.694},
 'model_latency': {'mean': 0.6650017681121826}}