## 🔍 Retrieval System Testing & Evaluation for RAG

This notebook is designed to **experiment with multiple retrieval flows** and compare their performance using **retrieval evaluation methods**.  
The goal is to identify which retrieval strategy works best for our RAG (Retrieval-Augmented Generation) system.

---

### 1. Retrieval Flows to Test

- **Dense Retrieval (Embeddings via Qdrant)**
  - Store document embeddings in Qdrant
  - Qdrant handles efficient nearest-neighbor search for semantic similarity

- **Hybrid Retrieval (BM25 + Qdrant)**
  - Combine lexical retrieval (e.g., BM25 from Elasticsearch/Whoosh) with dense retrieval from Qdrant
  - Weighted scoring or rank fusion improves precision and recall

- **Structured Filtering + Semantic Search**
  - Apply metadata/numeric filters (e.g., date, category, calories) directly in Qdrant
  - Perform dense retrieval within the filtered subset
  - Ensures results are both **relevant** and **contextually constrained**



### 2. Retrieval Evaluation Methods

- **Hit Rate@K**
  - Measures whether at least one relevant document appears in the top-K results  
  - Binary (hit or miss), useful for quick assessment  

- **Mean Reciprocal Rank (MRR)**
  - Evaluates the rank position of the first relevant document  
  - Higher score = relevant results appear earlier in the ranking  

- **LLM-as-a-Judge**
  - Use an LLM to assess whether retrieved documents are relevant to a query  
  - Useful when no human-annotated labels exist  
  - Can provide **graded relevance scores** instead of binary labels  



In [99]:
from openai import OpenAI
from api_key import openAI_api_key 

import json
import uuid
from qdrant_client import QdrantClient, models

import pandas as pd
from tqdm.auto import tqdm

### LLM prompt template

In [100]:
prompt_template = """
You are a helpful baking assistant. A client has asked the following question:

Client question:
{user_query}

Here are the top {number_of_results} potentially relevant recipes retrieved from the knowledge base:
{context_block}

Task:
- Choose the single most relevant document that best answers the client's question.
- Return the recipe from that document only.
- Do NOT combine multiple recipes together.
- Provide the recipe in natural human-readable format, including all of the following information:
    - Recipe name / title
    - Difficulty level
    - Ingredients list
    - Step-by-step instructions
    - Total cooking time in minutes
    - Calories (kcal)
- If none of the documents are relevant, respond: "I don’t have a recipe for that."

Provide your answer as a step-by-step baking recipe.
"""

result_template = """
"name"              :   {name},
"difficult"         :   {difficult},
"total_cooking_min" :   {total_cooking_min},
"kcal"              :   {kcal},
"ingredients"       :   {ingredients},
"steps"             :   {steps}
"""

def format_prompt(prompt_template, question, search_results):
    result_number = len(search_results)

    formatted_result = []
    for result in search_results:
        dum = result_template.format(**result).strip()
        formatted_result.append(dum)
        
    # joining the list of text into one block of text (seperated by ;;)
    context = ";;".join(formatted_result)
    prompt = prompt_template.format(user_query = question, number_of_results = result_number, context_block = context)

    return prompt.strip()

In [None]:
def llm_prompt(prompt,  model = "gpt-5-mini"):
    response = llm_client.chat.completions.create(
            model= model,
            messages=[{"role": "user", "content": prompt}]
        )

    result = response.choices[0].message.content # extract the results message out
    return result


In [102]:
with open ("../data/baking_cleaned.json", "r") as f:
    baking_recipes = json.load(f)

In [103]:
qd_client = QdrantClient("http://localhost:6333")
llm_client = OpenAI(api_key = openAI_api_key)

In [104]:
qd_client.get_collections()

CollectionsResponse(collections=[CollectionDescription(name='baking_recipes_dense')])

###  Retrieval Flow -- Qdrant (Dense Retrieval)
---

In [92]:
qd_client.create_collection(
    collection_name = "baking_recipes_dense",
    vectors_config=models.VectorParams(
        size = 512,
        distance=models.Distance.COSINE
    )
)

True

In [12]:
points_vector = []
for recipe in baking_recipes:
    ingred = ";".join(recipe["ingredients"]) # joining list of text to a block of text
    txt = recipe["name"] + " | " + recipe["difficult"] + " | " + recipe["dish_type"] + " | " + recipe["description"] + " | " + ingred 
    point = models.PointStruct(
        id = recipe["id"],

        vector = models.Document(
                    text = txt,
                    model = "jinaai/jina-embeddings-v2-small-en"
                ),

        payload = {
            "id"                :   recipe["id"],
            "name"              :   recipe["name"],
            "dish_type"         :   recipe["dish_type"],
            "difficult"         :   recipe["difficult"],
            "ingredients"       :   recipe["ingredients"],
            "steps"             :   recipe["steps"],
            "preparation_min"   :   recipe['prep_mins'], 
            "cooking_min"       :   recipe['cook_mins'], 
            "total_cooking_min" :   recipe["total_mins"],
            "kcal"              :   recipe['kcal'], 
            "fat"               :   recipe['fat'], 
            "saturated fat"     :   recipe['saturates'], 
            "carbohydrates"     :   recipe['carbs'], 
            "sugars"            :   recipe['sugars'], 
            "fibre"             :   recipe['fibre'], 
            "protein"           :   recipe['protein'], 
            "salt"              :   recipe['salt'],
            "rating"           :   recipe["rattings"]
        }         
    )

    points_vector.append(point)

In [13]:
qd_client.upsert(
    collection_name = "baking_recipes_dense",
    points = points_vector
)

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

In [14]:
def vector_search(query):

    query_points = qd_client.query_points(
        collection_name = "baking_recipes_dense",
        query = models.Document(
            text = query,
            model = "jinaai/jina-embeddings-v2-small-en"
        ),
        limit = 3,
        with_payload = True
    )
    
    results = []
    
    for point in query_points.points:
        results.append(point.payload)
    
    return results

In [15]:
def dense_rag(query, model_name = "gpt-5-mini"):
    search_results = vector_search(query)
    prompt = format_prompt(prompt_template, query, search_results)
    answer = llm_prompt(prompt, model= model_name)
    return answer

In [None]:
query = "I want to make a chocolate cake"
ans = dense_rag(query)
ans

'Very chocolatey cake\n\nDifficulty: Easy\nTotal cooking time: 60 minutes\nCalories: 235 kcal\n\nIngredients:\n- 3 eggs\n- 200 g golden caster sugar\n- 200 g very soft butter\n- 200 g self-raising flour\n- 1 tsp baking powder\n- 3 tbsp cocoa powder\n- 100 g chocolate drops (milk, plain, white or a mix)\n- 300 g soft butter (for icing)\n- 100 g icing sugar (for icing)\n- 400 g melted plain chocolate (for icing)\n\nStep-by-step instructions:\n\n1. Preheat the oven to 180°C (160°C fan) / Gas 4. Grease and line two 20 cm (8 in) round cake tins (or use tins of similar size).\n\n2. Crack the eggs into a small bowl, check for any shell fragments and remove them, then tip the eggs into a large mixing bowl.\n\n3. Add the 200 g golden caster sugar and 200 g very soft butter to the bowl with the eggs.\n\n4. Sift the 200 g self-raising flour, 1 tsp baking powder and 3 tbsp cocoa powder over the egg, sugar and butter mixture.\n\n5. Beat everything together until well combined and smooth. You can us

###  Retrieval Flow -- Qdrant (hybrid search)

In [105]:
qd_client.create_collection(
    collection_name= "baking_recipes_description",
    vectors_config={
        "jina-v2" : models.VectorParams(
            size = 512, #embedding dimensionality
            distance = models.Distance.COSINE,
        )
    },
    sparse_vectors_config = {
        "bm25" : models.SparseVectorParams(
            modifier = models.Modifier.IDF
        )
    }
)

True

In [54]:
points_vector = []
for recipe in baking_recipes:
    ingred = ";".join(recipe["ingredients"]) # joining list of text to a block of text
    txt = recipe["name"] + " | " + recipe["difficult"] + " | " + recipe["dish_type"] + " | " + recipe["description"] + " | " + ingred 
    point = models.PointStruct(
        id = recipe["id"],
        
        vector = {
            "jina-v2": models.Document(
                    text = txt,
                    model ="jinaai/jina-embeddings-v2-small-en",
                ),
            "bm25": models.Document(
                    text = txt, 
                    model ="Qdrant/bm25",
                )
        },

        payload = {
            "id"                :   recipe["id"],
            "name"              :   recipe["name"],
            "dish_type"         :   recipe["dish_type"],
            "difficult"         :   recipe["difficult"],
            "ingredients"       :   recipe["ingredients"],
            "steps"             :   recipe["steps"],
            "preparation_min"   :   recipe['prep_mins'], 
            "cooking_min"       :   recipe['cook_mins'], 
            "total_cooking_min" :   recipe["total_mins"],
            "kcal"              :   recipe['kcal'], 
            "fat"               :   recipe['fat'], 
            "saturated fat"     :   recipe['saturates'], 
            "carbohydrates"     :   recipe['carbs'], 
            "sugars"            :   recipe['sugars'], 
            "fibre"             :   recipe['fibre'], 
            "protein"           :   recipe['protein'], 
            "salt"              :   recipe['salt'],
            "rating"           :   recipe["rattings"]
        }         
    )

    points_vector.append(point)

In [55]:
qd_client.upsert(
    collection_name="baking_recipes_description",
    points = points_vector 
)

UpdateResult(operation_id=2, status=<UpdateStatus.COMPLETED: 'completed'>)

In [65]:
def rrf_description_search(query: str, limit: int = 1) -> list[models.ScoredPoint]:
    results = qd_client.query_points(
        collection_name="baking_recipes_description",
        prefetch=[
            models.Prefetch(
                query=models.Document(
                    text=query,
                    model="jinaai/jina-embeddings-v2-small-en",
                ),
                using="jina-v2",
                limit=(3 * limit),
            ),
            models.Prefetch(
                query=models.Document(
                    text=query,
                    model="Qdrant/bm25",
                ),
                using="bm25",
                limit=(3 * limit),
            ),
        ],
        # Fusion query enables fusion on the prefetched results
        query=models.FusionQuery(fusion=models.Fusion.RRF),
        with_payload=True,
    )


    search_results = []
    
    for point in results.points:
        search_results.append(point.payload)

    return search_results

In [60]:
def hybrid_description_rag(query, model_name = "gpt-5-mini"):
    search_results = rrf_description_search(query)
    prompt = format_prompt(prompt_template, query, search_results)
    answer = llm_prompt(prompt, model= model_name)
    return answer

In [67]:
answer = hybrid_description_rag("Crowd-friendly pizza roll recipe that can be doubled to feed a group of kids—easy to make?")
answer

"Pizza rolls\nDifficulty: Easy\nTotal cooking time: 30 minutes\nCalories: 275 kcal (per roll, as listed)\n\nIngredients\n- 6 crusty bread rolls\n- 2 tbsp tomato purée\n- 6 slices ham\n- 3 tomatoes, sliced\n- 2 balls mozzarella, sliced (we used Sainsbury's Basics)\n- 2 tsp dried oregano\n- 6 black olives (optional)\n\nStep-by-step instructions\n1. Heat the oven to 180°C / 160°C fan / Gas 4.\n2. Cut the tops off the rolls and scoop out the insides to make little hollow bread cups.\n3. Spread the inside of each roll with tomato purée.\n4. Fill each roll with a slice of ham, a few slices of tomato, then top with slices of mozzarella.\n5. Scatter dried oregano over each filled roll and top each one with an olive if using.\n6. Place the filled rolls on a baking tray and bake for about 15 minutes, until the rolls are crusty brown and the cheese is bubbling.\n7. Leave to rest for 1 minute, then serve hot (they go nicely with a simple side salad).\n\nNote: To serve a larger group of kids, doubl

###  Retrieval Flow -- Qdrant (hybrid search + filter)

In [16]:
qd_client.create_collection(
    collection_name= "baking_recipes_hybrid",
    vectors_config={
        "jina-v2" : models.VectorParams(
            size = 512, #embedding dimensionality
            distance = models.Distance.COSINE,
        )
    },
    sparse_vectors_config = {
        "bm25" : models.SparseVectorParams(
            modifier = models.Modifier.IDF
        )
    }
)

True

In [17]:
def point_setup(js : json, section, txt : str, counter = ""):
    point = models.PointStruct(
        id = str(uuid.uuid4()),
        vector = {
            "jina-v2": models.Document(
                    text = txt,
                    model ="jinaai/jina-embeddings-v2-small-en",
                ),
            "bm25": models.Document(
                    text = txt, 
                    model ="Qdrant/bm25",
                )
        },
        payload = {
            "id"                :   js["id"],
            "section"           :   section,
            "name"              :   js["name"],
            "dish_type"         :   js["dish_type"],
            "difficult"         :   js["difficult"],
            "ingredients"       :   js["ingredients"],
            "steps"             :   js["steps"],
            "preparation_min"   :   js['prep_mins'], 
            "cooking_min"       :   js['cook_mins'], 
            "total_cooking_min" :   js["total_mins"],
            "kcal"              :   js['kcal'], 
            "fat"               :   js['fat'], 
            "saturated fat"     :   js['saturates'], 
            "carbohydrates"     :   js['carbs'], 
            "sugars"            :   js['sugars'], 
            "fibre"             :   js['fibre'], 
            "protein"           :   js['protein'], 
            "salt"              :   js['salt'],
            "rating"           :   js["rattings"]
        }         
    )

    return point

In [18]:
qd_points = []

for i, recipe in enumerate(baking_recipes):
    # join name, description, dish type and difficult into a section
    description_txt = recipe["name"] + " | " + recipe["difficult"] + " | " + recipe["dish_type"] + " | " + recipe["description"]
    des_point = point_setup(recipe, "description", description_txt)
    qd_points.append(des_point)


    # join ingredients into a section
    ingredient_txt = "; ".join(recipe["ingredients"])
    ing_point = point_setup(recipe, "ingredients", ingredient_txt)
    qd_points.append(ing_point)


    # chunked steps into different section
    counter = 1
    for each_step in recipe["steps"]:
        step_point = point_setup(recipe, "steps", each_step, str(counter))
        qd_points.append(step_point)
        counter += 1


In [19]:
BATCH_SIZE = 500

for i in range(0, len(qd_points), BATCH_SIZE):
    batch = qd_points[i:i + BATCH_SIZE]
    qd_client.upsert(
        collection_name="baking_recipes_hybrid",
        points=batch
    )

In [20]:
def rrf_search(query: str, filters, limit: int = 1) -> list[models.ScoredPoint]:
    results = qd_client.query_points(
        collection_name="baking_recipes_hybrid",
        
        query_filter = models.Filter(
            must = filters
        ),

        prefetch=[
            models.Prefetch(
                query=models.Document(
                    text=query,
                    model="jinaai/jina-embeddings-v2-small-en",
                ),
                using="jina-v2",
                limit=(2 * limit),
            ),
            models.Prefetch(
                query=models.Document(
                    text=query,
                    model="Qdrant/bm25",
                ),
                using="bm25",
                limit=(2 * limit),
            ),
        ],
        # Fusion query enables fusion on the prefetched results
        query=models.FusionQuery(fusion=models.Fusion.RRF),
        with_payload=True,
    )

    return results.points

In [115]:
def question_classifier(query, model = "gpt-5-mini"):
    """
    this block is using LLM to classify which section for this question is below to
    """

    prompt_template = """
    Given a user question, output JSON with:

    1. "relevant_sections": choose only from:
    [description, ingredients, steps, preparation_min, cooking_min, total_cooking_min, kcal, fat, saturated_fat, carbohydrates, sugars, fibre, protein, salt, rating]

    Rules:
    - description → dish/general info
    - ingredients → if components mentioned
    - steps → only if instructions asked
    - numeric fields → only if numbers mentioned (time, calories, etc.)

    2. "filter": list numeric filters as "field operator value"
    - Convert hours → minutes
    - For ranges, keep the larger value
    - rating must be numeric
    - use "lte" = ≤, "gte" = ≥

    Output format (no code block):
    {{
    "relevant_sections": [...],
    "filter": {{...}}
    }}

    Example:
    Q: Easy white bread rolls ready in 60 minutes with saturated fat < 1g?
    A: {{
    "relevant_sections": ["description","total_cooking_min","saturated_fat"],
    "filter": {{"total_cooking_min":"lte,60","saturated_fat":"lte,1"}}
    }}

    Question: {question}
    """

    prompt = prompt_template.format(question = query).strip()
    response = llm_prompt(prompt, model= model)
    result_js = json.loads(response)

    return result_js

In [22]:
def text_search(query, classifier):
    seen_ids = set()
    search_results = []

    for section in classifier["relevant_sections"]:
        qd_filter = []
        # set up relevant section for text search
        section_filter = models.FieldCondition(
                            key = "section",
                            match = models.MatchValue(value = section)
                        )
        qd_filter.append(section_filter)
        # filter section and then do a vector search
        results = rrf_search(query = query, filters = qd_filter)

        # combining the rrf_search with the previous rrf_search
        for point in results:
            # if the point is not appear before then will put in to search results list
            if point.payload["id"] not in seen_ids:
                search_results.append(point)
                seen_ids.add(point.payload["id"])
            else:
                pass

    # formatting results
    final_results = []
    for point in search_results:
        final_results.append(point.payload)

    return final_results

In [23]:
def text_numeric_search(query, classifier):
    
    seen_ids = set()
    search_results = []

    # first : hybrid (text) search
    for section in classifier["relevant_sections"]:
        qd_filter = []
        # set up relevant section for text search
        section_filter = models.FieldCondition(
                        key = "section",
                        match = models.MatchValue(value = section)
                    )
        qd_filter.append(section_filter)

        # second : numeric search 
        for numeric_filter in classifier["filter"].keys():
            filter_val = classifier["filter"][numeric_filter].split(',')
            # if the value is >= xx then is gt 
            if filter_val[0] == "gte":
                num_filter = models.FieldCondition(
                                key = numeric_filter,
                                range = models.Range(gte = filter_val[1])
                            )
                qd_filter.append(num_filter)
            # if the value is <= xx then is lte 
            elif filter_val[0] == "lte":
                num_filter = models.FieldCondition(
                                key = numeric_filter,
                                range = models.Range(lte = filter_val[1])
                            )
                qd_filter.append(num_filter)
            else:
                pass

        # filter section and then do a vector search
        results = rrf_search(query = query, filters = qd_filter)
        
        # combining the rrf_search with the previous rrf_search
        for point in results:
            # if the point is not appear before then will put in to search results list
            if point.payload["id"] not in seen_ids:
                search_results.append(point)
                seen_ids.add(point.payload["id"])
            else:
                pass

    # formatting results
    final_results = []
    for point in search_results:
        final_results.append(point.payload)
                
    return final_results


In [24]:
def hybrid_search(query, model_name = "gpt-5-mini"):
    """
    identifying what kinda of question is it and then filtering it with text or numerical filtering
    """
    unable_to_run = []
    classifier = question_classifier(query, model = model_name)

    try:
        # removing section that need numeric filter
        for numeric_filter in classifier["filter"].keys():
            classifier["relevant_sections"].remove(numeric_filter)

        if len(classifier["filter"].keys()) == 0:
            search_results = text_search(query, classifier)
        else:
            search_results = text_numeric_search(query, classifier)
    except:
        print(f"unexpected error: {query}")
        unable_to_run.append(query)
    
    return search_results

In [25]:
def hybrid_rag(query, model_name = "gpt-5-mini"):
    
    search_results = hybrid_search(query, model_name = model_name)
    # formatting prompt and ask llm to give the correct results
    prompt = format_prompt(prompt_template, query, search_results)
    answer = llm_prompt(prompt, model= model_name)

    return answer

In [None]:
ans = hybrid_rag("Crowd-friendly pizza roll recipe that can be doubled to feed a group of kids—easy to make?")
ans

'Recipe: Pizza rolls\nDifficulty: Easy\nTotal cooking time: 30 minutes\nCalories: 275 kcal\n\nIngredients\n- 6 crusty bread rolls\n- 2 tbsp tomato purée\n- 6 slices ham\n- 3 tomatoes, sliced\n- 2 balls mozzarella, sliced\n- 2 tsp dried oregano\n- 6 black olives (optional)\n\nStep-by-step instructions\n1. Heat the oven to 180°C / 160°C fan / Gas 4.\n2. Cut the tops off the rolls and scoop out the insides to make a hollow cavity in each roll.\n3. Spread each hollowed roll with tomato purée.\n4. Fill each roll with a slice of ham, a few slices of tomato, and top with slices of mozzarella.\n5. Scatter each filled roll with dried oregano. Top each roll with a black olive if using.\n6. Place the filled rolls on a baking tray and bake for 15 minutes, until the rolls are crusty brown and the cheese is bubbling.\n7. Leave to rest for 1 minute, then serve hot (served well with a side salad).\n\nEnjoy your pizza rolls!'

In [None]:
ans_ = hybrid_rag("Lunch idea: Ploughman's-style cheese rolls with apple and poppy seeds, around 250 kcal per roll?")
ans_

'Ploughman’s rolls\nDifficulty: Easy\nTotal cooking time: 62 minutes\nCalories: 250 kcal (per roll)\n\nIngredients\n- 1 tbsp celery seeds, plus a few extra\n- 500g pack bread dough\n- 200ml milk, plus a splash to glaze\n- 100g extra-mature cheddar, grated\n- 85g English Brie or camembert, diced\n- 1 small apple, cored and diced into small chunks\n- 2 spring onions, finely chopped\n- 1 tbsp poppy seed\n- Plain flour, for dusting\n- Pickle, to serve\n\nStep-by-step instructions\n1. Stir 1 tbsp celery seeds into the bread dough mix.\n2. Pour 200ml milk into a jug and make up to 325ml with water. Warm to hand temperature (you can do this briefly in the microwave). Add the warmed liquid to the bread mix and bring together following the pack instructions. Leave the dough to rise in a warm place until about doubled in size.\n3. Meanwhile, mix together 85g of the grated cheddar, the diced Brie or camembert, diced apple and finely chopped spring onions. Season lightly.\n4. Once the dough has ri

## Retrieval evaluation
---

In [68]:
eva_df = pd.read_csv("../data/ground_truth_retrieval.csv", usecols=['id', 'questions'])
eva_df.head(2)

Unnamed: 0,id,questions
0,16a94310-cea8-435f-90e8-10f8b02b7bfe,Easy white bread rolls for sandwiches or burge...
1,16a94310-cea8-435f-90e8-10f8b02b7bfe,Simple homemade rolls using strong white bread...


In [69]:
eva_js = eva_df.to_dict(orient="records")

In [29]:
def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

In [30]:
def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

In [33]:
def evaluate_func(ground_truth, search_function, method):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['id']
        try:
            results = search_function(q)
        except:
            relevance = [False, False, False, False, False]

        if method == "vector_search":
            relevance = [d['id'] == doc_id for d in results]
        elif method == "hybrid_search":
            relevance = [d['id'] == doc_id for d in results]
        else:
            print("wrong method")
            break
        
        
        relevance_total.append(relevance)
    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

In [None]:
evaluate_func(eva_js, lambda x: vector_search(x["questions"]), "vector_search")

100%|██████████| 3070/3070 [01:37<00:00, 31.57it/s]


{'hit_rate': 0.9276872964169381, 'mrr': 0.8225298588490784}

In [70]:
evaluate_func(eva_js, lambda x: rrf_description_search(x["questions"]), "vector_search")

100%|██████████| 3070/3070 [01:35<00:00, 32.00it/s]


{'hit_rate': 0.9716612377850163, 'mrr': 0.8546525515743764}

In [42]:
evaluate_func(eva_js, lambda x: hybrid_search(x["questions"]), "hybrid_search")

 95%|█████████▍| 142/150 [20:27<01:21, 10.23s/it]

unexpected error: Top-rated (5-star) aubergine and tomato baklava with dates and honey — what are the prep and cooking times?


100%|██████████| 150/150 [21:30<00:00,  8.60s/it]


{'hit_rate': 0.8866666666666667, 'mrr': 0.7378412698412699}

## LLM evaluation 
---

In [73]:
LLMeva_prompt_template = """
You are an evaluator for a RAG system.
Classify the generated answer’s relevance to the question as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".

Input:

Question: {question}

Answer: {llm_ans}

Output (JSON, no code block):

{{
  "Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",
  "Explanation": "[Brief reason for classification]"
}}
"""

In [74]:
sampled_df = eva_df.sample(200)

In [None]:
def llm_evaulation(df, model):
    evaluate_results = []

    for index, row in tqdm(df.iterrows()):
        question = row["questions"]
        ans_llm = hybrid_description_rag(question, model_name = model)
        # reformat the prompt
        prompt = LLMeva_prompt_template.format(question= question, llm_ans = ans_llm)

        # LLM-as-a-judge
        response = llm_prompt(prompt, model = model)
        evaluate = json.loads(response)

        # formatting for data storage
        result_js = {
            "id"            :   row["id"],
            "question"      :   question,
            "llm_answer"    :   ans_llm,
            "relevance"     :   evaluate["Relevance"],
            "explanation"   :   evaluate["Explanation"]
        }

        evaluate_results.append(result_js)

    return evaluate_results

In [80]:
evaluation_5mini = llm_evaulation(sampled_df, model = "gpt-5-mini")

200it [1:11:02, 21.31s/it]


In [81]:
eva_5mini_df = pd.DataFrame(evaluation_5mini)
eva_5mini_df.head(2)

Unnamed: 0,id,question,llm_answer,relevance,explanation
0,88cdc06c-19a2-4c4a-b7bd-8c7600a25552,Do you have a chocolate and vanilla celebratio...,Chocolate & vanilla celebration cake\nDifficul...,RELEVANT,The answer provides a full chocolate-and-vanil...
1,72a6dbdd-e1f7-47a5-b2a7-e46259ae7cdf,"Family-friendly rhubarb crumble dessert, rough...",Rhubarb crumble\nDifficulty: Easy\nTotal cooki...,RELEVANT,The answer provides a complete family-friendly...


In [82]:
eva_5mini_df["relevance"].value_counts()

relevance
RELEVANT           156
PARTLY_RELEVANT     44
Name: count, dtype: int64

In [83]:
eva_5mini_df.to_csv("../data/GPT_5mini_evaluation.csv", index = False)

In [84]:
evaluation_5nano = llm_evaulation(sampled_df, model = "gpt-5-nano")

200it [1:06:15, 19.88s/it]


In [85]:
eva_5nano_df = pd.DataFrame(evaluation_5nano)
eva_5nano_df.head(2)

Unnamed: 0,id,question,llm_answer,relevance,explanation
0,88cdc06c-19a2-4c4a-b7bd-8c7600a25552,Do you have a chocolate and vanilla celebratio...,Chocolate & vanilla celebration cake\nDifficul...,RELEVANT,The answer provides a complete recipe for a ch...
1,72a6dbdd-e1f7-47a5-b2a7-e46259ae7cdf,"Family-friendly rhubarb crumble dessert, rough...",Rhubarb crumble\nDifficulty: Easy\nTotal cooki...,RELEVANT,The answer provides a full rhubarb crumble rec...


In [86]:
eva_5nano_df["relevance"].value_counts()

relevance
RELEVANT           165
PARTLY_RELEVANT     33
NON_RELEVANT         2
Name: count, dtype: int64

In [87]:
eva_5nano_df.to_csv("../data/GPT_5nano_evaluation.csv", index = False)