# Retrieval Evaluation Notebook 

This notebook contains controlled experiments to evaluate the **retrieval component** of a RAG system using an **LLM-as-a-judge**. Two parameters are compared:

1. **Chunking strategy**

   * Markdown header-based chunking
   * Fixed-size chunking
2. **Retrieval search type**

   * Similarity
   * MMR (Max Marginal Relevance)

The outcome is a comparative ranking of retrieval sets, scored across quality dimensions, with a final winner per query.

---

## Data

**Source:** Academic project reports and technical documentation
**Formats:** PDF and Markdown (`.pdf`, `.md`)

---

## Tooling

* **Vector DB:** Qdrant (running via Docker)
* **Embedding model:** `text-embedding-3-small`
* **Reranker:** `cross-encoder/ms-marco-MiniLM-L-6-v2`
* **PDF → Markdown conversion:** Docling
* **OCR:** EasyOCR
* **Retrieval framework:** LangChain
* **Judge LLM:** OpenAI API (`o3-mini`)
* **Structured evaluation outputs:** Pydantic

---

## Experiment Design

Each query is executed under **4 retrieval conditions**:

| Set | Chunking Strategy   | Retrieval Type |
| --- | ------------------- | -------------- |
| A   | Markdown chunking   | Similarity     |
| B   | Markdown chunking   | MMR            |
| C   | Fixed-size chunking | Similarity     |
| D   | Fixed-size chunking | MMR            |

All retrieved documents are **reranked** with a cross-encoder before being passed to the judge.

---

## Chunking Strategies

### Strategy 1: Markdown Chunking (Structure-aware)

**Markdown files**

* Split using `MarkdownHeaderTextSplitter`
* Header levels: `#`, `##`, `###`, `####`

**PDF files**

* Convert PDF → Markdown using **Docling + EasyOCR**
* Then split using `MarkdownHeaderTextSplitter`

This strategy preserves document hierarchy and aligns chunks with semantic structure.

---

### Strategy 2: Fixed-Size Chunking (Uniform + robust)

**Markdown files**

1. Split into sections by headers
2. Chunk using `RecursiveCharacterTextSplitter`

   * `chunk_size = 1172`
   * `overlap = 0`

**PDF files**

1. Split by page
2. Chunk using `RecursiveCharacterTextSplitter`

   * `chunk_size = 2090`
   * `overlap = 200`

This strategy aims for consistent chunk lengths and improved recall under dense text.

---

## Rationale for Chunk Sizes

Chunk sizes were selected empirically by analyzing **character-count distributions** of PDF pages and Markdown sections using **histograms and boxplots**.

### PDF files

* Page lengths showed a broad but stable distribution (no extreme outliers)
* Pages generally contained more text per unit than Markdown sections
  → **Chunk size chosen:** **median page character count = 2090**

### Markdown files

* Section lengths were skewed with large outliers
* Typical sections were much shorter than the extremes
  → **Chunk size chosen:** **maximum non-outlier value = upper whisker = Q3 + 1.5×IQR = 1172**

---

## Retrieval Search Types

1. **Similarity** (dense vector similarity search)
2. **MMR** (diversity-aware retrieval, trades off relevance vs novelty)

---

## LLM-as-a-Judge Setup

### Queries

A list of **10 evaluation queries** was prepared, with an approximate expectation of what chunks should be retrieved for each.

### Judge model choice

A **cost-effective, small reasoning model with high intelligence** (`o3-mini`) was used because:

* The task is comparative ranking (selection + scoring),
* It requires less reasoning than generating a full final answer.

### Prompt iterations (what changed)

1. **Baseline prompt**

   * Too vague / open-ended
   * Produced unsatisfying scoring behavior
2. **Improved prompt**

   * More explicit role definition
   * Clear scoring criteria + 0–5 scale
   * Decision rules (winner selection, insufficient handling)
   * Forced **structured output** (Pydantic schema)

### Stability issue and fix

* Even with the improved prompt, results were **inconsistent across 3 runs**
* Error analysis identified a key issue:

  * **Markdown splitter was stripping headings**, weakening chunk semantics and harming judge reliability
* Fix:

  * Updated Markdown splitter configuration to **preserve headings** in `page_content`

After the fix, across the next 3 runs:

* The judge produced a **consistent winner 8/10 times**

---

## Result Summary

**Winner:** **Markdown chunking strategy + Similarity retrieval**

This combination produced the strongest balance of:

* semantic relevance (header-aligned chunks),
* sufficient coverage for query intent,
* lower noise compared to fixed chunking under these documents.

---

## Notes / Takeaways

* Evaluation stability depends heavily on **chunk content fidelity** (e.g., preserving headings).
* LLM judges can be sensitive to formatting changes; enforcing structure via Pydantic helps, but **instruction quality dominates**.
* For technical academic documents, **structure-aware chunking** paired with straightforward similarity retrieval was most reliable in this setup.

In [1]:
from __future__ import annotations

from json import load, dump
from dataclasses import dataclass
from glob import glob
from pathlib import Path
from re import MULTILINE, DOTALL, compile
from typing import Any, List, Literal, Sequence, Tuple, Annotated

from docling.datamodel.accelerator_options import (
    AcceleratorDevice,
    AcceleratorOptions,
)
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    EasyOcrOptions,
    PdfPipelineOptions,
)
from docling.document_converter import (
    DocumentConverter,
    PdfFormatOption,
)

from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_qdrant import QdrantVectorStore
from langchain_text_splitters import (
    MarkdownHeaderTextSplitter,
    RecursiveCharacterTextSplitter,
)

from qdrant_client import QdrantClient

from sentence_transformers import CrossEncoder
from openai import OpenAI

from pymupdf import Document as PdfDocument

from pydantic import BaseModel, Field

from dotenv import load_dotenv
load_dotenv()

True

In [2]:
class VectorStoreBase:
    COLLECTION: str
    PROCESSOR_CLS: type

    def __init__(self, url="http://localhost:6333", port=6333):
        self.url = url
        self.port = port

    def get_vector_store(self):
        client = QdrantClient(url=self.url, port=self.port)

        if client.collection_exists(self.COLLECTION):
            return self._load()

        return self._create()

    def _create(self):
        processor = self.PROCESSOR_CLS()
        documents = processor.build_documents()

        return QdrantVectorStore.from_documents(
            documents=documents,
            embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
            collection_name=self.COLLECTION,
            url=self.url,
            port=self.port,
        )

    def _load(self):
        return QdrantVectorStore.from_existing_collection(
            embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
            collection_name=self.COLLECTION,
            url=self.url,
            port=self.port,
        )

In [3]:
class FileProcessorMarkdownChunks:
    INPUT_GLOB = "../data/*"
    CONFIG_PATH = "../configs/data_config.json"

    def __init__(self):
        self.files = glob(self.INPUT_GLOB)
        self.config = self._load_config()
        self.splitter = self._markdown_splitter()
        self.pdf_converter = self._pdf_converter()

    def build_documents(self):
        documents = []
        for file in self.files:
            documents.extend(self._process_file(file))
        return documents

    def _process_file(self, file):
        text = self._to_markdown(Path(file))
        docs = self.splitter.split_text(text)

        for d in docs:
            d.metadata.update(self.config[Path(file).name])

        return docs

    def _to_markdown(self, path: Path) -> str:
        if path.suffix == ".pdf":
            return self.pdf_converter.convert(path).document.export_to_markdown()
        if path.suffix == ".md":
            return path.read_text(encoding="utf-8")

        raise ValueError(f"Unsupported format: {path.suffix}")

    @staticmethod
    def _markdown_splitter():
        return MarkdownHeaderTextSplitter(
            headers_to_split_on=[
                ("#", "Header 1"),
                ("##", "Header 2"),
                ("###", "Header 3"),
                ("####", "Header 4")
            ],
            strip_headers=False
        )

    @staticmethod
    def _pdf_converter():
        return DocumentConverter(
            format_options={
                InputFormat.PDF: PdfFormatOption(
                    pipeline_options=PdfPipelineOptions(
                        do_ocr=True,
                        ocr_options=EasyOcrOptions(lang=["en"]),
                        do_table_structure=False,
                        accelerator_options=AcceleratorOptions(
                            num_threads=4,
                            device=AcceleratorDevice.CUDA,
                        ),
                    )
                )
            }
        )

    @staticmethod
    def _load_config():
        with open(FileProcessorMarkdownChunks.CONFIG_PATH) as f:
            return load(f)

In [4]:
class FileProcessorFixedSizeChunks:
    INPUT_GLOB = "../data/*"
    CONFIG_PATH = "../configs/data_config.json"

    SPLITTERS = {
        ".pdf": (2090, 200),
        ".md": (1172, 0),
    }

    MULTIPLE_WHITESPACE = compile(r"[ \t]{2,}")
    IMAGES = compile(r"!\[.*?\]\(.*?\)", flags=DOTALL)
    MULTIPLE_NEWLINES = compile(r"\n{2,}")
    LINEBREAKS = compile(r"^\s*---\s*$", flags=MULTILINE)
    HEADINGS = compile(r"(?=^#{1,6}\s+)", flags=MULTILINE)

    def __init__(self):
        self.files = glob(self.INPUT_GLOB)
        self.config = self._load_config()

    def build_documents(self):
        documents = []

        for file in self.files:
            path = Path(file)
            if path.suffix not in self.SPLITTERS:
                raise ValueError(f"Unsupported format: {path.suffix}")

            sections = self._extract_sections(path)
            documents.extend(self._chunk(path, sections))

        return documents

    def _extract_sections(self, path: Path):
        if path.suffix == ".pdf":
            pages = PdfDocument(path)
            sections = [p.get_text("text") for p in pages]
        elif path.suffix == ".md":
            text = path.read_text(encoding="utf-8")
            sections = self.HEADINGS.split(text)
        
        return (self._clean(s) for s in sections if isinstance(s, str) and s.strip())

    def _chunk(self, path: Path, sections):
        chunk_size, overlap = self.SPLITTERS[path.suffix]
        splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=overlap,
        )

        metadata = self.config[path.name]
        documents = []

        for section in sections:
            for chunk in splitter.split_text(section):
                documents.append(Document(page_content=chunk, metadata=metadata))

        return documents

    def _clean(self, text: str) -> str:
        text = "".join(c for c in text if c.isprintable())
        text = self.MULTIPLE_WHITESPACE.sub(" ", text)
        text = self.IMAGES.sub("", text)
        text = self.LINEBREAKS.sub("", text)
        text = self.MULTIPLE_NEWLINES.sub("\n", text)
        return text.strip()

    @staticmethod
    def _load_config():
        with open(FileProcessorFixedSizeChunks.CONFIG_PATH) as f:
            return load(f)


In [5]:
class VectorStoreMarkdownChunks(VectorStoreBase):
    COLLECTION = "sanath_projects_markdown_chunks"
    PROCESSOR_CLS = FileProcessorMarkdownChunks

class VectorStoreFixedSizeChunks(VectorStoreBase):
    COLLECTION = "sanath_projects_fixed_size_chunks"
    PROCESSOR_CLS = FileProcessorFixedSizeChunks

In [6]:
@dataclass(frozen=True)
class RetrievalConfig:
    name: str
    search_type: str
    k: int = 10

@dataclass
class ExperimentResult:
    experiment: str
    strategy: str
    ranked_documents: List[Tuple]

class RetrievalExperiment:
    def __init__(self, name: str, vector_store, reranker):
        self.name = name
        self.vector_store = vector_store
        self.reranker = reranker

    def run(self, query: str, config: RetrievalConfig):
        retriever = self.vector_store.as_retriever(
            search_type=config.search_type,
            search_kwargs={"k": config.k},
        )

        documents = retriever.invoke(query)

        pairs = [(query, doc.page_content) for doc in documents]
        scores = self.reranker.predict(pairs)

        ranked = [
            (doc, score)
            for doc, score in zip(documents, scores)
            if score > 0
        ]

        ranked.sort(key=lambda x: x[1], reverse=True)

        return ExperimentResult(
            experiment=self.name,
            strategy=config.name,
            ranked_documents=ranked,
        )

class ExperimentRunner:
    def __init__(self, experiments: list[RetrievalExperiment], configs: list[RetrievalConfig]):
        self.experiments = experiments
        self.configs = configs

    def run_all(self, query: str) -> list[ExperimentResult]:
        results = []

        for experiment in self.experiments:
            for config in self.configs:
                results.append(experiment.run(query, config))

        return results

In [7]:
Verdict = Literal["SUFFICIENT", "INSUFFICIENT"]
Winner = Literal["A", "B", "C", "D"]

class SetScore(BaseModel):
    relevance: Annotated[int, Field(ge=0, le=5)]
    coverage: Annotated[int, Field(ge=0, le=5)]
    noise: Annotated[int, Field(ge=0, le=5)]
    redundancy: Annotated[int, Field(ge=0, le=5)]
    verdict: Verdict

class JudgeOutput(BaseModel):
    winner: Winner
    set_A: SetScore
    set_B: SetScore
    set_C: SetScore
    set_D: SetScore

In [8]:
def _safe(s: str) -> str:
    return (s or "").strip()

def format_retrieval_block(
    ranked_documents: Sequence[tuple[Any, float]],
) -> str:
    lines: list[str] = []
    for i, (doc, _) in enumerate(ranked_documents):
        content = _safe(getattr(doc, "page_content", ""))

        lines.append(
            f"- doc_{i+1}:\n"
            f"{content}\n"
        )

    return "\n".join(lines) if lines else "(no documents)"

In [9]:
def build_judge_prompt(
    query: str,
    retrieval_A: str,
    retrieval_B: str,
    retrieval_C: str,
    retrieval_D: str,
) -> str:
    return f"""
user_query: {_safe(query)}

retrieval_A:
{retrieval_A}

retrieval_B:
{retrieval_B}

retrieval_C:
{retrieval_C}

retrieval_D:
{retrieval_D}
""".strip()

In [10]:
def map_results_to_sets(results) -> dict[str, Any]:
    by_key = {(r.experiment, r.strategy): r for r in results}

    A = by_key[("markdown_chunks", "similarity")]
    B = by_key[("markdown_chunks", "mmr")]
    C = by_key[("fixed_chunks", "similarity")]
    D = by_key[("fixed_chunks", "mmr")]

    return {"A": A, "B": B, "C": C, "D": D}

In [11]:
system_prompt = \
"""
You are an impartial evaluation model acting as a judge for the retrieval component of a Retrieval-Augmented Generation (RAG) system.

You will compare four retrieval results for the SAME user query. Your job is to determine which set would enable a better final answer to the query.

### What you will receive (each list has doc_id and content)
1) user_query: a natural-language query
2) retrieval_A: a list of documents
3) retrieval_B: a list of documents
4) retrieval_C: a list of documents
5) retrieval_D: a list of documents

### Core evaluation dimensions (apply to each set)
1. Relevance
   - How directly the documents address the query intent.

2. Coverage / Completeness
   - Whether the set contains enough information to answer the query fully.
   - Identify missing key aspects.

3. Noise
   - Penalize boilerplate, metadata, formatting artifacts, tangential content, and irrelevant sections.
   - Prefer dense, query-focused evidence.

4. Redundancy
   - Penalize excessive duplication unless it adds complementary details.

### Scoring
Score each set on a 0–5 integer scale for each dimension:
0 = unacceptable / none
1 = poor
2 = weak
3 = adequate
4 = strong
5 = excellent

### Decision rules
- Pick a single winner: "A", "B", "C", or "D".
- If all sets are too weak to answer the query well, pick the less-bad winner AND mark all as insufficient.
- Prefer the set that is simultaneously:
  - more relevant,
  - more complete,
  - less noisy,
  - and more reliable for answering.

### Output format
Do not add a preamble.
"""

In [12]:
class RetrievalJudge:
    def __init__(self, client: OpenAI, model_name: str = "o3-mini"):
        self.client = client
        self.model_name = model_name

    def judge(self, *, query: str, results) -> JudgeOutput | None:
        sets = map_results_to_sets(results)

        retrieval_A = format_retrieval_block(
            sets["A"].ranked_documents
        )
        retrieval_B = format_retrieval_block(
            sets["B"].ranked_documents
        )
        retrieval_C = format_retrieval_block(
            sets["C"].ranked_documents
        )
        retrieval_D = format_retrieval_block(
            sets["D"].ranked_documents
        )

        prompt = build_judge_prompt(
            query=query,
            retrieval_A=retrieval_A,
            retrieval_B=retrieval_B,
            retrieval_C=retrieval_C,
            retrieval_D=retrieval_D,
        )

        # Structured output: let the API enforce JSON schema
        resp = self.client.responses.parse(
            model=self.model_name,
            input=[
                {
                    "role": "system",
                    "content": system_prompt,
                },
                {"role": "user", "content": prompt},
            ],
            text_format=JudgeOutput
        )

        return resp.output_parsed

In [13]:
reranker_model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

vs_markdown = VectorStoreMarkdownChunks().get_vector_store()
vs_fixed = VectorStoreFixedSizeChunks().get_vector_store()

experiments = [
    RetrievalExperiment("markdown_chunks", vs_markdown, reranker_model),
    RetrievalExperiment("fixed_chunks", vs_fixed, reranker_model),
]

configs = [
    RetrievalConfig(name="similarity", search_type="similarity", k=10),
    RetrievalConfig(name="mmr", search_type="mmr", k=10),
]

runner = ExperimentRunner(experiments, configs)

client = OpenAI()
judge = RetrievalJudge(client)

2025-12-18 20:24:56,655 - INFO - Use pytorch device: cuda:0
2025-12-18 20:24:57,782 - INFO - HTTP Request: GET http://localhost:6333 "HTTP/1.1 200 OK"
2025-12-18 20:24:57,801 - INFO - HTTP Request: GET http://localhost:6333/collections/sanath_projects_markdown_chunks/exists "HTTP/1.1 200 OK"
2025-12-18 20:24:57,862 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-18 20:25:01,633 - INFO - Going to convert document batch...
2025-12-18 20:25:01,637 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 29d7ac5255fd1ef2d81020aeb11ae19b
2025-12-18 20:25:01,662 - INFO - Loading plugin 'docling_defaults'
2025-12-18 20:25:01,668 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-12-18 20:25:01,687 - INFO - Loading plugin 'docling_defaults'
2025-12-18 20:25:01,692 - INFO - Registered ocr engines: ['auto', 'easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
2025-12-18 20:25:02,367 - INFO - Accelerator device: 'cuda:0'
2025-12-18 20:25:05,222 - INF

In [14]:
queries = [
    "thesis contribution", 
    "llm content evaluation", 
    "bikes data sources", 
    "visualization tools used", 
    "table extraction configuration parameters", 
    "Exploratory Data Analysis for cruise ship analysis", 
    "Employment reference Generation", 
    "ETL architecture",
    "chatbot LLMs used",
    "what are pseudonymization and anonymization techniques?",
    ]
all_res = {}
all_outs = {}
for query in queries:
    results = runner.run_all(query)
    all_res[query] = results
    out = judge.judge(query=query, results=results)
    if out:
        all_outs[query] = out

2025-12-18 20:27:08,217 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:27:08,284 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_markdown_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:27:08,697 - INFO - HTTP Request: GET http://localhost:6333/collections/sanath_projects_markdown_chunks "HTTP/1.1 200 OK"
2025-12-18 20:27:09,136 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:27:09,552 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:27:09,610 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_markdown_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:27:10,882 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:27:10,920 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_fixed_size_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:27:11,023 - INFO - HTTP Request: GET http://localhost:6333/collections/sanath_projects_fixed_size_chunks "HTTP/1.1 200 OK"
2025-12-18 20:27:11,393 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:27:13,646 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:27:13,712 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_fixed_size_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:27:27,774 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2025-12-18 20:27:28,387 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:27:28,425 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_markdown_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:27:28,653 - INFO - HTTP Request: GET http://localhost:6333/collections/sanath_projects_markdown_chunks "HTTP/1.1 200 OK"
2025-12-18 20:27:29,005 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:27:29,411 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:27:29,458 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_markdown_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:27:29,878 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:27:29,913 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_fixed_size_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:27:29,989 - INFO - HTTP Request: GET http://localhost:6333/collections/sanath_projects_fixed_size_chunks "HTTP/1.1 200 OK"
2025-12-18 20:27:31,061 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:27:31,709 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:27:31,777 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_fixed_size_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:27:50,673 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2025-12-18 20:27:51,161 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:27:51,209 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_markdown_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:27:51,465 - INFO - HTTP Request: GET http://localhost:6333/collections/sanath_projects_markdown_chunks "HTTP/1.1 200 OK"
2025-12-18 20:27:54,750 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:27:55,115 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:27:55,182 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_markdown_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:27:57,880 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:27:57,918 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_fixed_size_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:27:58,036 - INFO - HTTP Request: GET http://localhost:6333/collections/sanath_projects_fixed_size_chunks "HTTP/1.1 200 OK"
2025-12-18 20:27:58,283 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:27:58,701 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:27:58,745 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_fixed_size_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:28:23,686 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2025-12-18 20:28:24,500 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:28:24,543 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_markdown_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:28:24,731 - INFO - HTTP Request: GET http://localhost:6333/collections/sanath_projects_markdown_chunks "HTTP/1.1 200 OK"
2025-12-18 20:28:25,220 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:28:25,502 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:28:25,581 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_markdown_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:28:26,557 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:28:26,584 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_fixed_size_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:28:26,621 - INFO - HTTP Request: GET http://localhost:6333/collections/sanath_projects_fixed_size_chunks "HTTP/1.1 200 OK"
2025-12-18 20:28:26,899 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:28:27,121 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:28:27,165 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_fixed_size_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:28:51,131 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2025-12-18 20:28:51,634 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:28:51,680 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_markdown_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:28:52,047 - INFO - HTTP Request: GET http://localhost:6333/collections/sanath_projects_markdown_chunks "HTTP/1.1 200 OK"
2025-12-18 20:28:52,795 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:28:53,177 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:28:53,210 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_markdown_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:28:53,795 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:28:53,824 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_fixed_size_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:28:53,918 - INFO - HTTP Request: GET http://localhost:6333/collections/sanath_projects_fixed_size_chunks "HTTP/1.1 200 OK"
2025-12-18 20:28:54,226 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:28:55,978 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:28:56,030 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_fixed_size_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:29:14,990 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2025-12-18 20:29:15,812 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:29:15,851 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_markdown_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:29:16,243 - INFO - HTTP Request: GET http://localhost:6333/collections/sanath_projects_markdown_chunks "HTTP/1.1 200 OK"
2025-12-18 20:29:16,623 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:29:17,040 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:29:17,083 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_markdown_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:29:17,751 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:29:17,812 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_fixed_size_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:29:17,912 - INFO - HTTP Request: GET http://localhost:6333/collections/sanath_projects_fixed_size_chunks "HTTP/1.1 200 OK"
2025-12-18 20:29:18,467 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:29:19,163 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:29:19,200 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_fixed_size_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:29:39,254 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2025-12-18 20:29:39,769 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:29:39,799 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_markdown_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:29:40,040 - INFO - HTTP Request: GET http://localhost:6333/collections/sanath_projects_markdown_chunks "HTTP/1.1 200 OK"
2025-12-18 20:29:40,591 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:29:41,098 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:29:41,253 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_markdown_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:29:41,657 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:29:41,684 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_fixed_size_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:29:41,788 - INFO - HTTP Request: GET http://localhost:6333/collections/sanath_projects_fixed_size_chunks "HTTP/1.1 200 OK"
2025-12-18 20:29:42,170 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:29:42,533 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:29:42,593 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_fixed_size_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:30:02,016 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2025-12-18 20:30:02,602 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:30:02,634 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_markdown_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:30:03,009 - INFO - HTTP Request: GET http://localhost:6333/collections/sanath_projects_markdown_chunks "HTTP/1.1 200 OK"
2025-12-18 20:30:03,528 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:30:03,861 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:30:03,897 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_markdown_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:30:04,922 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:30:04,957 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_fixed_size_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:30:05,114 - INFO - HTTP Request: GET http://localhost:6333/collections/sanath_projects_fixed_size_chunks "HTTP/1.1 200 OK"
2025-12-18 20:30:05,422 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:30:06,700 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:30:06,740 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_fixed_size_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:30:19,241 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2025-12-18 20:30:19,851 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:30:19,925 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_markdown_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:30:20,212 - INFO - HTTP Request: GET http://localhost:6333/collections/sanath_projects_markdown_chunks "HTTP/1.1 200 OK"
2025-12-18 20:30:20,770 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:30:22,001 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:30:22,036 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_markdown_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:30:22,719 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:30:22,780 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_fixed_size_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:30:22,886 - INFO - HTTP Request: GET http://localhost:6333/collections/sanath_projects_fixed_size_chunks "HTTP/1.1 200 OK"
2025-12-18 20:30:23,250 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:30:23,558 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:30:23,624 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_fixed_size_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:30:51,651 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"
2025-12-18 20:30:52,271 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:30:52,342 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_markdown_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:30:52,478 - INFO - HTTP Request: GET http://localhost:6333/collections/sanath_projects_markdown_chunks "HTTP/1.1 200 OK"
2025-12-18 20:30:54,423 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:30:55,946 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:30:55,974 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_markdown_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:30:56,837 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:30:56,866 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_fixed_size_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:30:57,004 - INFO - HTTP Request: GET http://localhost:6333/collections/sanath_projects_fixed_size_chunks "HTTP/1.1 200 OK"
2025-12-18 20:30:57,303 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:30:57,552 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-12-18 20:30:57,623 - INFO - HTTP Request: POST http://localhost:6333/collections/sanath_projects_fixed_size_chunks/points/query "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-12-18 20:31:13,010 - INFO - HTTP Request: POST https://api.openai.com/v1/responses "HTTP/1.1 200 OK"


In [15]:
def serialize_all_res(all_res: dict) -> dict:
    serialized = {}

    for query, results in all_res.items():
        serialized[query] = []

        for r in results:
            serialized[query].append({
                "experiment": r.experiment,
                "strategy": r.strategy,
                "documents": [
                    {
                        "score": float(score),
                        "page_content": doc.page_content,
                        "metadata": doc.metadata,
                    }
                    for doc, score in r.ranked_documents
                ],
            })

    return serialized

def serialize_all_outs(all_outs: dict) -> dict:
    return {
        query: out.model_dump()
        for query, out in all_outs.items()
    }

output_dir = Path("../target/evaluation_outputs")
output_dir.mkdir(exist_ok=True)

with open(output_dir / "retrieval_results.json", "w", encoding="utf-8") as f:
    dump(
        serialize_all_res(all_res),
        f,
        ensure_ascii=False,
        indent=2,
    )

with open(output_dir / "judge_results.json", "w", encoding="utf-8") as f:
    dump(
        serialize_all_outs(all_outs),
        f,
        ensure_ascii=False,
        indent=2,
    )


In [16]:
for q, o in all_outs.items():
    print(q, o.winner)

thesis contribution A
llm content evaluation A
bikes data sources A
visualization tools used A
table extraction configuration parameters A
Exploratory Data Analysis for cruise ship analysis A
Employment reference Generation A
ETL architecture A
chatbot LLMs used C
what are pseudonymization and anonymization techniques? A
