Manufacturing RAG — Evaluation Notebooks

Companion evaluation suite for the Manufacturing RAG System.

Most RAG tutorials measure one thing: whether the final answer sounds correct. This series measures every layer — because retrieval failures are usually data problems, and data problems don't show up in the final answer until it's too late.

The notebooks are numbered in dependency order. Each one builds on the findings of the previous.

Why Evaluate Each Layer?

Raw Content
    │
    ▼
[NB1] Corpus Quality      ← Is the data worth indexing?
    │
    ▼
[NB2] Chunking & Signals  ← Are documents represented correctly?
    │
    ▼
[NB3] Retrieval Eval      ← Does the system find the right documents?
    │
    ▼
[NB4] Answer Eval         ← Does retrieval produce correct answers?

Each layer's quality constrains the next. A perfectly tuned retriever over a low-quality corpus still returns low-quality results. A correct answer generated from the wrong retrieved documents is a hallucination waiting to happen.

Notebooks

`01_corpus_quality_evaluation.ipynb`

Question: Is the data worth indexing?

Most retrieval failures are data problems in disguise. This notebook establishes a quality baseline before retrieval is attempted — measuring document usefulness, signal density, domain coverage, redundancy, and information entropy across content types (research papers, vendor manuals, standards docs, podcasts).

Key outputs:

Usefulness rate by threshold
Signal density rankings by content type
Pairwise Jaccard similarity (redundancy detection)
Per-document entropy scores
Composite quality score per document

`02_chunking_signal_design.ipynb`

Question: Are documents represented correctly in the vector index?

Chunking strategy determines what the retriever can and cannot find. This notebook evaluates how documents are split and embedded — covering chunk size distributions, boundary coherence, overlap redundancy, signal extraction quality, and embedding space structure.

Key outputs:

BoundaryScore and TopicConsistency per chunk
Expansion factor by content type
Signal specificity and redundancy rates
Chunking configuration recommendations

`03_retrieval_evaluation.ipynb`

Question: Does the system return the right documents when asked?

Runs domain-specific test queries against a 31-query labeled test set (human-judged relevance) and measures retrieval quality using standard IR metrics. Classifies every failure into one of four types — hit, buried, miss, absent — and identifies which query types and corpus characteristics drive failures.

Configurations evaluated: dense only, hybrid (BM25 + RRF), hybrid + cross-encoder reranker.

Key outputs:

Recall@k, Precision@k, MRR across configurations
Failure taxonomy breakdown
Score distribution analysis
Optimal similarity threshold

`04_end_to_end_answer_evaluation.ipynb`

Question: Does better retrieval actually produce better answers?

The final notebook connects the retrieval layer to an LLM (Gemini via Vertex AI) and evaluates the full RAG pipeline end-to-end — whether generated answers are correct, grounded in retrieved context, and faithful to source material. Tests whether the retrieval improvements measured in NB3 translate to meaningfully better answers.

In progress.

Domain

All notebooks use a manufacturing and industrial automation corpus — PLCs, SCADA, OPC UA, NIST standards, vendor manuals (Siemens, Rockwell, Schneider), and academic papers. The domain was chosen deliberately: evals are only meaningful when you know what a correct answer looks like.

Setup

Notebooks require a connection to the project DuckDB and LanceDB instances from the Manufacturing RAG System.

conda activate whisper-310
jupyter lab

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Manufacturing RAG — Evaluation Notebooks

Why Evaluate Each Layer?

Notebooks

`01_corpus_quality_evaluation.ipynb`

`02_chunking_signal_design.ipynb`

`03_retrieval_evaluation.ipynb`

`04_end_to_end_answer_evaluation.ipynb`

Domain

Setup

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Manufacturing RAG — Evaluation Notebooks

Why Evaluate Each Layer?

Notebooks

01_corpus_quality_evaluation.ipynb

02_chunking_signal_design.ipynb

03_retrieval_evaluation.ipynb

04_end_to_end_answer_evaluation.ipynb

Domain

Setup

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

`01_corpus_quality_evaluation.ipynb`

`02_chunking_signal_design.ipynb`

`03_retrieval_evaluation.ipynb`

`04_end_to_end_answer_evaluation.ipynb`

Packages