Skip to content

rhussain21/manufacturing-rag-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

Manufacturing RAG — Evaluation Notebooks

Companion evaluation suite for the Manufacturing RAG System.

Most RAG tutorials measure one thing: whether the final answer sounds correct. This series measures every layer — because retrieval failures are usually data problems, and data problems don't show up in the final answer until it's too late.

The notebooks are numbered in dependency order. Each one builds on the findings of the previous.


Why Evaluate Each Layer?

Raw Content
    │
    ▼
[NB1] Corpus Quality      ← Is the data worth indexing?
    │
    ▼
[NB2] Chunking & Signals  ← Are documents represented correctly?
    │
    ▼
[NB3] Retrieval Eval      ← Does the system find the right documents?
    │
    ▼
[NB4] Answer Eval         ← Does retrieval produce correct answers?

Each layer's quality constrains the next. A perfectly tuned retriever over a low-quality corpus still returns low-quality results. A correct answer generated from the wrong retrieved documents is a hallucination waiting to happen.


Notebooks

01_corpus_quality_evaluation.ipynb

Question: Is the data worth indexing?

Most retrieval failures are data problems in disguise. This notebook establishes a quality baseline before retrieval is attempted — measuring document usefulness, signal density, domain coverage, redundancy, and information entropy across content types (research papers, vendor manuals, standards docs, podcasts).

Key outputs:

  • Usefulness rate by threshold
  • Signal density rankings by content type
  • Pairwise Jaccard similarity (redundancy detection)
  • Per-document entropy scores
  • Composite quality score per document

02_chunking_signal_design.ipynb

Question: Are documents represented correctly in the vector index?

Chunking strategy determines what the retriever can and cannot find. This notebook evaluates how documents are split and embedded — covering chunk size distributions, boundary coherence, overlap redundancy, signal extraction quality, and embedding space structure.

Key outputs:

  • BoundaryScore and TopicConsistency per chunk
  • Expansion factor by content type
  • Signal specificity and redundancy rates
  • Chunking configuration recommendations

03_retrieval_evaluation.ipynb

Question: Does the system return the right documents when asked?

Runs domain-specific test queries against a 31-query labeled test set (human-judged relevance) and measures retrieval quality using standard IR metrics. Classifies every failure into one of four types — hit, buried, miss, absent — and identifies which query types and corpus characteristics drive failures.

Configurations evaluated: dense only, hybrid (BM25 + RRF), hybrid + cross-encoder reranker.

Key outputs:

  • Recall@k, Precision@k, MRR across configurations
  • Failure taxonomy breakdown
  • Score distribution analysis
  • Optimal similarity threshold

04_end_to_end_answer_evaluation.ipynb

Question: Does better retrieval actually produce better answers?

The final notebook connects the retrieval layer to an LLM (Gemini via Vertex AI) and evaluates the full RAG pipeline end-to-end — whether generated answers are correct, grounded in retrieved context, and faithful to source material. Tests whether the retrieval improvements measured in NB3 translate to meaningfully better answers.

In progress.


Domain

All notebooks use a manufacturing and industrial automation corpus — PLCs, SCADA, OPC UA, NIST standards, vendor manuals (Siemens, Rockwell, Schneider), and academic papers. The domain was chosen deliberately: evals are only meaningful when you know what a correct answer looks like.


Setup

Notebooks require a connection to the project DuckDB and LanceDB instances from the Manufacturing RAG System.

conda activate whisper-310
jupyter lab

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors