A learning sandbox for exploring LLM-as-Judge (autograder) evaluation techniques. This project is designed for learning and experimentation, not as a production framework.
LLM-as-Judge is a technique where one LLM evaluates the output of another LLM. Instead of humans manually reviewing every AI response, we use a "judge" LLM to automatically classify responses as correct or incorrect.
In this system:
- Judge LLM: GPT-4o-mini evaluates responses for hallucinations
- Task: Given a context + response pair, classify as "grounded" or "hallucinated"
- Ground Truth: 120 hand-labeled test cases across 7 failure modes
- Metrics: F1, Precision, Recall, Correlation, Calibration, and more
Why use LLM-as-Judge?
- Scale: Evaluate thousands of responses automatically
- Speed: Get results in seconds, not days
- Consistency: Same criteria applied every time
- Cost: Cheaper than human annotation at scale
- Measure hallucination detection performance across multiple prompting strategies
- Identify and test failure modes systematically
- Compare prompt effectiveness through A/B testing
- Monitor production metrics with drift detection
- Ensure model reliability through regression testing on held-out data
- 100 hand-labeled test cases across 7 failure modes
- 6 prompting strategies (v1: zero-shot to v6: calibrated confidence)
- 12+ evaluation metrics (accuracy, F1, TNR, bias, consistency, correlation)
- A/B testing framework with statistical significance
- Drift monitoring for production safety
- Regression testing on held-out set (never used in development)
- Streamlit dashboard with slice analysis and learning guides
- Daily orchestrator for automated evaluation runs
- Database backend (SQLite/PostgreSQL) for metrics history
evaluation-system/
├── docs/
│ ├── DESIGN.md # Architecture & methodology
│ ├── METRICS.md # All metrics with interpretation
│ ├── FAILURE_MODES.md # 7 failure modes explained
│ ├── PROMPTING_STRATEGIES.md # 6 strategies compared
│ └── USER_GUIDE.md # Comprehensive user guide
├── data/
│ ├── test_cases/
│ │ ├── ground_truth.json # 80 training cases
│ │ └── regression.json # 20 held-out cases
│ ├── results/ # Timestamped evaluation runs
│ ├── daily_runs/ # Orchestrator-generated results
│ └── metrics.db # SQLite metrics database
├── prompts/
│ ├── v1_zero_shot.txt # Baseline
│ ├── v2_few_shot.txt # With examples
│ ├── v3_chain_of_thought.txt # Step-by-step reasoning
│ ├── v4_rubric_based.txt # Explicit criteria
│ ├── v5_structured_output.txt # JSON format
│ ├── v6_calibrated_confidence.txt # Calibrated confidence (recommended)
│ └── prompt_registry.json # Version tracking
├── src/
│ ├── metrics.py # All metric calculations
│ ├── evaluator.py # LLM evaluation engine
│ ├── database.py # Database abstraction (SQLite/PostgreSQL)
│ ├── orchestrator.py # Daily evaluation runner
│ ├── ab_testing.py # A/B test framework
│ ├── monitoring.py # Drift detection
│ ├── regression.py # Regression testing
│ └── dashboard_v2.py # Streamlit dashboard (primary)
├── tests/
│ └── test_metrics.py # Unit tests
├── config/
│ └── settings.yaml # Configuration
├── requirements.txt
└── README.md
pip install -r requirements.txtOption A: Using .env file (Recommended - Most Secure)
-
Copy the example file:
cp .env.example .env
-
Edit
.envand add your actual API key:OPENAI_API_KEY=sk-your-actual-api-key-here -
The code will automatically load from
.env- no manual setup needed -
IMPORTANT: Never commit
.envto git (it's in.gitignore)
Option B: Using Environment Variable (Less Secure)
# PowerShell
$env:OPENAI_API_KEY = "sk-your-api-key"
# Bash/Linux
export OPENAI_API_KEY="sk-your-api-key"from src.evaluator import Evaluator
from src.metrics import MetricsCalculator
# Initialize evaluator (automatically loads API key from .env)
evaluator = Evaluator(model="gpt-4")
# Run on training set with v1 prompt
results = evaluator.evaluate_batch(
prompt_id="v1_zero_shot",
test_case_ids=[case for case in evaluator.list_test_cases() if not case.startswith("REG_")]
)
# Calculate metrics
y_true = [r["ground_truth"] for r in results]
y_pred = [r["prediction"] for r in results]
calc = MetricsCalculator(y_true, y_pred)
metrics = calc.all_metrics()
for name, metric in metrics.items():
print(f"{name}: {metric.value:.3f} - {metric.interpretation}")
# Check production readiness
is_ready, details = calc.production_ready()
print(f"Production ready: {is_ready}")
# Save results
evaluator.save_results(results, run_name="v1_baseline")# Using Python module (recommended)
python -m streamlit run src/dashboard_v2.pyNavigate to http://localhost:8501
The dashboard includes:
- Getting Started: Workflow explanation and quick start
- Failure Modes: Detailed explanations of FM1-FM7
- Metrics Overview: Current performance metrics
- Slice Analysis: Performance breakdown by failure mode, difficulty, label
- Trends: Historical metrics over time
- Run Evaluation: Execute new evaluation runs
- Baseline (v1): Establish zero-shot performance
- Few-shot (v2): Add examples if baseline < 0.70
- Chain-of-thought (v3): Add reasoning if few-shot insufficient
- Rubric-based (v4): Explicit criteria
- Structured Output (v5): JSON parsing
- Calibrated Confidence (v6): Best correlation metrics (recommended for production)
from src.ab_testing import ABTester
tester = ABTester()
# Run both prompts on training set
results_v2 = evaluator.evaluate_batch("v2_few_shot", ...)
results_v3 = evaluator.evaluate_batch("v3_chain_of_thought", ...)
# A/B test
test_result = tester.run_test(results_v2, results_v3, "v2_few_shot", "v3_chain_of_thought")
print(f"Winner: {test_result['winner']}")
print(f"F1 improvement: {test_result['summary']['f1_improvement']:.1%}")
# Set winner as production
tester.set_production_prompt("v3_chain_of_thought")
tester.update_registry_with_results(test_result)from src.regression import RegressionTester
tester = RegressionTester()
# Evaluate on regression set
regression_results = evaluator.evaluate_batch(
prompt_id="v3_chain_of_thought", # Production prompt
test_case_ids=[c for c in evaluator.list_test_cases() if c.startswith("REG_")]
)
# Check thresholds
passed, details = tester.run_regression_test(regression_results)
if passed:
print("✓ Ready for production!")
else:
print(tester.format_report(passed, details))from src.monitoring import DriftMonitor
monitor = DriftMonitor(baseline_results_path="data/results/results_baseline.json")
# Check current run for drift
drift_analysis = monitor.compare_to_baseline(current_results)
print(f"Health: {drift_analysis['overall_health']}")
for alert in monitor.get_alerts(drift_analysis):
print(alert)
if monitor.should_rollback(drift_analysis):
print("⚠️ Recommend rollback!")- F1: ≥ 0.75 (balanced correctness)
- TNR: ≥ 0.65 (specificity; avoid over-flagging)
- Bias: |bias| ≤ 0.15 (balanced FP/FN)
- Precision: When we flag hallucinations, are we right?
- Recall: Do we catch all hallucinations?
- Accuracy: Overall correctness
- Cohen's Kappa: Consistency across runs
- Spearman/Pearson: Correlation (if scores provided)
- MAE/RMSE: Calibration quality
- Variance: Consistency across runs
- Self-consistency: Per-case consistency
See docs/METRICS.md for complete interpretation guide.
| Mode | Definition | Cases | Difficulty |
|---|---|---|---|
| FM1: Factual Addition | True facts not in context | 15 | Easy-Hard |
| FM2: Fabrication | Outright false claims | 15 | Easy-Medium |
| FM3: Subtle Distortion | Small changes to facts | 15 | Medium-Hard |
| FM4: Valid Inference | Logical inferences | 15 | Easy-Hard |
| FM5: Verbatim Grounded | Direct quotes | 15 | Easy-Hard |
| FM6: Fluent Hallucination | Well-written but wrong | 15 | Hard |
| FM7: Partial Grounding | Mixed grounded/hallucinated | 10 | Hard |
See docs/FAILURE_MODES.md for detailed analysis.
| Strategy | Token Cost | Expected F1 | Best For |
|---|---|---|---|
| V1: Zero-shot | 1x | 0.55 | Baseline |
| V2: Few-shot | 1.3x | 0.70 | Quick iteration |
| V3: Chain-of-thought | 1.8x | 0.75 | Complex reasoning |
| V4: Rubric-based | 1.6x | 0.75 | Domain-specific |
| V5: Structured | 2.0x | 0.80 | Production systems |
| V6: Calibrated confidence | 1.5x | 0.78 | Best correlation metrics |
Recommended: V6 for production use - provides best calibration and correlation metrics.
See docs/PROMPTING_STRATEGIES.md for pros/cons.
- F1 ≥ 0.75 on training set
- TNR ≥ 0.65 (don't over-flag)
- |Bias| ≤ 0.15 (balanced)
- Regression test passed on held-out set
- A/B test winner selected
- Baseline metrics recorded
- Drift thresholds configured
- Dashboard monitoring active
- Error handling implemented
- Logging/alerts configured
- DESIGN.md - Complete evaluation methodology
- METRICS.md - Every metric explained with interpretation scales
- FAILURE_MODES.md - 7 failure modes catalog
- PROMPTING_STRATEGIES.md - 6 strategies compared
- SLICE_ANALYSIS.md - Performance breakdown by data slices
- USER_GUIDE.md - Comprehensive user guide and tutorials
python -m pytest tests/test_metrics.py -vevaluator = Evaluator(model="gpt-4", api_key="...")
results = evaluator.evaluate_batch("v1_zero_shot")
evaluator.save_results(results, run_name="baseline")calc = MetricsCalculator(y_true, y_pred)
f1 = calc.f1() # Individual metric
metrics = calc.all_metrics() # All metrics
is_ready, details = calc.production_ready() # Production checktester = ABTester()
test_result = tester.run_test(results_a, results_b, "v2", "v3")
tester.set_production_prompt("v3")monitor = DriftMonitor(baseline_results_path="...")
drift = monitor.compare_to_baseline(current_results)
alerts = monitor.get_alerts(drift)
should_rollback = monitor.should_rollback(drift)https://evaluation-system-v3.streamlit.app/
tester = RegressionTester()
passed, details = tester.run_regression_test(regression_results)
report = tester.format_report(passed, details)When adding new features:
- Update test cases if adding failure modes
- Document new metrics in METRICS.md
- Add unit tests in tests/
- Update this README
Portfolio project - Open for review and discussion
See the documentation files for detailed explanations. Each module includes comprehensive docstrings.
Last Updated: January 2026
Status: for demonstration purposes