A framework built on Google ADK (Agent Development Kit) that automatically generates comprehensive evaluation for given dimensions and a system prompt (for now).
- Python 3.12+
- uv package manager
- Make (for development commands)
-
Clone the repository
git clone <repository-url> cd eval_harness
-
Install dependencies
make install
-
Verify installation
uv run python -c "import google.adk; print('β Installation successful')"
-
Start the web interface
make run # OR directly with ADK: cd src && PYTHONPATH=.. adk web .
-
Access the interface
- Open http://localhost:8000 in your browser
- Navigate to the orchestrator app
- Input your AUT (Application Under Test) prompt
-
View results
- Generated tests are saved in timestamped folders under
outputs/ - Each dimension produces a separate JSON file with test cases
- Generated tests are saved in timestamped folders under
AUT Prompt β Rule Extraction β Dimension Processing β Test Generation
β β β β
Static Config β Preprocessing β Parallel Agents β JSON Output
- OrchestratorAgent: Main orchestration and pipeline management
- DimensionParallelAgent: Processes multiple test dimensions in parallel
- SavePromptAgent: Automatically saves generated test cases to organized files
- Dynamic Agents: Runtime creation of specialized testing agents per dimension
# Format code (Google Python Style)
make format
# Run comprehensive linting
make lint
# Type checking
make typecheck
# Security scanning
make security
# Run all checks
make all
# Quick development iteration
make quickβββ src/
β βββ orchestrator/ # Main orchestration logic
β β βββ agent.py # Core agent implementations
β βββ shared/ # Shared utilities and configuration
β βββ config.py # Configuration management
β βββ config.yaml # Main configuration file
β βββ util.py # Utility functions
βββ artifacts/
β βββ taxonomy/ # Taxonomy definitions
β β βββ taxonomies.jsonl # Taxonomy registry
β β βββ *.md # Individual dimension guides
β βββ coverage_engine/ # Coverage engine definitions
βββ outputs/ # Generated test outputs
βββ pyproject.toml # Project dependencies and config
The taxonomy system defines test dimensions hierarchically (L1βL2βL3βL4) for comprehensive evaluation coverage.
-
Define the taxonomy structure in
artifacts/taxonomy/taxonomies.jsonl:{ "l1": "Reliability", // Top-level category "l2": "Correctness", // Domain area "l3": "Factual Accuracy", // Specific aspect "l4": "Fabrication of facts", // Test dimension "leaf": "Reliability__Correctness__Factual_Accuracy__Fabrication_of_facts" }Important naming conventions:
- Use double underscores (
__) to separate hierarchy levels in theleaffield - Replace spaces with underscores in the
leafname - Special characters in parentheses are allowed but will be sanitized for agent names
- The
leaffield must exactly match the corresponding.mdfilename
- Use double underscores (
-
Create the taxonomy guide at
artifacts/taxonomy/{leaf_name}.md:Each guide should include:
# **North Star Guide β [Test Dimension Name]** Definition: Clear explanation of what this dimension tests **Purpose:** Explain why testing this dimension is important ## **Diagnostic Hooks** - Specific patterns to detect - Key indicators of issues - Expected vs problematic behaviors ## **Testing Approach** - How to construct test cases - Edge cases to consider - Adversarial testing strategies ## **Evaluation Criteria** - Pass/fail conditions - Severity levels - Mitigation strategies
-
Add to
taxonomies.jsonl:{ "l1": "Safety", "l2": "Privacy", "l3": "Data Protection", "l4": "PII Exposure", "leaf": "Safety__Privacy__Data_Protection__PII_Exposure" } -
Create
Safety__Privacy__Data_Protection__PII_Exposure.md:# **North Star Guide β PII Exposure** Definition: Testing for unauthorized disclosure of personally identifiable information **Purpose:** Ensure the agent doesn't reveal sensitive user data...
Update src/shared/config.yaml to customize paths:
taxonomy:
taxonomy_root: "../artifacts/taxonomy/" # Path to taxonomy files
outputs:
outputs_dir: "../outputs/" # Where test results are savedFor adding multiple taxonomies at once:
-
Prepare your taxonomy data in a spreadsheet or script
-
Generate JSONL entries (one JSON object per line):
import json taxonomies = [ {"l1": "Safety", "l2": "Privacy", "l3": "Data Protection", "l4": "PII Exposure"}, {"l1": "Safety", "l2": "Privacy", "l3": "Data Protection", "l4": "Data Retention"}, # ... more entries ] with open('taxonomies.jsonl', 'a') as f: for tax in taxonomies: tax['leaf'] = f"{tax['l1']}__{tax['l2']}__{tax['l3']}__{tax['l4']}".replace(' ', '_') f.write(json.dumps(tax) + '\n')
-
Batch create guide files using templates for consistency
- Hierarchy Design: Keep L1-L3 broad and reusable, make L4 specific to test cases
- Naming: Use descriptive but concise names; avoid overly long leaf names
- Documentation: Each taxonomy guide should be self-contained with clear examples
- Version Control: Track changes to taxonomies as they evolve with your testing needs
Configure testing approaches in artifacts/coverage_engine/:
coverage_engines.jsonl: Engine registry{engine_name}.md: Implementation guides
Update src/shared/config.yaml:
model:
default: "gemini-1.5-flash" # or any LiteLLM-supported model
outputs:
outputs_dir: "../outputs/"# The system processes your AUT prompt through:
# 1. Rule extraction from the prompt
# 2. Dimension-specific filtering and annotation
# 3. Coverage goal generation
# 4. Test case generation across all 50 dimensions
# 5. Automatic JSON output organization
# Example AUT prompt:
"A customer service chatbot that helps users with account issues,
billing questions, and technical support while maintaining
professional tone and protecting user privacy."outputs/
βββ run_2025-01-24_10-30-45/
βββ Fabrication_of_nonexistent_facts_candidate_prompts.json
βββ Confabulated_references_candidate_prompts.json
βββ Privacy_leakage_candidate_prompts.json
βββ ... (50 dimension-specific files)
The system automatically converts hierarchical taxonomy names to valid ADK agent identifiers:
Reliability__Correctness__Factual_Accuracy__Fabrication_of_nonexistent_facts- β
Fabrication_of_nonexistent_facts_SieveAgent
- Fallback to hardcoded values when taxonomy files are missing
- Real-time loading of test dimension guides
- Flexible coverage engine configuration
- Each run creates a unique timestamped folder
- Dimension-specific JSON output files
- Complete traceability of test generation sessions
-
Setup development environment
uv sync --dev uv run pre-commit install
-
Follow code standards
- Google Python Style Guide
- Comprehensive type hints
- Security-first development
-
Run full validation
make all # Format, lint, typecheck, security scan
Agent naming errors: Ensure taxonomy leaf names don't contain special characters. The system automatically sanitizes names but complex characters may cause issues.
Missing taxonomy files: The system provides fallback values, but create artifacts/taxonomy/taxonomies.jsonl and corresponding .md files for full functionality.
uv environment conflicts: Use uv sync in a clean directory to avoid environment path conflicts.
- Check the comprehensive make help:
make help - Review configuration in
src/shared/config.yaml - Examine example taxonomy files in
artifacts/taxonomy/
For enterprise deployments, consider:
- Custom taxonomy development for domain-specific testing
- Integration with existing CI/CD pipelines
- Scaling across multiple AI model providers
- Custom coverage engine implementations
The system's modular architecture supports extensive customization while maintaining the core testing pipeline integrity.