Evaluation Harness Generator

A framework built on Google ADK (Agent Development Kit) that automatically generates comprehensive evaluation for given dimensions and a system prompt (for now).

📋 Quick Start

Prerequisites

Python 3.12+
uv package manager
Make (for development commands)

Installation

Clone the repository

git clone <repository-url>
cd eval_harness

Install dependencies
```
make install
```

Verify installation

uv run python -c "import google.adk; print('✅ Installation successful')"

Running the System

Start the web interface

make run
# OR directly with ADK:
cd src && PYTHONPATH=.. adk web .

Access the interface
- Open http://localhost:8000 in your browser
- Navigate to the orchestrator app
- Input your AUT (Application Under Test) prompt
View results
- Generated tests are saved in timestamped folders under outputs/
- Each dimension produces a separate JSON file with test cases

🏗️ Architecture

Pipeline Overview

AUT Prompt → Rule Extraction → Dimension Processing → Test Generation
     ↓              ↓                    ↓                ↓
Static Config → Preprocessing → Parallel Agents → JSON Output

Key Components

OrchestratorAgent: Main orchestration and pipeline management
DimensionParallelAgent: Processes multiple test dimensions in parallel
SavePromptAgent: Automatically saves generated test cases to organized files
Dynamic Agents: Runtime creation of specialized testing agents per dimension

🔧 Development

Code Quality Commands

# Format code (Google Python Style)
make format

# Run comprehensive linting  
make lint

# Type checking
make typecheck

# Security scanning
make security

# Run all checks
make all

# Quick development iteration
make quick

Project Structure

├── src/
│   ├── orchestrator/          # Main orchestration logic
│   │   └── agent.py          # Core agent implementations
│   └── shared/               # Shared utilities and configuration  
│       ├── config.py         # Configuration management
│       ├── config.yaml       # Main configuration file
│       └── util.py           # Utility functions
├── artifacts/
│   ├── taxonomy/             # Taxonomy definitions
│   │   ├── taxonomies.jsonl  # Taxonomy registry
│   │   └── *.md             # Individual dimension guides
│   └── coverage_engine/      # Coverage engine definitions
├── outputs/                  # Generated test outputs
└── pyproject.toml           # Project dependencies and config

⚙️ Configuration

Taxonomy Management

The taxonomy system defines test dimensions hierarchically (L1→L2→L3→L4) for comprehensive evaluation coverage.

Creating Taxonomy Entries

Define the taxonomy structure in artifacts/taxonomy/taxonomies.jsonl:
```
{
  "l1": "Reliability",           // Top-level category
  "l2": "Correctness",           // Domain area
  "l3": "Factual Accuracy",      // Specific aspect
  "l4": "Fabrication of facts",  // Test dimension
  "leaf": "Reliability__Correctness__Factual_Accuracy__Fabrication_of_facts"
}
```
Important naming conventions:
- Use double underscores (__) to separate hierarchy levels in the leaf field
- Replace spaces with underscores in the leaf name
- Special characters in parentheses are allowed but will be sanitized for agent names
- The leaf field must exactly match the corresponding .md filename

Create the taxonomy guide at artifacts/taxonomy/{leaf_name}.md:

Each guide should include:

# **North Star Guide — [Test Dimension Name]**

Definition: Clear explanation of what this dimension tests

**Purpose:**
Explain why testing this dimension is important

## **Diagnostic Hooks**
- Specific patterns to detect
- Key indicators of issues
- Expected vs problematic behaviors

## **Testing Approach**
- How to construct test cases
- Edge cases to consider
- Adversarial testing strategies

## **Evaluation Criteria**
- Pass/fail conditions
- Severity levels
- Mitigation strategies

Example: Adding a New Privacy Test

Add to taxonomies.jsonl:

{
  "l1": "Safety",
  "l2": "Privacy",
  "l3": "Data Protection",
  "l4": "PII Exposure",
  "leaf": "Safety__Privacy__Data_Protection__PII_Exposure"
}

Create Safety__Privacy__Data_Protection__PII_Exposure.md:

# **North Star Guide — PII Exposure**

Definition: Testing for unauthorized disclosure of personally identifiable information

**Purpose:**
Ensure the agent doesn't reveal sensitive user data...

Configuration Settings

Update src/shared/config.yaml to customize paths:

taxonomy:
  taxonomy_root: "../artifacts/taxonomy/"  # Path to taxonomy files

outputs:
  outputs_dir: "../outputs/"  # Where test results are saved

Bulk Taxonomy Operations

For adding multiple taxonomies at once:

Prepare your taxonomy data in a spreadsheet or script

Generate JSONL entries (one JSON object per line):

import json

taxonomies = [
    {"l1": "Safety", "l2": "Privacy", "l3": "Data Protection", "l4": "PII Exposure"},
    {"l1": "Safety", "l2": "Privacy", "l3": "Data Protection", "l4": "Data Retention"},
    # ... more entries
]

with open('taxonomies.jsonl', 'a') as f:
    for tax in taxonomies:
        tax['leaf'] = f"{tax['l1']}__{tax['l2']}__{tax['l3']}__{tax['l4']}".replace(' ', '_')
        f.write(json.dumps(tax) + '\n')

Batch create guide files using templates for consistency

Best Practices

Hierarchy Design: Keep L1-L3 broad and reusable, make L4 specific to test cases
Naming: Use descriptive but concise names; avoid overly long leaf names
Documentation: Each taxonomy guide should be self-contained with clear examples
Version Control: Track changes to taxonomies as they evolve with your testing needs

Coverage Engines

Configure testing approaches in artifacts/coverage_engine/:

coverage_engines.jsonl: Engine registry
{engine_name}.md: Implementation guides

Model Configuration

Update src/shared/config.yaml:

model:
  default: "gemini-1.5-flash"  # or any LiteLLM-supported model

outputs:
  outputs_dir: "../outputs/"

🧪 Example Usage

Basic Test Generation

# The system processes your AUT prompt through:
# 1. Rule extraction from the prompt
# 2. Dimension-specific filtering and annotation  
# 3. Coverage goal generation
# 4. Test case generation across all 50 dimensions
# 5. Automatic JSON output organization

# Example AUT prompt:
"A customer service chatbot that helps users with account issues, 
billing questions, and technical support while maintaining 
professional tone and protecting user privacy."

Output Structure

outputs/
└── run_2025-01-24_10-30-45/
    ├── Fabrication_of_nonexistent_facts_candidate_prompts.json
    ├── Confabulated_references_candidate_prompts.json
    ├── Privacy_leakage_candidate_prompts.json
    └── ... (50 dimension-specific files)

🔍 Key Features Deep Dive

Agent Name Sanitization

The system automatically converts hierarchical taxonomy names to valid ADK agent identifiers:

Reliability__Correctness__Factual_Accuracy__Fabrication_of_nonexistent_facts
→ Fabrication_of_nonexistent_facts_SieveAgent

Dynamic Configuration Loading

Fallback to hardcoded values when taxonomy files are missing
Real-time loading of test dimension guides
Flexible coverage engine configuration

Timestamped Organization

Each run creates a unique timestamped folder
Dimension-specific JSON output files
Complete traceability of test generation sessions

🤝 Contributing

Setup development environment
```
uv sync --dev
uv run pre-commit install
```
Follow code standards
- Google Python Style Guide
- Comprehensive type hints
- Security-first development

Run full validation

make all  # Format, lint, typecheck, security scan

🔧 Troubleshooting

Common Issues

Agent naming errors: Ensure taxonomy leaf names don't contain special characters. The system automatically sanitizes names but complex characters may cause issues.

Missing taxonomy files: The system provides fallback values, but create artifacts/taxonomy/taxonomies.jsonl and corresponding .md files for full functionality.

uv environment conflicts: Use uv sync in a clean directory to avoid environment path conflicts.

Getting Help

Check the comprehensive make help: make help
Review configuration in src/shared/config.yaml
Examine example taxonomy files in artifacts/taxonomy/

🌟 Advanced Usage

For enterprise deployments, consider:

Custom taxonomy development for domain-specific testing
Integration with existing CI/CD pipelines
Scaling across multiple AI model providers
Custom coverage engine implementations

The system's modular architecture supports extensive customization while maintaining the core testing pipeline integrity.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.claude		.claude
artifacts		artifacts
outputs		outputs
scripts		scripts
src		src
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Uh oh!

Uh oh!

yonata-learn/Harness_Generator

Folders and files

Latest commit

History

Repository files navigation

Evaluation Harness Generator

📋 Quick Start

Prerequisites

Installation

Running the System

🏗️ Architecture

Pipeline Overview

Key Components

🔧 Development

Code Quality Commands

Project Structure

⚙️ Configuration

Taxonomy Management

Creating Taxonomy Entries

Example: Adding a New Privacy Test

Configuration Settings

Bulk Taxonomy Operations

Best Practices

Coverage Engines

Model Configuration

🧪 Example Usage

Basic Test Generation

Output Structure

🔍 Key Features Deep Dive

Agent Name Sanitization

Dynamic Configuration Loading

Timestamped Organization

🤝 Contributing

🔧 Troubleshooting

Common Issues

Getting Help

🌟 Advanced Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages