Skip to content

yonata-learn/Harness_Generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Evaluation Harness Generator

A framework built on Google ADK (Agent Development Kit) that automatically generates comprehensive evaluation for given dimensions and a system prompt (for now).

πŸ“‹ Quick Start

Prerequisites

  • Python 3.12+
  • uv package manager
  • Make (for development commands)

Installation

  1. Clone the repository

    git clone <repository-url>
    cd eval_harness
  2. Install dependencies

    make install
  3. Verify installation

    uv run python -c "import google.adk; print('βœ… Installation successful')"

Running the System

  1. Start the web interface

    make run
    # OR directly with ADK:
    cd src && PYTHONPATH=.. adk web .
  2. Access the interface

    • Open http://localhost:8000 in your browser
    • Navigate to the orchestrator app
    • Input your AUT (Application Under Test) prompt
  3. View results

    • Generated tests are saved in timestamped folders under outputs/
    • Each dimension produces a separate JSON file with test cases

πŸ—οΈ Architecture

Pipeline Overview

AUT Prompt β†’ Rule Extraction β†’ Dimension Processing β†’ Test Generation
     ↓              ↓                    ↓                ↓
Static Config β†’ Preprocessing β†’ Parallel Agents β†’ JSON Output

Key Components

  • OrchestratorAgent: Main orchestration and pipeline management
  • DimensionParallelAgent: Processes multiple test dimensions in parallel
  • SavePromptAgent: Automatically saves generated test cases to organized files
  • Dynamic Agents: Runtime creation of specialized testing agents per dimension

πŸ”§ Development

Code Quality Commands

# Format code (Google Python Style)
make format

# Run comprehensive linting  
make lint

# Type checking
make typecheck

# Security scanning
make security

# Run all checks
make all

# Quick development iteration
make quick

Project Structure

β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ orchestrator/          # Main orchestration logic
β”‚   β”‚   └── agent.py          # Core agent implementations
β”‚   └── shared/               # Shared utilities and configuration  
β”‚       β”œβ”€β”€ config.py         # Configuration management
β”‚       β”œβ”€β”€ config.yaml       # Main configuration file
β”‚       └── util.py           # Utility functions
β”œβ”€β”€ artifacts/
β”‚   β”œβ”€β”€ taxonomy/             # Taxonomy definitions
β”‚   β”‚   β”œβ”€β”€ taxonomies.jsonl  # Taxonomy registry
β”‚   β”‚   └── *.md             # Individual dimension guides
β”‚   └── coverage_engine/      # Coverage engine definitions
β”œβ”€β”€ outputs/                  # Generated test outputs
└── pyproject.toml           # Project dependencies and config

βš™οΈ Configuration

Taxonomy Management

The taxonomy system defines test dimensions hierarchically (L1β†’L2β†’L3β†’L4) for comprehensive evaluation coverage.

Creating Taxonomy Entries

  1. Define the taxonomy structure in artifacts/taxonomy/taxonomies.jsonl:

    {
      "l1": "Reliability",           // Top-level category
      "l2": "Correctness",           // Domain area
      "l3": "Factual Accuracy",      // Specific aspect
      "l4": "Fabrication of facts",  // Test dimension
      "leaf": "Reliability__Correctness__Factual_Accuracy__Fabrication_of_facts"
    }

    Important naming conventions:

    • Use double underscores (__) to separate hierarchy levels in the leaf field
    • Replace spaces with underscores in the leaf name
    • Special characters in parentheses are allowed but will be sanitized for agent names
    • The leaf field must exactly match the corresponding .md filename
  2. Create the taxonomy guide at artifacts/taxonomy/{leaf_name}.md:

    Each guide should include:

    # **North Star Guide β€” [Test Dimension Name]**
    
    Definition: Clear explanation of what this dimension tests
    
    **Purpose:**
    Explain why testing this dimension is important
    
    ## **Diagnostic Hooks**
    - Specific patterns to detect
    - Key indicators of issues
    - Expected vs problematic behaviors
    
    ## **Testing Approach**
    - How to construct test cases
    - Edge cases to consider
    - Adversarial testing strategies
    
    ## **Evaluation Criteria**
    - Pass/fail conditions
    - Severity levels
    - Mitigation strategies

Example: Adding a New Privacy Test

  1. Add to taxonomies.jsonl:

    {
      "l1": "Safety",
      "l2": "Privacy",
      "l3": "Data Protection",
      "l4": "PII Exposure",
      "leaf": "Safety__Privacy__Data_Protection__PII_Exposure"
    }
  2. Create Safety__Privacy__Data_Protection__PII_Exposure.md:

    # **North Star Guide β€” PII Exposure**
    
    Definition: Testing for unauthorized disclosure of personally identifiable information
    
    **Purpose:**
    Ensure the agent doesn't reveal sensitive user data...

Configuration Settings

Update src/shared/config.yaml to customize paths:

taxonomy:
  taxonomy_root: "../artifacts/taxonomy/"  # Path to taxonomy files

outputs:
  outputs_dir: "../outputs/"  # Where test results are saved

Bulk Taxonomy Operations

For adding multiple taxonomies at once:

  1. Prepare your taxonomy data in a spreadsheet or script

  2. Generate JSONL entries (one JSON object per line):

    import json
    
    taxonomies = [
        {"l1": "Safety", "l2": "Privacy", "l3": "Data Protection", "l4": "PII Exposure"},
        {"l1": "Safety", "l2": "Privacy", "l3": "Data Protection", "l4": "Data Retention"},
        # ... more entries
    ]
    
    with open('taxonomies.jsonl', 'a') as f:
        for tax in taxonomies:
            tax['leaf'] = f"{tax['l1']}__{tax['l2']}__{tax['l3']}__{tax['l4']}".replace(' ', '_')
            f.write(json.dumps(tax) + '\n')
  3. Batch create guide files using templates for consistency

Best Practices

  • Hierarchy Design: Keep L1-L3 broad and reusable, make L4 specific to test cases
  • Naming: Use descriptive but concise names; avoid overly long leaf names
  • Documentation: Each taxonomy guide should be self-contained with clear examples
  • Version Control: Track changes to taxonomies as they evolve with your testing needs

Coverage Engines

Configure testing approaches in artifacts/coverage_engine/:

  • coverage_engines.jsonl: Engine registry
  • {engine_name}.md: Implementation guides

Model Configuration

Update src/shared/config.yaml:

model:
  default: "gemini-1.5-flash"  # or any LiteLLM-supported model

outputs:
  outputs_dir: "../outputs/"

πŸ§ͺ Example Usage

Basic Test Generation

# The system processes your AUT prompt through:
# 1. Rule extraction from the prompt
# 2. Dimension-specific filtering and annotation  
# 3. Coverage goal generation
# 4. Test case generation across all 50 dimensions
# 5. Automatic JSON output organization

# Example AUT prompt:
"A customer service chatbot that helps users with account issues, 
billing questions, and technical support while maintaining 
professional tone and protecting user privacy."

Output Structure

outputs/
└── run_2025-01-24_10-30-45/
    β”œβ”€β”€ Fabrication_of_nonexistent_facts_candidate_prompts.json
    β”œβ”€β”€ Confabulated_references_candidate_prompts.json
    β”œβ”€β”€ Privacy_leakage_candidate_prompts.json
    └── ... (50 dimension-specific files)

πŸ” Key Features Deep Dive

Agent Name Sanitization

The system automatically converts hierarchical taxonomy names to valid ADK agent identifiers:

  • Reliability__Correctness__Factual_Accuracy__Fabrication_of_nonexistent_facts
  • β†’ Fabrication_of_nonexistent_facts_SieveAgent

Dynamic Configuration Loading

  • Fallback to hardcoded values when taxonomy files are missing
  • Real-time loading of test dimension guides
  • Flexible coverage engine configuration

Timestamped Organization

  • Each run creates a unique timestamped folder
  • Dimension-specific JSON output files
  • Complete traceability of test generation sessions

🀝 Contributing

  1. Setup development environment

    uv sync --dev
    uv run pre-commit install
  2. Follow code standards

    • Google Python Style Guide
    • Comprehensive type hints
    • Security-first development
  3. Run full validation

    make all  # Format, lint, typecheck, security scan

πŸ”§ Troubleshooting

Common Issues

Agent naming errors: Ensure taxonomy leaf names don't contain special characters. The system automatically sanitizes names but complex characters may cause issues.

Missing taxonomy files: The system provides fallback values, but create artifacts/taxonomy/taxonomies.jsonl and corresponding .md files for full functionality.

uv environment conflicts: Use uv sync in a clean directory to avoid environment path conflicts.

Getting Help

  • Check the comprehensive make help: make help
  • Review configuration in src/shared/config.yaml
  • Examine example taxonomy files in artifacts/taxonomy/

🌟 Advanced Usage

For enterprise deployments, consider:

  • Custom taxonomy development for domain-specific testing
  • Integration with existing CI/CD pipelines
  • Scaling across multiple AI model providers
  • Custom coverage engine implementations

The system's modular architecture supports extensive customization while maintaining the core testing pipeline integrity.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •