Professional AI agent evaluation and testing framework with multi-tier assessment capabilities
ASTK is a comprehensive testing framework for AI agents that evaluates performance, intelligence, and capabilities through diverse scenarios. Test your agents against real-world tasks ranging from basic functionality to expert-level multi-layer evaluation using OpenAI's professional grading infrastructure.
- ๐ง 3-Tier Evaluation System: From basic testing to PhD-level assessment
- ๐ฌ Rigorous Multi-Layer Evaluation: Professional-grade assessment using multiple OpenAI evaluators
- ๐ Performance Metrics: Comprehensive analysis with response time, success rate, and quality scoring
- ๐ง Easy Installation: Simple pip install from PyPI with full cross-platform support
- ๐ Universal Testing: Works with CLI agents, REST APIs, Python modules, and more
- ๐ค Agent Ready: Compatible with LangChain, OpenAI, and custom agents
- ๐ Built-in Examples: File Q&A agent and project templates included
- โ๏ธ GitHub Actions: Ready-to-use CI/CD workflow templates
- ๐ฏ OpenAI Evals Integration: Enterprise-grade evaluation using OpenAI's infrastructure
- ๐ฐ Cost Management: Built-in cost estimation and budgeting controls
ASTK provides three distinct testing tiers to meet different development and assessment needs:
| Tier | Purpose | Cost | Time | Pass Rate | Use Case |
|---|---|---|---|---|---|
| ๐ข TIER 1 Basic Testing |
Development feedback | FREE | 2-3 min | 80-100% | Daily development iterations |
| ๐ก TIER 2 Professional |
Production validation | $1-5 | 5-10 min | 60-80% | Pre-deployment assessment |
| ๐ด TIER 3 Rigorous |
Expert assessment | $7-15 | 10-20 min | 10-30% | Research, competition, academic |
# Standard installation
pip install agent-sprint-testkit
# With OpenAI Evals support for rigorous evaluation
pip install agent-sprint-testkit[evals]export OPENAI_API_KEY="your-api-key-here"python -m astk.cli init my-agent-tests
cd my-agent-tests# Tier 1: Basic Testing (FREE)
python -m astk.cli benchmark examples/agents/file_qa_agent.py
# Tier 2: Professional Evaluation ($1-5)
python -m astk.cli evals create my_agent.py --eval-type code_qa
# Tier 3: Rigorous Multi-Layer Assessment ($7-15)
python -m astk.cli rigorous run my_agent.py --max-cost 10.0Our flagship feature provides professional-grade AI agent assessment using multiple specialized OpenAI evaluators.
- ๐ฏ Multiple Evaluation Layers: Each scenario uses 2-4 different OpenAI models as specialized evaluators
- ๐ Expert-Level Assessment: 4-tier difficulty system from foundational to expert integration
- ๐ง Domain-Specific Grading: Specialized evaluation prompts for security, ethics, systems thinking, etc.
- ๐ Comprehensive Scoring: Detailed dimension scores with weighted overall assessment
- ๐ฐ Cost Transparent: Built-in cost estimation and budgeting controls
- โก Parallel Execution: Optional parallel processing for faster results
| Tier | Difficulty | Focus | Scenarios | Pass Threshold |
|---|---|---|---|---|
| ๐ฏ Tier 1 | Foundational | Mathematical reasoning, Technical explanations | 2 scenarios | 7.0+ |
| ๐ง Tier 2 | Advanced | Creative problem-solving, Ethical dilemmas | 2 scenarios | 7.5+ |
| โก Tier 3 | Expert | Systems analysis, Security assessment | 2 scenarios | 8.0+ |
| ๐ Tier 4 | Extreme | Logical paradoxes, Crisis coordination | 2 scenarios | 8.5+ |
| ๐ฅ Chaos | Adversarial | Prompt injection resistance | 1 scenario | 9.0+ |
# Basic rigorous evaluation
python -m astk.cli rigorous run my_agent.py
# Development-friendly (lower cost)
python -m astk.cli rigorous run my_agent.py --max-cost 3.0 --fail-fast
# Professional assessment (parallel execution)
python -m astk.cli rigorous run my_agent.py \
--evaluators gpt-4 o1-preview gpt-4-turbo \
--max-cost 15.0 \
--parallel \
--output-format detailed \
--save-results# Set cost limits
python -m astk.cli rigorous run my_agent.py --max-cost 10.0
# Use cost-effective evaluator combinations
python -m astk.cli rigorous run my_agent.py --evaluators gpt-4
# Quick development testing
python -m astk.cli rigorous run my_agent.py --max-cost 2.0 --fail-fastEstimated Costs:
- Complete rigorous suite (9 scenarios): ~$6.50
- Single expert scenario: ~$0.50-$1.30
- Foundational scenarios: ~$0.30-$0.40
Professional-grade agent evaluation using OpenAI's enterprise infrastructure.
- ๐ Enterprise-grade evaluation using OpenAI's infrastructure
- ๐ฏ AI-powered grading with detailed scoring explanations
- โ๏ธ Easy A/B testing between agent versions
- ๐ Comparative analysis with industry benchmarks
# Create evaluation for your agent
python -m astk.cli evals create my_agent.py --eval-type code_qa --grader gpt-4
# Run evaluation
python -m astk.cli evals run eval_12345
# Compare two models
python -m astk.cli evals compare eval_12345 gpt-4o-mini gpt-4-turboAlways use this format for 100% reliability across all environments:
python -m astk.cli <command>This format works regardless of PATH configuration, virtual environments, or installation method.
# Show comprehensive help with all tiers
python -m astk.cli --help
# Initialize new project with templates
python -m astk.cli init <project-name>
# Run basic benchmarks (Tier 1)
python -m astk.cli benchmark <agent-path>
# Show examples and tier guide
python -m astk.cli examples
# Generate reports
python -m astk.cli report <results-dir># Complete rigorous evaluation suite
python -m astk.cli rigorous run <agent-path>
# Custom evaluation with specific parameters
python -m astk.cli rigorous run <agent-path> \
--scenarios path/to/scenarios.yaml \
--evaluators gpt-4 o1-preview gpt-4-turbo \
--parallel \
--max-cost 20.0 \
--output-format detailed \
--save-results \
--fail-fast
# Available options:
# --scenarios: Custom scenarios YAML file
# --evaluators: OpenAI models (gpt-4, o1-preview, gpt-4-turbo)
# --parallel: Run scenarios in parallel (faster, more expensive)
# --max-cost: Maximum total cost in USD
# --output-format: json, yaml, or detailed markdown
# --save-results: Save comprehensive results to file
# --fail-fast: Stop on first scenario failure
# --retry-failures: Number of retry attempts (default: 1)# Create professional evaluation
python -m astk.cli evals create <agent-path> --eval-type code_qa --grader gpt-4
# Run evaluation from logs
python -m astk.cli evals run <eval-id>
# Compare two models
python -m astk.cli evals compare <eval-id> baseline-model test-model
# Available eval types: general, code_qa, customer_service, research
# Available graders: gpt-4, gpt-4-turbo, o3, o3-miniYour agent must accept queries as command-line arguments:
#!/usr/bin/env python3
import sys
import asyncio
class MyAgent:
def __init__(self):
# Initialize your agent here
pass
async def process_query(self, query: str) -> str:
"""Process a query and return response"""
# Your agent logic here
return f"Agent response to: {query}"
async def main():
agent = MyAgent()
if len(sys.argv) > 1:
query = " ".join(sys.argv[1:])
response = await agent.process_query(query)
print(f"Agent: {response}")
else:
print("Agent: Ready for queries!")
if __name__ == "__main__":
asyncio.run(main())Test your agent:
# Make sure your agent works
python my_agent.py "What is 2+2?"
# Then run ASTK evaluation
python -m astk.cli benchmark my_agent.pyTests 8 fundamental capabilities:
- ๐ File Discovery: Find Python files and entry points
- โ๏ธ Config Analysis: Analyze configuration files
- ๐ README Comprehension: Read and explain project documentation
- ๐๏ธ Code Structure: Analyze directory architecture
- ๐ Documentation Search: Explore project documentation
- ๐ Dependency Analysis: Analyze requirements and dependencies
- ๐ก Example Exploration: Discover example implementations
- ๐งช Test Discovery: Find testing frameworks and patterns
Foundational Reasoning - Multi-step mathematical problem with verification:
- Tests calculation accuracy and step-by-step reasoning
- Multiple evaluators verify mathematical correctness
Creative Constraint Problem - Design offline food delivery app:
- Evaluates innovation within severe technical constraints
- Cultural sensitivity and business viability assessment
Ethical AI Dilemma - Healthcare ICU bed allocation:
- Tests ethical reasoning and moral framework application
- Legal compliance and practical implementation evaluation
Complex Systems Analysis - Universal Basic Income impact:
- 6-domain analysis across economic, social, political dimensions
- Systems thinking with feedback loop identification
Adversarial Security Analysis - Cryptocurrency exchange security:
- Security expertise and threat modeling evaluation
- Risk assessment and mitigation strategy analysis
Crisis Coordination - Multi-modal disaster response:
- Hurricane + COVID + cyberattack simultaneous crisis
- Resource allocation and stakeholder coordination
Logical Paradoxes - AI self-reference and consistency:
- Tests handling of logical contradictions
- Philosophical reasoning and consistency evaluation
Prompt Injection Resistance - Adversarial input testing:
- Security robustness against manipulation attempts
- Attack resistance and safe response generation
ASTK provides detailed analysis across multiple dimensions:
{
"success_rate": 0.78,
"complexity_score": 0.65,
"total_duration_seconds": 45.2,
"average_response_length": 1247,
"difficulty_breakdown": {
"foundational": { "success_rate": 1.0, "scenarios": "2/2" },
"advanced": { "success_rate": 0.6, "scenarios": "3/5" },
"expert": { "success_rate": 0.4, "scenarios": "2/5" }
},
"category_breakdown": {
"reasoning": { "success_rate": 0.67, "scenarios": "2/3" },
"creativity": { "success_rate": 0.5, "scenarios": "1/2" },
"ethics": { "success_rate": 1.0, "scenarios": "2/2" },
"security": { "success_rate": 0.3, "scenarios": "1/3" }
}
}Based on Complexity Score:
- ๐ Exceptional AI (80%+): Expert-level reasoning across multiple domains
- ๐ฅ Advanced AI (60-79%): Strong performance on sophisticated tasks
- ๐ก Competent AI (40-59%): Good basic capabilities, room for improvement
- ๐ Developing AI (<40%): Focus on foundational skills
ASTK/
โโโ ๐ค examples/
โ โโโ agents/ # Example AI agents
โ โ โโโ file_qa_agent.py # LangChain File Q&A agent
โ โโโ benchmarks/scenarios/ # Evaluation scenarios
โ โโโ rigorous_multilayer_scenarios.yaml # Expert scenarios
โโโ ๐ scripts/ # Benchmark and utility scripts
โ โโโ simple_benchmark.py # Intelligent benchmark runner
โ โโโ simple_run.py # Basic agent runner
โ โโโ astk.py # Advanced CLI
โโโ ๐ง astk/ # Core ASTK framework
โ โโโ benchmarks/ # Benchmark modules
โ โโโ cli.py # Command-line interface
โ โโโ evals_integration.py # OpenAI Evals integration
โ โโโ schema.py # Data schemas
โ โโโ *.py # Core modules
โโโ ๐ benchmark_results/ # Generated benchmark results
โโโ โ๏ธ config/ # Configuration files
โโโ ๐ docs/ # Documentation
# Quick development testing (FREE)
python -m astk.cli benchmark my_agent.py
# Check specific capabilities
python -m astk.cli benchmark my_agent.py --scenarios 5
# View results
python -m astk.cli report astk_results/# Professional evaluation
python -m astk.cli evals create my_agent.py --eval-type code_qa
# Run comprehensive assessment
python -m astk.cli evals run eval_12345# Complete rigorous evaluation
python -m astk.cli rigorous run my_agent.py --max-cost 15.0
# Parallel execution for speed
python -m astk.cli rigorous run my_agent.py \
--parallel \
--evaluators gpt-4 o1-preview gpt-4-turbo \
--save-results# .github/workflows/astk.yml
name: ASTK Agent Evaluation
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.11"
- name: Install ASTK
run: pip install agent-sprint-testkit[evals]
- name: Run Basic Benchmarks
run: python -m astk.cli benchmark agents/my_agent.py
- name: Run Professional Evaluation
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
python -m astk.cli rigorous run agents/my_agent.py \
--max-cost 5.0 \
--fail-fast \
--save-results
- name: Upload Results
uses: actions/upload-artifact@v3
with:
name: evaluation-results
path: rigorous_evaluation_*.json# Update pip and reinstall
pip install --upgrade pip
pip install --upgrade agent-sprint-testkit[evals]
# Verify installation
python -m astk.cli --version
python -c "import astk; print('ASTK loaded successfully')"โ Always use the reliable format:
# Recommended (always works)
python -m astk.cli benchmark my_agent.py
# Avoid (may fail with PATH issues)
astk benchmark my_agent.py# Verify API key is set
echo $OPENAI_API_KEY
# Set API key
export OPENAI_API_KEY="sk-..."
# Test API access
python -c "import openai; print('OpenAI client ready')"# Set strict cost limits
python -m astk.cli rigorous run my_agent.py --max-cost 5.0
# Use fewer evaluators to reduce costs
python -m astk.cli rigorous run my_agent.py --evaluators gpt-4
# Development testing with fail-fast
python -m astk.cli rigorous run my_agent.py --max-cost 2.0 --fail-fastYour agent must:
- Accept queries as command-line arguments
- Print responses to stdout
- Exit with code 0 on success
Test your agent:
python your_agent.py "test question"
# Should print a response and exit cleanly- โก Faster Responses: Optimize your agent's processing pipeline
- ๐ง Better Intelligence: Use more sophisticated reasoning patterns
- ๐ฐ Cost Optimization: Use
--max-costlimits and selective evaluators - ๐ง Custom Scenarios: Create domain-specific evaluation scenarios
- โก Parallel Processing: Use
--parallelfor faster rigorous evaluation - ๐ฏ Targeted Testing: Focus on specific capability categories
- Fork the repository and create a feature branch
- Add new agents in
examples/agents/ - Create new scenarios in
examples/benchmarks/scenarios/ - Test thoroughly with all evaluation tiers
- Submit pull request with comprehensive test results
Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
For commercial use or derivative works, please contact: admin@blackbox-dev.com
See LICENSE file for complete details.
# Quick installation and first test
pip install agent-sprint-testkit[evals]
export OPENAI_API_KEY="your-key"
python -m astk.cli init my-tests
cd my-tests
python -m astk.cli examples
# Run first evaluation
python -m astk.cli benchmark examples/agents/file_qa_agent.py
# Try rigorous assessment
python -m astk.cli rigorous run examples/agents/file_qa_agent.py --max-cost 3.0Ready to evaluate your AI agent? Start with basic testing and progress through our three-tier system as your agent improves!
๐ For package-specific installation and usage instructions, see README-PACKAGE.md