An intelligent reward system for Reinforcement Learning training that combines LLM-based evaluation with semantic memory. The system evaluates model rollouts, detects reward hacking, learns from training history, and dynamically evolves evaluation criteria (rubrics) over time.
Full Documentation: the topic-based docs live under
docs/. Start atdocs/index.mdfor audience-specific reading paths (integrators, contributors, researchers). This README is a self-contained quick start.
- Dynamic Rubric Evolution: Evaluation criteria automatically adapt based on observed model behaviors and training patterns
- Reward Hacking Detection: Identifies when models game evaluation metrics without genuine improvement
- Long-term Semantic Memory: Uses ChromaDB to learn from training history and inform future evaluations
- Two-Tier Rubric System: Immutable anchor rubrics + fully modifiable adaptive rubrics
- Async-First Architecture: Background processing doesn't block RL training loops
- OpenAI-Compatible: Works with OpenAI, Azure OpenAI, Ollama, vLLM, or any OpenAI-compatible API
- Architecture Overview
- Installation
- Quick Start
- Configuration
- API Reference
- Client Library
- Data Models
- Pipeline Workflow
- Advanced Features
- Utility Scripts
- Project Structure
┌─────────────────────────────────────────────────────────────────────────────┐
│ RL Training Loop │
│ │
│ ┌───────────┐ ┌───────────────┐ ┌──────────────┐ │
│ │ Generate │───►│ Score via │───►│ Update Model│ │
│ │ Rollouts │ │ RewardClient │ │ (PPO, etc.) │ │
│ └───────────┘ └───────┬───────┘ └──────────────┘ │
│ │ │
└────────────────────────────┼────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ Reward System Server │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Pipeline Orchestrator │ │
│ │ │ │
│ │ ┌─────────┐ ┌──────────┐ ┌────────────┐ ┌─────────────────┐ │ │
│ │ │ Scorer │──►│ Analyzer │──►│ Summarizer │──►│ Rubric Updater │ │ │
│ │ │ Agent │ │ Agent │ │ Agent │ │ Agent │ │ │
│ │ └─────────┘ └────┬─────┘ └──────┬─────┘ └────────┬────────┘ │ │
│ │ │ │ │ │ │
│ │ ▼ ▼ ▼ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ ChromaDB (Semantic Memory) │ │ │
│ │ │ • Individual Analyses • Batch Summaries │ │ │
│ │ │ • Rubric Update History • Meta Summaries │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────┐ ┌────────────────────────────────────────┐ │
│ │ Rubric Manager │ │ LLM Agents (6) │ │
│ │ • Anchor Rubrics │ │ Scorer, Analyzer, Summarizer, │ │
│ │ • Adaptive Rubrics│ │ MetaSummarizer, QueryGenerator, │ │
│ │ • Persistence │ │ Updater │ │
│ └────────────────────┘ └────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
| Agent | Purpose | When Called |
|---|---|---|
| Scorer | Fast evaluation of rollout against current rubrics | Every /score request |
| Analyzer | Deep analysis of individual rollouts (hacking detection, performance assessment) | Background after scoring |
| Summarizer | Aggregate insights from batch of analyses | After batch complete |
| MetaSummarizer | Consolidate multiple batch summaries (for chunked processing) | Large batches with chunking |
| QueryGenerator | Generate targeted semantic search queries | Before rubric update |
| Updater | Evolve rubrics based on summary + historical context | End of each batch |
- Python >= 3.12
- OpenAI API key (or compatible provider)
uv is a fast Python package manager:
uv pip install -e .pip install -e .openai>=1.55.0 # OpenAI SDK (works with all compatible providers)
chromadb>=0.5.3 # Vector database for semantic memory
fastapi>=0.115.0 # Web framework
uvicorn>=0.32.0 # ASGI server
httpx>=0.27.0 # HTTP client
pydantic>=2.9.0 # Data validation
pyyaml>=6.0.0 # Configuration parsing
json-repair>=0.30.0 # JSON repair for imperfect LLM outputs
Create or modify config.yaml:
client:
default:
api_key: null # Uses OPENAI_API_KEY env var if null
base_url: null # null = OpenAI default
agents:
default:
model: "gpt-4o-mini"
response_format: { type: json_object }
memory:
path: "./data/chroma"
rubrics_path: "./data/rubrics.json"
anchor_rubrics:
- text: "The model provides the correct final answer."
weight: 1.0
adaptive_rubrics: []
server:
host: "0.0.0.0"
port: 8002
pipeline:
mode: "async"export OPENAI_API_KEY=your_key_here
python -m amaris.server.main --config config.yamlimport asyncio
from amaris import RewardClient
async def train():
async with RewardClient("http://localhost:8002") as client:
for step in range(total_steps):
rollouts = generate_rollouts()
metadata = {
"high_level_goal": "Train a helpful math assistant",
"model_size": "7B",
"current_step": step,
"total_steps": total_steps,
}
# Score all rollouts concurrently
scores = await asyncio.gather(*(
client.score_async(
rollout={
"id": f"step{step}_r{i}",
"input": r.prompt,
"output": r.response,
"supplemental_context": r.ground_truth,
},
metadata=metadata,
)
for i, r in enumerate(rollouts)
))
# Use scores for RL update
loss = compute_rl_loss(scores)
# Signal batch complete only in manual mode (rollouts_per_step = null)
client.rollouts_complete()
client.rubric_updated()
asyncio.run(train())# OpenAI API client settings
client:
default:
api_key: null # API key (null = use OPENAI_API_KEY env var)
base_url: null # Custom endpoint (null = OpenAI default)
# Per-agent overrides (optional)
analyzer:
api_key: "different_key"
base_url: "https://custom-endpoint.com"
# LLM model options (per-agent customization)
agents:
default:
model: "gpt-4o-mini" # Model name
service_tier: "flex" # OpenAI service tier
response_format: { type: json_object }
scorer:
model: "gpt-4o-mini" # Fast model for scoring
analyzer:
model: "gpt-4o" # More capable model for analysis
# Available agents: scorer, analyzer, summarizer, meta_summarizer, query_generator, updater
# Additional per-agent options: extra_body (dict for provider-specific parameters, e.g., OpenRouter)
# Concurrency control for LLM API calls
concurrency:
total: null # Max parallel LLM calls across all agents (null = unlimited)
scorer: null # Per-agent limits (null = unlimited)
analyzer: null
summarizer: null
meta_summarizer: null
query_generator: null
updater: null
# Memory and rubric settings
memory:
path: "./data/chroma" # ChromaDB storage path
rubrics_path: "./data/rubrics.json" # Rubric persistence file
collection: "reward_memory" # ChromaDB collection name
embedding_model: null # null = ChromaDB default (all-MiniLM-L6-v2)
static_memory_steps: 5 # Recent batch summaries for context
dynamic_memory_limit_per_query: 3 # Semantic search results per query
max_queries: null # Max search queries per update; null = adaptive
rubrics_mode: "global" # "global" or "per_instance"
per_instance_dir: "./data/per_instance" # Per-instance data directory
per_instance_rubrics_path: null # Path to JSON with pre-defined rubrics per input_id
anchor_rubrics: # Immutable evaluation criteria
- text: "The model provides the correct final answer."
weight: 1.0
adaptive_rubrics: [] # Dynamic criteria (initially empty)
# Custom prompt templates (optional)
prompts:
scorer: null # null = use built-in prompt
analyzer: null
summarizer: null
meta_summarizer: null
query_generator: null
updater: null
# Server settings
server:
host: "0.0.0.0"
port: 8002
# Pipeline behavior
pipeline:
mode: "async" # "async" or "sync"
rollouts_per_step: null # null = manual /complete; set integer for auto-complete (manual /complete is ignored)
use_chunking: true # Enable chunked processing for large batches
chunk_size: 32 # Rollouts per chunk
rubric_update_frequency: 1 # Update rubrics every N steps (1 or null = every step)
# Logging
logging:
path: null # null = defaults to data parent directoryclient:
default:
api_key: null # Uses OPENAI_API_KEY env var
base_url: null
agents:
default:
model: "gpt-4o-mini"client:
default:
api_key: "your_azure_api_key"
base_url: "https://<resource>.openai.azure.com"
agents:
default:
model: "gpt-4o" # Your deployment nameclient:
default:
api_key: "ollama" # Any value works
base_url: "http://localhost:11434/v1"
agents:
default:
model: "llama3.2"client:
default:
api_key: "EMPTY"
base_url: "http://localhost:8000/v1"
agents:
default:
model: "meta-llama/Llama-3.2-3B-Instruct"Score a rollout against current rubrics.
Request:
{
"rollout": {
"id": "unique_rollout_id",
"input": "User prompt or question",
"output": "Model's response",
"supplemental_context": "Ground truth, reference solution, etc."
},
"metadata": {
"high_level_goal": "Train a helpful assistant",
"model_size": "7B",
"current_step": 5,
"total_steps": 100,
"other_context": "Optional additional context"
}
}Response:
{
"score": 0.75
}Behavior:
- Async mode: Returns immediately; analysis runs in background
- Sync mode: Blocks until rubric update completes, returns re-scored value
Signal that a batch of rollouts is complete. Triggers analysis/summarization/update pipeline.
If pipeline.rollouts_per_step is set, this endpoint is a no-op.
Response:
{
"status": "complete"
}{
"status": "noop"
}Wait for rubric update to complete (blocking). Waits until all inputs in the step are complete.
Response:
{
"status": "updated",
"mode": "global"
}The RewardClient provides a simple API with async scoring and sync/async batch control methods. Scoring is async-only to enforce concurrent rollout submission, which prevents deadlocks in sync pipeline mode.
import asyncio
from amaris import RewardClient
async def train():
async with RewardClient("http://localhost:8002", timeout=600) as client:
# 1. Score rollouts concurrently
scores = await asyncio.gather(*(
client.score_async(
rollout={"id": f"r{i}", "input": "What is 2 + 2?", "output": "4.", "supplemental_context": "Answer: 4"},
metadata={"high_level_goal": "Train a math assistant", "model_size": "7B", "current_step": 1, "total_steps": 100},
)
for i in range(batch_size)
))
# 2. Signal batch complete (manual mode only, sync)
client.rollouts_complete()
# 3. Wait for rubric evolution (sync)
client.rubric_updated()
asyncio.run(train())For fully async training loops, async versions of batch control are also available:
async def train_step():
async with RewardClient("http://localhost:8002") as client:
scores = await asyncio.gather(*(client.score_async(r, meta) for r in rollouts))
await client.rollouts_complete_async()
await client.rubric_updated_async()class Rollout:
id: str # Unique identifier
input: str # User prompt/question
output: str # Model response
supplemental_context: str # Ground truth, reference, etc.
metadata: dict # Additional metadataclass Rubric:
id: int # Auto-generated ID
text: str # Evaluation criterion text
weight: float # Importance weight (default: 1.0)
is_anchor: bool # True = immutable
created_at: datetime
metadata: dictclass TrainingMetadata:
high_level_goal: str # Training objective description
model_size: str # Model size (e.g., "7B")
current_step: int # Current training step
total_steps: int # Total training steps
other_context: str # Additional contextDeep analysis of a single rollout:
class IndividualAnalysis:
analysis_id: str
rollout_id: str
step: int
overall_assessment: OverallAssessment
performance_at_stage: PerformanceAtStage
reward_hacking_analysis: RewardHackingAnalysis
performance_advancement_strategy: PerformanceAdvancementStrategy
rubric_evaluation: RubricEvaluation
other_observations: list[OtherObservation]Aggregated insights from a batch:
class BatchSummary:
batch_summary_id: str
step: int
executive_summary: ExecutiveSummary
systemic_patterns: list[SystemicPattern]
aggregated_reward_hacking: AggregatedHackingAnalysis
rubric_meta_evaluation: RubricMetaEvaluation
performance_advancement_analysis: PerformanceAdvancementAnalysis
rubric_evolution_plan: RubricEvolutionPlan
memory_entry: MemoryEntryRubric modification result:
class RubricUpdate:
rubric_update_id: str
step: int
update_strategy: str # "MAINTENANCE", "CORRECTIVE", "ADVANCEMENT"
memory_consultation: MemoryConsultation
rationale: UpdateRationale
rubric_modifications: list[RubricModification]
final_adaptive_rubric_set: list[Rubric]
memory_artifact: MemoryArtifactClient calls /score
│
▼
┌───────────────────┐
│ Scorer Agent │──────► Returns score immediately
│ (Fast eval) │
└───────────────────┘
│
▼ (Background)
┌───────────────────┐
│ Analyzer Agent │──────► Store to ChromaDB
│ (Deep analysis) │
└───────────────────┘
│
▼ (On /complete, manual only)
┌───────────────────┐
│ Summarizer Agent │──────► Aggregate batch insights
└───────────────────┘
│
▼
┌───────────────────┐
│ QueryGenerator │──────► Generate memory search queries
└───────────────────┘
│
▼
┌───────────────────┐
│ Updater Agent │──────► Evolve rubrics
└───────────────────┘
│
▼
Rubric update complete
(Client /rubric_status/wait returns)
Client calls /score
│
▼
┌───────────────────┐
│ Full Pipeline │──────► Score → Analyze → Summarize → Update
│ (Blocking) │
└───────────────────┘
│
▼
Re-score with NEW rubrics
│
▼
Returns updated score
For heterogeneous tasks (math, coding, writing), use separate rubrics per input type:
Input IDs are derived from rollout.input and handled internally.
memory:
rubrics_mode: "per_instance"
per_instance_dir: "./data/per_instance"score = await client.score_async(
rollout={
"id": "r1",
"input": "Write a poem",
"output": "Roses are red..."
},
metadata={...}
)
# One /complete triggers updates for all inputs in the step (manual mode only)
client.rollouts_complete()For large batches (>100 rollouts), enable chunking to prevent memory issues:
pipeline:
use_chunking: true
chunk_size: 32This splits the batch, summarizes each chunk separately, then uses MetaSummarizer to consolidate.
Automatically trigger batch processing after N rollouts:
pipeline:
rollouts_per_step: 32No need to call client.rollouts_complete() manually. When rollouts_per_step is set, /complete is ignored.
Control how often rubrics are updated to reduce LLM costs during training:
pipeline:
rubric_update_frequency: 5 # Update rubrics every 5 steps1ornull: Update rubrics every step (default)N > 1: Update rubrics at steps 0, N, 2N, 3N, ... (e.g.,5updates at steps 0, 5, 10, 15, ...)
Summarization still runs every step to preserve memory context; only the expensive rubric update is skipped on non-update steps.
Override default prompts by specifying file paths:
prompts:
scorer: "./prompts/my_scorer.txt"
analyzer: "./prompts/my_analyzer.txt"Initialize rubrics from config (clears existing):
python setup_rubrics.pyInspect ChromaDB memory contents:
python inspect_chroma_db.py ./data/chroma --limit 50Output includes:
- Step histogram (document count by training step)
- Type × Step breakdown
- Recent documents with preview
Run a simulated RL training loop:
python test_rl_simulation.pySimulates:
- Math problem training
- Various model behaviors (correct, wrong, verbose, sycophant, reward hacking)
- Behavior evolution across training stages
- Score tracking and rubric evolution
AMARIS/
├── config.yaml # Main configuration
├── requirements.txt # Dependencies
├── pyproject.toml # Package metadata
├── README.md # This documentation
│
├── setup_rubrics.py # Rubric initialization utility
├── inspect_chroma_db.py # Memory inspection utility
├── test_rl_simulation.py # RL training simulation
├── per_instance_rubrics_example.json # Example per-instance rubrics
│
├── amaris/
│ ├── __init__.py # Exports RewardClient
│ │
│ ├── client/
│ │ ├── __init__.py
│ │ └── client.py # RewardClient (3-function API)
│ │
│ └── server/
│ ├── __init__.py
│ ├── main.py # Server entry point
│ ├── api.py # FastAPI endpoints
│ ├── config.py # Configuration loader
│ ├── models.py # Pydantic data models
│ ├── pipeline.py # Processing orchestrator
│ ├── memory.py # ChromaDB abstraction
│ ├── rubrics.py # Rubric management
│ ├── json_logger.py # LLM call logging
│ │
│ ├── agents/ # LLM agents
│ │ ├── __init__.py
│ │ ├── factory.py # OpenAI client factory
│ │ ├── scorer.py
│ │ ├── analyzer.py
│ │ ├── summarizer.py
│ │ ├── meta_summarizer.py
│ │ ├── query_generator.py
│ │ └── updater.py
│ │
│ └── prompts/ # Prompt templates
│ ├── __init__.py
│ ├── scorer.py
│ ├── analyzer.py
│ ├── summarizer.py
│ ├── meta_summarizer.py
│ ├── query_generator.py
│ └── updater.py
│
└── data/ # Runtime data (gitignored)
├── chroma/ # ChromaDB vector storage
├── rubrics.json # Persisted rubrics
├── logs/ # LLM call logs
└── per_instance/ # Per-instance data (if enabled)
| Component | Location | Description |
|---|---|---|
| ChromaDB | ./data/chroma/ |
Vector database with semantic memory |
| Rubrics | ./data/rubrics.json |
Current rubric set (JSON) |
| LLM Logs | ./data/logs/llm_log_*.jsonl |
JSONL logs of all LLM calls |
| Per-Instance | ./data/per_instance/{id}/ |
Separate rubrics/memory per input type |
- Modular: Each component is independently testable
- OpenAI-Compatible: Works with any provider supporting OpenAI API format
- Async-First: Background processing doesn't block training
- Memory-Driven: Learns from history via semantic search
- Flexible Rubrics: Supports both shared and per-input customization
- JSON-Robust: Agents use regex fallback for imperfect LLM outputs
- Production-Ready: Comprehensive logging, error handling, graceful degradation
- Check Python version:
python --version(requires >= 3.12) - Verify dependencies:
pip install -r requirements.txt - Check config file path:
python -m amaris.server.main --config config.yaml
- Verify API key:
echo $OPENAI_API_KEY - Test endpoint:
curl https://api.openai.com/v1/models -H "Authorization: Bearer $OPENAI_API_KEY" - Check model availability for your API tier
Enable chunking:
pipeline:
use_chunking: true
chunk_size: 32- Ensure
/completeis called after batch (manual mode only) - Wait for update:
/rubric_status/wait - Check logs for LLM errors
MIT License