Skip to content

qualidea1217/AMARIS

Repository files navigation

AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

An intelligent reward system for Reinforcement Learning training that combines LLM-based evaluation with semantic memory. The system evaluates model rollouts, detects reward hacking, learns from training history, and dynamically evolves evaluation criteria (rubrics) over time.

Full Documentation: the topic-based docs live under docs/. Start at docs/index.md for audience-specific reading paths (integrators, contributors, researchers). This README is a self-contained quick start.

Key Features

  • Dynamic Rubric Evolution: Evaluation criteria automatically adapt based on observed model behaviors and training patterns
  • Reward Hacking Detection: Identifies when models game evaluation metrics without genuine improvement
  • Long-term Semantic Memory: Uses ChromaDB to learn from training history and inform future evaluations
  • Two-Tier Rubric System: Immutable anchor rubrics + fully modifiable adaptive rubrics
  • Async-First Architecture: Background processing doesn't block RL training loops
  • OpenAI-Compatible: Works with OpenAI, Azure OpenAI, Ollama, vLLM, or any OpenAI-compatible API

Table of Contents

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                              RL Training Loop                               │
│                                                                             │
│   ┌───────────┐    ┌───────────────┐    ┌──────────────┐                    │
│   │  Generate │───►│  Score via    │───►│  Update Model│                    │
│   │  Rollouts │    │  RewardClient │    │  (PPO, etc.) │                    │
│   └───────────┘    └───────┬───────┘    └──────────────┘                    │
│                            │                                                │
└────────────────────────────┼────────────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                          Reward System Server                               │
│                                                                             │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                         Pipeline Orchestrator                       │   │
│   │                                                                     │   │
│   │  ┌─────────┐   ┌──────────┐   ┌────────────┐   ┌─────────────────┐  │   │
│   │  │ Scorer  │──►│ Analyzer │──►│ Summarizer │──►│ Rubric Updater  │  │   │
│   │  │ Agent   │   │ Agent    │   │ Agent      │   │ Agent           │  │   │
│   │  └─────────┘   └────┬─────┘   └──────┬─────┘   └────────┬────────┘  │   │
│   │                     │                │                  │           │   │
│   │                     ▼                ▼                  ▼           │   │
│   │              ┌──────────────────────────────────────────────┐       │   │
│   │              │          ChromaDB (Semantic Memory)          │       │   │
│   │              │  • Individual Analyses • Batch Summaries     │       │   │
│   │              │  • Rubric Update History • Meta Summaries    │       │   │
│   │              └──────────────────────────────────────────────┘       │   │
│   │                                                                     │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│   ┌────────────────────┐    ┌────────────────────────────────────────┐      │
│   │   Rubric Manager   │    │            LLM Agents (6)              │      │
│   │  • Anchor Rubrics  │    │  Scorer, Analyzer, Summarizer,         │      │
│   │  • Adaptive Rubrics│    │  MetaSummarizer, QueryGenerator,       │      │
│   │  • Persistence     │    │  Updater                               │      │
│   └────────────────────┘    └────────────────────────────────────────┘      │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Agent Responsibilities

Agent Purpose When Called
Scorer Fast evaluation of rollout against current rubrics Every /score request
Analyzer Deep analysis of individual rollouts (hacking detection, performance assessment) Background after scoring
Summarizer Aggregate insights from batch of analyses After batch complete
MetaSummarizer Consolidate multiple batch summaries (for chunked processing) Large batches with chunking
QueryGenerator Generate targeted semantic search queries Before rubric update
Updater Evolve rubrics based on summary + historical context End of each batch

Installation

Requirements

  • Python >= 3.12
  • OpenAI API key (or compatible provider)

Install with uv (Recommended)

uv is a fast Python package manager:

uv pip install -e .

Install with pip

pip install -e .

Dependencies

openai>=1.55.0      # OpenAI SDK (works with all compatible providers)
chromadb>=0.5.3     # Vector database for semantic memory
fastapi>=0.115.0    # Web framework
uvicorn>=0.32.0     # ASGI server
httpx>=0.27.0       # HTTP client
pydantic>=2.9.0     # Data validation
pyyaml>=6.0.0       # Configuration parsing
json-repair>=0.30.0 # JSON repair for imperfect LLM outputs

Quick Start

1. Configure the System

Create or modify config.yaml:

client:
  default:
    api_key: null  # Uses OPENAI_API_KEY env var if null
    base_url: null  # null = OpenAI default

agents:
  default:
    model: "gpt-4o-mini"
    response_format: { type: json_object }

memory:
  path: "./data/chroma"
  rubrics_path: "./data/rubrics.json"
  anchor_rubrics:
    - text: "The model provides the correct final answer."
      weight: 1.0
  adaptive_rubrics: []

server:
  host: "0.0.0.0"
  port: 8002

pipeline:
  mode: "async"

2. Start the Server

export OPENAI_API_KEY=your_key_here
python -m amaris.server.main --config config.yaml

3. Integrate with Training Loop

import asyncio
from amaris import RewardClient

async def train():
    async with RewardClient("http://localhost:8002") as client:
        for step in range(total_steps):
            rollouts = generate_rollouts()
            metadata = {
                "high_level_goal": "Train a helpful math assistant",
                "model_size": "7B",
                "current_step": step,
                "total_steps": total_steps,
            }
            # Score all rollouts concurrently
            scores = await asyncio.gather(*(
                client.score_async(
                    rollout={
                        "id": f"step{step}_r{i}",
                        "input": r.prompt,
                        "output": r.response,
                        "supplemental_context": r.ground_truth,
                    },
                    metadata=metadata,
                )
                for i, r in enumerate(rollouts)
            ))
            # Use scores for RL update
            loss = compute_rl_loss(scores)

            # Signal batch complete only in manual mode (rollouts_per_step = null)
            client.rollouts_complete()
            client.rubric_updated()

asyncio.run(train())

Configuration

Full Configuration Reference

# OpenAI API client settings
client:
  default:
    api_key: null           # API key (null = use OPENAI_API_KEY env var)
    base_url: null          # Custom endpoint (null = OpenAI default)
  # Per-agent overrides (optional)
  analyzer:
    api_key: "different_key"
    base_url: "https://custom-endpoint.com"

# LLM model options (per-agent customization)
agents:
  default:
    model: "gpt-4o-mini"           # Model name
    service_tier: "flex"           # OpenAI service tier
    response_format: { type: json_object }
  scorer:
    model: "gpt-4o-mini"           # Fast model for scoring
  analyzer:
    model: "gpt-4o"                # More capable model for analysis
  # Available agents: scorer, analyzer, summarizer, meta_summarizer, query_generator, updater
  # Additional per-agent options: extra_body (dict for provider-specific parameters, e.g., OpenRouter)

# Concurrency control for LLM API calls
concurrency:
  total: null                # Max parallel LLM calls across all agents (null = unlimited)
  scorer: null               # Per-agent limits (null = unlimited)
  analyzer: null
  summarizer: null
  meta_summarizer: null
  query_generator: null
  updater: null

# Memory and rubric settings
memory:
  path: "./data/chroma"                    # ChromaDB storage path
  rubrics_path: "./data/rubrics.json"      # Rubric persistence file
  collection: "reward_memory"              # ChromaDB collection name
  embedding_model: null                    # null = ChromaDB default (all-MiniLM-L6-v2)
  static_memory_steps: 5                   # Recent batch summaries for context
  dynamic_memory_limit_per_query: 3        # Semantic search results per query
  max_queries: null                        # Max search queries per update; null = adaptive
  rubrics_mode: "global"                   # "global" or "per_instance"
  per_instance_dir: "./data/per_instance"  # Per-instance data directory
  per_instance_rubrics_path: null           # Path to JSON with pre-defined rubrics per input_id
  anchor_rubrics:                          # Immutable evaluation criteria
    - text: "The model provides the correct final answer."
      weight: 1.0
  adaptive_rubrics: []                     # Dynamic criteria (initially empty)

# Custom prompt templates (optional)
prompts:
  scorer: null          # null = use built-in prompt
  analyzer: null
  summarizer: null
  meta_summarizer: null
  query_generator: null
  updater: null

# Server settings
server:
  host: "0.0.0.0"
  port: 8002

# Pipeline behavior
pipeline:
  mode: "async"              # "async" or "sync"
  rollouts_per_step: null    # null = manual /complete; set integer for auto-complete (manual /complete is ignored)
  use_chunking: true         # Enable chunked processing for large batches
  chunk_size: 32             # Rollouts per chunk
  rubric_update_frequency: 1 # Update rubrics every N steps (1 or null = every step)

# Logging
logging:
  path: null                 # null = defaults to data parent directory

Provider-Specific Examples

OpenAI (Default)

client:
  default:
    api_key: null  # Uses OPENAI_API_KEY env var
    base_url: null

agents:
  default:
    model: "gpt-4o-mini"

Azure OpenAI

client:
  default:
    api_key: "your_azure_api_key"
    base_url: "https://<resource>.openai.azure.com"

agents:
  default:
    model: "gpt-4o"  # Your deployment name

Ollama (Local)

client:
  default:
    api_key: "ollama"  # Any value works
    base_url: "http://localhost:11434/v1"

agents:
  default:
    model: "llama3.2"

vLLM (Local Inference)

client:
  default:
    api_key: "EMPTY"
    base_url: "http://localhost:8000/v1"

agents:
  default:
    model: "meta-llama/Llama-3.2-3B-Instruct"

API Reference

Endpoints

POST /score

Score a rollout against current rubrics.

Request:

{
  "rollout": {
    "id": "unique_rollout_id",
    "input": "User prompt or question",
    "output": "Model's response",
    "supplemental_context": "Ground truth, reference solution, etc."
  },
  "metadata": {
    "high_level_goal": "Train a helpful assistant",
    "model_size": "7B",
    "current_step": 5,
    "total_steps": 100,
    "other_context": "Optional additional context"
  }
}

Response:

{
  "score": 0.75
}

Behavior:

  • Async mode: Returns immediately; analysis runs in background
  • Sync mode: Blocks until rubric update completes, returns re-scored value

POST /complete

Signal that a batch of rollouts is complete. Triggers analysis/summarization/update pipeline. If pipeline.rollouts_per_step is set, this endpoint is a no-op.

Response:

{
  "status": "complete"
}
{
  "status": "noop"
}

GET /rubric_status/wait

Wait for rubric update to complete (blocking). Waits until all inputs in the step are complete.

Response:

{
  "status": "updated",
  "mode": "global"
}

Client Library

The RewardClient provides a simple API with async scoring and sync/async batch control methods. Scoring is async-only to enforce concurrent rollout submission, which prevents deadlocks in sync pipeline mode.

import asyncio
from amaris import RewardClient

async def train():
    async with RewardClient("http://localhost:8002", timeout=600) as client:
        # 1. Score rollouts concurrently
        scores = await asyncio.gather(*(
            client.score_async(
                rollout={"id": f"r{i}", "input": "What is 2 + 2?", "output": "4.", "supplemental_context": "Answer: 4"},
                metadata={"high_level_goal": "Train a math assistant", "model_size": "7B", "current_step": 1, "total_steps": 100},
            )
            for i in range(batch_size)
        ))

        # 2. Signal batch complete (manual mode only, sync)
        client.rollouts_complete()

        # 3. Wait for rubric evolution (sync)
        client.rubric_updated()

asyncio.run(train())

Async Batch Control Methods

For fully async training loops, async versions of batch control are also available:

async def train_step():
    async with RewardClient("http://localhost:8002") as client:
        scores = await asyncio.gather(*(client.score_async(r, meta) for r in rollouts))
        await client.rollouts_complete_async()
        await client.rubric_updated_async()

Data Models

Core Models

Rollout

class Rollout:
    id: str                        # Unique identifier
    input: str                     # User prompt/question
    output: str                    # Model response
    supplemental_context: str      # Ground truth, reference, etc.
    metadata: dict                 # Additional metadata

Rubric

class Rubric:
    id: int                        # Auto-generated ID
    text: str                      # Evaluation criterion text
    weight: float                  # Importance weight (default: 1.0)
    is_anchor: bool                # True = immutable
    created_at: datetime
    metadata: dict

TrainingMetadata

class TrainingMetadata:
    high_level_goal: str           # Training objective description
    model_size: str                # Model size (e.g., "7B")
    current_step: int              # Current training step
    total_steps: int               # Total training steps
    other_context: str             # Additional context

Analysis Models

IndividualAnalysis

Deep analysis of a single rollout:

class IndividualAnalysis:
    analysis_id: str
    rollout_id: str
    step: int
    overall_assessment: OverallAssessment
    performance_at_stage: PerformanceAtStage
    reward_hacking_analysis: RewardHackingAnalysis
    performance_advancement_strategy: PerformanceAdvancementStrategy
    rubric_evaluation: RubricEvaluation
    other_observations: list[OtherObservation]

BatchSummary

Aggregated insights from a batch:

class BatchSummary:
    batch_summary_id: str
    step: int
    executive_summary: ExecutiveSummary
    systemic_patterns: list[SystemicPattern]
    aggregated_reward_hacking: AggregatedHackingAnalysis
    rubric_meta_evaluation: RubricMetaEvaluation
    performance_advancement_analysis: PerformanceAdvancementAnalysis
    rubric_evolution_plan: RubricEvolutionPlan
    memory_entry: MemoryEntry

RubricUpdate

Rubric modification result:

class RubricUpdate:
    rubric_update_id: str
    step: int
    update_strategy: str           # "MAINTENANCE", "CORRECTIVE", "ADVANCEMENT"
    memory_consultation: MemoryConsultation
    rationale: UpdateRationale
    rubric_modifications: list[RubricModification]
    final_adaptive_rubric_set: list[Rubric]
    memory_artifact: MemoryArtifact

Pipeline Workflow

Async Mode (Default)

Client calls /score
        │
        ▼
┌───────────────────┐
│   Scorer Agent    │──────► Returns score immediately
│   (Fast eval)     │
└───────────────────┘
        │
        ▼ (Background)
┌───────────────────┐
│  Analyzer Agent   │──────► Store to ChromaDB
│ (Deep analysis)   │
└───────────────────┘
        │
        ▼ (On /complete, manual only)
┌───────────────────┐
│ Summarizer Agent  │──────► Aggregate batch insights
└───────────────────┘
        │
        ▼
┌───────────────────┐
│ QueryGenerator    │──────► Generate memory search queries
└───────────────────┘
        │
        ▼
┌───────────────────┐
│  Updater Agent    │──────► Evolve rubrics
└───────────────────┘
        │
        ▼
  Rubric update complete
  (Client /rubric_status/wait returns)

Sync Mode

Client calls /score
        │
        ▼
┌───────────────────┐
│   Full Pipeline   │──────► Score → Analyze → Summarize → Update
│   (Blocking)      │
└───────────────────┘
        │
        ▼
  Re-score with NEW rubrics
        │
        ▼
  Returns updated score

Advanced Features

Per-Instance Rubrics

For heterogeneous tasks (math, coding, writing), use separate rubrics per input type: Input IDs are derived from rollout.input and handled internally.

memory:
  rubrics_mode: "per_instance"
  per_instance_dir: "./data/per_instance"
score = await client.score_async(
    rollout={
        "id": "r1",
        "input": "Write a poem",
        "output": "Roses are red..."
    },
    metadata={...}
)

# One /complete triggers updates for all inputs in the step (manual mode only)
client.rollouts_complete()

Chunked Processing

For large batches (>100 rollouts), enable chunking to prevent memory issues:

pipeline:
  use_chunking: true
  chunk_size: 32

This splits the batch, summarizes each chunk separately, then uses MetaSummarizer to consolidate.

Auto-Completion

Automatically trigger batch processing after N rollouts:

pipeline:
  rollouts_per_step: 32

No need to call client.rollouts_complete() manually. When rollouts_per_step is set, /complete is ignored.

Rubric Update Frequency

Control how often rubrics are updated to reduce LLM costs during training:

pipeline:
  rubric_update_frequency: 5  # Update rubrics every 5 steps
  • 1 or null: Update rubrics every step (default)
  • N > 1: Update rubrics at steps 0, N, 2N, 3N, ... (e.g., 5 updates at steps 0, 5, 10, 15, ...)

Summarization still runs every step to preserve memory context; only the expensive rubric update is skipped on non-update steps.

Custom Prompt Templates

Override default prompts by specifying file paths:

prompts:
  scorer: "./prompts/my_scorer.txt"
  analyzer: "./prompts/my_analyzer.txt"

Utility Scripts

setup_rubrics.py

Initialize rubrics from config (clears existing):

python setup_rubrics.py

inspect_chroma_db.py

Inspect ChromaDB memory contents:

python inspect_chroma_db.py ./data/chroma --limit 50

Output includes:

  • Step histogram (document count by training step)
  • Type × Step breakdown
  • Recent documents with preview

test_rl_simulation.py

Run a simulated RL training loop:

python test_rl_simulation.py

Simulates:

  • Math problem training
  • Various model behaviors (correct, wrong, verbose, sycophant, reward hacking)
  • Behavior evolution across training stages
  • Score tracking and rubric evolution

Project Structure

AMARIS/
├── config.yaml                    # Main configuration
├── requirements.txt               # Dependencies
├── pyproject.toml                 # Package metadata
├── README.md                      # This documentation
│
├── setup_rubrics.py               # Rubric initialization utility
├── inspect_chroma_db.py           # Memory inspection utility
├── test_rl_simulation.py              # RL training simulation
├── per_instance_rubrics_example.json  # Example per-instance rubrics
│
├── amaris/
│   ├── __init__.py                # Exports RewardClient
│   │
│   ├── client/
│   │   ├── __init__.py
│   │   └── client.py              # RewardClient (3-function API)
│   │
│   └── server/
│       ├── __init__.py
│       ├── main.py                # Server entry point
│       ├── api.py                 # FastAPI endpoints
│       ├── config.py              # Configuration loader
│       ├── models.py              # Pydantic data models
│       ├── pipeline.py            # Processing orchestrator
│       ├── memory.py              # ChromaDB abstraction
│       ├── rubrics.py             # Rubric management
│       ├── json_logger.py         # LLM call logging
│       │
│       ├── agents/                # LLM agents
│       │   ├── __init__.py
│       │   ├── factory.py         # OpenAI client factory
│       │   ├── scorer.py
│       │   ├── analyzer.py
│       │   ├── summarizer.py
│       │   ├── meta_summarizer.py
│       │   ├── query_generator.py
│       │   └── updater.py
│       │
│       └── prompts/               # Prompt templates
│           ├── __init__.py
│           ├── scorer.py
│           ├── analyzer.py
│           ├── summarizer.py
│           ├── meta_summarizer.py
│           ├── query_generator.py
│           └── updater.py
│
└── data/                          # Runtime data (gitignored)
    ├── chroma/                    # ChromaDB vector storage
    ├── rubrics.json               # Persisted rubrics
    ├── logs/                      # LLM call logs
    └── per_instance/              # Per-instance data (if enabled)

Data Storage

Component Location Description
ChromaDB ./data/chroma/ Vector database with semantic memory
Rubrics ./data/rubrics.json Current rubric set (JSON)
LLM Logs ./data/logs/llm_log_*.jsonl JSONL logs of all LLM calls
Per-Instance ./data/per_instance/{id}/ Separate rubrics/memory per input type

Design Principles

  1. Modular: Each component is independently testable
  2. OpenAI-Compatible: Works with any provider supporting OpenAI API format
  3. Async-First: Background processing doesn't block training
  4. Memory-Driven: Learns from history via semantic search
  5. Flexible Rubrics: Supports both shared and per-input customization
  6. JSON-Robust: Agents use regex fallback for imperfect LLM outputs
  7. Production-Ready: Comprehensive logging, error handling, graceful degradation

Troubleshooting

Server won't start

  1. Check Python version: python --version (requires >= 3.12)
  2. Verify dependencies: pip install -r requirements.txt
  3. Check config file path: python -m amaris.server.main --config config.yaml

LLM API errors

  1. Verify API key: echo $OPENAI_API_KEY
  2. Test endpoint: curl https://api.openai.com/v1/models -H "Authorization: Bearer $OPENAI_API_KEY"
  3. Check model availability for your API tier

Memory issues with large batches

Enable chunking:

pipeline:
  use_chunking: true
  chunk_size: 32

Rubrics not evolving

  1. Ensure /complete is called after batch (manual mode only)
  2. Wait for update: /rubric_status/wait
  3. Check logs for LLM errors

License

MIT License

About

AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages