AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

An intelligent reward system for Reinforcement Learning training that combines LLM-based evaluation with semantic memory. The system evaluates model rollouts, detects reward hacking, learns from training history, and dynamically evolves evaluation criteria (rubrics) over time.

Full Documentation: the topic-based docs live under docs/. Start at docs/index.md for audience-specific reading paths (integrators, contributors, researchers). This README is a self-contained quick start.

Key Features

Dynamic Rubric Evolution: Evaluation criteria automatically adapt based on observed model behaviors and training patterns
Reward Hacking Detection: Identifies when models game evaluation metrics without genuine improvement
Long-term Semantic Memory: Uses ChromaDB to learn from training history and inform future evaluations
Two-Tier Rubric System: Immutable anchor rubrics + fully modifiable adaptive rubrics
Async-First Architecture: Background processing doesn't block RL training loops
OpenAI-Compatible: Works with OpenAI, Azure OpenAI, Ollama, vLLM, or any OpenAI-compatible API

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                              RL Training Loop                               │
│                                                                             │
│   ┌───────────┐    ┌───────────────┐    ┌──────────────┐                    │
│   │  Generate │───►│  Score via    │───►│  Update Model│                    │
│   │  Rollouts │    │  RewardClient │    │  (PPO, etc.) │                    │
│   └───────────┘    └───────┬───────┘    └──────────────┘                    │
│                            │                                                │
└────────────────────────────┼────────────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                          Reward System Server                               │
│                                                                             │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                         Pipeline Orchestrator                       │   │
│   │                                                                     │   │
│   │  ┌─────────┐   ┌──────────┐   ┌────────────┐   ┌─────────────────┐  │   │
│   │  │ Scorer  │──►│ Analyzer │──►│ Summarizer │──►│ Rubric Updater  │  │   │
│   │  │ Agent   │   │ Agent    │   │ Agent      │   │ Agent           │  │   │
│   │  └─────────┘   └────┬─────┘   └──────┬─────┘   └────────┬────────┘  │   │
│   │                     │                │                  │           │   │
│   │                     ▼                ▼                  ▼           │   │
│   │              ┌──────────────────────────────────────────────┐       │   │
│   │              │          ChromaDB (Semantic Memory)          │       │   │
│   │              │  • Individual Analyses • Batch Summaries     │       │   │
│   │              │  • Rubric Update History • Meta Summaries    │       │   │
│   │              └──────────────────────────────────────────────┘       │   │
│   │                                                                     │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│   ┌────────────────────┐    ┌────────────────────────────────────────┐      │
│   │   Rubric Manager   │    │            LLM Agents (6)              │      │
│   │  • Anchor Rubrics  │    │  Scorer, Analyzer, Summarizer,         │      │
│   │  • Adaptive Rubrics│    │  MetaSummarizer, QueryGenerator,       │      │
│   │  • Persistence     │    │  Updater                               │      │
│   └────────────────────┘    └────────────────────────────────────────┘      │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Agent Responsibilities

Agent	Purpose	When Called
Scorer	Fast evaluation of rollout against current rubrics	Every `/score` request
Analyzer	Deep analysis of individual rollouts (hacking detection, performance assessment)	Background after scoring
Summarizer	Aggregate insights from batch of analyses	After batch complete
MetaSummarizer	Consolidate multiple batch summaries (for chunked processing)	Large batches with chunking
QueryGenerator	Generate targeted semantic search queries	Before rubric update
Updater	Evolve rubrics based on summary + historical context	End of each batch

Installation

Requirements

Python >= 3.12
OpenAI API key (or compatible provider)

Install with uv (Recommended)

uv is a fast Python package manager:

uv pip install -e .

Install with pip

pip install -e .

Dependencies

openai>=1.55.0      # OpenAI SDK (works with all compatible providers)
chromadb>=0.5.3     # Vector database for semantic memory
fastapi>=0.115.0    # Web framework
uvicorn>=0.32.0     # ASGI server
httpx>=0.27.0       # HTTP client
pydantic>=2.9.0     # Data validation
pyyaml>=6.0.0       # Configuration parsing
json-repair>=0.30.0 # JSON repair for imperfect LLM outputs

Quick Start

1. Configure the System

Create or modify config.yaml:

client:
  default:
    api_key: null  # Uses OPENAI_API_KEY env var if null
    base_url: null  # null = OpenAI default

agents:
  default:
    model: "gpt-4o-mini"
    response_format: { type: json_object }

memory:
  path: "./data/chroma"
  rubrics_path: "./data/rubrics.json"
  anchor_rubrics:
    - text: "The model provides the correct final answer."
      weight: 1.0
  adaptive_rubrics: []

server:
  host: "0.0.0.0"
  port: 8002

pipeline:
  mode: "async"

2. Start the Server

export OPENAI_API_KEY=your_key_here
python -m amaris.server.main --config config.yaml

3. Integrate with Training Loop

import asyncio
from amaris import RewardClient

async def train():
    async with RewardClient("http://localhost:8002") as client:
        for step in range(total_steps):
            rollouts = generate_rollouts()
            metadata = {
                "high_level_goal": "Train a helpful math assistant",
                "model_size": "7B",
                "current_step": step,
                "total_steps": total_steps,
            }
            # Score all rollouts concurrently
            scores = await asyncio.gather(*(
                client.score_async(
                    rollout={
                        "id": f"step{step}_r{i}",
                        "input": r.prompt,
                        "output": r.response,
                        "supplemental_context": r.ground_truth,
                    },
                    metadata=metadata,
                )
                for i, r in enumerate(rollouts)
            ))
            # Use scores for RL update
            loss = compute_rl_loss(scores)

            # Signal batch complete only in manual mode (rollouts_per_step = null)
            client.rollouts_complete()
            client.rubric_updated()

asyncio.run(train())

Configuration

Full Configuration Reference

# OpenAI API client settings
client:
  default:
    api_key: null           # API key (null = use OPENAI_API_KEY env var)
    base_url: null          # Custom endpoint (null = OpenAI default)
  # Per-agent overrides (optional)
  analyzer:
    api_key: "different_key"
    base_url: "https://custom-endpoint.com"

# LLM model options (per-agent customization)
agents:
  default:
    model: "gpt-4o-mini"           # Model name
    service_tier: "flex"           # OpenAI service tier
    response_format: { type: json_object }
  scorer:
    model: "gpt-4o-mini"           # Fast model for scoring
  analyzer:
    model: "gpt-4o"                # More capable model for analysis
  # Available agents: scorer, analyzer, summarizer, meta_summarizer, query_generator, updater
  # Additional per-agent options: extra_body (dict for provider-specific parameters, e.g., OpenRouter)

# Concurrency control for LLM API calls
concurrency:
  total: null                # Max parallel LLM calls across all agents (null = unlimited)
  scorer: null               # Per-agent limits (null = unlimited)
  analyzer: null
  summarizer: null
  meta_summarizer: null
  query_generator: null
  updater: null

# Memory and rubric settings
memory:
  path: "./data/chroma"                    # ChromaDB storage path
  rubrics_path: "./data/rubrics.json"      # Rubric persistence file
  collection: "reward_memory"              # ChromaDB collection name
  embedding_model: null                    # null = ChromaDB default (all-MiniLM-L6-v2)
  static_memory_steps: 5                   # Recent batch summaries for context
  dynamic_memory_limit_per_query: 3        # Semantic search results per query
  max_queries: null                        # Max search queries per update; null = adaptive
  rubrics_mode: "global"                   # "global" or "per_instance"
  per_instance_dir: "./data/per_instance"  # Per-instance data directory
  per_instance_rubrics_path: null           # Path to JSON with pre-defined rubrics per input_id
  anchor_rubrics:                          # Immutable evaluation criteria
    - text: "The model provides the correct final answer."
      weight: 1.0
  adaptive_rubrics: []                     # Dynamic criteria (initially empty)

# Custom prompt templates (optional)
prompts:
  scorer: null          # null = use built-in prompt
  analyzer: null
  summarizer: null
  meta_summarizer: null
  query_generator: null
  updater: null

# Server settings
server:
  host: "0.0.0.0"
  port: 8002

# Pipeline behavior
pipeline:
  mode: "async"              # "async" or "sync"
  rollouts_per_step: null    # null = manual /complete; set integer for auto-complete (manual /complete is ignored)
  use_chunking: true         # Enable chunked processing for large batches
  chunk_size: 32             # Rollouts per chunk
  rubric_update_frequency: 1 # Update rubrics every N steps (1 or null = every step)

# Logging
logging:
  path: null                 # null = defaults to data parent directory

Provider-Specific Examples

OpenAI (Default)

client:
  default:
    api_key: null  # Uses OPENAI_API_KEY env var
    base_url: null

agents:
  default:
    model: "gpt-4o-mini"

Azure OpenAI

client:
  default:
    api_key: "your_azure_api_key"
    base_url: "https://<resource>.openai.azure.com"

agents:
  default:
    model: "gpt-4o"  # Your deployment name

Ollama (Local)

client:
  default:
    api_key: "ollama"  # Any value works
    base_url: "http://localhost:11434/v1"

agents:
  default:
    model: "llama3.2"

vLLM (Local Inference)

client:
  default:
    api_key: "EMPTY"
    base_url: "http://localhost:8000/v1"

agents:
  default:
    model: "meta-llama/Llama-3.2-3B-Instruct"

API Reference

Endpoints

POST /score

Score a rollout against current rubrics.

Request:

{
  "rollout": {
    "id": "unique_rollout_id",
    "input": "User prompt or question",
    "output": "Model's response",
    "supplemental_context": "Ground truth, reference solution, etc."
  },
  "metadata": {
    "high_level_goal": "Train a helpful assistant",
    "model_size": "7B",
    "current_step": 5,
    "total_steps": 100,
    "other_context": "Optional additional context"
  }
}

Response:

{
  "score": 0.75
}

Behavior:

Async mode: Returns immediately; analysis runs in background
Sync mode: Blocks until rubric update completes, returns re-scored value

POST /complete

Signal that a batch of rollouts is complete. Triggers analysis/summarization/update pipeline. If pipeline.rollouts_per_step is set, this endpoint is a no-op.

Response:

{
  "status": "complete"
}

{
  "status": "noop"
}

GET /rubric_status/wait

Wait for rubric update to complete (blocking). Waits until all inputs in the step are complete.

Response:

{
  "status": "updated",
  "mode": "global"
}

Client Library

The RewardClient provides a simple API with async scoring and sync/async batch control methods. Scoring is async-only to enforce concurrent rollout submission, which prevents deadlocks in sync pipeline mode.

import asyncio
from amaris import RewardClient

async def train():
    async with RewardClient("http://localhost:8002", timeout=600) as client:
        # 1. Score rollouts concurrently
        scores = await asyncio.gather(*(
            client.score_async(
                rollout={"id": f"r{i}", "input": "What is 2 + 2?", "output": "4.", "supplemental_context": "Answer: 4"},
                metadata={"high_level_goal": "Train a math assistant", "model_size": "7B", "current_step": 1, "total_steps": 100},
            )
            for i in range(batch_size)
        ))

        # 2. Signal batch complete (manual mode only, sync)
        client.rollouts_complete()

        # 3. Wait for rubric evolution (sync)
        client.rubric_updated()

asyncio.run(train())

Async Batch Control Methods

For fully async training loops, async versions of batch control are also available:

async def train_step():
    async with RewardClient("http://localhost:8002") as client:
        scores = await asyncio.gather(*(client.score_async(r, meta) for r in rollouts))
        await client.rollouts_complete_async()
        await client.rubric_updated_async()

Data Models

Core Models

Rollout

class Rollout:
    id: str                        # Unique identifier
    input: str                     # User prompt/question
    output: str                    # Model response
    supplemental_context: str      # Ground truth, reference, etc.
    metadata: dict                 # Additional metadata

Rubric

class Rubric:
    id: int                        # Auto-generated ID
    text: str                      # Evaluation criterion text
    weight: float                  # Importance weight (default: 1.0)
    is_anchor: bool                # True = immutable
    created_at: datetime
    metadata: dict

TrainingMetadata

class TrainingMetadata:
    high_level_goal: str           # Training objective description
    model_size: str                # Model size (e.g., "7B")
    current_step: int              # Current training step
    total_steps: int               # Total training steps
    other_context: str             # Additional context

Analysis Models

IndividualAnalysis

Deep analysis of a single rollout:

class IndividualAnalysis:
    analysis_id: str
    rollout_id: str
    step: int
    overall_assessment: OverallAssessment
    performance_at_stage: PerformanceAtStage
    reward_hacking_analysis: RewardHackingAnalysis
    performance_advancement_strategy: PerformanceAdvancementStrategy
    rubric_evaluation: RubricEvaluation
    other_observations: list[OtherObservation]

BatchSummary

Aggregated insights from a batch:

class BatchSummary:
    batch_summary_id: str
    step: int
    executive_summary: ExecutiveSummary
    systemic_patterns: list[SystemicPattern]
    aggregated_reward_hacking: AggregatedHackingAnalysis
    rubric_meta_evaluation: RubricMetaEvaluation
    performance_advancement_analysis: PerformanceAdvancementAnalysis
    rubric_evolution_plan: RubricEvolutionPlan
    memory_entry: MemoryEntry

RubricUpdate

Rubric modification result:

class RubricUpdate:
    rubric_update_id: str
    step: int
    update_strategy: str           # "MAINTENANCE", "CORRECTIVE", "ADVANCEMENT"
    memory_consultation: MemoryConsultation
    rationale: UpdateRationale
    rubric_modifications: list[RubricModification]
    final_adaptive_rubric_set: list[Rubric]
    memory_artifact: MemoryArtifact

Pipeline Workflow

Async Mode (Default)

Client calls /score
        │
        ▼
┌───────────────────┐
│   Scorer Agent    │──────► Returns score immediately
│   (Fast eval)     │
└───────────────────┘
        │
        ▼ (Background)
┌───────────────────┐
│  Analyzer Agent   │──────► Store to ChromaDB
│ (Deep analysis)   │
└───────────────────┘
        │
        ▼ (On /complete, manual only)
┌───────────────────┐
│ Summarizer Agent  │──────► Aggregate batch insights
└───────────────────┘
        │
        ▼
┌───────────────────┐
│ QueryGenerator    │──────► Generate memory search queries
└───────────────────┘
        │
        ▼
┌───────────────────┐
│  Updater Agent    │──────► Evolve rubrics
└───────────────────┘
        │
        ▼
  Rubric update complete
  (Client /rubric_status/wait returns)

Sync Mode

Client calls /score
        │
        ▼
┌───────────────────┐
│   Full Pipeline   │──────► Score → Analyze → Summarize → Update
│   (Blocking)      │
└───────────────────┘
        │
        ▼
  Re-score with NEW rubrics
        │
        ▼
  Returns updated score

Advanced Features

Per-Instance Rubrics

For heterogeneous tasks (math, coding, writing), use separate rubrics per input type: Input IDs are derived from rollout.input and handled internally.

memory:
  rubrics_mode: "per_instance"
  per_instance_dir: "./data/per_instance"

score = await client.score_async(
    rollout={
        "id": "r1",
        "input": "Write a poem",
        "output": "Roses are red..."
    },
    metadata={...}
)

# One /complete triggers updates for all inputs in the step (manual mode only)
client.rollouts_complete()

Chunked Processing

For large batches (>100 rollouts), enable chunking to prevent memory issues:

pipeline:
  use_chunking: true
  chunk_size: 32

This splits the batch, summarizes each chunk separately, then uses MetaSummarizer to consolidate.

Auto-Completion

Automatically trigger batch processing after N rollouts:

pipeline:
  rollouts_per_step: 32

No need to call client.rollouts_complete() manually. When rollouts_per_step is set, /complete is ignored.

Rubric Update Frequency

Control how often rubrics are updated to reduce LLM costs during training:

pipeline:
  rubric_update_frequency: 5  # Update rubrics every 5 steps

1 or null: Update rubrics every step (default)
N > 1: Update rubrics at steps 0, N, 2N, 3N, ... (e.g., 5 updates at steps 0, 5, 10, 15, ...)

Summarization still runs every step to preserve memory context; only the expensive rubric update is skipped on non-update steps.

Custom Prompt Templates

Override default prompts by specifying file paths:

prompts:
  scorer: "./prompts/my_scorer.txt"
  analyzer: "./prompts/my_analyzer.txt"

Utility Scripts

setup_rubrics.py

Initialize rubrics from config (clears existing):

python setup_rubrics.py

inspect_chroma_db.py

Inspect ChromaDB memory contents:

python inspect_chroma_db.py ./data/chroma --limit 50

Output includes:

Step histogram (document count by training step)
Type × Step breakdown
Recent documents with preview

test_rl_simulation.py

Run a simulated RL training loop:

python test_rl_simulation.py

Simulates:

Math problem training
Various model behaviors (correct, wrong, verbose, sycophant, reward hacking)
Behavior evolution across training stages
Score tracking and rubric evolution

Project Structure

AMARIS/
├── config.yaml                    # Main configuration
├── requirements.txt               # Dependencies
├── pyproject.toml                 # Package metadata
├── README.md                      # This documentation
│
├── setup_rubrics.py               # Rubric initialization utility
├── inspect_chroma_db.py           # Memory inspection utility
├── test_rl_simulation.py              # RL training simulation
├── per_instance_rubrics_example.json  # Example per-instance rubrics
│
├── amaris/
│   ├── __init__.py                # Exports RewardClient
│   │
│   ├── client/
│   │   ├── __init__.py
│   │   └── client.py              # RewardClient (3-function API)
│   │
│   └── server/
│       ├── __init__.py
│       ├── main.py                # Server entry point
│       ├── api.py                 # FastAPI endpoints
│       ├── config.py              # Configuration loader
│       ├── models.py              # Pydantic data models
│       ├── pipeline.py            # Processing orchestrator
│       ├── memory.py              # ChromaDB abstraction
│       ├── rubrics.py             # Rubric management
│       ├── json_logger.py         # LLM call logging
│       │
│       ├── agents/                # LLM agents
│       │   ├── __init__.py
│       │   ├── factory.py         # OpenAI client factory
│       │   ├── scorer.py
│       │   ├── analyzer.py
│       │   ├── summarizer.py
│       │   ├── meta_summarizer.py
│       │   ├── query_generator.py
│       │   └── updater.py
│       │
│       └── prompts/               # Prompt templates
│           ├── __init__.py
│           ├── scorer.py
│           ├── analyzer.py
│           ├── summarizer.py
│           ├── meta_summarizer.py
│           ├── query_generator.py
│           └── updater.py
│
└── data/                          # Runtime data (gitignored)
    ├── chroma/                    # ChromaDB vector storage
    ├── rubrics.json               # Persisted rubrics
    ├── logs/                      # LLM call logs
    └── per_instance/              # Per-instance data (if enabled)

Data Storage

Component	Location	Description
ChromaDB	`./data/chroma/`	Vector database with semantic memory
Rubrics	`./data/rubrics.json`	Current rubric set (JSON)
LLM Logs	`./data/logs/llm_log_*.jsonl`	JSONL logs of all LLM calls
Per-Instance	`./data/per_instance/{id}/`	Separate rubrics/memory per input type

Design Principles

Modular: Each component is independently testable
OpenAI-Compatible: Works with any provider supporting OpenAI API format
Async-First: Background processing doesn't block training
Memory-Driven: Learns from history via semantic search
Flexible Rubrics: Supports both shared and per-input customization
JSON-Robust: Agents use regex fallback for imperfect LLM outputs
Production-Ready: Comprehensive logging, error handling, graceful degradation

Troubleshooting

Server won't start

Check Python version: python --version (requires >= 3.12)
Verify dependencies: pip install -r requirements.txt
Check config file path: python -m amaris.server.main --config config.yaml

LLM API errors

Verify API key: echo $OPENAI_API_KEY
Test endpoint: curl https://api.openai.com/v1/models -H "Authorization: Bearer $OPENAI_API_KEY"
Check model availability for your API tier

Memory issues with large batches

Enable chunking:

pipeline:
  use_chunking: true
  chunk_size: 32

Rubrics not evolving

Ensure /complete is called after batch (manual mode only)
Wait for update: /rubric_status/wait
Check logs for LLM errors

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
amaris		amaris
docs		docs
.gitignore		.gitignore
README.md		README.md
config.yaml		config.yaml
inspect_chroma_db.py		inspect_chroma_db.py
per_instance_rubrics_example.json		per_instance_rubrics_example.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup_rubrics.py		setup_rubrics.py
test_rl_simulation.py		test_rl_simulation.py

Folders and files

Latest commit

History

Repository files navigation

AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

Key Features

Table of Contents

Architecture Overview

Agent Responsibilities

Installation

Requirements

Install with uv (Recommended)

Install with pip

Dependencies

Quick Start

1. Configure the System

2. Start the Server

3. Integrate with Training Loop

Configuration

Full Configuration Reference

Provider-Specific Examples

OpenAI (Default)

Azure OpenAI

Ollama (Local)

vLLM (Local Inference)

API Reference

Endpoints

POST /score

POST /complete

GET /rubric_status/wait

Client Library

Async Batch Control Methods

Data Models

Core Models

Rollout

Rubric

TrainingMetadata

Analysis Models

IndividualAnalysis

BatchSummary

RubricUpdate

Pipeline Workflow

Async Mode (Default)

Sync Mode

Advanced Features

Per-Instance Rubrics

Chunked Processing

Auto-Completion

Rubric Update Frequency

Custom Prompt Templates

Utility Scripts

setup_rubrics.py

inspect_chroma_db.py

test_rl_simulation.py

Project Structure

Data Storage

Design Principles

Troubleshooting

Server won't start

LLM API errors

Memory issues with large batches

Rubrics not evolving

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages