# RLLM SDK: Make Any Agent Trainable With Almost No Adaptation

This tutorial will walk you through how to make **any existing agent code** trainable with RL with minimal changes. Specifically, we'll work through how to turn an existing countdown agent, and make it trainable using RLLM SDK.

**Key Steps:** 
- Replace OpenAI client with our SDK provided OpenAI client
- Write a rollout function

## Testing The Agent

Start the proxy (litellm) for testing. During training, the Trainer manages this automatically. The proxy logs all LLM calls to a SQLite database which allow retrieval exact token ids during training.

In [None]:
from pathlib import Path
from rllm.sdk.proxy.proxy_manager import ProxyManager
import os

# Setup
DB_PATH = "/tmp/rllm_demo.db"
MODEL = "gpt-4o-mini"

openai_api_key = os.environ["OPENAI_API_KEY"]
# openai_api_key="abc"

# Clean up
Path(DB_PATH).unlink(missing_ok=True)

# Start proxy
proxy_manager = ProxyManager(proxy_port=4000, admin_token="my-shared-secret")
config = {
    "model_list": [
        {
            "model_name": MODEL,
            "litellm_params": {
                "model": MODEL,
                "api_key": openai_api_key,
            },
        }
    ]
}
proxy_manager.start_proxy_subprocess(config=config, db_path=DB_PATH, project="demo")
proxy_url = proxy_manager.get_proxy_url(include_v1=True)

print(f"✓ Proxy started at {proxy_url}")
print(f"✓ Database: {DB_PATH}")

### Prepare the Countdown Dataset
**The Countdown Task:**  
Given a set of numbers and a target, find an arithmetic expression using those numbers to reach the target. Each number can be used at most once. For example: numbers `[30, 32, 76]` and target `78` → solution could be `76 + 32 - 30 = 78`.

In [None]:
!python rllm/examples/solver_judge/prepare_countdown_data.py

In [None]:
from rllm.data.dataset import DatasetRegistry

train_dataset = DatasetRegistry.load_dataset("countdown", "train")
test_dataset = DatasetRegistry.load_dataset("countdown", "test")

train_dataset[0]

## Your Original Agent Code

A typical agent using the standard OpenAI client. This agent follows a Solver Verifier workflow: the solver generate multiple solution attempts, then the verifier select the best one

In [None]:
from openai import AsyncOpenAI
import asyncio
import re

judge_prompt = """You are an expert verifier. Given a countdown problem and multiple solution attempts, select a correct solution.
Problem:
{problem}
Solutions to evaluate:
{solutions}
A correct solution must satisfy the following criteria:
1. The solution uses only the given numbers.
2. Each number is used exactly once.
3. Only basic arithmetic operations (+, -, *, /) are used.
4. The calculation results in the target number.
5. The final answer is clearly marked within <answer>...</answer> tags.
Output the index of your selected solution within <answer>...</answer> tags, e.g., <answer>1</answer> for the first solution, <answer>2</answer> for the second solution, etc. If multiple solutions are correct, output the index of the first correct solution."""


def parse_answer_in_xml(solution_str):
    answer_matches = re.findall(r"<answer>(.*?)</answer>", solution_str, re.IGNORECASE | re.DOTALL)
    if answer_matches:
        return answer_matches[-1].strip()
    else:
        return "No solution found."


class CountdownAgent:
    """A simple math solving agent - ORIGINAL VERSION"""

    def __init__(self, api_key: str, model: str = "gpt-4o-mini"):
        # Standard OpenAI client
        self.client = AsyncOpenAI(api_key=api_key)
        self.model = model

    async def solve(self, problem: str) -> str:
        """Solve a math problem."""
        response = await self.client.chat.completions.create(
            model=self.model,
            messages=[
                {
                    "role": "user",
                    "content": f"{problem}. Output the final answer within <answer>...</answer>",
                }
            ],
            max_tokens=1000,
        )
        return parse_answer_in_xml(response.choices[0].message.content)

    async def judge(self, problem, solutions: list[str]) -> str:
        """Judge multiple solutions to a problem."""
        formatted_solutions = "\n".join([f"Solution {i + 1}:\n<answer>{sol}</answer>\n" for i, sol in enumerate(solutions)])
        prompt = judge_prompt.format(problem=problem, solutions=formatted_solutions)

        response = await self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=1000,
        )
        print("Judge:", response.choices[0].message.content)
        return parse_answer_in_xml(response.choices[0].message.content)

    async def run(self, problem: str, n_solutions: int = 2) -> str:
        """Generate multiple solutions and judge them."""
        solutions = await asyncio.gather(*[self.solve(problem) for _ in range(n_solutions)])
        print("First solution attempt:")
        print(solutions[0])

        selected_index = await self.judge(problem, solutions)
        if selected_index:
            try:
                selected_index = int(selected_index)
                selected_solution = solutions[selected_index - 1]
            except:
                selected_solution = ""
        else:
            selected_solution = ""

        return selected_solution


# Use it
agent = CountdownAgent(api_key=openai_api_key, model=MODEL)
await agent.run(train_dataset[0]["question"])

## Make It Trainable

**Change:** Import the SDK client instead of OpenAI client  

**What's `session()`?** A lightweight primitive that tracks all LLM calls within its scope and injects metadata into each call. Every LLM calls inside `with session()` can be retrieved via `sess._uid`.

In [None]:
import re
from rllm.sdk import get_chat_client_async, session


class TrainableAgent:
    """A simple math solving agent - TRAINABLE VERSION"""

    def __init__(self, api_key: str, model: str = "gpt-4o-mini"):
        # Replace standard OpenAI client with SDK client
        # self.client = AsyncOpenAI(api_key=api_key)
        self.client = get_chat_client_async(api_key=api_key, base_url=proxy_url)
        self.model = model

    async def solve(self, problem: str) -> str:
        """Solve a math problem."""
        response = await self.client.chat.completions.create(
            model=self.model,
            messages=[
                {
                    "role": "user",
                    "content": f"{problem}. Output the final answer within <answer>...</answer>",
                }
            ],
            max_tokens=1000,
        )
        return parse_answer_in_xml(response.choices[0].message.content)

    async def judge(self, problem, solutions: list[str]) -> str:
        """Judge multiple solutions to a problem."""
        formatted_solutions = "\n".join([f"Solution {i + 1}:\n<answer>{sol}</answer>\n" for i, sol in enumerate(solutions)])
        prompt = judge_prompt.format(problem=problem, solutions=formatted_solutions)

        response = await self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=1000,
        )
        return parse_answer_in_xml(response.choices[0].message.content)

    async def run(self, problem: str, n_solutions: int = 2) -> str:
        """Generate multiple solutions and judge them."""
        solutions = await asyncio.gather(*[self.solve(problem) for _ in range(n_solutions)])

        selected_index = await self.judge(problem, solutions)
        if selected_index:
            try:
                selected_index = int(selected_index)
                selected_solution = solutions[selected_index - 1]
            except:
                selected_solution = ""
        else:
            selected_solution = ""

        return selected_solution

### Automatic LLM Call Tracking

The `session()` primitive enables automatically capturing every LLM interaction.

Access auto tracked LLM calls via `sess.steps`, each step is a concise view of a single LLM call (trace) with reward:

Step obj have fields:
- id: Trace ID, unique per trace, can be used to retrieve the full trace from the store
- input: LLM input (from trace)
- output: LLM response (from trace)
- action: Parsed action (set manually by user)
- reward: Step reward
- metadata: Additional tracking data (can include model, tokens, latency, etc.)

In [None]:
# Use it
agent = TrainableAgent(api_key=openai_api_key, model=MODEL)
with session() as sess:
    solution = await agent.run(train_dataset[0]["question"])
    print("selected solution: ", solution)

In [None]:
# Access traces directly from the session
steps = sess.steps
print(f"✅ Tracked {len(steps)} steps\n")

# Inspect the first trace
step = steps[2]

print("=" * 70 + "\nTRACE DETAILS\n" + "=" * 70)
print(f"\nInput Messages:")
for msg in step.input["messages"]:
    print(f"  [{msg['role']}]: {msg['content']}")
print(f"\nOutput:  {step.output['choices'][0]['message']['content']}")

## Evaluate the Agent

We need to define a reward function that scores agent outputs for training.

In [None]:
def validate_equation(equation_str, available_numbers):
    """Validate that equation only uses available numbers and each number once."""
    try:
        # Extract all numbers from the equation
        numbers_in_eq = [int(n) for n in re.findall(r"\d+", equation_str)]

        # Check if all numbers in equation are available
        available_numbers = sorted(available_numbers)
        numbers_in_eq = sorted(numbers_in_eq)

        # Each number should be used exactly once
        return numbers_in_eq == available_numbers
    except Exception:
        return False


def evaluate_equation(equation_str):
    """Safely evaluate the arithmetic equation using eval() with precautions."""
    try:
        # Define a regex pattern that only allows numbers, operators, parentheses, and whitespace
        allowed_pattern = r"^[\d+\-*/().\s]+$"
        if not re.match(allowed_pattern, equation_str):
            raise ValueError("Invalid characters in equation.")

        # Evaluate the equation with restricted globals and locals
        result = eval(equation_str, {"__builtins__": None}, {})
        return result
    except Exception:
        return None


def reward_fn(equation, numbers, target):
    """The scoring function for countdown task.

    Args:
        solution_str: the solution text
        numbers: list of numbers
        target: target number

    Returns:
        float: 1.0 if correct, 0.0 if incorrectet
    """

    if equation is None or not validate_equation(equation, numbers):
        return 0.0

    # Evaluate equation
    try:
        result = evaluate_equation(equation)

        if result is None:
            return 0.0

        if abs(result - target) < 1e-5:  # Account for floating point precision
            return 1.0
        else:
            return 0.0
    except Exception:
        return 0.0

## Defining the Rollout Function

Rollout function take initial input (prompts, task description etc.) , run the agent, and return a final reward.

In [None]:
async def rollout_v1(
    question: str,
    ground_truth: str,
    nums: list,
    target: float,
    model="Qwen/Qwen3-4B-Instruct-2507",
    **kwargs,
) -> float:
    # we need to provide an rollout function that return a reward
    agent = TrainableAgent(api_key=openai_api_key, model=model)
    equation = await agent.run(question)
    print("Target:", target, "\nEquation:", equation)

    reward = reward_fn(equation, nums, target)
    return reward


reward = await rollout_v1(**train_dataset[0], model="gpt-4o-mini")
print("reward =", reward)

## Train!

Directly plug in the rollout function you defined to the `AgentTrainer`

In [None]:
# Training
from rllm.trainer import AgentTrainer
from hydra import initialize_config_dir, compose
import os

with initialize_config_dir(config_dir=os.path.abspath("."), version_base=None):
    config = compose(config_name="tutorial_config")

trainer = AgentTrainer(
    agent_run_func=rollout_v1,
    config=config,
    train_dataset=train_dataset,
    val_dataset=test_dataset,
)

trainer.train()

## Bonus: Using @trajectory Decorator for Step-Level Control

The `@trajectory` decorator is almost **equivalent to `with session()`**. The following are equivalent:

A TrajectoryView obj is a group of steps with reward (optional) and metadata:
- name: Trajectory name
- steps: List of StepViews (auto-generated from traces)
- reward: Trajectory reward (set manually)
- input: Function arguments (dict)
- output: Function return value (Any)
- metadata: Additional tracking data

In [None]:
# These two are equivalent
# v1
with session(agent_name="countdown") as sess:
    output = await rollout_v1(**train_dataset[0])

steps = sess.steps


# v2 - using trajectory decorator
@trajectory(name="countdown")
def rollout_v1(): ...


traj = await rollout_v1(**train_dataset[0])
steps = traj.steps
output = traj.result

In [None]:
from rllm.sdk import trajectory, get_chat_client_async
from rllm.sdk.protocol import TrajectoryView


class TrainableAgentV2:
    """A simple math solving agent - TRAINABLE VERSION V2"""

    def __init__(self, api_key: str, model: str = "gpt-4o-mini"):
        # Replace standard OpenAI client with SDK client
        # self.client = AsyncOpenAI(api_key=api_key)
        self.client = get_chat_client_async(api_key=api_key, base_url=proxy_url)
        self.model = model

    @trajectory(name="solver")
    async def solve(self, problem: str) -> str:
        """Solve a math problem."""
        response = await self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": f"{problem}. Output the final answer within <answer>...</answer>"}],
            max_tokens=1000,
        )
        return parse_answer_in_xml(response.choices[0].message.content)

    @trajectory(name="judge")
    async def judge(self, problem, solutions: list[str]) -> str:
        """Judge multiple solutions to a problem."""
        formatted_solutions = "\n".join([f"Solution {i + 1}:\n<answer>{sol}</answer>\n" for i, sol in enumerate(solutions)])
        prompt = judge_prompt.format(problem=problem, solutions=formatted_solutions)

        response = await self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=1000,
        )
        return parse_answer_in_xml(response.choices[0].message.content)

    async def run(self, problem: str, nums, target, n_solutions: int = 2) -> str:
        """Generate multiple solutions and judge them."""
        solver_trajs = await asyncio.gather(*[self.solve(problem) for _ in range(n_solutions)])
        solutions = [traj.result for traj in solver_trajs]

        judge_traj = await self.judge(problem, solutions)
        selected_index = judge_traj.result

        if selected_index:
            try:
                selected_index = int(selected_index)
                selected_solution = solutions[selected_index - 1]
            except:
                selected_solution = ""
        else:
            selected_solution = ""

        # assign rewards
        for traj in solver_trajs:
            traj.steps[-1].reward = reward_fn(equation=traj.result, numbers=nums, target=target)

        judge_traj.steps[-1].reward = reward_fn(equation=selected_solution, numbers=nums, target=target)
        return solver_trajs + [judge_traj]


async def rollout_v2(question: str, nums, target, model, **kwargs) -> list[TrajectoryView]:
    agent = TrainableAgentV2(None, model=model)
    trajs = await agent.run(question, nums, target)
    return trajs


await rollout_v2(**train_dataset[0], model="gpt-4o-mini")

## How Does It Work?

Here's what happens under the hood:

1. **Trace Collection:** The litellm proxy captures all LLM calls, and directly get the token ids from the vllm server. This avoid retokenization, which cause off-policyness and training instability.
2. **Trace Storage:** The SQlite based database will store all the LLM calls with appropriate metadata for fast retrieval.

The following code show conceptually how the training loop works:

In [None]:
# The AgentTrainer will wrap your rollout function with a unique `rollout_id`:

with session(rollout_id=rollout_id):
    rollout(**input)

steps = store.retrieve(session._uid)
trajectories = group(steps)  # This part based on metadata

trainer.step(trajectories)  # fetch all the trajectories and update model weight using GRPO

## Design Details (For The Curious)

**Why a proxy?**  
Transparent LLM call interception without modifying agent code. Works with any inference provider besides `OpenAI`, e.g., `Anthropic`. So you don't need to change your agent implementation, directly replace the chat client suffice.

**How does session tracking work?**  
Uses Python's **contextvar** for automatic context propagation. `with session()` or `@trajectory` creates a context that automatically groups all LLM calls inside it. Thread-safe, zero manual tracking.

**Why SQLite storage?**  
Zero installation overhead with no live service dependencies.

In [None]:
# Cleanup
proxy_manager.shutdown_proxy()
print("✓ Proxy shutdown complete")