<a href="https://colab.research.google.com/github/OpenPipe/ART/blob/art-mcp/examples/mcp-rl/mcp-rl-alphavantage.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To train a model for your custom task, click _Runtime_ and press _Run all_. Make sure you've enabled a free Tesla T4 GPU!

<div class="align-center">
<a href="https://github.com/openpipe/art"><img src="https://github.com/openpipe/art/raw/main/assets/ART_pill.png" height="50"></a>
<a href="https://discord.gg/zbBHRUpwf4"><img src="https://github.com/openpipe/art/raw/main/assets/Discord_pill.png" height="50"></a>
<a href="https://art.openpipe.ai"><img src="https://github.com/openpipe/art/raw/main/assets/Documentation_pill.png" height="50"></a>

Questions? Join the Discord and ask away! For feature requests or to leave a star, visit our [Github](https://github.com/openpipe/art).

</div>

<a href="https://art.openpipe.ai/"><img src="https://github.com/openpipe/art/raw/main/assets/Header_separator.png" height="5"></a>

**MCP•RL: Tool-driven agent training**

This notebook shows how to train a Qwen 2.5 7B model to automatically optimize against any MCP server. Simply define the server's tools and resources and the notebook below will:

1. Query the server's tools and resources
2. Generate diverse input examples for your task
3. Train the model using RULER's automatic evaluation
4. Test the trained model on new inputs against the server

RULER learns what makes a good output purely from the MCP server's tools and resources - no expected outputs required!


In [None]:
# @title 💿 Installation

!uv pip install -q openpipe-art==0.3.11.post5 langchain-core tenacity "mcp>=1.11.0" "gql<4" fastmcp --prerelease allow --no-cache-dir

<a name="Configuration"></a>

### 🎯 Configuration - Edit These Settings

Add an OpenRouter key and customize your training by modifying the values below.

By default your model will be trained to retrieve and analyze stock and crypto market data from the Alphavantage MCP server. To teach your model to use another MCP server, configure it to run in the [MCP server](#mcp) cell below!

In [None]:
# Required - Used for generating training inputs and RULER evaluation
OPENROUTER_API_KEY = ""

# Optional - Enables metric logging
WANDB_API_KEY = ""

# Shared key for the demo - DO NOT USE IN PRODUCTION, AND EXPECT RATE LIMITS
ALPHAVANTAGE_API_KEY = "HR32X84C3B4HJ93C"

# Choose the base model to train
BASE_MODEL = "Qwen/Qwen2.5-7B-Instruct"  # Options: "Qwen/Qwen2.5-3B-Instruct", "Qwen/Qwen2.5-7B-Instruct", etc.

In [None]:
# @title Advanced Settings

# Model configuration
MODEL_NAME = "mcp-7b-alphavantage"  # Name for your trained model
PROJECT_NAME = "mcp-rl"  # Project name for tracking

# Training configuration
TRAINING_CONFIG = {
    "num_training_inputs": 16,  # Number of training inputs to generate
    "groups_per_step": 2,  # Inputs to process per training step
    "num_epochs": 3,  # Number of times through all data
    "rollouts_per_group": 3,  # Different responses per input (for RULER comparison)
    "learning_rate": 1e-5,  # Learning rate
    "max_training_steps": None,  # Maximum training steps (set to None for no limit)
}

MAX_TURNS = 10  # Maximum number of turns for the model to generate during one rollout

NUM_TEST_INPUTS = 8  # Number of test inputs to generate
RULER_MODEL = "openrouter/openai/o4-mini"  # Model for RULER evaluation
INPUT_GENERATION_MODEL = "openai/o4-mini"

# GPU configuration (for T4 — keep these as-is unless you have a reason to change them)
MAX_SEQ_LENGTH = 4096  # Maximum sequence length
GPU_MEMORY_UTILIZATION = 0.7  # GPU memory usage (0.0-1.0)

In [None]:
# @title MCP server

import asyncio
import os
from typing import Any, Dict

import aiohttp
from dotenv import load_dotenv
from fastmcp import FastMCP
from tenacity import (
    retry,
    retry_if_exception_type,
    stop_after_attempt,
    wait_exponential,
)

load_dotenv()

# Required for Alphavantage demo
if ALPHAVANTAGE_API_KEY:
    os.environ["ALPHAVANTAGE_API_KEY"] = ALPHAVANTAGE_API_KEY
else:
    raise ValueError("ALPHAVANTAGE_API_KEY is required for the Alphavantage demo.")


class AlphaVantageClient:
    """Client for interacting with Alpha Vantage API"""

    def __init__(self, api_key: str):
        self.api_key = api_key
        self.base_url = "https://www.alphavantage.co/query"

    async def fetch_data(self, function: str, **params) -> Dict[str, Any]:
        """Fetch data from Alpha Vantage API"""
        query_params = {
            "function": function,
            "apikey": self.api_key,
            "datatype": "json",
            **params,
        }

        async with aiohttp.ClientSession() as session:
            async with session.get(self.base_url, params=query_params) as response:
                if response.status != 200:
                    raise Exception(f"API request failed: {response.status}")

                data = await response.json()

                if "Error Message" in data:
                    raise Exception(f"Alpha Vantage API Error: {data['Error Message']}")

                if (
                    "Thank you for using Alpha Vantage! Please contact premium@alphavantage.co if you are targeting a higher API call volume."
                    in data
                ):
                    raise Exception(
                        "Alpha Vantage API Error: Thank you for using Alpha Vantage! Please contact premium@alphavantage.co if you are targeting a higher API call volume."
                    )

                return data


def _format_json(data: Dict[str, Any]) -> str:
    """Format JSON data for display"""
    import json

    return json.dumps(data, indent=2)


# Initialize FastMCP server
mcp = FastMCP("mcp-alphavantage")
client = AlphaVantageClient(os.getenv("ALPHAVANTAGE_API_KEY"))


@mcp.tool
@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=1, max=30),
    retry=retry_if_exception_type(
        (aiohttp.ClientError, asyncio.TimeoutError, Exception)
    ),
)
async def get_stock_quote(symbol: str) -> str:
    """Get real-time stock quote for a symbol

    Args:
        symbol: Stock symbol (e.g., AAPL, MSFT)
    """
    data = await client.fetch_data("GLOBAL_QUOTE", symbol=symbol)
    return f"Stock Quote for {symbol}:\n{_format_json(data)}"


@mcp.tool
@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=1, max=30),
    retry=retry_if_exception_type(
        (aiohttp.ClientError, asyncio.TimeoutError, Exception)
    ),
)
async def get_time_series_daily(symbol: str, outputsize: str = "compact") -> str:
    """Get daily time series data for a stock

    Args:
        symbol: Stock symbol (e.g., AAPL, MSFT)
        outputsize: Output size: compact (latest 100 data points)
    """
    data = await client.fetch_data(
        "TIME_SERIES_DAILY",
        symbol=symbol,
        outputsize=outputsize,
    )
    return f"Daily Time Series for {symbol}:\n{_format_json(data)}"


@mcp.tool
@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=1, max=30),
    retry=retry_if_exception_type(
        (aiohttp.ClientError, asyncio.TimeoutError, Exception)
    ),
)
async def search_symbol(keywords: str) -> str:
    """Search for stock symbols by keywords

    Args:
        keywords: Keywords to search for (e.g., company name)
    """
    data = await client.fetch_data("SYMBOL_SEARCH", keywords=keywords)
    return f"Symbol Search Results for '{keywords}':\n{_format_json(data)}"


@mcp.tool
@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=1, max=30),
    retry=retry_if_exception_type(
        (aiohttp.ClientError, asyncio.TimeoutError, Exception)
    ),
)
async def get_company_overview(symbol: str) -> str:
    """Get fundamental data and company overview

    Args:
        symbol: Stock symbol (e.g., AAPL, MSFT)
    """
    data = await client.fetch_data("OVERVIEW", symbol=symbol)
    return f"Company Overview for {symbol}:\n{_format_json(data)}"


@mcp.tool
@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=1, max=30),
    retry=retry_if_exception_type(
        (aiohttp.ClientError, asyncio.TimeoutError, Exception)
    ),
)
async def get_sma(
    symbol: str,
    interval: str = "daily",
    time_period: int = 30,
    series_type: str = "close",
) -> str:
    """Get Simple Moving Average (SMA) technical indicator

    Args:
        symbol: Stock symbol (e.g., AAPL, MSFT)
        interval: Time interval (1min, 5min, 15min, 30min, 60min, daily, weekly, monthly)
        time_period: Number of data points for SMA calculation
        series_type: Price type to use for calculation (close, open, high, low)
    """
    data = await client.fetch_data(
        "SMA",
        symbol=symbol,
        interval=interval,
        time_period=time_period,
        series_type=series_type,
    )
    tech_analysis_key = "Technical Analysis: SMA"
    # Alpha Vantage returns a dict keyed by timestamp; convert to list to slice
    data[tech_analysis_key] = dict(list(data[tech_analysis_key].items())[:time_period])
    return f"SMA for {symbol}:\n{_format_json(data)}"


@mcp.tool
@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=1, max=30),
    retry=retry_if_exception_type(
        (aiohttp.ClientError, asyncio.TimeoutError, Exception)
    ),
)
async def get_rsi(
    symbol: str,
    interval: str = "daily",
    time_period: int = 14,
    series_type: str = "close",
) -> str:
    """Get Relative Strength Index (RSI) technical indicator

    Args:
        symbol: Stock symbol (e.g., AAPL, MSFT)
        interval: Time interval (daily, weekly, monthly)
        time_period: Number of data points for RSI calculation
        series_type: Price type to use for calculation (close, open, high, low)
    """
    data = await client.fetch_data(
        "RSI",
        symbol=symbol,
        interval=interval,
        time_period=time_period,
        series_type=series_type,
    )
    tech_analysis_key = "Technical Analysis: RSI"
    # Alpha Vantage returns a dict keyed by timestamp; convert to list to slice
    data[tech_analysis_key] = dict(list(data[tech_analysis_key].items())[:time_period])
    return f"RSI for {symbol}:\n{_format_json(data)}"


# For in-memory usage, we don't need server_params anymore
# The FastMCP server is now available as the 'mcp' variable

<a name="mcp"></a>

In [None]:
# @title Let's generate our train and validation scenarios!

import json
import os
import random
from typing import Any, Dict, List

import openai
from dotenv import load_dotenv
from fastmcp import Client

load_dotenv()

# Required
if OPENROUTER_API_KEY:
    os.environ["OPENROUTER_API_KEY"] = OPENROUTER_API_KEY
else:
    raise ValueError(
        "OPENROUTER_API_KEY is required for data generation and RULER evaluation."
    )


async def generate_scenarios(
    mcp_server: FastMCP,
    num_scenarios: int = 24,
) -> List[Dict[str, Any]]:
    # Connect to MCP server using in-memory transport
    async with Client(mcp_server) as client:
        # Get available tools
        tools_result = await client.list_tools()
        tools_info = []
        for tool in tools_result:
            tool_info = {
                "name": tool.name,
                "description": tool.description,
                "parameters": tool.inputSchema,
            }
            tools_info.append(tool_info)

        # Get available resources
        try:
            resources_result = await client.list_resources()
            resources_info = []
            for resource in resources_result.resources:
                resource_info = {
                    "uri": str(resource.uri),
                    "name": resource.name,
                    "description": resource.description,
                    "mimeType": resource.mimeType,
                }
                resources_info.append(resource_info)
        except Exception:
            # Some servers might not have resources
            resources_info = []

    # Prepare the prompt for o3
    tools_description = json.dumps(tools_info, indent=2)
    resources_description = (
        json.dumps(resources_info, indent=2)
        if resources_info
        else "No resources available"
    )

    prompt = f"""You are an expert at creating realistic scenarios for testing AI agents that interact with MCP (Model Context Protocol) servers.

Given the following available tools and resources from an MCP server, generate {num_scenarios} diverse, realistic scenarios that a user might want to accomplish using these tools.

AVAILABLE TOOLS:
{tools_description}

AVAILABLE RESOURCES:
{resources_description}

Requirements for scenarios:
1. Each scenario should be a task that can be accomplished using the available tools
2. Scenarios should vary in complexity - some simple (1-2 tool calls), some complex (multiple tool calls)
3. Scenarios should cover different use cases and tool combinations (though the task should not specify which tools to use)
4. Each scenario should be realistic - something a real user might actually want to do
5. Assign a difficulty rating from 1 (easy, single tool call) to 5 (hard, complex multi-step analysis)
6. The task should always include generating a summary of the work done and a thorough analysis and report of the results

You must respond with a JSON object containing a "scenarios" array of exactly {num_scenarios} objects. Each object must have:
- "task": string describing the scenario
- "difficulty": integer from 1-5 representing complexity

Example:
{{
  "scenarios": [
    {{"task": "Get the current stock price for Apple (AAPL)", "difficulty": 1}},
    {{"task": "Compare the 30-day SMA with current price for Tesla and determine if it's above or below the moving average and generate a thorough analysis and report", "difficulty": 2}}
  ]
}}"""

    # Call OpenAI's model with structured JSON output
    client_openai = openai.OpenAI(
        api_key=os.getenv("OPENROUTER_API_KEY"),
        base_url="https://openrouter.ai/api/v1",
    )

    # Define the JSON schema for the response
    response_schema = {
        "type": "object",
        "properties": {
            "scenarios": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "task": {"type": "string"},
                        "difficulty": {"type": "integer", "minimum": 1, "maximum": 5},
                    },
                    "required": ["task", "difficulty"],
                    "additionalProperties": False,
                },
                "minItems": num_scenarios,
                "maxItems": num_scenarios,
            }
        },
        "required": ["scenarios"],
        "additionalProperties": False,
    }

    response = client_openai.chat.completions.create(
        model=INPUT_GENERATION_MODEL,
        messages=[{"role": "user", "content": prompt}],
        max_completion_tokens=8000,
        response_format={
            "type": "json_schema",
            "json_schema": {"name": "scenario_list", "schema": response_schema},
        },
    )

    # Parse the JSON response
    content = response.choices[0].message.content
    try:
        result = json.loads(content)
    except Exception as e:
        print("Error parsing JSON response:", e)
        print("Response content:", content)
        raise e

    # Extract scenarios from the response
    if "scenarios" in result:
        scenarios = result["scenarios"]
    else:
        # If the response is just an array
        scenarios = result if isinstance(result, list) else list(result.values())[0]

    # Validate we got exactly the right number
    if len(scenarios) != num_scenarios:
        raise ValueError(f"Expected {num_scenarios} scenarios, got {len(scenarios)}")

    return scenarios


num_scenarios = TRAINING_CONFIG["num_training_inputs"] + NUM_TEST_INPUTS
for _ in range(10):
    scenarios = await generate_scenarios(
        mcp,  # Use the FastMCP server directly
        num_scenarios=num_scenarios,
    )

    if len(scenarios) == num_scenarios:
        break


print(f"\nGenerated {len(scenarios)} scenarios:")
for i, scenario in enumerate(scenarios, 1):
    print(f"{i}. Task: {scenario['task']}")
    print(f"   Difficulty: {scenario['difficulty']}/5")

# Shuffle scenarios randomly
random.shuffle(scenarios)

raw_train_scenarios = scenarios[: TRAINING_CONFIG["num_training_inputs"]]
raw_val_scenarios = scenarios[TRAINING_CONFIG["num_training_inputs"] :]

In [None]:
# @title Run this cell to train your model!

from dataclasses import dataclass

import mcp.types as types
import weave
from openai import AsyncOpenAI

import art
from art.local import LocalBackend
from art.rewards import ruler_score_group
from art.utils import iterate_dataset

# Optional
if WANDB_API_KEY:
    os.environ["WANDB_API_KEY"] = WANDB_API_KEY
    weave.init(PROJECT_NAME)
else:
    print("WANDB_API_KEY is not set. We'll skip logging metrics to Weights & Biases.")


random.seed(42)

# Declare the model
model = art.TrainableModel(
    name=MODEL_NAME,
    project=PROJECT_NAME,
    base_model=BASE_MODEL,
)

# To run on a T4, we need to override some config defaults.
model._internal_config = art.dev.InternalModelConfig(
    init_args=art.dev.InitArgs(
        max_seq_length=MAX_SEQ_LENGTH,
    ),
    engine_args=art.dev.EngineArgs(
        enforce_eager=True,
        gpu_memory_utilization=GPU_MEMORY_UTILIZATION,
    ),
)

# Initialize the server
backend = LocalBackend(
    in_process=True,
    path="./.art",
)

# Register the model with the local Backend
await model.register(backend)

print("Model created!")
print("Base model:", BASE_MODEL)
print("Model name:", MODEL_NAME)
print("Project name:", PROJECT_NAME)

# =============== Rollout function code ===============


def get_content_text(result) -> str:
    # Extract text content from tool call result
    if isinstance(result, str):
        return result
    elif hasattr(result, "content") and result.content:
        if isinstance(result.content, list):
            # Handle list of content items
            content_text = ""
            for item in result.content:
                if isinstance(item, types.TextContent):
                    content_text += item.text
                else:
                    content_text += str(item)
        elif isinstance(result.content[0], types.TextContent):
            content_text = result.content[0].text
        else:
            content_text = str(result.content)
    else:
        content_text = str(result)

    return content_text


@dataclass
class McpScenario:
    """A scenario for MCP agent evaluation."""

    task_description: str
    mcp_server: FastMCP
    max_turns: int = 10


@weave.op()
async def rollout(
    model: art.Model,
    scenario: McpScenario,
    debug: bool = False,
) -> art.Trajectory:
    """Run an MCP agent rollout with FastMCP server.

    Args:
        model: The ART model to use for the agent
        scenario: The MCP scenario to run (must include mcp_server)

    Returns:
        Trajectory containing the results of the rollout
    """
    traj = art.Trajectory(
        messages_and_choices=[],
        reward=0,
        metadata={"task": scenario.task_description},
        metrics={
            "task_completed": False,
            "success": False,
            "ran_out_of_turns": False,
        },
        scenario=scenario,
    )

    # Initialize system prompt
    system_prompt = f"""You are an MCP (Model Context Protocol) agent.\n\nYou have access to MCP tools through the server. Use them to complete your task.\n\nWhen you believe you have completed the task, call the 'complete_task' function with a summary of what you accomplished. You have a total of {scenario.max_turns} turns to complete the task. Only use tool calls, do not write any content. After you have completed the task, call the 'complete_task' function with a summary of what you accomplished. Call complete_task by itself, not in conjunction with any other tools."""

    # Connect to FastMCP server using in-memory transport
    try:
        async with Client(scenario.mcp_server) as client:
            # Get available tools from the server
            tools_result = await client.list_tools()

            # Convert to OpenAI format
            tool_schemas = []
            for tool in tools_result:
                tool_schema = {
                    "type": "function",
                    "function": {
                        "name": tool.name,
                        "description": tool.description or f"MCP tool: {tool.name}",
                        "parameters": tool.inputSchema
                        or {"type": "object", "properties": {}},
                    },
                }
                tool_schemas.append(tool_schema)

            if debug:
                available_tools = [tool["function"]["name"] for tool in tool_schemas]
                print(f"Available MCP tools: {available_tools}")

            # Add completion tool schema
            tool_schemas.append(
                {
                    "type": "function",
                    "function": {
                        "name": "complete_task",
                        "description": "Complete the task with a summary",
                        "parameters": {
                            "type": "object",
                            "properties": {
                                "summary": {
                                    "type": "string",
                                    "description": "Summary of accomplishments",
                                }
                            },
                            "required": ["summary"],
                        },
                    },
                }
            )

            traj.tools = tool_schemas

            # Initialize conversation
            traj.messages_and_choices = [
                {"role": "system", "content": system_prompt},
                {
                    "role": "user",
                    "content": f"Please complete this task: {scenario.task_description}",
                },
            ]

            if debug:
                print(traj.messages())

            num_turns = 0
            task_completed = False

            # Main interaction loop
            while num_turns < scenario.max_turns and not task_completed:
                num_turns += 1

                try:
                    # Get LLM response
                    async with traj.track_duration("llm_completion"):
                        openai_client = AsyncOpenAI(
                            api_key=model.inference_api_key,
                            base_url=model.inference_base_url,
                        )

                        response = await openai_client.chat.completions.create(
                            model=model.inference_model_name
                            if model.inference_model_name
                            else model.name,
                            messages=traj.messages(),
                            tools=tool_schemas,
                            max_completion_tokens=8000,
                        )

                    choice = response.choices[0]

                    if debug:
                        print(f"Choice: {choice.message}")

                    traj.messages_and_choices.append(choice)

                    # Handle tool calls
                    if choice.message.tool_calls:
                        for tool_call in choice.message.tool_calls:
                            try:
                                tool_args = json.loads(tool_call.function.arguments)

                                if tool_call.function.name == "complete_task":
                                    traj.metrics["task_completed"] = True
                                    traj.logs.append(
                                        f"Task completion attempted with summary: {tool_args['summary']}"
                                    )
                                else:
                                    # Call MCP tool through FastMCP client
                                    result = await client.call_tool(
                                        tool_call.function.name, tool_args
                                    )

                                    content_text = get_content_text(result)

                                    if len(content_text) > 20000:
                                        print(
                                            f"Tool call result for {tool_call.function.name} is too long: {len(content_text)}"
                                        )
                                        print(f"Args: {tool_args}")
                                        # print first and last 1000 characters
                                        print(content_text[:1000])
                                        print(content_text[-1000:])
                                        raise Exception(
                                            f"Tool call result for {tool_call.function.name} is too long: {len(content_text)}"
                                        )

                                    # Add tool response
                                    traj.messages_and_choices.append(
                                        {
                                            "role": "tool",
                                            "tool_call_id": tool_call.id,
                                            "content": content_text,
                                        }
                                    )

                                if debug:
                                    print(f"Tool call result: {content_text}")

                            except Exception as e:
                                traj.logs.append(f"Tool call error: {e}")

                                # Add error response
                                traj.messages_and_choices.append(
                                    {
                                        "role": "tool",
                                        "tool_call_id": tool_call.id,
                                        "content": f"Error: {str(e)}",
                                    }
                                )
                    else:
                        # No tool calls, just continue conversation
                        break

                except Exception as e:
                    traj.logs.append(f"Error in turn {num_turns}: {e}")
                    break

    except Exception as e:
        traj.logs.append(f"MCP server error: {e}")
    if not task_completed and num_turns == scenario.max_turns:
        traj.metrics["ran_out_of_turns"] = True

    traj.metrics["num_turns"] = num_turns

    if debug:
        for message in traj.messages_and_choices:
            print("\n")
            print(message)
            print("\n")

    return traj.finish()


# =============== Training code ===============

load_dotenv()

print(
    f"Using config: max_turns={MAX_TURNS}, rollouts_per_group={TRAINING_CONFIG['rollouts_per_group']}, groups_per_step={TRAINING_CONFIG['groups_per_step']}, num_epochs={TRAINING_CONFIG['num_epochs']}, learning_rate={TRAINING_CONFIG['learning_rate']}"
)

await model.register(backend)

train_scenarios = [
    McpScenario(
        task_description=scenario["task"],
        mcp_server=mcp,  # Use the FastMCP server directly
        max_turns=MAX_TURNS,
    )
    for scenario in raw_train_scenarios
]

# Create dataset iterator using raw scenarios (not McpScenario objects)
train_iterator = iterate_dataset(
    train_scenarios,
    groups_per_step=TRAINING_CONFIG["groups_per_step"],
    num_epochs=TRAINING_CONFIG["num_epochs"],
    initial_step=await model.get_step(),  # Resume from checkpoint
)

# Main training loop using iterate_dataset
for batch in train_iterator:
    print("Gathering trajectory groups with RULER scoring...")

    # Use gather_trajectory_groups with ruler_score_group
    groups = await art.gather_trajectory_groups(
        (
            art.TrajectoryGroup(
                rollout(model, scenario, False)
                for _ in range(TRAINING_CONFIG["rollouts_per_group"])
            )
            for scenario in batch.items
        ),
        pbar_desc=f"train gather step {batch.step}",
    )

    scored_groups = []
    for group in groups:
        # Use RULER to assign relative scores to each trajectory
        judged_group = await ruler_score_group(
            group, judge_model=RULER_MODEL, debug=True, swallow_exceptions=True
        )
        scored_groups.append(judged_group)

    print("starting train")
    await model.train(
        scored_groups,
        config=art.TrainConfig(learning_rate=TRAINING_CONFIG["learning_rate"]),
    )

In [None]:
# @title Test Your Model!

# Generate test inputs
print("Generating test inputs...")
val_scenarios = [
    McpScenario(
        task_description=scenario["task"],
        mcp_server=mcp,  # Use the FastMCP server directly
        max_turns=MAX_TURNS,
    )
    for scenario in raw_val_scenarios
]

print(f"\n🧪 Testing the trained model on {len(val_scenarios)} new inputs:\n")
print("=" * 80)

for i, scenario in enumerate(val_scenarios):
    print(f"\nTest {i + 1}:")
    print(f"Input: {scenario.task_description}")

    # Run the model
    result_trajectory = await rollout(model, scenario)

    # Extract the model's response
    messages = result_trajectory.messages()
    model_response = messages[-1]["content"] if messages else "No response"

    print(f"Model output: {model_response}")
    print("-" * 80)

print("\n🎉 Testing completed!")
print(
    f"\nYour model '{MODEL_NAME}' has been trained to effectively use the Alphavantage MCP server."
)
print("\nTo use this model in production:")
print("1. The model checkpoint is saved in ./.art/")
print("2. You can load it using the vLLM library")
print(
    "3. Or continue training with more examples by adjusting the configuration at the top"
)

In [None]:
# @title Upload to Hugging Face 🤗

import torch
from unsloth import FastLanguageModel

lora_model_path = (
    f".art/{model.project}/models/{model.name}/{await model.get_step():04d}"
)

peft_model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=lora_model_path,
    max_seq_length=16384,
    dtype=torch.bfloat16,
    load_in_4bit=True,
)

if False:  # Change to True to upload finetune
    peft_model.push_to_hub_merged(f"HF_ACCOUNT/{model.name}", tokenizer, token="hf_...")

### Next Steps

Congratulations! You've successfully trained a custom model for your task using only:
- A pre-built MCP server
- Example inputs (no outputs needed!)
- RULER's automatic evaluation

Here are some ways to improve results:

1. **More diverse inputs**: Generate more varied input examples
2. **Longer training**: Increase the number of training steps
3. **More comparisons**: Increase `rollouts_per_group` for better RULER comparisons
4. **MCP server refinement**: Add better tools and resources to the server
5. **Hyperparameter tuning**: Adjust learning rate, batch size, etc.

Remember: RULER learns what "good" means from your MCP server alone - no labeled data required!

For more advanced use cases, check out the [ART documentation](https://art.openpipe.ai).