<a href="https://colab.research.google.com/github/OpenPipe/ART/blob/auto-art/clean_auto_art.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To train a model for your custom task, click _Runtime_ and press _Run all_. Make sure you've enabled a free Tesla T4 GPU!

<div class="align-center">
<a href="https://github.com/openpipe/art"><img src="https://github.com/openpipe/art/raw/main/assets/ART_pill.png" height="50"></a>
<a href="https://discord.gg/zbBHRUpwf4"><img src="https://github.com/openpipe/art/raw/main/assets/Discord_pill.png" height="50"></a>
<a href="https://art.openpipe.ai"><img src="https://github.com/openpipe/art/raw/main/assets/Documentation_pill.png" height="50"></a>

Questions? Join the Discord and ask away! For feature requests or to leave a star, visit our [Github](https://github.com/openpipe/art).

</div>

<a href="https://art.openpipe.ai/"><img src="https://github.com/openpipe/art/raw/main/assets/Header_separator.png" height="5"></a>

**Custom Task Training with ART**

This notebook shows how to train a Qwen 2.5 7B model to perform any single-turn task you describe - no labeled data needed! Simply describe what you want the model to learn, and this notebook will:

1. Generate diverse input examples for your task
2. Create an appropriate system prompt
3. Train the model using RULER's automatic evaluation
4. Test the trained model on new inputs

RULER learns what makes a good output purely from your task description - no expected outputs required!

You will learn how to use RULER for unsupervised learning, define custom [rollouts](https://art.openpipe.ai/resources/glossary#rollout), and run a [training loop](https://art.openpipe.ai/fundamentals/training-loop) that automatically improves your model.

In [1]:
#@title üíø Installation

!uv pip install -q openpipe-art==0.3.11.post2 langchain-core tenacity "gql<4" --prerelease allow --no-cache-dir

<a name="Configuration"></a>

### üéØ Configuration - Edit These Settings

Add an OpenRouter key and customize your training by modifying the values below:

In [2]:
# Required - Used for generating training inputs and RULER evaluation
OPENROUTER_API_KEY = ""

# Optional - Enables metric logging
WANDB_API_KEY = ""

# Describe your custom task (be specific!)
TASK_DESCRIPTION = """
Convert informal bug reports into structured JIRA-style tickets with these exact sections:
- SUMMARY: (one line title)
- PRIORITY: (Critical/High/Medium/Low based on impact)
- STEPS TO REPRODUCE: (numbered list)
- EXPECTED RESULT: (what should happen)
- ACTUAL RESULT: (what actually happens)
- ENVIRONMENT: (extracted system/version info)
"""

# Choose the base model to train
BASE_MODEL = "Qwen/Qwen2.5-7B-Instruct"  # Options: "Qwen/Qwen2.5-1.5B-Instruct", "Qwen/Qwen2.5-3B-Instruct", etc.

In [3]:
#@title Advanced Settings

# Model configuration
MODEL_NAME = "custom-task-model-001"  # Name for your trained model
PROJECT_NAME = "custom-task-training"  # Project name for tracking

# Training configuration
TRAINING_CONFIG = {
    "num_training_inputs": 25,  # Number of training inputs to generate
    "groups_per_step": 2,  # Inputs to process per training step
    "num_epochs": 1,  # Number of times through all data
    "rollouts_per_group": 3,  # Different responses per input (for RULER comparison)
    "learning_rate": 1e-5,  # Learning rate
    "max_training_steps": None,  # Maximum training steps (set to None for no limit)
}

# Evaluation configuration
RULER_MODEL = "openrouter/deepseek/deepseek-r1-0528"  # Model for RULER evaluation
SYSTEM_PROMPT_GENERATION_MODEL="openrouter/moonshotai/kimi-k2"
INPUT_GENERATION_MODEL="openrouter/moonshotai/kimi-k2"
NUM_TEST_INPUTS = 5  # Number of test inputs to generate

# GPU configuration (for T4 ‚Äî¬†keep these as-is unless you have a reason to change them)
MAX_SEQ_LENGTH = 4096  # Maximum sequence length
GPU_MEMORY_UTILIZATION = 0.7  # GPU memory usage (0.0-1.0)

In [4]:
#@title Run this cell to train your model!
import art
from art.local import LocalBackend
from art.rewards import ruler_score_group
from art.utils.litellm import convert_litellm_choice_to_openai
from art.utils import iterate_dataset

import weave
from typing import List
import os
import random
from pydantic import BaseModel, Field
from litellm import acompletion
from dotenv import load_dotenv

load_dotenv()

# Required
if OPENROUTER_API_KEY:
    os.environ["OPENROUTER_API_KEY"] = OPENROUTER_API_KEY
else:
    raise ValueError(
        "OPENROUTER_API_KEY is required for data generation and RULER evaluation."
    )

# Optional
if WANDB_API_KEY:
    os.environ["WANDB_API_KEY"] = WANDB_API_KEY
else:
    print("WANDB_API_KEY is not set. We'll skip logging metrics to Weights & Biases.")


class TrainingInput(BaseModel):
    input: str = Field(description="The input text for the task")

class TrainingDataset(BaseModel):
    inputs: List[TrainingInput] = Field(description="List of training inputs")

async def generate_training_inputs(task_description: str, num_examples: int = 50) -> List[str]:
    """Generate diverse training inputs for the given task"""

    system_prompt = f"""You are a helpful assistant that generates diverse, high-quality training inputs.

Task: {task_description}

Generate {num_examples} diverse INPUT examples that someone might provide for this task.
Make sure the inputs:
1. Cover a wide range of cases and edge cases
2. Are realistic and practical
3. Vary in length and complexity
4. Represent real-world scenarios

Only generate the INPUTS, not the outputs. RULER will evaluate the model's attempts automatically.
"""

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Generate {num_examples} input examples for the task described above. Return them in the form of a list."}
    ]

    print(f"Generating {num_examples} training inputs...")

    inputs = []

    i = 0
    while i < 5 and len(inputs) < num_examples:
      i += 1
      response = await acompletion(
          model=INPUT_GENERATION_MODEL,
          messages=messages,
          response_format=TrainingDataset,
          temperature=1.0,
      )

      dataset = TrainingDataset.model_validate_json(response.choices[0].message.content)
      inputs = [ex.input for ex in dataset.inputs]


    if len(inputs) < num_examples:
      raise ValueError(f"Failed to generate {num_examples} training inputs.")

    return inputs

# Generate training inputs
training_inputs = await generate_training_inputs(TASK_DESCRIPTION, num_examples=TRAINING_CONFIG["num_training_inputs"])
print(f"\nGenerated {len(training_inputs)} training inputs!")
print("\nFirst 5 examples:")
for i, input_text in enumerate(training_inputs[:5]):
    print(f"\nExample {i+1}: {input_text}")

# =========== Model Creation Code ===========

random.seed(42)

# Declare the model
model = art.TrainableModel(
    name=MODEL_NAME,
    project=PROJECT_NAME,
    base_model=BASE_MODEL,
)

# To run on a T4, we need to override some config defaults.
model._internal_config = art.dev.InternalModelConfig(
    init_args=art.dev.InitArgs(
        max_seq_length=MAX_SEQ_LENGTH,
    ),
    engine_args=art.dev.EngineArgs(
        enforce_eager=True,
        gpu_memory_utilization=GPU_MEMORY_UTILIZATION,
    ),
)

# Initialize the server
backend = LocalBackend(
    in_process=True,
    path="./.art",
)

# Register the model with the local Backend
await model.register(backend)

print("Model created!")
print("Base model:", BASE_MODEL)
print("Model name:", MODEL_NAME)
print("Project name:", PROJECT_NAME)

# ============ Rollout Function Code =============



if os.getenv("WANDB_API_KEY", ""):
    weave.init(PROJECT_NAME, settings={"print_call_link": False})

# Generate a system prompt for the task
async def generate_system_prompt(task_description: str) -> str:
    """Generate an appropriate system prompt for the task"""

    messages = [
        {
            "role": "system",
            "content": "Generate a clear, concise system prompt for a model that will perform the following task. The prompt should be direct and instructional."
        },
        {
            "role": "user",
            "content": f"Task: {task_description}\n\nGenerate a system prompt for this task."
        }
    ]

    response = await acompletion(
        model=SYSTEM_PROMPT_GENERATION_MODEL,
        messages=messages,
        temperature=0.3,
    )

    return response.choices[0].message.content.strip()

SYSTEM_PROMPT = await generate_system_prompt(TASK_DESCRIPTION)
print(f"Generated system prompt:\n\n{SYSTEM_PROMPT}")

class TaskInput(BaseModel):
    step: int
    input_text: str

@weave.op
async def rollout(model: art.Model, task_input: TaskInput) -> art.Trajectory:
    """Execute a single rollout for the custom task"""

    traj = art.Trajectory(
        reward=0.0,
        messages_and_choices=[],
        metadata={
            "step": task_input.step,
            "input": task_input.input_text,
        },
    )

    # Build the conversation
    traj.messages_and_choices = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": task_input.input_text},
    ]

    # Get model response
    if model.trainable:
        litellm_model_name = f"hosted_vllm/{model.name}"
    else:
        litellm_model_name = model.name

    response = await acompletion(
        model=litellm_model_name,
        base_url=model.inference_base_url,
        api_key=model.inference_api_key,
        temperature=0.7,
        messages=traj.messages(),
        caching=False,
    )

    # Add the model's response to the trajectory
    traj.messages_and_choices.append(
        convert_litellm_choice_to_openai(response.choices[0])
    )

    return traj

print("\nRollout function defined!")


# Test RULER with example outputs for a text formalization task
test_input = "hey can u send me the report asap? thx"

base_messages = [
    {"role": "system", "content": "Convert informal text to formal business language."},
    {"role": "user", "content": test_input},
]

good_trajectory = art.Trajectory(
    messages_and_choices=[
        *base_messages,
        {"role": "assistant", "content": "Could you please send me the report at your earliest convenience? Thank you."},
    ],
    reward=0,
)

mediocre_trajectory = art.Trajectory(
    messages_and_choices=[
        *base_messages,
        {"role": "assistant", "content": "Can you send me the report soon? Thanks."},
    ],
    reward=0,
)

bad_trajectory = art.Trajectory(
    messages_and_choices=[
        *base_messages,
        {"role": "assistant", "content": "hey send report quick thx"},
    ],
    reward=0,
)

sample_group = art.TrajectoryGroup(
    trajectories=[good_trajectory, mediocre_trajectory, bad_trajectory]
)

# RULER will score these based on how well they accomplish the task
# three retries in case of API rate limiting
for i in range(3):
    try:
        judged_group = await ruler_score_group(sample_group, RULER_MODEL, debug=True)
        break
    except Exception as e:
        print(f"Error scoring group: {e}")
        continue

assert judged_group is not None

# Display rankings
sorted_trajectories = sorted(
    judged_group.trajectories, key=lambda t: t.reward, reverse=True
)
for rank, traj in enumerate(sorted_trajectories, 1):
    messages = traj.messages()
    print(f"\nRank {rank}: Score {traj.reward:.3f}")
    print(f"  Response: {messages[-1]['content']}")


# ============ Training Loop =============

# Convert training inputs to TaskInput objects
training_task_inputs = [
    TaskInput(step=0, input_text=inp)
    for inp in training_inputs
]

# Create training iterator
training_iterator = iterate_dataset(
    training_task_inputs,
    groups_per_step=TRAINING_CONFIG["groups_per_step"],
    num_epochs=TRAINING_CONFIG["num_epochs"],
    initial_step=await model.get_step(),
)

print(f"Starting training with {len(training_task_inputs)} inputs...")
print(f"Training for {TRAINING_CONFIG['num_epochs']} epoch(s)")
print(f"Generating {TRAINING_CONFIG['rollouts_per_group']} responses per input for RULER to compare")
print(f"\nWhy multiple responses? RULER needs to compare different attempts to learn what's good!")

for batch, epoch, global_step, epoch_step in training_iterator:
    print(f"\nTraining step {global_step}, epoch {epoch}, epoch step {epoch_step}")
    print(f"Batch contains {len(batch)} inputs")

    # Create trajectory groups for this batch
    groups = []
    for task_input in batch:
        # Update step number
        task_input.step = global_step

        # Generate multiple responses for each input (RULER will compare these)
        groups.append(
            art.TrajectoryGroup(
                (
                    rollout(model, task_input)
                    for _ in range(TRAINING_CONFIG["rollouts_per_group"])
                )
            )
        )

    # Gather all trajectory groups
    finished_groups = await art.gather_trajectory_groups(
        groups,
        pbar_desc="Generating responses",
        max_exceptions=TRAINING_CONFIG["rollouts_per_group"] * len(batch),
    )

    # Use RULER to score each group
    judged_groups = []
    for group in finished_groups:
        judged_group = await ruler_score_group(
            group,
            RULER_MODEL,
            debug=False
        )
        judged_groups.append(judged_group)

    # Train on the scored trajectories
    await model.delete_checkpoints()
    await model.train(
        judged_groups,
        config=art.TrainConfig(learning_rate=TRAINING_CONFIG["learning_rate"]),
        _config={"logprob_calculation_chunk_size": 8},
    )

    print(f"Completed training step {global_step}")

    # Stop after configured steps (if limit is set)
    if TRAINING_CONFIG["max_training_steps"] and global_step >= TRAINING_CONFIG["max_training_steps"]:
        print(f"Reached maximum training steps ({TRAINING_CONFIG['max_training_steps']})")
        break

print("\n‚úÖ Training completed!")

INFO 07-29 03:20:21 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 07-29 03:20:21 [__init__.py:239] Automatically detected platform cuda.
Generating 25 training inputs...

Generated 25 training inputs!

First 5 examples:

Example 1: Hey guys, just noticed that nothing happens when i try to sign in using my google account from the checkout page on safari. I click the big blue 'Continue with Google' button, but the popup doesn‚Äôt open at all. Any ideas? iOS 17.2, latest MacBook Pro.

Example 2: Pls fix ASAP‚Äîenter any gift card longer than 12 digits and the whole iOS app hard-crashes. Every user on TestFlight build 8.12.3 hits the same thing (just send a 13-digit GC like 9999999999999 and boom). Had 500+ crash logs overnight. Running iPhone 14 Pro.

Example 3: Super weird‚Äîif my email address contains a plus sign (+), the password-reset email never arrives. I‚Äôm on joe+test@mydomain.com, nothing ever shows up, yet it works fine when I do joe@test.com. Gmai


Please restructure your imports with 'import unsloth' at the top of your file.
  import unsloth  # type: ignore


ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.5.1: Fast Qwen2 patching. Transformers: 4.51.3. vLLM: 0.8.5.post1.
   \\   /|    NVIDIA H100 PCIe. Num GPUs = 1. Max memory: 79.097 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 9.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/qwen2.5-7b-instruct-unsloth-bnb-4bit with actual GPU utilization = 78.47%
Unsloth: Your GPU has CUDA compute capability 9.0 with VRAM = 79.1 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 4096. Num Sequences = 368.
Unsloth: vLLM's KV Cache can use up to 56.2 GB. Also swap space = 6 GB.
INFO 07-29 03:22:20 [config.py:717] 

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


INFO 07-29 03:22:27 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 07-29 03:22:28 [model_runner.py:1140] Model loading took 6.7340 GiB and 5.049031 seconds
INFO 07-29 03:22:32 [worker.py:287] Memory profiling takes 3.29 seconds
INFO 07-29 03:22:32 [worker.py:287] the current vLLM instance can use total_gpu_memory (79.10GiB) x gpu_memory_utilization (0.70) = 55.37GiB
INFO 07-29 03:22:32 [worker.py:287] model weights take 6.73GiB; non_torch_memory takes 0.14GiB; PyTorch activation peak memory takes 2.04GiB; the rest of the memory reserved for KV Cache is 46.45GiB.
INFO 07-29 03:22:32 [executor_base.py:112] # cuda blocks: 54363, # CPU blocks: 7021
INFO 07-29 03:22:32 [executor_base.py:117] Maximum concurrency for 4096 tokens per request: 212.36x
INFO 07-29 03:22:37 [llm_engine.py:437] init engine (profile, create kv cache, warmup model) took 8.84 seconds
Unsloth: Just some info: will skip parsing ['k_norm', 'q_norm', 'pre_feedforward_layernorm', 'post_feedforward_layernorm']
Unsloth

Unsloth 2025.5.1 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.
Unsloth: Already have LoRA adapters! We shall skip this step.


Model created!
Base model: Qwen/Qwen2.5-7B-Instruct
Model name: custom-task-model-001
Project name: custom-task-training


[36m[1mweave[0m: weave version 0.51.59 is available!  To upgrade, please run:
[36m[1mweave[0m:  $ pip install weave --upgrade
INFO:weave.trace.init_message:weave version 0.51.59 is available!  To upgrade, please run:
 $ pip install weave --upgrade
[36m[1mweave[0m: Logged in as Weights & Biases user: openpipe.
[36m[1mweave[0m: View Weave data at https://wandb.ai/openpipe-team/custom-task-training/weave
INFO:weave.trace.init_message:Logged in as Weights & Biases user: openpipe.
View Weave data at https://wandb.ai/openpipe-team/custom-task-training/weave


Generated system prompt:

You are a bug-ticket formatter.  
For every user-supplied informal bug report, output a JIRA-style ticket that contains **only** the following six sections in the exact order and labels shown:

SUMMARY: <one-line title of the issue>

PRIORITY: <Critical|High|Medium|Low>  (choose based on user impact)

STEPS TO REPRODUCE:  
1. <first step>  
2. <second step>  
‚Ä¶

EXPECTED RESULT: <what should happen>

ACTUAL RESULT: <what actually happens>

ENVIRONMENT: <extracted system, version, browser, OS, etc.>

Do not add extra commentary, markdown, or sections.

Rollout function defined!



Rank 1: Score 0.950
  Response: Could you please send me the report at your earliest convenience? Thank you.

Rank 2: Score 0.700
  Response: Can you send me the report soon? Thanks.

Rank 3: Score 0.100
  Response: hey send report quick thx
Starting training with 25 inputs...
Training for 1 epoch(s)
Generating 3 responses per input for RULER to compare

Why multiple responses? RULER needs to compare different attempts to learn what's good!


Iterating dataset:  62%|######1   | 8/13 [00:00<?, ?batch/s]


Training step 8, epoch 0, epoch step 8
Batch contains 2 inputs


Generating responses:   0%|          | 0/6 [00:00<?, ?it/s]

Deleted checkpoint ./.art/custom-task-training/models/custom-task-model-001/0007


[34m[1mwandb[0m: Currently logged in as: [33mopenpipe[0m ([33mopenpipe-team[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Wandb run initialized! You can view it at https://wandb.ai/openpipe-team/custom-task-training/runs/custom-task-model-001
Packed 6 trajectories into 1 sequences of length 2048


train:   0%|          | 0/1 [00:00<?, ?it/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 10,000,000 | Num Epochs = 3 | Total steps = 30,000,000
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 1 x 1) = 2
 "-____-"     Trainable parameters = 20,185,088/7,000,000,000 (0.29% trained)


Unsloth: Will smartly offload gradients to save VRAM!
Completed training step 8

Training step 9, epoch 0, epoch step 9
Batch contains 2 inputs


Generating responses:   0%|          | 0/6 [00:00<?, ?it/s]

Deleted checkpoint ./.art/custom-task-training/models/custom-task-model-001/0008
Packed 6 trajectories into 1 sequences of length 2048


train:   0%|          | 0/1 [00:00<?, ?it/s]

Completed training step 9

Training step 10, epoch 0, epoch step 10
Batch contains 2 inputs


Generating responses:   0%|          | 0/6 [00:00<?, ?it/s]

Deleted checkpoint ./.art/custom-task-training/models/custom-task-model-001/0009
Packed 3 trajectories into 1 sequences of length 2048


train:   0%|          | 0/1 [00:00<?, ?it/s]

Completed training step 10

Training step 11, epoch 0, epoch step 11
Batch contains 2 inputs


Generating responses:   0%|          | 0/6 [00:00<?, ?it/s]

Deleted checkpoint ./.art/custom-task-training/models/custom-task-model-001/0010
Packed 3 trajectories into 1 sequences of length 2048


train:   0%|          | 0/1 [00:00<?, ?it/s]

Completed training step 11

Training step 12, epoch 0, epoch step 12
Batch contains 1 inputs


Generating responses:   0%|          | 0/3 [00:00<?, ?it/s]

Deleted checkpoint ./.art/custom-task-training/models/custom-task-model-001/0011
Packed 3 trajectories into 1 sequences of length 2048


train:   0%|          | 0/1 [00:00<?, ?it/s]

Completed training step 12

‚úÖ Training completed!


In [5]:
#@title Test Your Model!

# Generate test inputs
print("Generating test inputs...")
test_inputs = await generate_training_inputs(TASK_DESCRIPTION, num_examples=NUM_TEST_INPUTS)

print(f"\nüß™ Testing the trained model on {len(test_inputs)} new inputs:\n")
print("=" * 80)

for i, test_input in enumerate(test_inputs):
    print(f"\nTest {i+1}:")
    print(f"Input: {test_input}")

    # Run the model
    test_task_input = TaskInput(
        step=999,
        input_text=test_input
    )
    result_trajectory = await rollout(model, test_task_input)

    # Extract the model's response
    messages = result_trajectory.messages()
    model_response = messages[-1]['content'] if messages else "No response"

    print(f"Model output: {model_response}")
    print("-" * 80)

print("\nüéâ Testing completed!")
print(f"\nYour model '{MODEL_NAME}' has been trained to: {TASK_DESCRIPTION}")
print("\nTo use this model in production:")
print("1. The model checkpoint is saved in ./.art/")
print("2. You can load it using the vLLM library")
print("3. Or continue training with more examples by adjusting the configuration at the top")

Generating test inputs...
Generating 5 training inputs...

üß™ Testing the trained model on 5 new inputs:


Test 1:
Input: yo, so every time I open the iOS app on my 12 pro max running 15.6 and try to upload a pic that‚Äôs any bigger than like maybe 2 mb, the whole thing just fucking dies and goes back to the home screen. it‚Äôs like the image is radioactive lmfao. smaller images go through fine so it‚Äôs obviously the file size but cmon this ain‚Äôt 2003. help or ima rage uninstall thx
Model output: SUMMARY: App crashes when uploading images larger than 2MB on iOS

PRIORITY: Medium

STEPS TO REPRODUCE:  
1. Open the iOS app on iPhone 12 Pro Max running iOS 15.6  
2. Attempt to upload an image larger than 2MB  

EXPECTED RESULT: Image uploads successfully

ACTUAL RESULT: App crashes and returns to the home screen

ENVIRONMENT: iOS 15.6, iPhone 12 Pro Max
--------------------------------------------------------------------------------

Test 2:
Input: Bug spotted: CSV button on the anal

[36m[1mweave[0m: üç© https://wandb.ai/openpipe-team/custom-task-training/r/call/01985439-f8c8-7be2-b43b-470687306a2d
INFO:weave.trace.op:üç© https://wandb.ai/openpipe-team/custom-task-training/r/call/01985439-f8c8-7be2-b43b-470687306a2d
[36m[1mweave[0m: üç© https://wandb.ai/openpipe-team/custom-task-training/r/call/0198543a-3b56-7430-bc6a-8575e6aba080
INFO:weave.trace.op:üç© https://wandb.ai/openpipe-team/custom-task-training/r/call/0198543a-3b56-7430-bc6a-8575e6aba080
[36m[1mweave[0m: üç© https://wandb.ai/openpipe-team/custom-task-training/r/call/0198543a-73df-7817-966f-802713d30304
INFO:weave.trace.op:üç© https://wandb.ai/openpipe-team/custom-task-training/r/call/0198543a-73df-7817-966f-802713d30304
[36m[1mweave[0m: üç© https://wandb.ai/openpipe-team/custom-task-training/r/call/0198543a-ae28-7113-9cac-49d72178b808
INFO:weave.trace.op:üç© https://wandb.ai/openpipe-team/custom-task-training/r/call/0198543a-ae28-7113-9cac-49d72178b808
[36m[1mweave[0m: üç© https:/

### Next Steps

Congratulations! You've successfully trained a custom model for your task using only:
- A task description
- Example inputs (no outputs needed!)
- RULER's automatic evaluation

Here are some ways to improve results:

1. **More diverse inputs**: Generate more varied input examples
2. **Longer training**: Increase the number of training steps
3. **More comparisons**: Increase `rollouts_per_group` for better RULER comparisons
4. **Task refinement**: Make your task description more specific and detailed
5. **Hyperparameter tuning**: Adjust learning rate, batch size, etc.

Remember: RULER learns what "good" means from your task description alone - no labeled data required!

For more advanced use cases, check out the [ART documentation](https://art.openpipe.ai).