# Tutorial: RLM on LongBench-v2 Code Repository Understanding

This tutorial demonstrates **Recursive Language Models (RLM)** on the [LongBench-v2](https://huggingface.co/datasets/THUDM/LongBench-v2) Code Repository Understanding benchmark.

LongBench-v2 is a challenging benchmark with 503 multiple-choice questions requiring deep understanding and reasoning over contexts ranging from 8K to 2M words. The **Code Repository Understanding** split contains 50 questions about real code repositories.

Reference:
- ["Recursive Language Models" (Zhang, Kraska, Khattab, 2025)](https://arxiv.org/abs/placeholder)
- ["LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks" (Bai et al., 2024)](https://arxiv.org/abs/2412.15204)

Install dependencies: `pip install dspy datasets e2b-code-interpreter`

## Setup

Configure DSPy with an LLM and load environment variables for E2B.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from dotenv import load_dotenv
load_dotenv()

import warnings
warnings.filterwarnings('ignore', category=UserWarning, module='pydantic')

import dspy

lm = dspy.LM("openai/gpt-5")
dspy.configure(lm=lm)

  from pydantic.v1.typing import (


## Load the LongBench-v2 Dataset

LongBench-v2 contains 503 challenging multiple-choice questions across 6 domains:
- Single-Document QA (175)
- Multi-Document QA (125)
- Long In-context Learning (81)
- Code Repository Understanding (50)
- Long-dialogue History Understanding (39)
- Long Structured Data Understanding (33)

We focus on **Code Repository Understanding** - questions about real code repositories with contexts up to 2M characters.

In [3]:
from datasets import load_dataset
from collections import Counter

# Load the dataset
dataset = load_dataset('THUDM/LongBench-v2', split='train')

print(f"Total examples: {len(dataset)}")
print(f"\nDomains:")
for domain, count in sorted(Counter(row['domain'] for row in dataset).items()):
    print(f"  {domain}: {count}")

Total examples: 503

Domains:
  Code Repository Understanding: 50
  Long In-context Learning: 81
  Long Structured Data Understanding: 33
  Long-dialogue History Understanding: 39
  Multi-Document QA: 125
  Single-Document QA: 175


In [4]:
# Filter for Code Repository Understanding
code_examples = [row for row in dataset if row['domain'] == 'Code Repository Understanding']

print(f"Code Repository Understanding: {len(code_examples)} examples")
print(f"\nDifficulty distribution:")
print(f"  {dict(Counter(row['difficulty'] for row in code_examples))}")
print(f"\nLength distribution:")
print(f"  {dict(Counter(row['length'] for row in code_examples))}")

# Show context length statistics
context_lens = [len(row['context']) for row in code_examples]
print(f"\nContext lengths:")
print(f"  Min: {min(context_lens):,} chars")
print(f"  Max: {max(context_lens):,} chars")
print(f"  Mean: {sum(context_lens)//len(context_lens):,} chars")

Code Repository Understanding: 50 examples

Difficulty distribution:
  {'easy': 18, 'hard': 32}

Length distribution:
  {'long': 29, 'short': 12, 'medium': 9}

Context lengths:
  Min: 101,348 chars
  Max: 16,182,936 chars
  Mean: 3,560,623 chars


## Prepare Examples for Evaluation

Convert the dataset to `dspy.Example` format. Each question has 4 choices (A, B, C, D) and one correct answer.

In [5]:
import random
random.seed(42)

def make_example(row):
    """Convert a LongBench-v2 row to a DSPy example."""
    # Format choices as part of the query - emphasize single letter answer
    choices_text = f"""

Choices:
A) {row['choice_A']}
B) {row['choice_B']}
C) {row['choice_C']}
D) {row['choice_D']}"""
    
    return dspy.Example(
        id=row["_id"],
        context=row["context"],
        query=row["question"] + choices_text,
        answer=row["answer"],  # A, B, C, or D
        difficulty=row["difficulty"],
        length=row["length"],
        choice_A=row["choice_A"],
        choice_B=row["choice_B"],
        choice_C=row["choice_C"],
        choice_D=row["choice_D"],
    ).with_inputs("context", "query")

# Create examples and shuffle
examples = [make_example(row) for row in code_examples]
random.shuffle(examples)

# Split: first 10 as devset, rest as testset
devset = examples[:25]
testset = examples[25:]

print(f"Devset: {len(devset)} examples")
print(f"Testset: {len(testset)} examples")

# Show an example
ex = devset[0]
print(f"\nExample:")
print(f"  Difficulty: {ex.difficulty}")
print(f"  Length: {ex.length}")
print(f"  Context: {len(ex.context):,} chars")
print(f"  Question: {ex.query[:200]}...")
print(f"  Answer: {ex.answer}")

Devset: 25 examples
Testset: 25 examples

Example:
  Difficulty: easy
  Length: short
  Context: 108,019 chars
  Question: In this cloud storage system, the scheduler is a crucial component responsible for allocating file blocks to the appropriate slave servers. There are three Raspberry Pi devices acting as slave servers...
  Answer: D


## Define the Metric and Signature

Simple multiple-choice accuracy: exact match on the letter (A, B, C, or D).

We use a `Literal` type in the signature to constrain the output to valid choices.

In [6]:
def longbench_code_metric(example, pred, trace=None):
    """Multiple choice accuracy metric."""
    gold = example.answer.strip().upper()
    predicted = pred.answer.strip().upper() if pred.answer else ""
    return 1.0 if predicted == gold else 0.0

## Initialize RLM with E2B Sandbox

We use `E2BSandbox` which runs code in a secure cloud sandbox. This is ideal for:
- **Very long contexts**: Code repositories can be 1-2M characters
- **Security**: Untrusted code runs in isolated Firecracker microVMs
- **Tool support**: `llm_query()` is available for semantic analysis

The LLM can:
- Navigate the code repository using string operations, regex, etc.
- Call `llm_query(prompt)` to ask questions about code snippets
- Use `FINAL("A")` to submit the final answer

In [None]:
from dspy.primitives.e2b_sandbox import E2BSandbox
from dspy.predict.rlm import RLM
from typing import Literal

# Define signature with Literal type for answer
class CodeQA(dspy.Signature):
    """Answer a multiple choice question about a code repository."""
    context: str = dspy.InputField(desc="The code repository contents")
    query: str = dspy.InputField(desc="The question with choices A, B, C, D")
    answer: Literal["A", "B", "C", "D"] = dspy.OutputField(desc="The answer: A, B, C, or D")

# Create sandbox (picks up OPENAI_API_KEY and E2B_API_KEY from environment)
sandbox = E2BSandbox()

# Create RLM module with typed signature
rlm = RLM(
    CodeQA,
    max_iterations=15,
    interpreter=sandbox,
    verbose=True,
)

print("RLM initialized with E2BSandbox")
print(f"  Max iterations: {rlm.max_iterations}")
print(f"  LLM: {dspy.settings.lm.model}")

## Run on a Single Example

Let's test RLM on one example to see how it explores the code repository.

In [8]:
def print_trajectory(trajectory):
    """Pretty-print an RLM trajectory."""
    for i, step in enumerate(trajectory):
        print(f"\n{'='*60}")
        print(f"Step {i+1}")
        print(f"{'='*60}")
        
        if step.get("reasoning"):
            reasoning = step['reasoning']
            if len(reasoning) > 300:
                reasoning = reasoning[:300] + "..."
            print(f"\nReasoning: {reasoning}")
        
        print(f"\nCode:")
        print(f"```python")
        print(step["code"])
        print(f"```")
        
        print(f"\nOutput:")
        output = step["output"]
        if len(output) > 500:
            print(output[:500] + "\n... (truncated)")
        else:
            print(output if output else "(no output)")

In [9]:
# Pick an example (prefer shorter context for faster demo)
example = min(devset, key=lambda x: len(x.context))

print(f"Example:")
print(f"  Difficulty: {example.difficulty}")
print(f"  Context length: {len(example.context):,} chars")
print(f"  Question: {example.query}")
print(f"  Expected answer: {example.answer}")
print(f"\nRunning RLM...")

result = rlm(context=example.context, query=example.query)

print(f"\n{'='*60}")
print(f"RESULT")
print(f"{'='*60}")
print(f"Predicted: {result.answer}")
print(f"Expected: {example.answer}")
print(f"Correct: {result.answer == example.answer}")
print(f"Steps: {len(result.trajectory)}")

Example:
  Difficulty: easy
  Context length: 108,019 chars
  Question: In this cloud storage system, the scheduler is a crucial component responsible for allocating file blocks to the appropriate slave servers. There are three Raspberry Pi devices acting as slave servers, named pi1, pi2, and pi3, with remaining storage capacities of 800 bytes, 700 bytes, and 500 bytes, respectively. A file has been divided into 9 blocks, each of 100 bytes, and two copies (including the original) need to be stored. Based on the scheduler’s allocation strategy, analyze how the two copies are distributed across the three Raspberry Pi devices:

Choices:
A) Copy 1: pi3(500), pi2(400). Copy 2: pi2(300), pi1(600)
B) Copy 1: pi1(400), pi2(400), pi3(100). Copy 2: pi3(400), pi2(300), pi1(200)
C) Copy 1: pi1(400), pi2(200), pi3(300). Copy 2: pi1(400), pi1(500)
D) Copy 1: pi1(800), pi2(100). Copy 2: pi2(600), pi3(300)
  Expected answer: D

Running RLM...


2026/01/08 10:37:12 INFO dspy.predict.rlm: RLM iteration 1/15
Reasoning: Step 1 - Explore: I need to inspect the provided variables. I will print lengths and small samples of both `context` and `query` to understand their contents and see if the query includes the multiple-choice options and if the context contains the scheduler logic.
Code:
print("context length:", len(context))
print("context sample:\n", context[:1000])

print("\nquery length:", len(query))
print("query full:\n", query)
2026/01/08 10:37:28 INFO dspy.predict.rlm: RLM iteration 2/15
Reasoning: Step 1 (DONE previously): Explored lengths and samples of context and query.

Step 2 - Analyze: I need to locate the scheduler/allocation strategy in the large codebase contained in `context`. I will search for classes/functions like Scheduler/schedule/allocate, and keywords such as "replica", "copy", "block", "slave", "capacity", "remaining". Then I will read the matched snippets to infer the allocation strategy and compute how 


RESULT
Predicted: D
Expected: D
Correct: True
Steps: 5


In [10]:
# Show the trajectory
print_trajectory(result.trajectory)


Step 1

Reasoning: Step 1 - Explore: I need to inspect the provided variables. I will print lengths and small samples of both `context` and `query` to understand their contents and see if the query includes the multiple-choice options and if the context contains the scheduler logic.

Code:
```python
print("context length:", len(context))
print("context sample:\n", context[:1000])

print("\nquery length:", len(query))
print("query full:\n", query)
```

Output:
context length: 108019
context sample:
 import socket
from IO.IOStream import *
from Constants import *

class RawClient:
    def __init__(self, host, port):
        self.host = host
        self.port = port
        self.knock = Knock(method='socket', host=host, port=port)
        self.io_stream = self.knock.knock()

    def send(self, data, is_byte = False):
        print(f"Sending: {data[:100]}")
        self.io_stream.send(data, is_byte = is_byte)

    def recv(self, is_byte = False):
       
... (truncated)

Step 2

Reasoning

## Run Evaluation on Devset

Use `dspy.Evaluate` to run RLM on the devset with parallelism.

In [None]:
# Shutdown the test sandbox from the single example run
sandbox.shutdown()

# For evaluation, we need to pass an E2BSandbox to each RLM instance.
# Without a sandbox, RLM defaults to LocalSandbox (local Deno/Pyodide sandbox).
# Since we want to use E2B's cloud sandbox, we create a new sandbox for evaluation.
# eval_sandbox = E2BSandbox()

rlm_eval = RLM(
    CodeQA,
    max_iterations=15,
    # interpreter=eval_sandbox,
    verbose=False,
)

# Run evaluation on a subset (full devset takes longer due to large contexts)
eval_set = devset

evaluate = dspy.Evaluate(
    devset=eval_set,
    metric=longbench_code_metric,
    num_threads=1,  # Single thread since we're sharing one E2B sandbox
    display_progress=True,
    display_table=5,
    provide_traceback=True,
)

print(f"Evaluating on {len(eval_set)} examples...")
results = evaluate(rlm_eval)

# Clean up
# eval_sandbox.shutdown()

In [14]:
print(f"\nLongBench-v2 Code Repository Understanding Results:")
print(f"  Examples: {len(eval_set)}")
print(f"  Accuracy: {results.score:.1f}%")
print(f"  LLM: {dspy.settings.lm.model}")

# Trajectory statistics
if results.results:
    trajectory_lengths = [len(r[1].trajectory) for r in results.results if hasattr(r[1], 'trajectory')]
    if trajectory_lengths:
        print(f"  Avg trajectory length: {sum(trajectory_lengths)/len(trajectory_lengths):.1f}")


LongBench-v2 Code Repository Understanding Results:
  Examples: 25
  Accuracy: 52.0%
  LLM: openai/gpt-5
  Avg trajectory length: 3.4


## Analysis

LongBench-v2 Code Repository Understanding is challenging because:

1. **Massive contexts**: Code repositories can be 1-2M characters
2. **Deep understanding required**: Questions require understanding code structure, function signatures, and logic
3. **Human expert baseline is 53.7%**: Even experts struggle with these questions

RLM helps by:
- Allowing the LLM to programmatically search through the codebase
- Using `llm_query()` to reason about specific code snippets
- Building up understanding iteratively through multiple code executions

## Next Steps

1. **Try different LLMs**: Stronger models like GPT-4 may improve accuracy
2. **Increase iterations**: Complex questions may need more exploration steps
3. **Optimize prompts**: The RLM prompts can be tuned for code understanding
4. **Run full evaluation**: Use `testset` for complete benchmark results