# RLM Log Analysis Functions

This notebook provides utility functions to extract key data from RLM log files:
- **Final answer**: The agent's concluding response
- **Code blocks**: All code executed during the session
- **RLM calls**: Sub-LLM calls made via `llm_query()` / `llm_query_batched()`

In [17]:
import sys
import importlib


sys.path.append('data/analysis')
import rlm_log_utils
importlib.reload(rlm_log_utils)
from rlm_log_utils import *

## Usage Example

Load the log file and extract key information:

# Load

In [18]:
LOG_PATH = "/home/winnieyangwn/rlm/logs/rlm_2026-01-31_11-05-04_ee149f54.jsonl"

# Load the log - first entry is metadata, rest are iterations
entries = load_rlm_log(LOG_PATH)
metadata = entries[0]
iterations = entries[1:]

print(f"Loaded {len(iterations)} iterations")

Loaded 2 iterations


# Metadata

In [19]:
# View metadata
print("=== METADATA ===")
for k, v in metadata.items():
    if k != "backend_kwargs":
        print(f"{k}: {v}")

=== METADATA ===
type: metadata
timestamp: 2026-01-31T11:05:04.339193
root_model: gpt-5
max_depth: 1
max_iterations: 10
backend: azure_openai
environment_type: local
environment_kwargs: {}
other_backends: None


# Iterations

# Iteration #1

In [28]:
iterations[0].keys()

dict_keys(['type', 'iteration', 'timestamp', 'prompt', 'response', 'code_blocks', 'final_answer', 'iteration_time'])

In [None]:
iterations[0]["iteration"]

1

In [31]:
iterations[0]["timestamp"]

'2026-01-31T11:05:16.345724'

In [32]:
iterations[0]["prompt"]

[{'role': 'system',
  'content': 'You are tasked with answering a query with associated context. You can access, transform, and analyze this context interactively in a REPL environment that can recursively query sub-LLMs, which you are strongly encouraged to use as much as possible. You will be queried iteratively until you provide a final answer.\n\nThe REPL environment is initialized with:\n1. A `context` variable that contains extremely important information about your query. You should check the content of the `context` variable to understand what you are working with. Make sure you look through it sufficiently as you answer your query.\n2. A `llm_query` function that allows you to query an LLM (that can handle around 500K chars) inside your REPL environment.\n3. A `llm_query_batched` function that allows you to query multiple prompts concurrently: `llm_query_batched(prompts: List[str]) -> List[str]`. This is much faster than sequential `llm_query` calls when you have multiple inde

In [33]:
iterations[0]["response"]

'```repl\ntotal = len(context)\nvalid = sum(1 for r in context if r.get("valid_submission"))\npercentage = (valid / total) * 100\nprint(f"Total rollouts: {total}, Valid submissions: {valid}, Percentage: {percentage:.2f}%")\n```'

In [34]:
iterations[0]["iteration_time"]

4.414399859495461

# Code  Block


In [37]:
len(iterations[0]["code_blocks"])

1

In [42]:
iterations[0]["code_blocks"][0].keys()

dict_keys(['code', 'result'])


### Code
```
total = len(context)
valid = sum(1 for r in context if r.get("valid_submission"))
percentage = (valid / total) * 100
print(f"Total rollouts: {total}, Valid submissions: {valid}, Percentage: {percentage:.2f}%")
```


### Result

#### Locals

The locals snapshot captured all those variables after execution of the code.

The directly created the `total`, `valid`, and `percentage` variables.


The context and context_0 variables were injected by the RLM environment during setup 


#### Statefulness of locals

NOTE:  locals  PERSISTS across iterations. The key is that the same environment instance is reused throughout all iterations.

Looking at rlm.py:234:

```
for i in range(self.max_iterations):
    ...
    iteration = self._completion_turn(
        prompt=current_prompt,
        lm_handler=lm_handler,
        environment=environment,  # Same environment instance each time
    )``

```

The environment object (e.g., LocalREPL) maintains self.locals as **PERSISTENT** state. When code runs:

Variables are added/modified in self.locals

Those variables remain available for the next code block or iteration.

The locals in each log entry is a snapshot at that moment — showing cumulative state up to that point.


In [None]:
iterations[0]["code_blocks"][0]["code"]

In [41]:
iterations[0]["code_blocks"][0]["result"]["locals"].keys()

dict_keys(['json', 'f', 'context_0', 'context', 'total', 'valid', 'percentage'])

In [45]:
len(iterations[0]["code_blocks"][0]["result"]["locals"]["context_0"])

4800

# Iteration #2

In [26]:
iterations[1]

{'type': 'iteration',
 'iteration': 2,
 'timestamp': '2026-01-31T11:05:27.813493',
 'prompt': [{'role': 'system',
   'content': 'You are tasked with answering a query with associated context. You can access, transform, and analyze this context interactively in a REPL environment that can recursively query sub-LLMs, which you are strongly encouraged to use as much as possible. You will be queried iteratively until you provide a final answer.\n\nThe REPL environment is initialized with:\n1. A `context` variable that contains extremely important information about your query. You should check the content of the `context` variable to understand what you are working with. Make sure you look through it sufficiently as you answer your query.\n2. A `llm_query` function that allows you to query an LLM (that can handle around 500K chars) inside your REPL environment.\n3. A `llm_query_batched` function that allows you to query multiple prompts concurrently: `llm_query_batched(prompts: List[str]) -

In [20]:
# Get the final answer
final_answer = get_final_answer(iterations)
print("=== FINAL ANSWER ===")
print(final_answer[:2000] if final_answer else "No final answer found")
print(f"\n(Total length: {len(final_answer) if final_answer else 0} chars)")

=== FINAL ANSWER ===
79.29%

(Total length: 6 chars)


In [21]:
# Get all code blocks
code_blocks = get_all_code_with_results(iterations)
print(f"=== CODE BLOCKS ({len(code_blocks)} total) ===\n")

for i, block in enumerate(code_blocks[:3]):  # Show first 3
    print(f"--- Block {i+1} (Iteration {block['iteration']}) ---")
    print(block["code"][:500])
    if block.get("stdout"):
        print(f"\n[stdout]: {block['stdout'][:200]}...")
    print()

=== CODE BLOCKS (1 total) ===

--- Block 1 (Iteration 1) ---
total = len(context)
valid = sum(1 for r in context if r.get("valid_submission"))
percentage = (valid / total) * 100
print(f"Total rollouts: {total}, Valid submissions: {valid}, Percentage: {percentage:.2f}%")

[stdout]: Total rollouts: 4800, Valid submissions: 3806, Percentage: 79.29%
...



In [23]:
# Get RLM calls summary
summary = get_sub_rlm_calls_summary(iterations)
print("=== RLM CALLS SUMMARY ===")
print(f"Total sub-LLM calls: {summary['total_calls']}")
print(f"Total input tokens: {summary['total_input_tokens']:,}")
print(f"Total output tokens: {summary['total_output_tokens']:,}")
print(f"Models used: {summary['models_used']}")
print(f"Calls per iteration: {summary['calls_by_iteration']}")

=== RLM CALLS SUMMARY ===
Total sub-LLM calls: 0
Total input tokens: 0
Total output tokens: 0
Models used: []
Calls per iteration: {}


In [24]:
# Get detailed RLM calls
rlm_calls = get_sub_rlm_calls(iterations)
print(f"=== RLM CALLS DETAIL ({len(rlm_calls)} calls) ===\n")

for i, call in enumerate(rlm_calls[:2]):  # Show first 2 calls
    print(f"--- Call {i+1} (Iteration {call['iteration']}, Block {call['code_block_idx']}) ---")
    print(f"Model: {call['root_model']}")
    print(f"Execution time: {call['execution_time']:.2f}s")
    prompt_preview = str(call['prompt'])[:300]
    print(f"Prompt preview: {prompt_preview}...")
    response_preview = call['response'][:300] if call['response'] else "None"
    print(f"Response preview: {response_preview}...")
    print()

=== RLM CALLS DETAIL (0 calls) ===



In [25]:
# Extract everything at once
all_data = extract_all(LOG_PATH)
print("=== FULL EXTRACTION ===")
print(f"Metadata keys: {list(all_data['metadata'].keys()) if all_data['metadata'] else 'None'}")
print(f"Number of iterations: {all_data['num_iterations']}")
print(f"Number of code blocks: {len(all_data['code_blocks'])}")
print(f"Number of RLM calls: {len(all_data['rlm_calls'])}")
print(f"Has final answer: {all_data['final_answer'] is not None}")

=== FULL EXTRACTION ===
Metadata keys: ['type', 'timestamp', 'root_model', 'max_depth', 'max_iterations', 'backend', 'backend_kwargs', 'environment_type', 'environment_kwargs', 'other_backends']
Number of iterations: 2
Number of code blocks: 1
Number of RLM calls: 0
Has final answer: True
