# Transformer model + RPG testing pipeline

This document lays out the procedure for:
- Training the forward model to learn the RPG environment and saving it to disk
- Using the trained forward model to generate expert memories and saving them
- Loading the expert memories into the transformer model
- Training the transformer model

## Table of Contents
1. [Train the forward model](#Train-the-forward-model)
2. [Locate the saved model](#Locate-the-saved-model)
3. [Generate memories and save to disk](#Generate-memories-and-save-to-disk)
4. [Evaluate the transformer model](#Evaluate-the-transformer-model)

In [8]:
# --------------- #
# region: Imports #
# --------------- #

# Import base packages
import os
import sys
from pathlib import Path
from datetime import datetime


from sorrel.examples.treasurehunt.entities import EmptyEntity
from sorrel.examples.treasurehunt.env import TreasurehuntEnv
from sorrel.examples.treasurehunt.world import TreasurehuntWorld
from sorrel.utils.logging import TensorboardLogger

# --------------- #
# endregion       #
# --------------- #

## Train the forward model

To give the transformer model something to learn, we need to first train a model to solve the RPG task.

To edit details of the model training regime, use the configuration file (by default, stored in `../configs/config.yaml` relative to this Python notebook) to change them. Some details need to be changed together for the model to successfully run.

In addition to providing a simple console log summary of each epoch, more detailed data is stored in TensorBoard if `log` in the configuration file is set to `True`.

In [9]:
STATIC_RUNTIME = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

# Load configuration
config = {
    "experiment": {
        "epochs": 1000,
        "max_turns": 100,
        "record_period": 50,
        "log_dir": Path("../data/logs/forward_model") / STATIC_RUNTIME,
    },
    "model": {
        "agent_vision_radius": 4,
        "epsilon_decay": 0.0005,
    },
    "world": {
        "height": 20,
        "width": 20,
        "gem_value": 10,
        "food_value": 5,
        "bone_value": -10,
        "spawn_prob": 0.01,
    },
}

# construct the world
world = TreasurehuntWorld(config=config, default_entity=EmptyEntity())

# construct the environment
env = TreasurehuntEnv(world, config)

# model save path
model_path = Path("../data/checkpoints") / f"treasurehunt_model_{STATIC_RUNTIME}.pkl"
model_path.parent.mkdir(parents=True, exist_ok=True)

# run the experiment with default parameters
env.run_experiment(
    output_dir=Path("../data"),
    logger=TensorboardLogger.from_config(config),
)

# save model
print("\nSaving model...")
for agent in env.agents:
    agent.model.save(file_path=str(model_path))
    print(f"Model saved to: {model_path}")
    break


Saving model...
Model saved to: ../data/checkpoints/treasurehunt_model_2025-12-22_16-55-42.pkl


## Generate memories and save to disk

Generating memories of game trajectories, replays of games, and model scores on games is achieved using the `eval_model()` function. This function plays a specified number of games (by default, 1) and returns a dictionary of output variables. The memories file is somewhat large (~ 1.5 GB for 1024 games), so keep that in mind.

Flags that can be used:
- `'memories'` returns a stored memory buffer of size (n_games x max_turns)
- `'frames'` returns a list of size (n_games x max_turns) with images of each turn.
- `'scores` returns a record of the model's reward on each turn.
- `'jupyter-mode'` should be added when you are using `'frames'` from a Python notebook rather than from the command line.

**NOTE**: As generating the frames takes a while, it's generally faster to generate animated replays and memories separately.

In [10]:
model_path = Path(f"../data/checkpoints/treasurehunt_model_{STATIC_RUNTIME}.pkl")
print(f"Using model: {model_path}")

# Reconstructing the same enviornment
world = TreasurehuntWorld(config=config, default_entity=EmptyEntity())
env = TreasurehuntEnv(world, config)

# Loading saved weights
for i, agent in enumerate(env.agents):
    agent.model.load(file_path=str(model_path))  # type: ignore

# Generating memories
output_dir = Path("../data")
num_games = 128

env.generate_memories(num_games=num_games, animate=False, output_dir=output_dir)


print(f"\nMemories saved!")
print(f"Agent 0: {output_dir / 'memories/agent0.npz'}")
print(f"Agent 1: {output_dir / 'memories/agent1.npz'}")

Using model: ../data/checkpoints/treasurehunt_model_2025-12-22_16-55-42.pkl

Memories saved!
Agent 0: ../data/memories/agent0.npz
Agent 1: ../data/memories/agent1.npz


In [11]:
import numpy.random as random
from datetime import datetime
from pathlib import Path

from sorrel.models.pytorch.transformer import ViTOneHot
from sorrel.buffers import TransformerBuffer
from sorrel.utils.logging import TensorboardLogger

TRAINING_EPOCHS = 1000
STATIC_RUNTIME = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

transformer_buffer = TransformerBuffer.load(Path("../data") / "memories/agent1.npz")
model = ViTOneHot(
    state_size=(6, 9, 9),
    action_space=4,
    layer_size=192,
    patch_size=3,
    num_frames=5,
    num_heads=3,
    batch_size=64,
    num_layers=2,
    memory=transformer_buffer,
    LR=0.001,
    device="cpu",
    seed=random.randint(1, 1000),
)
logger = TensorboardLogger(
    TRAINING_EPOCHS,
    Path("../data/logs/inverse_model") / STATIC_RUNTIME,
    "state_loss",
    "action_loss",
    "state_targets",
    "state_preds",
)

## Evaluate the transformer model

The transformer model uses a separate configuration file (by default, `../configs/transformer.yaml`). This file is shorter and handles only the details of the transformer model itself, since all of the details of the environment were already established when training the forward model. Make sure that the input parameters are compatible with those used by the forward model.

**NOTE**: Be aware that at this stage, some specific configurations are not compatible with the transformer model. For example, the transformer model requires the state space to be evenly divisible by a patch size. Since the agent vision results in an odd-numbered state H x W, the state size in the forward model must have a H x W of e.g., 9 x 9, 15 x 15, 21 x 21 in order to have a patch size of 3.

In [12]:
for epoch in range(TRAINING_EPOCHS):
    state_loss, action_loss = model.train_model()
    state_predictions, state_targets = model.plot_trajectory()

    logger.record_turn(
        epoch=epoch,
        loss=action_loss + state_loss,
        reward=0.0,
        action_loss=action_loss,
        state_loss=state_loss,
    )
    logger.writer.add_images(
        "state_targets", state_targets[:, 1:4], epoch, dataformats="NCHW"
    )
    logger.writer.add_images(
        "state_preds", state_predictions[:, 1:4], epoch, dataformats="NCHW"
    )

**Simple test to validate weight transfer hypothesis**

In [None]:
import numpy as np
import torch
from pathlib import Path
from sorrel.buffers import TransformerBuffer

# Since we already loded agent1.npz buffer as transformer_buffer
# Load agent2 buffer with agent0.npz
buffer_agent_2 = TransformerBuffer.load(Path("../data/memories/agent0.npz"))

# Evaluate on Agent 1 (Self)

print("EVALUATING ON SELF")

model.memory = transformer_buffer

model.eval()
eval_losses_agent_1 = []
eval_state_losses_1 = []
eval_action_losses_1 = []

print("\nEvaluating on Agent 1")

for _ in range(500):
    state_loss, action_loss = model.train_model()
    eval_losses_agent_1.append(state_loss + action_loss)
    eval_state_losses_1.append(state_loss)
    eval_action_losses_1.append(action_loss)

avg_loss_agent_1 = np.mean(eval_losses_agent_1)
std_loss_agent_1 = np.std(eval_losses_agent_1)
avg_state_loss_1 = np.mean(eval_state_losses_1)
avg_action_loss_1 = np.mean(eval_action_losses_1)

print(f"Total Loss:  {avg_loss_agent_1:.4f} ± {std_loss_agent_1:.4f}")
print(f"State Loss:  {avg_state_loss_1:.4f}")
print(f"Action Loss: {avg_action_loss_1:.4f}")

# Cross-Agent Transfer (Agent 2)

print("\nCROSS-AGENT TRANSFER")

# Switch to Agent 2's memory
model.memory = buffer_agent_2

eval_losses_agent_2 = []
eval_state_losses_2 = []
eval_action_losses_2 = []

print("\nEvaluating on Agent 2")
for _ in range(500):
    state_loss, action_loss = model.train_model()
    eval_losses_agent_2.append(state_loss + action_loss)
    eval_state_losses_2.append(state_loss)
    eval_action_losses_2.append(action_loss)

avg_loss_agent_2 = np.mean(eval_losses_agent_2)
std_loss_agent_2 = np.std(eval_losses_agent_2)
avg_state_loss_2 = np.mean(eval_state_losses_2)
avg_action_loss_2 = np.mean(eval_action_losses_2)

print(f"Total Loss:  {avg_loss_agent_2:.4f} ± {std_loss_agent_2:.4f}")
print(f"State Loss:  {avg_state_loss_2:.4f}")
print(f"Action Loss: {avg_action_loss_2:.4f}")

# RESULTS

print("RESULTS")

# Transfer metrics
total_transfer_gap = (avg_loss_agent_2 - avg_loss_agent_1) / avg_loss_agent_1 * 100
state_transfer_gap = (avg_state_loss_2 - avg_state_loss_1) / avg_state_loss_1 * 100
action_transfer_gap = (avg_action_loss_2 - avg_action_loss_1) / avg_action_loss_1 * 100

print(f"\nTransfer Gaps:")
print(f"Total Loss:  {total_transfer_gap:>+6.1f}%")
print(f"State Loss:  {state_transfer_gap:>+6.1f}%")
print(f"Action Loss: {action_transfer_gap:>+6.1f}%")

EVALUATING ON SELF

Evaluating on Agent 1
Total Loss:  0.6601 ± 0.0522
State Loss:  0.5749
Action Loss: 0.0852

CROSS-AGENT TRANSFER

Evaluating on Agent 2
Total Loss:  0.7055 ± 0.0702
State Loss:  0.5722
Action Loss: 0.1334
RESULTS

Transfer Gaps:
Total Loss:    +6.9%
State Loss:    -0.5%
Action Loss:  +56.5%
