# Hello World, TurboZero Backgammon 🏁

`turbozero` provides a vectorized implementation of AlphaZero. 

In a nutshell, this means we can massively speed up training, by collecting many self-play games and running Monte Carlo Tree Search in parallel across one or more GPUs!

As the user, you just need to provide:
* environment dynamics functions (step and init) that adhere to the TurboZero spec
* a conversion function for environment state -> neural net input
* and a few hyperparameters!

TurboZero takes care of the rest. 😀 

## Getting Started

Follow the instructions in the repo readme to properly install dependencies and set up your environment.

## Environments

In order to take advantage of the batched implementation of AlphaZero, we need to pair it with a vectorized environment.

Fortunately, there are many great vectorized RL environment libraries, one I like in particular is [pgx](https://github.com/sotetsuk/pgx).

In [1]:
import jax
print("Jax Version: ",jax.__version__)
#jax.config.update('jax_platform_name', 'gpu')
from jax.lib import xla_bridge
from prompt_toolkit import HTML
print("Default backend:", jax.default_backend())

import pgx
import pgx.backgammon as bg

print(jax.__version__)


env = bg.Backgammon(simple_doubles=True)
print(env.simple_doubles)
print(env.num_actions)
print(env.stochastic_action_probs)

# create key
key = jax.random.PRNGKey(0)
state = env.init(key)
from IPython.display import HTML
display(HTML(state.to_svg()))





Jax Version:  0.5.3
Default backend: gpu
0.5.3
True
156
[0.16666667 0.16666667 0.16666667 0.16666667 0.16666667 0.16666667]


## Environment Dynamics

Turbozero needs to interface with the environment in order to build search trees and collect self-play episodes.

We can define this interface with the following functions:
* `env_step_fn`: given an environment state and an action, return the new environment state 
```python
    EnvStepFn = Callable[[chex.ArrayTree, int], Tuple[chex.ArrayTree, StepMetadata]]
```
* `env_init_fn`: given a key, initialize and reutrn a new environment state
```python
    EnvInitFn = Callable[[chex.PRNGKey], Tuple[chex.ArrayTree, StepMetadata]]
```
Fortunately, environment libraries implement these for us! We just need to extract a few key pieces of information 
from the environment state so that we can match the TurboZero specification. We store this in a StepMetadata object:

In [2]:
from core.types import StepMetadata
%psource StepMetadata

* `rewards` stores the rewards emitted for each player for the given timestep
* `action_mask` is a mask across all possible actions, where legal actions are set to `True`, and invalid/illegal actions are set to `False`
* `terminated` True if the environment is terminated/completed
* `cur_player_id`: id of the current player
* `step`: step number

We can define the environment interface for `Backgammon` as follows:

In [3]:
import chex
from typing import Tuple

def step_fn(state: bg.State, action: int, key: chex.PRNGKey) -> Tuple[bg.State, StepMetadata]:
    """Combined step function for backgammon environment that handles both deterministic and stochastic actions."""
    # print(f"[DEBUG-BG_STEP-{time.time()}] Called with state (stochastic={state.is_stochastic}), action={action}") # Optional debug

    # Handle stochastic vs deterministic branches
    def stochastic_branch(operand):
        s, a, _ = operand # state, action, key (key ignored for stochastic step)
        # Use env instance captured by closure (assuming env is accessible in this scope)
        return env.stochastic_step(s, a)

    def deterministic_branch(operand):
        s, a, k = operand # state, action, key
        # Use env instance captured by closure
        return env.step(s, a, k)

    # Use conditional to route to the appropriate branch
    # The key is only needed for the deterministic branch
    new_state = jax.lax.cond(
        state.is_stochastic,
        stochastic_branch,
        deterministic_branch,
        (state, action, key) # Pass all required operands
    )

    # Create standard metadata
    metadata = StepMetadata(
        rewards=new_state.rewards,
        action_mask=new_state.legal_action_mask,
        terminated=new_state.terminated,
        cur_player_id=new_state.current_player,
        step=new_state._step_count
    )

    return new_state, metadata

def init_fn(key):
    """Initializes a new environment state."""
    state = env.init(key)
    # No need to force non-stochastic, let the environment handle it
    return state, StepMetadata(
        rewards=state.rewards,
        action_mask=state.legal_action_mask,
        terminated=state.terminated,
        cur_player_id=state.current_player,
        step=state._step_count
    )

Pretty easy!

## Neural Network

Next, we'll need to define the architecture of the neural network 

A simple implementation of the residual neural network used in the _AlphaZero_ paper is included for your convenience. 

You can implement your own architecture using `flax.linen`.

In [4]:
from core.networks.mlp import MLPConfig, MLP

# Replace the resnet with an MLP network
mlp_network = MLP(MLPConfig(
    hidden_dims=[128, 128, 64],  # Adjust layer sizes as needed
    policy_head_out_size=env.num_actions,
    value_head_out_size=1
))


We also need a way to convert our environment's state into something our neural network can take as input (i.e. structured data -> Array). `pgx` conveniently includes this in `state.observation`, but for other environments you may need to perform the conversion yourself.

In [5]:
def state_to_nn_input(state):
    return state.observation

## Evaluator

Next, we can initialize our evaluator, AlphaZero, which takes the following parameters:

* `eval_fn`: function used to evaluate a leaf node (returns a policy and value)
* `num_iterations`: number of MCTS iterations to run before returning the final policy
* `max_nodes`: maximum capacity of search tree
* `branching_factor`: branching factor of search tree == policy_size
* `action_selector`: the algorithm used to select an action to take at any given search node, choose between:
    * `PUCTSelector`: AlphaZero action selection algorithm
    * `MuZeroPUCTSelector`: MuZero action selection algorithm
    * or write your own! :)

There are also a few other optional parameters, a few of the important ones are:
* `temperature`: temperature applied to move probabilities prior to sampling (0.0 == argmax, ->inf == completely random sampling). I reccommend setting this to 1.0 for training (default) and 0.0 for evaluation.
* `dirichlet_alpha`: magnitude of Dirichlet noise to add to root policy (default 0.3). Generally, the more actions are possible in a game, the smaller this value should be. 
* `dirichlet_epsilon`: proportion of root policy composed of Dirichlet noise (default 0.25)


We use `make_nn_eval_fn` to create a leaf evaluation function that uses our neural network to generate a policy and a value for the given state. 

In [6]:

from core.evaluators.evaluation_fns import make_nn_eval_fn
from core.evaluators.mcts.action_selection import PUCTSelector
from core.evaluators.mcts.stochastic_mcts import StochasticMCTS
import jax.numpy as jnp

# Training evaluator: StochasticMCTS using NN
evaluator = StochasticMCTS(   #Explores new moves
    eval_fn=make_nn_eval_fn(mlp_network, state_to_nn_input),
    stochastic_action_probs=env.stochastic_action_probs,
    num_iterations=200,  
    max_nodes=300,      
    branching_factor=env.num_actions,
    action_selector=PUCTSelector(),
    temperature=1.0,
)

We also define a separate evaluator with different parameters to use for testing purposes. We'll give this one a larger budget (num_iterations), and set the temperature to zero so it always chooses the most-visited action after search is complete.

In [7]:

evaluator_test = StochasticMCTS(   #Use optimized moves, temperature=0.0
    eval_fn=make_nn_eval_fn(mlp_network, state_to_nn_input),
    stochastic_action_probs=env.stochastic_action_probs,
    num_iterations=200,  # Very few iterations
    max_nodes=300,      # Very small tree
    branching_factor=env.num_actions,
    action_selector=PUCTSelector(),
    temperature=0.0,
)

We can use similar ideas to write a greedy baseline evaluation function, one that doesn't use a neural network at all!

Instead, it simply counts the number of tiles for the active player and compares it to the number of tiles controlled by the other player, so the value is higher for states where the active player controls more tiles than the other player.

Using similar techniques as before, we can create another AlphaZero evaluator to test against.

In [8]:
from core.evaluators.evaluation_fns import make_nn_eval_fn_no_params_callable
import chex

# --- Pip Count Eval Fn (for test evaluator) ---
def backgammon_pip_count_eval(state: chex.ArrayTree, params: chex.ArrayTree, key: chex.PRNGKey):
    """Calculates value based on pip count difference. Ignores params/key."""
    board = state._board
    loc_player_0 = jnp.maximum(0, board[1:25])
    loc_player_1 = jnp.maximum(0, -board[1:25])
    points = jnp.arange(1, 25)
    pip_player_0 = jnp.sum(loc_player_0 * points)
    pip_player_1 = jnp.sum(loc_player_1 * (25 - points))
    pip_player_0 += jnp.maximum(0, board[0]) * 25
    pip_player_1 += jnp.maximum(0, -board[25]) * 25
    total_pips = pip_player_0 + pip_player_1 + 1e-6
    value_p0_perspective = (pip_player_1 - pip_player_0) / total_pips
    value = jnp.where(state.current_player == 0, value_p0_perspective, -value_p0_perspective)
    # Uniform policy over legal actions for greedy baseline
    policy_logits = jnp.where(state.legal_action_mask, 0.0, -jnp.inf)
    return policy_logits, jnp.array(value)


# Test evaluator: Regular MCTS using pip count
pip_count_mcts_evaluator_test = StochasticMCTS(  # optimizes for moves
    eval_fn=backgammon_pip_count_eval, # Use pip count eval fn
    stochastic_action_probs=env.stochastic_action_probs,
    num_iterations=30, # Give it slightly more iterations maybe
    max_nodes=100,
    branching_factor=env.num_actions,
    action_selector=PUCTSelector(),
    temperature=0.0 # Deterministic action selection for testing
)

## Replay Memory Buffer

Next, we'll initialize a replay memory buffer to hold selfplay trajectories that we can sample from during training. This actually just defines an interface, the buffer state itself will be initialized and managed internally.

The replay buffer is batched, it retains a buffer of trajectories across a batch dimension. We specify a `capacity`: the amount of samples stored in a single buffer. The total capacity of the entire replay buffer is then `batch_size * capacity`, where `batch_size` is the number of environments/self-play games being run in parallel.

In [9]:
from core.memory.replay_memory import EpisodeReplayBuffer

replay_memory = EpisodeReplayBuffer(capacity=1000)

## Rendering
We can optionally provide a `render_fn` that will record games played by our model against one of the baselines and save it as a `.gif`.

I've included a helper fn that takes care of this:

This helper function depends upon cairoSVG, which itself depends upon `cairo`, which you'll need to install on your system.

On Ubuntu, this can be done with:

In [10]:
! apt-get update && apt-get -y install libcairo2-dev

Reading package lists... 0%Reading package lists... 100%Reading package lists... Done
E: Could not open lock file /var/lib/apt/lists/lock - open (13: Permission denied)
E: Unable to lock directory /var/lib/apt/lists/
W: Problem unlinking the file /var/cache/apt/pkgcache.bin - RemoveCaches (13: Permission denied)
W: Problem unlinking the file /var/cache/apt/srcpkgcache.bin - RemoveCaches (13: Permission denied)


  pid, fd = os.forkpty()


If you're on another OS, consult https://www.cairographics.org/download/ for installation guidelines

In [11]:
from functools import partial
from core.testing.utils import render_pgx_2p
render_fn = partial(render_pgx_2p, p1_label='Black', p2_label='White', duration=900)


## Trainer Initialization
Now that we have all the proper pieces defined, we are ready to initialize a Trainer and start training!

The `Trainer` takes many parameters, so let's walk through them all:
* `batch_size`: # of parallel environments used to collect self-play games
* `train_batch_size`: size of minibatch used during training step
* `warmup_steps`: # of steps (per batch) to collect via self-play prior to entering the training loop. This is used to populate the replay memory with some initial samples
* `collection_steps_per_epoch`: # of steps (per batch) to collect via self-play per epoch
* `train_steps_per_epoch`: # of train steps per epoch
* `nn`: neural network (`linen.Module`)
* `loss_fn`: loss function used for training, we use a provided default loss which implements the loss function used in the `AlphaZero` paper
* `optimizer`: an `optax` optimizer used for training
* `evaluator`: the `Evaluator` to use during self-play, we initialized ours using `AlphaZero(MCTS)`
* `memory_buffer`: the memory buffer used to store samples from self-play games, we  initialized ours using `EpisodeReplayBuffer`
* `max_episode_steps`: maximum number of steps/turns to allow before truncating an episode
* `env_step_fn`: environment step function (we defined ours above)
* `env_init_fn`: environment init function (we defined ours above)
* `state_to_nn_input_fn`: function to convert environment state to nn input (we defined ours above)
* `testers`: any number of `Tester`s, used to evaluate a given model and take their own parameters. We'll use the two evaluators defined above to initialize two Testers.
* `evaluator_test`: (Optional) Evaluator used within Testers. By default used `evaluator`, but sometimes you may want to test with a larger MCTS iteration budget for example, or a different move sampling temperature
* `data_transform_fns`: (optional) list of data transform functions to apply to self-play experiences (e.g. rotation, reflection, etc.)
* `extract_model_params_fn`: (Optional) in special cases we need to define how to extract all model parameters from a flax `TrainState`. The default function handles BatchNorm, but if another special-case technique applied across batches is used (e.g. Dropout) we would need to define a function to extract the appropriate parameters. You usually won't need to define this!
* `wandb_project_name`: (Optional) Weights and Biases project name. You will be prompted to login if a name is provided. If a name is provided, a run will be initialized and loss and other metrics will be logged to the given wandb project.
* `ckpt_dir`: (Optional) directory to store checkpoints in, by default this is set to `/tmp/turbozero_checkpoints`
* `max_checkpoints`: (Optional) maximum number of most-recent checkpoints to retain (default: 2)
* `num_devices`: (Optional) number of hardware accelerators (GPUs/TPUs) to use. If not given, all available hardware accelerators are used
* `wandb_run`: (Optional) continues from an initialized `wandb` run if provided, otherwise a new one is initialized
* `extra_wandb_config`: (Optional) any extra metadata to store in the `wandb` run config

A training epoch is comprised of M collection steps, followed by N training steps sampling minibatches from replay memory. Optionally, any number of Testers evaluate the current model. At the end of each epoch, a checkpoint is saved.

If you are using one or more GPUs (reccommended), TurboZero by default will run on all your available hardware.

In [12]:
from functools import partial
from core.testing.two_player_baseline import TwoPlayerBaseline
from core.training.loss_fns import az_default_loss_fn
from core.training.stochastic_train import StochasticTrainer
from core.training.train import Trainer
import optax

trainer = StochasticTrainer(
    batch_size=128,      # Minimal batch size
    train_batch_size=50,
    warmup_steps=0,
    collection_steps_per_epoch=512,  # Just 2 collection step
    train_steps_per_epoch=512,       # Just 2 training step
    nn=mlp_network,
    loss_fn=partial(az_default_loss_fn, l2_reg_lambda=0.0),
    optimizer=optax.adam(1e-4),
    # Use the stochastic evaluator for training
    evaluator=evaluator, 
    memory_buffer=replay_memory,
    max_episode_steps=500,  # Super short episodes
    env_step_fn=step_fn,
    env_init_fn=init_fn,
    state_to_nn_input_fn=state_to_nn_input,
    ckpt_dir = "/tmp/ckpts",
    testers=[
        # Use our custom BackgammonTwoPlayerBaseline
        TwoPlayerBaseline(
            num_episodes=20,
            baseline_evaluator=pip_count_mcts_evaluator_test,
            #render_fn=render_fn,
            #render_dir='training_eval/pip_count_baseline',
            name='pip_count_baseline'
        )
    ],
    # Use the pip count MCTS evaluator for testing
    evaluator_test=evaluator_test, 
    data_transform_fns=[],  # No data transforms as requested
    wandb_project_name=None
)

ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33msile16[0m ([33msile16-self[0m). Use [1m`wandb login --relogin`[0m to force relogin


## Training

Now all that's left to do is to kick off the training loop! We need to pass an initial seed for reproducibility, and the number of epochs to run for!

If you've set up `wandb`, you can track metrics via plots in the run dashboard. Metrics will also be printed to the console. 

IMPORTANT: The first epoch will not execute quickly! This is because there is significant overhead in JAX compilation (nearly all of the training loop is JIT-compiled). This will cause the first epoch to run very slowly, as JIT-compiled functions are traced and compiled the first time they are run. Expect epochs after the first to execute much more quickly. Typically, GPU utilization will also be low/zero during this period.

It's also worth mentioning that the hyperparameters in this notebook are just here for example purposes. Regardless of the task, they will need to be tuned according to the characteristics of the environment as well as your available hardware and time/cost constraints.

In [None]:
output = trainer.train_loop(seed=0, num_epochs=150, eval_every=5)

Epoch 0: {'l2_reg': '0.0000', 'loss': '2.2401', 'policy_entropy': '0.9424', 'policy_loss': '0.9889', 'value_loss': '1.2512', 'buffer/size': '50970.0000', 'buffer/valid_samples': '118200.0000', 'buffer/fullness_pct': '39.8203', 'perf/train_steps_per_sec': '512000000.0000', 'perf/samples_per_sec': '25600000000.0000', 'perf/collect_time_sec': '311.6890', 'perf/train_time_sec': '21.3884', 'perf/collect_steps_per_sec': '1.6427', 'perf/collect_game_steps_per_sec': '210.2609', 'game/terminated_count': '0.0000', 'game/terminated_pct': '0.0000'}
Epoch 0: {'pip_count_baseline_avg_outcome': '0.6500'}
Epoch 1: {'l2_reg': '0.0000', 'loss': '2.1494', 'policy_entropy': '0.9680', 'policy_loss': '1.0241', 'value_loss': '1.1253', 'buffer/size': '102050.0000', 'buffer/valid_samples': '118450.0000', 'buffer/fullness_pct': '79.7266', 'perf/train_steps_per_sec': '512000000.0000', 'perf/samples_per_sec': '25600000000.0000', 'perf/collect_time_sec': '307.0906', 'perf/train_time_sec': '14.6771', 'perf/collect_

Epoch 15: {'l2_reg': '0.0000', 'loss': '1.8232', 'policy_entropy': '0.9831', 'policy_loss': '0.9926', 'value_loss': '0.8306', 'buffer/size': '128000.0000', 'buffer/valid_samples': '118190.0000', 'buffer/fullness_pct': '100.0000', 'perf/train_steps_per_sec': '512000000.0000', 'perf/samples_per_sec': '25600000000.0000', 'perf/collect_time_sec': '312.1035', 'perf/train_time_sec': '14.5774', 'perf/collect_steps_per_sec': '1.6405', 'perf/collect_game_steps_per_sec': '209.9816', 'game/terminated_count': '0.0000', 'game/terminated_pct': '0.0000'}
Epoch 15: {'pip_count_baseline_avg_outcome': '-0.4000'}
Epoch 16: {'l2_reg': '0.0000', 'loss': '1.8789', 'policy_entropy': '0.9840', 'policy_loss': '0.9953', 'value_loss': '0.8836', 'buffer/size': '128000.0000', 'buffer/valid_samples': '118201.0000', 'buffer/fullness_pct': '100.0000', 'perf/train_steps_per_sec': '512000000.0000', 'perf/samples_per_sec': '25600000000.0000', 'perf/collect_time_sec': '309.8896', 'perf/train_time_sec': '14.7018', 'perf/c

Epoch 30: {'l2_reg': '0.0000', 'loss': '1.8467', 'policy_entropy': '0.9640', 'policy_loss': '0.9724', 'value_loss': '0.8743', 'buffer/size': '128000.0000', 'buffer/valid_samples': '118194.0000', 'buffer/fullness_pct': '100.0000', 'perf/train_steps_per_sec': '512000000.0000', 'perf/samples_per_sec': '25600000000.0000', 'perf/collect_time_sec': '311.4151', 'perf/train_time_sec': '14.4230', 'perf/collect_steps_per_sec': '1.6441', 'perf/collect_game_steps_per_sec': '210.4458', 'game/terminated_count': '0.0000', 'game/terminated_pct': '0.0000'}
Epoch 30: {'pip_count_baseline_avg_outcome': '-0.3000'}
Epoch 31: {'l2_reg': '0.0000', 'loss': '1.8478', 'policy_entropy': '0.9720', 'policy_loss': '0.9798', 'value_loss': '0.8679', 'buffer/size': '128000.0000', 'buffer/valid_samples': '118957.0000', 'buffer/fullness_pct': '100.0000', 'perf/train_steps_per_sec': '512000000.0000', 'perf/samples_per_sec': '25600000000.0000', 'perf/collect_time_sec': '315.2819', 'perf/train_time_sec': '14.6543', 'perf/c

Epoch 45: {'l2_reg': '0.0000', 'loss': '1.8106', 'policy_entropy': '0.9595', 'policy_loss': '0.9642', 'value_loss': '0.8464', 'buffer/size': '128000.0000', 'buffer/valid_samples': '118550.0000', 'buffer/fullness_pct': '100.0000', 'perf/train_steps_per_sec': '512000000.0000', 'perf/samples_per_sec': '25600000000.0000', 'perf/collect_time_sec': '314.7660', 'perf/train_time_sec': '14.6716', 'perf/collect_steps_per_sec': '1.6266', 'perf/collect_game_steps_per_sec': '208.2054', 'game/terminated_count': '0.0000', 'game/terminated_pct': '0.0000'}
Epoch 45: {'pip_count_baseline_avg_outcome': '0.2000'}
Epoch 46: {'l2_reg': '0.0000', 'loss': '1.7820', 'policy_entropy': '0.9554', 'policy_loss': '0.9604', 'value_loss': '0.8216', 'buffer/size': '128000.0000', 'buffer/valid_samples': '119261.0000', 'buffer/fullness_pct': '100.0000', 'perf/train_steps_per_sec': '512000000.0000', 'perf/samples_per_sec': '25600000000.0000', 'perf/collect_time_sec': '321.8299', 'perf/train_time_sec': '14.7894', 'perf/co

Epoch 60: {'l2_reg': '0.0000', 'loss': '1.7532', 'policy_entropy': '0.9649', 'policy_loss': '0.9694', 'value_loss': '0.7838', 'buffer/size': '128000.0000', 'buffer/valid_samples': '118267.0000', 'buffer/fullness_pct': '100.0000', 'perf/train_steps_per_sec': '512000000.0000', 'perf/samples_per_sec': '25600000000.0000', 'perf/collect_time_sec': '312.3078', 'perf/train_time_sec': '14.5276', 'perf/collect_steps_per_sec': '1.6394', 'perf/collect_game_steps_per_sec': '209.8443', 'game/terminated_count': '0.0000', 'game/terminated_pct': '0.0000'}
Epoch 60: {'pip_count_baseline_avg_outcome': '-0.3500'}
Epoch 61: {'l2_reg': '0.0000', 'loss': '1.7514', 'policy_entropy': '0.9640', 'policy_loss': '0.9663', 'value_loss': '0.7852', 'buffer/size': '128000.0000', 'buffer/valid_samples': '117405.0000', 'buffer/fullness_pct': '100.0000', 'perf/train_steps_per_sec': '512000000.0000', 'perf/samples_per_sec': '25600000000.0000', 'perf/collect_time_sec': '311.1055', 'perf/train_time_sec': '14.4274', 'perf/c

Epoch 75: {'l2_reg': '0.0000', 'loss': '1.7975', 'policy_entropy': '0.9680', 'policy_loss': '0.9709', 'value_loss': '0.8266', 'buffer/size': '128000.0000', 'buffer/valid_samples': '119041.0000', 'buffer/fullness_pct': '100.0000', 'perf/train_steps_per_sec': '512000000.0000', 'perf/samples_per_sec': '25600000000.0000', 'perf/collect_time_sec': '315.0400', 'perf/train_time_sec': '14.7482', 'perf/collect_steps_per_sec': '1.6252', 'perf/collect_game_steps_per_sec': '208.0244', 'game/terminated_count': '0.0000', 'game/terminated_pct': '0.0000'}
Epoch 75: {'pip_count_baseline_avg_outcome': '0.7000'}
Epoch 76: {'l2_reg': '0.0000', 'loss': '1.7200', 'policy_entropy': '0.9585', 'policy_loss': '0.9624', 'value_loss': '0.7576', 'buffer/size': '128000.0000', 'buffer/valid_samples': '117907.0000', 'buffer/fullness_pct': '100.0000', 'perf/train_steps_per_sec': '512000000.0000', 'perf/samples_per_sec': '25600000000.0000', 'perf/collect_time_sec': '313.8790', 'perf/train_time_sec': '14.6841', 'perf/co

Epoch 90: {'l2_reg': '0.0000', 'loss': '1.7531', 'policy_entropy': '0.9552', 'policy_loss': '0.9558', 'value_loss': '0.7973', 'buffer/size': '128000.0000', 'buffer/valid_samples': '117804.0000', 'buffer/fullness_pct': '100.0000', 'perf/train_steps_per_sec': '512000000.0000', 'perf/samples_per_sec': '25600000000.0000', 'perf/collect_time_sec': '316.4263', 'perf/train_time_sec': '14.5600', 'perf/collect_steps_per_sec': '1.6181', 'perf/collect_game_steps_per_sec': '207.1130', 'game/terminated_count': '0.0000', 'game/terminated_pct': '0.0000'}
Epoch 90: {'pip_count_baseline_avg_outcome': '-0.0500'}
Epoch 91: {'l2_reg': '0.0000', 'loss': '1.7657', 'policy_entropy': '0.9474', 'policy_loss': '0.9510', 'value_loss': '0.8147', 'buffer/size': '128000.0000', 'buffer/valid_samples': '118645.0000', 'buffer/fullness_pct': '100.0000', 'perf/train_steps_per_sec': '512000000.0000', 'perf/samples_per_sec': '25600000000.0000', 'perf/collect_time_sec': '311.7017', 'perf/train_time_sec': '14.6374', 'perf/c

Epoch 105: {'l2_reg': '0.0000', 'loss': '1.6651', 'policy_entropy': '0.9454', 'policy_loss': '0.9445', 'value_loss': '0.7206', 'buffer/size': '128000.0000', 'buffer/valid_samples': '117637.0000', 'buffer/fullness_pct': '100.0000', 'perf/train_steps_per_sec': '512000000.0000', 'perf/samples_per_sec': '25600000000.0000', 'perf/collect_time_sec': '319.6617', 'perf/train_time_sec': '14.8104', 'perf/collect_steps_per_sec': '1.6017', 'perf/collect_game_steps_per_sec': '205.0168', 'game/terminated_count': '0.0000', 'game/terminated_pct': '0.0000'}
Epoch 105: {'pip_count_baseline_avg_outcome': '0.7500'}
Epoch 106: {'l2_reg': '0.0000', 'loss': '1.6713', 'policy_entropy': '0.9317', 'policy_loss': '0.9293', 'value_loss': '0.7420', 'buffer/size': '128000.0000', 'buffer/valid_samples': '118603.0000', 'buffer/fullness_pct': '100.0000', 'perf/train_steps_per_sec': '512000000.0000', 'perf/samples_per_sec': '25600000000.0000', 'perf/collect_time_sec': '314.9574', 'perf/train_time_sec': '14.8559', 'perf

Epoch 120: {'l2_reg': '0.0000', 'loss': '1.6774', 'policy_entropy': '0.9352', 'policy_loss': '0.9316', 'value_loss': '0.7457', 'buffer/size': '128000.0000', 'buffer/valid_samples': '118049.0000', 'buffer/fullness_pct': '100.0000', 'perf/train_steps_per_sec': '512000000.0000', 'perf/samples_per_sec': '25600000000.0000', 'perf/collect_time_sec': '311.7599', 'perf/train_time_sec': '14.6518', 'perf/collect_steps_per_sec': '1.6423', 'perf/collect_game_steps_per_sec': '210.2131', 'game/terminated_count': '0.0000', 'game/terminated_pct': '0.0000'}
Epoch 120: {'pip_count_baseline_avg_outcome': '0.0500'}
Epoch 121: {'l2_reg': '0.0000', 'loss': '1.6731', 'policy_entropy': '0.9316', 'policy_loss': '0.9274', 'value_loss': '0.7457', 'buffer/size': '128000.0000', 'buffer/valid_samples': '118535.0000', 'buffer/fullness_pct': '100.0000', 'perf/train_steps_per_sec': '512000000.0000', 'perf/samples_per_sec': '25600000000.0000', 'perf/collect_time_sec': '317.7277', 'perf/train_time_sec': '14.6373', 'perf

and GIFs generated will appear in the same directory as this notebook, and also on your `wandb` dashboard.