In [None]:
import sys
from pathlib import Path
sys.path.append(str(Path("..").resolve()))

# RBC AlphaZero-like Bot

Reconnaissance Blind Chess (RBC) is a partially observable variant of chess in which players have perfect information about their own pieces but only limited observations of the opponent.

This notebook describes a learning-based RBC agent inspired by the AlphaZero framework, combining neural network evaluation, search, and self-play training. The emphasis is on a clear and rule-compliant implementation that can be trained and evaluated end to end.

##DEPENDENCIES


This section lists the external libraries required to run the notebook.

The implementation relies on standard numerical and deep learning tools for tensor computation and optimization, together with a chess engine library and the ReconChess framework to ensure correct handling of game rules and interaction between agents.

These dependencies provide the basic infrastructure for representing game states, running self-play matches, training neural networks, and evaluating the resulting agent.

In [None]:
%pip -q install python-chess reconchess


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.1/6.1 MB[0m [31m136.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.7/63.7 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for chess (setup.py) ... [?25l[?25hdone


Install required dependencies.
 - python-chess: standard chess representation and move generation
 - reconchess: official framework for Reconnaissance Blind Chess,
   enabling rule-compliant gameplay against other bots


In [None]:
import os
import math
import  random
from dataclasses import dataclass
from typing import Dict, Tuple, List, Optional, Any
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import chess
import reconchess
from reconchess import Player, Color, Square
import csv
import datetime

#REPRODUCIBILITY

This section lists the external libraries required to run the notebook.

The implementation relies on standard numerical and deep learning tools for tensor computation and optimization, together with a chess engine library and the ReconChess framework to ensure correct handling of game rules and interaction between agents.

These dependencies provide the basic infrastructure for representing game states, running self-play matches, training neural networks, and evaluating the resulting agent.

In [None]:
import sys
from pathlib import Path
sys.path.append(str(Path("..").resolve()))

from src.utils import set_seeds
from src.config import DEVICE

set_seeds(0)
print("DEVICE:", DEVICE)


## ACTION ENCODING (20480 fixed policy head)


In order to use a fixed-size policy head, all possible chess moves are mapped to a discrete action space of fixed dimensionality.

Each move is encoded as an index in a predefined action set of size 20,480, covering all standard chess moves, including promotions. This encoding allows the policy network to produce a fixed-length output independent of the current position.

During play, only the subset of actions corresponding to legal moves provided by the game environment is considered, while the remaining entries are masked implicitly


In [None]:
from src.encoding import (
    PROMO_TO_ID,
    ID_TO_PROMO,
    POLICY_SIZE,
    move_to_index,
    index_to_move,   # se esiste nel tuo blocco
)


## BELIEF TENSOR (7 channels) + SENSE UPDATE


Uncertainty about the opponent’s pieces is represented through a per-square belief tensor with seven channels, corresponding to the six standard chess piece types plus an explicit EMPTY channel.

For each board square, the belief tensor stores a probability distribution over these channels, normalized independently per square. This representation makes uncertainty explicit while remaining simple and easy to inspect.

Sensing actions update the belief deterministically within the sensed 3×3 region: observed squares are set to the corresponding piece type or to EMPTY, while beliefs outside the sensed area remain unchanged. This local update rule provides a lightweight mechanism to incorporate new information without maintaining a full probabilistic game history.

In [None]:
from src.belief import (
    normalize_over_channels,
    init_belief_from_initial,
    apply_sense_to_belief,
)


## SENSE SELECTION: entropy-max 3×3 center


At each turn, the agent selects a sensing action by evaluating the uncertainty of the opponent’s belief distribution.

For each allowed sensing square, the total entropy over the corresponding 3×3 region is computed, and the square that maximizes this value is selected. This heuristic prioritizes sensing actions that are expected to provide the largest reduction in uncertainty.

The approach is purely information-driven and independent of the immediate move selection, making it simple, efficient, and consistent with the belief representation.

In [None]:
from src.sense import *


## GREEDY DETERMINIZATION FROM BELIEF + remaining opponent inventory


To enable fast planning with standard chess move generation, the opponent’s hidden position is approximated by constructing a single fully specified “determinized” board state from the belief tensor.

For each opponent piece type, the algorithm places the remaining pieces on the highest-probability squares according to the belief distribution, while respecting already occupied squares (including all known own pieces). A simple opponent inventory is maintained to ensure that the determinized position contains a consistent number of pieces of each type.

The resulting determinized board is used only as a hypothesis for search and evaluation; it provides a concrete state on which legal moves can be checked and simulated efficiently.

In [None]:
from src.determinize import *


## ENCODER (own pieces + belief + small metadata) → 15×8×8


The neural network input is a stack of 2D feature planes with fixed spatial resolution (8×8), producing a tensor of shape 15×8×8.

The encoding includes: (i) six binary planes for the agent’s own pieces (one per piece type), (ii) the seven-channel opponent belief tensor, and (iii) a small set of global metadata planes (side to move and a normalized move counter).

This representation keeps the input compact while preserving the spatial structure of the board, allowing convolutional layers to exploit local patterns and piece configurations.

In [None]:
from src.encoder import *


## SMALL POLICY/VALUE NET


The agent uses a lightweight convolutional neural network with a shared trunk and two output heads: a policy head and a value head.

The policy head produces logits over the fixed 20,480-action encoding, which are later restricted to the legal moves available in the current position. The value head outputs a single scalar estimating the expected game outcome from the current player’s perspective.

The network is intentionally small to keep self-play and training fast while still capturing the spatial structure of the board representation.

In [None]:
from src.model import FastPolicyValueNet


## ROOT-only PUCT (search on determinization, choose among provided legal actions)


Move selection is performed using a lightweight, root-only PUCT search guided by the network’s policy priors.

The search is run on the determinized board hypothesis and considers only the move actions provided by the ReconChess environment for the current turn. Each simulation selects the move that maximizes a PUCT score combining an exploitation term (estimated value) and an exploration term weighted by the network prior.

The final move is chosen from the resulting visit counts, producing a search-improved policy target that is also reused during training.

In [None]:
from src.search import *


## THE **RECONCHESS PLAYER** (FAST, rule-compliant)


The agent is implemented as a ReconChess Player, fully compliant with the game’s interface and rules.

All game interactions—including sensing, move selection, belief updates, and board state tracking—are handled through the standard ReconChess callbacks. This ensures that the agent can play against other bots without relying on privileged information or modified game mechanics. The focus is on robustness and correct interaction rather than maximal performance.


In [None]:
from src.player import *


## LOCAL MATCH HARNESS (ReconChess) — smoke test


A local match harness is used to run short games against a baseline opponent, serving as an end-to-end smoke test for rule compliance and framework integration.

In [None]:
# See play_local.py for local smoke testing


### Run smoke test (uncomment)


## RUN ALL CHECKS (fast sanity gate)
These checks are meant to fail fast if something is inconsistent. If they pass, the agent is generally safe to run in local matches and self-play


# SELF-PLAY + TRAINING (FAST)

This section makes the bot trainable with minimal extra code:

1) ReplayBuffer in RAM (FAST)
2) Training step: KL(policy) + MSE(value)
3) Self-play game generator (ReconChess local runner)
4) Iterative loop: self-play → train → eval → checkpoint


In [None]:
# Fast sanity gate moved to sanity_checks.py


## ReplayBuffer (RAM) + Dataset


Self-play generates training samples over time, so the implementation stores them in a replay buffer kept in RAM.

The buffer collects tuples (X,π,z), where X is the encoded state, π is the search-improved policy target, and z is the final game outcome from the player’s perspective. A maximum capacity is enforced by discarding the oldest samples, keeping the dataset bounded and biased toward more recent experience.

A lightweight PyTorch Dataset wrapper exposes the buffer in a format suitable for batching with a DataLoader, enabling standard supervised updates of the policy and value network.

In [None]:
from src.replay import *


## Training step (KL policy + MSE value)


Network parameters are updated using supervised learning on batches sampled from the replay buffer.

The policy head is trained by minimizing the Kullback–Leibler divergence between the network’s predicted action distribution and the search-derived policy target. In parallel, the value head is trained using a mean squared error loss against the final game outcome.

The two losses are combined into a single objective and optimized using standard gradient-based methods, with gradient clipping applied for stability.

In [None]:
from src.train_loop import train_steps


## Checkpoint helpers


To support long-running experiments and allow training to be resumed across sessions, helper functions are provided to save and load model checkpoints.

Each checkpoint stores the network parameters, optimizer state, and basic training metadata, ensuring that training can be restarted consistently without loss of information.

In [None]:
from src.checkpoint import *


In [None]:
Use src.checkpoint in notebook


## One self-play game (ReconChess) → (X, P, Z)


A single self-play game is executed by running two instances of the same ReconChess player against each other using the local game runner.

During the game, each player records training samples (X,π) at decision time, where X is the encoded state and π is derived from the search visit counts. After the game ends, the final outcome is converted into a value target z and assigned to all samples collected by each player.

The resulting lists (X,P,Z) provide one complete episode of training data that can be appended to the replay buffer.

## Self-play → train → eval loop (FAST)


In [None]:
from src.selfplay import *


The main training loop alternates between data generation and network updates.

At each iteration, a small batch of self-play games is generated to produce new (X,π,z) samples, which are appended to the replay buffer. The policy/value network is then updated for a fixed number of gradient steps using mini-batches sampled from the buffer.

After training, the current model is evaluated in a short match series against a simple baseline opponent to provide a quick progress signal, and a checkpoint plus a CSV log entry are saved for later inspection

In [None]:
from src.selfplay import *


### Run a small self-play training (start tiny)


## Results plots
After training, plot loss and winrate vs random.


In [None]:
from src.plots import *


This notebook provides a complete and executable reference implementation of a learning-based RBC agent, suitable for experimentation and further extensions.