# UCI Tokenizer Example

This notebook demonstrates how to use the BPE tokenizer for chess moves in UCI notation.

The tokenizer is based on the Byte-Pair Encoding (BPE) algorithm with **ByteLevel pre-tokenization**. Each move is explicitly separated by a **`<STEP>`** token, ensuring crystal-clear move boundaries.

Special tokens included:
- `<PAD>`: Padding token
- `<START>`: Start of sequence token
- `<END>`: End of sequence token
- `<STEP>`: Move separator token (replaces spaces)
- `<1-0>`: White wins
- `<0-1>`: Black wins
- `<1/2-1/2>`: Draw
- `<UNK>`: Unknown token

**Important**: The `<STEP>` token explicitly marks move boundaries, making it impossible to confuse where one move ends and another begins.

## 1. Load the Pre-trained Tokenizer

We'll load a pre-trained BPE tokenizer that was trained on Lichess games.

In [1]:
from tokenizers import Tokenizer
from pathlib import Path

# Load the pre-trained tokenizer
tokenizer_path = Path("../src/chesstransformer/data/tokenizer_models/bpe_tokenizer_vocab2000.json")
tokenizer = Tokenizer.from_file(str(tokenizer_path))

print(f"Tokenizer loaded successfully!")
print(f"Vocabulary size: {tokenizer.get_vocab_size()}")

Tokenizer loaded successfully!
Vocabulary size: 1987


## 2. Encode Chess Moves

Let's encode a sequence of chess moves in UCI notation.

In [2]:
# Example game: Scholar's Mate (Coup du Berger)
# Note: Moves are separated by <STEP> token instead of spaces
moves = "e2e4 e7e5 f1c4 b8c6 d1h5 g8f6 h5f7"
input_text = moves.replace(" ", " <STEP> ")
input_text += " <1-0>"

# Encode the moves
encoding = tokenizer.encode(input_text)

print("Original moves:", moves)
print("\nEncoded:")
print("Token IDs:", encoding.ids)
print("Tokens:", encoding.tokens)
print("\nNote: <STEP> tokens clearly mark where each move ends!")

Original moves: e2e4 e7e5 f1c4 b8c6 d1h5 g8f6 h5f7

Encoded:
Token IDs: [1, 101, 7, 111, 7, 133, 7, 106, 7, 370, 7, 99, 7, 943, 3, 2]
Tokens: ['<START>', 'e2e4', '<STEP>', 'e7e5', '<STEP>', 'f1c4', '<STEP>', 'b8c6', '<STEP>', 'd1h5', '<STEP>', 'g8f6', '<STEP>', 'h5f7', '<1-0>', '<END>']

Note: <STEP> tokens clearly mark where each move ends!


## 3. Decode Token IDs Back to Moves

We can decode the token IDs back to the original chess moves.

In [3]:
# Decode the token IDs back to text
decoded_moves = tokenizer.decode(encoding.ids)

print("Decoded moves:", decoded_moves)
print("Match original?", decoded_moves.strip() == moves)

Decoded moves: e2e4 e7e5 f1c4 b8c6 d1h5 g8f6 h5f7
Match original? True
