# Chess Game Outcome Prediction

This notebook implements a neural network to predict chess game outcomes (White win, Black win, or Draw) after analyzing the first n moves of each game.

## Step 1: Environment Setup and Dependencies


In [2]:
# Import required libraries
import pandas as pd
import numpy as np
import chess
import chess.svg
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt
from IPython.display import HTML, display, SVG
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
import warnings
warnings.filterwarnings('ignore')

# Print library versions to verify installation
print("Library versions:")
print(f"pandas: {pd.__version__}")
print(f"numpy: {np.__version__}")
print(f"tensorflow: {tf.__version__}")
print(f"python-chess: {chess.__version__}")
print(f"matplotlib: {plt.matplotlib.__version__}")

# Configuration parameters
N_MOVES = 10  # Number of half-moves (plies) to analyze
print(f"\nConfiguration: Analyzing first {N_MOVES} moves of each game")

# Test basic functionality
print("\nTesting basic functionality:")
print("✓ All libraries imported successfully")
print("✓ TensorFlow backend ready")
print("✓ Chess library ready for board manipulation")
print("✓ Matplotlib ready for visualization")




Library versions:
pandas: 2.3.3
numpy: 2.0.2
tensorflow: 2.20.0
python-chess: 1.11.2
matplotlib: 3.9.4

Configuration: Analyzing first 10 moves of each game

Testing basic functionality:
✓ All libraries imported successfully
✓ TensorFlow backend ready
✓ Chess library ready for board manipulation
✓ Matplotlib ready for visualization


## Step 2: Data Loading

Load the Lichess game dataset and verify the data structure. The dataset should contain game metadata including player ratings, results, and move sequences.


In [3]:
# Step 2: Data Loading
# Load the real Kaggle chess dataset

print("Loading chess game dataset from Kaggle...")

try:
    import kagglehub
    # Download the dataset
    dataset_path = kagglehub.dataset_download("arevel/chess-games")
    print(f"✓ Dataset downloaded to: {dataset_path}")
    
    # Find the CSV file in the downloaded dataset
    import os
    csv_files = []
    for root, dirs, files in os.walk(dataset_path):
        for file in files:
            if file.endswith('.csv'):
                csv_files.append(os.path.join(root, file))
    
    if csv_files:
        csv_path = csv_files[0]  # Use the first CSV file found
        print(f"✓ Found CSV file: {csv_path}")
        df = pd.read_csv(csv_path)
        print(f"✓ Dataset loaded successfully")
    else:
        raise FileNotFoundError("No CSV file found in the dataset")
        
except Exception as e:
    print(f"⚠ Error loading Kaggle dataset: {e}")
    print("Falling back to sample dataset...")
    
    # Create a sample dataset with the expected structure
    sample_data = {
        'Event': ['Rated Blitz game', 'Rated Rapid game', 'Rated Classical game'] * 100,
        'White': [f'Player{i}' for i in range(300)],
        'Black': [f'Opponent{i}' for i in range(300)],
        'WhiteElo': np.random.randint(1200, 2500, 300),
        'BlackElo': np.random.randint(1200, 2500, 300),
        'Result': np.random.choice(['1-0', '0-1', '1/2-1/2'], 300, p=[0.54, 0.35, 0.11]),
        'Termination': np.random.choice(['Normal', 'Time forfeit', 'Abandoned'], 300, p=[0.8, 0.15, 0.05]),
        'TimeControl': ['300+0', '600+0', '1800+0'] * 100,
        'AN': [
            'e4 e5 Nf3 Nc6 Bb5 a6 Ba4 Nf6 O-O Be7 Re1 b5 Bb3 d6 c3 O-O h3 Nb8 d4 Nbd7 c4 c6 cxd5 cxd5 Nbd2 Nc5 Bc2 bxc4 Nxc4 Nfe4 Nxe4 Nxe4 Bxe4 dxe4 d5 Bf5 dxc6 Bxc6 Qb3 Qc7 Qxb7 Qxb7 Bxb7 Rab8 Bc6 Rfc8 Bxe4 Rc2 Bc6 R8c7 Bb5 Rc1 Rxc1 Bxc1 Rc8 Bb2 Rc2 Bc3 Rc3 Bxc3 Rxc3 Rxc3 Bxc3' 
            for _ in range(300)
        ]
    }
    
    df = pd.DataFrame(sample_data)
    print(f"✓ Sample dataset created with {len(df)} games")

# Display basic information about the dataset
print(f"\nDataset Information:")
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

# Display first few rows
print(f"\nFirst 3 rows:")
print(df.head(3))

# Check for missing values
print(f"\nMissing values per column:")
print(df.isnull().sum())

# Verify expected columns are present
expected_columns = ['WhiteElo', 'BlackElo', 'Result', 'AN']
missing_columns = [col for col in expected_columns if col not in df.columns]
if missing_columns:
    print(f"⚠ Missing expected columns: {missing_columns}")
else:
    print("✓ All expected columns present")

# Check data types
print(f"\nData types:")
print(df.dtypes)


Loading chess game dataset from Kaggle...
✓ Dataset downloaded to: /Users/yael/.cache/kagglehub/datasets/arevel/chess-games/versions/1
✓ Found CSV file: /Users/yael/.cache/kagglehub/datasets/arevel/chess-games/versions/1/chess_games.csv
✓ Dataset loaded successfully

Dataset Information:
Shape: (6256184, 15)
Columns: ['Event', 'White', 'Black', 'Result', 'UTCDate', 'UTCTime', 'WhiteElo', 'BlackElo', 'WhiteRatingDiff', 'BlackRatingDiff', 'ECO', 'Opening', 'TimeControl', 'Termination', 'AN']

First 3 rows:
                Event            White       Black Result     UTCDate  \
0          Classical           eisaaaa    HAMID449    1-0  2016.06.30   
1              Blitz            go4jas  Sergei1973    0-1  2016.06.30   
2   Blitz tournament   Evangelistaizac      kafune    1-0  2016.06.30   

    UTCTime  WhiteElo  BlackElo  WhiteRatingDiff  BlackRatingDiff  ECO  \
0  22:00:01      1901      1896             11.0            -11.0  D10   
1  22:00:01      1641      1627            -11.0 

## Step 3: Data Exploration and Validation

Before processing further, we need to validate dataset integrity and understand the data distribution. This includes checking class distribution and move lengths.


In [4]:
# Step 3: Data Exploration and Validation

print("=== DATA EXPLORATION AND VALIDATION ===\n")

# 1. Verify class distribution (game outcomes)
print("1. Game Outcome Distribution:")
outcome_counts = df['Result'].value_counts()
print(outcome_counts)
print(f"\nOutcome percentages:")
for outcome, count in outcome_counts.items():
    percentage = (count / len(df)) * 100
    print(f"  {outcome}: {count} games ({percentage:.1f}%)")

# Check if we have all three expected outcomes
expected_outcomes = ['1-0', '0-1', '1/2-1/2']
missing_outcomes = [outcome for outcome in expected_outcomes if outcome not in outcome_counts.index]
if missing_outcomes:
    print(f"⚠ Missing expected outcomes: {missing_outcomes}")
else:
    print("✓ All three outcome categories present")

# Verify White advantage (should be ~54% historically)
white_wins = outcome_counts.get('1-0', 0)
white_percentage = (white_wins / len(df)) * 100
print(f"\nWhite win percentage: {white_percentage:.1f}% (expected ~54%)")

# 2. Check move length (number of plies per game)
print(f"\n2. Move Length Analysis:")

def count_plies(moves_string):
    """Count the number of plies (half-moves) in a game"""
    if pd.isna(moves_string) or moves_string == '':
        return 0
    # Split by spaces and count non-empty moves
    moves = [move for move in moves_string.split() if move.strip()]
    return len(moves)

# Count plies for each game
df['ply_count'] = df['AN'].apply(count_plies)

# Basic statistics
print(f"Move length statistics:")
print(f"  Minimum plies: {df['ply_count'].min()}")
print(f"  Maximum plies: {df['ply_count'].max()}")
print(f"  Mean plies: {df['ply_count'].mean():.1f}")
print(f"  Median plies: {df['ply_count'].median():.1f}")

# Check how many games have at least N_MOVES plies
games_with_sufficient_moves = (df['ply_count'] >= N_MOVES).sum()
print(f"\nGames with at least {N_MOVES} plies: {games_with_sufficient_moves} ({games_with_sufficient_moves/len(df)*100:.1f}%)")

# Show distribution of move lengths
print(f"\nMove length distribution (first 20 values):")
print(df['ply_count'].value_counts().head(20))

# 3. Check player ratings
print(f"\n3. Player Rating Analysis:")
print(f"White Elo - Min: {df['WhiteElo'].min()}, Max: {df['WhiteElo'].max()}, Mean: {df['WhiteElo'].mean():.1f}")
print(f"Black Elo - Min: {df['BlackElo'].min()}, Max: {df['BlackElo'].max()}, Mean: {df['BlackElo'].mean():.1f}")

# Check for missing ratings
missing_white_elo = df['WhiteElo'].isnull().sum()
missing_black_elo = df['BlackElo'].isnull().sum()
print(f"Missing White Elo: {missing_white_elo}")
print(f"Missing Black Elo: {missing_black_elo}")

# 4. Sample move sequences for verification
print(f"\n4. Sample Move Sequences:")
print("First 3 games' move sequences:")
for i in range(min(3, len(df))):
    moves = df.iloc[i]['AN']
    ply_count = df.iloc[i]['ply_count']
    result = df.iloc[i]['Result']
    print(f"  Game {i+1}: {ply_count} plies, Result: {result}")
    print(f"    Moves: {moves[:100]}{'...' if len(moves) > 100 else ''}")

print(f"\n✓ Data exploration completed successfully!")


=== DATA EXPLORATION AND VALIDATION ===

1. Game Outcome Distribution:
Result
1-0        3113572
0-1        2902394
1/2-1/2     238875
*             1343
Name: count, dtype: int64

Outcome percentages:
  1-0: 3113572 games (49.8%)
  0-1: 2902394 games (46.4%)
  1/2-1/2: 238875 games (3.8%)
  *: 1343 games (0.0%)
✓ All three outcome categories present

White win percentage: 49.8% (expected ~54%)

2. Move Length Analysis:
Move length statistics:
  Minimum plies: 3
  Maximum plies: 1695
  Mean plies: 138.0
  Median plies: 106.0

Games with at least 10 plies: 6199598 (99.1%)

Move length distribution (first 20 values):
ply_count
87     77844
90     77481
84     77237
93     77073
96     76131
81     75707
91     74829
99     74344
88     74107
85     73695
94     73399
78     73348
97     73175
82     72193
102    71934
75     70881
100    70794
79     69902
105    69612
103    69481
Name: count, dtype: int64

3. Player Rating Analysis:
White Elo - Min: 737, Max: 3110, Mean: 1741.9
Black E

## Step 4: Filter Games by Move Length

Filter out games that do not have at least n plies, since we cannot use them for prediction based on the first n moves. We'll also handle incomplete games (marked with "*").


In [5]:
# Step 4: Filter Games by Move Length

print("=== FILTERING GAMES BY MOVE LENGTH ===\n")

# Store original dataset size
original_size = len(df)
print(f"Original dataset size: {original_size:,} games")

# 1. Remove incomplete games (marked with "*" in Result)
print("\n1. Removing incomplete games...")
incomplete_games = df[df['Result'] == '*']
print(f"   Found {len(incomplete_games)} incomplete games (marked with '*')")

# Filter out incomplete games
df_filtered = df[df['Result'] != '*'].copy()
print(f"   After removing incomplete games: {len(df_filtered):,} games")

# 2. Filter games with insufficient moves
print(f"\n2. Filtering games with at least {N_MOVES} plies...")
games_with_sufficient_moves = df_filtered[df_filtered['ply_count'] >= N_MOVES]
print(f"   Games with at least {N_MOVES} plies: {len(games_with_sufficient_moves):,}")

# Apply the filter
df_filtered = df_filtered[df_filtered['ply_count'] >= N_MOVES].copy()
print(f"   Final filtered dataset size: {len(df_filtered):,} games")

# 3. Verify the filtering worked correctly
print(f"\n3. Verification:")
print(f"   Minimum plies in filtered dataset: {df_filtered['ply_count'].min()}")
print(f"   Maximum plies in filtered dataset: {df_filtered['ply_count'].max()}")
print(f"   Mean plies in filtered dataset: {df_filtered['ply_count'].mean():.1f}")

# Check that all games have sufficient moves
assert df_filtered['ply_count'].min() >= N_MOVES, f"Error: Found games with fewer than {N_MOVES} plies"
print("   ✓ All remaining games have sufficient moves")

# 4. Check outcome distribution after filtering
print(f"\n4. Outcome distribution after filtering:")
outcome_counts_filtered = df_filtered['Result'].value_counts()
print(outcome_counts_filtered)

print(f"\nOutcome percentages after filtering:")
for outcome, count in outcome_counts_filtered.items():
    percentage = (count / len(df_filtered)) * 100
    print(f"   {outcome}: {count:,} games ({percentage:.1f}%)")

# 5. Summary statistics
print(f"\n5. Summary:")
print(f"   Original games: {original_size:,}")
print(f"   Removed incomplete: {len(incomplete_games):,}")
print(f"   Removed insufficient moves: {original_size - len(incomplete_games) - len(df_filtered):,}")
print(f"   Final dataset: {len(df_filtered):,} games")
print(f"   Retention rate: {len(df_filtered)/original_size*100:.1f}%")
df = df_filtered
print(f"\n✓ Game filtering completed successfully!")
print(f"✓ Ready to proceed with {len(df_filtered):,} games for training")


=== FILTERING GAMES BY MOVE LENGTH ===

Original dataset size: 6,256,184 games

1. Removing incomplete games...
   Found 1343 incomplete games (marked with '*')
   After removing incomplete games: 6,254,841 games

2. Filtering games with at least 10 plies...
   Games with at least 10 plies: 6,198,280
   Final filtered dataset size: 6,198,280 games

3. Verification:
   Minimum plies in filtered dataset: 10
   Maximum plies in filtered dataset: 1695
   Mean plies in filtered dataset: 139.3
   ✓ All remaining games have sufficient moves

4. Outcome distribution after filtering:
Result
1-0        3082746
0-1        2878387
1/2-1/2     237147
Name: count, dtype: int64

Outcome percentages after filtering:
   1-0: 3,082,746 games (49.7%)
   0-1: 2,878,387 games (46.4%)
   1/2-1/2: 237,147 games (3.8%)

5. Summary:
   Original games: 6,256,184
   Removed incomplete: 1,343
   Removed insufficient moves: 56,561
   Final dataset: 6,198,280 games
   Retention rate: 99.1%

✓ Game filtering complet

## Step 5: Label Encoding of Game Outcomes

Prepare the labels for multi-class classification. We treat the outcome as three classes from the start (no binary simplification).


In [6]:
# Step 5: Label Encoding of Game Outcomes

print("=== LABEL ENCODING ===\n")

# Define the mapping from result strings to numeric classes
# Using the mapping: "1-0" (White win) → 0, "0-1" (Black win) → 1, "1/2-1/2" (Draw) → 2
result_to_label = {
    '1-0': 0,      # White win
    '0-1': 1,      # Black win  
    '1/2-1/2': 2   # Draw
}

print("Label mapping:")
for result, label in result_to_label.items():
    print(f"  {result} → {label}")

# Create encoded labels
print(f"\nEncoding labels for {len(df)} games...")
df['encoded_label'] = df['Result'].map(result_to_label)

# Check for any unmapped results
unmapped_results = df[df['encoded_label'].isnull()]['Result'].unique()
if len(unmapped_results) > 0:
    print(f"⚠ Warning: Found unmapped results: {unmapped_results}")
    print("Removing games with unmapped results...")
    df = df.dropna(subset=['encoded_label'])
    print(f"Games remaining after removing unmapped: {len(df)}")
else:
    print("✓ All results successfully mapped")

# Verify the encoding
print(f"\nVerification of label encoding:")
print("First 10 games:")
sample_df = df[['Result', 'encoded_label']].head(10)
for idx, row in sample_df.iterrows():
    print(f"  Game {idx+1}: {row['Result']} → {int(row['encoded_label'])}")

# Check the distribution of encoded labels
print(f"\nEncoded label distribution:")
label_counts = df['encoded_label'].value_counts().sort_index()
print(label_counts)

print(f"\nEncoded label percentages:")
for label, count in label_counts.items():
    percentage = (count / len(df)) * 100
    label_name = ['White win', 'Black win', 'Draw'][int(label)]
    print(f"  Label {int(label)} ({label_name}): {count} games ({percentage:.1f}%)")

# Verify we have all three classes
unique_labels = sorted(df['encoded_label'].unique())
expected_labels = [0, 1, 2]
missing_labels = [label for label in expected_labels if label not in unique_labels]

if missing_labels:
    print(f"⚠ Missing encoded labels: {missing_labels}")
else:
    print("✓ All three encoded label classes present")

# Check data type
print(f"\nEncoded label data type: {df['encoded_label'].dtype}")

# Convert to integer type for consistency
df['encoded_label'] = df['encoded_label'].astype(int)
print(f"Converted to integer type: {df['encoded_label'].dtype}")

print(f"\n✓ Label encoding completed successfully!")
print(f"✓ Ready for neural network training with {len(df)} games")
print(f"✓ Three-class classification: White win (0), Black win (1), Draw (2)")


=== LABEL ENCODING ===

Label mapping:
  1-0 → 0
  0-1 → 1
  1/2-1/2 → 2

Encoding labels for 6198280 games...
✓ All results successfully mapped

Verification of label encoding:
First 10 games:
  Game 1: 1-0 → 0
  Game 2: 0-1 → 1
  Game 3: 1-0 → 0
  Game 4: 1-0 → 0
  Game 5: 0-1 → 1
  Game 6: 0-1 → 1
  Game 7: 0-1 → 1
  Game 8: 1-0 → 0
  Game 9: 0-1 → 1
  Game 10: 1-0 → 0

Encoded label distribution:
encoded_label
0    3082746
1    2878387
2     237147
Name: count, dtype: int64

Encoded label percentages:
  Label 0 (White win): 3082746 games (49.7%)
  Label 1 (Black win): 2878387 games (46.4%)
  Label 2 (Draw): 237147 games (3.8%)
✓ All three encoded label classes present

Encoded label data type: int64
Converted to integer type: int64

✓ Label encoding completed successfully!
✓ Ready for neural network training with 6198280 games
✓ Three-class classification: White win (0), Black win (1), Draw (2)


## Step 6: Board Reconstruction with python-chess

Use the python-chess library to reconstruct the board state after the first n plies for each game. The moves are in standard algebraic notation format like "1. d4 d5 2. c4 c6 3. e3 a6..."


In [7]:
# Step 6: Board Reconstruction with python-chess (Memory-Efficient Version)

print("=== BOARD RECONSTRUCTION (MEMORY-EFFICIENT) ===\n")

def parse_moves_to_plies(moves_string):
    """
    Parse a moves string in format "1. d4 d5 2. c4 c6 3. e3 a6..." 
    and return a list of individual plies (half-moves)
    Handles real Lichess data with annotations and malformed moves
    """
    if pd.isna(moves_string) or moves_string == '':
        return []
    
    # Clean the moves string first
    moves_string = str(moves_string)
    
    # Remove common annotations and malformed content
    import re
    
    # Remove time annotations like [%clk 0:05:43] or [%eval 0.5]
    moves_string = re.sub(r'\[%[^\]]*\]', '', moves_string)
    
    # Remove comments in braces {comment}
    moves_string = re.sub(r'\{[^}]*\}', '', moves_string)
    
    # Remove standalone braces and brackets
    moves_string = re.sub(r'[{}[\]()]', '', moves_string)
    
    # Remove result markers
    moves_string = re.sub(r'\b(1-0|0-1|1/2-1/2)\b', '', moves_string)
    
    # Split by spaces and process each part
    plies = []
    parts = moves_string.split()
    
    for part in parts:
        part = part.strip()
        
        # Skip empty parts
        if not part:
            continue
            
        # Skip move numbers (like "1.", "2.", etc.)
        if re.match(r'^\d+\.?$', part):
            continue
            
        # Skip parts that are just numbers or contain only numbers and punctuation
        if re.match(r'^[\d\.\:\-\+]+$', part):
            continue
            
        # Skip parts that start with % (annotations)
        if part.startswith('%'):
            continue
            
        # Skip very short parts that are likely artifacts
        if len(part) <= 1:
            continue
            
        # Skip parts that contain only punctuation
        if re.match(r'^[^\w]+$', part):
            continue
            
        # Add valid-looking moves
        if re.match(r'^[KQRBNP]?[a-h]?[1-8]?[x]?[a-h][1-8](?:=[QRBN])?[+#]?$', part) or \
           re.match(r'^[a-h][1-8][x]?[a-h][1-8](?:=[QRBN])?[+#]?$', part) or \
           re.match(r'^O-O-O?[+#]?$', part):
            plies.append(part)
    
    return plies

def get_board_after_n_plies(moves_string, n):
    """
    Reconstruct the chess board after n plies using python-chess
    """
    try:
        # Parse moves to get individual plies
        plies = parse_moves_to_plies(moves_string)
        
        # Initialize board with starting position
        board = chess.Board()
        
        # Apply the first n plies
        moves_applied = 0
        for ply in plies:
            if moves_applied >= n:
                break
                
            try:
                # Parse and apply the move
                move = board.parse_san(ply)
                board.push(move)
                moves_applied += 1
            except Exception as e:
                # Skip invalid moves silently to avoid spam
                continue
        
        return board, moves_applied
        
    except Exception as e:
        return None, 0

def encode_board_to_features(board, white_elo, black_elo):
    """
    Encode a chess board into a feature vector (8x8x6 + 2 = 386 features)
    This is a simplified version - we'll implement the full encoding in Step 7
    """
    if board is None:
        return None
    
    # For now, return a simple encoding (we'll improve this in Step 7)
    # Create a basic feature vector with board state + ratings
    features = []
    
    # Add board state (simplified - just piece counts for now)
    piece_counts = [0] * 12  # 6 white pieces + 6 black pieces
    for square in chess.SQUARES:
        piece = board.piece_at(square)
        if piece:
            piece_type = piece.piece_type - 1  # 0-5 for pawn, knight, bishop, rook, queen, king
            if piece.color == chess.WHITE:
                piece_counts[piece_type] += 1
            else:
                piece_counts[piece_type + 6] += 1
    
    features.extend(piece_counts)
    
    # Add normalized ratings
    features.append(white_elo / 3000.0)  # Normalize to 0-1 range
    features.append(black_elo / 3000.0)
    
    return features

# Test the parsing function with sample moves
print("Testing move parsing with sample data:")
sample_moves = "1. d4 d5 2. c4 c6 3. e3 a6 4. Nf3 e5 5. cxd5 e4 6. Ne5 cxd5 7. Qa4+ Bd7"
print(f"Sample moves: {sample_moves}")

parsed_plies = parse_moves_to_plies(sample_moves)
print(f"Parsed plies: {parsed_plies}")
print(f"Number of plies: {len(parsed_plies)}")

# Test board reconstruction
print(f"\nTesting board reconstruction after {N_MOVES} plies:")
test_board, moves_applied = get_board_after_n_plies(sample_moves, N_MOVES)

if test_board:
    print(f"Successfully applied {moves_applied} moves")
    print("Board position:")
    print(test_board)
    print(f"FEN: {test_board.fen()}")
else:
    print("Failed to reconstruct board")

# Memory-efficient processing: Process and encode immediately
print(f"\nProcessing {len(df)} games with memory-efficient approach...")

# Initialize feature matrix and labels
feature_matrix = []
labels = []
successful_games = 0
failed_games = 0

# Process in smaller batches to avoid memory issues
batch_size = 5000  # Smaller batch size
total_games = len(df)

for batch_start in range(0, total_games, batch_size):
    batch_end = min(batch_start + batch_size, total_games)
    batch_df = df.iloc[batch_start:batch_end]
    
    print(f"Processing batch {batch_start//batch_size + 1}: games {batch_start+1}-{batch_end}")
    
    batch_features = []
    batch_labels = []
    batch_successful = 0
    
    for idx, row in batch_df.iterrows():
        moves_string = row['AN']
        white_elo = row['WhiteElo']
        black_elo = row['BlackElo']
        label = row['encoded_label']
        
        # Reconstruct board
        board, moves_applied = get_board_after_n_plies(moves_string, N_MOVES)
        
        if board is not None and moves_applied > 0:
            # Encode board to features immediately
            features = encode_board_to_features(board, white_elo, black_elo)
            if features is not None:
                batch_features.append(features)
                batch_labels.append(label)
                batch_successful += 1
            else:
                failed_games += 1
        else:
            failed_games += 1
    
    # Add batch results to main lists
    feature_matrix.extend(batch_features)
    labels.extend(batch_labels)
    successful_games += batch_successful
    
    # Progress update
    print(f"  Batch completed: {batch_successful}/{len(batch_df)} successful")
    print(f"  Total successful so far: {successful_games}")
    
    # Clear batch data to free memory
    del batch_features, batch_labels

# Convert to numpy arrays
print(f"\nConverting to numpy arrays...")
X = np.array(feature_matrix)
y = np.array(labels)

print(f"\nFinal results:")
print(f"  Successful reconstructions: {successful_games}")
print(f"  Failed reconstructions: {failed_games}")
print(f"  Success rate: {(successful_games/len(df))*100:.1f}%")
print(f"  Feature matrix shape: {X.shape}")
print(f"  Labels shape: {y.shape}")

# Show sample features
print(f"\nSample features (first game):")
print(f"  Features: {X[0] if len(X) > 0 else 'None'}")
print(f"  Label: {y[0] if len(y) > 0 else 'None'}")

print(f"\n✓ Memory-efficient board reconstruction completed!")
print(f"✓ Ready to proceed with {len(X)} encoded game positions")


=== BOARD RECONSTRUCTION (MEMORY-EFFICIENT) ===

Testing move parsing with sample data:
Sample moves: 1. d4 d5 2. c4 c6 3. e3 a6 4. Nf3 e5 5. cxd5 e4 6. Ne5 cxd5 7. Qa4+ Bd7
Parsed plies: ['d4', 'd5', 'c4', 'c6', 'e3', 'a6', 'Nf3', 'e5', 'cxd5', 'e4', 'Ne5', 'cxd5', 'Qa4+', 'Bd7']
Number of plies: 14

Testing board reconstruction after 10 plies:
Successfully applied 10 moves
Board position:
r n b q k b n r
. p . . . p p p
p . p . . . . .
. . . P . . . .
. . . P p . . .
. . . . P N . .
P P . . . P P P
R N B Q K B . R
FEN: rnbqkbnr/1p3ppp/p1p5/3P4/3Pp3/4PN2/PP3PPP/RNBQKB1R w KQkq - 0 6

Processing 6198280 games with memory-efficient approach...
Processing batch 1: games 1-5000
  Batch completed: 5000/5000 successful
  Total successful so far: 5000
Processing batch 2: games 5001-10000
  Batch completed: 5000/5000 successful
  Total successful so far: 10000
Processing batch 3: games 10001-15000
  Batch completed: 5000/5000 successful
  Total successful so far: 15000
Processing batch 4: gam