#### Data splitting

Acá se concatenan los tres datasets (`chessData.csv`, `random_evals.csv`, `tactic_evals.csv`) y se dividen en:
- `training_data` (70%)
- `validation_data` (20%)
- `testing_data` (10%)

Nota: Esta división solo se ejecuta una vez para generar los datasets persistentes. Después de guardar los archivos, se pueden cargar directamente sin repetir la división.

In [6]:
from pathlib import Path
import pandas as pd
import numpy as np

DATA_DIR = Path('../data/raw')

print('Cargando datasets...')
chess_evals = pd.read_csv(DATA_DIR / 'chessData.csv')
print(f'  chessData.csv: {len(chess_evals):,} filas')

random_evals = pd.read_csv(DATA_DIR / 'random_evals.csv')
print(f'  random_evals.csv: {len(random_evals):,} filas')

tactic_evals = pd.read_csv(DATA_DIR / 'tactic_evals.csv')
print(f'  tactic_evals.csv: {len(tactic_evals):,} filas')

raw_data = pd.concat([chess_evals, random_evals, tactic_evals], join='inner', ignore_index=True)
print(f'\nDataset concatenado: {len(raw_data):,} filas, {len(raw_data.columns)} columnas')
print(f'Columnas: {list(raw_data.columns[:10])}...')

Cargando datasets...
  chessData.csv: 12,958,035 filas
  random_evals.csv: 1,000,273 filas
  tactic_evals.csv: 2,628,219 filas

Dataset concatenado: 16,586,527 filas, 2 columnas
Columnas: ['FEN', 'Evaluation']...


In [7]:
# Divide los datasets en train (70%), validation (20%), test (10%)
from sklearn.model_selection import train_test_split

# primer split: 70% train, 30% temp (val + test)
training_data, temp_data = train_test_split(
    raw_data, 
    test_size=0.30, 
    random_state=42, 
    shuffle=True
)

# segundo split: dividir temp_data en 20% val y 10% test (2/3 y 1/3 de temp)
validation_data, testing_data = train_test_split(
    temp_data, 
    test_size=0.333333,  # 10% del total (1/3 de 30%)
    random_state=42, 
    shuffle=True
)

print('\nDivisión completada:')
print(f'  training_data:   {len(training_data):,} filas ({len(training_data)/len(raw_data)*100:.1f}%)')
print(f'  validation_data: {len(validation_data):,} filas ({len(validation_data)/len(raw_data)*100:.1f}%)')
print(f'  testing_data:    {len(testing_data):,} filas ({len(testing_data)/len(raw_data)*100:.1f}%)')
print(f'  Total:           {len(training_data) + len(validation_data) + len(testing_data):,} filas')


División completada:
  training_data:   11,610,568 filas (70.0%)
  validation_data: 3,317,307 filas (20.0%)
  testing_data:    1,658,652 filas (10.0%)
  Total:           16,586,527 filas


In [8]:

SPLIT_DIR = Path('../data/processed')
SPLIT_DIR.mkdir(parents=True, exist_ok=True)

print('\nGuardando datasets divididos...')

# se guardan en formato parquet para eficiencia
training_data.to_parquet(SPLIT_DIR / 'training_data.parquet', index=False)
print(f'  ✓ training_data.parquet guardado ({len(training_data):,} filas)')

validation_data.to_parquet(SPLIT_DIR / 'validation_data.parquet', index=False)
print(f'  ✓ validation_data.parquet guardado ({len(validation_data):,} filas)')

testing_data.to_parquet(SPLIT_DIR / 'testing_data.parquet', index=False)
print(f'  ✓ testing_data.parquet guardado ({len(testing_data):,} filas)')

print(f'\n Todos los datasets modificados guardados en: {SPLIT_DIR.resolve()}')


Guardando datasets divididos...
  ✓ training_data.parquet guardado (11,610,568 filas)
  ✓ validation_data.parquet guardado (3,317,307 filas)
  ✓ testing_data.parquet guardado (1,658,652 filas)

 Todos los datasets modificados guardados en: C:\Users\USUARIO\Documents\Code\stocksalmon\data\processed


In [1]:

# para cargar los datasets modificados en los otros notebooks
from pathlib import Path
import pandas as pd

SPLIT_DIR = Path('../data/processed')

training_data = pd.read_parquet(SPLIT_DIR / 'training_data.parquet')
validation_data = pd.read_parquet(SPLIT_DIR / 'validation_data.parquet')
testing_data = pd.read_parquet(SPLIT_DIR / 'testing_data.parquet')

print('Datasets cargados:')
print(f'  training_data:   {len(training_data):,} filas')
print(f'  validation_data: {len(validation_data):,} filas')
print(f'  testing_data:    {len(testing_data):,} filas')

Datasets cargados:
  training_data:   11,610,568 filas
  validation_data: 3,317,307 filas
  testing_data:    1,658,652 filas


In [10]:
# verificación de los splits

total = len(training_data) + len(validation_data) + len(testing_data)
print(f'Total de filas: {total:,}')
print(f'  Train:      {len(training_data):,} ({len(training_data)/total*100:.1f}%)')
print(f'  Validation: {len(validation_data):,} ({len(validation_data)/total*100:.1f}%)')
print(f'  Test:       {len(testing_data):,} ({len(testing_data)/total*100:.1f}%)')

print(f'\nColumnas en común: {len(training_data.columns)}')

print('\nPrimeras 3 filas de training_data:')
print(training_data.head(3))
print('\nPrimeras 3 filas de validation_data:')
print(validation_data.head(3))
print('\nPrimeras 3 filas de testing_data:')
print(testing_data.head(3))

Total de filas: 16,586,527
  Train:      11,610,568 (70.0%)
  Validation: 3,317,307 (20.0%)
  Test:       1,658,652 (10.0%)

Columnas en común: 2

Primeras 3 filas de training_data:
                                                 FEN Evaluation
0  r4rk1/pp1qnppp/1n2p1b1/2RpP1B1/1P1P2PN/1Q5P/P3...        +34
1  r4rk1/1ppbqppp/p1n1p3/3n4/P1QP4/5NP1/1P1NPPBP/...          0
2  r1q2r2/ppp2B1k/2bp1Qnp/8/4P3/5N1P/PPP3P1/3RR1K...       -312

Primeras 3 filas de validation_data:
                                                 FEN Evaluation
0  5rk1/2qnppbp/r2p2p1/1NpP4/P3P3/R4P2/1P4PP/2BQK...        +15
1              8/2p2kp1/3p4/3P4/6K1/8/8/8 w - - 1 57      -4795
2  1k1r4/ppq2ppp/2p1pn2/4P3/2P5/2Q3N1/PPK2PPP/4R3...          0

Primeras 3 filas de testing_data:
                                                 FEN Evaluation
0  8/4Np1k/3p1Prp/p1p1bRp1/P7/5RPP/1P5K/4r3 b - -...       -310
1  5rr1/1p3p1p/5k2/1P2pP2/8/2P3PP/3Q1P1K/8 b - - ...       +544
2  r4rk1/p1p3pp/2nqpn2/1Qbpp3/4P3/2PP1N2/