## Data Processing

This notebook consolidates all playoff datasets (2015–2025), applies consistent preprocessing, and generates train/validation/test splits for model development. The goal is to create standardized feature sets ready for machine learning pipelines while ensuring chronological integrity and reproducibility.

---

#### Combining All Modeled Datasets
This section merges all playoff round/game modeling outputs into a single master dataset. By consolidating these datasets, we ensure that every feature and game context is represented in one place, which supports consistent preprocessing and model training.

##### Key Steps:
1. **Import Libraries**: Load `pandas` for data handling and `glob` for file path management.
2. **Define File Paths**: List the CSV outputs for each playoff round and game scenario.
3. **Load and Combine**: Read each file into a DataFrame and concatenate them into one dataset.
4. **Save Combined Dataset**: Export the merged dataset as `all_modeled_playoff_games.csv` for use in preprocessing.

In [1]:
# Import libraries
import pandas as pd
import glob

# Load file paths for each modeling dataset
file_paths = [
    "../data/processed/modeling_round1_game1.csv",
    "../data/processed/modeling_round1_game2.csv",
    "../data/processed/modeling_round1_games3to7.csv",
    "../data/processed/modeling_round2_game1.csv",
    "../data/processed/modeling_round2_game2to7.csv",
    "../data/processed/modeling_round3_game1.csv",
    "../data/processed/modeling_round3_games2to7.csv",
    "../data/processed/modeling_round4_game1.csv",
    "../data/processed/modeling_round4_games2to7.csv"
]

# Load and combine 
dfs = []
for file in file_paths:
    df = pd.read_csv(file)
    dfs.append(df)

all_games = pd.concat(dfs, ignore_index=True)

# Export the final dataset
all_games.to_csv("../data/processed/all_modeled_playoff_games.csv", index=False)

# Print the first 5 rows
df = pd.read_csv("../data/processed/all_modeled_playoff_games.csv")
print(df.head())

     gameId  season               homeTeam                awayTeam  homeSeed  \
0  42400151    2025        Houston Rockets   Golden State Warriors       2.0   
1  42400101    2025    Cleveland Cavaliers              Miami Heat       1.0   
2  42400111    2025         Boston Celtics           Orlando Magic       2.0   
3  42400141    2025  Oklahoma City Thunder       Memphis Grizzlies       1.0   
4  42400161    2025     Los Angeles Lakers  Minnesota Timberwolves       3.0   

   awaySeed  homeCourt  homeWin  gameNumber  round  ...  eFG%_home_r4_roll  \
0       7.0       True    False           1    1.0  ...                NaN   
1       8.0       True     True           1    1.0  ...                NaN   
2       7.0       True     True           1    1.0  ...                NaN   
3       8.0       True     True           1    1.0  ...                NaN   
4       6.0       True    False           1    1.0  ...                NaN   

   eFG%_away_r4_roll  TOV%_home_r4_roll  TOV%_away

#### Build Processed Feature Sets
Here, the combined dataset is structured into train/validation/test splits, identifiers are removed, and a preprocessing pipeline is applied. This standardizes numeric and categorical features for consistent model input.

**Key Steps:**
1. **Import Libraries**: Use `pandas` for data handling and `sklearn` for preprocessing.
2. **Load Combined Dataset**: Read `all_modeled_playoff_games.csv` file.
3. **Season Split**:  
   - Train: 2015–2022  
   - Validation: 2023–2024  
   - Test: 2025
4. **Drop Non-Predictive Columns**: Remove `homeTeam`, `awayTeam`, `gameDate` (identifiers).
5. **Separate Features/Target**: Target is `homeWin`.
6. **Detect Column Types**: Identify numeric and categorical columns.
7. **Fit Preprocessor**: `ColumnTransformer` = numeric(impute+scale) + categorical(one-hot).
8. **Transform Splits**: Apply the fitted preprocessor to train/val/test.

In [3]:
# Import Libraries
import os, json
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import joblib

# Load all modeled playoff games file
all_models  = "../data/processed/all_modeled_playoff_games.csv"
processed_dir = "../data/processed"
model_dir = "../models"
os.makedirs(processed_dir, exist_ok=True)
os.makedirs(model_dir, exist_ok=True)

# Load Combined Dataset
df = pd.read_csv(all_models)

# Season Split
train_df = df[df["season"].between(2015, 2022)].copy()
val_df   = df[df["season"].between(2023, 2024)].copy()
test_df  = df[df["season"] == 2025].copy()

# Drop Non-Predictive Columns
cols_to_drop = ["homeTeam", "awayTeam", "gameDate"]
for _d in (train_df, val_df, test_df):
    _d.drop(columns=[c for c in cols_to_drop if c in _d.columns], inplace=True, errors="ignore")

# Separate Target and Features
target = "homeWin"
y_train = train_df[target]; X_train = train_df.drop(columns=[target])
y_val   = val_df[target];   X_val   = val_df.drop(columns=[target])
y_test  = test_df[target];  X_test  = test_df.drop(columns=[target])

# Detect Column Types
numeric_cols = X_train.select_dtypes(include=["number"]).columns.tolist()
categorical_cols = X_train.select_dtypes(include=["object", "category", "bool"]).columns.tolist()

# Build Preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ("num", Pipeline([
            ("imputer", SimpleImputer(strategy="mean")),
            ("scaler", StandardScaler())
        ]), numeric_cols),
        ("cat", OneHotEncoder(drop="first", handle_unknown="ignore"), categorical_cols)
    ],
    remainder="drop"
)

# Fit on TRAIN only, then transform all splits
preprocessor.fit(X_train)

X_train_proc = preprocessor.transform(X_train)
X_val_proc   = preprocessor.transform(X_val)
X_test_proc  = preprocessor.transform(X_test)

# Retrieve post-transform feature names
def get_feature_names_out(preprocessor, num_cols, cat_cols):
    num_feats = [f"num__{c}" for c in num_cols]
    cat_encoder = preprocessor.named_transformers_["cat"]
    cat_feats = cat_encoder.get_feature_names_out(cat_cols).tolist()
    return num_feats + cat_feats

feature_names = get_feature_names_out(preprocessor, numeric_cols, categorical_cols)

# Convert to DataFrames with consistent column order
import numpy as np
X_train_df = pd.DataFrame(X_train_proc, columns=feature_names, index=X_train.index)
X_val_df   = pd.DataFrame(X_val_proc,   columns=feature_names, index=X_val.index)
X_test_df  = pd.DataFrame(X_test_proc,  columns=feature_names, index=X_test.index)

# Save processed matrices & labels
X_train_df.to_csv(f"{processed_dir}/X_train_processed.csv", index=False)
X_val_df.to_csv(f"{processed_dir}/X_val_processed.csv", index=False)
X_test_df.to_csv(f"{processed_dir}/X_test_processed.csv", index=False)

y_train.to_csv(f"{processed_dir}/y_train.csv", index=False)
y_val.to_csv(f"{processed_dir}/y_val.csv", index=False)
y_test.to_csv(f"{processed_dir}/y_test.csv", index=False)

# Save preprocessor and feature names 
joblib.dump(preprocessor, f"{model_dir}/final_preprocessor.pkl")
with open(f"{model_dir}/feature_names.json", "w") as f:
    json.dump(feature_names, f)

# Print the first 5 rows for each
print("X_train (processed):")
print(X_train_df.head(), "\n")

print("X_val (processed):")
print(X_val_df.head(), "\n")

print("X_test (processed):")
print(X_test_df.head(), "\n")

print("y_train:")
print(y_train.head(), "\n")

print("y_val:")
print(y_val.head(), "\n")

print("y_test:")
print(y_test.head(), "\n")

X_train (processed):
    num__gameId  num__season  num__homeSeed  num__awaySeed  num__gameNumber  \
24     1.502849     1.497183      -2.059523       2.083832        -1.367616   
25     1.502762     1.497183      -0.812439       0.838660        -1.367616   
26     1.502719     1.497183      -1.435981       1.461246        -1.367616   
27     1.502675     1.497183      -2.059523       2.083832        -1.367616   
28     1.502936     1.497183      -0.812439       0.838660        -1.367616   

    num__round  num__eFG%_home_season  num__TOV%_home_season  \
24   -0.812235               0.918033              -1.398159   
25   -0.812235               0.741434              -0.959716   
26   -0.812235               0.536276              -0.352787   
27   -0.812235               0.820496               1.000534   
28   -0.812235               1.035282               1.135686   

    num__ORB%_home_season  num__DRB%_home_season  ...  num__eFG%_away_r4_roll  \
24              -0.560221             