## Data Processing

This notebook processes modeled NBA playoff datasets from 2015–2025, preparing them for machine learning model training and evaluation.  
It applies data cleaning, feature preprocessing, and structured train, validation, and test splits to ensure consistency and prevent data leakage.

#### Combining All Modeled Datasets
This section merges all playoff round/game modeling outputs into a single master dataset. Combining these datasets into one file ensures that all features and game contexts are available for consistent preprocessing and model training.

##### Overview of Steps:
1. **Import Libraries**: Load `pandas` for data handling and `glob` for file path management.
2. **Define File Paths**: List the CSV outputs for each playoff round and game scenario.
3. **Load and Combine**: Read each file into a DataFrame and concatenate them into one dataset.
4. **Save Combined Dataset**: Export the merged dataset as `all_modeled_playoff_games.csv` for use in preprocessing.

In [39]:
# Import libraries
import pandas as pd
import glob

# Load file paths for each modeling dataset
file_paths = [
    "../data/processed/modeling_round1_game1.csv",
    "../data/processed/modeling_round1_game2.csv",
    "../data/processed/modeling_round1_games3to7.csv",
    "../data/processed/modeling_round2_game1.csv",
    "../data/processed/modeling_round2_game2to7.csv",
    "../data/processed/modeling_round3_game1.csv",
    "../data/processed/modeling_round3_games2to7.csv",
    "../data/processed/modeling_round4_game1.csv",
    "../data/processed/modeling_round4_games2to7.csv"
]

# Load and combine 
dfs = []
for file in file_paths:
    df = pd.read_csv(file)
    dfs.append(df)

all_games = pd.concat(dfs, ignore_index=True)

# Export the final dataset
all_games.to_csv("../data/processed/all_modeled_playoff_games.csv", index=False)

#### Build Processed Feature Sets
We split the combined dataset by season, remove non-predictive identifiers, and apply a preprocessing pipeline:
numeric features → mean imputation + standard scaling; categorical features → one-hot encoding with unknown handling.  
Outputs are saved for downstream modeling.

**Overview of Steps:**
1. **Import Libraries**: Use `pandas` for data handling and `sklearn` for preprocessing.
2. **Load Combined Dataset**: Read `all_modeled_playoff_games.csv` file.
3. **Season Split**:  
   - Train: 2015–2022  
   - Validation: 2023–2024  
   - Test: 2025
4. **Drop Non-Predictive Columns**: Remove `homeTeam`, `awayTeam`, `gameDate` (identifiers).
5. **Separate Features/Target**: Target is `homeWin`.
6. **Detect Column Types**: Identify numeric and categorical columns.
7. **Fit Preprocessor**: `ColumnTransformer` = numeric(impute+scale) + categorical(one-hot).
8. **Transform Splits**: Apply the fitted preprocessor to train/val/test.

In [44]:
# Import Libraries
import os, json
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import joblib

# Load all modeled playoff games file
all_models  = "../data/processed/all_modeled_playoff_games.csv"
processed_dir = "../data/processed"
model_dir = "../models"
os.makedirs(processed_dir, exist_ok=True)
os.makedirs(model_dir, exist_ok=True)

# Load Combined Dataset
df = pd.read_csv(all_models)

# Season Split
train_df = df[df["season"].between(2015, 2022)].copy()
val_df   = df[df["season"].between(2023, 2024)].copy()
test_df  = df[df["season"] == 2025].copy()

# Drop Non-Predictive Columns
cols_to_drop = ["homeTeam", "awayTeam", "gameDate"]
for _d in (train_df, val_df, test_df):
    _d.drop(columns=[c for c in cols_to_drop if c in _d.columns], inplace=True, errors="ignore")

# Separate Target and Features
target = "homeWin"
y_train = train_df[target]; X_train = train_df.drop(columns=[target])
y_val   = val_df[target];   X_val   = val_df.drop(columns=[target])
y_test  = test_df[target];  X_test  = test_df.drop(columns=[target])

# Detect Column Types
numeric_cols = X_train.select_dtypes(include=["number"]).columns.tolist()
categorical_cols = X_train.select_dtypes(include=["object", "category", "bool"]).columns.tolist()

# Build Preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ("num", Pipeline([
            ("imputer", SimpleImputer(strategy="mean")),
            ("scaler", StandardScaler())
        ]), numeric_cols),
        ("cat", OneHotEncoder(drop="first", handle_unknown="ignore"), categorical_cols)
    ],
    remainder="drop"
)

# Fit on TRAIN only, then transform all splits
preprocessor.fit(X_train)

X_train_proc = preprocessor.transform(X_train)
X_val_proc   = preprocessor.transform(X_val)
X_test_proc  = preprocessor.transform(X_test)

# Retrieve post-transform feature names
def get_feature_names_out(preprocessor, num_cols, cat_cols):
    num_feats = [f"num__{c}" for c in num_cols]
    cat_encoder = preprocessor.named_transformers_["cat"]
    cat_feats = cat_encoder.get_feature_names_out(cat_cols).tolist()
    return num_feats + cat_feats

feature_names = get_feature_names_out(preprocessor, numeric_cols, categorical_cols)

# Convert to DataFrames with consistent column order
import numpy as np
X_train_df = pd.DataFrame(X_train_proc, columns=feature_names, index=X_train.index)
X_val_df   = pd.DataFrame(X_val_proc,   columns=feature_names, index=X_val.index)
X_test_df  = pd.DataFrame(X_test_proc,  columns=feature_names, index=X_test.index)

# Save processed matrices & labels
X_train_df.to_csv(f"{processed_dir}/X_train_processed.csv", index=False)
X_val_df.to_csv(f"{processed_dir}/X_val_processed.csv", index=False)
X_test_df.to_csv(f"{processed_dir}/X_test_processed.csv", index=False)

y_train.to_csv(f"{processed_dir}/y_train.csv", index=False)
y_val.to_csv(f"{processed_dir}/y_val.csv", index=False)
y_test.to_csv(f"{processed_dir}/y_test.csv", index=False)

# Save preprocessor and feature names 
joblib.dump(preprocessor, f"{model_dir}/final_preprocessor.pkl")
with open(f"{model_dir}/feature_names.json", "w") as f:
    json.dump(feature_names, f)