# Final Inference Pipeline
This step builds a **single, deployable pipeline** that bundles preprocessing (imputation, scaling, one-hot encoding) and a classifier (**SVM**).  
It’s designed for the **simulator** so inference needs only one artifact: `final_pipeline_model.pkl`.

#### Overview of Steps

1. **Import Libraries**: Load `pandas` for data handling, scikit-learn transformers (`StandardScaler`, `OneHotEncoder`, `SimpleImputer`, `ColumnTransformer`), pipeline utilities, `SVC` for classification, and `joblib`/`os` for saving artifacts.
2. **Load Dataset**: Read the merged modeling dataset `all_modeled_playoff_games.csv`.
3. **Season-Based Split**: Create Train (2015–2022), Validation (2023–2024), and Test (2025 holdout) partitions.
4. **Drop Non-Predictive Columns**: Remove identifiers that should not drive predictions (`homeTeam`, `awayTeam`, `gameDate`).
5. **Separate Target and Features**: Target is `homeWin`; form `X_train/X_val/X_test` and `y_train/y_val/y_test`.
6. **Identify Column Types**: Detect numeric vs. categorical feature lists to drive the transformers.
7. **Build Preprocessing Pipeline**  
   - **Numeric:** mean imputation → standard scaling  
   - **Categorical:** one-hot encoding with `drop='first'` and `handle_unknown='ignore'`
8. **Create Full Pipeline (Preprocess + Model)**: Chain the preprocessor with `SVC(kernel='rbf', C=1, probability=True, class_weight='balanced')` so training and inference run consistently.
9. **Train Model**: Fit the full pipeline on the training split only.
10. **Validate**: Predict on the validation split and print a `classification_report` (Accuracy, Precision, Recall, F1).

In [6]:
# Import librariers
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.svm import SVC
from sklearn.metrics import classification_report
import joblib
import os

# Load all modeled playoff games dataset file
df = pd.read_csv("../data/processed/all_modeled_playoff_games.csv")

# Train/Val/Test Split
train_df = df[df['season'].between(2015, 2022)].copy()
val_df   = df[df['season'].between(2023, 2024)].copy()
test_df  = df[df['season'] == 2025].copy()


# Drop Non-Predictive Columns
columns_to_drop = ['homeTeam', 'awayTeam', 'gameDate']
train_df.drop(columns=columns_to_drop, inplace=True, errors='ignore')
val_df.drop(columns=columns_to_drop, inplace=True, errors='ignore')
test_df.drop(columns=columns_to_drop, inplace=True, errors='ignore')

# Separate Target and Features
target = 'homeWin'

y_train = train_df[target]
X_train = train_df.drop(columns=[target])

y_val = val_df[target]
X_val = val_df.drop(columns=[target])

y_test = test_df[target]
X_test = test_df.drop(columns=[target])

# Identify Numeric & Categorical Columns
numeric_cols = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_cols = X_train.select_dtypes(include=['object', 'category']).columns.tolist()

# Build Preprocessing Pipeline
preprocessor = ColumnTransformer([
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', StandardScaler())
    ]), numeric_cols),
    
    ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), categorical_cols)
])

# Create Full Pipeline (Preprocessing + Model)
full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', SVC(kernel='rbf', C=1, probability=True, class_weight='balanced'))
])

# Train Model
full_pipeline.fit(X_train, y_train)

# Evaluate on Validation Set
val_preds = full_pipeline.predict(X_val)
print("\nValidation Performance:")
print(classification_report(y_val, val_preds))

# Save Final Pipeline
os.makedirs("../model/", exist_ok=True)
joblib.dump(full_pipeline, "../model/final_pipeline_model.pkl")


Validation Performance:
              precision    recall  f1-score   support

       False       0.49      0.72      0.59        68
        True       0.72      0.49      0.58        98

    accuracy                           0.58       166
   macro avg       0.61      0.61      0.58       166
weighted avg       0.63      0.58      0.58       166



['../model/final_pipeline_model.pkl']