# Model Creation

## Introduction

Welcome to the prediction model creation portion of my basketball prediction project. This will be the last notebook in the step to having a complete prediction model with manually scraped and pruned data. In the future, I will need to create the failure model that will be able to take these predictions and use a confidence interval to flag a team as a fail if they lose too many games and not failed if they win enough.

## Methods

In [234]:
# imports
import pandas as pd

# machine learning imports
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neural_network import MLPRegressor

# model selection
from sklearn.model_selection import GridSearchCV
 
# data preprocessing
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler, OneHotEncoder

# src imports
%reset -f
import sys, os
sys.path.append(os.path.abspath(".."))
from src.model import build_preprocessor, build_elasticnet, build_knn

ImportError: cannot import name 'build_elasticnet' from 'src.model' (/Users/trustanprice/Desktop/Personal/Basketball-Predictions/src/model.py)

In [235]:
import src.model
print(src.model.__file__)
print(dir(src.model))


/Users/trustanprice/Desktop/Personal/Basketball-Predictions/src/model.py
['OneHotEncoder', 'RobustScaler', 'SimpleImputer', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'build_preprocessor', 'make_column_transformer', 'make_pipeline']


### Data

In [None]:
# load data
master_df = pd.read_csv("/Users/trustanprice/Desktop/Personal/Basketball-Predictions/data/raw/master-stats/master_df.csv")
print(master_df.head())

I will now create the training and testing data, making the training data 2016-2023 seasons (8 seasons) and the testing data 2024/2025 seasons (2 seasons). I am doing it like this because I am predicting for the seasons to come; therefore, I am trying to replicate the traditional 80/20 split while making it a time-based split.

In [None]:
# train/test split
master_test = master_df[master_df["Season"].isin([2024, 2025])]
master_train = master_df[~master_df["Season"].isin([2024, 2025])]

print("Train shape:", master_train.shape)
print("Test shape:", master_test.shape)

In [208]:
# Numeric features (continuous or counts)
numeric_features = [
    "GP", "W", "L", "WIN%", "Min", "PTS", "FGM", "FGA", "FG%",
    "3PM", "3PA", "3P%", "FTM", "FTA", "FT%", "OREB", "DREB",
    "REB", "AST", "TOV", "STL", "BLK", "BLKA", "PF", "PFD",
    "PLUS_MINUS", "Home_W", "Home_L", "Road_W", "Road_L",
    "E_W", "E_L", "W_W", "W_L", "Pre-ASG_W", "Pre-ASG_L",
    "Post-ASG_W", "Post-ASG_L", "SOS", "Yw/Franch", "YOverall",
    "CareerW", "CareerL", "CareerW%", "Pk", "Coach_Count", "Payroll",
    
    # Player-aggregated features
    "avg_age", "avg_pts_top10", "avg_production_score", "injury_rate"
]

# Categorical features (labels, identifiers, strings)
categorical_features = [
    "Season"
]

# Define target column (predict next season’s wins)
target_column = "NWins"

### Models

In [209]:
# --- Training data ---
X_train = master_train.drop(columns=["NWins"])   # drop target column
y_train = master_train["NWins"]                  # target = next season wins

# --- Testing data ---
X_test = master_test.drop(columns=["NWins"])     # drop target column
y_test = master_test["NWins"]                    # target = next season wins (NaN for 2025)

In [None]:
# build preprocessor for reduced features
preprocessor = build_preprocessor(numeric_features, categorical_features)

# choose a model
model = build_elasticnet(preprocessor)  

# train
model.fit(X_train, y_train)

print("Best Params:", model.best_params_)

In [None]:
# --- Get feature importances (coefficients) ---
best_pipeline = model.best_estimator_

# 1. Get one-hot encoder feature names
cat_ohe = best_pipeline.named_steps['columntransformer'] \
    .named_transformers_['pipeline-2'] \
    .named_steps['onehotencoder'] \
    .get_feature_names_out(categorical_features)

# 2. Combine numeric + categorical feature names
all_features = numeric_features + list(cat_ohe)

# 3. Get coefficients
coefficients = best_pipeline.named_steps['elasticnet'].coef_

feature_importance = pd.DataFrame({
    "Feature": all_features,
    "Coefficient": coefficients
}).sort_values(by="Coefficient", key=abs, ascending=False)

print(feature_importance.head(30))  # top 15 features


In [None]:
# --- Step 2: Keep only non-zero features ---
important_features = [
    f for f, c in zip(all_features, coefficients) if abs(c) > 1e-6
]

print("Selected features:", important_features)

# --- Step 3: Reduce train/test sets ---
X_train_reduced = X_train[important_features].copy()
X_test_reduced = X_test[important_features].copy()

print("Reduced train shape:", X_train_reduced.shape)
print("Reduced test shape:", X_test_reduced.shape)

In [None]:

# get reduced features from your selection step
reduced_numeric_features = [col for col in X_train_reduced.columns if col in numeric_features]
reduced_categorical_features = [col for col in X_train_reduced.columns if col in categorical_features]

preprocessor_reduced = build_preprocessor(reduced_numeric_features, reduced_categorical_features)

knn = KNeighborsRegressor()

param_grid = {
    'kneighborsregressor__n_neighbors': [3, 5, 7, 10], 
    'kneighborsregressor__weights': ['uniform', 'distance'],
    'kneighborsregressor__metric': ['euclidean', 'manhattan', 'minkowski'], 
    'kneighborsregressor__p': [1, 2]
}

pipeline = make_pipeline(
    preprocessor_reduced,
    knn,
)

model = GridSearchCV(
    pipeline, 
    param_grid, 
    cv=10, 
    scoring="neg_mean_absolute_error",
    n_jobs=-1,
    verbose=1
)

model.fit(X_train_reduced, y_train)

## Results

In [214]:
# report model metrics

In [215]:
# summary figure

In [216]:
# serialize model

## Discussion