# Model Creation

## Introduction

Welcome to the prediction model creation portion of my basketball prediction project. This will be the last notebook in the step to having a complete prediction model with manually scraped and pruned data. In the future, I will need to create the failure model that will be able to take these predictions and use a confidence interval to flag a team as a fail if they lose too many games and not failed if they win enough.

## Methods

In [57]:
# imports
import pandas as pd

# machine learning imports
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neural_network import MLPRegressor

# model selection
from sklearn.model_selection import GridSearchCV
 
# data preprocessing
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler, OneHotEncoder

# src imports
from src.model import build_preprocessor, build_elasticnet, build_knn

### Data

In [58]:
# load data
master_df = pd.read_csv("/Users/trustanprice/Desktop/Personal/Basketball-Predictions/data/raw/master-stats/master_df.csv")
print(master_df.head())

   Season                   Team  GP     W     L   WIN%   Min    PTS   FGM  \
0    2025  Oklahoma City Thunder  82  68.0  14.0  0.829  48.1  120.5  44.6   
1    2025  Oklahoma City Thunder  82  68.0  14.0  0.829  48.1  120.5  44.6   
2    2025  Oklahoma City Thunder  82  68.0  14.0  0.829  48.1  120.5  44.6   
3    2025    Cleveland Cavaliers  82  64.0  18.0  0.780  48.2  121.9  44.5   
4    2025    Cleveland Cavaliers  82  64.0  18.0  0.780  48.2  121.9  44.5   

    FGA  ...            Coach  Yw/Franch  YOverall  CareerW  CareerL  \
0  92.7  ...  Mark Daigneault          5         5      211      189   
1  92.7  ...  Mark Daigneault          5         5      211      189   
2  92.7  ...  Mark Daigneault          5         5      211      189   
3  90.8  ...   Kenny Atkinson          1         5      182      208   
4  90.8  ...   Kenny Atkinson          1         5      182      208   

   CareerW%  Pk  Coach_Count      Payroll  NWins  
0     0.528  15            1  166418720.0    Na

I will now create the training and testing data, making the training data 2016-2023 seasons (8 seasons) and the testing data 2024/2025 seasons (2 seasons). I am doing it like this because I am predicting for the seasons to come; therefore, I am trying to replicate the traditional 80/20 split while making it a time-based split.

In [59]:
# train/test split
master_test = master_df[master_df["Season"].isin([2024, 2025])]
master_train = master_df[~master_df["Season"].isin([2024, 2025])]

print("Train shape:", master_train.shape)
print("Test shape:", master_test.shape)

Train shape: (515, 56)
Test shape: (132, 56)


In [60]:
# Numeric features (continuous or counts)
numeric_features = [
    "GP", "W", "L", "WIN%", "Min", "PTS", "FGM", "FGA", "FG%",
    "3PM", "3PA", "3P%", "FTM", "FTA", "FT%", "OREB", "DREB",
    "REB", "AST", "TOV", "STL", "BLK", "BLKA", "PF", "PFD",
    "PLUS_MINUS", "Home_W", "Home_L", "Road_W", "Road_L",
    "E_W", "E_L", "W_W", "W_L", "Pre-ASG_W", "Pre-ASG_L",
    "Post-ASG_W", "Post-ASG_L", "SOS", "Yw/Franch", "YOverall",
    "CareerW", "CareerL", "CareerW%", "Pk", "Coach_Count", "Payroll",
    
    # Player-aggregated features
    "avg_age", "avg_pts_top10", "avg_production_score", "injury_rate"
]

# Categorical features (labels, identifiers, strings)
categorical_features = [
    "Season"
]

# Define target column (predict next season’s wins)
target_column = "NWins"

### Models

In [61]:
# --- Training data ---
X_train = master_train.drop(columns=["NWins"])   # drop target column
y_train = master_train["NWins"]                  # target = next season wins

# --- Testing data ---
X_test = master_test.drop(columns=["NWins"])     # drop target column
y_test = master_test["NWins"]                    # target = next season wins (NaN for 2025)

In [62]:
# build preprocessor for reduced features
preprocessor = build_preprocessor(numeric_features, categorical_features)

# choose a model
elasticnet_model = build_elasticnet(preprocessor, X_train, y_train)

Fitting 10 folds for each of 12 candidates, totalling 120 fits


In [63]:
# --- Get feature importances (coefficients) ---
best_pipeline = elasticnet_model.best_estimator_

# 1. Get one-hot encoder feature names
cat_ohe = best_pipeline.named_steps['columntransformer'] \
    .named_transformers_['pipeline-2'] \
    .named_steps['onehotencoder'] \
    .get_feature_names_out(categorical_features)

# 2. Combine numeric + categorical feature names
all_features = numeric_features + list(cat_ohe)

# 3. Get coefficients
coefficients = best_pipeline.named_steps['elasticnet'].coef_

feature_importance = pd.DataFrame({
    "Feature": all_features,
    "Coefficient": coefficients
}).sort_values(by="Coefficient", key=abs, ascending=False)

print(feature_importance.head(30))  # top 15 features


                 Feature  Coefficient
25            PLUS_MINUS     3.011835
38                   SOS    -1.342478
2                      L    -1.220410
1                      W     1.219661
3                   WIN%     1.219532
20                   STL     0.565105
4                    Min     0.333147
12                   FTM     0.150562
27                Home_L    -0.074596
43              CareerW%    -0.000000
37            Post-ASG_L     0.000000
36            Post-ASG_W    -0.000000
35             Pre-ASG_L    -0.000000
39             Yw/Franch    -0.000000
40              YOverall     0.000000
41               CareerW    -0.000000
42               CareerL     0.000000
0                     GP    -0.000000
46               Payroll    -0.000000
44                    Pk     0.000000
45           Coach_Count    -0.000000
33                   W_L    -0.000000
47               avg_age    -0.000000
48         avg_pts_top10    -0.000000
49  avg_production_score     0.000000
50          

In [64]:
# --- Step 2: Keep only non-zero features ---
important_features = [
    f for f, c in zip(all_features, coefficients) if abs(c) > 1e-6
]

print("Selected features:", important_features)

# --- Step 3: Reduce train/test sets ---
X_train_reduced = X_train[important_features].copy()
X_test_reduced = X_test[important_features].copy()

print("Reduced train shape:", X_train_reduced.shape)
print("Reduced test shape:", X_test_reduced.shape)

Selected features: ['W', 'L', 'WIN%', 'Min', 'FTM', 'STL', 'PLUS_MINUS', 'Home_L', 'SOS']
Reduced train shape: (515, 9)
Reduced test shape: (132, 9)


In [65]:

# get reduced features from your selection step
reduced_numeric_features = [col for col in X_train_reduced.columns if col in numeric_features]
reduced_categorical_features = [col for col in X_train_reduced.columns if col in categorical_features]

preprocessor_reduced = build_preprocessor(reduced_numeric_features, reduced_categorical_features)

knn_model = build_knn(preprocessor_reduced, X_train_reduced, y_train)

Fitting 10 folds for each of 48 candidates, totalling 480 fits


## Results

In [66]:
# report model metrics
cv_results = pd.DataFrame(knn_model.cv_results_)

cv_results = cv_results[
    [
        "param_kneighborsregressor__n_neighbors",
        "param_kneighborsregressor__p",
        "mean_test_score",
        "std_test_score",
        "rank_test_score",
    ]
].sort_values(by="rank_test_score")

cv_results.columns = [
    "n_neighbors",
    "p",
    "mean_test_rmse",
    "std_test_rmse",
    "rank_test_score",
]

cv_results["mean_test_rmse"] = -cv_results["mean_test_rmse"]

cv_results.head()


Unnamed: 0,n_neighbors,p,mean_test_rmse,std_test_rmse,rank_test_score
47,10,2,8.931255,1.829801,1
13,10,1,8.931255,1.829801,1
15,10,2,8.931255,1.829801,1
31,10,2,9.081091,1.878438,4
29,10,1,9.081091,1.878438,4


In [67]:
# --- Step 1: Get 2025 teams ---
test_2025 = master_df[master_df["Season"] == 2025].copy()

# --- Step 2: Drop target (NWins) ---
X_2025 = test_2025.drop(columns=["NWins"], errors="ignore")

# --- Step 3: Predict with trained KNN model ---
test_2025["NWins"] = knn_model.predict(X_2025)

# --- Step 4: Select only required columns ---
results_2025 = test_2025[["W", "NWins", "Season", "Team"]]

# --- Step 5: Drop duplicates by team ---
results_2025 = results_2025.drop_duplicates(subset=["Team"]).reset_index(drop=True)

print(results_2025)


       W      NWins  Season                    Team
0   68.0  49.724750    2025   Oklahoma City Thunder
1   64.0  49.464971    2025     Cleveland Cavaliers
2   61.0  51.591667    2025          Boston Celtics
3   52.0  48.722488    2025         Houston Rockets
4   51.0  46.960380    2025         New York Knicks
5   50.0  41.732522    2025          Indiana Pacers
6   50.0  46.496481    2025    Los Angeles Clippers
7   49.0  50.896277    2025  Minnesota Timberwolves
8   48.0  43.753867    2025   Golden State Warriors
9   48.0  51.510377    2025       Memphis Grizzlies
10  48.0  45.572334    2025         Milwaukee Bucks
11  44.0  39.702885    2025         Detroit Pistons
12  41.0  47.215597    2025           Orlando Magic
13  40.0  54.608953    2025           Atlanta Hawks
14  40.0  43.453704    2025        Sacramento Kings
15  39.0  36.213564    2025           Chicago Bulls
16  39.0  45.667655    2025        Dallas Mavericks
17  37.0  29.523473    2025              Miami Heat
18  36.0  34

In [69]:
# summary figure

In [70]:
# serialize model

## Discussion