# Model Creation

## Introduction

Welcome to the prediction model creation portion of my basketball prediction project. This will be the last notebook in the step to having a complete prediction model with manually scraped and pruned data. In the future, I will need to create the failure model that will be able to take these predictions and use a confidence interval to flag a team as a fail if they lose too many games and not failed if they win enough.

## Libraries/Imports

In [166]:
# data manipulatiom
import pandas as pd
import numpy as np

# machine learning imports
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neural_network import MLPRegressor

# model selection
from sklearn.model_selection import GridSearchCV
 
# data preprocessing
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler, OneHotEncoder

# result metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error

# internal imports
import sys, os
sys.path.append(os.path.abspath(".."))
from src.model import build_preprocessor, build_elasticnet, build_knn

### Data

In [167]:
# load data
master_df = pd.read_csv("/Users/trustanprice/Desktop/Personal/Basketball-Predictions/data/raw/master-stats/master_df.csv")
print(master_df.head())

   Season           Team  GP          W          L   WIN%        Min  \
0    2016  Atlanta Hawks  82  48.000000  34.000000  0.585  48.400000   
1    2017  Atlanta Hawks  82  43.000000  39.000000  0.524  48.500000   
2    2018  Atlanta Hawks  82  24.000000  58.000000  0.293  48.100000   
3    2019  Atlanta Hawks  82  29.000000  53.000000  0.354  48.400000   
4    2020  Atlanta Hawks  67  24.477612  57.522388  0.299  59.480597   

          PTS        FGM         FGA  ...  Yw/Franch  YOverall  CareerW  \
0  102.800000  38.600000   84.400000  ...          3         3      146   
1  103.200000  38.100000   84.400000  ...          4         4      189   
2  103.400000  38.200000   85.500000  ...          5         5      213   
3  113.300000  41.400000   91.800000  ...          1         1       29   
4  136.829851  49.689552  110.883582  ...          2         2       49   

   CareerL  CareerW%  FirstRoundPicks  SecondRoundPicks  Coach_Count  \
0      100     0.593         1.000000       

I will now create the training and testing data, making the training data 2016-2023 seasons (8 seasons) and the testing data 2024/2025 seasons (2 seasons). I am doing it like this because I am predicting for the seasons to come; therefore, I am trying to replicate the traditional 80/20 split while making it a time-based split.

In [168]:
# train/test split
master_test = master_df[master_df["Season"].isin([2024, 2025])]
master_train = master_df[~master_df["Season"].isin([2024, 2025])]

print("Train shape:", master_train.shape)
print("Test shape:", master_test.shape)

Train shape: (260, 57)
Test shape: (68, 57)


In [169]:
# Numeric features (continuous or counts)
numeric_features = [
    "GP", "W", "L", "WIN%", "Min", "PTS", "FGM", "FGA", "FG%",
    "3PM", "3PA", "3P%", "FTM", "FTA", "FT%", "OREB", "DREB",
    "REB", "AST", "TOV", "STL", "BLK", "BLKA", "PF", "PFD",
    "PLUS_MINUS", "Home_W", "Home_L", "Road_W", "Road_L",
    "E_W", "E_L", "W_W", "W_L", "Pre-ASG_W", "Pre-ASG_L",
    "Post-ASG_W", "Post-ASG_L", "SOS", "Yw/Franch", "YOverall",
    "CareerW", "CareerL", "CareerW%", "FirstRoundPicks","SecondRoundPicks", "Coach_Count", "Payroll",
    
    # Player-aggregated features
    "avg_age", "avg_pts_top10", "avg_production_score", "injury_rate"
]

# Categorical features (labels, identifiers, strings)
categorical_features = [
    "Season"
]

# Define target column (predict next season’s wins)
target_column = "NWins"

### Models

In [170]:
# --- Training data ---
X_train = master_train.drop(columns=["NWins"])   # drop target column
y_train = master_train["NWins"]                  # target = next season wins

# --- Testing data ---
X_test = master_test.drop(columns=["NWins"])     # drop target column
y_test = master_test["NWins"]                    # target = next season wins (NaN for 2025)

After loading in my training and testing data, I will now utilize elastic net for feature selection because of the amount of predictors my original dataset has.

In [171]:
# build preprocessor for reduced features
preprocessor = build_preprocessor(numeric_features, categorical_features)

# choose a model
elasticnet_model = build_elasticnet(preprocessor, X_train, y_train)

Fitting 10 folds for each of 12 candidates, totalling 120 fits


In [172]:
# --- Get feature importances (coefficients) ---
best_pipeline = elasticnet_model.best_estimator_

# 1. Get one-hot encoder feature names
cat_ohe = best_pipeline.named_steps['columntransformer'] \
    .named_transformers_['pipeline-2'] \
    .named_steps['onehotencoder'] \
    .get_feature_names_out(categorical_features)

# 2. Combine numeric + categorical feature names
all_features = numeric_features + list(cat_ohe)

# 3. Get coefficients
coefficients = best_pipeline.named_steps['elasticnet'].coef_

feature_importance = pd.DataFrame({
    "Feature": all_features,
    "Coefficient": coefficients
}).sort_values(by="Coefficient", key=abs, ascending=False)

print(feature_importance.head(30))  # top 30 features


                 Feature  Coefficient
25            PLUS_MINUS     2.463480
2                      L    -1.642310
1                      W     1.641432
3                   WIN%     1.622578
38                   SOS    -0.250511
27                Home_L    -0.201869
44       FirstRoundPicks    -0.198625
26                Home_W     0.173738
39             Yw/Franch    -0.000000
42               CareerL    -0.000000
41               CareerW    -0.000000
40              YOverall    -0.000000
0                     GP    -0.000000
37            Post-ASG_L     0.000000
36            Post-ASG_W    -0.000000
35             Pre-ASG_L    -0.000000
43              CareerW%     0.000000
45      SecondRoundPicks     0.000000
33                   W_L     0.000000
46           Coach_Count    -0.000000
47               Payroll     0.000000
48               avg_age    -0.000000
49         avg_pts_top10    -0.000000
50  avg_production_score     0.000000
51           injury_rate    -0.000000
52          

The elastic net model has made all of the insignificant predictors coefficients zero so that they do not affect the final outcome of the prediction. As we can see from these results, the predictors left after feature selection are:
- PLUS_MINUS
- L
- W
- WIN%
- SOS
- Home_L
- FirstRoundPicks
- Home_W

In [173]:
# Keep only non-zero features ---
important_features = [
    f for f, c in zip(all_features, coefficients) if abs(c) > 1e-6
]

# Step 3: Reduce train/test sets ---
X_train_reduced = X_train[important_features].copy()
X_test_reduced = X_test[important_features].copy()

Now that I’ve saved the important features, I will run the KNN model using the function in src/model.py, which contains a hyperparameterized KNN pipeline fitted and processed with GridSearchCV. I set it up this way as a best practice: after first running the Elastic Net model to identify and reduce the predictors, I can now apply a stronger model without excessive computation time, since it runs only on the most accurate and viable features.

In [174]:
# get reduced features from your selection step
reduced_numeric_features = [col for col in X_train_reduced.columns if col in numeric_features]
reduced_categorical_features = [col for col in X_train_reduced.columns if col in categorical_features]

preprocessor_reduced = build_preprocessor(reduced_numeric_features, reduced_categorical_features)

knn_model = build_knn(preprocessor_reduced, X_train_reduced, y_train)

Fitting 10 folds for each of 48 candidates, totalling 480 fits


This marks the current end of the ML Model portion of this notebook, the rest will be geared towards understanding the results of the model (accuracy, predictive power, etc.).

## Results

First, I will test the predictive power with the 2024 NBA Season as testing data.

In [175]:
# Predict for test data (2024)
master_test["Pred_NWins"] = knn_model.predict(X_test)

total_required_wins = (82 / 2) * 30  # = 1230

# --- Compare 2023 → 2024 ---
eval_2024 = master_test[master_test["Season"] == 2024].copy()
eval_2024 = eval_2024.groupby("Team", as_index=False)["Pred_NWins"].mean()

# Normalize 2024 predictions
pred_sum_2024 = eval_2024["Pred_NWins"].sum()
scaling_factor_2024 = total_required_wins / pred_sum_2024
eval_2024["Pred_NWins"] = eval_2024["Pred_NWins"] * scaling_factor_2024

train_2023 = (
    master_train[master_train["Season"] == 2023]
    .groupby("Team", as_index=False)["NWins"].mean()
    .rename(columns={"NWins": "NWins_2023true"})
)

eval_2024 = eval_2024.merge(train_2023, on="Team")
eval_2024 = eval_2024[["Team", "NWins_2023true", "Pred_NWins"]]

rmse_2024 = np.sqrt(mean_squared_error(eval_2024["NWins_2023true"], eval_2024["Pred_NWins"]))
mae_2024 = mean_absolute_error(eval_2024["NWins_2023true"], eval_2024["Pred_NWins"])

print("2023 → 2024 Predictions")
print(eval_2024)
print("Total normalized wins:", round(eval_2024["Pred_NWins"].sum()))
print("RMSE:", rmse_2024, "MAE:", mae_2024)

2023 → 2024 Predictions
                      Team  NWins_2023true  Pred_NWins
0            Atlanta Hawks       39.333333   39.333229
1           Boston Celtics       64.000000   50.402542
2            Brooklyn Nets       38.500000   44.994055
3        Charlotte Hornets       21.000000   32.868216
4            Chicago Bulls       39.000000   42.201844
5      Cleveland Cavaliers       48.000000   46.450984
6         Dallas Mavericks       50.000000   41.580628
7           Denver Nuggets       57.000000   46.978181
8          Detroit Pistons       14.000000   29.210934
9    Golden State Warriors       46.000000   39.139384
10         Houston Rockets       41.000000   43.738006
11          Indiana Pacers       47.000000   44.850759
12    Los Angeles Clippers       51.000000   42.528739
13      Los Angeles Lakers       47.000000   43.749361
14       Memphis Grizzlies       27.000000   30.028467
15              Miami Heat       46.000000   45.538259
16         Milwaukee Bucks       49.00000

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  master_test["Pred_NWins"] = knn_model.predict(X_test)


Next, I will make the predictions for the 2025-2026 NBA Season, making sure to scale it so that there are the correct amount of wins for league as a whole.

In [176]:
# get the predictions
master_test["Pred_NWins"] = knn_model.predict(X_test)

results = master_test[["Season", "Team", "W", "Pred_NWins"] + important_features]
results_2025 = results[results["Season"] == 2025].copy()
results_2025 = results_2025.drop_duplicates(subset=["Team"]).reset_index(drop=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  master_test["Pred_NWins"] = knn_model.predict(X_test)


In [177]:
total_required_wins = (82 / 2) * 30
pred_sum = results_2025["Pred_NWins"].sum()
scaling_factor = total_required_wins / pred_sum

# replace Pred_NWins with normalized version
results_2025["Pred_NWins"] = (results_2025["Pred_NWins"] * scaling_factor).round()

print(results_2025)
print("Check total:", results_2025["Pred_NWins"].sum())

    Season                    Team     W  Pred_NWins     W     L   WIN%  \
0     2025           Atlanta Hawks  40.0        35.0  40.0  42.0  0.488   
1     2025          Boston Celtics  61.0        50.0  61.0  21.0  0.744   
2     2025           Brooklyn Nets  26.0        31.0  26.0  56.0  0.317   
3     2025       Charlotte Hornets  19.0        31.0  19.0  63.0  0.232   
4     2025           Chicago Bulls  39.0        40.0  39.0  43.0  0.476   
5     2025     Cleveland Cavaliers  64.0        56.0  64.0  18.0  0.780   
6     2025        Dallas Mavericks  39.0        43.0  39.0  43.0  0.476   
7     2025          Denver Nuggets  50.0        42.0  50.0  32.0  0.610   
8     2025         Detroit Pistons  44.0        47.0  44.0  38.0  0.537   
9     2025   Golden State Warriors  48.0        41.0  48.0  34.0  0.585   
10    2025         Houston Rockets  52.0        49.0  52.0  30.0  0.634   
11    2025          Indiana Pacers  50.0        48.0  50.0  32.0  0.610   
12    2025    Los Angeles

### Summary Figures

In [178]:
# summary figure

### Serialize Results

In [179]:
# Save as normal CSV
results_2025.to_csv("/Users/trustanprice/Desktop/Personal/Basketball-Predictions/data/raw/master-stats/results_2025.csv", index=False)

# Save as gzip-compressed CSV
results_2025.to_csv("/Users/trustanprice/Desktop/Personal/Basketball-Predictions/data/processed/master-data/results_2025.csv.gz", index=False, compression="gzip")

## Discussion

I currently have a lot to work on left for the predictive power of this model; however, I am happy with where I have gotten in the span of less than a week! In the future, I will look to try and use Strength of Schedule as a better predictive element; to explain, currently the model is predicting off of previous seasons SOS when it should be using the next seasons strength of schedule calculated by how good the team was in the previous year.

Looking back, I am happy I was able to tune the model as much as I was, but when I come back I will do much more feature and data processing engineering. I am excited for whats to come and hopefully anyone reading this is able to enjoy the project for what it is currently!

#### LeBron James is the GOAT