# Model Creation

## Introduction

Welcome to the prediction model creation portion of my basketball prediction project. This will be the last notebook in the step to having a complete prediction model with manually scraped and pruned data. In the future, I will need to create the failure model that will be able to take these predictions and use a confidence interval to flag a team as a fail if they lose too many games and not failed if they win enough.

## Libraries/Imports

In [249]:
# data manipulatiom
import pandas as pd
import numpy as np

# machine learning imports
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neural_network import MLPRegressor

# model selection
from sklearn.model_selection import GridSearchCV
 
# data preprocessing
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler, OneHotEncoder

# result metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error

# internal imports
import sys, os
sys.path.append(os.path.abspath(".."))
from src.model import *
from src.utils import nba_teams, team_map
from src.data_loader import *

### Data

In [250]:
# load data
master_df = pd.read_csv("/Users/trustanprice/Desktop/Personal/Basketball-Predictions/data/raw/master-stats/master_df.csv")
print(master_df.head())

   Season           Team  GP          W          L   WIN%        Min  \
0    2016  Atlanta Hawks  82  48.000000  34.000000  0.585  48.400000   
1    2017  Atlanta Hawks  82  43.000000  39.000000  0.524  48.500000   
2    2018  Atlanta Hawks  82  24.000000  58.000000  0.293  48.100000   
3    2019  Atlanta Hawks  82  29.000000  53.000000  0.354  48.400000   
4    2020  Atlanta Hawks  67  24.477612  57.522388  0.299  59.480597   

          PTS        FGM         FGA  ...  Yw/Franch  YOverall  CareerW  \
0  102.800000  38.600000   84.400000  ...          3         3      146   
1  103.200000  38.100000   84.400000  ...          4         4      189   
2  103.400000  38.200000   85.500000  ...          5         5      213   
3  113.300000  41.400000   91.800000  ...          1         1       29   
4  136.829851  49.689552  110.883582  ...          2         2       49   

   CareerL  CareerW%  FirstRoundPicks  SecondRoundPicks  Coach_Count  \
0      100     0.593         1.000000       

I will now create the training and testing data, making the training data 2016-2023 seasons (8 seasons) and the testing data 2024/2025 seasons (2 seasons). I am doing it like this because I am predicting for the seasons to come; therefore, I am trying to replicate the traditional 80/20 split while making it a time-based split.

In [251]:
# train/test split
master_test = master_df[master_df["Season"].isin([2023, 2024, 2025])]
master_train = master_df[~master_df["Season"].isin([2023, 2024, 2025])]

print("Train shape:", master_train.shape)
print("Test shape:", master_test.shape)

Train shape: (227, 57)
Test shape: (101, 57)


In [252]:
# Numeric features (continuous or counts)
numeric_features = [
    "GP", "W", "L", "WIN%", "Min", "PTS", "FGM", "FGA", "FG%",
    "3PM", "3PA", "3P%", "FTM", "FTA", "FT%", "OREB", "DREB",
    "REB", "AST", "TOV", "STL", "BLK", "BLKA", "PF", "PFD",
    "PLUS_MINUS", "Home_W", "Home_L", "Road_W", "Road_L",
    "E_W", "E_L", "W_W", "W_L", "Pre-ASG_W", "Pre-ASG_L",
    "Post-ASG_W", "Post-ASG_L", "SOS", "Yw/Franch", "YOverall",
    "CareerW", "CareerL", "CareerW%", "FirstRoundPicks","SecondRoundPicks", "Coach_Count", "Payroll",
    
    # Player-aggregated features
    "avg_age", "avg_pts_top10", "avg_production_score", "injury_rate"
]

# Categorical features (labels, identifiers, strings)
categorical_features = [
    "Season"
]

# Define target column (predict next season’s wins)
target_column = "NWins"

### Models

In [253]:
# --- Training data ---
X_train = master_train.drop(columns=["NWins"])   # drop target column
y_train = master_train["NWins"]                  # target = next season wins

# --- Testing data ---
X_test = master_test.drop(columns=["NWins"])     # drop target column
y_test = master_test["NWins"]                    # target = next season wins (NaN for 2025)

After loading in my training and testing data, I will now utilize elastic net for feature selection because of the amount of predictors my original dataset has.

In [254]:
# build preprocessor for reduced features
preprocessor = build_preprocessor(numeric_features, categorical_features)

# choose a model
elasticnet_model = build_elasticnet(preprocessor, X_train, y_train)

Fitting 10 folds for each of 12 candidates, totalling 120 fits


In [255]:
# --- Get feature importances (coefficients) ---
best_pipeline = elasticnet_model.best_estimator_

# 1. Get one-hot encoder feature names
cat_ohe = best_pipeline.named_steps['columntransformer'] \
    .named_transformers_['pipeline-2'] \
    .named_steps['onehotencoder'] \
    .get_feature_names_out(categorical_features)

# 2. Combine numeric + categorical feature names
all_features = numeric_features + list(cat_ohe)

# 3. Get coefficients
coefficients = best_pipeline.named_steps['elasticnet'].coef_

feature_importance = pd.DataFrame({
    "Feature": all_features,
    "Coefficient": coefficients
}).sort_values(by="Coefficient", key=abs, ascending=False)

print(feature_importance.head(30))  # top 30 features


                 Feature  Coefficient
25            PLUS_MINUS     2.788172
2                      L    -1.564991
1                      W     1.564768
3                   WIN%     1.562024
8                    FG%     0.298940
27                Home_L    -0.273870
38                   SOS    -0.203967
44       FirstRoundPicks    -0.192135
26                Home_W     0.092364
43              CareerW%     0.000000
36            Post-ASG_W     0.000000
37            Post-ASG_L     0.000000
35             Pre-ASG_L    -0.000000
39             Yw/Franch     0.000000
40              YOverall    -0.000000
41               CareerW    -0.000000
42               CareerL     0.000000
0                     GP    -0.000000
45      SecondRoundPicks     0.000000
33                   W_L    -0.000000
46           Coach_Count    -0.000000
47               Payroll     0.000000
48               avg_age    -0.000000
49         avg_pts_top10     0.000000
50  avg_production_score     0.000000
51          

The elastic net model has made all of the insignificant predictors coefficients zero so that they do not affect the final outcome of the prediction. As we can see from these results, the predictors left after feature selection are:
- PLUS_MINUS
- L
- W
- WIN%
- SOS
- Home_L
- FirstRoundPicks
- Home_W

Now that I’ve saved the important features, I will run the KNN model using the function in src/model.py, which contains a hyperparameterized KNN pipeline fitted and processed with GridSearchCV. I set it up this way as a best practice: after first running the Elastic Net model to identify and reduce the predictors, I can now apply a stronger model without excessive computation time, since it runs only on the most accurate and viable features.

In [256]:
important_features = [f for f, c in zip(all_features, coefficients) if abs(c) > 1e-6]

# --- Step 2: Reduce train/test sets to only those features ---
X_train_reduced = X_train[important_features].copy()
X_test_reduced  = X_test[important_features].copy()

# --- Step 3: Build preprocessor for reduced features ---
reduced_numeric_features = [col for col in important_features if col in numeric_features]
reduced_categorical_features = [col for col in important_features if col in categorical_features]
preprocessor_reduced = build_preprocessor(reduced_numeric_features, reduced_categorical_features)

# Train KNN with feature weights
knn_model = build_knn(
    preprocessor_reduced,
    X_train_reduced,
    y_train
)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


This marks the current end of the ML Model portion of this notebook, the rest will be geared towards understanding the results of the model (accuracy, predictive power, etc.).

## Results

First, I will test the predictive power with the 2024 NBA Season as testing data.

In [257]:
# --- Get the predictions ---
master_test["Pred_NWins"] = knn_model.predict(X_test_reduced)

# Keep only necessary columns
results = master_test[["Season", "Team", "W", "Pred_NWins"] + important_features]

# Filter for 2023 season
results_2023 = results[results["Season"] == 2023].copy()

# Drop duplicates by team
results_2023 = results_2023.drop_duplicates(subset=["Team"]).reset_index(drop=True)

# --- Normalize predictions so league total = 1230 ---
total_required_wins = (82 / 2) * 30  # 1230
pred_sum = results_2023["Pred_NWins"].sum()
scaling_factor = total_required_wins / pred_sum

results_2023["Pred_NWins"] = (results_2023["Pred_NWins"] * scaling_factor).round()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  master_test["Pred_NWins"] = knn_model.predict(X_test_reduced)


In [258]:
# --- Get the predictions ---
# Filter for 2024 season
results_2024 = results[results["Season"] == 2024].copy()

# Drop duplicates by team
results_2024 = results_2024.drop_duplicates(subset=["Team"]).reset_index(drop=True)

# --- Normalize predictions so league total = 1230 ---
total_required_wins = (82 / 2) * 30  # 1230
pred_sum = results_2024["Pred_NWins"].sum()
scaling_factor = total_required_wins / pred_sum

results_2024["Pred_NWins"] = (results_2024["Pred_NWins"] * scaling_factor).round()
results_2024["Pred_Wins"] = (results_2023["Pred_NWins"])

# Drop any duplicate columns just in case
results_2024 = results_2024.loc[:, ~results_2024.columns.duplicated()].copy()

# --- Output ---
print(results_2024)
print("Check total:", results_2024["Pred_NWins"].sum())


    Season                    Team     W  Pred_NWins     L   WIN%   FG%  \
0     2024           Atlanta Hawks  36.0        39.0  46.0  0.439  46.5   
1     2024          Boston Celtics  64.0        56.0  18.0  0.780  48.7   
2     2024           Brooklyn Nets  32.0        42.0  50.0  0.390  45.6   
3     2024       Charlotte Hornets  21.0        34.0  61.0  0.256  46.0   
4     2024           Chicago Bulls  39.0        39.0  43.0  0.476  47.0   
5     2024     Cleveland Cavaliers  48.0        47.0  34.0  0.585  47.9   
6     2024        Dallas Mavericks  50.0        42.0  32.0  0.610  48.1   
7     2024          Denver Nuggets  57.0        44.0  25.0  0.695  49.6   
8     2024         Detroit Pistons  14.0        28.0  68.0  0.171  46.3   
9     2024   Golden State Warriors  46.0        46.0  36.0  0.561  47.7   
10    2024         Houston Rockets  41.0        46.0  41.0  0.500  45.9   
11    2024          Indiana Pacers  47.0        46.0  35.0  0.573  50.7   
12    2024    Los Angeles

Next, I will make the predictions for the 2025-2026 NBA Season, making sure to scale it so that there are the correct amount of wins for league as a whole.

In [259]:
# get the predictions
results_2025 = results[results["Season"] == 2025].copy()
results_2025 = results_2025.drop_duplicates(subset=["Team"]).reset_index(drop=True)

total_required_wins = (82 / 2) * 30
pred_sum = results_2025["Pred_NWins"].sum()
scaling_factor = total_required_wins / pred_sum

# replace Pred_NWins with normalized version
results_2025["Pred_NWins"] = (results_2025["Pred_NWins"] * scaling_factor).round()
results_2025["Pred_Wins"] = (results_2024["Pred_NWins"])

results_2025 = results_2025.loc[:, ~results_2025.columns.duplicated()].copy()

print(results_2025)
print("Check total:", results_2025["Pred_NWins"].sum()) 

    Season                    Team     W  Pred_NWins     L   WIN%   FG%  \
0     2025           Atlanta Hawks  40.0        40.0  42.0  0.488  47.2   
1     2025          Boston Celtics  61.0        51.0  21.0  0.744  46.2   
2     2025           Brooklyn Nets  26.0        29.0  56.0  0.317  43.7   
3     2025       Charlotte Hornets  19.0        34.0  63.0  0.232  43.0   
4     2025           Chicago Bulls  39.0        42.0  43.0  0.476  47.0   
5     2025     Cleveland Cavaliers  64.0        55.0  18.0  0.780  49.1   
6     2025        Dallas Mavericks  39.0        48.0  43.0  0.476  47.9   
7     2025          Denver Nuggets  50.0        46.0  32.0  0.610  50.6   
8     2025         Detroit Pistons  44.0        46.0  38.0  0.537  47.6   
9     2025   Golden State Warriors  48.0        41.0  34.0  0.585  45.1   
10    2025         Houston Rockets  52.0        48.0  30.0  0.634  45.5   
11    2025          Indiana Pacers  50.0        44.0  32.0  0.610  48.8   
12    2025    Los Angeles

### Summary Figures

In [260]:
import pandas as pd

# --- Step 2: Calculate accuracy and add to DataFrame ---
def calculate_accuracy(df: pd.DataFrame, threshold: int, save_path: str | None = None) -> pd.DataFrame:
    df = df.copy()
    df["within_threshold"] = (df["Pred_Wins"] - df["W"]).abs() <= threshold
    accuracy = df["within_threshold"].mean()

    print(f"Accuracy on {df['Season'].iloc[0]} season (±{threshold} wins): {accuracy:.2%}")

    # Save updated results with accuracy column if path provided
    if save_path:
        df.to_csv(save_path, index=False)
        print(f"Saved results with accuracy to {save_path}")

    return df

# Run calculations
results_2025 = calculate_accuracy(results_2025, threshold=10)
results_2024 = calculate_accuracy(results_2024, threshold=10)

# Combine into one file
combined = pd.concat([results_2024, results_2025], ignore_index=True)
combined.to_csv(
    "/Users/trustanprice/Desktop/Personal/Basketball-Predictions/data/raw/master-stats/test_results.csv",
    index=False
)
print("Saved combined results with accuracy for 2024 and 2025.")


Accuracy on 2025 season (±10 wins): 63.33%
Accuracy on 2024 season (±10 wins): 70.00%
Saved combined results with accuracy for 2024 and 2025.


### Serialize Results

In [261]:
# # Save as normal CSV
# results_2025.to_csv("/Users/trustanprice/Desktop/Personal/Basketball-Predictions/data/raw/master-stats/results_2025.csv", index=False)

# # Save as gzip-compressed CSV
# results_2025.to_csv("/Users/trustanprice/Desktop/Personal/Basketball-Predictions/data/processed/master-data/results_2025.csv.gz", index=False, compression="gzip")

## Discussion

I currently have a lot to work on left for the predictive power of this model; however, I am happy with where I have gotten in the span of less than a week! In the future, I will look to try and use Strength of Schedule as a better predictive element; to explain, currently the model is predicting off of previous seasons SOS when it should be using the next seasons strength of schedule calculated by how good the team was in the previous year.

Looking back, I am happy I was able to tune the model as much as I was, but when I come back I will do much more feature and data processing engineering. I am excited for whats to come and hopefully anyone reading this is able to enjoy the project for what it is currently!

#### LeBron James is the GOAT