# NBA Player Stats Prediction

**Team Members:** Ryan, Momoka, Jesus, Angel, Harshil   
**Course:** CS4661 - Introduction to Data Science  
**Objective:** Predict NBA player statistics using machine learning

---

## Project Overview

This notebook demonstrates a complete machine learning pipeline for predicting NBA player statistics:
- **Target Variables:** PTS (total points scored) and team win classifiction
- **Models:** Linear Regression, Random Forest, Gradient Boosting
- **Approach:** Modular, reusable functions for scalability and maintainability

## 1. Imports and Setup

In [12]:
import data_utils
import training

## 2. Load and Explore Data

In [13]:
# Load dataset (only need to do this once!)
player_stats_df = data_utils.load_nba_data()

Downloading dataset...
Path to dataset files: /Users/ryan/.cache/kagglehub/datasets/eduardopalmieri/nba-player-stats-season-2425/versions/37

Available CSV files: ['database_24_25.csv']

DATASET OVERVIEW

Dataset shape: (16512, 25)

Column names:
['Player', 'Tm', 'Opp', 'Res', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'GmSc', 'Data']

First few rows:
          Player   Tm  Opp Res     MP  FG  FGA    FG%  3P  3PA  ...  DRB  TRB  \
0   Jayson Tatum  BOS  NYK   W  30.30  14   18  0.778   8   11  ...    4    4   
1  Anthony Davis  LAL  MIN   W  37.58  11   23  0.478   1    3  ...   13   16   
2  Derrick White  BOS  NYK   W  26.63   8   13  0.615   6   10  ...    3    3   
3   Jrue Holiday  BOS  NYK   W  30.52   7    9  0.778   4    6  ...    2    4   
4  Miles McBride  NYK  BOS   L  25.85   8   10  0.800   4    5  ...    0    0   

   AST  STL  BLK  TOV  PF  PTS  GmSc        Data  
0   10    1    1    1  

## 3. Predict Points Scored (PTS)

Points scored (PTS) represents the number of successful points made by a player in a game.

In [14]:
# Run complete pipeline for FG prediction
training.predict_target(player_stats_df, "PTS", classification=False)


################################################################################
# PREDICTION PIPELINE FOR: PTS
################################################################################

Target variable: PTS
Feature variables (16 total): ['MP', 'FGA', '3P', '3PA', '3P%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF']
Final dataset shape: X=(16512, 16), y=(16512,)

Train set: 9907 samples
Test set: 6605 samples

MODEL TRAINING FOR PTS

--------------------------------------------------------------------------------
Model: Linear Regression
--------------------------------------------------------------------------------
RMSE: 2.1743
MAE: 1.5885
R²: 0.9391

Feature Coefficients:
  MP: 0.2254
  FGA: 6.5157
  3P: 4.6443
  3PA: -3.6856
  3P%: -0.0842
  FT: 1.9715
  FTA: 0.2723
  FT%: 0.0838
  ORB: -0.0956
  DRB: 0.1303
  TRB: 0.0647
  AST: -0.0781
  STL: 0.0363
  BLK: 0.0855
  TOV: -0.0170
  PF: -0.0298

----------------------------------------------------

({'Linear Regression': {'RMSE': np.float64(2.174298662942923),
   'MAE': 1.588466536068367,
   'R²': 0.9390734842086711},
  'Random Forest': {'RMSE': np.float64(2.3868183618586643),
   'MAE': 1.716802422407267,
   'R²': 0.9265813007880593},
  'Gradient Boosting': {'RMSE': np.float64(2.27493488059965),
   'MAE': 1.6624387004339314,
   'R²': 0.9333030638190561}},
                        RMSE       MAE        R²
 Linear Regression  2.174299  1.588467  0.939073
 Random Forest      2.386818  1.716802  0.926581
 Gradient Boosting  2.274935  1.662439  0.933303)

## 4. Next Steps (To Be Completed)

### TODO List for Team:

1. **Exploratory Data Analysis (EDA)** - Assigned to: Angel (Week 1-2)
   - Distribution plots for PTS (player-level)
   - Distribution plots for aggregated team statistics (team-level)
   - Correlation heatmaps (both player and team level)
   - Win vs Loss feature comparisons
   - Temporal trends

2. **Feature Engineering** - Assigned to: Ryan + Momoka (Week 1)
   - Encode categorical variables (Tm, Opp, Res)
   - Create derived features (shooting efficiency, etc.)
   - Stretch: Rolling averages for player form
   - Stretch goal: PCA (Dimensionality Reduction)

3. **Team Win Prediction (Classification)** - Assigned to: **Ryan** (Week 1-2)
   - Transform player-level data to team-game level using aggregation
   - Binary classification models (Logistic Regression, Random Forest Classifier, Gradient Boosting Classifier)
   - Evaluate with accuracy, precision, recall, F1-score, ROC curves
   - Compare classification performance across models
   - **Deliverable:** New prediction pipeline for binary classification + results comparison

4. **Hyperparameter Tuning & Additional Modeling** - Assigned to: Jesus (Week 1-2)
   - Implement `tune_hyperparameters()` function with GridSearchCV (cv=5)
   - Add XGBoost and LightGBM for both regression (PTS) and classification (Team Win)
   - Tune hyperparameters for all models (regression and classification)
   - Compare tuned vs baseline models
   - Further instructions in models.py
   - **Deliverable:** Tuned models + comparison table
     
5. **Visualization & Analysis** - Assigned to: Harshil (Week 1-2)
   - Residual plots
   - Feature importance charts
   - Prediction vs actual scatter plots

6. **Documentation** - Assigned to: All (Week 2)
   - Executive summary (Ryan, Momoka)
   - Methodology explanation (Ryan, Jesus)
   - Results interpretation (Ryan, Angel, Harshil)
   - Conclusions and recommendations (All)



## 4. Predict Game Result (Item 3 of Next steps)

In [15]:
team_stats_df = data_utils.aggregate_team_game_stats(player_stats_df)
team_stats_df

Unnamed: 0,Tm,Opp,Data,Res,FG,FGA,3P,3PA,FT,FTA,...,AST,STL,BLK,TOV,PF,MP,team_fg_pct,team_3p_pct,team_ft_pct,win
0,ATL,BOS,2024-11-04,L,37,89,6,31,13,16,...,23,9,3,19,13,21.817273,0.415730,0.193548,0.812500,0
1,ATL,BOS,2024-11-12,W,50,99,10,32,7,13,...,35,16,2,16,17,26.665556,0.505051,0.312500,0.538462,1
2,ATL,BOS,2025-01-18,W,44,93,9,37,22,28,...,27,9,10,17,17,29.444444,0.473118,0.243243,0.785714,1
3,ATL,BRK,2024-10-23,W,39,80,9,28,33,46,...,25,12,9,16,20,26.664444,0.487500,0.321429,0.717391,1
4,ATL,CHI,2024-11-09,L,41,89,9,29,22,30,...,31,8,5,13,19,24.001000,0.460674,0.310345,0.733333,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1529,WAS,PHO,2025-01-16,L,46,92,13,41,18,27,...,28,13,4,13,22,23.999000,0.500000,0.317073,0.666667,0
1530,WAS,PHO,2025-01-25,L,42,90,18,41,7,9,...,28,8,5,8,17,26.663333,0.466667,0.439024,0.777778,0
1531,WAS,SAC,2025-01-19,L,32,89,10,42,26,32,...,19,6,1,16,16,18.459231,0.359551,0.238095,0.812500,0
1532,WAS,SAS,2024-11-13,L,49,95,18,44,14,16,...,27,5,5,18,23,26.666667,0.515789,0.409091,0.875000,0


In [16]:
training.predict_target(team_stats_df, 'win', classification=True)


################################################################################
# PREDICTION PIPELINE FOR: win
################################################################################

Target variable: win
Feature variables (18 total): ['FG', 'FGA', '3P', '3PA', 'FT', 'FTA', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'MP', 'team_fg_pct', 'team_3p_pct', 'team_ft_pct']
Final dataset shape: X=(1534, 18), y=(1534,)

Train set: 920 samples
Test set: 614 samples
Train win rate: 50.00%
Test win rate: 50.00%

MODEL TRAINING FOR win

--------------------------------------------------------------------------------
Model: Logistic Regression
--------------------------------------------------------------------------------
Accuracy: 0.8436
Precision: 0.8505
Recall: 0.8339
F1: 0.8421
ROC-AUC: 0.9208

Confusion Matrix:
  Predicted:  Loss | Win
Actual Loss:  262 |  45
Actual Win:    51 | 256

Feature Coefficients:
  FG: 0.5745
  FGA: -1.8381
  3P: 0.9303
  3PA: -0.1143
  FT: -0.0

({'Logistic Regression': {'Accuracy': 0.8436482084690554,
   'Precision': 0.8504983388704319,
   'Recall': 0.8338762214983714,
   'F1': 0.8421052631578947,
   'ROC-AUC': 0.9208055257880721},
  'Random Forest': {'Accuracy': 0.7899022801302932,
   'Precision': 0.7763975155279503,
   'Recall': 0.8143322475570033,
   'F1': 0.794912559618442,
   'ROC-AUC': 0.8707413341255611},
  'Gradient Boosting': {'Accuracy': 0.8061889250814332,
   'Precision': 0.7974683544303798,
   'Recall': 0.8208469055374593,
   'F1': 0.8089887640449438,
   'ROC-AUC': 0.8924657025538733}},
                      Accuracy  Precision    Recall        F1   ROC-AUC
 Logistic Regression  0.843648   0.850498  0.833876  0.842105  0.920806
 Random Forest        0.789902   0.776398  0.814332  0.794913  0.870741
 Gradient Boosting    0.806189   0.797468  0.820847  0.808989  0.892466)