# NBA Player Stats Prediction - Refactored

**Team Members:** Ryan, Momoka, Jesus, Angel, Harshil 
**Course:** CS4661 - Introduction to Data Science  
**Objective:** Predict NBA player statistics using machine learning

---

## Project Overview

This notebook demonstrates a complete machine learning pipeline for predicting NBA player statistics:
- **Target Variables:** Field Goals (FG) and Field Goal Attempts (FGA)
- **Models:** Linear Regression, Random Forest, Gradient Boosting
- **Approach:** Modular, reusable functions for scalability and maintainability

## 1. Imports and Setup

In [8]:
import pandas as pd
import numpy as np
import kagglehub
import os
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

## 2. Reusable Functions

These functions eliminate code duplication and make the pipeline modular.

In [9]:
def load_nba_data():
    """
    Download and load NBA player stats dataset from Kaggle.
    
    Returns:
        pd.DataFrame: Raw dataset
    """
    print("Downloading dataset...")
    path = kagglehub.dataset_download("eduardopalmieri/nba-player-stats-season-2425")
    print(f"Path to dataset files: {path}")
    
    csv_files = [f for f in os.listdir(path) if f.endswith('.csv')]
    print(f"\nAvailable CSV files: {csv_files}")
    
    df = pd.read_csv(os.path.join(path, csv_files[0]))
    
    print("\n" + "="*80)
    print("DATASET OVERVIEW")
    print("="*80)
    print(f"\nDataset shape: {df.shape}")
    print(f"\nColumn names:\n{df.columns.tolist()}")
    print(f"\nFirst few rows:\n{df.head()}")
    
    missing_values = df.isnull().sum()
    if missing_values.sum() > 0:
        print(f"\nMissing values:\n{missing_values[missing_values > 0]}")
    else:
        print("\nNo missing values found!")
    
    return df


def prepare_features(df, target_col, exclude_cols=None):
    """
    Prepare features and target variable for modeling.
    
    Args:
        df: DataFrame with raw data
        target_col: Name of target variable column
        exclude_cols: List of columns to exclude (default: auto-detected)
    
    Returns:
        tuple: (X, y, feature_names)
    """
    if exclude_cols is None:
        # Auto-detect columns to exclude
        exclude_cols = [target_col, 'Player', 'Data', 'FG%', 'PTS', 'GmSc']
        
        # If predicting FG, exclude FGA and vice versa
        if target_col == 'FG':
            exclude_cols.append('FGA')
        elif target_col == 'FGA':
            exclude_cols.append('FG')
    
    # Get feature columns
    feature_cols = [col for col in df.columns if col not in exclude_cols]
    
    # Select only numeric features for now
    numeric_cols = df[feature_cols].select_dtypes(include=[np.number]).columns.tolist()
    
    print(f"\nTarget variable: {target_col}")
    print(f"Feature variables ({len(numeric_cols)} total): {numeric_cols}")
    
    # Create feature matrix and target vector
    X = df[numeric_cols].copy()
    y = df[target_col].copy()
    
    # Clean data
    valid_indices = X.notna().all(axis=1) & y.notna()
    X = X[valid_indices]
    y = y[valid_indices]
    
    # Handle infinite values
    X = X.replace([np.inf, -np.inf], np.nan).dropna()
    y = y[X.index]
    
    print(f"Final dataset shape: X={X.shape}, y={y.shape}")
    
    return X, y, numeric_cols


def create_model_configs():
    """
    Create model configurations for training.
    
    Returns:
        dict: Model configurations
    """
    models = {
        'Linear Regression': {
            'model': LinearRegression(),
            'use_scaled': True,
            'has_coef': True
        },
        'Random Forest': {
            'model': RandomForestRegressor(n_estimators=100, random_state=42),
            'use_scaled': False,
            'has_coef': False
        },
        'Gradient Boosting': {
            'model': GradientBoostingRegressor(n_estimators=100, random_state=42),
            'use_scaled': False,
            'has_coef': False
        }
    }
    return models


def train_and_evaluate_models(X_train, X_test, y_train, y_test, feature_cols, target_name):
    """
    Train and evaluate all models for a given target variable.
    
    Args:
        X_train: Training features
        X_test: Test features
        y_train: Training target
        y_test: Test target
        feature_cols: List of feature column names
        target_name: Name of target variable (for display)
    
    Returns:
        dict: Results for each model
    """
    print("\n" + "="*80)
    print(f"MODEL TRAINING FOR {target_name}")
    print("="*80)
    
    models = create_model_configs()
    results = {}
    
    # Initialize scaler once for all models that need it
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    for model_name, config in models.items():
        print("\n" + "-"*80)
        print(f"Model: {model_name}")
        print("-"*80)
        
        # Select scaled or unscaled data based on model requirements
        X_train_use = X_train_scaled if config['use_scaled'] else X_train
        X_test_use = X_test_scaled if config['use_scaled'] else X_test
        
        # Train model
        model = config['model']
        model.fit(X_train_use, y_train)
        y_pred = model.predict(X_test_use)
        
        # Calculate metrics
        rmse = np.sqrt(mean_squared_error(y_test, y_pred))
        mae = mean_absolute_error(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)
        
        results[model_name] = {
            'RMSE': rmse,
            'MAE': mae,
            'R²': r2
        }
        
        # Print metrics
        print(f"RMSE: {rmse:.4f}")
        print(f"MAE: {mae:.4f}")
        print(f"R²: {r2:.4f}")
        
        # Print coefficients or feature importances
        if config['has_coef'] and hasattr(model, 'coef_'):
            print("\nFeature Coefficients:")
            for feature, coef in zip(feature_cols, model.coef_):
                print(f"  {feature}: {coef:.4f}")
        elif hasattr(model, 'feature_importances_'):
            print("\nFeature Importances:")
            for feature, importance in zip(feature_cols, model.feature_importances_):
                print(f"  {feature}: {importance:.4f}")
    
    return results


def summarize_results(results, target_name):
    """
    Print summary of model results.
    
    Args:
        results: Dictionary of model results
        target_name: Name of target variable
    """
    print("\n" + "="*80)
    print(f"SUMMARY OF RESULTS FOR {target_name}")
    print("="*80)
    
    results_df = pd.DataFrame(results).T
    print(f"\n{'':<20s}{'RMSE':>10s}{'MAE':>12s}{'R²':>10s}")
    for model_name, row in results_df.iterrows():
        print(f"{model_name:<20s}{row['RMSE']:>10.6f}{row['MAE']:>12.6f}{row['R²']:>10.6f}")
    
    # Identify best models
    best_model_r2 = results_df['R²'].idxmax()
    best_model_rmse = results_df['RMSE'].idxmin()
    best_model_mae = results_df['MAE'].idxmin()
    
    print(f"\nBest Model (by R²): {best_model_r2}")
    print(f"Best Model (by RMSE): {best_model_rmse}")
    print(f"Best Model (by MAE): {best_model_mae}")
    
    return results_df


def predict_target(df, target_col, test_size=0.4, random_state=42):
    """
    Complete pipeline for predicting a target variable.
    
    Args:
        df: DataFrame with data
        target_col: Target variable to predict
        test_size: Proportion of data for testing
        random_state: Random seed for reproducibility
    
    Returns:
        dict: Results for all models
    """
    print("\n" + "#"*80)
    print(f"# PREDICTION PIPELINE FOR: {target_col}")
    print("#"*80)
    
    # Prepare features
    X, y, feature_cols = prepare_features(df, target_col)
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state
    )
    print(f"\nTrain set: {X_train.shape[0]} samples")
    print(f"Test set: {X_test.shape[0]} samples")
    
    # Train and evaluate models
    results = train_and_evaluate_models(
        X_train, X_test, y_train, y_test, feature_cols, target_col
    )
    
    # Summarize results
    results_df = summarize_results(results, target_col)
    
    return results, results_df

## 3. Load and Explore Data

In [10]:
# Load dataset (only need to do this once!)
df = load_nba_data()

Downloading dataset...
Path to dataset files: /Users/ryan/.cache/kagglehub/datasets/eduardopalmieri/nba-player-stats-season-2425/versions/37

Available CSV files: ['database_24_25.csv']

DATASET OVERVIEW

Dataset shape: (16512, 25)

Column names:
['Player', 'Tm', 'Opp', 'Res', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'GmSc', 'Data']

First few rows:
          Player   Tm  Opp Res     MP  FG  FGA    FG%  3P  3PA  ...  DRB  TRB  \
0   Jayson Tatum  BOS  NYK   W  30.30  14   18  0.778   8   11  ...    4    4   
1  Anthony Davis  LAL  MIN   W  37.58  11   23  0.478   1    3  ...   13   16   
2  Derrick White  BOS  NYK   W  26.63   8   13  0.615   6   10  ...    3    3   
3   Jrue Holiday  BOS  NYK   W  30.52   7    9  0.778   4    6  ...    2    4   
4  Miles McBride  NYK  BOS   L  25.85   8   10  0.800   4    5  ...    0    0   

   AST  STL  BLK  TOV  PF  PTS  GmSc        Data  
0   10    1    1    1  

## 4. Predict Field Goals (FG)

Field Goals (FG) represents the number of successful shots made by a player in a game.

In [11]:
# Run complete pipeline for FG prediction
fg_results, fg_results_df = predict_target(df, 'FG')


################################################################################
# PREDICTION PIPELINE FOR: FG
################################################################################

Target variable: FG
Feature variables (15 total): ['MP', '3P', '3PA', '3P%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF']
Final dataset shape: X=(16512, 15), y=(16512,)

Train set: 9907 samples
Test set: 6605 samples

MODEL TRAINING FOR FG

--------------------------------------------------------------------------------
Model: Linear Regression
--------------------------------------------------------------------------------
RMSE: 1.8485
MAE: 1.3734
R²: 0.6749

Feature Coefficients:
  MP: 1.1350
  3P: 1.5072
  3PA: -0.2837
  3P%: -0.1100
  FT: -0.1143
  FTA: 0.6943
  FT%: 0.0123
  ORB: 0.2277
  DRB: 0.1227
  TRB: 0.1858
  AST: 0.1766
  STL: 0.0625
  BLK: 0.0646
  TOV: 0.2008
  PF: -0.1140

------------------------------------------------------------------------------

## 5. Predict Field Goal Attempts (FGA)

Field Goal Attempts (FGA) represents the total number of shots attempted by a player in a game.

In [5]:
# Run complete pipeline for FGA prediction
fga_results, fga_results_df = predict_target(df, 'FGA')


################################################################################
# PREDICTION PIPELINE FOR: FGA
################################################################################

Target variable: FGA
Feature variables (15 total): ['MP', '3P', '3PA', '3P%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF']
Final dataset shape: X=(16512, 15), y=(16512,)

Train set: 9907 samples
Test set: 6605 samples

MODEL TRAINING FOR FGA

--------------------------------------------------------------------------------
Model: Linear Regression
--------------------------------------------------------------------------------
RMSE: 2.7500
MAE: 2.0331
R²: 0.7919

Feature Coefficients:
  MP: 1.9140
  3P: -0.1039
  3PA: 2.9191
  3P%: -0.1271
  FT: 0.0386
  FTA: 1.0450
  FT%: -0.0554
  ORB: 0.5157
  DRB: 0.1077
  TRB: 0.2872
  AST: 0.4037
  STL: 0.0830
  BLK: 0.0410
  TOV: 0.3919
  PF: -0.1855

---------------------------------------------------------------------------

## 6. Side-by-Side Comparison

Compare model performance across both prediction tasks.

In [6]:
print("\n" + "="*80)
print("COMPARISON: FG vs FGA PREDICTION")
print("="*80)

print("\n" + "-"*40 + " FG Results " + "-"*40)
print(fg_results_df)

print("\n" + "-"*40 + " FGA Results " + "-"*40)
print(fga_results_df)

print("\n" + "="*80)
print("KEY INSIGHTS")
print("="*80)
print(f"Best model for FG: {fg_results_df['R²'].idxmax()} (R² = {fg_results_df['R²'].max():.4f})")
print(f"Best model for FGA: {fga_results_df['R²'].idxmax()} (R² = {fga_results_df['R²'].max():.4f})")


COMPARISON: FG vs FGA PREDICTION

---------------------------------------- FG Results ----------------------------------------
                       RMSE       MAE        R²
Linear Regression  1.848504  1.373403  0.674866
Random Forest      1.882773  1.379354  0.662699
Gradient Boosting  1.827591  1.340221  0.682181

---------------------------------------- FGA Results ----------------------------------------
                       RMSE       MAE        R²
Linear Regression  2.749998  2.033133  0.791892
Random Forest      2.801822  2.041688  0.783975
Gradient Boosting  2.724104  1.985679  0.795793

KEY INSIGHTS
Best model for FG: Gradient Boosting (R² = 0.6822)
Best model for FGA: Gradient Boosting (R² = 0.7958)


## 7. Next Steps (To Be Completed)

### TODO List for Team:

1. **Exploratory Data Analysis (EDA)** - Assigned to: Angel
   - Distribution plots for FG and FGA
   - Correlation heatmap
   - Feature relationships
   - Temporal trends

2. **Feature Engineering** - Assigned to: Ryan
   - Encode categorical variables (Tm, Opp, Res)
   - Create derived features (shooting efficiency, etc.)
   - Rolling averages for player form

3. **Advanced Modeling** - Assigned to: Jesus
   - Cross-validation (5-fold)
   - Hyperparameter tuning
   - Add XGBoost and LightGBM
   - Feature selection

4. **Visualization & Analysis** - Assigned to: 
   - Residual plots
   - Feature importance charts
   - Prediction vs actual scatter plots

5. **Documentation** - Assigned to: Momoka
   - Executive summary
   - Methodology explanation
   - Results interpretation
   - Conclusions and recommendations