## 1) Inspiration

I've always been a huge NBA fan, and I'm very interested in player's stats and averages. I think the dataset is worth exploring in order to see how net rating is related to the overall statistics of a player. Some player in the NBA can put up impressive statlines, but do not necessarily contribute to winning basketball games. This project will explore which basketball statistics actually contribute to winning.

## 2) Stakeholders

General Managers and coaching staff of NBA teams would be interested in hearing the results of this project. This project would benefit them by showing how their player's statistics contribute to winning on the court.

## 3) Task and Metrics

This project will be a regression problem since net_rating is a continuous variable.

My evaluation metric will be MAE. I don't need to penalize bigger errors with RMSE, so MAE will do.

## 4) Data

My dataset contains all the NBA players from the 1996 to 2021 seasons and all their descriptive information (height, weight, college) and statistics (points, rebounds, assists, shooting percentages).

Link: https://www.kaggle.com/datasets/justinas/nba-players-data?resource=download

There are 12844 observations and 22 columns. There are 8 categorical variables and 14 numeric variables.

Each observation is an NBA player's season. Since it is data from 1996 to 2021, players that played multiple seasons in that time have an observation for each season. The variables represent their statistics from that season. My response variable, 'net_rating', is the teams point differential per 100 possessions when a player is on/off the court. 'ts_pct' is True Shooting Percentage, 'usg_pct' is usage percentage, 'pts' is average points, 'reb' is average rebounds, 'ast' is average assists, 'ast_pct' is assist percentage, 'oreb_pct' is offensive rebound percentage, and 'dreb_pct' is defensive rebound percentage.

The response variable is the 'net_rating' column, and the predictor variables are usg_pct, pts, reb, ast, ts_pct, ast_pct, oreb_pct, dreb_pct, gp, and age.

CLEANING: I excluded observations from the dataset where gp was less than 20. This is because if players play less than 20 games in a season, their statistical averages might skew/misrepresent due to lack of data.


## 5) Prediction

For my first model, I used 'ts_pct' as my predictor because it had the highest correlation with net_rating (0.36). The training MAE was 4.67, and the test MAE was also 4.67.

Here is a table showing the rest of the predictors being added along with the test and training RMSE, and test and training R^2:

In [171]:
#| echo: false
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm

data = pd.read_csv('all_seasons.csv')

data = data[data['gp'] > 20]

predictors = ['pts', 'reb', 'ast', 'ts_pct', 'ast_pct', 'oreb_pct', 'dreb_pct', 'age', 'gp', 'usg_pct']
results = []

X = data[['ts_pct']]  # Chosen as the first predictor because it has the highest correlation with net_rating
y = data['net_rating']

# Split into train and test 80/20
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fit model
model = LinearRegression()
model.fit(X_train, y_train)

# predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculate MAE and R^2
train_mae = mean_absolute_error(y_train, y_train_pred)
test_mae = mean_absolute_error(y_test, y_test_pred)
r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)

# Df with results
results.append({
    'Predictors': 'ts_pct',
    'Train MAE': train_mae,
    'Test MAE': test_mae,
    'Train R^2': r2_train,
    'Test R^2': r2_test
})

X = X.copy()


for predictor in predictors[0:]:

    if predictor in data.columns:
        X[predictor] = data[predictor]
    else:
        print(f"Warning: '{predictor}' not found in the DataFrame columns. Skipping it.")
        continue

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    X_train_const = sm.add_constant(X_train)
    X_test_const = sm.add_constant(X_test)

    model = sm.OLS(y_train, X_train_const).fit()

    y_train_pred = model.predict(X_train_const)
    y_test_pred = model.predict(X_test_const)

    train_mae = mean_absolute_error(y_train, y_train_pred)
    test_mae = mean_absolute_error(y_test, y_test_pred)
    r2_train = r2_score(y_train, y_train_pred)
    r2_test = r2_score(y_test, y_test_pred)

    results.append({
        'Predictors': ', '.join(X.columns),
        'Train MAE': train_mae,
        'Test MAE': test_mae,
        'Train R^2': r2_train,
        'Test R^2': r2_test
    })

results_df = pd.DataFrame(results)

print("\nPerformance with each set of predictors:\n")
results_df


Performance with each set of predictors:



Unnamed: 0,Predictors,Train MAE,Test MAE,Train R^2,Test R^2
0,ts_pct,4.670336,4.671505,0.135979,0.139392
1,"ts_pct, pts",4.616341,4.612677,0.15619,0.16397
2,"ts_pct, pts, reb",4.607351,4.604622,0.159138,0.166505
3,"ts_pct, pts, reb, ast",4.5617,4.556252,0.172178,0.184453
4,"ts_pct, pts, reb, ast",4.5617,4.556252,0.172178,0.184453
5,"ts_pct, pts, reb, ast, ast_pct",4.557704,4.54811,0.1756,0.187443
6,"ts_pct, pts, reb, ast, ast_pct, oreb_pct",4.557575,4.5473,0.175661,0.18764
7,"ts_pct, pts, reb, ast, ast_pct, oreb_pct, dreb...",4.551968,4.52889,0.178205,0.193172
8,"ts_pct, pts, reb, ast, ast_pct, oreb_pct, dreb...",4.468096,4.463113,0.206976,0.211481
9,"ts_pct, pts, reb, ast, ast_pct, oreb_pct, dreb...",4.423564,4.426429,0.222523,0.226889


After creating a model with my non-linear terms, I found that no predictors had a coef of 0. Here are the five most important and five least important terms from my trained and tuned polynomial model, regularized using Lasso:

In [190]:
#| echo: false
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import RidgeCV, LassoCV, LogisticRegressionCV

poly = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly = poly.fit_transform(X_train) 
X_test_poly = poly.transform(X_test)

scaler = StandardScaler()
scaler.fit(X_train_poly)
X_train_poly_scaled = scaler.transform(X_train_poly)
X_test_poly_scaled = scaler.transform(X_test_poly)

alphas = 10**np.linspace(10,-2,200)*0.5

rcv = RidgeCV(alphas=alphas, cv=5)

rcv.fit(X_train_poly_scaled, y_train)

y_pred = rcv.predict(X_test_poly_scaled)

mae = mean_absolute_error(y_test, y_pred)



coefficients = rcv.coef_

feature_names = poly.get_feature_names_out(X_train.columns)
coef_df = pd.DataFrame({
    'Predictor': feature_names,
    'Coefficient': coefficients
})

coef_df['Abs_Coefficient'] = coef_df['Coefficient'].abs()

coef_df_sorted = coef_df.sort_values(by='Abs_Coefficient', ascending=False)

print("\nTop 5 Most Important Predictors:")
print(coef_df_sorted.head(5)[['Predictor', 'Coefficient']])

print("\nTop 5 Least Important Predictors:")
print(coef_df_sorted.tail(5)[['Predictor', 'Coefficient']])

zero_coef_df = coef_df_sorted[coef_df_sorted['Coefficient'] == 0]
if not zero_coef_df.empty:
    print("\nTerms with Zero Coefficients:")
    print(zero_coef_df[['Predictor', 'Coefficient']])


Top 5 Most Important Predictors:
         Predictor  Coefficient
31     reb ast_pct   -12.066458
9          usg_pct    -8.158342
1              pts    -8.003121
43     ast usg_pct     6.078292
19  ts_pct usg_pct     5.464256

Top 5 Least Important Predictors:
           Predictor  Coefficient
62              gp^2     0.198761
27            pts gp     0.137600
45  ast_pct oreb_pct    -0.111271
22           pts ast     0.061311
2                reb    -0.036364


After cross-validating, I found the best hyperparameter to be 1.95, and the Test MAE of my trained and tuned model is 4.25. The best CV score is 0.27.

Looking at my list of most important predictors, usg_pct seems to be very informative. The original predictor of usg_pct is in the list, and it is also included in two interaction terms, ast * usg_pct and ts_pct * usg_pct.

From looking at my performance table where I add one predictor to the model each time, pts is very influential in reducing the MAE, reducing the Training and Test MAE by ~0.6. Another predictor that seems to have a large impact is age. When age is added into the model, Training MAE reduces by ~0.8, and Test MAE reduces by ~0.6. These two predictors reduce MAE by significantly more than the other predictors.

From looking at my Least Important Predictors table, gp seems to not be informative to the model. The transformation term gp^2 and the interaction term pts * gp are both in this list. Looking at my performance table, the variables oreb_pct and dreb_pct also seem to be low-impact predictors because both of them negligibly change the Training or Test MAE (<0.01).

There seems to be non-linearities in the relationship between net_rating and the predictors. This can be seen because the Test MAE of the polynomial model is 4.25, which is lower than the Test MAE of the earlier model (4.42). This means that the non-linear terms in the new model fit the relationship better, reducing error.

## 6) Inference

My original predictors for this model are reb, ast_pct, usg_pct, pts, ast, and ts_pct, and my interaction terms are reb * ast_pct, ast * usg_pct, and ts_pct * usg_pct. I got these terms from my top 5 most informative list in my table in my prediction section.

Here is a summary of the model that includes these influential predictors:

In [197]:
#| echo: false
import statsmodels.formula.api as smf

model = smf.ols(formula= 'net_rating ~ scale(reb) + scale(ast_pct) + scale(usg_pct) + scale(pts) + scale(ast) + scale(ts_pct) + scale(reb*ast_pct) + scale(ast*usg_pct) + scale(ts_pct*usg_pct)', data=data).fit()
model.summary()

0,1,2,3
Dep. Variable:,net_rating,R-squared:,0.191
Model:,OLS,Adj. R-squared:,0.19
Method:,Least Squares,F-statistic:,278.8
Date:,"Tue, 18 Mar 2025",Prob (F-statistic):,0.0
Time:,21:33:09,Log-Likelihood:,-33628.0
No. Observations:,10627,AIC:,67280.0
Df Residuals:,10617,BIC:,67350.0
Df Model:,9,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-1.0456,0.056,-18.807,0.000,-1.155,-0.937
scale(reb),0.4826,0.127,3.809,0.000,0.234,0.731
scale(ast_pct),-0.4689,0.165,-2.848,0.004,-0.792,-0.146
scale(usg_pct),-6.1785,0.532,-11.616,0.000,-7.221,-5.136
scale(pts),0.3528,0.183,1.929,0.054,-0.006,0.711
scale(ast),1.6525,0.301,5.485,0.000,1.062,2.243
scale(ts_pct),-0.2160,0.211,-1.025,0.305,-0.629,0.197
scale(reb * ast_pct),-0.0311,0.170,-0.182,0.855,-0.365,0.303
scale(ast * usg_pct),-0.2358,0.294,-0.801,0.423,-0.813,0.341

0,1,2,3
Omnibus:,122.102,Durbin-Watson:,1.932
Prob(Omnibus):,0.0,Jarque-Bera (JB):,162.902
Skew:,-0.164,Prob(JB):,4.23e-36
Kurtosis:,3.51,Cond. No.,34.7


Now, I will discuss the effect of each predictor on the response, net_rating, using the coefs from the summary:

- reb (0.4826) - For each additional rebound a player records, net_rating goes up by 0.1966.  
- ast_pct	(-0.4689) - For each 1% increase in a player's ast_pct, net_rating goes down by 0.4689.  
- usg_pct	(-6.1785) - For each 1% increase in a player's usg_pct, net_rating goes down by 6.1785.  
- pts	(0.3528) - For each additional point a player records, net_rating goes up by 0.3528.  
- ast	(1.6525) - For each additional assist a player records, net_rating goes up by 1.6525.  
- ts_pct (-0.2160) - For each 1% increase in a player's ts_pct, net_rating goes down by 0.2160.  
- reb * ast_pct (-0.0311) - For each additional rebound a player records, the effect of their ast_pct on net_rating decreases by 0.0311.
- ast * usg_pct (-0.2358) - For each additional assist a player records, the effect of their usg_pct on net_rating decreases by 0.2358.
- ts_pct * usg_pct (6.2596) - For each 1% increase in usg_pct, the effect of ts_pct on net_rating increases by 6.2596.

Now, I will discuss the the reliability of the effect of each predictor using the p values from the summary:

- reb (0.000) – The p-value is highly significant, indicating that reb has a reliable and strong effect on the target variable. This suggests that reb is an influential predictor in the model.
- ast_pct (0.004) – With a p-value well below 0.05, ast_pct is statistically significant, demonstrating a reliable effect on the target variable. This means that ast_pct contributes meaningfully to the model's predictive power.
- usg_pct (0.000) – The p-value is extremely low, indicating a highly reliable and significant effect. This confirms that usg_pct is an important and consistent predictor.
- pts (0.054) – The p-value is above the threshold of 0.05, which means that the effect of pts on the model is stastically insignificant.
- ast (0.000) – The p-value is extremely low, confirming a highly significant and reliable effect. This indicates that ast is a strong and consistent predictor.
- ts_pct (0.305) – The p-value is much higher than 0.05, indicating no reliable effect. This suggests that ts_pct does not significantly influence the target variable and may not be a meaningful predictor in the model.
- reb * ast_pct (0.855) – With a very high p-value, this interaction term is not statistically significant, indicating no reliable effect. The combination of reb and ast_pct does not meaningfully influence the target variable.
- ast * usg_pct (0.423) – The p-value is well above 0.05, showing no reliable effect. This indicates that the interaction between ast and usg_pct does not significantly contribute to the model.
- ts_pct * usg_pct (0.000) – The p-value is extremely low, making this interaction term highly significant and reliable. This suggests that the combined effect of ts_pct and usg_pct has a meaningful and consistent influence on the target variable.


The overall model has a R-squared value of 0.191, which means that 19.1% of the variation in net_rating can be explained by the predictors included in the model.

## 7) Recommendation to Stakeholders

Here are the main action items for NBA coaching staff or GMS to focus on based on my analysis:

- Focus on rebounds (reb): Since rebounds have a strong and reliable effect on net_rating, coaches should prioritize players who can consistently record more rebounds to improve team performance.
- Manage player usage (usg_pct): The negative relationship between usg_pct and net_rating suggests that players with a high usage percentage may lower the team's overall effectiveness. NBA coaches may want to adjust how often high-usage players are involved in plays to avoid diminishing returns on net_rating.
- Encourage assists (ast): Since assists have a significant positive effect on net_rating, focusing on improving playmaking abilities and encouraging more assists could improve team performance.
- Minimize reliance on TS% (ts_pct): Given that TS% has no significant effect on net_rating (with a high p-value), coaches may want to reassess its emphasis as a key performance indicator, focusing more on other factors like assists and rebounds.
- Consider interaction between ts_pct and usg_pct: The significant interaction between ts_pct and usg_pct suggests that players with higher usage may benefit from an improved TS%, and this combination should be considered when managing players.

Here are some limitations of my analysis:
- Limited explanatory power: The R-squared value of 0.191 indicates that only 19.1% of the variation in net_rating can be explained by the model. This means there are other important factors affecting net_rating that aren't captured in this analysis.
- Insignificant predictors: Several predictors, such as pts, ts_pct, and the interaction terms (reb * ast_pct, ast * usg_pct), do not show a significant effect on net_rating (with p-values above 0.05). This indicates that the model may be overlooking important variables or interactions that could better explain the target variable.
- Potential model misspecification: The model may be missing key predictors or interactions that could improve the predictive power. Additionally, some of the included variables might not fully represent the real-world dynamics of player performance.
- Contextual factors missing: The analysis does not account for contextual factors such as game situation, team strategy, or player roles that could influence the relationship between the predictors and net_rating.

In order to overcome these limitations, I would conduct further analysis with a wider range of variables, such as minutes per game, defensive metrics, or PER (player efficiency ratings). I would also try to implement more non-linear relationships or interaction terms in order to capture more the variance of net_ratings with my model. 

## 8) Conclusion

In this analysis, I examined the relationship between various player statistics and net_rating, using both prediction and inference methods. I found that rebounds (reb), assists (ast), and player usage percentage (usg_pct) have significant and reliable effects on net_rating. Specifically, rebounds and assists contribute positively to a player’s net_rating, while higher usage percentages negatively impact it. Points (pts) and true shooting percentage (ts_pct) showed weaker or insignificant effects, and the interaction between ts_pct and usg_pct proved meaningful. The model explains 19.1% of the variation in net_rating, indicating that these predictors provide valuable insights into player performance.