# DSCI 408/508 - Team Capstone Project - Part 2
### By Eric Cowan, Clay Bruner, Tyler Dreiling, and Sam Risenhoover
<br>
The NBA is the one professional sports league where one player, one decision can create the opportunity for sustained success. In recent years, as player salaries have skyrocketed and the player empowerent era has taken hold, being able to draft well and build around those young players has become more important than ever. In the past two seasons alone, franchise altering players have been drafted and have already attained great success, both for their teams and individually. The goal of this model is to accurately predict what the statistics for any given college player would be during their rookie season in the NBA. An important note is that this will be a predictor before draft night, so the pick would not have anything to do with where the player is drafted to. The goal is to accuratle predict the numbers that a rookie player would put up in the ideal situation for them, if the role they played in college seamlessly translated to the NBA.
<br>
Another important note, would be that players drafted out of high school, or from overseas leagues, such as the EuroLeague, are not in this data. This data is about NCAA men's basketball players and would not be accurate for players from different leagues.

In [1]:
import pandas as pd
import numpy as np
import scipy
from sklearn.model_selection import train_test_split

import tensorflow as tf
from keras import Sequential
from keras.layers import Dense

from numpy import mean
from numpy import std

Using TensorFlow backend.


**Importing the data**

I will import the data that was exported in the first part of the project, this is the data that will be used in the modeling process.

In [2]:
#Standardized Statistics
standard = pd.read_csv('CollegeRookieStats_standardized.csv')
#PCA2, from Part 1
pca2 = pd.read_csv('CollegeRookieStats_pca2.csv')
#PCA3, from Part 1
pca3 = pd.read_csv('CollegeRookieStats_pca3.csv')
# Raw Data
stats = pd.read_csv('CollegeRookieStats.csv')

**Selecting the target variables**

These are the target variables that we will be predicting. They are the rookie statistics for the NBA rookies. Strong rookie statistics are a good indicator of success in the NBA. Efficiency statistics, such as field goal percentage, are not included in the prediction due to the fact that efficiency is typically poor in the rookie season.

In [3]:
y = stats.loc[:,['NBATRB', 'NBAAST', 'NBASTL', 'NBABLK', 'NBAPTS','Year' ]]
y = y.drop(y[y['Year']==2019].index)
y = y.drop('Year', axis = 1)

For the standardized data, I will have to drop the columns that would not affect the output. Transposing the DataFrame will help determine what columns have no bearing on predicting future statistics.

In [4]:
standard.transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,626,627,628,629,630,631,632,633,634,635
Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,627,628,629,630,631,632,633,634,635,636
ID,1,2,3,4,5,6,7,8,9,10,...,627,628,629,630,631,632,633,634,635,636
Name,Kenyon Martin,Stromile Swift,Marcus Fizer,Mike Miller,DerMarr Johnson,Chris Mihm,Jamal Crawford,Keyon Dooling,Jerome Moiso,Mateen Cleaves,...,Jalen McDaniels,Justin Wright-Foreman,Marial Shayok,Kyle Guy,Jaylen Hands,Jordan Bone,Miye Oni,Cam Reddish,Kevin PorterJr,Ja Morant
College,Cincinnati,LSU,Iowa State,Florida,Cincinnati,Texas,Michigan,Missouri,UCLA,Michigan State,...,San Diego State,Hofstra,Iowa State,Virginia,UCLA,Tennessee,Yale,Duke,USC,Murray State
NBATeam,NJN,VAN,CHI,ORL,ATL,CHI,CLE,ORL,BOS,DET,...,CHO,UTA,PHI,NYK,LAC,NOP,GSW,ATL,MIL,MEM
DraftYear,2000,2000,2000,2000,2000,2000,2000,2000,2000,2000,...,2019,2019,2019,2019,2019,2019,2019,2019,2019,2019
DraftPick,1,2,4,5,6,7,8,10,11,14,...,52,53,54,55,56,57,58,10,30,2
G,0.755102,0.306122,0.62585,0.408163,0.183673,0.619048,0.0816327,0.367347,0.387755,0.802721,...,0.421769,0.816327,0.197279,0.687075,0.401361,0.612245,0.557823,0.210884,0.108844,0.408163
MP,0.562874,0.649701,0.838323,0.652695,0.676647,0.793413,0.868263,0.694611,0.655689,0.739521,...,0.688623,0.694611,0.838323,0.724551,0.700599,0.634731,0.802395,0.742515,0.51497,0.91018
FG,0.436782,0.517241,0.747126,0.448276,0.436782,0.528736,0.643678,0.367816,0.494253,0.436782,...,0.505747,0.689655,0.701149,0.436782,0.402299,0.344828,0.528736,0.425287,0.333333,0.632184


The columns ID, Name, NBATeam, DraftRange, Unnamed: 0, and DraftPick would not be included in the final model. College, I feel, could be a predictor of NBA success. Different Programs have different strengths and could be more likely to produce successful players. I will also remove the 2019 Draft Class, since those are the rookeis we will predict the stats for.

In [4]:
standard = standard.drop(['ID', 'Name', 'NBATeam', 'DraftPick', 'DraftRange', 'Unnamed: 0'], axis = 1)

In [5]:
def find_cat(data):
    col_list = list(data.columns)
    numList = []
    objList = []
    for col in col_list:
        if data.dtypes[col] == object:
            objList.append(col)
        elif data.dtypes[col] in [ 'int64','float64']:
            numList.append(col)
        else:
            print(f'The column {col} is niether object nor int64 nor float64')
    return [objList, numList]

In [6]:
cat, num = find_cat(standard)
standard[cat].nunique()

College    138
dtype: int64

That is a large amount of different colleges, which could clog up the model and slow down the training of the neural networ. For now, I will drop the College column for sake of just looking at the player themselves. The SOS parameter, Strength of Schedule, should be an indicator of whether the team the player was on was competing or not.

In [7]:
standard = standard.drop('College', axis = 1)

In [9]:
standard.transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,626,627,628,629,630,631,632,633,634,635
DraftYear,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,...,2019.0,2019.0,2019.0,2019.0,2019.0,2019.0,2019.0,2019.0,2019.0,2019.0
G,0.755102,0.306122,0.62585,0.408163,0.183673,0.619048,0.081633,0.367347,0.387755,0.802721,...,0.421769,0.816327,0.197279,0.687075,0.401361,0.612245,0.557823,0.210884,0.108844,0.408163
MP,0.562874,0.649701,0.838323,0.652695,0.676647,0.793413,0.868263,0.694611,0.655689,0.739521,...,0.688623,0.694611,0.838323,0.724551,0.700599,0.634731,0.802395,0.742515,0.51497,0.91018
FG,0.436782,0.517241,0.747126,0.448276,0.436782,0.528736,0.643678,0.367816,0.494253,0.436782,...,0.505747,0.689655,0.701149,0.436782,0.402299,0.344828,0.528736,0.425287,0.333333,0.632184
FGA,0.328125,0.40625,0.661458,0.416667,0.416667,0.479167,0.71875,0.416667,0.447917,0.5,...,0.458333,0.645833,0.645833,0.463542,0.453125,0.375,0.567708,0.5625,0.328125,0.59375
FGP,0.61828,0.548387,0.416667,0.341398,0.327957,0.379032,0.150538,0.147849,0.373656,0.134409,...,0.397849,0.341398,0.376344,0.206989,0.142473,0.182796,0.193548,0.0,0.30914,0.346774
P2,0.5,0.5625,0.8125,0.3625,0.3,0.5875,0.475,0.275,0.55,0.35,...,0.525,0.4875,0.525,0.2,0.25,0.275,0.35,0.175,0.225,0.55
P2A,0.455172,0.517241,0.848276,0.337931,0.268966,0.634483,0.565517,0.337931,0.593103,0.427586,...,0.531034,0.475862,0.503448,0.248276,0.310345,0.331034,0.386207,0.255172,0.227586,0.551724
P2P,0.579088,0.565684,0.402145,0.512064,0.538874,0.345845,0.252011,0.209115,0.351206,0.214477,...,0.439678,0.485255,0.520107,0.187668,0.176944,0.225201,0.310992,0.053619,0.383378,0.458445
P3,0.0,0.05,0.05,0.3,0.4,0.025,0.475,0.275,0.025,0.3,...,0.1,0.55,0.525,0.6,0.425,0.25,0.5,0.625,0.325,0.325


From Part 1, PCA2 is a component analysis of the following variables: FG, FGA, FG%, 2P, 2PA, 2P%, 3P, 3PA, 3P%, FT, FTA, and FT%. The PCA suggests to only use the PC1 from this and replace those variables in the standardized dataset, which is what I will do. So in the end, those 12 varaibles will be replace by the PC1 component from PCA2. PCA3 is over the remaining variables.

In [8]:
pc = pca2.loc[:,'PC1']
standard['PC'] = pc
X = standard.loc[:,['PC', 'G', 'MP', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS',
                    'SOS','DraftYear','Pos.C', 'Pos.PF', 'Pos.PG', 'Pos.SF', 'Pos.SG']]
rookies_x = X.loc[X.DraftYear == 2019]
rookies_x = rookies_x.drop('DraftYear', axis = 1)
X = X.drop(X[X['DraftYear']==2019].index, axis = 0)
X = X.drop('DraftYear', axis = 1)

X.shape

(586, 16)

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Now we have our features and our outcomes for training. The next step is to build the model. i will be using Keras and Tensorflow to construct a muli regression nueral network to preduct the 5 statistics.

In [10]:
from keras.models import Sequential
from keras.layers import Dense,Dropout
from keras.optimizers import Adam

After performing several sets of parameter tuning, this is the model that we finished on. I started off with one hidden layer, but that did not provide a wide enough variance in predictions, once I added the second hidden layer and upped the initial size of the input layer, the model functioned as desired. The loss parameter, which is Mean Squared Error, is the best indicator of model accuracy from observation. The accuracy jumps up to around 95% after the first few epochs, but the loss continues to decrease the more epochs there are, and the lower the loss the better the model functions as a predictor.

In [51]:
# get the model
model = Sequential()
# Input layer
model.add(Dense(64, activation='relu', input_dim=16))
# Hidden Layers
model.add(Dense(16, activation='relu'))
model.add(Dense(8, activation='relu'))
# Output Layer
model.add(Dense(5, activation = 'linear'))
# Compile the model
model.compile(optimizer='Adam', metrics = ['acc'], loss = 'mse')

In [73]:
model.fit(X_train, y_train, epochs=1000, verbose=1)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.callbacks.History at 0x27055c3aa48>

In [74]:
y_pred_test=model.predict(X_test)

score = model.evaluate(X_test, y_test, verbose = 0)
print(score)

y_pred_train = model.predict(X_train)

[9.725679575386694, 0.9152542352676392]


In [75]:
y_pred1 = pd.DataFrame(np.round(y_pred_test,2), index = y_test.index,
                            columns = ['PRED_TRB', 'PRED_AST', 'PRED_STL', 'PRED_BLK', 'PRED_PTS'])

y_pred2 = pd.DataFrame(np.round(y_pred_train,2), index = y_train.index,
                            columns = ['PRED_TRB', 'PRED_AST', 'PRED_STL', 'PRED_BLK', 'PRED_PTS'])

test_predictions = y_pred1.join(stats.loc[:,['Name', 'G', 'School', 'Pk']])

train_predictions = y_pred2.join(stats.loc[:,['Name', 'G', 'School', 'Pk']])

test_predictions.to_csv('test_predictions.csv')
train_predictions.to_csv('train_predictions.csv')

Overfitting the model seems to occur once the loss metric dips below 0.7, and the best results are found around 1000 epochs.
<br>
Here I will use the model to predict the stats for the 2019 rookie class, which is headlined by Zion Williamson, Ja Morant, and RJ Barret.

In [76]:
rookies_pred = model.predict(rookies_x)

rookies_pred = pd.DataFrame(np.round(rookies_pred,2),index = rookies_x.index,
                            columns = ['PRED_TRB', 'PRED_AST', 'PRED_STL', 'PRED_BLK', 'PRED_PTS'])

Merging the predictions with the information about the player.

In [77]:
rookie_predictions = rookies_pred.join(stats.loc[:,['Name', 'G', 'School', 'Pk']]).sort_values(by = 'Pk')

rookie_predictions.to_csv('rookie_predictions.csv')

rookie_predictions.head(15)

Unnamed: 0,PRED_TRB,PRED_AST,PRED_STL,PRED_BLK,PRED_PTS,Name,G,School,Pk
586,16.190001,6.46,2.23,1.89,33.779999,Zion Williamson,33,Duke,1
635,3.33,4.92,1.19,0.25,9.65,Ja Morant,65,Murray State,2
587,1.4,0.99,0.38,0.23,6.47,RJ Barrett,38,Duke,3
588,1.33,0.57,0.24,0.16,3.74,DeAndre Hunter,71,Virginia,4
589,1.38,1.73,0.46,0.16,5.11,Darius Garland,5,Vanderbilt,5
590,3.44,2.24,0.77,0.31,10.56,Jarrett Culver,75,Texas Tech,6
591,2.87,3.63,0.93,0.19,12.31,Coby White,35,UNC,7
592,7.01,2.32,0.95,0.82,16.07,Jaxson Hayes,32,Texas,8
593,6.63,2.57,0.97,0.68,16.32,Rui Hachimura,102,Gonzaga,9
633,2.03,1.74,0.6,0.17,8.42,Cam Reddish,36,Duke,10


The top two players in the draft are definitely the ones you would want to pick if this predictor is the metric you will base your decision on.

In [93]:
rookie_pred = pd.read_csv('rookie_predictions_good.csv')
rookie_pred.drop('Unnamed: 0', axis = 1).head()

Unnamed: 0,PRED_TRB,PRED_AST,PRED_STL,PRED_BLK,PRED_PTS,Name,G,School,Pk
0,9.4,3.55,1.61,1.08,25.23,Zion Williamson,33,Duke,1
1,6.98,7.74,1.75,0.64,20.77,Ja Morant,65,Murray State,2
2,4.49,3.68,1.07,0.43,17.01,RJ Barrett,38,Duke,3
3,2.52,0.73,0.5,0.25,7.25,DeAndre Hunter,71,Virginia,4
4,2.94,3.96,1.04,0.23,13.07,Darius Garland,5,Vanderbilt,5
