##### All-Star/NBA Predictions

An MLP implementation with Tensorflow and Keras.

I tried to keep this code generally in line with the code I have in the scikit-learn implemented MLPmodeling.py script.

First, we need to import a number of necessary packages and functions.

In [384]:
import tensorflow as tf
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.activations import relu, sigmoid
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

I have historical player stats at this path. Make sure that data is present before using.

I explain how I arrived at these 7 main features in the `README.md` file in my GitHub at https://github.com/rogersheu/NBA-ML-Predictions/blob/master/README.md.

In [385]:
fileName = 'C:/Users/Roger/Documents/GitHub/All-Star-Predictions/baseData/ML/all_stats_20211201.csv'
df = pd.read_csv(fileName)
print('All players data loaded.')

X = df[['RPG','APG','SBPG','PPG','TS','WS48','Perc']]
y = df['allLeague']

All players data loaded.


Because this data set is relatively imbalanced, with around 1000 all-League (either All-Star or All-NBA) player-seasons out of 7400 or so (~13.5%), I made sure to use train_test_split with stratification, to ensure each split had the same proportions of 1's and 0's and avoided small sample size biases. The likelihood of such a thing happening is relatively minimal, given that the test set still has at least 1000 members, but it's a good idea to be preempt such issues.

In [386]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify = y)

I decided to use scikit-learn's StandardScaler, since unlike Keras' normalize function, which forces data to be between 0 and 1, the StandardScaler transform changes the data into a mean of 0 and a standard deviation of 1, which is much more preferable for a multilayer perceptron analysis. Running Keras' normalize instead of the scaler gives unfortunate results, as the model quickly converges to assigning all 1's to the binary classification.

In [390]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Here, we finally set up the neural network. There exist a few other ways to do this, like the Sequential([ ... ]) method, but I like the modularity of adding layers.

As mentioned in the opening block, to closely align with my scikit-learn-based code, I used a 4 node relu hidden layer followed by a 1 node sigmoid conversion. However, I also tried a pair of 2-node relu-activated hidden layers. While using such a set-up worked here adequately, using it in the scikit-learn led to odd results, with correct classifications but oddly maxed-out probabilities. Therefore, I defaulted to using the 4-node relu-activated layer here.

In [391]:
model = tf.keras.models.Sequential()

model.add(Dense(4, activation='relu', name = "hidden1"))
#model.add(Dense(2, activation='relu', name = "hidden1"))
#model.add(Dense(2, activation='relu', name = "hidden2"))
model.add(Dense(1, activation='sigmoid', name = "sigmoid"))

In [392]:
model.compile(optimizer='adam',
             loss="binary_crossentropy",
             metrics='accuracy')

In [393]:
# Remove 'verbose=0' to see training epochs, especially if you're suspicious of this fitting going so quietly.
model.fit(X_train, y_train, verbose=0, epochs=100);

We can quickly evaluate the model here, though I also included a confusion matrix and classification report below, so we will have multiple ways of evaluating our model output. Still, this is the first time we see the model used on the y_test data, and we see that our accuracy is quite good, especially if the model has enough epochs.

If your accuracy is not higher than around 86% for whatever reason, you may need to run the fit again.

In [394]:
val_loss, val_acc = model.evaluate(X_test, y_test)
print(val_loss, val_acc)

0.15123455226421356 0.9407407641410828


In [395]:
y_true, y_pred = y_test, model.predict(X_test)

y_pred_bin = np.zeros([len(y_pred),1], dtype=int)

Keras, sklearn, or some other package probably contains a function to do this quickly, but I just wanted to convert model predictions into a binary array to make the confusion matrix and classification report. Python's list comprehension is a really nice way to present this conversion.

Feel free to toggle the threshold if you want to lean more toward precision or recall, too, though keep in mind the impact this may have on your eventual new data predictions.

In [402]:
y_pred_bin = [1 if item > 0.5 else 0 for item in y_pred]
        
print(confusion_matrix(y_true, y_pred_bin))

[[1257   20]
 [  68  140]]


In [403]:
print(classification_report(y_true, y_pred_bin))

              precision    recall  f1-score   support

           0       0.95      0.98      0.97      1277
           1       0.88      0.67      0.76       208

    accuracy                           0.94      1485
   macro avg       0.91      0.83      0.86      1485
weighted avg       0.94      0.94      0.94      1485



Once we are happy with our model, we can use it on some new data.

In the following block, I bring in data from the ongoing season, extracting the same columns as the historical data.

After transforming the data as before and using those transformed data for the prediction, we get all-League probabilities in the variable df_2022pred, which can then be exported as needed.

In [404]:
df_2022 = pd.read_csv("C:/Users/Roger/Documents/GitHub/All-Star-Predictions/baseData/dailystats/2022-01-06/stats_20220106.csv")
X_2022 = df_2022[['RPG','APG','SBPG','PPG','TS','WS48','Perc']]

df_2022pred = df_2022.copy()
X_2022 = scaler.transform(X_2022)
X_predicted = model.predict(X_2022)
df_2022pred['Prob'] = [i[0] for i in X_predicted]
df_2022pred; # Export to target CSV if so desired

That's it! If you have made it this far, thank you for reading through my implementation of a multilayer perceptron (MLP) using TensorFlow/Keras, along with some scikit-learn tools (splitting, scaling, and output evaluation).