<h1>Neural Model<h1>

In [38]:
import copy
import numpy as np
import pandas as pd
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam, SGD
from tensorflow.keras.regularizers import l1, l2
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler
import rbo

In this notebook, I will be using a multi-layer perceptron to make predictions for each award. This notebook includes the code I used to create this neural model and make predictions for each award.
<br><br>
This code should be run after you have finished running the 'preprocessing' notebook.

<h2>Helper Functions</h2>

Here I am loading helper functions that I wrote in order to load my data, scale the data, fill in missing/NA values, and calculate and print my accuracy.

In [67]:
GET_TEST_RESULTS = True

In [68]:
# retrieves train, dev, and test data for the specified award
def get_data(award_name):
    x_train_pnames = pd.read_csv(f'data/train_x_{award}.csv', index_col=0)
    y_train = pd.read_csv(f'data/train_y_{award}.csv', index_col=0)
    x_dev_pnames = pd.read_csv(f'data/dev_x_{award}.csv', index_col=0)
    y_dev = pd.read_csv(f'data/dev_y_{award}.csv', index_col=0)
    x_test_pnames = pd.read_csv(f'data/test_x_{award}.csv', index_col=0)
    y_test = pd.read_csv(f'data/test_y_{award}.csv', index_col=0)
    x_train = x_train_pnames.drop(columns=['player', 'season'])
    x_dev = x_dev_pnames.drop(columns=['player', 'season'])
    x_test = x_test_pnames.drop(columns=['player', 'season'])
    return x_train_pnames, x_train, y_train, x_dev_pnames, x_dev, y_dev, x_test_pnames, x_test, y_test

There are some missing statistics in my dataset. These occur mostly for older players, as there are some statistics (such as steals and blocks) that weren't kept track of when the NBA first started. I will deal with these missing statistics in three ways:
<br>
Method 2 - Fill in zeros for all of the missing stats, and use all the data
<br>
Method 5 - Fill in mean values for all of missing stats, and use all of the data
<br>
Method 4 - Use linear regressions to predict each of the missing stats, and use all of the data
<br>
<br>

Some other ways I may implement in the future to explore dealing with missing statistics are:
<br>
Remove all of these rows (This should end up using all player seasons after 1979)
<br>
Drop all the columns with missing stats, and use all of the data

In [70]:
# fill in the missing statistics (that were not kept track of for older players)
# using one of the three methods mentioned above
def fill_missing_stats(x_train, x_train_pnames, x_dev, x_dev_pnames, x_test, x_test_pnames, method=4):
    if method == 2:
        x_train_filled = x_train.fillna(0)
        x_dev_filled = x_dev.fillna(0)
        x_test_filled = x_test.fillna(0)
    elif method == 4:
        x_train_filled = copy.copy(x_train_pnames)
        x_dev_filled = copy.copy(x_dev_pnames)
        x_test_filled = copy.copy(x_test_pnames)
        # three_p and three_pa - if season < 1980, set to NA
        x_train_filled['three_p'] = np.where(x_train_filled.season < 1980, float('NaN'), x_train_filled.three_p)
        x_train_filled['three_pa'] = np.where(x_train_filled.season < 1980, float('NaN'), x_train_filled.three_pa)
        
        x_train_filled = x_train_filled.drop(columns=['player', 'season'])
        x_dev_filled = x_dev_filled.drop(columns=['player', 'season'])
        x_test_filled = x_test_filled.drop(columns=['player', 'season'])

        # predict all missing values for these stats using lin reg
        # train data: player seasons where stat != NaN
        # test data: player seasons where stat == NaN
        for stat in ['three_p', 'three_pa', 'gs', 'orb', 'drb', 'stl', 'blk', 'tov']:
            # check if this stat has missing values
            if x_train_filled[stat].isnull().values.any():
                train_stat = x_train_filled[x_train_filled[stat].notna()]
                # for all lin reg predictions, don't use any of the columns that have missing values
                x_train_stat = train_stat.drop(columns=['three_p', 'three_pa', 'gs', 'orb', 'drb', 'stl', 'blk', 'tov', 'three_pct', 'gs_pct'])
                y_train_stat = train_stat[[stat]]
                test_stat = x_train_filled[~x_train_filled[stat].notna()]
                x_test_stat = test_stat.drop(columns=['three_p', 'three_pa', 'gs', 'orb', 'drb', 'stl', 'blk', 'tov', 'three_pct', 'gs_pct'])

                reg_stat = LinearRegression().fit(x_train_stat, y_train_stat)
                pred_test_stat = reg_stat.predict(x_test_stat)
                pred_stat_dict = {index: round(value[0], 3) for index, value in zip(x_train_filled[~x_train_filled[stat].notna()].index, pred_test_stat)}
                x_train_filled[stat] = x_train_filled[stat].fillna(pred_stat_dict)

        # fill in three_pct - three_p / three_pa, gs_pct - gs / g
        x_train_filled['three_pct'] = x_train_filled['three_p'] / x_train_filled['three_pa']
        x_train_filled['gs_pct'] = x_train_filled['gs'] / x_train_filled['g']
        x_train_filled = x_train_filled.fillna(0)
        x_dev_filled['three_pct'] = x_dev_filled['three_p'] / x_dev_filled['three_pa']
        x_dev_filled = x_dev_filled.fillna(0)
        x_test_filled['three_pct'] = x_test_filled['three_p'] / x_test_filled['three_pa']
        x_test_filled = x_test_filled.fillna(0)
    elif method == 5:
        x_train_filled = x_train.fillna(x_train.mean())
        x_dev_filled = x_dev.fillna(x_dev.mean())
        x_test_filled = x_test.fillna(x_test.mean())
    else:
        print('method of dealing with missing stats must be either 2, 4, or 5')
    return x_train_filled, x_dev_filled, x_test_filled

The award points that players receive are very skewed, as most players received zero points for a given award. To reduce this skew, I scaled all of the award point values logarithmically.
<br>
<br>
I also applied a Min Max scaler so that each feature would be in the range (0, 1).

In [6]:
# log scale all the y values, then min-max scale all the x and y values
def scale_vals(x_vals, y_vals):
    # add 1 so that you can take log of players with 0 award points
    y_log_vals = np.log(y_vals.award_pts_won + 1).values.reshape(-1, 1)
    x_scaler = MinMaxScaler().fit(x_vals, y_vals)
    x_vals = x_scaler.transform(x_vals)
    y_scaler = MinMaxScaler()
    y_vals = y_scaler.fit_transform(y_log_vals)
    return x_vals, y_vals, y_scaler
    
def unscale_vals(y_vals_scaled, y_scaler):
    y_log_vals = y_scaler.inverse_transform(y_vals_scaled)
    y_vals = np.expm1(y_log_vals)
    return y_vals

I decided to measure my models' performance using three metrics: Mean Squared Error, % of correct MVP predictions, and Rank-Biased Overlap.
<br>
I chose to use rank-biased overlap because it is an accuracy metric for rankings that weights higher ranked items more than lower ranked items. In addition, when comparing two lists using this metric, rank-biased overlap can deal with items that occur in one list but are not seen in the other list.
<br>
For more information on rank-biased overlap, see this article: http://codalism.com/research/papers/wmz10_tois.pdf

In [74]:
# print accuracy metrics by comparing the given lists (y_pred and y_actual)
# accuracy metrics that I am using are % of correct winners picked, rank biased overlap, and mean squared error
def print_accuracy(y_pred, y_actual, x_pnames, rbo_cutoff = None, verbose=0):
    all_data = copy.copy(x_pnames)
    all_data['award_pts_actual'] = y_actual['award_pts_won']
    all_data['award_pts_pred'] = y_pred
    num_correct = 0
    num_yrs = 0
    rbo_vals = []
    for year in set(all_data.season):
        # a. % of correct winners picked
        data_in_yr = all_data[all_data['season'] == year]
        pred_winner_row = data_in_yr['award_pts_pred'].argmax()
        actual_winner_row = data_in_yr['award_pts_actual'].argmax()
        pred_winner, pred_pts = data_in_yr.iloc[pred_winner_row]['player'], data_in_yr.iloc[pred_winner_row]['award_pts_pred']
        actual_winner, actual_pts = data_in_yr.iloc[actual_winner_row]['player'], data_in_yr.iloc[actual_winner_row]['award_pts_actual']
        if verbose > 0:
            print(f'{year}')
            print(f'Predicted Winner: {pred_winner} ({pred_pts} award pts)')
            print(f'Actual Winner: {actual_winner} ({actual_pts} award pts)')
        if pred_winner == actual_winner:
            num_correct += 1
        num_yrs += 1
        
        # b. Rank-Biased Overlap
        # calculate RBO:
        # get rows in given year with players that received votes - sorted by num votes
        vote_getters_df = data_in_yr[data_in_yr['award_pts_actual'] > 0]
        num_vote_getters = len(vote_getters_df)
        vote_getters_df = vote_getters_df.sort_values(by=['award_pts_actual'], ascending=False)
        # get top-(num_vote_getters) rows from predictions
        pred_vote_getters_df = data_in_yr.sort_values(by=['award_pts_pred'], ascending=False)
        if rbo_cutoff == None:
            pred_vote_getters_df = pred_vote_getters_df[:num_vote_getters]
        else:
            cutoff = min(rbo_cutoff, num_vote_getters)
            vote_getters_df = vote_getters_df[:cutoff]
            pred_vote_getters_df = pred_vote_getters_df[:cutoff]
        vote_getters = vote_getters_df['player'].values
        pred_vote_getters = pred_vote_getters_df['player'].values
        # deal with edge case where two vote getters have exact same name
        vote_getters = list(set(vote_getters))
        pred_vote_getters = list(set(pred_vote_getters))
        #print(len(vote_getters))
        #print(len(pred_vote_getters))
        if verbose > 1:
            print('Actual vote getters:')
            print(vote_getters)
            print(f'Predicted top-{num_vote_getters} vote getters:')
            print(pred_vote_getters)
        # compute RBO from these two lists
        rbo_num = rbo.RankingSimilarity(vote_getters, pred_vote_getters).rbo()
        rbo_vals.append(rbo_num)
        if verbose > 0:
            print(f'Rank Biased Overlap: {rbo_num}')
        
    print(f'% of winners predicted correctly: {round(num_correct / num_yrs * 100, 2)}%')
    print(f'Average Rank-Biased Overlap: {round(sum(rbo_vals) / len(rbo_vals), 3)}')
    # c. MSE
    mse = mean_squared_error(y_actual, y_pred)
    print(f'Mean Squared Error: {mse}')

<h2>Implementing Neural Model</h2>

Here, I implement a separate neural model for predicting each of the five awards. I manually tested changing a bunch of different hyperparameters one at a time in order to come up with the best set of hyperparameters I could. One area for future improvement would be to use grid search or another method to search for better hyperparameters. 

In [75]:
for award in ['mvp', 'dpoy', 'roy', 'mip', 'smoy']:
    print(f'\n\n*****{award}*****')
    x_train_pnames, x_train, y_train, x_dev_pnames, x_dev, y_dev, x_test_pnames, x_test, y_test = get_data(award)
    for fill_na_method in [4, 2, 5]:
        print(f'\nfilling na values using method {fill_na_method}')
        x_train_filled, x_dev_filled, x_test_filled = fill_missing_stats(x_train, x_train_pnames, x_dev, x_dev_pnames,
                                                                         x_test, x_test_pnames, method=fill_na_method)
        x_train_scaled, y_train_scaled, y_train_scaler = scale_vals(x_train_filled, y_train)
        x_dev_scaled, y_dev_scaled, y_dev_scaler = scale_vals(x_dev_filled, y_dev)
        
        # create model
        model = Sequential()
        #model.add(Dense(40, activation='relu', kernel_regularizer=l2(l=0.6)))
        model.add(Dense(40, input_dim=x_train_filled.shape[1], activation='relu', kernel_regularizer=l2(l=0.6)))        
        #model.add(Dense(40, activation='relu'))
        #model.add(Dense(40, input_dim=x_train_filled.shape[1], activation='relu'))
        #model.add(Dense(20, activation='relu', kernel_regularizer=l2(l=0.1)))
        #model.add(Dense(1, activation='sigmoid', kernel_regularizer=l2(l=0.1)))
        model.add(Dense(40, activation='relu'))
        model.add(Dense(1, activation='sigmoid'))
        #model.add(Dense(1))
        #model.compile(loss='mse', optimizer=Adam(lr=0.01), metrics=['accuracy'])
        model.compile(loss='mse', optimizer=SGD(lr=0.01, momentum=0.9, clipnorm=1.0), metrics=['accuracy'])
        #model.compile(loss='mse', optimizer=SGD(lr=0.01, momentum=0.9), metrics=['accuracy'])
        model.fit(x_train_scaled, y_train_scaled, epochs=50, batch_size=300, verbose=0)
        
        train_pred_scaled = model.predict(x_train_scaled)
        dev_pred_scaled = model.predict(x_dev_scaled)
        train_pred = unscale_vals(train_pred_scaled, y_train_scaler)
        dev_pred = unscale_vals(dev_pred_scaled, y_dev_scaler)
        print('\nTrain accuracy:')
        print_accuracy(train_pred, y_train, x_train_pnames)
        print('\nDev accuracy:')
        print_accuracy(dev_pred, y_dev, x_dev_pnames, verbose=0)



*****mvp*****

filling na values using method 4

Train accuracy:
% of winners predicted correctly: 40.0%
Average Rank-Biased Overlap: 0.608
Mean Squared Error: 3455.0053425848796

Dev accuracy:
% of winners predicted correctly: 20.0%
Average Rank-Biased Overlap: 0.448
Mean Squared Error: 5248.896052518296

filling na values using method 2

Train accuracy:
% of winners predicted correctly: 38.18%
Average Rank-Biased Overlap: 0.59
Mean Squared Error: 3455.0033176317493

Dev accuracy:
% of winners predicted correctly: 20.0%
Average Rank-Biased Overlap: 0.438
Mean Squared Error: 5248.89100670029

filling na values using method 5

Train accuracy:
% of winners predicted correctly: 38.18%
Average Rank-Biased Overlap: 0.584
Mean Squared Error: 3454.952450350006

Dev accuracy:
% of winners predicted correctly: 20.0%
Average Rank-Biased Overlap: 0.439
Mean Squared Error: 5248.842520457459


*****dpoy*****

filling na values using method 4

Train accuracy:
% of winners predicted correctly: 10.7

<h2>Get Final Results</h2>
After testing and tweaking my model, I calculated its final accuracy on my test data:

In [76]:
if GET_TEST_RESULTS:
    for award in ['mvp', 'dpoy', 'roy', 'mip', 'smoy']:
        print(f'\n\n*****{award}*****')
        x_train_pnames, x_train, y_train, x_dev_pnames, x_dev, y_dev, x_test_pnames, x_test, y_test = get_data(award)
        for fill_na_method in [4]:
            print(f'\nfilling na values using method {fill_na_method}')
            x_train_filled, x_dev_filled, x_test_filled = fill_missing_stats(x_train, x_train_pnames, x_dev, x_dev_pnames,
                                                                             x_test, x_test_pnames, method=fill_na_method)
            x_train_scaled, y_train_scaled, y_train_scaler = scale_vals(x_train_filled, y_train)
            x_dev_scaled, y_dev_scaled, y_dev_scaler = scale_vals(x_dev_filled, y_dev)
            x_test_scaled, y_test_scaled, y_test_scaler = scale_vals(x_test_filled, y_test)

            # create model
            model = Sequential()
            #model.add(Dense(40, activation='relu', kernel_regularizer=l2(l=0.6)))
            model.add(Dense(40, input_dim=x_train_filled.shape[1], activation='relu', kernel_regularizer=l2(l=0.6)))        
            #model.add(Dense(40, activation='relu'))
            #model.add(Dense(40, input_dim=x_train_filled.shape[1], activation='relu'))
            #model.add(Dense(20, activation='relu', kernel_regularizer=l2(l=0.1)))
            #model.add(Dense(1, activation='sigmoid', kernel_regularizer=l2(l=0.1)))
            model.add(Dense(40, activation='relu'))
            model.add(Dense(1, activation='sigmoid'))
            #model.add(Dense(1))
            #model.compile(loss='mse', optimizer=Adam(lr=0.01), metrics=['accuracy'])
            model.compile(loss='mse', optimizer=SGD(lr=0.01, momentum=0.9, clipnorm=1.0), metrics=['accuracy'])
            #model.compile(loss='mse', optimizer=SGD(lr=0.01, momentum=0.9), metrics=['accuracy'])
            model.fit(x_train_scaled, y_train_scaled, epochs=50, batch_size=300, verbose=0)

            train_pred_scaled = model.predict(x_train_scaled)
            test_pred_scaled = model.predict(x_test_scaled)
            train_pred = unscale_vals(train_pred_scaled, y_train_scaler)
            test_pred = unscale_vals(test_pred_scaled, y_test_scaler)
            print('\nTest accuracy:')
            print_accuracy(test_pred, y_test, x_test_pnames, verbose=0)



*****mvp*****

filling na values using method 4

Test accuracy:
% of winners predicted correctly: 20.0%
Average Rank-Biased Overlap: 0.367
Mean Squared Error: 3766.1329367187127


*****dpoy*****

filling na values using method 4

Test accuracy:
% of winners predicted correctly: 0.0%
Average Rank-Biased Overlap: 0.291
Mean Squared Error: 616.9849134132764


*****roy*****

filling na values using method 4

Test accuracy:
% of winners predicted correctly: 60.0%
Average Rank-Biased Overlap: 0.755
Mean Squared Error: 3684.5887546505296


*****mip*****

filling na values using method 4

Test accuracy:
% of winners predicted correctly: 0.0%
Average Rank-Biased Overlap: 0.407
Mean Squared Error: 660.1662631386804


*****smoy*****

filling na values using method 4

Test accuracy:
% of winners predicted correctly: 40.0%
Average Rank-Biased Overlap: 0.518
Mean Squared Error: 823.7149630759205
