# Working Under Pressure: Predicting NBA Players Performance in the Playoffs (championship games) Using Random Forest Classifier

## Introduction 

Michael Jordan is considered to be the best player ever played basketball. One of the main reasons is that he elevated his performance when it mattered the most - In the NBA Playoffs (the NBA championship games). For example, while in the regular season he scored 30.1 points on 50.1% eFG% (efficient Field Goal Percentge), in the playoff , where the competition intensifies, he improved his averages and scored 33.4 points per game on 50.3% eFG%.

In this project, I tried to predict which NBA players preform better in the playoffs, based on their eFG% stats.
The eFG% measure considered to be a better measure for shooting ability as it gives higher weight to the more valuable 3 point shooting compared with 2 point shooting. 

In the project, I use a random forest binary classifier. the response variable is assigned a value of 1 if a player improved his efG% in the playoffs compared to his eFG% in the regular season and 0 otherwise. My independent variables are the player's personal stats in the regular season and some player's team stats measures.  

## Data scraping

In order to create the data set, I scraped the players' personal stats data from "Basketball Reference" website using the BeautifulSoup Python's library and store the data into a data frame.
The data includes regular per game stats (points, assists, etc.) and more advanced stats (inter alia, efficiency measures like PER, True Shooting percentage, Win shares etc.) 
Similarly, I scraped the player's team stats data from the "NBA Miner" Website.
The data is ranged between 1997-2019 where data for earlier years in not available on NBA Miner Website.

***Important*** - 
There is no need to run the scraping code below, which could take a long time, as the files needed for the rest of the project have already been stored. 

In [None]:
#Running the code below is ot needed. Files are already saved.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

#scraping players' stats data from basketball reference website
years = np.arange(1997,2020)
url_dic = {
            'pg': "https://www.basketball-reference.com/leagues/NBA_{}_per_game.html",
            'advanced': "https://www.basketball-reference.com/leagues/NBA_{}_advanced.html",
            'playoffs_pg': "https://www.basketball-reference.com/playoffs/NBA_{}_per_game.html"
          }
for url in url_dic:
    for year in years:
        html = urlopen(url_dic[url].format(year))
        soup = BeautifulSoup(html)
        #get column headers
        headers = [th.getText() for th in soup.findAll('tr', limit=2)[0].findAll('th')]
        headers = headers[1:]
        #get stats data by player 
        rows = soup.findAll('tr')[1:]
        player_stats = [[td.getText() for td in rows[i].findAll('td')]
                    for i in range(len(rows))]
        stats = pd.DataFrame(player_stats, columns = headers)
        #change numeric data to numeric format
        stats['Age'] = pd.to_numeric(stats['Age'])
        stats_numeric = stats.iloc[:,4:].apply(pd.to_numeric, errors='coerce')
        stats.iloc[:,4:] = stats_numeric
        #add year column
        listyear = [year]*len(stats.index)
        stats.insert(loc=0, column ='Year', value = listyear)
        #concat year data to main tbale collecting all years stats
        if year == years[0]:
            stats_data = stats
        else:
            stats_data=pd.concat([stats_data, stats]).reset_index(drop=True)
    
    stats_data.to_csv('stats_data_' + url + '.csv')



In [None]:
#Running the code below is not needed. Files are already saved.

#scraping players' team stats data from nba miner website
url = "http://www.nbaminer.com/nbaminer_nbaminer/advanced_team_stats.php?partitionpage={}"
years = np.arange(1,24)
for year in years:
    html = urlopen(url.format(year))
    soup = BeautifulSoup(html)
    table = soup.find('table', id="team_statsGrid")
    rows = table.findAll('tr')
    headers = [th.getText().strip() for th in rows[1].findAll('th')]
    #handling the fact that until 24 there were 29 teams in the league and 30 teams later on
    if year<16:
        rows = rows[3:33]
    else:
        rows = rows[3:32]
        
    player_stats = [[td.getText().strip() for td in rows[i].findAll('td')] 
                   for i in range(len(rows))]
    stats = pd.DataFrame(player_stats, columns = headers)
    #get image caption for team name
    a = [img["title"] for img in table.select("img[title]")]
    stats["Team"] = a 
    stats.iloc[:,3:] = stats.iloc[:,3:].apply(pd.to_numeric, errors='coerce')
    stats = stats.iloc[:,1:]
    listyear = [2020-year]*len(stats.index)
    stats.insert(loc=0, column ='Year', value = listyear)
    if year == years[0]:
        stats_data = stats
    else:
        stats_data=pd.concat([stats_data, stats]).reset_index(drop=True)
    
stats_data.to_csv('stats_team_advanced.csv')

## Data Preprocessing

First, let import the required python libraries. Then, I import and concatenate into one dataset the two "regular season" stats datasets (Per Game stats and Advanced stats). 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.feature_selection import RFECV

#import and concat regular seson per game and advanced stats - 
data_pg = pd.read_csv('stats_data_pg.csv')
data_advanced = pd.read_csv('stats_data_advanced.csv')
data_season = pd.concat([data_pg.iloc[:, 1:], data_advanced.iloc[:, 8:]], axis=1)

As a sanity check, let see how many observations there is in the data set for each player:

In [2]:
data_season['Player'].value_counts().head()

Nazr Mohammed    28
Andre Miller     25
Vince Carter     25
Joe Smith        25
Joe Johnson      23
Name: Player, dtype: int64

At a first look, it seems unreasonable that the data spans 23 years (1997-2019) but there are players with larger number of observations in the dataset. But we need to remember that during the regular season, players can move between teams. For Example, let take a look Nazr Mohammed's (the top frequent player in the dataset) teams over the years:

In [3]:
data_season.loc[data_season['Player']=='Nazr Mohammed',['Year', 'Tm']]

Unnamed: 0,Year,Tm
1487,1999,PHI
2024,2000,PHI
2554,2001,TOT
2555,2001,PHI
2556,2001,ATL
3095,2002,ATL
3598,2003,ATL
4160,2004,TOT
4161,2004,ATL
4162,2004,NYK


We can see that in several occasions Mohammed played for more than one team and in those cases the dataset also includes the TOT ('Total') observation for aggregating the stats over the different teams he played in one season. The dataset is already sorted in a way that the last team for every player and year in the dataset is also the last team the player actually played for in that year (which is also the team that the player played for in the playoffs). Since my goal is to compare the player performance in regular season to the playoffs, I keep only the data of the last team the player played for. Preforming that, we can see that the player value counts make more sense (and more familiar names are most frequent in the dataset):

In [4]:
#drop duplicates and keep only the players' last team stats. 
data_season.drop_duplicates(subset =['Year', 'Player'],  keep = 'last', inplace = True)
data_season = data_season.reset_index(drop=True)
data_season['Player'].value_counts().head()

Dirk Nowitzki    21
Vince Carter     21
Kevin Garnett    20
Kobe Bryant      20
Jason Terry      19
Name: Player, dtype: int64

Dealing with Nan Values (both for regular season and playoff datasets):

- Drop garbage columns

- Drop observations of players which didn't shot the ball at all (FGA>0).

- fill 3P% column's Nan values with 0 as almost all of these observations are front line players which are not capable of shooting 3s.

- Fill 2p% and FT% column with the mean value of the column.

In this piece of code, I also add new feature - Total FGA - which I use when merging the datasets.
then, I merge the regular season and playoffs datasets.

In [5]:
data_season = data_season.drop(['\xa0', '\xa0.1'], axis=1)
data_season = data_season[data_season['FGA']>0]
data_season['3P%'].fillna(0, inplace=True)
data_season['2P%'].fillna(data_season['2P%'].mean(), inplace=True)
data_season['FT%'].fillna(data_season['FT%'].mean(), inplace=True)
data_season['Total_FGA'] = data_season['FGA']*data_season['G']  

#import playoffs data
data_playoffs = pd.read_csv('stats_data_playoffs_pg.csv')
#data preprocesing handle na values
data_playoffs = data_playoffs.dropna(subset=['Player'])
data_playoffs = data_playoffs[data_playoffs['FGA']>0]
data_playoffs['3P%'].fillna(0, inplace=True)
data_playoffs['2P%'].fillna(data_playoffs['2P%'], inplace=True)
data_playoffs['FT%'].fillna(data_playoffs['FT%'].mean(), inplace=True)
data_playoffs['Total_FGA'] = data_playoffs['FGA']*data_playoffs['G']

#merge season and playoffs data
data = pd.merge(data_season, data_playoffs[['Year', 'Player', 'eFG%', 'Total_FGA']], on=['Year', 'Player'], how='left', 
                suffixes = ('', '_Pfs'), indicator = 'Exist')
data = data[data['Exist'] == 'both'].reset_index(drop = True)

Next, I enrich the data with players' team stats. here I add some features (Wins, Pace) of players own team. 

In [6]:
#import teams' stats data

#handling different names for same teams
pd.options.mode.chained_assignment = None  # ignoring copy warning
data.Tm[data['Tm']=='CHH'] = 'CHO'
data.Tm[data['Tm']=='WSB'] = 'WAS'

team = pd.read_csv('stats_team_advanced.csv').iloc[:,1:]
Team_key = pd.read_csv('Team_key.csv')
team = pd.merge(team, Team_key, on=['Team'], how='left')
data = pd.merge(data,team[['Year', 'Tm', 'Win', 'Pace']], on=['Year', 'Tm'], how='left')


Here I add a feature that relates to the quality of defense the Player is facing. Against good defensive teams it would be much harder to improve the player's personal stats compared with poor defensive teams. Hence, I calculate the weighted average of the efg% that the players' opponents in the playoffs allow to their rivals in the regular season. The efG% is weighted by the number of games that the player played against each team in the playoffs.

In the Block of code Below:

- I import and preprocess the file 'Seriss.csv' downloaded from 'Basketball Reference' which contains all the playoffs matchups in NBA History.

- I merge the opp. EFG% feature, taken from the team's stats dataset, with the Series dataset  .

- I calculate the weighted average of the quality of defense measure and merge that feature to the main dataset. 


In [7]:
#Import and preprocess Series dataset
Series = pd.read_csv('Series.csv')[['Yr', 'Team', 'W', 'Team.1', 'W.1']]
Series = Series[Series.Yr>=1997]
Series2 = Series[['Yr', 'Team.1', 'W.1', 'Team', 'W']]
Series2.columns = Series.columns
Series = pd.concat([Series, Series2]).sort_values(by=['Yr', 'Team']).reset_index(drop=True)
Series['Team'] = Series.Team.str.split(pat='(', expand=True)[0].str.rstrip()
Series['Team.1'] = Series['Team.1'].str.split(pat='(', expand=True)[0].str.rstrip()
Series['G'] = Series['W']+Series['W.1']
Series = Series.rename(columns={"Yr": "Year", "Team": 'Tm', "Team.1":  "Team"})
Series.Team[Series['Team']=='Washington Bullets'] = 'Washington Wizards'
Series.Tm[Series['Tm']=='Washington Bullets'] = 'Washington Wizards'

#Merging the opp. eFG% feature to the Series datast from the team dataset
Series = pd.merge(Series,team[['Year', 'Team', 'Opp. EFg%']], on=['Year', 'Team'], how='left')

#Calculating the weighted average of oppnent eFG%
g = Series.groupby(['Year', 'Tm'])
Series['WA_opp_FG%'] = Series.G / g.G.transform("sum") * Series['Opp. EFg%']
g = g[['WA_opp_FG%']].sum()
g=g.reset_index()
g = g.rename(columns={"Tm": 'Team'})
Team_key = pd.read_csv('Team_key.csv')
g = pd.merge(g, Team_key, on=['Team'], how='left')
g = g.rename(columns={"key": 'Tm'})

#Merge to the main dataset
data = pd.merge(data, g[['Year', 'Tm', 'WA_opp_FG%']], on=['Year', 'Tm'], how='left')
data = data.dropna(subset=['WA_opp_FG%'])

# Training the model

Before I train the model I created the target variable, y, which indicates weather a player improved his eFG% in the playoffs compared with his regular season's eFG%. The dataframe X contains the relevant features of a player - including his personal stats and team stats.

I have designed the code in a way that the target variable could be chosen and is assigned in the string variable 'target'. from there no changes needed for the code.

In [8]:
target = 'eFG%'
data['target_diff']=data[target+'_Pfs']-data[target]
data['target']=np.where(data['target_diff']>=0, 1, 0)
rng = list(range(5,51))+list(range(54,57))
X = data.iloc[:, rng]
y = data['target'] 

print('X shape: ', X.shape)

X shape:  (4407, 49)


In [9]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2
                                ,random_state = 0)

print('y_train mean: ', y_train.mean())
print('y_test mean: ', y_test.mean())

y_train mean:  0.38070921985815603
y_test mean:  0.35034013605442177


I train the model with a random forest classifier and use grid search in order to tune the parameters of the random forest classifier(such as max depth). I use 5-fold cross validation scheme and then test the best classifier on the test set. 

In [10]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(random_state=0, n_estimators=100)

from sklearn.model_selection import GridSearchCV
parameters = [{'n_estimators': [100, 200, 300], 'criterion': ['gini', 'entropy'], 
               'max_depth': [3,4,5,6]}]
grid_search = GridSearchCV(estimator = classifier,
                           param_grid = parameters,
                           scoring = 'roc_auc',
                           cv=5,
                           n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train)
best_parameters = grid_search.best_params_

classifier = RandomForestClassifier(n_estimators =best_parameters['n_estimators'],
                                    criterion=best_parameters['criterion'],
                                    max_depth=best_parameters['max_depth'],
                                    random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print('X_train shape: ', X_train.shape)
print('RF accuracy: ', (cm[1,1]+cm[0,0])/cm.sum())
print('RF auc: ', roc_auc_score(y_test, y_pred))

X_train shape:  (3525, 49)
RF accuracy:  0.6621315192743764
RF auc:  0.528236669547095


We can see that the classifier barely learns anything form the dataset as the roc-auc score is ~53%. One of the reasons for that is that many players in the dataset didn't play enough and shoot the ball many times (regular season or in the playoffs). As we know form central limit theorem if the sample size is not large enough than sample average could be far from the true average and variance is larger.
Hence, I decided to filter the dataset according to shots taken, minutes played and game played in order to keep only players that their averages in the regular season and playoffs are more stabilized. 
After applying the fiter, the roc-auc results improves to more than 58%, despite the fact that the model is trained on less observations.

In [11]:
data = data[(data['Total_FGA_Pfs']>50) & (data['Total_FGA']>50) ]
data = data[(data['MP']>15) & (data['G']>20)].reset_index(drop=True)

X = data.iloc[:, rng]
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2
                                ,random_state = 0)
grid_search = grid_search.fit(X_train, y_train)
best_parameters = grid_search.best_params_

classifier = RandomForestClassifier(n_estimators =best_parameters['n_estimators'],
                                    criterion=best_parameters['criterion'],
                                    max_depth=best_parameters['max_depth'],
                                    random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print('X_train shape: ', X_train.shape)
print('RF accuracy: ', (cm[1,1]+cm[0,0])/cm.sum())
print('RF auc: ', roc_auc_score(y_test, y_pred))

X_train shape:  (1467, 49)
RF accuracy:  0.6839237057220708
RF auc:  0.5827241557769439


Furthermore, I remove  highly correlated features (more than 0.9 correlation) as having both features doesn't contribute for prediction but adding some noise. Removing the features, improves the model roc_auc to 59%.

In [12]:
corrX = X.corr()
columns = np.full((corrX.shape[0],), True, dtype=bool)
for i in range(corrX.shape[0]):
    for j in range(i+1, corrX.shape[0]):
        if corrX.iloc[i,j] >= 0.9:
            if columns[j]:
                columns[j] = False
selected_columns = X.columns[columns]
X = X[selected_columns]
y = data['target']

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2
                                ,random_state = 0)
grid_search = grid_search.fit(X_train, y_train)
best_parameters = grid_search.best_params_

classifier = RandomForestClassifier(n_estimators =best_parameters['n_estimators'],
                                    criterion=best_parameters['criterion'],
                                    max_depth=best_parameters['max_depth'],
                                    random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print('X_train shape: ', X_train.shape)
print('RF accuracy: ', (cm[1,1]+cm[0,0])/cm.sum())
print('RF auc: ', roc_auc_score(y_test, y_pred))

X_train shape:  (1467, 35)
RF accuracy:  0.6866485013623979
RF auc:  0.58993724932074


Later, I check if there is more optimal combination of features using the RFECV (Recursive feature elimination cross validation) which remove features one by one and check if it improves the model according to the defined criteria. We can see that the 35 features left after removing the correlated features are the most optimal combination.

In [15]:
rfecv = RFECV(estimator=classifier, step=1, cv=5, scoring='roc_auc',  n_jobs = -1)
rfecv.fit(X_train, y_train)
print('Optimal number of features: {}'.format(rfecv.n_features_))

Optimal number of features: 35


# Summary

In this project I tried to predict which players improve their performance in the NBA playoffs compared to the NBA regular season. Using random forest algorithm and some feature engineering I achieved ~69% accuracy and 59% roc_auc score.
Improvements to the model could be made as for the following:
- add more features - especially game by game data as it might be we could get benefit from adding to the model features that describe the player performance against the specific teams he played in the playoffs.  
- Train the model on more data -  my model was limited to data from 1997 and after which is the data available on the NBA Miner website.