# Analyzing NBA Salaries in relation to performance: finding the most overpaid and underpaid 

## Amolak Singh

The goal of this mini-project is to determine which NBA players will be overpaid and underpaid in the 2018-19 season relative to their performance the prior season. In the past, the NBA may have payed players based more on reputation and less informative statistics but nowadays the use of advanced analytics is more common. The methodology of this project is as follows:
* 1st notebook: 1_nba_data_prep_clustering.ipynb
  1. We will use some advanced analytics pulled from *basketball-reference.com* to cluster players into several different groups based on their performance in the 2017-18 season. This will allow us to group similar players together so we can do more apple to apple comparisions when analyzing their salaries.
  2. After creating the group labels, we will pull salary information from *basketball-reference.com* and clean that data to join with the data above.
* 2nd notebook: 2_nba_results_analysis_clean.ipynb
  3. We wil analyze the salaries in relation to performance. Within each cluster, we will calculate the percentile/rank of each player salary. Players in the bottom quartile will be considered underpaid relative to their performance and players in the upper quartile will be considered overpaid relative to their performance.
  4. We will finish with a simple regression analysis and some final conclusions.

### Imports

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
from sklearn.preprocessing import scale,StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.cluster import KMeans
import warnings
warnings.filterwarnings('ignore')

### Variables to be used later

In [2]:
year = 2018
url = 'https://www.basketball-reference.com/leagues/NBA_{}_advanced.html'.format(year)
salary_url = 'https://www.basketball-reference.com/contracts/players.html'
seed = 6
clusters = 10

In [3]:
not_yet_num_cols = ['Age','G','MP','PER','TS%','3PAr','FTr','ORB%','DRB%','TRB%',
                    'AST%','STL%','BLK%','TOV%','USG%','OWS','DWS','WS','WS/48',
                    'OBPM','DBPM','BPM','VORP']
df_interesting_variables = ['TS%','3PAr','FTr','ORB%','DRB%','TRB%','AST%','STL%',
                            'BLK%','TOV%','USG%','OWS','DWS','WS','WS/48','OBPM',
                            'DBPM','BPM', 'VORP']

### Pulling the player analytics with BeautifulSoup

In [4]:
html = urlopen(url)
soup = BeautifulSoup(html,"lxml")
cols = [th.getText() for th in soup.findAll('tr',limit=3)[0].findAll('th')]
cols.remove('Rk')

In [5]:
rows = soup.findAll('tr')[1:]
dat = [[td.getText() for td in rows[i].findAll('td')] for i in (range(len(rows)))]
df = pd.DataFrame(dat,columns=cols)

In [6]:
df = pd.concat([df[['Player','Pos','Tm',]], df[not_yet_num_cols].apply(pd.to_numeric, errors='ignore')], axis=1)
# removed players who played less than 50 games,  played less than 800 minutes, and had a negative win 
# share impact (per game)
df = df[(df.G >= 50) & (df.MP >= 800) & (df.WS/48 > 0)]
df = df.reset_index()
df.to_csv('full_nba_data')
df.head(5)

Unnamed: 0,index,Player,Pos,Tm,Age,G,MP,PER,TS%,3PAr,...,TOV%,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP
0,0,Alex Abrines,SG,OKC,24.0,75.0,1134.0,9.0,0.567,0.759,...,7.4,12.7,1.3,1.0,2.2,0.094,-0.5,-1.7,-2.2,-0.1
1,1,Quincy Acy,PF,BRK,27.0,70.0,1359.0,8.2,0.525,0.8,...,13.3,14.4,-0.1,1.1,1.0,0.036,-2.0,-0.2,-2.2,-0.1
2,2,Steven Adams,C,OKC,24.0,76.0,2487.0,20.6,0.63,0.003,...,13.2,16.7,6.7,3.0,9.7,0.187,2.2,1.1,3.3,3.3
3,3,Bam Adebayo,C,MIA,20.0,69.0,1368.0,15.7,0.57,0.021,...,13.6,15.9,2.3,1.9,4.2,0.148,-1.6,1.8,0.2,0.8
4,6,LaMarcus Aldridge,C,SAS,32.0,75.0,2509.0,25.0,0.57,0.068,...,6.8,29.1,7.4,3.5,10.9,0.209,3.0,0.3,3.3,3.3


In [7]:
df_k_means = df[df_interesting_variables]
df_k_means.to_csv('clustering_nba_data')

In [8]:
#we have a high degree of collinearity in our data making analytical methods such as multiple OLS undesirable.
# we will try unsupervised learning techniques instead
corr = df_k_means.corr()
corr.style.background_gradient().set_precision(2)

Unnamed: 0,TS%,3PAr,FTr,ORB%,DRB%,TRB%,AST%,STL%,BLK%,TOV%,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP
TS%,1.0,-0.14,0.35,0.33,0.24,0.3,-0.089,-0.11,0.25,-0.049,0.095,0.66,0.22,0.59,0.73,0.57,0.13,0.56,0.47
3PAr,-0.14,1.0,-0.55,-0.73,-0.52,-0.63,-0.069,0.0018,-0.53,-0.37,-0.13,-0.15,-0.24,-0.2,-0.32,0.17,-0.5,-0.19,-0.13
FTr,0.35,-0.55,1.0,0.5,0.41,0.47,0.099,0.055,0.42,0.27,0.23,0.4,0.3,0.42,0.5,0.16,0.37,0.38,0.34
ORB%,0.33,-0.73,0.5,1.0,0.75,0.89,-0.29,-0.17,0.65,0.22,-0.094,0.19,0.28,0.25,0.46,-0.13,0.59,0.28,0.17
DRB%,0.24,-0.52,0.41,0.75,1.0,0.97,-0.11,-0.15,0.59,0.2,0.13,0.22,0.48,0.35,0.46,-0.036,0.62,0.37,0.31
TRB%,0.3,-0.63,0.47,0.89,0.97,1.0,-0.18,-0.17,0.65,0.23,0.053,0.23,0.44,0.34,0.5,-0.067,0.65,0.37,0.29
AST%,-0.089,-0.069,0.099,-0.29,-0.11,-0.18,1.0,0.39,-0.23,0.46,0.53,0.33,0.2,0.33,0.21,0.49,-0.018,0.4,0.46
STL%,-0.11,0.0018,0.055,-0.17,-0.15,-0.17,0.39,1.0,-0.073,0.27,0.1,0.047,0.29,0.14,0.13,0.2,0.3,0.36,0.32
BLK%,0.25,-0.53,0.42,0.65,0.59,0.65,-0.23,-0.073,1.0,0.22,-0.063,0.11,0.35,0.21,0.39,-0.16,0.73,0.34,0.21
TOV%,-0.049,-0.37,0.27,0.22,0.2,0.23,0.46,0.27,0.22,1.0,-0.062,-0.12,0.12,-0.045,0.0075,-0.17,0.42,0.13,0.097


### Clustering our players and creating group labels

Chose 10 clusters because too few clusters (5-7) did not lead to good enough differentiation while too many clusters (13-18) lead to too much differentiation. 10 splits the difference well. 

In [9]:
scaler = StandardScaler()
kmeans = KMeans(n_clusters=clusters, random_state = seed)
pipeline = make_pipeline(scaler, kmeans)
pipeline.fit_transform(df_k_means)

array([[  3.56327972,   7.8432778 ,   3.6776219 , ...,  10.45152617,
         10.70726077,   5.02273776],
       [  2.49841862,   7.24571452,   3.84066854, ...,  11.09889861,
         10.70093602,   5.70838771],
       [  9.2584267 ,   4.1158974 ,   6.47098628, ...,   7.0684398 ,
          4.23176146,   7.00808363],
       ..., 
       [  3.46494263,   8.64437405,   4.43613896, ...,  11.1101057 ,
         11.48729918,   5.25958131],
       [  5.38020716,   5.29425728,   3.57706196, ...,   8.09613375,
          7.15412442,   4.77983356],
       [  4.87536628,   4.0060952 ,   3.70092642, ...,   9.91360687,
          8.05168184,   5.57309797]])

In [10]:
labs = pipeline.predict(df_k_means)
results_df = pd.concat([df[['Player','Pos','Tm',]], df_k_means], axis=1)
results_df['label'] = labs.tolist()

### Pulling and cleaning salary data

In [11]:
salaries = pd.read_html(salary_url, header=1)[0]
salaries = salaries[salaries['Rk'] != 'Rk']
salaries = salaries[['Player', '2018-19', 'Guaranteed']]

In [12]:
# removing commmas and dollar signs from the money columns
for col in ['2018-19','Guaranteed']:
    salaries[col] = salaries[col].replace('[\$,]', '', regex=True).astype(float)

In [13]:
salaries.to_csv('player_salaries')
salaries.head()

Unnamed: 0,Player,2018-19,Guaranteed
0,Stephen Curry,37457154.0,166476240.0
1,Chris Paul,35654150.0,159730592.0
2,LeBron James,35654150.0,113310573.0
3,Russell Westbrook,35350000.0,158382000.0
4,Blake Griffin,31873932.0,102704892.0


In [14]:
combined_df = pd.merge(results_df, salaries, how='left', left_on=['Player'], right_on=['Player'])

In [15]:
# we will the null salaries with 0. These players are either free agents who have not signed new deals yet, players 
# who wil be playing abroad, or potentially retirees
combined_df = combined_df.fillna(0)
combined_df.head()
combined_df.to_csv('joined_nba_data')