Imports first, then load the raw data:

In [1]:
import re
import os
import pickle
import pandas as pd
import numpy as np
from functools import reduce

In [2]:
money=pickle.load(open("../data/raw/salaries.pickle", "rb"))
advs=pickle.load(open("../data/raw/advstats.pickle", "rb"))
stats=pickle.load(open("../data/raw/regstats.pickle", "rb"))
rookies=pickle.load(open("../data/raw/rookies.pickle", "rb"))

# Salary List Data

Quickly look at one of the dataframes:

In [5]:
money[2018].apply(lambda x: x.head(5).append(x.tail(5))) #look at first and last five

Unnamed: 0,0,1,2,3
0,RK,NAME,TEAM,SALARY
1,1,"Stephen Curry, PG",Golden State Warriors,"$37,457,154"
2,2,"Russell Westbrook, PG",Oklahoma City Thunder,"$35,654,150"
3,3,"Chris Paul, PG",Houston Rockets,"$35,654,150"
4,4,"Blake Griffin, PF",Detroit Pistons,"$32,088,932"
477,434,"Jonathan Gibson, PG",Boston Celtics,"$17,092"
478,435,"Tarik Phillip, G",Washington Wizards,"$9,474"
479,436,"Duncan Robinson, SF",Miami Heat,"$9,474"
480,437,"Theo Pinson, SG",Brooklyn Nets,"$4,737"
481,438,"Kendrick Nunn, SG",Miami Heat,"$4,737"


Certain things we need to fix: 

1. Fix some headers (0 -> Rank, 1-> Name, 2-> Team, 3-> Salary)
2. Remove rows with those header labels, as they were repeated in the website tables
3. Add year column for when the lists are aggregated into a single dataframe

This we can do after aggregations:

1. Change Salary format (remove $ and commas)
2. Split position from the name into a new column
3. Change Salary datatype to int
4. Remove Rk column; it's not significant

In [6]:
import re
combined ={}

for k,v in money.items():
    v.columns = ['Rk','Player','Tm','Salary']
    v = v[v.Rk!= "RK"]
    v["Year"] = k
    combined[k]=v

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


In [10]:
kk = combined[2018]

In [11]:
kk[kk.Player == "LeBron James"]

Unnamed: 0,Rk,Player,Tm,Salary,Year


Now we can combine all the dataframes into a single one and get the FA information from 2011-2018 (2018 Salary information will be our test_y)

In [None]:
from functools import reduce
salaries = reduce(lambda x,y:pd.concat([x,y]),[v for k,v in combined.items()])

In [None]:
salaries.shape

In [None]:
salaries.info()

In [None]:
salaries.head(5)

Now we can changed the Salary format to remove $ and commas, as well as splitting 

In [None]:
salaries["Salary"] = salaries["Salary"].str.replace('$','').str.replace(',','')

In [None]:
salaries.Salary = salaries.Salary.astype(int)
del salaries['Rk']
salaries['Player'], salaries['Pos'] = salaries['Player'].str.split(', ', 1).str

In [None]:
salaries.head(5)

# Player Stats Data

Now we can clean up the stats data. First quickly look at one of the regular stats dataframe and advanced stats dataframe:

In [None]:
stats.keys()

In [None]:
stats[2019].tail(10)

In [None]:
advs[2015].head(7)

## Before Aggregation of Regular and Advanced Stats Year-wise

Some things I notice:

- Both: There are players who were traded mid-season that have appeared as rows of both teams, and a total season. I want to keep the cumulative total row (Tm assigned to TOT) and get rid of partial team stats.
- Both: It will be useful to again add a Year column for after I aggregate each lists into a single dataframe. This can be done after this step

In [None]:
for i,j in stats.items():
    temp = j 
    temp['total'] = (temp['Tm'] == 'TOT')
    temp = temp.sort_values('total', ascending=False).drop_duplicates(['Player','Age']).drop('total', 1)
    stats[i]=temp

In [None]:
for i,j in advs.items():
    temp = j 
    temp['total'] = (temp['Tm'] == 'TOT')
    temp = temp.sort_values('total', ascending=False).drop_duplicates(['Player','Age']).drop('total', 1)
    advs[i]=temp

We can now aggregate advanced stats and regular stats PER year first:

In [None]:
combined={}
for (a1,b1),(a2,b2) in zip(stats.items(),advs.items()):
    df = b1.merge(b2, how="inner",on=["Player","Age","Pos","Tm","G"])#,"MP"])
#     pd.DataFrame(sorted(df.values, key=lambda x: x[1].split(' ')[::-1]),columns=df.columns)
    combined[a1]=df.sort_values("Player")
    print("Stats Row for "+str(a1)+": "+str(b1.shape[0])
          +", Adv Row for "+str(a2)+": "+str(b2.shape[0])+", After combined: "+str(df.shape[0]))
    

## Before Aggregation of Yearly stats into a single dataframe

Now we have combined stats for each year. We can take a look at one of them:

In [None]:
combined[2017].head(10)

In [None]:
combined[2015].columns

Some basic cleaning we can do before we combine all of the years into one dataframe:

1. Some columns can be eliminated (Two "unnamed" arbitrary empty columns were on the website when I scraped. Rk_x and Rk_y were arbitrary rankings done by alphabetic order and is insiginficant as well)
2. We can remove one of the MP (Minutes Played) columns; there was a conflict during dataframe merge because regular stats data compiled minutes played as per game average, whereas the Advance stats data compiled minutes played as season total. I will remove MP_y.
3. Add the year of the player stat here in a column called 'Year'.

In [None]:
for k,v in combined.items():
    v=v.drop(['Rk_x','Unnamed: 19','Unnamed: 24', 'Rk_y','MP_y'], axis=1)
    v['Year'] = k
    combined[k]=v

In [None]:
combined_stats = reduce(lambda x,y:pd.concat([x,y]),[v for k,v in combined.items() if k != 2019 or k != 2008])
combined_stats = combined_stats.reset_index(drop=True);

In [None]:
combined_stats.info()

Now we can convert some of the datatypes into what we want: 

- Player, Position, Tm, Year -> unchanged
- Age, G, GS -> int
- Everything else -> floats

In [None]:
unchanged = ['Player','Pos','Tm','Year']
intlist = ['Age','G','GS']
floatlist= combined_stats.columns.difference(unchanged+intlist)

In [None]:
combined_stats[intlist] = combined_stats[intlist].astype(int)
combined_stats[floatlist] = combined_stats[floatlist].astype(float)

In [None]:
combined_stats.info()

In [None]:
combined_stats.head(10)

# Rookies list and using it to remove from player stats

Rookies have no previous year's stats (because they were in college or overseas), so we cannot use their data. Therefore we must identify the rookies for each year and remove them from the stats list.

First load the rookies pickle data:

In [None]:
rookies[2017].Player.value_counts()

There are some weird "Player" that shows up. Also some null values got picked up when it was scraped. We can remove those rows ("Player" is part of the table header that got repeated on Basketball Reference). We can also add a Year column that will be useful to identify which year the rookies belong (just like our other lists). After that we can concatenate the dataframes into a single one.

In [None]:
combined_rookies = pd.DataFrame()
for v,k in rookies.items():
    temp = rookies[v][rookies[v].Player != 'Player']
    temp = temp[~(temp.Player.isnull())]
    temp['Year']=v
    combined_rookies = pd.concat([combined_rookies,temp])
    

In [None]:
combined_rookies.head(5)

In [None]:
combined_rookies.info()

# Remove the rookies from the stats list

Now we have our rookies list and ready to remove the stats from those players.

In [None]:
combined_stats.shape

In [None]:
combined_rookies.shape

In [None]:
COLS = ['Player','Year']

no_rookies = combined_stats.merge(combined_rookies, indicator=True, how='outer')
no_rookies = no_rookies[no_rookies['_merge'] == 'left_only']

# combined_stats_no_rookies = pd.merge(combined_stats,combined_rookies, on=['Player','Year'], how='left')

In [None]:
no_rookies.shape

In [None]:
del no_rookies['_merge']

# Add Player stats into our Salary list

Now we need to extract needed player information and add it to our FA list to complete our cleaned data process. We have to determine what we need and how we are going to do it. For now we want to look at data from the season previous to when the player becomes a free agent.

But first, let's save the data.

In [None]:
def save_dataset(data,filename):
    with open(filename, 'wb') as w:
        pickle.dump(data,w)

In [None]:
save_dataset(no_rookies,"../data/interim/stats.pickle")
save_dataset(salaries, "../data/interim/salaries.pickle")

Now we want to filter out the data for all the FA's on our list, but before we do that, we must look at one thing: the way naming is done differently between the salaries list (from ESPN.com) and the stats list (from Basketball-Reference.com). I noticed two weird things:

1. Suffixes are missing (notably Jr.) in Basketball-Reference list
2. Also players that go by initials (i.e. J.J. Redick in Basketball-Reference) are missing periods in the other list (i.e. JJ Redick in ESPN).

In [None]:
salaries['Player'] = salaries['Player'].map(lambda x: x.replace(' Jr.',""))

In [None]:
no_rookies['Player'] = no_rookies['Player'].map(lambda x: x.replace('.',""))

In [None]:
data_all = pd.merge(salaries,no_rookies, on=['Player','Year'], how='left')

In [None]:
data_all.columns

In [None]:
salaries.shape

In [None]:
data_all.shape

In [None]:
data_all.info()

Missing alot of data; what is going on?

## Some column conflicts happened on the merge:

- Tm: I will use Tm_x as that was from the salary data. It is the team that paid the player on the season following (2018 salary is for the 2018-19 season, while 2018 stats is for the 2017-18 season)
- Pos: I will also use Pos_x although position ambiguity is really on the data collector's hands; some players can be either guard position, or either forward position, or some could be SG/SF. There is no real definition on positions now, as NBA is becoming more positionless and a guard is able to do what forwards used to, and vice versa. Even some centers handle the ball like a guard!

In [None]:
del data_all['Tm_y']
del data_all['Pos_y']


In [None]:
data_all.rename(columns={'Tm_x': 'Tm','Pos_x':'Pos','MP_x':'MP'}, inplace=True)

## Missing Data:

Here are the possibilities based on some research:

1. Some did not have stats because they were out of the NBA (not playing basketball entirely or overseas). These players should also be removed from the considerations.

2. Some players are missing a couple stats only (way to treat those datapoints will be explored in the next section)

In [None]:
data_all[(data_all.isnull().any(axis=1))]

In [None]:
playerinfo =['Player','Tm','Salary','Year','Pos']
rest = data_all.columns.difference(playerinfo)

In [None]:
played = data_all.dropna(thresh=20)

In [None]:
played.info()

Save the data and move onto EDA:

In [None]:
FA=pickle.load(open("../data/raw/freeagents.pickle", "rb"))

In [None]:
FA[2018].head()

In [None]:
FAS={}
for k,v in FA.items():
    v.columns=[re.sub(r"Player.+","Player",col) for col in v.columns]
    v.columns=[re.sub(r"\d+ Cap Hit","Cap Hit",col) for col in v.columns]
    v["Year"] = k
    FAS[k]=v

In [None]:
FAS[2018].head(5)

In [None]:
freeagents = reduce(lambda x,y:pd.concat([x,y]),[v for k,v in FAS.items() if k != 2019])

In [None]:
freeagents.head(5)

In [None]:
freeagents = freeagents[['Player','Year']]

In [None]:
freeagents

In [None]:
FA_check = played.merge(freeagents, indicator=True, how='left')

In [None]:
played["FA"] = FA_check["_merge"]
played["FA"] = played["FA"].str.replace("left_only",'No').replace("both","Yes")

In [None]:
played

In [None]:
save_dataset(played,"../data/processed/dataset2.pickle")