Imports first, then load the raw data:

In [1]:
import re
import os
import pickle
import pandas as pd
from functools import reduce

In [2]:
fas=pickle.load(open("../data/raw/salaries.pickle", "rb"))
advs=pickle.load(open("../data/raw/advstats.pickle", "rb"))
stats=pickle.load(open("../data/raw/regstats.pickle", "rb"))
rookies=pickle.load(open("../data/raw/rookies.pickle", "rb"))

# Salary List Data

Quickly look at one of the dataframes:

In [3]:
fas[2018].apply(lambda x: x.head(5).append(x.tail(5))) #look at first and last five

Unnamed: 0,0,1,2,3
0,RK,NAME,TEAM,SALARY
1,1,"Stephen Curry, PG",Golden State Warriors,"$37,457,154"
2,2,"Russell Westbrook, PG",Oklahoma City Thunder,"$35,654,150"
3,3,"Chris Paul, PG",Houston Rockets,"$35,654,150"
4,4,"Blake Griffin, PF",Detroit Pistons,"$32,088,932"
477,434,"Jonathan Gibson, PG",Boston Celtics,"$17,092"
478,435,"Tarik Phillip, G",Washington Wizards,"$9,474"
479,436,"Duncan Robinson, SF",Miami Heat,"$9,474"
480,437,"Theo Pinson, SG",Brooklyn Nets,"$4,737"
481,438,"Kendrick Nunn, SG",Miami Heat,"$4,737"


In [4]:
fas.keys()

dict_keys([2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009])

In [5]:
fas[2017].columns

Int64Index([0, 1, 2, 3], dtype='int64')

Certain things we need to fix: 

1. Fix some headers (0 -> Rank, 1-> Name, 2-> Team, 3-> Salary)
2. Remove rows with those header labels, as they were repeated in the website tables
3. Add year column for when the lists are aggregated into a single dataframe

This we can do after aggregations:

1. Change Salary format (remove $ and commas)
2. Split position from the name into a new column
3. Change Salary datatype to int
4. Remove Rk column; it's not significant

In [6]:
import re
combined ={}

for k,v in fas.items():
    v.columns = ['Rk','Player','Tm','Salary']
    v = v[v.Rk!= "RK"]
    v["Year"] = k
    combined[k]=v

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


In [7]:
combined[2017].head(10)

Unnamed: 0,Rk,Player,Tm,Salary,Year
1,1,"Stephen Curry, PG",Golden State Warriors,"$37,457,154",2017
2,2,"Blake Griffin, PF",LA Clippers,"$32,088,932",2017
3,3,"Paul Millsap, PF",Denver Nuggets,"$31,269,231",2017
4,4,"Kyle Lowry, PG",Toronto Raptors,"$31,200,000",2017
5,5,"Gordon Hayward, SF",Boston Celtics,"$29,727,900",2017
6,6,"Mike Conley, PG",Memphis Grizzlies,"$28,530,608",2017
7,7,"Russell Westbrook, PG",Oklahoma City Thunder,"$28,530,608",2017
8,8,"James Harden, PG",Houston Rockets,"$28,299,399",2017
9,9,"DeMar DeRozan, SG",Toronto Raptors,"$27,739,975",2017
10,10,"Al Horford, C",Boston Celtics,"$27,734,406",2017


Now we can combine all the dataframes into a single one and get the FA information from 2011-2018 (2018 Salary information will be our test_y)

In [8]:
from functools import reduce
salaries = reduce(lambda x,y:pd.concat([x,y]),[v for k,v in combined.items()])

In [9]:
salaries.shape

(4511, 5)

In [10]:
salaries.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4511 entries, 1 to 463
Data columns (total 5 columns):
Rk        4511 non-null object
Player    4511 non-null object
Tm        4511 non-null object
Salary    4511 non-null object
Year      4511 non-null int64
dtypes: int64(1), object(4)
memory usage: 211.5+ KB


In [11]:
salaries.head(5)

Unnamed: 0,Rk,Player,Tm,Salary,Year
1,1,"Stephen Curry, PG",Golden State Warriors,"$37,457,154",2018
2,2,"Russell Westbrook, PG",Oklahoma City Thunder,"$35,654,150",2018
3,3,"Chris Paul, PG",Houston Rockets,"$35,654,150",2018
4,4,"Blake Griffin, PF",Detroit Pistons,"$32,088,932",2018
5,5,"Gordon Hayward, SF",Boston Celtics,"$31,214,295",2018


Now we can changed the Salary format to remove $ and commas, as well as splitting 

In [12]:
salaries["Salary"] = salaries["Salary"].str.replace('$','').str.replace(',','')

In [13]:
salaries.Salary = salaries.Salary.astype(int)
del salaries['Rk']
salaries['Player'], salaries['Pos'] = salaries['Player'].str.split(', ', 1).str

In [14]:
salaries.head(5)

Unnamed: 0,Player,Tm,Salary,Year,Pos
1,Stephen Curry,Golden State Warriors,37457154,2018,PG
2,Russell Westbrook,Oklahoma City Thunder,35654150,2018,PG
3,Chris Paul,Houston Rockets,35654150,2018,PG
4,Blake Griffin,Detroit Pistons,32088932,2018,PF
5,Gordon Hayward,Boston Celtics,31214295,2018,SF


# Player Stats Data

Now we can clean up the stats data. First quickly look at one of the regular stats dataframe and advanced stats dataframe:

In [15]:
stats.keys()

dict_keys([2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008])

In [16]:
stats[2019].tail(10)

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
724,525,Thaddeus Young,PF,30,IND,81,81,30.7,5.5,10.4,...,0.644,2.4,4.1,6.5,2.5,1.5,0.4,1.5,2.4,12.6
725,526,Trae Young,PG,20,ATL,81,81,30.9,6.5,15.5,...,0.829,0.8,2.9,3.7,8.1,0.9,0.2,3.8,1.7,19.1
726,527,Cody Zeller,C,26,CHO,49,47,25.4,3.9,7.0,...,0.787,2.2,4.6,6.8,2.1,0.8,0.8,1.3,3.3,10.1
727,528,Tyler Zeller,C,29,TOT,6,1,15.5,2.7,5.0,...,0.778,1.8,2.2,4.0,0.7,0.2,0.5,0.7,3.3,7.7
728,528,Tyler Zeller,C,29,ATL,2,0,5.5,0.0,1.0,...,,1.0,2.0,3.0,0.5,0.0,0.0,0.0,2.0,0.0
729,528,Tyler Zeller,C,29,MEM,4,1,20.5,4.0,7.0,...,0.778,2.3,2.3,4.5,0.8,0.3,0.8,1.0,4.0,11.5
730,529,Ante Zizic,C,22,CLE,59,25,18.3,3.1,5.6,...,0.705,1.8,3.6,5.4,0.9,0.2,0.4,1.0,1.9,7.8
731,530,Ivica Zubac,C,21,TOT,59,37,17.6,3.6,6.4,...,0.802,1.9,4.2,6.1,1.1,0.2,0.9,1.2,2.3,8.9
732,530,Ivica Zubac,C,21,LAL,33,12,15.6,3.4,5.8,...,0.864,1.6,3.3,4.9,0.8,0.1,0.8,1.0,2.2,8.5
733,530,Ivica Zubac,C,21,LAC,26,25,20.2,3.8,7.2,...,0.733,2.3,5.3,7.7,1.5,0.4,0.9,1.4,2.5,9.4


In [17]:
advs[2015].head(7)

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,MP,PER,TS%,3PAr,...,Unnamed: 19,OWS,DWS,WS,WS/48,Unnamed: 24,OBPM,DBPM,BPM,VORP
0,1,Quincy Acy,PF,24,NYK,68,1287,11.9,0.533,0.181,...,,1.0,0.7,1.7,0.063,,-2.3,-0.8,-3.1,-0.3
1,2,Jordan Adams,SG,20,MEM,30,248,12.8,0.489,0.291,...,,0.0,0.4,0.4,0.073,,-1.8,1.2,-0.6,0.1
2,3,Steven Adams,C,21,OKC,70,1771,14.1,0.549,0.005,...,,1.9,2.2,4.1,0.111,,-1.4,1.8,0.4,1.1
3,4,Jeff Adrien,PF,28,MIN,17,215,14.2,0.494,0.0,...,,0.2,0.2,0.4,0.087,,-2.7,0.5,-2.2,0.0
4,5,Arron Afflalo,SG,29,TOT,78,2502,10.7,0.533,0.377,...,,1.6,1.0,2.6,0.05,,-0.5,-1.3,-1.8,0.1
5,5,Arron Afflalo,SG,29,DEN,53,1750,11.7,0.533,0.37,...,,1.4,0.4,1.8,0.05,,0.0,-1.6,-1.6,0.2
6,5,Arron Afflalo,SG,29,POR,25,752,8.2,0.533,0.396,...,,0.2,0.5,0.8,0.049,,-1.5,-0.8,-2.3,-0.1


## Before Aggregation of Regular and Advanced Stats Year-wise

Some things I notice:

- Both: There are players who were traded mid-season that have appeared as rows of both teams, and a total season. I want to keep the cumulative total row (Tm assigned to TOT) and get rid of partial team stats.
- Both: It will be useful to again add a Year column for after I aggregate each lists into a single dataframe. This can be done after this step

In [18]:
for i,j in stats.items():
    temp = j 
    temp['total'] = (temp['Tm'] == 'TOT')
    temp = temp.sort_values('total', ascending=False).drop_duplicates(['Player','Age']).drop('total', 1)
    stats[i]=temp

In [19]:
for i,j in advs.items():
    temp = j 
    temp['total'] = (temp['Tm'] == 'TOT')
    temp = temp.sort_values('total', ascending=False).drop_duplicates(['Player','Age']).drop('total', 1)
    advs[i]=temp

We can now aggregate advanced stats and regular stats PER year first:

In [20]:
combined={}
for (a1,b1),(a2,b2) in zip(stats.items(),advs.items()):
    df = b1.merge(b2, how="inner",on=["Player","Age","Pos","Tm","G"])#,"MP"])
#     pd.DataFrame(sorted(df.values, key=lambda x: x[1].split(' ')[::-1]),columns=df.columns)
    combined[a1]=df.sort_values("Player")
    print("Stats Row for "+str(a1)+": "+str(b1.shape[0])
          +", Adv Row for "+str(a2)+": "+str(b2.shape[0])+", After combined: "+str(df.shape[0]))
    

Stats Row for 2019: 530, Adv Row for 2019: 530, After combined: 530
Stats Row for 2018: 540, Adv Row for 2018: 540, After combined: 540
Stats Row for 2017: 486, Adv Row for 2017: 486, After combined: 486
Stats Row for 2016: 476, Adv Row for 2016: 476, After combined: 476
Stats Row for 2015: 492, Adv Row for 2015: 492, After combined: 492
Stats Row for 2014: 482, Adv Row for 2014: 482, After combined: 482
Stats Row for 2013: 469, Adv Row for 2013: 469, After combined: 469
Stats Row for 2012: 478, Adv Row for 2012: 478, After combined: 478
Stats Row for 2011: 452, Adv Row for 2011: 452, After combined: 452
Stats Row for 2010: 442, Adv Row for 2010: 442, After combined: 442
Stats Row for 2009: 445, Adv Row for 2009: 445, After combined: 445
Stats Row for 2008: 451, Adv Row for 2008: 451, After combined: 451


## Before Aggregation of Yearly stats into a single dataframe

Now we have combined stats for each year. We can take a look at one of them:

In [21]:
combined[2017].head(10)

Unnamed: 0,Rk_x,Player,Pos,Age,Tm,G,GS,MP_x,FG,FGA,...,Unnamed: 19,OWS,DWS,WS,WS/48,Unnamed: 24,OBPM,DBPM,BPM,VORP
476,170,A.J. Hammons,C,24,DAL,22,0,7.4,0.8,1.9,...,,-0.2,0.2,0.0,-0.001,,-7.5,2.0,-5.6,-0.1
355,58,Aaron Brooks,PG,32,IND,65,0,13.8,1.9,4.6,...,,-0.2,0.5,0.3,0.016,,-2.1,-2.5,-4.6,-0.6
454,157,Aaron Gordon,SF,21,ORL,80,72,28.7,4.9,10.8,...,,2.0,1.7,3.7,0.077,,-0.2,-0.4,-0.7,0.8
462,181,Aaron Harrison,SG,22,CHO,5,0,3.4,0.0,0.8,...,,-0.1,0.0,-0.1,-0.146,,-9.6,-2.1,-11.6,0.0
83,352,Adreian Payne,PF,25,MIN,18,0,7.5,1.3,3.0,...,,0.0,0.2,0.2,0.086,,-2.2,0.7,-1.5,0.0
396,203,Al Horford,C,30,BOS,68,68,32.3,5.6,11.8,...,,3.6,2.7,6.3,0.138,,1.0,2.0,3.1,2.8
399,221,Al Jefferson,C,32,IND,66,1,14.1,3.6,7.1,...,,1.2,1.1,2.3,0.119,,-1.5,-1.5,-3.1,-0.2
343,10,Al-Farouq Aminu,PF,26,POR,61,25,29.1,3.0,7.6,...,,-0.1,2.0,1.9,0.051,,-2.3,1.2,-1.1,0.4
345,12,Alan Anderson,SF,34,LAC,30,0,10.3,1.0,2.7,...,,0.0,0.1,0.1,0.02,,-2.6,-2.3,-4.9,-0.2
194,464,Alan Williams,C,24,PHO,47,0,15.1,2.9,5.7,...,,1.1,0.9,2.1,0.142,,-1.8,0.1,-1.7,0.1


In [22]:
combined[2015].columns

Index(['Rk_x', 'Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP_x', 'FG', 'FGA',
       'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA',
       'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS',
       'Rk_y', 'MP_y', 'PER', 'TS%', '3PAr', 'FTr', 'ORB%', 'DRB%', 'TRB%',
       'AST%', 'STL%', 'BLK%', 'TOV%', 'USG%', 'Unnamed: 19', 'OWS', 'DWS',
       'WS', 'WS/48', 'Unnamed: 24', 'OBPM', 'DBPM', 'BPM', 'VORP'],
      dtype='object')

Some basic cleaning we can do before we combine all of the years into one dataframe:

1. Some columns can be eliminated (Two "unnamed" arbitrary empty columns were on the website when I scraped. Rk_x and Rk_y were arbitrary rankings done by alphabetic order and is insiginficant as well)
2. We can remove one of the MP (Minutes Played) columns; there was a conflict during dataframe merge because regular stats data compiled minutes played as per game average, whereas the Advance stats data compiled minutes played as season total. I will remove MP_y.
3. Add the year of the player stat here in a column called 'Year'.

In [23]:
for k,v in combined.items():
    v=v.drop(['Rk_x','Unnamed: 19','Unnamed: 24', 'Rk_y','MP_y'], axis=1)
    v['Year'] = k
    combined[k]=v

In [24]:
combined_stats = reduce(lambda x,y:pd.concat([x,y]),[v for k,v in combined.items() if k != 2019])
combined_stats = combined_stats.reset_index(drop=True);

In [25]:
combined_stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5213 entries, 0 to 5212
Data columns (total 50 columns):
Player    5213 non-null object
Pos       5213 non-null object
Age       5213 non-null object
Tm        5213 non-null object
G         5213 non-null object
GS        5213 non-null object
MP_x      5213 non-null object
FG        5213 non-null object
FGA       5213 non-null object
FG%       5197 non-null object
3P        5213 non-null object
3PA       5213 non-null object
3P%       4525 non-null object
2P        5213 non-null object
2PA       5213 non-null object
2P%       5179 non-null object
eFG%      5197 non-null object
FT        5213 non-null object
FTA       5213 non-null object
FT%       5029 non-null object
ORB       5213 non-null object
DRB       5213 non-null object
TRB       5213 non-null object
AST       5213 non-null object
STL       5213 non-null object
BLK       5213 non-null object
TOV       5213 non-null object
PF        5213 non-null object
PTS       5213 non-null o

Now we can convert some of the datatypes into what we want: 

- Player, Position, Tm, Year -> unchanged
- Age, G, GS -> int
- Everything else -> floats

In [26]:
unchanged = ['Player','Pos','Tm','Year']
intlist = ['Age','G','GS']
floatlist= combined_stats.columns.difference(unchanged+intlist)

In [27]:
combined_stats[intlist] = combined_stats[intlist].astype(int)
combined_stats[floatlist] = combined_stats[floatlist].astype(float)

In [28]:
combined_stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5213 entries, 0 to 5212
Data columns (total 50 columns):
Player    5213 non-null object
Pos       5213 non-null object
Age       5213 non-null int64
Tm        5213 non-null object
G         5213 non-null int64
GS        5213 non-null int64
MP_x      5213 non-null float64
FG        5213 non-null float64
FGA       5213 non-null float64
FG%       5197 non-null float64
3P        5213 non-null float64
3PA       5213 non-null float64
3P%       4525 non-null float64
2P        5213 non-null float64
2PA       5213 non-null float64
2P%       5179 non-null float64
eFG%      5197 non-null float64
FT        5213 non-null float64
FTA       5213 non-null float64
FT%       5029 non-null float64
ORB       5213 non-null float64
DRB       5213 non-null float64
TRB       5213 non-null float64
AST       5213 non-null float64
STL       5213 non-null float64
BLK       5213 non-null float64
TOV       5213 non-null float64
PF        5213 non-null float64
PTS   

In [29]:
combined_stats.head(10)

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP_x,FG,FGA,FG%,...,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP,Year
0,Aaron Brooks,PG,33,MIN,32,1,5.9,0.9,2.2,0.406,...,19.9,0.1,0.1,0.1,0.033,-0.8,-3.6,-4.3,-0.1,2018
1,Aaron Gordon,PF,22,ORL,58,57,32.9,6.5,14.9,0.434,...,24.7,0.9,2.0,2.9,0.072,0.0,0.0,0.0,1.0,2018
2,Aaron Harrison,SG,23,DAL,9,3,25.9,2.1,7.7,0.275,...,15.5,-0.3,0.2,-0.1,-0.014,-3.6,-0.5,-4.1,-0.1,2018
3,Aaron Jackson,PG,31,HOU,1,0,35.0,3.0,9.0,0.333,...,13.7,0.0,0.0,0.0,-0.017,-4.6,-3.0,-7.7,-0.1,2018
4,Abdel Nader,SF,24,BOS,48,1,10.9,1.0,3.1,0.336,...,17.1,-0.9,0.8,-0.1,-0.014,-5.9,0.3,-5.6,-0.5,2018
5,Adreian Payne,PF,26,ORL,5,0,8.6,1.4,2.0,0.7,...,15.0,0.1,0.0,0.2,0.203,0.1,-4.1,-4.1,0.0,2018
6,Al Horford,C,31,BOS,72,72,31.6,5.1,10.5,0.489,...,18.4,4.0,3.8,7.8,0.165,1.1,2.9,4.0,3.5,2018
7,Al Jefferson,C,33,IND,36,1,13.4,3.1,5.8,0.534,...,22.5,0.8,0.8,1.6,0.158,-1.7,1.3,-0.5,0.2,2018
8,Al-Farouq Aminu,PF,27,POR,69,67,30.0,3.3,8.4,0.395,...,14.8,0.9,3.2,4.2,0.097,-0.8,1.9,1.1,1.6,2018
9,Alan Williams,PF,25,PHO,5,0,14.0,1.4,3.6,0.389,...,17.5,-0.1,0.1,0.0,-0.004,-6.5,2.8,-3.7,0.0,2018


# Rookies list and using it to remove from player stats

Rookies have no previous year's stats (because they were in college or overseas), so we cannot use their data. Therefore we must identify the rookies for each year and remove them from the stats list.

First load the rookies pickle data:

In [30]:
rookies[2017].Player.value_counts()

Player                     4
Isaiah Taylor              1
Semaj Christon             1
Joel Bolomboy              1
Gary Payton                1
Marquese Chriss            1
A.J. Hammons               1
Yogi Ferrell               1
Brice Johnson              1
Timothe Luwawu-Cabarrot    1
Jaylen Brown               1
Fred VanVleet              1
Ben Bentil                 1
Caris LeVert               1
Patricio Garino            1
Dario Saric                1
Tyler Ulis                 1
Cheick Diallo              1
Daniel Ochefu              1
Deyonta Davis              1
Malik Beasley              1
Jake Layman                1
David Nwaba                1
Okaro White                1
Dragan Bender              1
Henry Ellenson             1
Juan Hernangomez           1
Brandon Ingram             1
Taurean Waller-Prince      1
Ivica Zubac                1
                          ..
Isaiah Whitehead           1
Michael Gbinije            1
Kris Dunn                  1
Danuel House  

There are some weird "Player" that shows up. Also some null values got picked up when it was scraped. We can remove those rows ("Player" is part of the table header that got repeated on Basketball Reference). We can also add a Year column that will be useful to identify which year the rookies belong (just like our other lists). After that we can concatenate the dataframes into a single one.

In [31]:
combined_rookies = pd.DataFrame()
for v,k in rookies.items():
    temp = rookies[v][rookies[v].Player != 'Player']
    temp = temp[~(temp.Player.isnull())]
    temp['Year']=v
    combined_rookies = pd.concat([combined_rookies,temp])
    

In [32]:
combined_rookies.head(5)

Unnamed: 0,Player,Year
0,Bam Adebayo,2018
1,Jarrett Allen,2018
2,Kadeem Allen,2018
3,Ike Anigbogu,2018
4,OG Anunoby,2018


In [33]:
combined_rookies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 851 entries, 0 to 69
Data columns (total 2 columns):
Player    851 non-null object
Year      851 non-null int64
dtypes: int64(1), object(1)
memory usage: 19.9+ KB


# Remove the rookies from the stats list

Now we have our rookies list and ready to remove the stats from those players.

In [34]:
combined_stats.shape

(5213, 50)

In [35]:
combined_rookies.shape

(851, 2)

In [66]:
COLS = ['Player','Year']

no_rookies = combined_stats.merge(combined_rookies, indicator=True, how='outer')
no_rookies = no_rookies[no_rookies['_merge'] == 'left_only']

# combined_stats_no_rookies = pd.merge(combined_stats,combined_rookies, on=['Player','Year'], how='left')

In [67]:
no_rookies.shape

(4360, 51)

In [68]:
del no_rookies['_merge']

# Add Player stats into our Salary list

Now we need to extract needed player information and add it to our FA list to complete our cleaned data process. We have to determine what we need and how we are going to do it. For now we want to look at data from the season previous to when the player becomes a free agent.

But first, let's save the data.

In [39]:
def save_dataset(data,filename):
    with open(filename, 'wb') as w:
        pickle.dump(data,w)

In [40]:
save_dataset(no_rookies,"../data/interim/stats.pickle")
save_dataset(salaries, "../data/interim/salaries.pickle")

Now we want to filter out the data for all the FA's on our list, but before we do that, we must look at one thing: the way naming is done differently between the salaries list (from ESPN.com) and the stats list (from Basketball-Reference.com). I noticed two weird things:

1. Suffixes are missing (notably Jr.) in Basketball-Reference list
2. Also players that go by initials (i.e. J.J. Redick in Basketball-Reference) are missing periods in the other list (i.e. JJ Redick in ESPN).

In [63]:
salaries['Player'] = salaries['Player'].map(lambda x: x.replace('Jr.',""))

In [65]:
no_rookies[no_rookies.Player == "Otto Porte"]

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP_x,FG,FGA,FG%,...,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP,Year
407,Otto Porte,SF,24,WAS,77,77,31.6,5.8,11.5,0.503,...,18.4,5.0,3.1,8.1,0.161,2.4,1.2,3.6,3.4,2018
901,Otto Porte,SF,23,WAS,80,80,32.6,5.2,10.0,0.516,...,15.1,6.4,3.0,9.4,0.173,3.0,0.9,3.9,3.9,2017
1384,Otto Porte,SF,22,WAS,75,73,30.3,4.5,9.6,0.473,...,16.2,3.0,2.7,5.6,0.119,0.8,1.0,1.8,2.2,2016
1870,Otto Porte,SF,21,WAS,74,13,19.4,2.4,5.3,0.45,...,15.1,1.0,1.8,2.7,0.092,-1.4,0.9,-0.4,0.6,2015


In [47]:
data_all = pd.merge(salaries,no_rookies, on=['Player','Year'], how='left')

In [48]:
data_all.columns

Index(['Player', 'Tm_x', 'Salary', 'Year', 'Pos_x', 'Pos_y', 'Age', 'Tm_y',
       'G', 'GS', 'MP_x', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA',
       '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL',
       'BLK', 'TOV', 'PF', 'PTS', 'PER', 'TS%', '3PAr', 'FTr', 'ORB%', 'DRB%',
       'TRB%', 'AST%', 'STL%', 'BLK%', 'TOV%', 'USG%', 'OWS', 'DWS', 'WS',
       'WS/48', 'OBPM', 'DBPM', 'BPM', 'VORP'],
      dtype='object')

In [49]:
data_all.shape

(4512, 53)

In [50]:
data_all.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4512 entries, 0 to 4511
Data columns (total 53 columns):
Player    4512 non-null object
Tm_x      4512 non-null object
Salary    4512 non-null int64
Year      4512 non-null int64
Pos_x     4512 non-null object
Pos_y     2890 non-null object
Age       2890 non-null float64
Tm_y      2890 non-null object
G         2890 non-null float64
GS        2890 non-null float64
MP_x      2890 non-null float64
FG        2890 non-null float64
FGA       2890 non-null float64
FG%       2888 non-null float64
3P        2890 non-null float64
3PA       2890 non-null float64
3P%       2592 non-null float64
2P        2890 non-null float64
2PA       2890 non-null float64
2P%       2883 non-null float64
eFG%      2888 non-null float64
FT        2890 non-null float64
FTA       2890 non-null float64
FT%       2859 non-null float64
ORB       2890 non-null float64
DRB       2890 non-null float64
TRB       2890 non-null float64
AST       2890 non-null float64
STL   

Missing alot of data; what is going on?

## Missing Data:

Here are the possibilities based on some research:

1. Some did not have stats because they were out of the NBA (not playing basketball entirely or overseas). These players should also be removed from the considerations.

2. Some players are missing a couple stats only (way to treat those datapoints will be explored in the next section)

In [51]:
data_all[data_all.Player=="Bam Adebayo"]

Unnamed: 0,Player,Tm_x,Salary,Year,Pos_x,Pos_y,Age,Tm_y,G,GS,...,TOV%,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP
251,Bam Adebayo,Miami Heat,2955840,2018,C,,,,,,...,,,,,,,,,,
690,Bam Adebayo,Miami Heat,2955840,2017,C,,,,,,...,,,,,,,,,,


## Some column conflicts happened on the merge:

- Tm: I will use Tm_x as that was from the salary data. It is the team that paid the player on the season following (2018 salary is for the 2018-19 season, while 2018 stats is for the 2017-18 season)
- Pos: I will also use Pos_x although position ambiguity is really on the data collector's hands; some players can be either guard position, or either forward position, or some could be SG/SF. There is no real definition on positions now, as NBA is becoming more positionless and a guard is able to do what forwards used to, and vice versa. Even some centers handle the ball like a guard!

In [None]:
del data_all['Tm_y']
del data_all['Pos_y']

Save the data and move onto EDA:

In [None]:
save_dataset(data_all,"../data/processed/dataset.pickle")