In [1]:
import re
import os
import pickle
import pandas as pd

In [2]:
fas=pickle.load(open("../data/raw/salaries.pickle", "rb"))

Quickly look at one of the dataframes:

In [14]:
fas[2018].apply(lambda x: x.head(5).append(x.tail(5))) #look at first and last five

Unnamed: 0,Rk,Player,Tm,Salary
0,RK,NAME,TEAM,SALARY
1,1,"Stephen Curry, PG",Golden State Warriors,"$37,457,154"
2,2,"Russell Westbrook, PG",Oklahoma City Thunder,"$35,654,150"
3,3,"Chris Paul, PG",Houston Rockets,"$35,654,150"
4,4,"Blake Griffin, PF",Detroit Pistons,"$32,088,932"
477,434,"Jonathan Gibson, PG",Boston Celtics,"$17,092"
478,435,"Tarik Phillip, G",Washington Wizards,"$9,474"
479,436,"Duncan Robinson, SF",Miami Heat,"$9,474"
480,437,"Theo Pinson, SG",Brooklyn Nets,"$4,737"
481,438,"Kendrick Nunn, SG",Miami Heat,"$4,737"


In [13]:
fas.keys()

dict_keys([2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009])

In [4]:
fas[2017].columns

Int64Index([0, 1, 2, 3], dtype='int64')

Certain things we need to fix: 

1. Fix some headers (0 -> Rank, 1-> Name, 2-> Team, 3-> Salary)
2. Remove rows with those header labels, as they were repeated in the website tables
3. Add year column for when the lists are aggregated into a single dataframe

This we can do after aggregations:

1. Change Salary format (remove $ and commas)
2. Split position from the name into a new column
3. Change Salary datatype to int
4. Remove Rk column; it's not significant

In [11]:
import re
combined ={}

for k,v in fas.items():
    v.columns = ['Rk','Player','Tm','Salary']
    v = v[v.Rk!= "RK"]
    v["Year"] = k
    combined[k]=v

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


In [12]:
combined[2017].head(10)

Unnamed: 0,Rk,Player,Tm,Salary,Year
1,1,"Stephen Curry, PG",Golden State Warriors,"$37,457,154",2017
2,2,"Blake Griffin, PF",LA Clippers,"$32,088,932",2017
3,3,"Paul Millsap, PF",Denver Nuggets,"$31,269,231",2017
4,4,"Kyle Lowry, PG",Toronto Raptors,"$31,200,000",2017
5,5,"Gordon Hayward, SF",Boston Celtics,"$29,727,900",2017
6,6,"Mike Conley, PG",Memphis Grizzlies,"$28,530,608",2017
7,7,"Russell Westbrook, PG",Oklahoma City Thunder,"$28,530,608",2017
8,8,"James Harden, PG",Houston Rockets,"$28,299,399",2017
9,9,"DeMar DeRozan, SG",Toronto Raptors,"$27,739,975",2017
10,10,"Al Horford, C",Boston Celtics,"$27,734,406",2017


Now we can combine all the dataframes into a single one and get the FA information from 2011-2018 (2018 Salary information will be our test_y)

In [26]:
from functools import reduce
salaries = reduce(lambda x,y:pd.concat([x,y]),[v for k,v in combined.items()])

In [27]:
salaries.shape

(4511, 5)

In [28]:
salaries.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4511 entries, 1 to 463
Data columns (total 5 columns):
Rk        4511 non-null object
Player    4511 non-null object
Tm        4511 non-null object
Salary    4511 non-null object
Year      4511 non-null int64
dtypes: int64(1), object(4)
memory usage: 211.5+ KB


In [29]:
salaries.head(5)

Unnamed: 0,Rk,Player,Tm,Salary,Year
1,1,"Stephen Curry, PG",Golden State Warriors,"$37,457,154",2018
2,2,"Russell Westbrook, PG",Oklahoma City Thunder,"$35,654,150",2018
3,3,"Chris Paul, PG",Houston Rockets,"$35,654,150",2018
4,4,"Blake Griffin, PF",Detroit Pistons,"$32,088,932",2018
5,5,"Gordon Hayward, SF",Boston Celtics,"$31,214,295",2018


Now we can changed the Salary format to remove $ and commas, as well as splitting 

In [30]:
salaries["Salary"] = salaries["Salary"].str.replace('$','').str.replace(',','')

In [31]:
salaries.head(5)

Unnamed: 0,Rk,Player,Tm,Salary,Year
1,1,"Stephen Curry, PG",Golden State Warriors,37457154,2018
2,2,"Russell Westbrook, PG",Oklahoma City Thunder,35654150,2018
3,3,"Chris Paul, PG",Houston Rockets,35654150,2018
4,4,"Blake Griffin, PF",Detroit Pistons,32088932,2018
5,5,"Gordon Hayward, SF",Boston Celtics,31214295,2018


In [32]:
salaries.Salary = salaries.Salary.astype(int)
del salaries['Rk']
salaries['Player'], salaries['Pos'] = salaries['Player'].str.split(', ', 1).str

In [37]:
salaries.head(5)

Unnamed: 0,Player,Tm,Salary,Year,Pos
1,Stephen Curry,Golden State Warriors,37457154,2018,PG
2,Russell Westbrook,Oklahoma City Thunder,35654150,2018,PG
3,Chris Paul,Houston Rockets,35654150,2018,PG
4,Blake Griffin,Detroit Pistons,32088932,2018,PF
5,Gordon Hayward,Boston Celtics,31214295,2018,SF
