# Introduction

In this notebook, we present how we append the features for seasonal awards scraped from websites to the player's career statistics. Specifically, we include all star MVP, final MVP, defense player of the year, players whose team win division championships or final championships, all star, all teams, and MVP.

# Appending Features

### Importing Packages

In [2]:
#matplotlib inline

import numpy as np
import pandas as pd
from matplotlib import pyplot
import unidecode
import seaborn as sns

print(pd.__version__)

0.23.4


### Importing Dataset

In [3]:
player_stats = pd.read_csv('../data/player_stats.csv')

We explore the features and find that there are several players who have won seasonal awards and also have the same legal names with other players. In order to properly append those features automatically, we assume that the player's who is a Hall of Fame inductee is also likely to win seasonal awards. Finding the feature most correlated with 'HOF,' we assume this feature is also highly correlated with seasonal awards and use it to distinguish between players have the same legal names.

In [4]:
X = player_stats.drop(["Player","HOF"], axis = 1)
X = (X - X.mean())/X.std()
X.corrwith(player_stats['HOF']).nlargest(n=1)

VORP    0.436834
dtype: float64

In [5]:
def add_features(df,df_feature,feature):
    # function that automatrically appending features to dataframe df
    # df has features: 'Player', 'Birth Year'
    # df_feature: 'Player', feature (the name of the feature), and may have 'Birth Year'
    
    # check if all the entries in 'Player' column in df_feature can be found on the 'Player' column in df_feature 
    # A player who is in df_feature but not in df_feature could either be
    #      (1) the player played before 1982 or 
    #      (2) there is an error in the code 
    list_phantom_player = list(set(df_feature['Player'].tolist())-set(df['Player'].tolist()))
    if list_phantom_player:
        print('There are phantom players as below\n')
        print(list_phantom_player)
        print('\n')
    
    if 'Birth Year' in list(df_feature.columns):
        # if 'Birth Year' is in df_feature, we create a dictionary with keys from ('Player','Birth Year') 
        # and values from 'feature.' This way the players can be identified without the help of 'VORP'
        d = df_feature.set_index(['Player','Birth Year']).to_dict()[feature]
        
        # Create a new column in df for the new feature indexed by 'Player' and 'Birth Year.'
        df[feature] = pd.Series(list(zip(df['Player'],df['Birth Year']))).map(d)
        df[feature] = df[feature].fillna(value = 0) #  Fill new NaN values with 0.
        
    else:
        # We index the data frame by 'Player' to make matching the new data to the old possible.
        # We create a dictionary with keys from 'Player' and values from 'feature'

        d = df_feature.set_index('Player').to_dict()[feature]

        # Create a new column in df for the new feature indexed by 'Player.'
        df[feature] = df['Player'].map(d)
        df[feature] = df[feature].fillna(value = 0) #  Fill new NaN values with 0.

        # To avoid giving accolades to the son of the player who deserves them, 
        # we discriminate between players with the same name by giving the award 
        # to whoever has highest 'VORP'. 

        player_list = df_feature['Player'].tolist()
        for name in player_list:
            temp = (df['Player']==name)  
            if sum(df['Player']==name)>1: 
                # We print out players who have the same legal names and one of them won the awards as a quick check
                print('The repeated name is {}, repeated {} times\n'.format(name,sum(df['Player']==name)))
                repeated_name_index = df.index[temp]
                most_likely_index = df.iloc[repeated_name_index]['VORP'].idxmax(axis=0)
                for i in repeated_name_index:
                    if i != most_likely_index:
                        df.at[i,feature]=0
           
    return df
    

### All Star MVP

We now append all star MVP to the dataframe as 'AllStar_MVP.'

In [6]:
AllStar_MVP = pd.read_csv('../data/All-Star-MVP.csv')
AllStar_MVP['Player']= AllStar_MVP['Player'].str.replace('Lew Alcindor','Kareem Abdul-Jabbar')
df_new = add_features(player_stats, AllStar_MVP,'AllStar_MVP')

There are phantom players as below

['Bill Russell', 'Elgin Baylor', 'Bob Pettit', 'Oscar Robertson', 'Rick Barry', 'George Mikan', 'Wilt Chamberlain', 'Nate Archibald', 'Paul Arizin', 'Hal Greer', 'Adrian Smith', 'Lenny Wilkens', 'Jerry Lucas', 'Jerry West', 'Dave Bing', 'Bill Sharman', 'Walt Frazier', 'Bob Cousy', 'Willis Reed', 'Ed Macauley']


The repeated name is Glen Rice, repeated 2 times



In [7]:
df_new.query('Player == "Glen Rice"')[['VORP','Birth Year', 'AllStar_MVP']]

Unnamed: 0,VORP,Birth Year,AllStar_MVP
978,22.0,1968.0,1.0
979,-0.2,1991.0,0.0


## Final MVP

We now append final MVP to the dataframe as 'Final_MVP.'

In [8]:
Final_MVP = pd.read_csv('../data/Final-MVP.csv')
# give 'Kareem Abdul-Jabbar' credit of 'Lew Alcindor'
K_index = Final_MVP.index[Final_MVP['Player'] == 'Kareem Abdul-Jabbar'] 
L_index = Final_MVP.index[Final_MVP['Player'] == 'Lew Alcindor'] 
Final_MVP.at[K_index[0],'Final_MVP'] = Final_MVP.at[K_index[0],'Final_MVP'] + Final_MVP.at[L_index[0],'Final_MVP']
Final_MVP.drop(L_index,inplace = True)
df_new = add_features(df_new, Final_MVP,'Final_MVP')

There are phantom players as below

['Rick Barry', 'Wilt Chamberlain', 'Jerry West', 'Wes Unseld', 'Jo Jo White', 'John Havlicek', 'Willis Reed']




## Defensive Player Of The Year

We now append defensive player of the year to the dataframe as 'DPOY.'

In [9]:
DPOY = pd.read_csv('../data/DPOY.csv')
# replace the name 'Ron Artest' with 'Metta World Peace'
DPOY['Player'] = DPOY['Player'].str.replace('Ron Artest','Metta World Peace')
df_new = add_features(df_new, DPOY,'DPOY')

The repeated name is Gary Payton, repeated 2 times



In [10]:
df_new.query('Player == "Gary Payton"')[['VORP','DPOY']]

Unnamed: 0,VORP,DPOY
940,63.0,1.0
941,-0.1,0.0


## The player whose team attended Finals or won Championships 

We now append player's team which attended finals as 'Final,' and player's team which won the championships as 'Champion.'

In [11]:
Final_Champion = pd.read_csv('../data/Final-and-Champion.csv')
df_new = add_features(df_new, Final_Champion,'Final')
df_new = add_features(df_new, Final_Champion,'Champion')

### AllStar

We now append all star as 'AllStar.'
Note that, in the scraped data for the all star, players whose names have special characters got messed up. We find those names and change them by hand.

In [12]:
AllStar = pd.read_csv('../data/AllStarTable.csv', encoding='UTF-8')

# This maps special, accented characters to their closest latin character.
AllStar['Player'] = AllStar['Player'].apply(unidecode.unidecode)

df_new = add_features(df_new, AllStar,'#')
df_new.rename(index=str, columns = {"#": "AllStar"}, inplace = True);

There are phantom players as below

['Fred Scolari', 'Sidney Wicks', 'Wes Unseld', 'Rudy Tomjanovich', 'Bob Boozer', 'Dolph Schayes', 'Bob Davies', 'Pete Maravich', 'Joe Caldwell', 'Alex Groza', 'Nate Archibald', 'Jack Molinas', 'Dick Van Arsdale', 'Tom Meschery', 'Joe Fulks', 'B. J. Armstrong', 'Jim Pollard', 'Adrian Smith', 'Chet Walker', 'Johnny Kerr', 'Slater Martin', 'Paul Silas', 'Maurice Stokes', 'Dave Bing', 'Bill Sharman', 'Tom Gola', 'Guy Rodgers', 'John Havlicek', 'Red Rocha', 'Paul Walther', 'Connie Hawkins', 'Walt Frazier', 'Lucious Jackson', 'Jon McGlocklin', 'Gene Shue', 'Penny Hardaway', 'Mel Hutchins', 'Elgin Baylor', 'Geoff Petrie', 'Walter Dukes', 'Clyde Lee', 'Earl Monroe', 'Bob Harrison', 'Jim Price', 'John Block', 'Dick Barnett', 'Frank Selvy', 'Micheal Ray Richardson', 'Norm Van Lier', 'Willie Naulls', 'Hal Greer', 'Don Ohl', 'Len Chappell', 'Lenny Wilkens', 'Vern Mikkelsen', 'Lee Shaffer', 'Max Zaslofsky', 'Wayne Embry', 'Billy Cunningham', 'Jack George', 'Phil 

In [13]:
df_new.query('Player == "Glen Rice"')[['GS','Birth Year', 'AllStar']]

Unnamed: 0,GS,Birth Year,AllStar
978,876.0,1968.0,3.0
979,1.0,1991.0,0.0


### All Teams

We now append all teams as feature 'AllTeams.'

In [14]:
AllTeams = pd.read_csv('../data/AllTeamsTable.csv',encoding='latin-1')
AllTeams.head(2)

features = ["First team", "Second team", "Third team"]
for feature in features:
    df_new = add_features(df_new, AllTeams, feature)

There are phantom players as below

['Wilt Chamberlain', 'Bill Russell', 'Elgin Baylor', 'Jerry West', 'Bob Pettit', 'John Havlicek', 'Bob Cousy', 'Dolph Schayes', 'Oscar Robertson']


There are phantom players as below

['Wilt Chamberlain', 'Bill Russell', 'Elgin Baylor', 'Jerry West', 'Bob Pettit', 'John Havlicek', 'Bob Cousy', 'Dolph Schayes', 'Oscar Robertson']


There are phantom players as below

['Wilt Chamberlain', 'Bill Russell', 'Elgin Baylor', 'Jerry West', 'Bob Pettit', 'John Havlicek', 'Bob Cousy', 'Dolph Schayes', 'Oscar Robertson']




### MVP

We now append feature MVP as 'MVP.'

In [15]:
MVP = pd.read_csv('../data/MVPTable.csv',encoding='latin-1')
# give 'Kareem Abdul-Jabbar' credit of 'Lew Alcindor'
K_index = MVP.index[MVP['Player'] == 'Kareem Abdul-Jabbar'] 
L_index = MVP.index[MVP['Player'] == 'Lew Alcindor'] 
if list(L_index):
    MVP.at[K_index[0],'MVPs'] = MVP.at[K_index[0],'MVPs'] + MVP.at[L_index[0],'MVPs']
    MVP.drop(L_index,inplace = True)
df_new = add_features(df_new, MVP, "MVPs")

There are phantom players as below

['Wilt Chamberlain', 'Bill Russell', 'Wes Unseld', 'Bob Pettit', 'Bob Cousy', 'Willis Reed', 'Oscar Robertson']




In [16]:
df_new.head()

Unnamed: 0,Player,Birth Year,G,GS,MP,FG,FGA,FG%,3P,3PA,...,AllStar_MVP,Final_MVP,DPOY,Final,Champion,AllStar,First team,Second team,Third team,MVPs
0,A.C. Green,1964.0,1278.0,832.0,36552.0,4544.0,9202.0,0.493806,124.0,489.0,...,0.0,0.0,0.0,5,3,0.0,0.0,0.0,0.0,0.0
1,A.J. Bramlett,1977.0,8.0,0.0,61.0,4.0,21.0,0.190476,0.0,0.0,...,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0
2,A.J. English,1968.0,151.0,18.0,3108.0,617.0,1418.0,0.43512,9.0,65.0,...,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0
3,A.J. Guyton,1979.0,80.0,14.0,1246.0,166.0,440.0,0.377273,73.0,193.0,...,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0
4,A.J. Hammons,1993.0,22.0,0.0,163.0,17.0,42.0,0.404762,5.0,10.0,...,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0


### Check The Correlation Between Awards And VORP

In [17]:
# Check the correlations with stats and awards 
# to inform decision about which stat to 'trigger' 
# the second part of the add_feature() function.

awards_table = df_new[df_new.columns[-11:]] 
awards = df_new.columns[-11:] 

awards_table.corrwith(df_new['VORP'])

HOF            0.436834
AllStar_MVP    0.473098
Final_MVP      0.449949
DPOY           0.273940
Final          0.427286
Champion       0.352213
AllStar        0.753804
First team     0.487816
Second team    0.410567
Third team     0.366637
MVPs           0.442968
dtype: float64

In [18]:
# Save the new dataframe to 
df_new.to_csv('../data/player_stats_and_awards.csv', index = False, sep = ',')