# Data Preparation

The metrics that will be used to evaluate this stage are:

- **Integration**, this is, converting different data value formats AND entity matching between different sources;

- **Quality**, assessment of dimensions;

- **Cleaning**, systematic redundancy removal of redundant attributes, missing data, replace MVs with complex methods (e.g. regression, classification) with correct experimental setup, identify and discuss outliers and address them with complex approaches (technical or domain-dependent);

- **Transformation** for algorithm compatibility, adequate complex discretization or rescaling;

- **Feature Engineering and Selection** from tabular data, complex methods (e.g. aggregation) and knowledge (e.g. business concepts), and correct and combined use of filter and wrapper based methods;

- **Sampling** for domain-specific purposes, focus on the appropriate subset of the population, and for development, start with a very small sample and scale up to a significant sample;

- **Unbalanced** - you used advanced methods (e.g. SMOTE) correctly

In [415]:
import pandas as pd

awards_players = pd.read_csv("data/awards_players.csv")
coaches = pd.read_csv("data/coaches.csv")
players_teams = pd.read_csv("data/players_teams.csv")
players = pd.read_csv("data/players.csv")
series_post = pd.read_csv("data/series_post.csv")
teams_post = pd.read_csv("data/teams_post.csv")
teams = pd.read_csv("data/teams.csv")

tables = {
    "Awards Players": awards_players,
    "Coaches": coaches,
    "Players Teams": players_teams,
    "Players": players,
    "Series Post": series_post,
    "Teams Post": teams_post,
    "Teams": teams
}

# remove columns where all the entries have the same value
def clean_remove_columns_equal( name ):
    
    before = len( tables[ name ].columns )
    nunique = tables[ name ].nunique()
    cols_to_drop = nunique[ nunique == 1 ].index
    tables[ name ] = tables[ name ].drop( cols_to_drop, axis = 1 )
    
    if( before != len( tables[ name ].columns ) ):
        print( f"{ name } went from { before } to { len( tables[ name ].columns ) } columns : { cols_to_drop.to_list() }" )

# remove columns where all the entries have null values
def clean_remove_columns_na( name ):
    
    before_cols = tables[ name ].columns
    tables[ name ] = tables[ name ].dropna( axis = 1, how = 'all')
    after_cols = tables[ name ].columns
    
    if( len( before_cols ) != len( after_cols )):
        print( f"{ name } went from { len( before_cols ) } to { len( after_cols ) } columns : { list( filter( lambda x: x not in after_cols, before_cols ) ) }" )

# identify columns that have null values
def clean_identify_columns_na( name ):
    
    cols_with_empty_values = tables[ name ].columns[ tables[ name ].isnull().any() ].to_list()
    size = len( cols_with_empty_values )
    
    if size > 0:
        print(f"{ name } has { size } columns with missing data : { cols_with_empty_values }")
        
# identify pairs of very correlated features
def clean_identify_correlated_features( name ):    
    features = []
    data_converted = tables[ name ].copy()
    
    for col in data_converted.select_dtypes( include = [ 'object' ] ).columns:
        data_converted[ col ], _ = pd.factorize( data_converted[ col ] )

    matrix = data_converted.corr()
    for i in range( len( matrix.columns ) ):
        for j in range( i ):
            value = matrix.iloc[ i, j ]
            if abs( value ) > 0.95:
                name_i = matrix.columns[ i ]
                name_j = matrix.columns[ j ]
                features.append( ( name_i, name_j, value ) )

    return features
        

We need to first change the `tmID` column of all the datasets as it may not correspond to the `franchID` column of the teams dataset.

In [416]:
mapTeam = {}
for index, row in teams[['tmID', 'franchID']].iterrows():
    mapTeam [row['tmID']] = row['franchID']

tables['Coaches']['tmID'] = tables['Coaches']['tmID'].replace(mapTeam) 
tables['Players Teams']['tmID'] = tables['Players Teams']['tmID'].replace(mapTeam) 
tables['Series Post']['tmIDWinner'] = tables['Series Post']['tmIDWinner'].replace(mapTeam) 
tables['Series Post']['tmIDLoser'] = tables['Series Post']['tmIDLoser'].replace(mapTeam) 
tables['Teams Post']['tmID'] = tables['Teams Post']['tmID'].replace(mapTeam)

There are several attributes that are highly correlated with each other and may not be necessary in the Teams dataset. So let's make some changes:

- As we had already noticed previously, `tmID` and `franchID` are always the same, except in some specific cases. They are not equal only when the name under which the team participated in the competition (`tmID`) is not the same as its current name (`franchID`), that is, it changed its name.

- The `name` and `arena` attributes, which give us the full name of the team and the name of its arena, are not necessary, as `franchID` is sufficient as an identifier.

- Regarding the attributes referring to performance in the season, knowing that the `GP` attribute varies between two values, 32 and 34, depending on the year. In both cases the number of games is even, with half of them played at home and the rest away. Logo:

    - The `GP` attribute can be obtained by doing `won + Lost`.
    
    - The `homeL` attribute can be obtained by doing `GP/2 - homeW`.
    
    - The `awayL` attribute can be obtained by doing `lost - homeL`.
    
    - The `awayW` attribute can be obtained by doing `won - homeW`.

- In relation to the attributes referring to offensive statistics, the same ideas can be replicated for defensive statistics.

    - The attribute `o_reb` can be obtained `o_oreb + o_dreb`

    - The attribute `d_reb` can be obtained `d_oreb + d_dreb`.

    - The attribute `o_pts` can be obtained `2 * ( o_fgm + o_3pm ) + 3 * o_3pm + o_ftm`.

    - The attribute `d_pts` can be obtained `2 * ( d_fgm + d_3pm ) + 3 * d_3pm + d_ftm`.

- In relation to the attributes relating to team rebounding, the same ideas can be replicated for the opposing team's team rebounding.

    - The `tmTRB` attribute can be obtained `tmORB + tmDRB`.

    - The `opptmTRB` attribute can be obtained `opptmORB + opptmDRB`.

In [417]:
print( f"Before Teams had { len( tables[ 'Teams' ].columns ) } columns ")
tables[ 'Teams' ] = tables[ 'Teams' ].drop( [ 'tmID', 'name', 'arena' ], axis = 1)
tables[ 'Teams' ] = tables[ 'Teams' ].drop( [ 'GP', 'homeL', 'awayL', 'awayW' ], axis = 1)
tables[ 'Teams' ] = tables[ 'Teams' ].drop( [ 'o_reb', 'd_reb', 'o_pts', 'd_pts' ], axis = 1)
tables[ 'Teams' ] = tables[ 'Teams' ].drop( [ 'tmTRB', 'opptmTRB' ], axis = 1)
print( f"After Teams has { len( tables[ 'Teams' ].columns ) } columns ")

Before Teams had 61 columns 
After Teams has 48 columns 


In [418]:
tables['Teams'].columns = ['teams_' + col for col in tables['Teams'].columns]
tables['Teams Post'].columns = ['teams_post_' + col for col in tables['Teams Post'].columns]
tables['Coaches'].columns = ['coaches_' + col for col in tables['Coaches'].columns]

tables['Teams'] = pd.merge( tables['Teams'], tables['Teams Post'], left_on='teams_franchID', right_on='teams_post_tmID', how='left')

# IDK - mesma equipa no mesmo ano tem treinadores diferentes
# tables['Teams'] = pd.merge( tables['Teams'], tables['Coaches'], left_on='teams_franchID', right_on='coaches_tmID', how='left')

tables['Teams'] = tables['Teams'].drop( ['teams_post_year', 'coaches_year'], axis=1, errors='ignore')

print(len(tables[ 'Teams'].columns))
clean_remove_columns_equal('Teams')
clean_remove_columns_na('Teams')
print(len(tables[ 'Teams'].columns))


52
Teams went from 52 to 45 columns : ['teams_lgID', 'teams_seeded', 'teams_tmORB', 'teams_tmDRB', 'teams_opptmORB', 'teams_opptmDRB', 'teams_post_lgID']
Teams went from 45 to 44 columns : ['teams_divID']
44
