Load in Data (Regular Season, Tournament Seeds, Tournament Game Results)

In [2]:
import pandas as pd
import numpy as np
reg_df = pd.read_csv('../data/raw/MRegularSeasonCompactResults.csv')
seed_df = pd.read_csv('../data/raw/MNCAATourneySeeds.csv')
tourney_df = pd.read_csv('../data/raw/MNCAATourneyCompactResults.csv')

From the Regular Season data, compute average differences that will be used as training data
The model will be predicting tournament games (knowing the seeds only) using regular season data, since the entire regular
season is known prior to the tournament games, averages from the regular season will be used for prediction
Data will be: Win Percentage Difference, Average Points Scored (PPG) Difference, Average Points Allowed Diffence, Seed Difference
rseason_df will be used to store the aggregate data from the regular season 
WinPCT = total wins of season / total games of season
PointsScored = total points scored for season, from both wins and loses (intermediate value)
PointsAllowed = total points opponent scored for season , from both wins and loses (intermediate value)
PPG = points scored for season / total games of season
Average Points allowed per game = points opponent scored for season / total games of season

In [3]:
# First intermediate winner and loser tables
winner_df = reg_df.groupby(['Season', 'WTeamID']).size().reset_index(name = "Wins")
winner_df.rename(columns={'WTeamID': 'TeamID'}, inplace = True)
loser_df = reg_df.groupby(['Season', 'LTeamID']).size().reset_index(name = "Loses")
loser_df.rename(columns={'LTeamID': 'TeamID'}, inplace = True)
wl_df = pd.merge(winner_df, loser_df, on = ('Season', 'TeamID'))
# Add on the win percentage (Wins / Total Games) where total games would just be (Wins + Losses) for each Team
wl_df['WinPCT'] = wl_df['Wins'] / (wl_df['Wins'] + wl_df['Loses'])

In [4]:
# Intermediate tables again for finding the points scored and allowed for each team and season
wpoints_df = reg_df.groupby(['Season', 'WTeamID']).agg({'WScore': 'sum', 'LScore': 'sum'}).reset_index()
wpoints_df.rename(columns = {'WTeamID': 'TeamID', 'WScore': 'PointsScored', 'LScore': 'PointsAllowed'}, inplace = True)
lpoints_df = reg_df.groupby(['Season', 'LTeamID']).agg({'LScore': 'sum', 'WScore': 'sum'}).reset_index()
lpoints_df.rename(columns = {'LTeamID': 'TeamID', 'LScore': 'PointsScored', 'WScore': 'PointsAllowed'}, inplace = True)
# Stores the total points a team scored (whether they won or lost) and allowed for the season
points_df = pd.concat([wpoints_df, lpoints_df]).groupby(['Season', 'TeamID']).sum().reset_index()
# Merge the points_df and wl_df, then compute the averages
reg_stats_df = pd.merge(wl_df, points_df, on = ('Season', 'TeamID'))
reg_stats_df['PPG'] = reg_stats_df['PointsScored'] / (reg_stats_df['Wins'] + reg_stats_df['Loses'])
reg_stats_df['APG'] = reg_stats_df['PointsAllowed'] / (reg_stats_df['Wins'] + reg_stats_df['Loses'])

Now with the stats, add on individual tournament games with their outcome, so that the model can see one tournament game and 
each teams regular season stats to make a predicition. Initial table will look like Season, Team1, Team2, Winner, then attach stats for each team, then
compute their differences so that there is not ten million columns in the table. team1 and team2 will be determined by the greater teamID, cannot just be which WTeamID = team1, LTeamID = team2 or the model will learn team1 always wins and how unproductive is that. winner will be if team1 == WTeamID, 1 if team1 is the winner, 0 if team1 lost aka team2 is the winner. 
t_df holds all the temporary values so far, it is HUGE

In [6]:
# Need to reshape the tourney_df to include only the stuff needed to prepare the final matchup + stats table
# take the WTeamID, LTeamID, and season, reshape to be season, team1, team2, winner, then attach the regular season stats to that boom final table
t_data = {
    'Season': tourney_df['Season'],
    'Team1': np.maximum(tourney_df['WTeamID'], tourney_df['LTeamID']),
    'Team2': np.minimum(tourney_df['WTeamID'], tourney_df['LTeamID']),
    'Winner': (tourney_df['WTeamID'] == np.maximum(tourney_df['WTeamID'], tourney_df['LTeamID'])).astype(int),
}
t_df = pd.DataFrame(t_data)

# More temporary merging tables!
t_df = t_df.merge(reg_stats_df, how = 'left', left_on = ('Season', 'Team1'), right_on = ('Season', 'TeamID'))
t_df.drop(columns=['TeamID'], inplace = True)
t_df = t_df.merge(reg_stats_df, how = 'left', left_on = ('Season', 'Team2'), right_on = ('Season', 'TeamID'), suffixes = ('', '_2'))
t_df.drop(columns=['TeamID'], inplace = True)

# Attach seeds here, first modify seed table to have numeric values
seed_df['SeedNum'] = seed_df['Seed'].str.extract(r'(\d+)').astype(int)
t_df = t_df.merge(seed_df, how = 'left', left_on = ('Season', 'Team1'), right_on = ('Season', 'TeamID'))
t_df.drop(columns=['TeamID'], inplace = True)
t_df = t_df.merge(seed_df, how = 'left', left_on = ('Season', 'Team2'), right_on = ('Season', 'TeamID'), suffixes = ('', '_2'))
t_df.drop(columns=['TeamID'], inplace = True)

# Compute the differences now that the values are in
t_df['WinPCTDiff'] = t_df['WinPCT'] - t_df['WinPCT_2']
t_df['PPGDiff'] = t_df['PPG'] - t_df['PPG_2']
t_df['APGDiff'] = t_df['APG'] - t_df['APG_2']
t_df['SeedDiff'] = t_df['SeedNum'] - t_df['SeedNum_2']

In [7]:
# Create the final feature table from everything that's been calculated
feature_data = {
    'Season': t_df['Season'], 
    'WinPCTDiff': t_df['WinPCTDiff'],
    'PPGDiff': t_df['PPGDiff'],
    'APGDiff': t_df['APGDiff'],
    'Winner': t_df['Winner'],
    'SeedDiff': t_df['SeedDiff']
}
features = pd.DataFrame(feature_data)