<img src="https://upload.wikimedia.org/wikipedia/en/thumb/2/28/March_Madness_logo.svg/1920px-March_Madness_logo.svg.png" width="1000px">


# Create Average Team Stats from RegularSeasonDetailedResults

Since the competition didn't provide a dataset of aggregated team stats for each year, and these will likely be important for any model I aggregated them myself. These are just the basic stats that were provided for each game, I'll likely calculate more advanced stats and use those in future models.

In [None]:
# Load libraries:
import pandas as pd
import numpy as np
import os


pd.options.display.max_rows=99
pd.options.display.max_columns=999

The competition provided game-by-game stats in the RegularSeasonDetailedResults csv file, where each row corresponds to a game. Each row contains the winning team ID and the lossing team ID, followed by the winning teams stats and the losing teams stats for that game. Since the data was collected in each row and seperated into columns, it was slightly difficult to sum all the data. <br>
<br>
First I had aggregated the winning and losing columns seperately into two dataframe "WTeam_ID_mean" and "LTeam_ID_mean" which contained the season, team_ID, and the mean statistics for the respective columns. I also count the number of wins and losses for each team so that I will be able to calculate the weighted arithmetic mean for each statistic.<br>
<br>
I renamed the columns for each dataframe so that they could be manipulated. I then split the arithmetic mean calculating into three steps, multiplying the winning and losing stats, then summing them and dividing by total games played. <br>
<br>
Let's take a look at the raw data, and then calculate the average stats:



## Raw Data:

In [None]:
# load data, first I loaded in only a few rows to get the infered dtypes from pandas
season_results = f'../input/datafiles/RegularSeasonDetailedResults.csv'
season_df = pd.read_csv(season_results, nrows=100)

season_dtypes = season_df.dtypes.to_dict()
season_df = pd.read_csv(season_results, dtype = season_dtypes, low_memory=False)
season_df.head()

## Calculate Average Stats:

In [None]:
# collect stats for each team:
#stat columns:
w_cols = ['Season', 'WTeamID', 'WFGM', 'WFGA','WFGM3','WFGA3','WFTM','WFTA',
          'WOR','WDR','WAst','WTO','WStl','WBlk','WPF']

l_cols = ['Season', 'LTeamID', 'LFGM','LFGA','LFGM3','LFGA3','LFTM','LFTA',
          'LOR','LDR','LAst','LTO','LStl','LBlk','LPF']

#collecting stats:
# average stats per team for winning game
WTeam_ID_mean = season_df[w_cols].groupby(['Season', 'WTeamID']).agg({'mean'}).reset_index()

# number wins per team
WTeam_counts = season_df.groupby(['Season', 'WTeamID']).agg({'WScore':'count'}).reset_index()
# average stats per team for lossing games
LTeam_ID_mean = season_df[l_cols].groupby(['Season', 'LTeamID']).agg({'mean'}).reset_index()
 # number losses per team
LTeam_counts = season_df.groupby(['Season', 'LTeamID']).agg({'LScore':'count'}).reset_index()

# rename columns:
col_names = ['Season', 'TeamID', 'FGM', 'FGA','FGM3','FGA3','FTM','FTA','OR',
             'DR','Ast','TO','Stl','Blk','PF']
WTeam_ID_mean.columns = col_names
LTeam_ID_mean.columns = col_names

# rename columns for win/loss counts
WTeam_counts.columns = ['Season', 'TeamID', 'Wins']
LTeam_counts.columns = ['Season', 'TeamID', 'Losses']

# merge number of wins and losses with their respective average stats
 
#indices should be the same
WTeam_ID_mean = WTeam_ID_mean.merge(WTeam_counts, how='left', on=None)
LTeam_ID_mean = LTeam_ID_mean.merge(LTeam_counts, how='left', on=None)

# weighted mean:
cols = ['FGM', 'FGA','FGM3','FGA3','FTM','FTA','OR', 'DR','Ast','TO','Stl','Blk','PF']
stats_wins = WTeam_ID_mean[cols].mul(WTeam_ID_mean['Wins'], axis=0) # multiply by number of wins
stats_losses = LTeam_ID_mean[cols].mul(LTeam_ID_mean['Losses'], axis=0)

# sum of total games:
games = WTeam_ID_mean['Wins'] + LTeam_ID_mean['Losses']

# weighted mean calculation:
team_stats = (stats_wins + stats_losses).div(games, axis=0)

# merge stats with year and ID, should still have same index
final_stats = LTeam_ID_mean[['Season', 'TeamID']].merge(team_stats, 
                                                        how='left', 
                                                        right_index=True, 
                                                        left_index=True) 


The final output will look like this:

In [None]:
final_stats.head()

I'll likely calculate more advanced stats and winning %, etc. in the future, but hopefully this kernel will speed up other people's models. <br>
<br>
Good luck! <br>
<br>
Matan Freedman