### Data  review for first pass model

* teams - two columns of TeamID & TeamName - don't think we actually need the name
* seasons - shows the year-by-year differences in the composition of the competition - start dates & regions
* seeds - lists the seeds for each tournament year, obviously v important to compare differences here, but can also be used for building a picture of how the season has gone for each team
* seasonResults - for all regular season games, has the date, venue, #overtimes & result of each game. expect relative scores of each game to be more important than just results
* tourneyResults - same data as in the seasonResults, but for the tournament game. Key question of how to balance the season to date into these games, is all of that information captured in the seed?

Next steps:
1. Build a basic model using some obvious features (seed, season win record, season points for and against) to establish the model pipeline
2. Refine the features used and modelling methodology
3. Add in further data items from the further datasets 


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [None]:
# define method to import file and give some basic output

def filecheck(file):
    df = pd.read_csv("../input/{}".format(file))
    print(df.head())
    print(df.shape)
    return df
    
teams = filecheck("WTeams.csv")

** teams**
<br>This table is just to match the IDs to names, can be ignored. 

In [None]:
# import the rest of the data section 1 files

seasons = filecheck("WSeasons.csv")

** seasons **
<br>Gives info on the region split in each season, may come into later models. Would likely use "Season" field as the key 

In [None]:
seeds = filecheck("WNCAATourneySeeds.csv")

In [None]:
# needs unique key creating to match team and their season

seeds['key'] = seeds.Season.astype(str) + seeds.TeamID.astype(str) 
print(len(seeds['key']))
print(len(seeds['key'].unique()))

seeds.head()

**seeds**
<br> table is now unique on the combination of season and teamid

In [None]:
seasonResults = filecheck("WRegularSeasonCompactResults.csv")

**seasonresults**
<br>This table could use some additional information from the fields in there, as well as from the seeds table

* key to match seeds table
* seed of each team
* difference in seed
* weighted value of result (account for Home/Away and seed) - this will require further analysis
* group field by season and teamid to give ratings of each team in each season

In [None]:
# add key to match winning and losing teams

seasonResults['wKey'] = seasonResults.Season.astype(str) + seasonResults.WTeamID.astype(str) 
seasonResults['lKey'] = seasonResults.Season.astype(str) + seasonResults.LTeamID.astype(str)

seasonResults = pd.merge(seasonResults, seeds[['Seed','key']],
                         how='left', left_on='wKey', right_on='key')
seasonResults = seasonResults.rename(columns={'Seed':'wSeed'})
seasonResults = seasonResults.drop(['key'], axis = 1)

seasonResults = pd.merge(seasonResults, seeds[['Seed','key']],
                         how='left', left_on='lKey', right_on='key')
seasonResults = seasonResults.rename(columns={'Seed':'lSeed'})
seasonResults = seasonResults.drop(['key'], axis = 1)

seasonResults.head()

In [None]:
# pivot the table to find numbers of teams and seeds in each year
# can use this to determine how to treat unseeded teams

seasonResults.pivot_table(index='Season', values=['WTeamID','wSeed'],
                          aggfunc=lambda x: len(x.unique()))

In [None]:
# ((350-65)/4)/2 + 16 ~= 50 as the average "seed" for unseeded teams. This feels like it will 
# penalise too severely any losses to unseeded teams, so for now I'll use 20 in place of NaNs

seasonResults['wSeedNum'] = seasonResults.wSeed.str[1:]
seasonResults.wSeedNum = seasonResults.wSeedNum.fillna(20)
seasonResults.wSeedNum = seasonResults.wSeedNum.astype(int)

seasonResults['lSeedNum'] = seasonResults.lSeed.str[1:]
seasonResults.lSeedNum = seasonResults.lSeedNum.fillna(20)
seasonResults.lSeedNum = seasonResults.lSeedNum.astype(int)

seasonResults['seedDiff'] = seasonResults.wSeedNum - seasonResults.lSeedNum 

In [None]:
wins = seasonResults.pivot_table(index='wKey', values=['WScore'], aggfunc=('count','sum'))
losses = seasonResults.pivot_table(index='lKey', values=['LScore'], aggfunc=('count','sum'))

seasonSumm = pd.merge(wins, losses, left_index=True, right_index=True)

seasonSumm['gamesPlayed'] = seasonSumm['WScore']['count'] + seasonSumm['LScore']['count']
seasonSumm['record'] = seasonSumm['WScore']['count']/seasonSumm['gamesPlayed']
seasonSumm['pointsDiff'] = seasonSumm['WScore']['sum'] - seasonSumm['LScore']['sum']
seasonSumm['meanPointsDiff'] = seasonSumm['pointsDiff']/seasonSumm['gamesPlayed']

seasonSumm.columns = [' '.join(col).strip() for col in seasonSumm.columns.values]

seasonSumm.head()

In [None]:
tourneyResults = filecheck("WNCAATourneyCompactResults.csv")

In [None]:
# First add the target to the tourney result data, 1 if the Winning Team has a lower ID than the losing 
# one, and 0 otherwise

tourneyResults['target'] = (tourneyResults['WTeamID'] < tourneyResults['LTeamID']).astype(int)

# create key and merge in the features created above in the seasonSumm
tourneyResults['wKey'] = tourneyResults.Season.astype(str) + tourneyResults.WTeamID.astype(str)
tourneyResults['lKey'] = tourneyResults.Season.astype(str) + tourneyResults.LTeamID.astype(str)
tourneyResults = pd.merge(tourneyResults, seasonSumm.iloc[:, 4:],
                          how='left', left_on='wKey', right_index=True)
tourneyResults = tourneyResults.rename(columns={'gamesPlayed':'w_gamesPlayed', 
                                                'record':'w_record',
                                                'pointsDiff':'w_pointsDiff',
                                                'meanPointsDiff':'w_meanPointsDiff'})
tourneyResults = pd.merge(tourneyResults, seasonSumm.iloc[:, 4:],
                          how='left', left_on='lKey', right_index=True)
tourneyResults = tourneyResults.rename(columns={'gamesPlayed':'l_gamesPlayed', 
                                                'record':'l_record',
                                                'pointsDiff':'l_pointsDiff',
                                                'meanPointsDiff':'l_meanPointsDiff'})


print(tourneyResults.head())
print(tourneyResults.shape)

In [None]:
tourneyResults = pd.merge(tourneyResults, seeds[['Seed','key']],
                          how='left', left_on='wKey', right_on='key')
tourneyResults = tourneyResults.rename(columns={'Seed':'wSeed'})
tourneyResults = tourneyResults.drop(['key'], axis = 1)

tourneyResults = pd.merge(tourneyResults, seeds[['Seed','key']],
                         how='left', left_on='lKey', right_on='key')
tourneyResults = tourneyResults.rename(columns={'Seed':'lSeed'})
tourneyResults = tourneyResults.drop(['key'], axis = 1)


In [None]:
tourneyResults['wSeedNum'] = tourneyResults.wSeed.str[1:]
tourneyResults.wSeedNum = tourneyResults.wSeedNum.fillna(20)
tourneyResults.wSeedNum = tourneyResults.wSeedNum.astype(int)

tourneyResults['lSeedNum'] = tourneyResults.lSeed.str[1:]
tourneyResults.lSeedNum = tourneyResults.lSeedNum.fillna(20)
tourneyResults.lSeedNum = tourneyResults.lSeedNum.astype(int)

tourneyResults['seedDiff'] = tourneyResults.wSeedNum - tourneyResults.lSeedNum 

In [None]:
tourneyResults['playedDiff'] = tourneyResults.w_gamesPlayed - tourneyResults.l_gamesPlayed
tourneyResults['recordDiff'] = tourneyResults.w_record - tourneyResults.l_record
tourneyResults['pointsDiffDiff'] = tourneyResults.w_pointsDiff - tourneyResults.l_pointsDiff
tourneyResults['pointsRatioDiff'] = tourneyResults.w_meanPointsDiff - tourneyResults.l_meanPointsDiff

In [None]:
tourneyResults.to_csv('JH_augmented_tourney.csv')
seasonResults.to_csv('JH_augmented_season.csv')
seasonSumm.to_csv('JH_season_summary.csv')