# Homecourt Advantage

This notebook will examine the effect of homecourt advantage measured in both winning % and point differential, and how it changes over time. Homecourt advantage tends to vary by year and affects how a team's strength is evaluated. 

This notebook was created using the data from the [2018 March Madness Kaggle Competition](https://www.kaggle.com/c/mens-machine-learning-competition-2018). 

In [2]:
%matplotlib inline
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
path = "../input/RegularSeasonCompactResults.csv"
df_rs = pd.read_csv(path, usecols=['Season', 'WTeamID', 'WScore', 'LTeamID', 'LScore', 'WLoc'])
df_rs.head()


In [4]:
### Throw away neutral games

df_rs = df_rs[df_rs['WLoc']!='N']
df_rs['Diff'] = df_rs['WScore'] - df_rs['LScore']
df_rs.head()

## Homecourt Advantage in Wins

In [5]:
n_games = len(df_rs)
print(n_games)

First let's look at the historical homecourt advantage by wins. 

In [6]:
home_mask = df_rs.WLoc=='H'
away_mask = df_rs.WLoc=='A'

home_wins = len(df_rs.loc[home_mask, 'WScore'])
away_wins = len(df_rs.loc[away_mask, 'WScore'])
home_wp = home_wins/(home_wins+away_wins)
print("Home Win-Loss: {}-{} (Winning %: {})".format(home_wins, away_wins, round(home_wp,3)))

The home team wins roughly twice as much as the away team!

### Homecourt Wins Over Time

Next we're going to look at the home team's win percentage over time. 

In [10]:
games_by_season = df_rs[['Season','WLoc']].groupby('Season').agg('count')
games_by_season.columns = ['Games']
wins_by_season = df_rs[home_mask][['Season','WLoc']].groupby('Season').agg('count')
wins_by_season.columns = ['Wins']
#wins_by_season.rename({'WLoc':'Wins'}, axis=1, inplace=True)

df_home = pd.DataFrame(wins_by_season['Wins']/games_by_season['Games'], columns=['WinP']).reset_index()
df_home.head()

In [11]:
sns.regplot(x="Season",y='WinP',data=df_home, color='g', scatter='False')
sns.set_style("darkgrid")
plt.title("Win % Since 1985")
plt.ylabel("Home Win %")


From the plot, it appears that homecourt advantage is getting slightly weaker. Let's zoom into the last 10 years. 

In [12]:
sns.regplot(x="Season",y='WinP',data=df_home.tail(10), color='g')
plt.title("Home Win % Over Last 10 Seasons")
plt.ylabel('Win %')

The trendline above is steeper over a shorter period of time, which suggests that homecourt advantage may not be as prominent now than it was before.

## Point Differential

Next we do the same steps but with point differential. 

In [13]:
df_rs.head()

In [14]:
home_ptdiff = ((df_rs.loc[home_mask,'WScore'].sum() + df_rs.loc[away_mask,'LScore'].sum()) - #(Total home score - Total away score)/n_games 
 (df_rs.loc[home_mask,'LScore'].sum() + df_rs.loc[away_mask,'WScore'].sum())) / n_games
print("Historical Average Homecourt Advantage (points): ", home_ptdiff)


On average, the home team outscores the away team by 5.8 points. This is actually a large advantage. 

Based on my experience as a college basketball fan, I suspect that the discrepancy's size is partially because the smaller schools with worse programs are expected to travel more often against the bigger schools in the non-conference season. 

In [17]:
home_scores = df_rs.loc[home_mask, ['Season', 'WScore']].groupby('Season').agg('sum').values + df_rs.loc[away_mask, ['Season', 'LScore']].groupby('Season').agg('sum').values
away_scores = df_rs.loc[home_mask, ['Season', 'LScore']].groupby('Season').agg('sum').values + df_rs.loc[away_mask, ['Season', 'WScore']].groupby('Season').agg('sum').values
df_ptdiff = pd.DataFrame((home_scores-away_scores)/games_by_season)
#df_ptdiff.rename({'Games':'PtDiff'}, axis=1, inplace=True)
df_ptdiff.columns = ['PtDiff']
df_ptdiff = df_ptdiff.reset_index()
df_home['PtDiff'] = df_ptdiff.PtDiff
df_home.head()

In [18]:
sns.regplot(x="Season",y='PtDiff',data=df_home, color='g', scatter='False')
sns.set_style("darkgrid")
plt.title("Home Point Differential since 1985")
plt.ylabel("Home Team Point Differential")


The trendline is flatter but that's likely due to the anomalies from the 1985-1987 seasons. Looking at the past 10 years shows a trend consistent with the downward trend in homecourt advantage above. 

In [19]:
sns.regplot(x="Season",y='PtDiff',data=df_home.tail(10), color='g', scatter='False')
sns.set_style("darkgrid")
plt.title("Home Point Differential Since 2007")
plt.ylabel("Home Team Point Differential")


### Deviation

When weighing the impact of homecourt advantage over time, it may be more helpful to see how each season deviates from the standard historical homecourt advantage. Below are the same plots above but for deviations instead of raw values. 

In [20]:
df_home['WinPDev'] = df_home['WinP'] - home_wp
df_home['PtDiffDev'] =  df_home['PtDiff'] - home_ptdiff
df_home.head()

In [21]:
sns.regplot(x="Season",y='WinPDev',data=df_home, color='g', scatter='False')
sns.set_style("darkgrid")
plt.title("Home Win % Deviation since 1985")
plt.ylabel("Home Win % Deviation")

In [22]:
sns.regplot(x="Season",y='WinPDev',data=df_home.tail(10), color='g', scatter='False')
sns.set_style("darkgrid")
plt.title("Home Win % Deviation Since 2007")
plt.ylabel("Win % Deviation")

In [35]:
sns.regplot(x="Season",y='PtDiffDev',data=df_home, color='g', scatter='False')
sns.set_style("darkgrid")
plt.title("Home Point Differential Deviation Since 1985")
plt.ylabel("Home Point Differential Deviation")

In [23]:
sns.regplot(x="Season",y='PtDiffDev',data=df_home.tail(10), color='g', scatter='False')
sns.set_style("darkgrid")
plt.title("Home Point Differential Deviation Since 2007")
plt.ylabel("Home Point Differential Deviation")

In [24]:
### Save homecourt advantage measurements to a csv
df_home.to_csv('homecourt.csv')

## Next Tasks:
- Examine homecourt advantage for each team over time
- Filter out non-tournament teams to control for stronger basketball programs playing "cupcakes". In other words, take out games where a smaller school with a historically weak basketball program travels to a historically stronger program, which inflates some team's homecourt advantage. 