I don't know how, but at some point in my life I got it in my head that baseball does not have a significant advantage for the home team.  I had heard there were sports where it plays a huge part (the NFL for example) but it was all hearsay.  Baseball for me just always felt immune to it.  I guess when even the best teams have winning percentages barely over .500 through 162 games, there was an air of randomness for any given baseball game.  

The best teams are the ones that can persist through a long, arduous 162 game season.  I mean, that's the logic for a 7 game series, right? Increase the sample size.  So, I was talking with my recently-converted-from-cricket-to-baseball-fanatic friend who told me he looked at all the games from 2000 to present day.  The home team won 54% of the time!  I mean, thats a pretty large sample size and he gave me a very convincing p-value.  But, c'mon, there's probably got to be some other reason for it, right?  I mean his sample was biased, for sure!  He only looked at seasons in recent history.  Technology in recent years has probably allowed for innumerable methods for the home team to cheat.  Tiny cameras in the batter's eye relaying signs to the dugout. Managers communicating these signs to the headpiece installed in the hitter's helmet.  Visiting team bat boys acting as spies, collecting intelligence and rearranging sunflower seed stashes as a type of psychological warfare.  I mean one time I saw the Phillie Phanatic win the affection of the 3rd base umpire so much so that they broke out into a dance number in the middle of the 6th!  Iggy the Iguana once swallowed an umpire whole! (__Figure 1.__)  

<center><img src=images/iggy.jpg></center>
<center><b>Figure 1.</b> Home field Advantage.</center>

My point is that the significance he saw is due to recent changes in the game. Surely, I would still be right. It's the nature of baseball; advantages like home field don't exist.  Given a large enough sample size, the winner is the better team.  I decided to look at the home team advantage since the beginning of baseball.  I downloaded the information from [retrosheet.org](http://www.retrosheet.org/) and stored it in a folder called `seasons`.  Each file is a named by season with the prefix GL for Game Log.  I used the Python module [`glob`](https://docs.python.org/2/library/glob.html) to aggregate them all.  

In [1]:
import pandas as pd
import glob

In [2]:
seasons_names = glob.glob('seasons/GL*.TXT')

In [3]:
import re
years = [int(re.findall('\d+', filename)[0]) for filename in seasons_names]

In [4]:
seasons = [pd.read_csv(season, index_col=0, header=None) for season in seasons_names]

Each game log file its a comma-separated file with each column representing a particular piece of game information (home team, stadium, etc.), all of which can be found [here](http://www.retrosheet.org/gamelogs/glfields.txt).  In this analysis, I simply walked through each season and counted the number of times the home team won and the number of times the visiting team won.  Apparently, there have been a number of ties in throughout the years.  I ignored these results.  

In [5]:
# column 9 is visiting team score, 10 is home
wins_by_year = []
for season in seasons:
    home_team_wins = 0
    visiting_team_wins = 0
    for game, info in season.iterrows():
        # ignore ties
        if info[9] != info[10]:
            home_team_wins += int(info[9] < info[10])
            visiting_team_wins += int(info[9] > info[10])
    wins_by_year.append((home_team_wins, visiting_team_wins))

To get an idea of some sort of trend, I calculated an average of the previous five years and the next five for each season where applicable.  

In [6]:
def calc_average(years) -> list:
    """ calculate a the average winning percentage from a list of home, away scores """
    winnnig_percent = [year[0]/sum(year) for year in years]
    return sum(winnnig_percent) / len(winnnig_percent)

In [7]:
ten_yr_avg = []
for i in range(5, len(wins_by_year)-5):
    ten_yr_avg.append(calc_average(wins_by_year[i:i+10]))

As always, I used [bokeh](http://bokeh.pydata.org/en/latest/) to plot the data.

In [8]:
from bokeh.plotting import figure, show, output_notebook

output_notebook()

For every season, I used the [SciPy's](https://www.scipy.org/) [binomial test](https://en.wikipedia.org/wiki/Binomial_test) to test if the difference in home vs. away wins was statistically significant for that particular year.  A red dot means the value was significant (p < 0.05).

In [9]:
from scipy.stats import binom_test

In [10]:
f = figure(y_range=(0, 1), 
           x_range=(min(years), max(years)),
           plot_width=800,
           plot_height=800
          )


f.circle(y=[year[0]/sum(year) for year in wins_by_year],
         x=years,
         color = ['red' if binom_test(wins) < 0.05 else 'blue'for wins in wins_by_year]
        )
f.line(y=[0.5, 0.5], 
       x=[min(years), max(years)], 
       line_color='black',
       line_dash='dashed'
      )

f.patch(y=[0.5, 1, 1, 0.5],
        x=[min(years), min(years), max(years), max(years)],
        fill_color='blue',
        line_color='blue',
        fill_alpha=0.2,
        line_alpha=0.2,
        legend='home team advantage'
        
       )

f.patch(y=[0, 0.5, 0.5, 0],
        x=[min(years), min(years), max(years), max(years)],
        fill_color='red',
        line_color='red',
        fill_alpha=0.2,
        line_alpha=0.2,
        legend='visiting team advantage'
       )

f.line(y=ten_yr_avg, 
       x=years[5:-5], 
       line_color='green',
       line_dash='dashed',
       line_width=2,
       legend='ten year average'
      )


f.yaxis.axis_label = 'Home Team Winning %'
f.xaxis.axis_label = 'Season'
f.xaxis.major_label_orientation = 3.14/4
f.xgrid.grid_line_color = None

from bokeh.models import FixedTicker

f.xaxis[0].ticker=FixedTicker(ticks=[x for x in years if x%10==0])

show(f)

<center><img src=images/plot.png></center>.

Needless to say, I was quite shocked.  In all of baseball...from 1871-2016...the visiting team never had a significant advantage.  In fact, it was the exact opposite of what I thought.  This advantage has leveled out since the beginning of baseball.  So, my friend and I were discussing the reasons for the higher advantage.  It seems to fit in the dead ball era, for one.  Another hypothesis could be the interesting dimensions of parks back then.  Take the Polo Grounds for example (__Figure 2.__).

<center><img src=images/PoloGrounds.jpg></center>.
<center><b>Figure 2.</b> Polo Grounds dimensions. (src=http://www.andrewclem.com/Baseball/PoloGrounds.html)</center>.

The home team playing 81 games on that monstrosity must be an advantage, right?  I dunno, would  be interesting to look into that a little more.  Also, these game logs contain a ton of information so it would be nice to look into some other relationshops as well.  