## Introduction

Within many sports, the youth stages of participation are often organized into annual age-groups using specific cutoff dates. Although the intention of providing a fair play for youngsters with similar age, in some sports can be observed a higher chance of participation in the professional level amongst those born early in the selection period. This phenomenon is known as [Relative Age Effect](https://en.wikipedia.org/wiki/Relative_age_effect) (RAE).

The first major study of this effect was published in the Journal of the Canadian Association for Health, Physical Education, and Recreation in 1985 by Barnsley et al. This study determined that NHL players of the early 1980s were more than four times as likely to be born in the first three months of the calendar year as the last three months.

Malcolm Gladwell explains this effect in his book [Outliers: The Story of Success](http://gladwell.com/outliers/):

>The explanation for this is quite simple. It has nothing to do with astrology, nor is there anything magical about the first three months of the year. It’s simply that in Canada the eligibility cutoff for age-class hockey is January 1. A boy who turns ten on January 2, then, could be playing alongside someone who doesn’t turn ten until the end of the year — and at that age, in preadolescence, a twelvemonth gap in age represents an enormous difference in physical maturity.

>This being Canada, the most hockey-crazed country on earth, coaches start to select players for the traveling “rep” squad — the all-star teams — at the age of nine or ten, and of course they are more likely to view as talented the bigger and more coordinated players, who have had the benefit of critical extra months of maturity.

>And what happens when a player gets chosen for a rep squad? He gets better coaching, and his teammates are better, and he plays fifty or seventy-five games a season instead of twenty games a season like those left behind in the “house” league, and he practices twice as much as, or even three times more than, he would have otherwise. In the beginning, his advantage isn’t so much that he is inherently better but only that he is a little older. But by the age of thirteen or fourteen, with the benefit of better coaching and all that extra practice under his belt, he really is better, so he’s the one more likely to make it to the Major Junior A league, and from there into the big leagues.

>Barnsley argues that these kinds of skewed age distributions exist whenever three things happen: selection, streaming, and differentiated experience. If you make a decision about who is good and who is not good at an early age; if you separate the “talented” from the “untalented”; and if you provide the “talented” with a superior experience, then you’re going to end up giving a huge advantage to that small group of people born closest to the cutoff date.

In 1991, Thompson et al. observed a similar effect in the American Baseball. For many years, July 31 was the cutoff date used by virtually all nonschool baseball leagues in the United States. This caused an unfair advantage for players born in August compared to players born in July.

The goal of this analysis is to observe if this effect is still relevant nowadays.

## Questions

* How does the month of birth correlate with the participation rate in the professional leagues?
* Does the month of birth correlate with the professional player performance, i.e., the unfair advantage of those who have been born earlier in baseball year remains relevant after they became a pro?

## Data Wrangle

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%pylab inline

In [None]:
# Creating auxilary series for month names 
monthNames = pd.Series(
    ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    , index=range(1,13))

relativeMonthNames = pd.Series(
     ['Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul'])

# Auxiliary functions
def normalizeValues(s):
    '''returns the proportion of each value in relation to the sum of all values.'''
    return s / s.sum()
    

### General population data
The general population natality birth data comes from the [Center of Disease Control and Prevention](http://www.nber.org/data/vital-statistics-natality-data.html) (CDC). The total number of births per month was collected from 1994 to 2002 and will be used as a proxy for the birth rate per month of the general population.

In [None]:
d = {
    'Year' : [1994,1994,1994,1994,1994,1994,1994,1994,1994,1994,1994,1994,1995,1995,1995,1995,1995,1995,1995,1995,1995,1995,1995,1995,1996,1996,1996,1996,1996,1996,1996,1996,1996,1996,1996,1996,1997,1997,1997,1997,1997,1997,1997,1997,1997,1997,1997,1997,1998,1998,1998,1998,1998,1998,1998,1998,1998,1998,1998,1998,1999,1999,1999,1999,1999,1999,1999,1999,1999,1999,1999,1999,2000,2000,2000,2000,2000,2000,2000,2000,2000,2000,2000,2000,2001,2001,2001,2001,2001,2001,2001,2001,2001,2001,2001,2001,2002,2002,2002,2002,2002,2002,2002,2002,2002,2002,2002,2002],
    'Month' : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
    'Births' : [320705, 301327, 339736, 317392, 330295, 329737, 345862, 352173, 339223, 330172, 319397, 326748, 316013, 295094, 328503, 309119, 334543, 329805, 340873, 350737, 339103, 330012, 310817, 314970, 314283, 301763, 322581, 312595, 325708, 318525, 345162, 346317, 336348, 336346, 309397, 322469, 317211, 291541, 321212, 314230, 330331, 321867, 346506, 339122, 333600, 328657, 307282, 329335, 319340, 298711, 329436, 319758, 330519, 327091, 348651, 344736, 343384, 332790, 313241, 333896, 319182, 297568, 332939, 316889, 328526, 332201, 349812, 351371, 349409, 332980, 315289, 333251, 330108, 317377, 340553, 317180, 341207, 341206, 348975, 360080, 347609, 343921, 333811, 336787, 335198, 303534, 338684, 323613, 344017, 331085, 351047, 361802, 342564, 344074, 323746, 326569, 330674, 303977, 331505, 324432, 339007, 327588, 357669, 359417, 348814, 345814, 318573, 334256]}

generalBirths = pd.DataFrame(d)
generalBirths.head()

The chart below shows the mean number of births for each month of the year of the general population. As could be observed, there is a slight increase in births over the summer (July to September) and a decrease in February.

In [None]:
generalBirthsByMonth = generalBirths.groupby('Month').mean()['Births'] / 1000

# plot a line with the mean births per year
meanBirths = generalBirthsByMonth.sum() / len(generalBirthsByMonth)
plt.axhline(meanBirths, color='red', zorder=1)

# plot the distribution of births per months
generalBirthsByMonth.plot(kind='bar', zorder=2)
plt.xticks(range(13), monthNames, rotation=0)
plt.ylabel('Mean number of births (thousand)')
plt.title('U.S. month of birth distribution of the general population from 1994 to 2002');

### Baseball players data
The data for this analysis comes from the 2015 edition of [The Lahman Baseball Database](http://www.seanlahman.com/baseball-archive/statistics/) that contains complete batting and pitching statistics from 1871 to 2015, plus fielding statistics, standings, team stats, managerial records, post-season data, and more. 

The following tables will be used in this analysis:
* Master: the master table contains player names, DOB, and biographical info
* Batting: batting statistics

In [None]:
master = pd.read_csv('../input/player.csv')
print(master.columns)
master.head()

In [None]:
batting = batting = pd.read_csv('../input/batting.csv')

# replace NaN with 0 so the batting stats can be used to calculate performance measures
batting = batting.fillna(value=0)
batting.head()

## Baseball players relative age effect
The chart below shows the month of birth distribution of baseball players in the database.

In [None]:
playerBirthsByMonth = master['birth_month'].value_counts().sort_index()
minYear = master['birth_year'].min()
maxYear = master['birth_year'].max()
total = playerBirthsByMonth.sum()

# plot a line indicating birth in a uniformly distribution
meanBirths = total / 12
plt.axhline(meanBirths, color='red', zorder=1)

# plot the distribution of births per months
playerBirthsByMonth.plot(kind='bar', zorder=2)
plt.xticks(range(13), monthNames, rotation=0)
plt.ylabel('Births')
plt.title('Month of birth distribution of {} american baseball players from {:4.0f} to {:4.0f}'.format(total, minYear, maxYear))

birthsBefore = playerBirthsByMonth.loc[5:7].sum()
birthsAfter = playerBirthsByMonth.loc[8:10].sum()
print('Players born in May, June or July: {}'.format(birthsBefore))
print('Players born in August, Septermber or October: {}'.format(birthsAfter))
print('Increase: {:4.2f}%'.format((birthsAfter - birthsBefore) / birthsBefore * 100))

This chart shows a tendency for professional baseball players to have been born early in the baseball year (starting in August and ending in July). For instance, there are 21.86% more players that have been born in August, September, and October than have been born in May, June, and July.

Although the chart is important to observe this tendency, it's necessary measuring it. In order to measure the impact of the RAE in baseball players, the following steps were executed:
1. Calculate the relative month for the baseball players and for the general population. The relative month measures how many months after August (the first month of the baseball year) the month of birth is.
2. Calculate the birth rate for each relative month for the general population and for the baseball players.
3. Calculate the deviations between the baseball players and the general population birth rates.
4. Calculate the correlation coefficient between the deviations of birth rates and the relative months.

### 1. Relative month

In [None]:
# The relative month is calculated rotating left the month of birth by 8.

# master dataframe
master = master.assign(relativeMonth = (master.birth_month - 8) % 12)

# general population dataframe
generalBirths = generalBirths.assign(relativeMonth = (generalBirths.Month - 8) % 12)
generalBirths.head(n = 12)

### 2. Birth rate per relative month

In [None]:
generalMeanBirthsByRelativeMonth = generalBirths.groupby('relativeMonth').mean()['Births']
generalBirthRatesByRelativeMonth = normalizeValues(generalMeanBirthsByRelativeMonth)

playersBirthRatesByRelativeMonth = master['relativeMonth'].value_counts(normalize=True).sort_index()

In the chart below could be observed that the birth rates of baseball players are higher in the first months of the baseball year and lower in the last months.

In [None]:
generalBirthRatesByRelativeMonth.plot(label='General Population')
playersBirthRatesByRelativeMonth.plot(label='Baseball Players')

plt.legend()
plt.ylabel('Proportion')
plt.title('General population vs baseball players birth rates')
plt.xticks(range(12), relativeMonthNames, rotation=45);

### 3. Deviations between birth rates
The deviations between birth rates are calculated subtracting the baseball players birth rate distribution from the general population birth rate distribution.

In [None]:
deviations = playersBirthRatesByRelativeMonth - generalBirthRatesByRelativeMonth
deviations.plot(kind='bar')
plt.ylabel('Frequency deviation')
plt.title('Frequency deviation between baseball players birth rate and the general population birth rate')
plt.xticks(range(12), relativeMonthNames, rotation=0);

### 4. Correlation coefficient

In [None]:
r = numpy.corrcoef(deviations, range(12))[0,1]
print('r = {}'.format(r))

## Month of birth vs player performance
In order to correlate the month of birth with the player performance, the following steps were be executed:
1. Calculate the [On-base percentage]('https://en.wikipedia.org/wiki/On-base_percentage') (OBP), that is a traditional batting performance measurement.
2. Calculate the mean OBP per relative month
3. Normalize the mean OBP per relative month
4. Plot the results and observe if there is a correlation

In [None]:
#On Base Percentage = (H + BB + HBP)/ (AB + BB + HBP + SF)
batting = batting.assign(OBP = (batting.h + batting.bb +batting.hbp)/(batting.ab + batting.bb + batting.hbp + batting.sf))

meanOBPByMonth = batting.merge(master, on='player_id').groupby('relativeMonth').mean()['OBP']
meanOBPFrequencyByMonth = normalizeValues(meanOBPByMonth)

In [None]:
meanOBPFrequencyByMonth.plot(label='mean OBP')
playersBirthRatesByRelativeMonth.plot(label='Player Month of Birth')

plt.legend()
plt.xlabel('Month')
plt.ylabel('Proportion')
plt.title('Frequency distribution of mean OBP vs players months of birth')
plt.xticks(range(12), relativeMonthNames, rotation=45);

As could be observed in the chart above, the mean OBP is uniformly distributed over the months of the year.

## Conclusions
The results of this analysis provide strong support that there is a significant tendency for professional baseball players to have been born early in the baseball year (starting in August and ending in July). In order to measure the impact of the Relative Age Effect, the correlation coefficient between the relative month of birth and the deviation of players birth rates from the general population birth rates was calculated.

The correlation coefficient obtained was -0.9, meaning a strong negative linear relationship, i.e, as far the month of birth is from August, less chance to participate in the professional league the player has. We can conclude that it appears that a significant number of budding baseball players are prevented from reaching their potential because of an accident of birth.

On the other hand, there is no correlation between the month of birth and the player performance. The unfair advantage of those who have been born earlier in the baseball year didn't remain relevant after they became a pro.

## References
http://www.nber.org/data/vital-statistics-natality-data.html

http://www.slate.com/articles/sports/sports_nut/2008/04/the_boys_of_late_summer.html

https://en.wikipedia.org/wiki/Relative_age_effect

http://gladwell.com/outliers/

https://en.wikipedia.org/wiki/On-base_percentage

http://www.seanlahman.com/baseball-archive/statistics/

Thompson A, Barnsley R, Stebelsky G. ‘Born to play ball’:
the relative age effect and major league baseball. Sociol
Sport J 1991; 8: 146-51

