In [None]:
from IPython.core.display import display, HTML
display(HTML("""<h1>Learning about Baseball</h1>

<h1 id="tocheading">Table of Contents</h1>
<div id="toc"></div>"""))

Although I decided on analysing this dataset as my first script on Kaggle, I have little to no knowledge of the sport. My limited knowledge of the game only spans as far as some terminology such as; bases, hits, home runs and so on. So this analysis is more of an exploratory journey, to try and learn a few things about baseball. I doubt at the end of this; I will be able to hold a debate about baseball, but I should be able to hold basic conversations about stats in the sport. This analysis ended up a bit longer than I expected as I kept trying things out, some of which ended up being removed, while the rest remained.

I will mainly be looking at the fielding and batting datasets.

## Setting up Python

In [None]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import linregress, percentileofscore
import matplotlib.pyplot as plt

from bokeh.plotting import *
from bokeh.charts import *
from bokeh.models import  Callback, ColumnDataSource, Rect, Select,CustomJS
from bokeh.models.widgets import Slider
from IPython.display import Math
#from bokeh.io import output_notebook
# Ensure the plots are inline
output_notebook()

## Loading the Dataset

In [None]:
#Location of Lahman's dataset
prefix = '../input/'

master = pd.read_csv(prefix + 'player.csv')
batting = pd.read_csv(prefix + 'batting.csv')
fielding = pd.read_csv(prefix + 'fielding.csv')
salaries = pd.read_csv(prefix + 'salary.csv')

### Cleaning the data

Starting of with the data from the master table, 

In [None]:
master['debut'] = pd.to_datetime(master['debut'])
master['final_game'] = pd.to_datetime(master['final_game'])
master['active'] = master['final_game'] - master['debut']

In [None]:
master = master[['player_id', 'weight', 'height', 'throws', 'debut', 'final_game', 'active']]
master.isnull().sum()

Considering that there are a few rows with missing values on several of the variables, I removed them from the dataset. 

In [None]:
master = master.dropna()

Besides the presence of *NA* values, there were a few players where the data showed that they ended their career before they began playing.

In [None]:
print(master['active'].min())
print(master[master['active'] < pd.Timedelta('1 days')].count())

As a result, I restricted the dataset to only those who have at least a weeks worth of experience according to the dataset.

In [None]:
master = master[master['active'] >= pd.Timedelta('7 days')]
master['active'].min()

One issue with the dataset is that the salary is not immediately comparable across the different years mainly due to inflation. So using the salary for each year was normalised using the Consumer Price Index (CPI) from 1984 to 2015

In [None]:
cpi = [102.100,105.700,109.900,111.400,116.000,121.200,127.500,134.700,138.300
       ,142.800,146.300,150.500,154.700,159.400,162.000,164.700,169.300,175.600
       ,177.700,182.600,186.300,191.600,199.300,203.437,212.174,211.933,217.488
       ,221.187,227.860,231.641,235.436,234.954]
years = range(1984, 2016, 1)
cpi = dict(zip(years, cpi))
salaries['salary'] = salaries.apply(lambda x: x['salary'] * cpi[2015] / cpi[x['year']], axis=1)

## Data Exploration

In this part of the project, I will show part of my exploration of the data. Starting from some player specific information, and moving towards parts of the team play; batting, pitching and fielding.

### Player Info
#### *Start and End of Career*

When exactly do players start their career? And at which point in the year do they retire?

In [None]:
dYear = master['debut'].map(lambda x: x.year)
fYear = master[master['final_game'] < pd.to_datetime('2015-01-01')]['final_game'].map(lambda x: x.year)
dp = Histogram(dYear, 'debut', bins=10, plot_width=400, xlabel='Year', ylabel='# of Players', title='# of Players starting their careers')
fp = Histogram(fYear, 'final_game', bins=10, plot_width=400, xlabel='Year', ylabel='# of Players', title='# of Players who retired')
show(hplot(dp, fp))

In [None]:
dMonth = master['debut'].map(lambda x: x.month)
fMonth = master[master['final_game'] < pd.to_datetime('2015-01-01')]['final_game'].map(lambda x: x.month)
dp = Histogram(dMonth, 'debut', bins=12, plot_width=400, xlabel='Month', ylabel='# of Players', title='# of Players starting their careers')
fp = Histogram(fMonth, 'final_game', bins=12, plot_width=400, xlabel='Month', ylabel='# of Players', title='# of Players who retired')
show(hplot(dp, fp))

In [None]:
dDay = master['debut'].map(lambda x: x.day)
fDay = master[master['final_game'] < pd.to_datetime('2015-01-01')]['final_game'].map(lambda x: x.day)
dp = Histogram(dDay, 'debut', bins=31, plot_width=400, xlabel='Day', ylabel='# of Players', title='# of Players starting their careers')
fp = Histogram(fDay, 'final_game', bins=31, plot_width=400, xlabel='Day', ylabel='# of Players', title='# of Players who retired')
show(hplot(dp, fp))

#### *Experience*
How long exactly do players play baseball for?

In [None]:
master['active_years'] = (master['active'] / np.timedelta64(365, 'D')).astype(float)
sub = master[master['final_game'] < pd.to_datetime('01-01-2015')]
p = Histogram(sub['active_years'], title='Distribution Showing the Active Years of Retired players',
              xlabel='Active Years', ylabel='# of Players')
show(p)

Representing the number of players as a function in terms of the active years spent, seems to be exponential. Performing a log-level linear regression, shows that this is indeed the case, with an R2 of 0.9265.

In [None]:
y = master.groupby(sub['active_years'].astype(int)).size()
ly = np.log(y)
x = range(len(y))
slope, intercept, rvalue, _, _ = linregress(x, ly)
print("R^2", rvalue ** 2)
Math('y = e^{' + '{:.1f}'.format(slope) + 'x + ' + '{:.1f}'.format(intercept) + '}')

#### *BMI*

What about health? Usually, anyone with a Body Mass Index (BMI) score below 18.5 is considered to be underweight, while a BMI above 25 is deemed to be overweight, and a score of 30 or higher is obese. So how do the players stack up?

In [None]:
wp = Histogram(master, 'weight', bins=30, xlabel='Weight (in pounds)', ylabel='# of Players', plot_width=400, title='Weight')
hp = Histogram(master, 'height', bins=20, xlabel='Height (in inches)', ylabel='# of Players', plot_width=400, title='Height')
show(hplot(wp, hp))

In [None]:
master['BMI'] = (master['weight'] / np.power(master['height'], 2)) * 703
p = Histogram(master, 'BMI', bins=30, ylabel='# of Players', title='BMI')
show(p)

In [None]:
r = [100 - percentileofscore(master['BMI'], i) for i in [18.5, 26, 30, 40]]
r

Based on the data, only 0.02% are under-weight, 26.49% are overweight and 2.16% are obese.

### Fielding
#### *Distribution of Positions*

How are the players distributed in terms of positions when fielding? Also, has this distribution changed over time, or have the team dynamics remained the same?

In [None]:
#For each year, group and count the players by the position
years = fielding['year'].unique()
df = pd.DataFrame({str(years[0]):fielding[fielding['year'] == years[0]].groupby('pos').size()}).reset_index()
for i in years[1:]:
    p =  pd.DataFrame({str(i):fielding[fielding['year'] == i].groupby('pos').size()}).reset_index()
    df = pd.merge(df, p, on='pos', how='outer')
df.fillna(0, inplace=True)

In [None]:
#Add bottom and height columns to be used by the bar plots.
pos = df['pos'].unique()
df['bottom'] = np.zeros(len(pos))
df['height'] = df[str(np.min(years))]

In [None]:
y_max = np.max(df.max()[1:])

src1 = ColumnDataSource(df)

p = figure(title="# of Players for each position according to year", 
           x_range=sorted(pos.tolist()), y_range=[0, y_max + 50],plot_width=700, 
           plot_height = 500, outline_line_color= None)
p.quad(top='height', bottom='bottom', left='pos', right='pos', source=src1, line_width=30)
p.xaxis.axis_label = 'Position'
p.yaxis.axis_label = '# of Players'

#Update the 'height' column to show the values for the selected year
callback = CustomJS(args = {'source':src1}, code="""
    var f = cb_obj.get('value');
    var data = source.get('data');
    data['height'] = data[f];
    source.trigger('change');
    """)
slider = Slider(start=np.min(years), end=np.max(years), value=np.min(years), step=1, 
                title="Year", callback=callback)
show(vplot(slider, p))

Generally, there is an increase in the number of players participating with each passing year. However, one interesting observation was that as the years passed by, pitchers, which was the position with the least number of players in 1871, grew to being the position with the largest number of players.

In [None]:
#Process and display the average number of players per position each team had
pos = pd.DataFrame({'count':fielding.groupby(['year','pos', 'team_id']).size()}).reset_index()
pos = pos.groupby(['year', 'pos']).mean().reset_index()
p1 = Bar(pos[pos['year'] == 1871], values='count', label='pos', agg='mean', plot_width=450, 
         title='1871', xlabel='Position', ylabel='Avg # of Players per team')
p2 = Bar(pos[pos['year'] == 2015], values='count', label='pos', agg='mean', plot_width=450, 
         title='2015', xlabel='Position', ylabel='Avg # of Players per team')
show(hplot(p1, p2))

#### *Salary*

Given the variety of positions available, is there a noticable difference in the salaries of the players by position?

In [None]:
field_salary = pd.merge(fielding, salaries, on=['year', 'player_id', 'team_id', 'league_id'])[['year', 'player_id', 'pos', 'salary']].dropna()
years = field_salary['year'].unique()
groups = field_salary[field_salary['year'] == years[0]][['pos','salary']].groupby('pos')
df = groups.median().reset_index()
df.columns = ['pos', str(years[0])]
p1 = groups.quantile(.25).reset_index()
p1.columns = ['pos', str(years[0]) + 'l']
p2 = groups.quantile(.75).reset_index()
p2.columns = ['pos', str(years[0]) + 'u']
df = pd.merge(df, pd.merge(p1, p2, on='pos'), on='pos')
for i in years[1:]:
    groups = field_salary[field_salary['year'] == i][['pos', 'salary']].groupby('pos')
    p = groups.median()['salary'].reset_index()
    p.columns = ['pos', str(i)]
    p1 = groups.quantile(.25).reset_index()
    p1.columns = ['pos', str(i) + 'l']
    p2 = groups.quantile(.75).reset_index()
    p2.columns = ['pos', str(i) + 'u']
    df = pd.merge(df, pd.merge(p, pd.merge(p1, p2, on='pos'), on='pos'), on='pos', how='outer')
df.fillna(0, inplace=True)

In [None]:
pos = df['pos'].unique()
df['bottom'] = np.zeros(len(pos))
df['height'] = df[str(np.min(years))]
df['1quant'] = df[str(np.min(years)) + 'l']
df['3quant'] = df[str(np.min(years)) + 'u']

In [None]:
y_max = np.max(df.max()[1:])

src1 = ColumnDataSource(df)

p1 = figure(title="Median Salary by Position", 
            x_range=pos.tolist(), y_range=[0, y_max + 50],plot_width=700, plot_height = 500,
            outline_line_color= None)
p1.quad(top='height', bottom='bottom', left='pos', right='pos', source=src1, line_width=30)
p1.quad(top='1quant', bottom='3quant', left='pos', right='pos', source=src1, line_width=3, color='red')

callback = CustomJS(args = {'source':src1}, code="""
    var f = cb_obj.get('value');
    var data = source.get('data');
    data['height'] = data[f];
    data['1quant'] = data[f+'l'];
    data['3quant'] = data[f+'u'];
    source.trigger('change');
    """)
slider = Slider(start=np.min(years), end=np.max(years), value=np.min(years), step=1, title="Year", callback=callback)
show(vplot(slider, p1))

The above plot shows the median salary (blue bar), as well as the 1st and 3rd quartiles (red bar) for each of the individual positions.

### Batting

Prior to analysing the data on batters, I restricted the dataset to only players that had at least 100 opportunities at batting.

In [None]:
# Limit the data to players with at least 100 chances at batting
players_batting = pd.merge(master, batting[batting['ab'] > 100], on='player_id')
# Extract the number of hits to first base
players_batting['single'] = players_batting['h'] - players_batting['hr'] - players_batting['double'] - players_batting['triple']
# Compute the average hits to a specific position
players_batting['A1B'] = players_batting['single'] / players_batting['ab']
players_batting['A2B'] = players_batting['double'] / players_batting['ab']
players_batting['A3B'] = players_batting['triple'] / players_batting['ab']
players_batting['AHR'] = players_batting['hr'] / players_batting['ab']
# Normalise the batting stats using the number of hits
players_batting['N1B'] = players_batting['single'] / players_batting['h']
players_batting['N2B'] = players_batting['double'] / players_batting['h']
players_batting['N3B'] = players_batting['triple'] / players_batting['h']
players_batting['NHR'] = players_batting['hr'] / players_batting['h']
players_batting['AH'] = players_batting['h'] / players_batting['ab']
# Normalise the info about base steals
players_batting['sa'] = players_batting['cs'] + players_batting['sb']
players_batting['ACS'] = players_batting['cs'] / players_batting['sa']
players_batting['ASB'] = players_batting['sb'] / players_batting['sa']

In [None]:
#Using a width of 5lbs, the data is grouped by weight and the mean is computed
width = 5
bins = np.arange((int(players_batting['weight'].min()) / width) * width, int(players_batting['weight'].max() + width) / width * width, width)
df = players_batting.groupby(np.digitize(players_batting['weight'], bins)).mean()
df['weight'] = bins

In [None]:
p = figure(plot_width=700, plot_height=700, title='Average hits by weight')
p.line(df['weight'], df['AH'])
show(p)

In [None]:
#p = Bar(df, label='weight', values='SA')
p1 = Area(df.dropna(), x='weight', y=['A1B', 'A2B', 'AHR', 'A3B'], legend='top_right',
          title='Average hits to bases by weight', xlabel='Weight (in lbs)', ylabel='Average hits',
          stack=False)
p2 = Area(df.dropna(), x='weight', y=['N1B', 'N2B', 'N3B', 'NHR'], legend='bottom_left',
          title='Proportion of hits to bases by weight', xlabel='Weight (in lbs)', stack=True)
#p = Bar(players_batting, label='weight', values='ACS', agg='mean')
show(vplot(p1, p2))

Based on the above plot, we can see that players who weigh more are more likely to hit to second base and home runs. 

Also, does weight affect their chances of stealing bases?

In the following plots, 'SB', and 'CS' represent the bases that were stolen by the player and the number of times the player was caught stealing a base respectively. Also, 'ASB' and 'ACS', represent these values normalised using number of attempts to steal a base.

In [None]:
#p = Bar(df, label='weight', values='SA')
p1 = Area(df.dropna(), x='weight', y=['cs','sb'][::-1], legend='top_right', stack=True,
         xlabel = 'Weight (in lbs)', ylabel='Average')
p2 = Area(df.dropna(), x='weight', y=['ASB', 'ACS'], legend='top_right', stack=True,
         xlabel = 'Weight (in lbs)', ylabel='Average')
#p = Bar(players_batting, label='weight', values='ACS', agg='mean')
show(vplot(p1, p2))

## Conclusion

So to sum up my exploration in a few points:

- Pitching has become the most sought out position.
- Although pitchers are well sought out, it is not the highest paying fielding position
- Heavier batters are more likely to hit to second base and home runs than lighter batters
- Heavier batters are less likely to attempt stealing a base than lighter players.

## References

A few places I got some inspiration from

- https://baseballwithr.wordpress.com/
- https://www.kaggle.com/jagelves/d/kaggle/the-history-of-baseball/does-size-matter-in-batting-19