# Moneyball
### Using Statistics to help the Oakland A's become competitive 

In 2002, the Oakland Athletic's made prominent use of statistical analysis to obtain undervalued players and front a competitive team with the third lowest salary in the Majors. They reevaluated key player performance metrics, discovering that older statistics such as RBI's and Batting Average where no good indicators of player performance. Instead they focused on metrics like 'On Base Percentage' and 'Slugging Percentage' as indicators of good offense. In this notebook, we will pretend to go back in time and help the Oakland A's find such talent that has been overlooked and undervalued by the rest of the league.

In [2]:
#imports
%matplotlib inline
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

 ## Reading in the data

In [3]:
batting_df = pd.read_csv('Batting.csv')

batting_df.head()

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,G_batting,AB,R,H,...,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP,G_old
0,aardsda01,2004,1,SFN,NL,11,11.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,11.0
1,aardsda01,2006,1,CHN,NL,45,43.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,45.0
2,aardsda01,2007,1,CHA,AL,25,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
3,aardsda01,2008,1,BOS,AL,47,5.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,5.0
4,aardsda01,2009,1,SEA,AL,73,3.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,


In [132]:
batting_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 97889 entries, 0 to 97888
Data columns (total 24 columns):
playerID     97889 non-null object
yearID       97889 non-null int64
stint        97889 non-null int64
teamID       97889 non-null object
lgID         97152 non-null object
G            97889 non-null int64
G_batting    96483 non-null float64
AB           91476 non-null float64
R            91476 non-null float64
H            91476 non-null float64
2B           91476 non-null float64
3B           91476 non-null float64
HR           91476 non-null float64
RBI          91052 non-null float64
SB           90176 non-null float64
CS           68022 non-null float64
BB           91476 non-null float64
SO           83638 non-null float64
IBB          54912 non-null float64
HBP          88656 non-null float64
SH           85138 non-null float64
SF           55443 non-null float64
GIDP         65368 non-null float64
G_old        92700 non-null float64
dtypes: float64(18), int64(3), objec

### Feature Engineering

We'll be adding three more statistics to the dataframe:
* Batting Average
* On Base Percentage
* Slugging Percentage


##### Batting Average

Batting Average is defined as:
\begin{equation*}
AVG = \frac{H}{AB}
\end{equation*}

Where \begin{equation*} {H} \end{equation*} is the number of hits and \begin{equation*} {AB} \end{equation*} is the number of 'At Bats'.

##### Here we create a new column on our batting_df dataframe called 'BA'

In [4]:
def get_bavg(x):
    H = x['H']
    AB = x['AB']
    if AB == 0 or H == 0:
        return 0
    else:
        return H/AB

In [5]:
batting_df['BA'] = batting_df.apply(get_bavg, axis=1)

##### On Base Percentage

On Base Percentage is defined as:
\begin{equation*}
OBP = \frac{H+BB+HBP}{AB+BB+HBP+SF}
\end{equation*}

In [6]:
# define a function to calculate the OBP that takes in a dataframe as a parameter
def get_obp(x):
    H = x['H']
    BB = x['BB']
    HBP = x['HBP']
    SF = x['SF']
    AB = x['AB']
    Denominator = (AB+BB+HBP+SF)
#    if Denominator == 0:
#        Denominator = 0.001 #prevent python from approximating to zero and getting div by zero errors
    if AB == 0:
        return 0
    else:
        return (H+BB+HBP)/Denominator

In [7]:
batting_df['OBP'] = batting_df.apply(get_obp, axis=1)

In [8]:
batting_df['OBP']

0        0.000000
1        0.000000
2        0.000000
3        0.000000
4        0.000000
5        0.000000
6             NaN
7        0.322068
8        0.366261
9        0.364885
10       0.377778
11       0.385542
12       0.401154
13       0.352410
14       0.380597
15       0.389805
16       0.390756
17       0.392744
18       0.378717
19       0.356105
20       0.369208
21       0.353550
22       0.395931
23       0.384615
24       0.410122
25       0.389706
26       0.402151
27       0.341207
28       0.332103
29       0.314935
           ...   
97859    0.000000
97860    0.257576
97861    0.000000
97862    0.000000
97863    0.193548
97864    0.000000
97865    0.257576
97866    0.000000
97867    0.000000
97868    0.000000
97869    0.000000
97870    0.390244
97871    0.000000
97872    0.000000
97873    0.369963
97874    0.305085
97875    0.280000
97876    0.302405
97877    0.328571
97878    0.289773
97879    0.317961
97880    0.335938
97881    0.320755
97882    0.000000
97883    0

##### Singles

To calculate Slugging Percentage, we'll need to calculate singles. To do that, we can get the total hits and subtract Home Runs, Doubles, and Triples

In [9]:
def get_singles(x):
    H = x['H']
    doubles = x['2B']
    triples = x['3B']
    HR = x['HR']
    singles = H-(doubles-triples-HR)
    if singles <=0:
        return 0
    else:
        return singles

In [10]:
batting_df['1B'] = batting_df.apply(get_singles, axis=1)

##### Slugging Percentage

Slugging Percentage is defined as
\begin{equation*}
SLG = \frac{(1B)+(2\times2B)+(3\times3B)+(4\times(HR))}{AB}
\end{equation*}

In [11]:
def get_slg(x):
    H = x['H']
    doubles = x['2B']
    triples = x['3B']
    HR = x['HR']
    singles = x['1B']
    AB = x['AB']
    if AB == 0:
        return 0
    else:
        return (singles + 2*doubles + 3*triples + 4*HR)/(AB)

In [12]:
batting_df['SLG'] = batting_df.apply(get_slg,axis=1)

In [13]:
batting_df.head(7)

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,G_batting,AB,R,H,...,IBB,HBP,SH,SF,GIDP,G_old,BA,OBP,1B,SLG
0,aardsda01,2004,1,SFN,NL,11,11.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,11.0,0.0,0.0,0.0,0.0
1,aardsda01,2006,1,CHN,NL,45,43.0,2.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,45.0,0.0,0.0,0.0,0.0
2,aardsda01,2007,1,CHA,AL,25,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0
3,aardsda01,2008,1,BOS,AL,47,5.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0
4,aardsda01,2009,1,SEA,AL,73,3.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0
5,aardsda01,2010,1,SEA,AL,53,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0
6,aardsda01,2012,1,NYA,AL,1,,,,,...,,,,,,,,,,


### Adding in Salary Data

We now have to add salary data to our data frame. This will give us a good idea at who is undervalued i.e cheap prospects

In [143]:
# Read in Salary data

In [14]:
salary_df = pd.read_csv('Salaries.csv')

In [145]:
salary_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23956 entries, 0 to 23955
Data columns (total 5 columns):
yearID      23956 non-null int64
teamID      23956 non-null object
lgID        23956 non-null object
playerID    23956 non-null object
salary      23956 non-null int64
dtypes: int64(2), object(3)
memory usage: 935.9+ KB


##### Since we only have salary data for 1985 and beyond, we need to limit our batting data to 1985 and beyond

In [15]:
battingtwo_df = batting_df.loc[batting_df['yearID'] >= 1985]

#### Merge both data frames on playerID and yearID

In [16]:
test = pd.merge(battingtwo_df, salary_df, on=['playerID','yearID'])

## Analyzing the key players lost during the off season

The A's lost 3 key players during the 2000 offseason. These were 2000 AL MVP 1st basemen Jason Giambi(giambja01), outfielder Johnny Damon(damonjo01), and infielder Rainer Gustavo Olmedo(saenzol01) *Along with Jason Isringhausen (though we are focusing on offense in this case, Jason was a pitcher) *

In [28]:
keyplayers_df = test[(test.playerID=='giambja01') | (test.playerID=='damonjo01') | (test.playerID=='saenzol01')]

keyplayers2001_df = keyplayers_df[keyplayers_df['yearID'] == 2001]

keyplayers2001_df[['playerID','H','2B','3B','HR','OBP','SLG','BA','AB']]

keyplayers2001sub_df = keyplayers2001_df[['playerID','H','2B','3B','HR','OBP','SLG','BA','AB','salary']]

In [30]:
keyplayers2001sub_df

Unnamed: 0,playerID,H,2B,3B,HR,OBP,SLG,BA,AB,salary
4955,damonjo01,165.0,34.0,4.0,9.0,0.323529,0.403727,0.256211,644.0,7100000
7586,giambja01,178.0,47.0,2.0,38.0,0.4769,0.813462,0.342308,520.0,4103333
19422,saenzol01,67.0,21.0,1.0,9.0,0.291176,0.44918,0.219672,305.0,290000


   Following the film/book's argument, runs wins games and getting on base means scoring runs. Therefore the higher the percentage that a player is on base, the more runs they'll score. Following this metric (On Base Percentage): We can already see that Johnny Damon is decent getting onto base but Olmedo isnt't that great. So perhaps the A's shouldnt' feel so bad that they are losing those two. Giambi however has a decent OBP and a very good Slugging Average, it's HIS offense that we must focus on replacing.

# Finding Replacements for the lost three

We have three constraints for finding our replacement players:
* The total combined salary of the three players cannot exceed 15 million dollars
* The combined number of at bats must be equal to or greater than the three lost players
* The mean OBP has to be equal to or greater than the mean OBP of the lost players

Set up the statistics:

In [20]:
At_Bat_Total_Lost = keyplayers2001sub_df['AB'].sum()

Mean_OBP_Lost = keyplayers2001sub_df['OBP'].mean()

test_2001 = test[test.yearID == 2001]

In [21]:
At_Bat_Total_Lost

1469.0

In [22]:
Mean_OBP_Lost

0.36386867712807924

Since the three combined to a total of 1469 At Bats, we would look for a minimum of players with at least 489 At Bats. We should also look for players with OBP greater than or equal to 0.363 

In [23]:
potentials = test_2001[(test_2001.OBP >= 0.363) & (test_2001.AB >= 489) & (test_2001.teamID_x != 'OAK') & (test_2001.salary < 15000000)]

potentials_sorted = potentials.sort_values(['OBP','SLG','salary'], ascending=[True,True, False])

slim_potentials = potentials_sorted[['playerID','H','OBP','SLG','HR','AB','salary']]

In [24]:
slim_potentials

Unnamed: 0,playerID,H,OBP,SLG,HR,AB,salary
4088,cirilje01,165.0,0.364103,0.55303,17.0,528.0,4850000
9823,higgibo02,150.0,0.367089,0.530499,17.0,541.0,5325000
7734,glaustr01,147.0,0.367232,0.676871,41.0,588.0,1250000
741,aurilri01,206.0,0.368805,0.704403,37.0,636.0,3250000
3686,caseyse01,165.0,0.369048,0.506567,13.0,533.0,3000000
11605,kentje01,181.0,0.369253,0.599671,22.0,607.0,6000000
7923,gonzaju03,173.0,0.369748,0.725564,35.0,532.0,10000000
21122,stewash01,202.0,0.37106,0.521875,12.0,640.0,2183333
2280,boonebr01,206.0,0.372263,0.70626,37.0,623.0,3250000
8195,greensh01,184.0,0.372325,0.768982,49.0,619.0,12166667


In [25]:
import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools

series1 = go.Scatter(
    x = slim_potentials.OBP,
    y = slim_potentials.salary,
    mode = 'markers',
    name= 'On Base Percentage'
)

series2 = go.Scatter(
    x = slim_potentials.SLG,
    y = slim_potentials.salary,
    mode = 'markers',
    name= 'Slugging percentage'
)


data = [series1,series2]


fig = tools.make_subplots(rows=1, cols=2, subplot_titles=('Salary vs OBP','Salary vs SLG'))

fig.append_trace(series1, 1, 1)
fig.append_trace(series2, 1, 2)

fig['layout'].update(height=600, width=1000, title='Salary & Performance')


py.iplot(fig)

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



The above scatter plot shows us that there are certainly undervalued players in the market. That is to say, there is very little correlation if any at all between Salary and OBP/SLG percentage. This is good news for us and the A's.

## Potential Replacements


There are a few players that stick out:
* berkmla01 ID:(1756) OBP: .430 SLG: .775 Salary: 305,000
* pujuolal01 ID:(17844) OBP: .402 SLG: .749 Salary: 200,000
* mientdo01 ID:(14643) OBP: .386 SLG: .523 Salary: 215,000

The above players would get the A's the best bang for their buck. However since the three players are so undervalued, the team can choose two of the above and still sign a higher salaried player (but a slightly better performing one) like Todd Helton or or even Walke from CO.

In [26]:
slim_potentials = potentials_sorted[['playerID','H','OBP','SLG','HR','AB','salary']]

In [32]:
replacements = slim_potentials[(slim_potentials.playerID == 'heltoto01')| (slim_potentials.playerID=='berkmla01') | (slim_potentials.playerID == 'walkela01')| (slim_potentials.playerID=='pujolal01')| (slim_potentials.playerID=='mientdo01')].sort_values(['OBP','SLG','salary'], ascending=[True,True, False])

In [27]:
slim_potentials[(slim_potentials.playerID == 'heltoto01')| (slim_potentials.playerID=='berkmla01') | (slim_potentials.playerID == 'walkela01')| (slim_potentials.playerID=='pujolal01')| (slim_potentials.playerID=='mientdo01')].sort_values(['OBP','SLG','salary'], ascending=[True,True, False])

Unnamed: 0,playerID,H,OBP,SLG,HR,AB,salary
14643,mientdo01,166.0,0.386581,0.52302,15.0,543.0,215000
17844,pujolal01,194.0,0.402963,0.749153,37.0,590.0,200000
1756,berkmla01,191.0,0.430233,0.755633,34.0,577.0,305000
9400,heltoto01,197.0,0.431655,0.858603,49.0,587.0,4950000
22972,walkela01,174.0,0.449251,0.826962,38.0,497.0,12166667


These players, espeically the first three would be the best choice for the front office to go after, leaving money to spend elsewhere in improving the team.

In [49]:
series1 = go.Scatter(
    x = keyplayers2001sub_df.OBP,
    y = keyplayers2001sub_df.salary,
    mode = 'markers',
    name= 'Lost Three',
    marker= dict(size= 14,
                    line= dict(width=1),
                    color = '#ffff00'
                   ),
        text= keyplayers2001sub_df['playerID'])


series2 = go.Scatter(
    x = replacements.OBP,
    y = replacements.salary,
    mode = 'markers',
    name= 'Potential Replacements',
    marker= dict(size= 14,
                    line= dict(width=1),
                    color = '#009933'
                   ),
        text= replacements['playerID'])




data = [series1,series2]

layout = dict(title = 'Old vs New'
             )

fig = dict(data=data, layout=layout)

py.iplot(fig)

# Conclusion

What have we learned? I believe the biggest takeaway here is that baseball analytics was very different. The introduction of statistical processes and metrics changed the way performance valuation was made. We see this through how the league undervalued top performing players.

I hope you enjoyed reading through this as much as I did working through it. Thanks for reading!