# Who's Winning? Offense or Defense

## Introduction
In baseball, the ultimate goal for a hitter is to reach base by base hit, measured by batting average (walks help but for this notebook the focus will be base hits) in order to score runs, and for pitchers is to prevent runs from scoring, measured by earned run average. Both the offensive and defensive aspects of baseball are rewarded for and punished in different ways based on their performance during a game. For the hitters, the ultimate reward is clubbing a home run over the outfield wall and for pitchers is to strikeout a hitter out. This notebook compares the goals and rewards for both offensive and defensive statistics for Major League baseball players over the history of recorded professional baseball.

### Data
The database used is Sean Lahman's History of Baseball database. It holds all seasonal statistics from every player that has played in Major League Baseball from 1871 to 2015. In this project only the regular season batting and pitching tables were used. 

### Questions Seeking Answers
1. How have certain baseball statistics overtime changed?
2. Have the different eras effected these statistics?
3. If so why and how?
4. Are there any relationships among these statistics?

In [3]:
%matplotlib inline

In [20]:
import pandas as pd
import matplotlib as plt
from pylab import *
import seaborn as sns
from sqlalchemy import create_engine

# Create connection.
engine = create_engine('sqlite:///:memory:')

In [7]:
batting = pd.DataFrame.from_csv('../input/batting.csv', index_col = None, encoding = 'utf-8')
batting.head()

In [8]:
#Load pitching data
pitching = pd.DataFrame.from_csv('../input/pitching.csv', index_col = None, encoding = 'utf-8')
pitching.head()

In [9]:
batting.to_sql('batting', engine, index = False)

In [10]:
pd.read_sql_table('batting', engine).head()

In [11]:
pitching.to_sql('pitching', engine, index = False)
pd.read_sql_table('pitching', engine).head()

In [12]:
#Queries to pull data for league wide HR and AVG by year
total_hr = pd.read_sql_query('SELECT year, SUM(hr) AS total_hr FROM batting GROUP BY year', engine)
total_avg = pd.read_sql_query('SELECT year, SUM(ab) AS total_ab, SUM(h) AS total_h FROM batting GROUP BY year', engine)

Batting average is calculated by using the following formula:

<i>Batting Average = Hits / At Bats</i>

In [13]:
#Adding a column to calculate league wide batting average.
ab = total_avg['total_ab']
h = total_avg['total_h']
total_avg['avg'] = (h / ab)

In [14]:
year = total_hr['year']
year.astype('int')
hr = total_hr['total_hr']
year_avg = total_avg['year']
avg = total_avg['avg']

In [15]:
#Query to pull total league strikeouts
df_so = pd.read_sql_query('SELECT year, SUM(so) AS so FROM pitching GROUP BY year', engine)
year_so = df_so['year']
so = df_so['so']

In [16]:
#Query to pull data for ERA
df_era = pd.read_sql_query('SELECT year, SUM(er) AS total_er, SUM(ipouts) / 3 as total_ip FROM pitching GROUP BY year', engine)


The database did not already have each pitchers era, so it needed to be calcualted with this fourmula:

<i>Earned Run Average = (Earned Runs / Innings Pitched) x 9</i>

In [17]:
#Calculations for league wide ERA
df_era['yr_era'] = (df_era['total_er'] / df_era['total_ip']) * 9
year_era = df_era['year']
era = df_era['yr_era']


### Eras of Baseball
Throughout the history of Major League baseball, the game has changed and developed. This notebook displays how certain statistics have changed based on the different eras of baseball. There eight different eras of Major League baseball. 
- [<i>19th Century Era (1871-1900)</i>](https://www.baseball-reference.com/bullpen/19th_Century)- The beginning of baseball 
- [<i>Dead Ball Era (1901-1919)</i>](https://www.baseball-reference.com/bullpen/Deadball_Era)- era highly focused on pitching and defense
- [<i>Lively Ball Era (1920-1941)</i>](https://www.baseball-reference.com/bullpen/Lively_ball_era)- era of increased offense
- [<i>Integration Era (1942-1960)</i>](https://www.baseball-reference.com/bullpen/Integration)- integration of MLB
- [<i>Expansion Era (1961-1976)</i>](https://en.wikipedia.org/wiki/1961_Major_League_Baseball_expansion)- addition of two more teams to both the National and American Leagues
- [<i>Free Agency Era (1977-1993)</i>](https://news.illinois.edu/blog/view/6367/198486)- free agency is introduced in MLB
- [<i>Steroid Era (1994-2005)</i>](http://www.espn.com/mlb/topics/_/page/the-steroids-era)- increased offense and widley used PEDs
- <i>Modern Era (2006-Present)</i> - present day MLB


In [19]:
fig = plt.figure(figsize=(8,4), dpi=100)

In [21]:
fig, (ax1, ax2) = plt.subplots(2,1, figsize = (15, 15), sharex = True)

ax1.bar(year, hr, align = 'center', width = .7, alpha = .5, color = 'red')
ax1.set_xlim([1871,2016])
ax1.set_xlabel('Year')
ax1.set_ylabel('Total Home Runs Hit')
ax1.set_title('Home Run History')

ax2.bar(year_so, so, align = 'center', width = .7, alpha = .5, color = 'blue')
ax2.set_xlim([1871,2016])
ax2.set_xlabel('Year')
ax2.set_ylabel('Total Strikeouts')
ax2.set_title('Strikeout History')

for x in year :
    #19th Century
    ax1.axvline(x=1900.5,c="black",linewidth=.5)
    ax2.axvline(x=1900.5,c="black",linewidth=.5)
    #Dead Ball
    ax1.axvline(x=1919.5,c="black",linewidth=.5)
    ax2.axvline(x=1919.5,c="black",linewidth=.5)
    #Lively Ball
    ax1.axvline(x=1940.5,c="black",linewidth=.5)
    ax2.axvline(x=1940.5,c="black",linewidth=.5)    
    #Integration
    ax1.axvline(x=1960.5,c="black",linewidth=.5)
    ax2.axvline(x=1960.5,c="black",linewidth=.5)
    #Expansion
    ax1.axvline(x=1976.5,c="black",linewidth=.5)
    ax2.axvline(x=1976.5,c="black",linewidth=.5) 
    #Free Agency
    ax1.axvline(x=1993.5,c="black",linewidth=.5)
    ax2.axvline(x=1993.5,c="black",linewidth=.5)
    #Steroid
    ax1.axvline(x=2005.5,c="black",linewidth=.5)
    ax2.axvline(x=2005.5,c="black",linewidth=.5)

ax1.text(1880, -575, '19th Century', fontsize = 12, color = 'black')
ax1.text(1905, -575, 'Dead Ball', fontsize = 12, color = 'black')
ax1.text(1925, -575, 'Lively Ball', fontsize = 12, color = 'black')
ax1.text(1946, -575, 'Integration', fontsize = 12, color = 'black')
ax1.text(1964, -575, 'Expansion', fontsize = 12, color = 'black')
ax1.text(1979, -575, 'Free Agency', fontsize = 12, color = 'black')
ax1.text(1996, -575, 'Steroid', fontsize = 12, color = 'black')
ax1.text(2007, -575, 'Modern', fontsize = 12, color = 'black')
ax1.text(1862, -575, 'Eras', fontsize = 15, color = 'black')
None

### Home Runs
The progression of home runs is obvious from the graph shown above. The [MLB schedule](https://en.wikipedia.org/wiki/Major_League_Baseball_schedule) has not always been 162 games a season with 30 teams playing. These increase in teams and games is a significant cause of statistics increasing but not the only reason. The first spike in home runs on the 19th Century Era was result of the pitching mound being pushed back to the length today of 60 feet 6 inches. During the Dead Ball Era, pitching a defense was emphasized. This obvious due to the low amount of league wide home runs hit. In the transition into the Lively Ball Era, home runs increase immediately. Offense began taking over as the main focus of the league in this era. Home runs took a dive when World War II. Many of the [MLB's best players](http://www.baseballinwartime.com/those_who_served/those_who_served_atoz.htm) took up arms and went to war. With most of the best players gone, older players or less experienced players took their places after [Franklin D. Roosevelt requested Judge Kenesaw Mountain Landis, the commissioner of baseball](http://ftw.usatoday.com/2014/01/fdr-franklin-roosevelt-letter-to-mlb-commissioner-kenesaw-landis). Once World War II was over, the MLB began to integrate itself. With the addition of amazing talent like Jackie Robinson and Larry Doby, home runs increased again. Home runs increased during the Expansion Era due to the addition of four more teams. With more personal going to bat than ever before home runs were bound to increase. The green line on the graph represents when the mound was lowered and the strike zone shrunk for the 1969 season. It was believed by the league that the pitchers were too good and had too much of an upper hand against hitters. During the Free Agency Era there is a massive drop in the 1981 season. [This was due to a players strike](https://en.wikipedia.org/wiki/1981_Major_League_Baseball_strike). Some of the league’s top hitters were involved in the strike and some games were cancelled, causing the dip in home runs. 2005 was the first year of the Steroid Era. There was an enormous increase in home runs. With steroids running through the league, players' power increased, increasing the amount of home runs hit.

### Strikeouts
In early baseball, strikeouts dominated league statistics. In the 19th Century and the Dead Ball Eras strikeouts were abnormally high compared to the growth of total strikeouts. This can be attributed to stacked pitching talent and poorer hitting mechanics compared to today's hitters. Another spike occurred in the Expansion Era due to the increase of players with the addition of two teams to both the National and American Leagues. Probably the most interesting observation is the increase in modern day baseball and its comparison to the amount of home runs hit.


In [22]:
hr_so = pd.merge(total_hr, df_so, on = 'year')

fig, (ax1, ax2) = plt.subplots(1,2, figsize = (15, 8), sharex = True)

sns.regplot(x="year", y="total_hr", truncate=False, data=hr_so, ax=ax1, color='red')
ax1.set_xlabel('Year')
ax1.set_ylabel('Home Runs')
ax1.set_title('League Home Run Totals')
ax1.set_ylim([0,6000])

sns.regplot(x="year", y="so", truncate=False, data=hr_so, ax=ax2, color='blue')
ax2.set_xlabel('Year')
ax2.set_ylabel('Strikeouts')
ax2.set_title('League Strikeout Totals')
ax2.set_ylim([0,40000])
None

### What To Takeaway
When comparing total strikeouts and home runs share a relationship over the history of baseball. This can be attributed to the culture shift in baseball. When swinging for the fences, batters take more of a risk of swinging and missing. Taking this risk has rewarded batters with more home runs but they have suffered the consequences of striking out more.

In [23]:
fig, (ax1, ax2) = plt.subplots(2,1, figsize = (15, 15), sharex = True)

ax1.bar(year_avg, avg, align = 'center', width = .7, alpha = .5, color = 'red')
ax1.set_xlim([1871,2016])
ax1.set_xlabel('Year')
ax1.set_ylabel('League Wide Batting Average')
ax1.set_title('Batting Average History')

ax2.bar(year_era, era, align = 'center', width = .7, alpha = .5, color = 'blue')
ax2.set_xlim([1871,2016])
ax2.set_xlabel('Year')
ax2.set_ylabel('League Wide Earned Run Average')
ax2.set_title('ERA History')

for x in year :
    #19th Century
    ax1.axvline(x=1900.5,c="black",linewidth=.5)
    ax2.axvline(x=1900.5,c="black",linewidth=.5)
    #Dead Ball
    ax1.axvline(x=1919.5,c="black",linewidth=.5)
    ax2.axvline(x=1919.5,c="black",linewidth=.5)
    #Lively Ball
    ax1.axvline(x=1940.5,c="black",linewidth=.5)
    ax2.axvline(x=1940.5,c="black",linewidth=.5)    
    #Integration
    ax1.axvline(x=1960.5,c="black",linewidth=.5)
    ax2.axvline(x=1960.5,c="black",linewidth=.5)
    #Expansion
    ax1.axvline(x=1976.5,c="black",linewidth=.5)
    ax2.axvline(x=1976.5,c="black",linewidth=.5) 
    #Free Agency
    ax1.axvline(x=1993.5,c="black",linewidth=.5)
    ax2.axvline(x=1993.5,c="black",linewidth=.5)
    #Steroid
    ax1.axvline(x=2005.5,c="black",linewidth=.5)
    ax2.axvline(x=2005.5,c="black",linewidth=.5)

ax1.text(1875, -.03, '19th Century', fontsize = 12, color = 'black')
ax1.text(1902, -.03, 'Dead Ball', fontsize = 12, color = 'black')
ax1.text(1926, -.03, 'Lively Ball', fontsize = 12, color = 'black')
ax1.text(1943, -.03, 'Integration', fontsize = 12, color = 'black')
ax1.text(1963, -.03, 'Expansion', fontsize = 12, color = 'black')
ax1.text(1978, -.03, 'Free Agency', fontsize = 12, color = 'black')
ax1.text(1995, -.03, 'Steroid', fontsize = 12, color = 'black')
ax1.text(2007, -.03, 'Modern', fontsize = 12, color = 'black')
ax1.text(1862, -.03, 'Eras', fontsize = 15, color = 'black')
None

### Batting Average
In the early years of baseball, there was significant fluctuation in the league's batting average. In the 19th Century the batting averages increased when the league decided to push the mound back. Giving the hitters more time to react to a pitch proved to be an advantage for the hitters. Once the Dead Ball Era began the averages plummeted in due part to an increased focus in pitching and hitting. This can support the idea that after the mound was pushed back, and offense dominated the sport, teams over compensated by increasing their defensive ability before their offensive ability. After the Dead Ball Era there is a spike in the Lively Ball Era, but for the most part the league average levels out to around .250.

### Earned Run Average
The first thing noticed in the ERA graph is how it correlates with the league's batting average in the early years of the game. Earned Run Average, or ERA, is the statistic that reveals how many earned runs a pitcher allows for every nine-innings pitched. This statistic is a good indication on how well the putcher prevents runs from scoring, therefore, the lower the ERA the better. In the early eras of baseball contact hitting was the major philosophy for hitters. Just getting base hits was the main priority, so this is how many teams scored their runs. Therefore, batting average and ERA appear to have a relationship with each other. However, the relationship is not so apparent later. After World War II, you can see ERA increasing over time, but batting average is staying the same. There appears to be more of a relationship with home runs than batting average. As home runs go up, ERA goes up, and vice versa.

In [24]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize = (15, 8), sharex = True)

sns.regplot(x="year", y="avg", truncate=False, data=total_avg, ax=ax1, color='red')
ax1.set_xlabel('Year')
ax1.set_ylabel('Batting Average')
ax1.set_title('League Batting Average')

sns.regplot(x="year", y="yr_era", truncate=False, data=df_era, ax=ax2, color='blue')
ax2.set_xlabel('Year')
ax2.set_ylabel('Earned Run Average')
ax2.set_title('League Earned Run Average')
None

### What To Take Away
The most interesting aspect of this comparison between average and ERA is how the relationship dissolves over time and ERA gains more of a relationship with home runs. From the scatter plots, it is obvious that that batting average is declining, but ERA is rising at a much greater pace.

## Level Playing Field
The Expansion Era marks the initial amount of games and teams that played in the MLB. Beforehand the league held less teams, as well as playing in a lesser amount of games.

In [25]:
exp_hr = total_hr.loc[total_hr['year'] >= 1961, :].reset_index()
exp_avg = total_avg.loc[total_avg['year'] >= 1961, :].reset_index()
exp_so = df_so.loc[df_so['year'] >= 1961, :].reset_index()
exp_era = df_era.loc[df_era['year'] >= 1961, :].reset_index()
exp_era.head()

In [26]:
fig, (ax1, ax2) = plt.subplots(2,1, figsize = (15, 15), sharex = True)

ax1.bar(exp_hr['year'], exp_hr['total_hr'], align = 'center', width = .7, alpha = .5, color = 'red')
ax1.set_xlim([1960,2016])
ax1.set_xlabel('Year')
ax1.set_ylabel('League Home Runs')
ax1.set_title('Home Run History After Expansion')

ax2.bar(exp_so['year'], exp_so['so'], align = 'center', width = .7, alpha = .5, color = 'blue')
ax2.set_xlim([1960,2016])
ax2.set_xlabel('Year')
ax2.set_ylabel('League Wide Stirkeouts')
ax2.set_title('Strikeout History after Expansion')
None

In [27]:
fig, (ax1, ax2) = plt.subplots(2,1, figsize = (15, 15), sharex = True)

ax1.bar(exp_avg['year'], exp_avg['avg'], align = 'center', width = .7, alpha = .5, color = 'red')
ax1.set_xlim([1960,2016])
ax1.set_xlabel('Year')
ax1.set_ylabel('League Wide Batting Average')
ax1.set_title('Batting Average History After Expansion Era')

ax2.bar(exp_era['year'], exp_era['yr_era'], align = 'center', width = .7, alpha = .5, color = 'blue')
ax2.set_xlim([1960,2016])
ax2.set_xlabel('Year')
ax2.set_ylabel('League Wide Earned Run Average')
ax2.set_title('ERA History After Expansion Era')
None

In [28]:
fig, ax = plt.subplots(2,2, figsize = (15, 8), sharex = True)

sns.regplot(x="year", y="total_hr", truncate=False, data=exp_hr, ax=ax[0,0], color='red')
ax[0,0].set_xlabel('Year')
ax[0,0].set_ylabel('Home Runs')
ax[0,0].set_title('League Home Run Totals After Expansion Era')
ax[0,0].set_ylim([0,6000])

sns.regplot(x="year", y="so", truncate=False, data=exp_so, ax=ax[1,0], color='blue')
ax[1,0].set_xlabel('Year')
ax[1,0].set_ylabel('Strikeouts')
ax[1,0].set_title('League Strikeout Totals After Expansion Era')
ax[1,0].set_ylim([0,40000])

sns.regplot(x="year", y="avg", truncate=False, data=exp_avg, ax=ax[0,1], color='red')
ax[0,1].set_xlabel('Year')
ax[0,1].set_ylabel('Batting Average')
ax[0,1].set_title('League Batting Average After Expansion Era')

sns.regplot(x="year", y="yr_era", truncate=False, data=exp_era, ax=ax[1,1], color='blue')
ax[1,1].set_xlabel('Year')
ax[1,1].set_ylabel('Earned Run Average')
ax[1,1].set_title('League Earned Run Average After Expansion Era')
None

### What to Take Away
When looking at these graphs, it is easy to see that all four statistics increase in value year to year. But the regression plots show something interesting. All four have a positive linear regression slope, but all except one have data points below the regression line after 2010. Home runs, batting average, and ERA all show a drop off around 2005 except for strikeouts. This deserves more detailed attention.

## Modern Day Baseball
By looking at the graphs above. it is obvious that Modern Day baseball is showing a different trend from the eras that came before it.

In [29]:
fig = plt.figure(figsize=(10,10), dpi=100)

In [30]:
fig, ax = plt.subplots(2, 2, figsize = (15, 10), sharex = True)
modern_hr = total_hr.loc[year >= 2005, :]
modern_hr = modern_hr.reset_index()
modern_avg = total_avg.loc[year >= 2005, :].reset_index()
modern_so = df_so.loc[year >= 2005, :].reset_index()
modern_era = df_era.loc[year >= 2005, :].reset_index()

ax[0,0].bar(modern_hr['year'], modern_hr['total_hr'], align = 'center', width = .7, alpha = .5, color = 'red')
ax[0,0].set_ylabel('Total Home Runs')
ax[0,0].set_title('Home Runs')
ax[0,1].bar(modern_avg['year'], modern_avg['avg'], align = 'center', width = .7, alpha = .5, color = 'red')
ax[0,1].set_ylabel('League Batting Average')
ax[0,1].set_title('Batting Average')
ax[1,0].bar(modern_so['year'], modern_so['so'], align = 'center', width = .7, alpha = .5, color = 'blue')
ax[1,0].set_xlabel('Year')
ax[1,0].set_ylabel('Total Strikeouts')
ax[1,0].set_title('Strikeouts')
ax[1,1].bar(modern_era['year'], modern_era['yr_era'], align = 'center', width = .7, alpha = .5, color = 'blue')
ax[1,1].set_xlabel('Year')
ax[1,1].set_ylabel('League ERA')
ax[1,1].set_title('Earned Run Average')
None

### What To Take Away
From the graphs shown. It is obvious that there is a decline in offensive production and a rise in pitching. Home runs vary year to year but still show signs of a decline. Batting average shows a consistent decline over the ten-year stretch. Both pitching stats shown significant improve in modern day baseball, with total strikeouts increasing by over 5000 in just ten years. With the increase in strikeouts and decrease in offense, ERA must go down, as the Earned Run Average graph shows.

## Conclusion

### Eras
The game of baseball has changed significantly over time. With the culture change in baseball, the statistics recorded year to year have changed since baseball began. Hitters are shooting for the long ball and pitchers are aiming to ring them on a called third strike more so than ever. Examining the data from era to era provides insight on how events from society and changed in the rules of the game can affect stats as well. With all these changes, there are a few questions that must be asked moving forward.
- Can players from different eras really be compared to each other?
- What caused the culture of baseball to have a hitting philosophy based on contact shift to power?
- Has the culture shift improved offensive production?

### Modern Day
In regard to how dominance in power hitting of modern day baseball, has swinging for the fences improved offensive production? The data shows that there has been a decline in offense as more and more hitters thirst for power. Pitchers' strikeout numbers are improving and ERAs are dropping while hitters’ averages fall along with their home runs numbers. Is it more beneficial for hitters to focus on contact rather than power? Maybe. Maybe not. But the numbers show the age of long ball goes to the pitchers.