### P2: Baseball Managers
Thomas Hrabchak <br>
February 2015

### Introduction
Major League Baseball (MLB) is known for its all-stars making multi million dollar salaries and hitting towering homeruns. Often overlooked is the manager, who rarely leaves the duggout except to argue with an umpire, throw his hat, and get ejected from the game. Some managers are well known, have winning teams, and make it into the hall of fame, many others are forgotten by fans. This project will explore the role of the manager on an MLB team in terms of team statistics.

Behind the scenes there are many decisions being made by the manager every pitch, such as the positioning of fielders, preparation of a relief pitcher, or if the batter should try to bunt. These decisions do affect the outcome of the game and cumulatively the season. However, the results of managerial decisions manifest themselves in the performance of the players which makes directly measuring the performance of the manager difficult. 

Additionally, managers have responsibilities that are unrelated to or indirectly related to winning games, such as interacting with the media. This paper will not take into consideration skills which do not result in wins for the team, although extraneous, non-baseball related skills are relevant to assessing the overall performance of managers.

#### Background
Methods for assessing the performance of baseball managers has been proposed in several academic papers. The most popular method is James's (1986) "Pythagorean theorem", in which a manager's performance is assessed using an estimation of expected wins. Additionally, Bradbury (2006) assesses a manager in terms of impact on player performance.

In 2014, Randy Silvers and Raul Susmel explored the compensation of managers in their paper "Compensation of a Manager: The Case of Major League Baseball". Instead of attempting to directly assess the performance of a manager using a metric derived from baseball statistics, they hypothesized that the economic market of baseball managers would result in the best performing managers (highest team winning percentage and number of playoff appearances) being compensated the highest salaries. The results of their analysis showed that a manager's past performance affects the manager's current salary, but the manager's current salary does not affect the current performance of the manager's team. Silvers and Susmel note that in efficient markets the compensation of a manager is a sufficient measure of his expected productivity but has been shown to be insignificant in predicting any team performance metric. This implies that the market of MLB managers is not efficient.

Based on Silvers and Susmel and the previous papers, it seems as if there is still room for improvement in understanding the role of an MLB manager and the impact they have on the performance of their team.

### Questions
There are many possible questions about MLB managers, assesing their performance, and assessing their impact on their team. To limit the scope of this paper, only the following question will be explored:

- When a manager transfers teams, does the relative performance of any team statistic from the manager's previous team transfer to the manager's new team?

This question is actionable in terms of a statistical approach and isolates the change to the individual manager. We can generalize this question for individual managers over their entire career and then compare managerial careers based on which team statistic they were best at improving on their team.

#### Alternative Questions For Future Exploration
Below are other questions which I considered addressing in this paper. However they need further refinement in order to be actionable.
- How important is having a good manager to the success of a major league baseball team?
- Which, if any, team statistics correlate with the winningest managers?
- What are characteristics of managers that make it into the hall of fame?
- What are characteristics of managers that have short careers?
- Are any managers particularly good in the post season?
- Do any team statistics correlate with a long managerial career?
- Who is the most recent player manager and what were the circumstances?
- Are there any trends in the salaries of managers?
- Has a 'bad' manager ever won the world series?

### Wrangle
The data for this project comes from the 2014 edition of The Lahman Baseball Database. We will need to refine the data from this source to better address our question.

As an overview, below is a summary of the steps we will take in the data wrangling phase.
1. Import relevant data from The Lahman Baseball Database.
2. Appened team statistic relative performance for each team statistic for each year to the teams_df DataFrame.
3. Associate manager to team, look only at managers that lasted the full year
4. Generate table of manager team transfers
5. Average statistics for previous stint and after stint, statistics for previous year stats, after year stats
6. Aggregate transfer correlations, using average previous/after statistics, to determine which stats are most correlated to managers

#### Import Data
This project uses data from 2014 edition of The Lahman Baseball Database, hosted on github.

In [145]:
# Import Libraries and Data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%pylab inline

repository = "https://raw.githubusercontent.com/thrabchak/Udacity-Data-Analysis/"
folder = "master/P2%20Baseball%20Managers/data/"

awards_managers_df  = pd.read_csv(repository + folder + "AwardsManagers.csv")
hall_of_fame_df     = pd.read_csv(repository + folder + "HallOfFame.csv")
managers_df         = pd.read_csv(repository + folder + "Managers.csv")
master_df           = pd.read_csv(repository + folder + "Master.csv")
series_post_df      = pd.read_csv(repository + folder + "SeriesPost.csv")
teams_df            = pd.read_csv(repository + folder + "Teams.csv")
teams_franchises_df = pd.read_csv(repository + folder + "TeamsFranchises.csv")

# Show columns of the imported DataFrames
def print_df_columns(show):
    """Print columns of imported dataframes if 'show' is True."""
    if show:
        print "Awards Managers: "
        print awards_managers_df.columns
        print "Hall of Fame: "
        print hall_of_fame_df.columns
        print "Managers: "
        print managers_df.columns
        print "Master: "
        print master_df.columns
        print "Series Post: "
        print series_post_df.columns
        print "Teams: "
        print teams_df.columns
        print "Teams Franchises: "
        print teams_franchises_df.columns
    
print_df_columns(True)

Populating the interactive namespace from numpy and matplotlib
Awards Managers: 
Index([u'playerID', u'awardID', u'yearID', u'lgID', u'tie', u'notes'], dtype='object')
Hall of Fame: 
Index([u'playerID', u'yearid', u'votedBy', u'ballots', u'needed', u'votes',
       u'inducted', u'category', u'needed_note'],
      dtype='object')
Managers: 
Index([u'playerID', u'yearID', u'teamID', u'lgID', u'inseason', u'G', u'W',
       u'L', u'rank', u'plyrMgr'],
      dtype='object')
Master: 
Index([u'playerID', u'birthYear', u'birthMonth', u'birthDay', u'birthCountry',
       u'birthState', u'birthCity', u'deathYear', u'deathMonth', u'deathDay',
       u'deathCountry', u'deathState', u'deathCity', u'nameFirst', u'nameLast',
       u'nameGiven', u'weight', u'height', u'bats', u'throws', u'debut',
       u'finalGame', u'retroID', u'bbrefID'],
      dtype='object')
Series Post: 
Index([u'yearID', u'round', u'teamIDwinner', u'lgIDwinner', u'teamIDloser',
       u'lgIDloser', u'wins', u'losses', u'ties'

#### Team Statistic Overall Relative Performance
We want to append the relative statistical performance of teams to the existing teams_df DataFrame.

In [153]:
# Create a dictionary to determine how each statistic should be ordered.
# The key is the statistic and the value is True if it should be ordered from
# largest to smallest, False if it should be ordered smallest to largest.
team_stat = {
    # Overall Statistics
    'Rank':   True,  # Position in final standings
    "W":      True,  # Wins
    "L":      False, # Losses
    "DivWin": True,  # Division Winner (Y or N)
    "WCWin":  True,  # Wild Card Winner (Y or N)
    "LgWin":  True,  # League Champion(Y or N)
    "WSWin":  True,  # World Series Winner (Y or N)
    
    # Batting Statistics
    "R":      True,  # Runs scored
    "H":      True,  # Hits
    "2B":     True,  # Doubles
    "3B":     True,  # Triples
    "HR":     True,  # Homeruns
    "BB":     True,  # Walks
    "SO":     False, # Strikeouts
    "SB":     True,  # Stolen bases
    "CS":     False, # Caught stealing
    "HBP":    True,  # Batters hit by pitch
    "SF":     True,  # Sacrifice flies
    
    # Pitching Statistics
    "RA":     False, # Opponents run scored
    "ER":     False, # Earned runs allowed
    "ERA":    False, # Earned run average
    "CG":     True,  # Complete games pitched
    "SHO":    True,  # Shutouts
    "SV":     True,  # Saves
    "HA":     False, # Hits allowed
    "HRA":    False, # Homeruns allowed
    "BBA":    False, # Walks allowed
    "SOA":    True,  # Strikeouts by pitchers
    
    # Fielding Statistics
    "E":      False, # Errors
    "DP":     True,  # Double plays
    "FP":     True   # Fielding percentage
}

# Initialize data structure which we will create a DataFrame from
def create_data_structure():
    """Returns a dictionary with team stats as keys and lists as values"""
    # Create list of columns
    columns = ['yearID', 'teamID']
    for key in team_stat.keys():
        columns.append('rel_' + key)   
    # Create 2D array for data
    data = {}
    for column in columns:
        data[column] = []
    return data

def get_rel_stats_for_year(my_teams_df, year):
    """Returns a dictionary of dictionaries for this year in this teams DataFrame.
       The return dictionary will take the form of:
       team -> {stat -> rel_place}"""
    # Create teams_dict
    year_dict = {} # team -> {stat -> rel_place}
    for team in pd.unique(teams_df[my_teams_df.yearID == year].teamID.ravel()):
        year_dict[team] = {}

    for stat in team_stat.keys():
        # Create an ordered list of teams in this year for this statistic
        sorted_df = teams_df[my_teams_df.yearID == year].sort_values(stat, 
                                                         ascending=(not team_stat[stat]))
        teams_pos = pd.unique(sorted_df.teamID.ravel())

        # Append row to data
        counter = 1
        for team in teams_pos:
            year_dict[team]['rel_' + stat] = counter
            counter += 1
    return year_dict

# Find the relative performance for each statistic as columns to teams_df
def create_rel_team_stats(my_teams_df):
    """Returns a DataFrame containing the relative statistical performance 
       for each team for each year."""
    data = create_data_structure()
    
    # Iterate through each year and determine the placement of each team for each statistic
    for year in pd.unique(my_teams_df.yearID.ravel()):       
        year_dict = get_rel_stats_for_year(my_teams_df, year)
            
        # Transform teams_dict into data
        for team in year_dict.keys():
            for col in data.keys():
                if col == 'yearID':
                    data['yearID'].append(year)
                elif col == 'teamID':
                    data['teamID'].append(team)
                else:
                    data[col].append(year_dict[team][col])
                
    # Create DataFrame    
    return pd.DataFrame(data=data)

# Merge to existing teams_df
rel_teams_df = pd.merge(teams_df, create_rel_team_stats(teams_df), on=['yearID', 'teamID'])

#print rel_teams_df

      yearID lgID teamID franchID divID  Rank    G  Ghome   W   L    ...      \
0       1871  NaN    BS1      BNA   NaN     3   31    NaN  20  10    ...       
1       1871  NaN    CH1      CNA   NaN     2   28    NaN  19   9    ...       
2       1871  NaN    CL1      CFC   NaN     8   29    NaN  10  19    ...       
3       1871  NaN    FW1      KEK   NaN     7   19    NaN   7  12    ...       
4       1871  NaN    NY2      NNA   NaN     5   33    NaN  16  17    ...       
5       1871  NaN    PH1      PNA   NaN     1   28    NaN  21   7    ...       
6       1871  NaN    RC1      ROK   NaN     9   25    NaN   4  21    ...       
7       1871  NaN    TRO      TRO   NaN     6   29    NaN  13  15    ...       
8       1871  NaN    WS3      OLY   NaN     4   32    NaN  15  15    ...       
9       1872  NaN    BL1      BLC   NaN     2   58    NaN  35  19    ...       
10      1872  NaN    BR1      ECK   NaN     9   29    NaN   3  26    ...       
11      1872  NaN    BR2      BRA   NaN 

#### Associate Managers To Teams
The provided managers_df associates managers to teams by year, however there may be more than one manager for a team for a given year. This muddles the data, so we only want to include managers that lasted the full year.

#### Generate List of Manager Transfers

#### Compare Manager Before and After Stints

### Explore
- build intuition
- find patterns

### Conclusion
- answers to questions based on exploration

### References

Bradbury, J. C. (2006), “Hired to Be Fired: The Publicity Value of Managers,” unpublished
manuscript, Kennesaw State University.

Lahman, Sean, comp. The Lahman Baseball Database. 2014 ed. Print. [link](http://www.seanlahman.com/baseball-archive/statistics/)

Ruggiero, J., Hadley, L., Ruggiero, G., & Knowles, S.. (1997). A Note on the Pythagorean Theorem of Baseball Production. Managerial and Decision Economics, 18(4), 335–342. Retrieved from http://www.jstor.org/stable/3108205

Silvers, Randy, and Susmel, Raul. "Compensation of a Manager: The Case of Major League Baseball." University of Houston, 1 Apr. 2014. Web. 14 Jan. 2016. [link](http://www.bauer.uh.edu/rsusmel/Academic/MLB Manager Salaries_1.pdf).


### Appendix
