# Behind the Plate in the '90s Exploring the Defensive Impact of Catchers

To analyze the defensive effectiveness of catchers, we use the average value per defensive play involving a catcher as our metric. The value of a defensive play is determined by calculating the difference between the expected values of the beginning state and the expected values of the end state of a defensive play involving the catcher.

It's important to note that this analysis does not provide a definitive measurement of a catcher's defensive effectiveness. Rather, it offers an interesting metric to gauge the average impact of defensive plays involving the catcher. When evaluating a catcher's defensive effectiveness, it's recommended to consider other metrics such as fielding percentage, zone rating, percentage of attempted steals thrown out, and pitch calling capabilities, among others.

We also compute the normalized values of a catcher's caught stealing percentage and examine the correlation with the normalized average defensive effectiveness. The analysis reveals that there is not a strong correlation between the two metrics. This suggests that measuring the average defensive effectiveness has the potential to be a unique indicator of an impactful catcher or defensive player.

In [2]:
###Function allows us to find the state of runner positions and number of outs at a given index
def current_state(index,data):
    current_base_runners =[]
    outs = data['outs_when_up'][index]
    base_1=str(data['on_1b'][index])
    base_2=str(data['on_2b'][index])
    base_3=str(data['on_3b'][index])
    
    return [base_3+base_2+base_1,outs]

#Base Runner State Index
# 000 = no runners on base
# 001 = runner on first no runners on second and third
# 010 = runner on second no runners on first and third
# 100 = runner on third no runners on first and second
# 011 = runners on first and second
# 101 = runners on first and third
# 110 = runners on second and third
# 111 = bases loaded


###Function that allows us to compute the expectancy matrix of a given set of games.

def create_expectancy_matrix(data):

    ## Generate all possible combinations of base runner situations
    length = 3
    entries = [0, 1]
    combinations = list(itertools.product(entries, repeat=length))

    # Create the expectancy matrix of zeros
    matrix = np.zeros((8, 3))

    # Set the row labels using the entries of combinations
    row_labels = [''.join(map(str, combo)) for combo in combinations]

    # Set the column labels based on number of outs
    outs=[0,1,2]

    # Create a DataFrame with the matrix, row labels, and column labels
    expectancy_matrix = pd.DataFrame(matrix, index=row_labels, columns=outs)

    for out in outs:
        for base_combination in combinations:

            # Create a data frame where the situation is that there are 2 outs and a man on first
            situation = data[(data['on_1b'] == base_combination[2]) & (data['on_2b'] == base_combination[1]) & (data['on_3b'] == base_combination[0]) & (data['outs_when_up'] == out)]

            # Drop the bottom of ninth inning
            situation.drop(situation[(situation['inning'] == 9) & (situation['inning_topbot'] == 0)].index)


            # We want to record the indexes of when an individual batter begins a plate appearance and when they end.
            # This will be useful for computing the number of runs scored by the batter's at-bat appearance.
            last_pitch_indexes = []
            first_pitch_indexes = []
            pitch_counts = []
            situation_indexes = situation.index.tolist()
            n = 0

            while n < len(situation_indexes):
                first_pitch_index = situation_indexes[n]
                first_pitch_indexes.append(first_pitch_index)
                index = first_pitch_index
                pitch_count = 1
                batter = data.at[index, 'batter']

                while index + 1 < len(data) and data.at[index + 1, 'batter'] == batter:
                    pitch_count += 1
                    index += 1
                    n += 1

                last_pitch_indexes.append(index)
                pitch_counts.append(pitch_count)
                n += 1
                index += 1

            # Keep track of the indexes of when an inning ends
            inning_over_indexes = []
            for index in last_pitch_indexes:
                inning_type=data['inning_topbot'][index]
                if index ==len(data):
                    index+=1
                while index<len(data) and data.at[index,'inning_topbot']==inning_type:
                    index+=1
                inning_over_indexes.append(index-1)

            # Keep track of the number of runs scored until the end of the inning
            Runs_Scored_Home = []
            Runs_Scored_Away = []

            for i in range(len(first_pitch_indexes)):
                if data.at[first_pitch_indexes[i], 'inning_topbot'] == 1:
                    runs_scored = data.at[inning_over_indexes[i], 'post_away_score'] - data.at[first_pitch_indexes[i], 'away_score']
                    Runs_Scored_Away.append(runs_scored)

                if data.at[first_pitch_indexes[i], 'inning_topbot'] == 0:
                    runs_scored = data.at[inning_over_indexes[i], 'post_home_score'] - data.at[first_pitch_indexes[i], 'home_score']
                    Runs_Scored_Home.append(runs_scored)

            Runs_Scored = Runs_Scored_Home + Runs_Scored_Away

            #Store expectancy matrix as a dataframe
            #The rows are labeled by runner positions and columns by number of outs.
            #The row labeled '100' corresponds to a runner on third and no runners on second and first
            expectancy_matrix.loc[''.join(map(str, base_combination)), out] = np.mean(Runs_Scored)

    return expectancy_matrix

In [5]:
###Upload play-by-play data from all 1990s regular season games

#PACKAGES
import pandas as pd
import pybaseball
from pybaseball import statcast
import itertools
import numpy as np

# Retrieve the statcast data for a specific time period
pybaseball.cache.enable()
start_date = '1990-01-01'
end_date = '1999-12-31'
data = statcast(start_date, end_date)


###DATA CLEANING

#Drop events that occur in the bottom of the 9th
data = data[(data['game_type'] == 'R') & ((data['inning'] != 9) | (data['inning_topbot'] != 0))]

#Change order of dataframe
data  = data[::-1]

# Relabeling the indexing of the columns
data.reset_index(drop=True, inplace=True)

# Change Top of inning value to 1 and Bottom of inning value to 0
data['inning_topbot'] = data['inning_topbot'].map({'Top': 1, 'Bot': 0})


# Change values for bases to 1 or 0 if there is a player on base or not
data['on_1b'] = np.where(data['on_1b'].isna(), 0, 1)
data['on_2b'] = np.where(data['on_2b'].isna(), 0, 1)
data['on_3b'] = np.where(data['on_3b'].isna(), 0, 1)

data_no_alter = data.copy()


This is a large query, it may take a moment to complete
Skipping offseason dates
Skipping offseason dates
Skipping offseason dates
Skipping offseason dates
Skipping offseason dates
Skipping offseason dates
Skipping offseason dates
Skipping offseason dates
Skipping offseason dates
Skipping offseason dates
Skipping offseason dates


100%|██████████| 2460/2460 [40:11<00:00,  1.02it/s]  


In [15]:
data_no_alter

Unnamed: 0,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,events,description,...,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,if_fielding_alignment,of_fielding_alignment,spin_axis,delta_home_win_exp,delta_run_exp
0,,1990-04-09,,,,"Saberhagen, Bret",111349,121604,,swinging_strike,...,0,0,0,0,0,,,,0.0,-0.038
1,,1990-04-09,,,,"Saberhagen, Bret",111349,121604,,swinging_strike,...,0,0,0,0,0,,,,0.0,-0.048
2,,1990-04-09,,,,"Saberhagen, Bret",111349,121604,,foul,...,0,0,0,0,0,,,,0.0,0.0
3,,1990-04-09,,,,"Saberhagen, Bret",111349,121604,,ball,...,0,0,0,0,0,,,,0.0,0.025
4,,1990-04-09,,,,"Saberhagen, Bret",111349,121604,,ball,...,0,0,0,0,0,,,,0.0,0.044
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5992474,,1999-10-04,,,,"Leiter, Al",123675,117652,,ball,...,5,5,0,0,5,,,,0.0,0.037
5992475,,1999-10-04,,,,"Leiter, Al",123675,117652,,ball,...,5,5,0,0,5,,,,0.0,0.072
5992476,,1999-10-04,,,,"Leiter, Al",123675,117652,walk,ball,...,5,5,0,0,5,,,,0.002,0.102
5992477,,1999-10-04,,,,"Leiter, Al",124693,117652,,foul,...,5,5,0,0,5,,,,0.0,-0.09


In [16]:
data_no_alter.columns

Index(['pitch_type', 'game_date', 'release_speed', 'release_pos_x',
       'release_pos_z', 'player_name', 'batter', 'pitcher', 'events',
       'description', 'spin_dir', 'spin_rate_deprecated',
       'break_angle_deprecated', 'break_length_deprecated', 'zone', 'des',
       'game_type', 'stand', 'p_throws', 'home_team', 'away_team', 'type',
       'hit_location', 'bb_type', 'balls', 'strikes', 'game_year', 'pfx_x',
       'pfx_z', 'plate_x', 'plate_z', 'on_3b', 'on_2b', 'on_1b',
       'outs_when_up', 'inning', 'inning_topbot', 'hc_x', 'hc_y',
       'tfs_deprecated', 'tfs_zulu_deprecated', 'fielder_2', 'umpire', 'sv_id',
       'vx0', 'vy0', 'vz0', 'ax', 'ay', 'az', 'sz_top', 'sz_bot',
       'hit_distance_sc', 'launch_speed', 'launch_angle', 'effective_speed',
       'release_spin_rate', 'release_extension', 'game_pk', 'pitcher.1',
       'fielder_2.1', 'fielder_3', 'fielder_4', 'fielder_5', 'fielder_6',
       'fielder_7', 'fielder_8', 'fielder_9', 'release_pos_y',
       'estima

In [235]:
#Create expectancy matrix for the 1990's regular season games
expectancy_matrix = create_expectancy_matrix(data_no_alter)
expectancy_matrix

Unnamed: 0,0,1,2
0,0.46789,0.224032,0.052246
1,0.858483,0.486218,0.175528
10,1.114729,0.658349,0.286464
11,1.478387,0.889381,0.407272
100,1.373564,0.935105,0.338966
101,1.780269,1.133007,0.470492
110,1.974039,1.365017,0.54934
111,2.284808,1.550458,0.732575


The expectancy matrix provides a valuable tool for calculating the value of defensive plays involving a catcher. Let's consider a specific scenario: no outs and two runners on base, with one at first base and the other at second base. The value of this state can be determined using the expectancy matrix by referring to the 011 row and 0 column. The corresponding value is 1.478387.

Now, let's imagine that on the next pitch, both runners attempt to steal second and third base. The catcher successfully throws out the runner attempting to steal third base. As a result, there is now a single out and a single runner at second base. To calculate the value of this new state, we can refer to the 010 row and 1 column of the expectancy matrix. The corresponding value is 0.658349.

By subtracting the value of the first play (1.478387) from the value of the second play (0.658349), we obtain a difference of 0.8200379. This difference represents the change in value between the two states.

Therefore, we assign a value of 0.8200379 to the defensive play made by the catcher in this specific scenario. This value reflects the impact of the catcher's successful throw-out on the overall defensive performance.

In [282]:
#Gather relevant catchers and their fielding data from the Lahman database.
#We only consider catchers with at least 3000 innings of play during the 1990's.
from pybaseball.lahman import *
download_lahman()

from pybaseball import playerid_reverse_lookup

# fielding stats by year 
fielding = fielding()
fielding = fielding[(fielding['yearID']>=1990) & (fielding['yearID']<=1999)]
people = people()

# Filter rows for the 1990s
fielding_1990s = fielding[fielding['yearID'].between(1990, 1999)]
catcher_fielding_info = fielding_1990s[(fielding_1990s['POS'] == 'C')].copy()

# Calculate the sum of 'InnOut' values for each unique 'playerID'
sum_innouts_by_player = catcher_fielding_info.groupby('playerID')['InnOuts'].sum()

# Get the player IDs whose sum is at least 3000
players_with_sum_3000 = sum_innouts_by_player[sum_innouts_by_player >= 3000].index

# Filter the original DataFrame based on the player IDs
catcher_fielding_info = catcher_fielding_info[catcher_fielding_info['playerID'].isin(players_with_sum_3000)]


catchers_1990s_playerIDs = catcher_fielding_info['playerID'].unique()
catchers_1990s = []

for ID in catchers_1990s_playerIDs:
    first_name = people[people['playerID'] == ID]['nameFirst'].iloc[0]
    last_name = people[people['playerID'] == ID]['nameLast'].iloc[0]
    name = first_name + ' ' + last_name
    
    catchers_1990s.append([first_name,last_name])


The following code calculates the normalized defensive value for all catchers from the 1990s who played at least 3000 innings as a catcher.

The computation of the normalized defensive value involves the following steps:

Firstly, we determine the average value per defensive play for each catcher in the catchers_1990s list.

Next, we calculate the mean and standard deviation of these average values.

The normalized defensive value for a specific catcher is determined by measuring the number of standard deviations that their average value per defensive play deviates from the mean of all catchers' average values. The results of this computation are stored in a dictionary called catcher_normalized_defensive_performance.

We also find each catcher's caught stealing percentage and store the normalized results in a a dictionary called catcher_caught_stealing_ratio.

In [283]:
from pybaseball import playerid_lookup
from pybaseball import playerid_reverse_lookup
#Compute the normalized defensive value of each catcher
data = data_no_alter[data_no_alter['des'].str.contains('catcher', case=False, na=False)].copy()
data = data.dropna(subset=['events'])
mlb_ids = []
player_IDs=[]
catcher_means=[]
catcher_performance={}
catcher_caught_stealing_ratio={}

for ID in catchers_1990s_playerIDs:
    first_name = people[people['playerID'] == ID]['nameFirst'].iloc[0]
    last_name = people[people['playerID'] == ID]['nameLast'].iloc[0]   
    result = playerid_lookup(last_name.lower(), first_name.lower())
    
    if len(result) == 0:
        continue
        
    mlb_id = result['key_mlbam'].values[0]
    mlb_ids.append(mlb_id)
    player_ID = people[(people['nameFirst'] == first_name) & (people['nameLast'] == last_name)]['playerID'].iloc[0]
    player_IDs.append(player_ID)

    
    #Create data frame where the catcher being examined is behind the plate
    data_player = data[data['fielder_2'] == mlb_id].copy()


    #Index list of events involving the catcher
    index_list = data_player.index.tolist()

    #expectancy_matrix = expectancy_matrix(data)
    
    
    for index in index_list:
        pre_state= current_state(index,data_no_alter)
        post_state = current_state(index+1,data_no_alter)

        #Compute the value of the event if it ends the inning
        if (data_no_alter['inning'][index] != data_no_alter['inning'][index+1]) or (data_no_alter['inning_topbot'][index] != data_no_alter['inning_topbot'][index+1]):
            play_value = -data_no_alter['post_away_score'][index]+data_no_alter['away_score'][index] - data_no_alter['post_home_score'][index]+data_no_alter['home_score'][index]+expectancy_matrix.at[pre_state[0],pre_state[1]]
            all_play_values.append(play_value)


        #Compute the value of the event if it does not end the inning
        else: 
            play_value = - data_no_alter['post_away_score'][index]-data_no_alter['away_score'][index] - data_no_alter['post_home_score'][index]+data_no_alter['home_score'][index]-expectancy_matrix.at[post_state[0],post_state[1]]+expectancy_matrix.at[pre_state[0],pre_state[1]]
            all_play_values.append(play_value)
            
    filtered_data = catcher_fielding_info[catcher_fielding_info['playerID'] == ID]

    # Compute the sum of CS and SB
    cs_sum = filtered_data['CS'].sum()
    sb_sum = filtered_data['SB'].sum()
    
    all_play_valuesdf = pd.DataFrame(all_play_values)
    if len(all_play_valuesdf)!=0 and sb_sum !=0:
        catcher_means.append(all_play_valuesdf.mean()[0])
        catcher_performance[mlb_id] = -all_play_valuesdf.mean()[0]
        cs_percentage = cs_sum / (cs_sum + sb_sum)
        catcher_caught_stealing_ratio[ID] = cs_percentage  
        
            
catcher_meansdf =pd.DataFrame(catcher_means)
catcher_meansdf = catcher_meansdf.dropna() 
mean_catcher_impact = catcher_meansdf.mean()[0]
std_catcher_impact = catcher_meansdf.std()[0]

        
catcher_normalized_defensive_performance = {}
for catcher in catcher_performance.keys():
    catcher_normalized_defensive_performance[catcher] = (catcher_performance[catcher]-mean_catcher_impact)/std_catcher_impact

catcher_caught_stealing_ratiodf = pd.DataFrame(list(catcher_caught_stealing_ratio.items()), columns=['catcher', 'caught_stealing_ratio'])
catcher_caught_stealing_ratiodf = catcher_caught_stealing_ratiodf.dropna()
mean_stealing = catcher_caught_stealing_ratiodf['caught_stealing_ratio'].mean()
std_stealing = catcher_caught_stealing_ratiodf['caught_stealing_ratio'].std()

catcher_normalized_caught_stealing_ratio = {}
for catcher in catcher_caught_stealing_ratio.keys():
    catcher_normalized_caught_stealing_ratio[catcher] = (catcher_caught_stealing_ratio[catcher] - mean_stealing) / std_stealing


    

We examine the defensive performance of several catchers from the 1990s. We focus on two key metrics: normalized defensive performance and normalized caught stealing percentage.

The following code will compute and display the normalized defensive performance and caught stealing percentage for each of the selected catchers. We provide a brief discussion of the findings.

In [333]:
catchers =[['Javy','Lopez'],['Ivan', 'Rodriguez'], ['Mike', 'Piazza'], ['Brian','Harper'], ['Ed','Taubensee'],['Jason','Kendall']]
for catcher in catchers:
    first_name = catcher[0]
    last_name = catcher[1]
    playerID = people[(people['nameFirst'].str.lower() == first_name.lower()) & (people['nameLast'].str.lower() == last_name.lower())]['playerID'].values[0]
    mlb_id = playerid_lookup(last_name.lower(),first_name.lower())['key_mlbam'][0]
    
    print(f'{first_name} {last_name}\'s normalized defensive performance is {catcher_normalized_defensive_performance[mlb_id]}.')
    print(f'{first_name} {last_name}\'s normalized caught stealing percentage is {catcher_normalized_caught_stealing_ratio[playerID]}.')
    
    print('--------------')
    print('--------------')
    
    
    
    
    

Javy Lopez's normalized defensive performance is 1.096450534464186.
Javy Lopez's normalized caught stealing percentage is -0.9345066830196516.
--------------
--------------
Ivan Rodriguez's normalized defensive performance is 1.6483480595234306.
Ivan Rodriguez's normalized caught stealing percentage is 3.247808944157454.
--------------
--------------
Mike Piazza's normalized defensive performance is 1.8878083323064965.
Mike Piazza's normalized caught stealing percentage is -1.1438936115663683.
--------------
--------------
Brian Harper's normalized defensive performance is 0.8286199440904888.
Brian Harper's normalized caught stealing percentage is -0.3952909941969597.
--------------
--------------
Ed Taubensee's normalized defensive performance is 1.0529629361622808.
Ed Taubensee's normalized caught stealing percentage is -1.543096053156842.
--------------
--------------
Jason Kendall's normalized defensive performance is -1.8146353808932179.
Jason Kendall's normalized caught stealing 

In summary, the findings indicate that Ivan Rodriguez, a 13-time Gold Glove award recipient, shows strong performance in both normalized defensive effectiveness and normalized caught stealing percentage. This aligns with his reputation as an exceptional defensive catcher.

On the other hand, Jason Kendall's results suggest lower performance in both metrics. This is consistent with the understanding that Kendall was not considered a standout defensive catcher during his career.

Overall, the analysis can be used to provide insights into the defensive effectiveness of catchers in the 1990s.

# Correlation between Defensive Effectiveness and Caught Stealing Values

In [335]:
def compute_correlation(dictionary1, dictionary2):
    # Convert the dictionaries to numpy arrays
    values1 = np.array(list(dictionary1.values()))
    values2 = np.array(list(dictionary2.values()))

    # Calculate the correlation coefficient
    correlation = np.corrcoef(values1, values2)[0, 1]

    return correlation

# Compute the correlation
correlation = compute_correlation(catcher_normalized_defensive_performance, catcher_normalized_caught_stealing_ratio)

print(f"The correlation between normalized defensive performance and normalized caught stealing ratio is: {correlation}.")


The correlation between normalized defensive performance and normalized caught stealing ratio is: -0.034104123054837274.


# Correlation between Defensive Performance and Caught Stealing Ratio

In our analysis, we examined the correlation between the normalized defensive performance values and the normalized caught stealing ratio of catchers from the 1990s. The computed correlation coefficient between these two metrics is approximately -0.03.

This correlation value suggests that there is a weak and negligible relationship between the normalized defensive performance and the caught stealing ratio. From a probabilistic standpoint, these two metrics can be considered uncorrelated. Therefore, the normalized defensive performance has the potential to serve as an independent and unique indicator of a catcher's defensive effectiveness.