## Situational Pitching Behavior ##

# Imports and data Query #
For this project I will be using data from the entirety of the 2017 regular season. This query takes a very long time to run and should only be run if the user does not have access to final_pitching_data.csv

In [None]:
from pybaseball import statcast
import pandas as pd
import matplotlib.pyplot as plt
from pybaseball import playerid_lookup

In [None]:
# query - very large, only run if no exisiting data csv
data = statcast(start_dt='2017-04-01', end_dt='2017-10-01')

Read from the csv into a dataframe

In [None]:
# read query data to a csv is necessary, otherwise grab from csv
# data.to_csv("final_pitching_data.csv")
data = pd.read_csv("final_pitching_data.csv") 

# Core Functions #
    - lookup: responsible for pulling the specific pitching data from the dataframe
    - isolate_data: statstical preprocessing of the lookup data
    - process_data: shape the data into structures that can be easily displayed
    - display_results: generate plots and print statements of processed data
    - driver: driver function for submitting a specific situation query

In [None]:
def lookup(playerID, on_1b, on_2b, on_3b):
    # runner on first
    if on_1b and not on_2b and not on_3b:
        situation_data = data.loc[((data['pitcher'] == playerID) & (data['on_1b'].notnull()) & (data['on_2b'].isnull()) & (data['on_3b'].isnull())), 
                                    ['pitcher', 'at_bat_number', 'pitch_type','type', 'plate_x', 'plate_z', 'strikes', 'balls', 'on_1b', 'on_2b', 'on_3b']]
    
    # runner on second
    elif not on_1b and on_2b and not on_3b:
        situation_data = data.loc[((data['pitcher'] == playerID) & (data['on_1b'].isnull()) & (data['on_2b'].notnull()) & (data['on_3b'].isnull())), 
                                    ['pitcher', 'at_bat_number', 'pitch_type','type', 'plate_x', 'plate_z', 'strikes', 'balls', 'on_1b', 'on_2b', 'on_3b']]
    
    # runner on third
    elif not on_1b and not on_2b and on_3b:
        situation_data = data.loc[((data['pitcher'] == playerID) & (data['on_1b'].isnull()) & (data['on_2b'].isnull()) & (data['on_3b'].notnull())), 
                                    ['pitcher', 'at_bat_number', 'pitch_type','type', 'plate_x', 'plate_z', 'strikes', 'balls', 'on_1b', 'on_2b', 'on_3b']]
   
    # runner on first and second
    elif on_1b and on_2b and not on_3b:
        situation_data = data.loc[((data['pitcher'] == playerID) & (data['on_1b'].notnull()) & (data['on_2b'].notnull()) & (data['on_3b'].isnull())), 
                                    ['pitcher', 'at_bat_number', 'pitch_type','type', 'plate_x', 'plate_z', 'strikes', 'balls', 'on_1b', 'on_2b', 'on_3b']]
    
    # runner on first and third 
    elif on_1b and not on_2b and on_3b:
        situation_data = data.loc[((data['pitcher'] == playerID) & (data['on_1b'].notnull()) & (data['on_2b'].isnull()) & (data['on_3b'].notnull())), 
                                    ['pitcher', 'at_bat_number', 'pitch_type','type', 'plate_x', 'plate_z', 'strikes', 'balls', 'on_1b', 'on_2b', 'on_3b']]
    
    # runner on second and third
    elif not on_1b and on_2b and on_3b:
        situation_data = data.loc[((data['pitcher'] == playerID) & (data['on_1b'].isnull()) & (data['on_2b'].notnull()) & (data['on_3b'].notnull())), 
                                    ['pitcher', 'at_bat_number', 'pitch_type','type', 'plate_x', 'plate_z', 'strikes', 'balls', 'on_1b', 'on_2b', 'on_3b']]
    
    # runner on first, second, third
    elif on_1b and  on_2b and on_3b:
        situation_data = data.loc[((data['pitcher'] == playerID) & (data['on_1b'].notnull()) & (data['on_2b'].notnull()) & (data['on_3b'].notnull())), 
                                    ['pitcher', 'at_bat_number', 'pitch_type','type', 'plate_x', 'plate_z', 'strikes', 'balls', 'on_1b', 'on_2b', 'on_3b']]
    # no runners on
    elif not on_1b and not on_2b and not on_3b:
        situation_data = data.loc[((data['pitcher'] == playerID) & (data['on_1b'].isnull()) & (data['on_2b'].isnull()) & (data['on_3b'].isnull())), 
                                    ['pitcher', 'at_bat_number', 'pitch_type','type', 'plate_x', 'plate_z', 'strikes', 'balls', 'on_1b', 'on_2b', 'on_3b']]
    return situation_data

In [None]:
def isolate_data(sit_data):
    # compare the types of pitches pitchers throw at home vs away
    pitches = {}
    descriptions = {}
    pitch_type_counts = sit_data.groupby(['pitch_type']).count().reset_index()    
    total_pitches = pitch_type_counts['pitcher'].sum()

    #add the pitch types and their frequency to a dictionary 
    for index, row in pitch_type_counts.iterrows():
        pitches[row['pitch_type']] = row['pitcher']
        
    # get mean location data
    x_location_mean = sit_data['plate_x'].mean()
    z_location_mean = sit_data['plate_z'].mean()
 
    # get number of strikes/balls/hits thrown in situation
    pitch_description_counts = sit_data.groupby(['type']).count().reset_index()
    for index, row in pitch_description_counts.iterrows():
        descriptions[row['type']] = row['pitcher']
    
    return total_pitches, pitches, descriptions, x_location_mean, z_location_mean

In [None]:
def process_data(isol_data):
    # if there is data for this situation, isolate the data points we are looking for
    tot_pitches, pitch_counts, pitch_descriptions, x_loc_mean, z_loc_mean = isolate_data(isol_data)
    # get the most likely pitch to be thrown
    max_pitch = 0
    # dictonary to store the types of pitches they could throw and the statistical frequency that they will
    pitch_freqs = {}
    for key, value in pitch_counts.items():
        pitch_freqs[key] = value/tot_pitches

    # dictionary to store outcome percentages
    outcome_freqs = {}
    for key, value in pitch_descriptions.items():
        outcome_freqs[key] = value/tot_pitches

    return tot_pitches, pitch_freqs, outcome_freqs, x_loc_mean, z_loc_mean
       

In [None]:
def display_results(p_data):
    print("There are", p_data[0], "data points for this pitcher in this situation")
    
    print("The odds that the pitcher will throw a specific pitch in this situation are the following \n", p_data[1])
    plt.bar(range(len(p_data[1])), list(p_data[1].values()), align='center')
    plt.xticks(range(len(p_data[1])), list(p_data[1].keys()))
    plt.title("Pitch Frequencies")
    plt.xlabel("Pitch Type")
    plt.ylabel("Frequency")
    plt.show()
    
    print("The odds of a specific outcome (B - Ball, S - Strike, X - Ball In Play) are the following \n", p_data[2])
    plt.bar(range(len(p_data[2])), list(p_data[2].values()), align='center')
    plt.xticks(range(len(p_data[2])), list(p_data[2].keys()))
    plt.title("Outcome Frequencies")
    plt.xlabel("Result of Pitch")
    plt.ylabel("Frequency")
    plt.show()
    
    # get the average location of where the pitch will be thrown based off the mean x (horizontal) location and mean z (vertical) location
    print("The average position of the ball over the plate, from the catcher's perspective,\n (0 in the x direction will be directly over the center of the plate horizontally, numbers in the y direction are an indication of how high the ball will be above the plate) \n in this situation for this pitcher is \n(", p_data[3], ",", p_data[4],")")
    plt.plot(p_data[3], p_data[4], 'ro', markersize=15)
    plt.grid(color='black', linestyle='-', linewidth=.8)
    plt.axis([-5, 5, -5, 5])
    plt.xlabel("Horiztional Position")
    plt.ylabel("Vertical Position")
    plt.title("Average Pitch Location")
    plt.show()

In [None]:
def driver(first, last, on_1b, on_2b, on_3b):
    playerID = playerid_lookup(last, first)
    pid = playerID.iloc[0]['key_mlbam']
    # call the lookup function with the specified siutational parameters
    SPB_data = lookup(pid, on_1b, on_2b, on_3b)
    # if the length of the returned data frame is 0 there is no data for that specific situation
    if (len(SPB_data) == 0):
        print("No data available for described situation")
        return
    
    # process the isolated data
    processed_data = process_data(SPB_data)

    # display results
    display_results(processed_data)

## Demonstration ##
Below is a demonstartion of how this application can be used. I will provide examples that show how pitchers change their behavior as situations change, but this application can be applied to any pitcher, not just the ones in my demonstration.

## Kyle Freeland ##

In [None]:
# first name, last name, on_1b, on_2b, on_3b
driver("kyle", "freeland", False, False, False)

In [None]:
# first name, last name, on_1b, on_2b, on_3b
driver("kyle", "freeland", True, True, False)

In [None]:
# first name, last name, on_1b, on_2b, on_3b
driver("kyle", "freeland", True, True, True)

## Statistical Comparison ##
SPB is very difficult to compare to traditional statistics. I will be taking mlb.com's list of most clutch pitchers of all time (https://www.mlb.com/news/best-clutch-pitchers-in-mlb-history-c298312438) and seeing how those pitchers have historically performed in high intensity situations from the SPB perspective. 

## Justin Verlander ##
Verlander is credited with a very low career ERA and is said to be a very clutch pitcher. I will be using SPB to show how Verrlander has performed in three different high pressure situations

#### Bases Loaded ####

In [None]:
# first name, last name, on_1b, on_2b, on_3b
# bases loaded
driver("justin", "verlander", True, True, True)

#### Runners on Second and Third ####

In [None]:
# first name, last name, strikes, balls, on_1b, on_2b, on_3b
# runners on second and third 
driver("justin", "verlander", False, True, True)

#### Runners on first and second ####

In [None]:
# first name, last name, on_1b, on_2b, on_3b
# runner on first and second 
driver("justin", "verlander", True, True, False)

## Analysis ##

From an intuitive perspective, the three situations above all help support the case of Justin Verlander being a "clutch" player. When the bases are loaded he throws strikes 57% of the time. When there are runners on Second and Third he throws strikes 45% of the time. When there are runners on first and second he throws strikes 49% of the time. The data also shows that in these high pressure situations Verlander strongly favors throwing four seem fastballs (FF). This data can be useful for batters who have to face Verlander in these high pressure situations. Verlander also has a very low hit in to play percentage in all three situations. Tyically in these high pressure situations verlander throws the pitch straight over the center of the plate.

## Jake Arrieta ##

#### Bases Loaded ####

In [None]:
# first name, last name, on_1b, on_2b, on_3b
# bases loaded
driver("jake", "arrieta", True, True, True)

#### Runners on Second and Third ####

In [None]:
# first name, last name, on_1b, on_2b, on_3b
# bases loaded
driver("jake", "arrieta", False, True, True)

#### Runners on First and Second ####

In [None]:
# first name, last name, on_1b, on_2b, on_3b
# bases loaded
driver("jake", "arrieta", True, True, False)

## Analysis ##
We can see that Jake Arrieta has a lower strike percentage than Verlander and also a heavy preference towards throwing sliders in these high pressure situations. This is very useful information for batters who have to face Arrieta, in fact Arrieta throws almost no four seem fast balls in these situations. The one similarity that Arrieta has to Verlander is there tendancy to throw directly over the plate in these situations. 