# Exploratory Analysis: MLB Pitch Selection

<img src="https://calltothepen.com/wp-content/uploads/getty-images/2018/08/1070840128.jpeg" width="500px">
https://calltothepen.com/wp-content/uploads/getty-images/2018/08/1070840128.jpeg


This is my first data-related project so I wanted to tackle something relatively simple while still being educational for baseball fans! Credit to https://www.kaggle.com/pschale/mlb-pitch-data-20152018 for providing the data I used for the analysis. However, in the 2019_pitches csv, there was one line that seemed to have a wrong outcome code that messed up my analysis, so I simply edited that one line and am instead using the fixed file as opposed to the original one provided. Any suggestions on how I can expand this topic or ways to improve are much appreciated as I'm still doing my best to learn more! 

In [None]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
import matplotlib as mpl

For my analysis, I am only looking at pitches thrown in the 2019 MLB season, so only 2019 files are being read in

In [None]:
pitches = pd.read_csv("../input/2019-pitches-fixed/2019_pitches.csv") #Includes pitch types,location,outcome for every pitch etc.
at_bats = pd.read_csv("../input/mlb-pitch-data-20152018/2019_atbats.csv") #Includes the result of each atbat

# Intro:
The point of this project is to conduct exploratory data analysis to see what the best pitches are in getting a specific type of outcome when the ball is in play. In this case I won't include strikeouts or pickoffs. When the ball is put in play, the ball is going to either be consided a groundball, or it will be something in the air such as a popup or a flyout. I will also include homeruns as a flyball since it's guranteed that a homerun will be put in the air. With this in mind, I'll split the main panda database between groundouts and flyouts/homeruns.

In [None]:
outs = ['Flyout','Groundout','Pop Out','Forceout','Sac Fly','Lineout','Fielders Choice Out','Homerun','Grounded Into DP','Bunt Groundout','Bunt Lineout','Sac Fly Double Play']
ground = ['Groundout','Forceout','Fielders Choice Out','Grounded Into DP','Bunt Groundout'] #Any outcome that involves a groundball
air = ['Flyout','Pop Out','Sac Fly','Lineout','Homerun','Bunt Lineout','Sac Fly Double Play'] #Any outcome that involves a flyball
results_groundballs = at_bats[at_bats['event'].isin(ground)]
results_flyballs = at_bats[at_bats['event'].isin(air)]
results = at_bats[at_bats['event'].isin(outs)] #For plotting purposes later


X means the ball was put in play resulting in an out and E means the ball was put in play resulting in runs. E is for the cases of sacrifice flys and homeruns. This block will also connects the atbats from the at_bats csv to the pitch from the 2019 csv file that resulted in the outcome.

In [None]:
pitchTypes_ground = pitches[((pitches['code']=='X') | (pitches['code'] == 'E')) & pitches['ab_id'].isin(results_groundballs['ab_id'])]
pitchTypes_air = pitches[((pitches['code']=='X') | (pitches['code'] == 'E')) & pitches['ab_id'].isin(results_flyballs['ab_id'])]
pitchTypes_total = pitches[((pitches['code']=='X') | (pitches['code'] == 'E')) & pitches['ab_id'].isin(results['ab_id'])]


To start simple, I can do analysis on the three main groups of pitches: Fastballs,Curveballs,and offspeed pitches. Curveballs can be considered offspeed, but I believe they are unique enough from other offspeed that they can be separated from the rest.

In [None]:
fastballs = ['FC','FF','FT','SI'] #Cutter, 4-seam,2-seam,sinker
curveballs = ['CU','EP','KC','SC'] #Curveball,Eephus,Knucklecurve,Screwball
offspeed = ['CH','FS','KN','SL'] #Changeup, Splitter, Knuckleball,Slider
ground = ['Groundout','Forceout','Fielders Choice Out','Grounded Into DP','Bunt Groundout'] #Any outcome that involves a groundball
air = ['Flyout','Pop Out','Sac Fly','Lineout','Homerun','Bunt Lineout','Sac Fly Double Play'] #Any outcome that involves a flyball

fastball_ground = pitchTypes_ground[pitchTypes_ground['pitch_type'].isin(fastballs)]
fastball_air = pitchTypes_air[pitchTypes_air['pitch_type'].isin(fastballs)]
offspeed_ground = pitchTypes_ground[pitchTypes_ground['pitch_type'].isin(offspeed)]
offspeed_air = pitchTypes_air[pitchTypes_air['pitch_type'].isin(offspeed)]
curveballs_ground = pitchTypes_ground[pitchTypes_ground['pitch_type'].isin(curveballs)]
curveballs_air = pitchTypes_air[pitchTypes_air['pitch_type'].isin(curveballs)]



In [None]:
totalg = len(fastball_ground)+len(offspeed_ground)+len(curveballs_ground)
totala = len(fastball_air)+len(offspeed_air)+len(curveballs_air)
total=len(fastball_ground)+len(offspeed_ground)+len(curveballs_ground) + len(fastball_air)+len(offspeed_air)+len(curveballs_air)
print(totala,"total air datapoints for analysis")
print(totalg,"total ground datapoints for analysis")

This block will display the first set of results showing the best ways of getting groundballs and flyballs

In [None]:
gfb = (len(fastball_ground)/totalg)*((len(fastball_ground)+len(offspeed_ground)+len(curveballs_ground))/total)
gos = (len(offspeed_ground)/totalg)*((len(fastball_ground)+len(offspeed_ground)+len(curveballs_ground))/total)
gcv =  (len(curveballs_ground)/totalg)*((len(fastball_ground)+len(offspeed_ground)+len(curveballs_ground))/total)

afb = (len(fastball_air)/totala)*((len(fastball_air)+len(offspeed_air)+len(curveballs_air))/total)
aos = (len(offspeed_air)/totala)*((len(fastball_air)+len(offspeed_air)+len(curveballs_air))/total)
acv = (len(curveballs_air)/totala)*((len(fastball_air)+len(offspeed_air)+len(curveballs_air))/total)

print("Percentage of groundballs:", (totalg)/total*100)
print("Percentage of flyballs:",(totala)/total*100)
print()
print("Percentage of groundballs via fastball:",len(fastball_ground)/totalg*100)
print("Percentage of groundballs via offspeed:",len(offspeed_ground)/totalg*100)
print("Percentage of groundballs via curveball:",len(curveballs_ground)/totalg*100)
print("Percentage of flyballs via fastball:",len(fastball_air)/totala*100)
print("Percentage of flyballs via offspeed:",len(offspeed_air)/totala*100)
print("Percentage of flyballs via curveball:",len(curveballs_air)/totala*100)
print()
print("Odds of groundball and fb:",gfb)
print("Odds of groundball and os:",gos)
print("Odds of groundball and cv:",gcv)
print()
print("Odds of flyball and fb:",afb)
print("Odds of flyball and os:",aos) 
print("Odds of flyball and cv:",gcv) 

These percentages are based off the total data points for air and ground, so the numbers aren't accuractly representative of true outcome odds yet, so Bayes Theorem will be useful here
<img src="https://miro.medium.com/max/1994/1*CnoTGGO7XeUpUMeXDrIfvA.png" width="500px">
https://miro.medium.com/max/1994/1*CnoTGGO7XeUpUMeXDrIfvA.png


In [None]:
def conditional(TOT,OPP):
    return (TOT/(TOT+OPP))
#TOT indicating the numerator of Bayes equation with OPP being part of the denominator.

In [None]:
print("Odds of groundball given fastball:",conditional(gfb,afb))
print("Odds of groundball given offspeed:",conditional(gos,aos))
print("Odds of groundball given curveball:",conditional(gcv,acv))
print()
print("Odds of flyball given fastball:",conditional(afb,gfb))
print("Odds of flyball given offspeed:",conditional(aos,gos))
print("Odds of flyball given curveball:",conditional(acv,gcv))

In [None]:
labels= 'Groundballs','Flyballs'
sizes = [conditional(gfb,afb),conditional(afb,gfb)]
colors = ['orange','lightblue']

plt.pie(sizes,labels=labels,colors=colors,shadow=True,startangle=90,autopct='%1.5f%%',wedgeprops={'edgecolor':'black'})
plt.axis('equal')
plt.title('Odds of GB/FB Given Fastballs')
plt.show()

labels= 'Groundballs','Flyballs'
sizes = [conditional(gos,aos),conditional(aos,gos)]
colors = ['navajowhite','aquamarine']

plt.pie(sizes,labels=labels,colors=colors,shadow=True,startangle=90,autopct='%1.5f%%',wedgeprops={'edgecolor':'black'})
plt.axis('equal')
plt.title('Odds of GB/FB Given Offspeed')
plt.show()

labels= 'Groundballs','Flyballs'
sizes = [conditional(gcv,acv),conditional(acv,gcv)]
colors = ['cornsilk','slateblue']

plt.pie(sizes,labels=labels,colors=colors,shadow=True,startangle=90,autopct='%1.5f%%',wedgeprops={'edgecolor':'black'})
plt.axis('equal')
plt.title('Odds of GB/FB Given Curveball')
plt.show()




Based off initial results, some type of fastball is the best bet to draw some type of flyball or possibly a popout. Note this could also include line drives. Offspeed pitches is essentially a toss-up if you're not looking for a specific result to happen. Curveballs are the best bet for groundballs especially since most curveballs are in the lower part of the strike zone.

Shouldn't sinkers cause more groundballs though? Shouldn't different types of offspeed cause different results? Next part of the project will be repeating the process, but this time looking at each specific pitch.

If you're not sure what a certain pitch is or how it acts, you can read about them here: https://en.wikipedia.org/wiki/Pitch_(baseball)

In [None]:
#Panda database splitting between the specific pitches
cutter_ground = pitchTypes_ground[pitchTypes_ground['pitch_type'] == 'FC']
cutter_air = pitchTypes_air[pitchTypes_air['pitch_type'] == 'FC']

fourseam_ground = pitchTypes_ground[pitchTypes_ground['pitch_type'] == 'FF']
fourseam_air = pitchTypes_air[pitchTypes_air['pitch_type'] == 'FF']

twoseam_ground = pitchTypes_ground[pitchTypes_ground['pitch_type'] == 'FT']
twoseam_air = pitchTypes_air[pitchTypes_air['pitch_type'] == 'FT']

sinker_ground = pitchTypes_ground[pitchTypes_ground['pitch_type'] == 'SI']
sinker_air = pitchTypes_air[pitchTypes_air['pitch_type'] == 'SI']

curveball_ground = pitchTypes_ground[pitchTypes_ground['pitch_type'] == 'CU']
curveball_air = pitchTypes_air[pitchTypes_air['pitch_type'] == 'CU']

eephus_ground = pitchTypes_ground[pitchTypes_ground['pitch_type'] == 'EP']
eephus_air = pitchTypes_air[pitchTypes_air['pitch_type'] == 'EP']

knucklecurve_ground = pitchTypes_ground[pitchTypes_ground['pitch_type'] == 'KC']
knucklecurve_air = pitchTypes_air[pitchTypes_air['pitch_type'] == 'KC']

screwball_ground = pitchTypes_ground[pitchTypes_ground['pitch_type'] == 'SC']
screwball_air = pitchTypes_air[pitchTypes_air['pitch_type'] == 'SC']

changeup_ground = pitchTypes_ground[pitchTypes_ground['pitch_type'] == 'CH']
changeup_air = pitchTypes_air[pitchTypes_air['pitch_type'] == 'CH']

splitter_ground = pitchTypes_ground[pitchTypes_ground['pitch_type'] == 'FS']
splitter_air = pitchTypes_air[pitchTypes_air['pitch_type'] == 'FS']

knuckleball_ground = pitchTypes_ground[pitchTypes_ground['pitch_type'] == 'KN']
knuckleball_air = pitchTypes_air[pitchTypes_air['pitch_type'] == 'KN']

slider_ground = pitchTypes_ground[pitchTypes_ground['pitch_type'] == 'SL']
slider_air = pitchTypes_air[pitchTypes_air['pitch_type'] == 'SL']

This just list how many datapoints I have for each pitch and see which ones I can work with.

In [None]:
print(len(cutter_ground),len(fourseam_ground),len(twoseam_ground),len(sinker_ground),len(curveball_ground),len(eephus_ground),len(knucklecurve_ground),len(screwball_ground),len(changeup_ground),len(splitter_ground),len(knuckleball_ground),len(slider_ground))
print(len(cutter_air),len(fourseam_air),len(twoseam_air),len(sinker_air),len(curveball_air),len(eephus_air),len(knucklecurve_air),len(screwball_air),len(changeup_air),len(splitter_air),len(knuckleball_air),len(slider_air))

It's generally a good idea to have atleast 15 datapoints for both ground and air because the lesser the sample size, the greater the potential error is in our findings, so any pitch that has less for either ground or air won't be accounted for analysis. Based off the numbers above, I won't include eephus,screwballs and knuckleballs. A few pitchers sometimes throw and eephus, and screwballers and knuckleballers are pretty much extinct in the MLB right now, but that could change in the future.

In [None]:
ground_total = len(cutter_ground)+len(fourseam_ground)+len(twoseam_ground)+len(sinker_ground)+len(curveball_ground)+len(knucklecurve_ground)+len(changeup_ground)+len(splitter_ground)+len(slider_ground)
air_total = len(cutter_air)+len(fourseam_air)+len(twoseam_air)+len(sinker_air)+len(curveball_air)+len(knucklecurve_air)+len(changeup_air)+len(splitter_air)+len(slider_air)
total = ground_total + air_total

print("Percentage of groundballs: ",ground_total/total*100)
print("Percentage of flyballs:",air_total/total*100)
print()
print("Percentage of groundballs via cutter",len(cutter_ground)/ground_total*100)
print("Percentage of groundballs via fourseam",len(fourseam_ground)/ground_total*100)
print("Percentage of groundballs via twoseam",len(twoseam_ground)/ground_total*100)
print("Percentage of groundballs via sinker",len(sinker_ground)/ground_total*100)
print("Percentage of groundballs via curveball",len(curveball_ground)/ground_total*100)
print("Percentage of groundballs via knucklecurve",len(knucklecurve_ground)/ground_total*100)
print("Percentage of groundballs via changeup",len(changeup_ground)/ground_total*100)
print("Percentage of groundballs via splitter",len(splitter_ground)/ground_total*100)
print("Percentage of groundballs via slider",len(slider_ground)/ground_total*100)
print()
print("Percentage of flyballs via cutter",len(cutter_air)/air_total*100)
print("Percentage of flyballs via fourseam",len(fourseam_air)/air_total*100)
print("Percentage of flyballs via twoseam",len(twoseam_air)/air_total*100)
print("Percentage of flyballs via sinker",len(sinker_air)/air_total*100)
print("Percentage of flyballs via curveball",len(curveball_air)/air_total*100)
print("Percentage of flyballs via knucklecurve",len(knucklecurve_air)/air_total*100)
print("Percentage of flyballs via changeup",len(changeup_air)/air_total*100)
print("Percentage of flyballs via splitter",len(splitter_air)/air_total*100)
print("Percentage of flyballs via slider",len(slider_air)/air_total*100)

Like before, we'll apply Baye's Theoerem to get the odds of a groundball or flyball based on the specific pitch. Since in my analysis there are only two possible outcomes, I'll simply just do 1-odds of groundball to find flyball odds

In [None]:
print("Odds of grounder given cutter:",conditional((len(cutter_ground)/ground_total)*(ground_total/total),(len(cutter_air)/air_total)*(air_total/total)))
print("Odds of flyball given cutter:",1-conditional((len(cutter_ground)/ground_total)*(ground_total/total),(len(cutter_air)/air_total)*(air_total/total)))
print()
print("Odds of grounder given fourseam:",conditional((len(fourseam_ground)/ground_total)*(ground_total/total),(len(fourseam_air)/air_total)*(air_total/total)))
print("Odds of flyball given fourseam:",1-conditional((len(fourseam_ground)/ground_total)*(ground_total/total),(len(fourseam_air)/air_total)*(air_total/total)))
print()
print("Odds of grounder given twoseam:",conditional((len(twoseam_ground)/ground_total)*(ground_total/total),(len(twoseam_air)/air_total)*(air_total/total)))
print("Odds of flyball given twoseam:",1-conditional((len(twoseam_ground)/ground_total)*(ground_total/total),(len(twoseam_air)/air_total)*(air_total/total)))
print()
print("Odds of grounder given sinker:",conditional((len(sinker_ground)/ground_total)*(ground_total/total),(len(sinker_air)/air_total)*(air_total/total)))
print("Odds of flyball given sinker:",1-conditional((len(sinker_ground)/ground_total)*(ground_total/total),(len(sinker_air)/air_total)*(air_total/total)))
print()
print("Odds of grounder given curveball:",conditional((len(curveball_ground)/ground_total)*(ground_total/total),(len(curveball_air)/air_total)*(air_total/total)))
print("Odds of flyball given curveball:",1-conditional((len(curveball_ground)/ground_total)*(ground_total/total),(len(curveball_air)/air_total)*(air_total/total)))
print()
print("Odds of grounder given knucklecurve:",conditional((len(knucklecurve_ground)/ground_total)*(ground_total/total),(len(knucklecurve_air)/air_total)*(air_total/total)))
print("Odds of flyball given knucklecurve:",1-conditional((len(knucklecurve_ground)/ground_total)*(ground_total/total),(len(knucklecurve_air)/air_total)*(air_total/total)))
print()
print("Odds of grounder given changeup:",conditional((len(changeup_ground)/ground_total)*(ground_total/total),(len(changeup_air)/air_total)*(air_total/total)))
print("Odds of flyball given changeup:",1-conditional((len(changeup_ground)/ground_total)*(ground_total/total),(len(changeup_air)/air_total)*(air_total/total)))
print()
print("Odds of grounder given splitter:",conditional((len(splitter_ground)/ground_total)*(ground_total/total),(len(splitter_air)/air_total)*(air_total/total)))
print("Odds of flyball given splitter:",1-conditional((len(splitter_ground)/ground_total)*(ground_total/total),(len(splitter_air)/air_total)*(air_total/total)))
print()
print("Odds of grounder given slider:",conditional((len(slider_ground)/ground_total)*(ground_total/total),(len(slider_air)/air_total)*(air_total/total)))
print("Odds of flyball given slider:",1-conditional((len(slider_ground)/ground_total)*(ground_total/total),(len(slider_air)/air_total)*(air_total/total)))



In [None]:
labels= 'Groundballs','Flyballs'
sizes = [conditional((len(cutter_ground)/ground_total)*(ground_total/total),(len(cutter_air)/air_total)*(air_total/total)),1-conditional((len(cutter_ground)/ground_total)*(ground_total/total),(len(cutter_air)/air_total)*(air_total/total))]
colors = ['orange','lightblue']

plt.pie(sizes,labels=labels,colors=colors,shadow=True,startangle=90,autopct='%1.5f%%',wedgeprops={'edgecolor':'black'})
plt.axis('equal')
plt.title('Odds of GB/FB Given Cutters')
plt.show()

labels= 'Groundballs','Flyballs'
sizes = [conditional((len(fourseam_ground)/ground_total)*(ground_total/total),(len(fourseam_air)/air_total)*(air_total/total)),1-conditional((len(fourseam_ground)/ground_total)*(ground_total/total),(len(fourseam_air)/air_total)*(air_total/total))]
colors = ['orange','lightblue']

plt.pie(sizes,labels=labels,colors=colors,shadow=True,startangle=90,autopct='%1.5f%%',wedgeprops={'edgecolor':'black'})
plt.axis('equal')
plt.title('Odds of GB/FB Given Fourseamers')
plt.show()

labels= 'Groundballs','Flyballs'
sizes = [conditional((len(twoseam_ground)/ground_total)*(ground_total/total),(len(twoseam_air)/air_total)*(air_total/total)),1-conditional((len(twoseam_ground)/ground_total)*(ground_total/total),(len(twoseam_air)/air_total)*(air_total/total))]
colors = ['orange','lightblue']

plt.pie(sizes,labels=labels,colors=colors,shadow=True,startangle=90,autopct='%1.5f%%',wedgeprops={'edgecolor':'black'})
plt.axis('equal')
plt.title('Odds of GB/FB Given Twoseamers')
plt.show()

labels= 'Groundballs','Flyballs'
sizes = [conditional((len(sinker_ground)/ground_total)*(ground_total/total),(len(sinker_air)/air_total)*(air_total/total)),1-conditional((len(sinker_ground)/ground_total)*(ground_total/total),(len(sinker_air)/air_total)*(air_total/total))]
colors = ['orange','lightblue']

plt.pie(sizes,labels=labels,colors=colors,shadow=True,startangle=90,autopct='%1.5f%%',wedgeprops={'edgecolor':'black'})
plt.axis('equal')
plt.title('Odds of GB/FB Given Sinkers')
plt.show()

labels= 'Groundballs','Flyballs'
sizes = [conditional((len(curveball_ground)/ground_total)*(ground_total/total),(len(curveball_air)/air_total)*(air_total/total)),1-conditional((len(curveball_ground)/ground_total)*(ground_total/total),(len(curveball_air)/air_total)*(air_total/total))]
colors = ['orange','lightblue']

plt.pie(sizes,labels=labels,colors=colors,shadow=True,startangle=90,autopct='%1.5f%%',wedgeprops={'edgecolor':'black'})
plt.axis('equal')
plt.title('Odds of GB/FB Given Curveballs')
plt.show()

labels= 'Groundballs','Flyballs'
sizes = [conditional((len(knucklecurve_ground)/ground_total)*(ground_total/total),(len(knucklecurve_air)/air_total)*(air_total/total)),1-conditional((len(knucklecurve_ground)/ground_total)*(ground_total/total),(len(knucklecurve_air)/air_total)*(air_total/total))]
colors = ['orange','lightblue']

plt.pie(sizes,labels=labels,colors=colors,shadow=True,startangle=90,autopct='%1.5f%%',wedgeprops={'edgecolor':'black'})
plt.axis('equal')
plt.title('Odds of GB/FB Given Knuckle-Curveballs')
plt.show()

labels= 'Groundballs','Flyballs'
sizes = [conditional((len(changeup_ground)/ground_total)*(ground_total/total),(len(changeup_air)/air_total)*(air_total/total)),1-conditional((len(changeup_ground)/ground_total)*(ground_total/total),(len(changeup_air)/air_total)*(air_total/total))]
colors = ['orange','lightblue']

plt.pie(sizes,labels=labels,colors=colors,shadow=True,startangle=90,autopct='%1.5f%%',wedgeprops={'edgecolor':'black'})
plt.axis('equal')
plt.title('Odds of GB/FB Given Changeups')
plt.show()

labels= 'Groundballs','Flyballs'
sizes = [conditional((len(splitter_ground)/ground_total)*(ground_total/total),(len(splitter_air)/air_total)*(air_total/total)),1-conditional((len(splitter_ground)/ground_total)*(ground_total/total),(len(splitter_air)/air_total)*(air_total/total))]
colors = ['orange','lightblue']

plt.pie(sizes,labels=labels,colors=colors,shadow=True,startangle=90,autopct='%1.5f%%',wedgeprops={'edgecolor':'black'})
plt.axis('equal')
plt.title('Odds of GB/FB Given Splitters')
plt.show()

labels= 'Groundballs','Flyballs'
sizes = [conditional((len(slider_ground)/ground_total)*(ground_total/total),(len(slider_air)/air_total)*(air_total/total)),1-conditional((len(slider_ground)/ground_total)*(ground_total/total),(len(slider_air)/air_total)*(air_total/total))]
colors = ['orange','lightblue']

plt.pie(sizes,labels=labels,colors=colors,shadow=True,startangle=90,autopct='%1.5f%%',wedgeprops={'edgecolor':'black'})
plt.axis('equal')
plt.title('Odds of GB/FB Given Sliders')
plt.show()



There really isn't too much surprise with these results. I did think cutters were interesting as they usually induce weak contact resulting in more grounders, but the data does not suggest this being the case. Fourseam fastballs isn't a surprise as it's the only pitch here that moves straight with minimal drop, making it easier to lift the ball on. Twoseamers and sinkers are virtually the same in terms on movement, so it makes since the odds are also virtually the same for both. They also can dive down sharply depending on the pitcher, which explains the heavy groundball rate. While curveballs are a tossup between results, knucklecurves lean more towards grounders which makes sense since knucklecurves traditionally have a sharper drop to them. Changeups and splitters also lean towards groundballs, but splitters lean that way a little more since they too have shaper break as opposed to changeups. I do find sliders a little interesting as they lean towards flyballs a little more. This could be since batters popup sliders more, or pitchers just simply aren't throwing well located sliders.

Next, I'll be exploring another major factor that goes into wether a ball in play is a groundball or a flyball: Location. Mant baseball fans know that a well located pitch is arguably better than something with sharp break, but also down the middle of the plate. However, does the data support these assumptions?

We'll start with something that should be relatively predictable-sinkers

While there isn't a visible strikezone in my plots, we can infer the general shape of it based off all called strikes in the 2019 season to make out a rough shape


In [None]:
ax = plt.gca()
strikezone = pitches[pitches['code'] == 'C'] #Code for called strikes
strikezone.plot(kind='scatter',x='px',y='pz',color='orange',ax=ax)
ax.set_facecolor('teal')
plt.xlim(-3,3)
plt.ylim(0,4.5)
plt.title("Called Strikes")
plt.xlabel('X Coordinate of Pitch Location')
plt.ylabel('Z Coordinate of Pitch Location')
plt.show()

Beside some outliers, the location of most called strikes is pretty consistent with a clear shape shown. Umpires are not perfect and in many cases call what should have been balls strikes and vice versa. Umpires usually still make the right call with pitches in the strike zone and more often than not extend the strikezone further than it really is. The data shows the strikezone dimensions to go from -1.25 to 1.25 px, but since umpires extend strikezones a little, the true dimensions are more likely -1.1 to 1.1 or -1 to 1 px. The height of the strikezone is dependant on the height of the player, so there isn't a uniform range for all players. Because of this, we can just rely on the range given by the data which is around 1 or 1.1 to 4+ pz.

In [None]:
ax = plt.gca()
pitchTypes_total.plot(kind='scatter',x='px',y='pz',color='blue',ax = ax,label = 'Not Sinkers')
sinker_ground.plot(kind='scatter',x='px',y='pz',color='darkred',ax = ax,label = 'Sinkers-Groundballs')
sinker_air.plot(kind='scatter',x='px',y='pz',color='yellowgreen',ax = ax,label = 'Sinkers-Flyballs',alpha = 0.3)
ax.set_facecolor('teal')
plt.xlim(-3,2.5)
plt.ylim(-1,6)
plt.legend(loc = 'lower left')
plt.title("2019 pitches put in play for out or homerun")
plt.xlabel('X Coordinate of Pitch Location')
plt.ylabel('Z Coordinate of Pitch Location')
plt.show()

In [None]:
sinker_ground.plot(kind = 'hexbin',x='px',y='pz',gridsize= 30,sharex=False,cmap=plt.cm.Greens)
plt.title("Heatmap of Sinkers-Groundballs")
plt.xlabel("X Coordinate of Pitch Location")
plt.ylabel("Z Coordinate of Pitch Location")
plt.xlim(-2,1.5)
plt.ylim(1,4)
plt.show()

sinker_air.plot(kind = 'hexbin',x='px',y='pz',gridsize= 20,sharex=False,cmap=plt.cm.Greens) 
plt.title("Heatmap of Sinkers-Flyballs")
plt.xlabel("X Coordinate of Pitch Location")
plt.ylabel("Z Coordinate of Pitch Location")
ax.set_facecolor('white')
plt.xlim(-2,1.5)
plt.ylim(1,4)
plt.show()


While most sinkers ended up around the middle of the plate, we can see that with the sinkers that resulted in groundballs, they were more common in the lower parts of zone as compared to the flyball sinkers. Groundballs sinkers were most common between 1.4 and 2.5 pz while flyball sinkers were more common around 2 and 3 pz. With the sharp drop in a sinker, it makes sense that a low located one is going to cause more groundballs. While sinkers mostly resulted in groundballs, fourseamers resulted in more flyballs, so we can observe to see why that is the case.

In [None]:
ax = plt.gca()
pitchTypes_total.plot(kind='scatter',x='px',y='pz',color='blue',ax = ax,label = 'Not Forseamers')
fourseam_ground.plot(kind='scatter',x='px',y='pz',color='darkred',ax = ax,label = 'Fourseamers-Groundballs')
fourseam_air.plot(kind='scatter',x='px',y='pz',color='yellowgreen',ax = ax,label = 'Fourseamers-Flyballs',alpha = 0.3)
plt.xlim(-3,2.5)
plt.ylim(-1,6)
plt.legend(loc = 'lower left')
plt.title("2019 pitches put in play for out or homerun")
plt.xlabel('X Coordinate of Pitch Location')
plt.ylabel('Z Coordinate of Pitch Location')
plt.show()

In [None]:
fourseam_ground.plot(kind = 'hexbin',x='px',y='pz',gridsize= 26,sharex= False,cmap=plt.cm.Greens)
plt.title("Heatmap of Fourseamers-Grounders")
plt.xlabel("X Coordinate of Pitch Location")
plt.ylabel("Z Coordinate of Pitch Location")
plt.show()

fourseam_air.plot(kind = 'hexbin',x='px',y='pz',gridsize= 35, sharex= False,cmap=plt.cm.Greens)
plt.title("Heatmap of Fourseamers-Flyballs")
plt.xlabel("X Coordinate of Pitch Location")
plt.ylabel("Z Coordinate of Pitch Location")
plt.show()

The fourseam fastball is the most common pitch in the game, as it is typically the pitcher's fastest pitch, so we have many datapoints to work with in this case. A well timed fourseam can be effective in virtually in any location of the strikezone, even down the middle. This is supported by both heatmaps as both dominate most of the whole strikezones for both grounders and flyballs. However, it can be seen that fourseam grounders tend to be more frequent from 1.5 to 3.25 pz with fourseam flyballs more common from 2 to 3.5 pz. It's a small difference, but it only takes a few inch difference in height to change the outcome of an at-bat. This difference along with the minimal movement of the pitch itself once again supports the fact that fourseamers are most likely to be put in the air.

Now, I'll do another grounder heavy pitch- knucklecurves

In [None]:
ax = plt.gca()
pitchTypes_total.plot(kind='scatter',x='px',y='pz',color='blue',ax = ax,label = 'Not Knucklecurves')
knucklecurve_ground.plot(kind='scatter',x='px',y='pz',color='darkred',ax = ax,label = 'Knucklecurves-Groundballs')
knucklecurve_air.plot(kind='scatter',x='px',y='pz',color='yellowgreen',ax = ax,label = 'Knucklecurves-Flyballs',alpha = 0.4)
plt.xlim(-3,2.5)
plt.ylim(-1,6)
plt.legend(loc = 'lower left')
plt.title("2019 pitches put in play for out or homerun")
plt.xlabel('X Coordinate of Pitch Location')
plt.ylabel('Z Coordinate of Pitch Location')
plt.show()

In [None]:
knucklecurve_ground.plot(kind = 'hexbin',x='px',y='pz',gridsize= 18,sharex = False,cmap=plt.cm.Greens) 
plt.title("Heatmap of Knucklecurves-Groundballs")
plt.xlabel("X Coordinate of Pitch Location")
plt.ylabel("Z Coordinate of Pitch Location")
plt.xlim(-2,2)
plt.ylim(0,4)
plt.show()

knucklecurve_air.plot(kind = 'hexbin',x='px',y='pz',gridsize= 15, sharex = False,cmap=plt.cm.Greens) 
plt.title("Heatmap of Knucklecurves-Flyballs")
plt.xlabel("X Coordinate of Pitch Location")
plt.ylabel("Z Coordinate of Pitch Location")
plt.xlim(-2,2)
plt.ylim(0,4)
plt.show()

The sample size is small with Knucklecurves, but there's enough thrown to see the pattern. There were more knucklecurves thrown higher in the zone than expected, but the general trend does show that knucklecurve grounders were common between 1.25 and 2.75 pz and flyball knucklecurves more common between 1.5 and 3 pz.
I'll choose a pitch now from the offspeed section- Splitters

In [None]:
ax = plt.gca()
pitchTypes_total.plot(kind='scatter',x='px',y='pz',color='blue',ax = ax,label = 'Not Splitters')
splitter_ground.plot(kind='scatter',x='px',y='pz',color='darkred',ax = ax,label = 'Splitters-Groundballs')
splitter_air.plot(kind='scatter',x='px',y='pz',color='yellowgreen',ax = ax,label = 'Splitters-Flyballs',alpha = 0.4)
plt.xlim(-3,2.5)
plt.ylim(-1,6)
plt.legend(loc = 'lower left')
plt.title("2019 pitches put in play for out or homerun")
plt.xlabel('X Coordinate of Pitch Location')
plt.ylabel('Z Coordinate of Pitch Location')
plt.show()

In [None]:
splitter_ground.plot(kind = 'hexbin',x='px',y='pz',gridsize= 17,sharex = False,cmap=plt.cm.Greens) 
plt.title("Heatmap of Splitters-Groundballs")
plt.xlabel("X Coordinate of Pitch Location")
plt.ylabel("Z Coordinate of Pitch Location")
plt.xlim(-2,1.5)
plt.ylim(0,4)
plt.show()

splitter_air.plot(kind = 'hexbin',x='px',y='pz',gridsize= 15,sharex = False,cmap=plt.cm.Greens) 
plt.title("Heatmap of Splitters-Flyballs")
plt.xlabel("X Coordinate of Pitch Location")
plt.ylabel("Z Coordinate of Pitch Location")
plt.xlim(-2,1.5)
plt.ylim(0,4)
plt.show()

Not many splitters were put in the air, but like the other pitches, there is a trend of lower pitches resulting in more grounders. Splitter groundballs are mostly between 1 and 2.75 pz while Splitter flyballs are between 1.5 and 3. Splitters have a sharp drop to them and they are typically faster than other offspeed, so it makes sense that they result in more grounders.

Finally, I'll plot out a pitch I was somewhat surprised about from earlier analysis: Sliders

In [None]:
ax = plt.gca()
pitchTypes_total.plot(kind='scatter',x='px',y='pz',color='blue',ax = ax,label = 'Not Sliders')
slider_ground.plot(kind='scatter',x='px',y='pz',color='darkred',ax = ax,label = 'Sliders-Groundballs')
slider_air.plot(kind='scatter',x='px',y='pz',color='yellowgreen',ax = ax,label = 'Sliders-Flyballs',alpha = 0.4)
plt.xlim(-3,2.5)
plt.ylim(-1,6)
plt.legend(loc = 'lower left')
plt.title("2019 pitches put in play for out or homerun")
plt.xlabel('X Coordinate of Pitch Location')
plt.ylabel('Z Coordinate of Pitch Location')
plt.show()

In [None]:
slider_ground.plot(kind = 'hexbin',x='px',y='pz',gridsize= 30,sharex = False,cmap=plt.cm.Greens) 
plt.title("Heatmap of Sliders-Groundballs")
plt.xlabel("X Coordinate of Pitch Location")
plt.ylabel("Z Coordinate of Pitch Location")
plt.xlim(-2,1.5)
plt.ylim(0,4)
plt.show()

slider_air.plot(kind = 'hexbin',x='px',y='pz',gridsize= 30,sharex = False,cmap=plt.cm.Greens) 
plt.title("Heatmap of Sliders-Flyballs")
plt.xlabel("X Coordinate of Pitch Location")
plt.ylabel("Z Coordinate of Pitch Location")
plt.xlim(-2,1.5)
plt.ylim(0,4)
plt.show()

There really isn't much different between the two maps, but sliders tend to be more common between 1.5 and 2.5 pz with the flyball map being between 2 and 3 pz. Once again, there's very little differene between them, but sometimes just a slight elevation is a pitch is all that is needed to change the outcome. Another interesting fact is that most sliders were on the right side of the plate, indicating either a lot of backdoor sliders by righties, or many well placed sliders by lefties.
# Conclusion:
There really weren't many surprises in much of the analysis I made,especially to those that know baseball well, but it's good to reinforce our understanding and assumptions of the game with actual data! If you feel this project can be explore other aspects of pitching or improved in general just let me know!