The code below try to quatify the effect of pitch tunnel accorss different pitch type. There are multiple article talks about pitch tunneling from [Baseball Prospectus](https://www.baseballprospectus.com/news/article/31030/prospectus-feature-introducing-pitch-tunnels/) or [Fangraphs](https://tht.fangraphs.com/pitch-tunneling-is-it-real-and-how-do-pitchers-actually-pitch/). However, it could couple with other effect, for example a pitch that land way outside the strike zone can't really tunnel with other pitches, but the pitch is bad due to lack of control anyway. So for better modelling a model that control for pitch location (and other factors) are needed.

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import math
pd.options.display.max_columns = 999

from catboost import CatBoostRegressor
from sklearn.model_selection import train_test_split

In [None]:
pitch_stat = pd.read_csv('/kaggle/input/mlb-statcast-data/Statcast_2019.csv')
pitch_stat = pitch_stat[(pitch_stat['balls'] <= 3)& (pitch_stat['strikes'] <= 2)]

Calculate inferred pitch data using [Alan Nathan](https://twitter.com/pobguy) calculator

In [None]:
pitch_stat = pitch_stat[(pitch_stat.plate_z > 0) & (pitch_stat.plate_z < 5)]

Here pitch tunneling is defined by difference of ball tracjectory on 23 feet 8 inches from home plate, per Baseball Prospectus definition.

In [None]:
pitch_stat['yR'] = 36.7-pitch_stat.release_extension
pitch_stat['tR'] = (-pitch_stat.vy0 - (pitch_stat.vy0**2-2*pitch_stat.ay*(50-pitch_stat.yR))**0.5)/pitch_stat.ay
pitch_stat['vxR'] = pitch_stat.vx0 + pitch_stat.ax * pitch_stat.tR
pitch_stat['vyR'] = pitch_stat.vy0 + pitch_stat.ay * pitch_stat.tR
pitch_stat['vzR'] = pitch_stat.vz0 + pitch_stat.az * pitch_stat.tR
pitch_stat['dv0'] = pitch_stat.release_speed - (pitch_stat.vxR**2+pitch_stat.vyR**2+pitch_stat.vzR**2)**0.5/1.467
pitch_stat['tf'] = (-pitch_stat.vyR-(pitch_stat.vyR**2-2*pitch_stat.ay*(pitch_stat.yR-17/12))**0.5)/pitch_stat.ay

pitch_stat['x_tunnel'] = pitch_stat.release_pos_x + pitch_stat.vxR * pitch_stat.tf + pitch_stat.ax * pitch_stat.tf*pitch_stat.tf 
pitch_stat['z_tunnel'] = pitch_stat.release_pos_z + pitch_stat.vzR * pitch_stat.tf + pitch_stat.az * pitch_stat.tf*pitch_stat.tf 

pitch_stat['vxbar'] = (2*pitch_stat.vxR+pitch_stat.ax*pitch_stat.tf)/2
pitch_stat['vybar'] = (2*pitch_stat.vyR+pitch_stat.ay*pitch_stat.tf)/2
pitch_stat['vzbar'] = (2*pitch_stat.vzR+pitch_stat.az*pitch_stat.tf)/2
pitch_stat['vbar'] = (pitch_stat.vxbar**2+pitch_stat.vybar**2+pitch_stat.vzbar**2)**0.5



In [None]:
pitch_stat['inning_diff'] = pitch_stat['inning_topbot'].ne(pitch_stat['inning_topbot'].shift(-1)).astype(int)
pitch_stat['AB_diff'] =  pitch_stat['at_bat_number'].diff(-1)

In [None]:
pitch_stat['pitch_type_diff'] = pitch_stat['pitch_type'].ne(pitch_stat['pitch_type'].shift(-1)).astype(int)

For pitches in sample there is a difference in pitch two in two conseqence pitches and it's on the same at bat.

In [None]:
pitch_stat  = pitch_stat[(pitch_stat['inning_diff'] == 0) & (pitch_stat['AB_diff'] == 0) & (pitch_stat['pitch_type_diff'] == 1)]

In [None]:
pitch_stat['tunnel_xdiff'] = pitch_stat['x_tunnel'].diff(-1)
pitch_stat['tunnel_zdiff'] = pitch_stat['z_tunnel'].diff(-1)

In [None]:
pitch_stat['on_1b_code'] = 1-pitch_stat.on_1b.isna().astype(int)
pitch_stat['on_2b_code'] = 1-pitch_stat.on_2b.isna().astype(int)
pitch_stat['on_3b_code'] = 1-pitch_stat.on_3b.isna().astype(int)

In [None]:
pitch_stat = pitch_stat.dropna(subset=['release_speed','pitch_type','plate_x' ,'plate_z', 'balls', 'strikes','on_1b_code','on_2b_code','on_3b_code','delta_run_exp','home_team','stand','p_throws'])

The methodology is to try to measure the excess run brought by tunneling. In Statcast data, 'delta_run_exp' is provided to represent the run expectancy change between pitches, and if tunneling effect exist it should yield a lower run expactency than prediction by a model that don't include tunneling effect. Let create a model without pitch speed as an example:

In [None]:
X = pitch_stat[['pitch_type','plate_x' ,'plate_z', 'balls', 'strikes','on_1b_code','on_2b_code','on_3b_code','home_team','stand','p_throws']]
y = pitch_stat['delta_run_exp']

In [None]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.5, random_state=42)

The model use Catboost to regress upon change in run expectancy:

In [None]:
cat_features = [0,3,4,5,6,7,8,9,10]
model = CatBoostRegressor()
# Fit model
model.fit(X_train, y_train, cat_features)

Excess run is defined by true change of run expectancy minus the model output

In [None]:
pitch_stat['expected_run'] =  model.predict(X)
pitch_stat['excess_run'] =  pitch_stat['delta_run_exp'] - pitch_stat['expected_run']

Plot a graph of release speed of fastball vs excess run

In [None]:
plt.ylim([-0.02,0.02])
plt.xlim([85,100])
plt.title('Excess run vs fastball release speed')
sns.regplot(pitch_stat[pitch_stat['pitch_type'] == 'FF']['release_speed'],pitch_stat[pitch_stat['pitch_type'] == 'FF']['excess_run'],scatter=False)

The above show for higher release speed the excess run is more negative, which means faster pitches give up less run and hence are better than slower one and exactly what we would expect.

Now we add back release speed in to see effect of pitch tunneling

In [None]:
X = pitch_stat[['pitch_type','plate_x' ,'plate_z', 'balls', 'strikes','on_1b_code','on_2b_code','on_3b_code','home_team','stand','p_throws','release_speed']]
y = pitch_stat['delta_run_exp']

In [None]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42)

In [None]:
cat_features = [0,3,4,5,6,7,8,9,10]
model = CatBoostRegressor()
# Fit model
model.fit(X_train, y_train, cat_features)

In [None]:
pitch_stat['expected_run'] =  model.predict(X)
pitch_stat['excess_run'] =  pitch_stat['delta_run_exp'] - pitch_stat['expected_run']

In [None]:
plt.ylim([-0.01,0.01])
plt.xlim([85,100])
sns.regplot(pitch_stat[pitch_stat['pitch_type'] == 'FF']['release_speed'],pitch_stat[pitch_stat['pitch_type'] == 'FF']['excess_run'],scatter=False)
plt.title('Excess run vs fastball release speed')

The above graph show roughly no excess run accross release speed, which mean it's well calibrated with respect to speed. Next 

In [None]:
plt.xlim([0,3])
plt.ylim([-0.01,0.01])
sns.regplot(abs(pitch_stat[pitch_stat['pitch_type'] == 'FF']['tunnel_xdiff']),pitch_stat[pitch_stat['pitch_type'] == 'FF']['excess_run'],scatter=False,order=2)
plt.title('Excess run fastball vs horizontal tunnel distance')

In [None]:
plt.xlim([-1,3])
plt.ylim([-0.01,0.01])
sns.regplot(pitch_stat[pitch_stat['pitch_type'] == 'FF']['tunnel_zdiff'],pitch_stat[pitch_stat['pitch_type'] == 'FF']['excess_run'],scatter=False,order=2)
plt.title('Excess run fastball vs vertical tunnel distance')

In [None]:
plt.xlim([0,3])
plt.ylim([-0.01,0.01])
sns.regplot(abs(pitch_stat[pitch_stat['pitch_type'] == 'CH']['tunnel_xdiff']),pitch_stat[pitch_stat['pitch_type'] == 'CH']['excess_run'],scatter=False,order=2)
plt.title('Excess run changeup vs horizontal tunnel distance')

In [None]:
plt.xlim([-1.5,2])
plt.ylim([-0.01,0.01])
sns.regplot(pitch_stat[pitch_stat['pitch_type'] == 'CH']['tunnel_zdiff'],pitch_stat[pitch_stat['pitch_type'] == 'CH']['excess_run'],scatter=False,order=2)
plt.title('Excess run fastball vs vertical tunnel distance')

In [None]:
plt.xlim([0,3])
plt.ylim([-0.01,0.01])
sns.regplot(abs(pitch_stat[pitch_stat['pitch_type'] == 'FC']['tunnel_xdiff']),pitch_stat[pitch_stat['pitch_type'] == 'FC']['excess_run'],scatter=False,order=2)
plt.title('Excess run cutter vs horizontal tunnel distance')

In [None]:
plt.xlim([-1.5,2])
plt.ylim([-0.01,0.01])
sns.regplot(pitch_stat[pitch_stat['pitch_type'] == 'FC']['tunnel_zdiff'],pitch_stat[pitch_stat['pitch_type'] == 'FC']['excess_run'],scatter=False,order=2)
plt.title('Excess run cutter vs vertical tunnel distance')

In [None]:
plt.xlim([0,3])
plt.ylim([-0.01,0.01])
sns.regplot(abs(pitch_stat[pitch_stat['pitch_type'] == 'SL']['tunnel_xdiff']),pitch_stat[pitch_stat['pitch_type'] == 'SL']['excess_run'],scatter=False,order=2)

plt.title('Excess run slider vs horizontal tunnel distance')

In [None]:
plt.xlim([-2.5,1.5])
plt.ylim([-0.01,0.01])
sns.regplot(pitch_stat[pitch_stat['pitch_type'] == 'SL']['tunnel_zdiff'],pitch_stat[pitch_stat['pitch_type'] == 'SL']['excess_run'],scatter=False,order=2)

plt.title('Excess run slider vs vertical tunnel distance')

In [None]:
plt.xlim([0,3])
plt.ylim([-0.01,0.01])
sns.regplot(abs(pitch_stat[pitch_stat['pitch_type'] == 'CU']['tunnel_xdiff']),pitch_stat[pitch_stat['pitch_type'] == 'CU']['excess_run'],scatter=False,order=2)

plt.title('Excess run curveball vs horizontal tunnel distance')

In [None]:
plt.xlim([-2.5,1])
plt.ylim([-0.01,0.01])
sns.regplot(pitch_stat[pitch_stat['pitch_type'] == 'CU']['tunnel_zdiff'],pitch_stat[pitch_stat['pitch_type'] == 'CU']['excess_run'],scatter=False,order=2)

plt.title('Excess run curveball vs vertical tunnel distance')

From above figures besides changeup and curveball tunneling don't seems to really make an impact on run scored