
## Pitch Modeling


Goals for a new approach to pitch modeling:

- Improve the target variable. Most stuff models use target metrics like context-neutral re24/re288 run values, context-specific run values like delta_run_exp, or called/swinging strike rate (CSW%). My approach will use expected run value as seen in expected wOBA, which uses exit velocity and launch angle to predict wOBA. I won't be using wOBA as a base metric due to reasons I've discussed [here](https://sam-walsh.github.io/posts/double-plays/) and [here](https://sam-walsh.github.io/posts/fixing-xwoba/). Instead I will be predicting  run values for balls in play based on exit velocity, launch angle, and batter handedness normalized spray angle bins to reduce the noise of the target metric, while still taking into account context-specific outcomes like double plays and sacrifice flies.

- Improve location modeling. Scott Powers, former Dodgers Analyst, and Professor of Sports Analytics and his student Vicente Iglesias recently gave a presentation at Saberseminar about improving location modeling using bayesian hierarchical models. More about that [here](https://github.com/saberpowers/predictive-pitch-score/blob/main/documentation/2023-08-12_saberseminar/slides.pdf). I will be taking a similar approach which will help improve location modeling, especially in small sample sizes.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GroupShuffleSplit
import optuna
from optuna import Trial
from optuna.samplers import TPESampler
import xgboost as xgb
import joblib

In [2]:
# load in data with predicted run values
df = pd.read_csv('statcast_data/df_all_spin.csv')

In [3]:
df['release_pos_y'].mean()

54.153003240913726

In [4]:
## Calculate vertical approach angle and horizontal approach angle (degrees) for each pitch
df['vaa'] = np.arctan((df['plate_z'] - df['release_pos_z']) / (df['release_pos_y'])) * (180 / np.pi)
df['haa'] = np.arctan((df['plate_x'] - df['release_pos_x']) / (df['release_pos_y'])) * (180 / np.pi)

df['axis_deviation_adj'] = np.where(df['p_throws']=='L', df['diff_measured_inferred'].mul(-1), df['diff_measured_inferred'])

In [5]:
df.loc[df['p_throws']=='R'].groupby('pitch_type')[['vaa', 'haa']].mean()

Unnamed: 0_level_0,vaa,haa
pitch_type,Unnamed: 1_level_1,Unnamed: 2_level_1
CH,-4.156372,1.637548
CS,-4.240454,1.662471
CU,-4.345157,1.947331
EP,-3.545236,1.426134
FA,-3.575805,1.781344
FC,-3.762419,2.319228
FF,-3.258477,1.91927
FO,-5.080818,1.895338
FS,-4.433563,1.589453
KC,-4.3745,2.102859


In [6]:
df.pitch_type.unique()

array(['FF', 'SL', 'SI', 'FC', 'CU', 'CH', 'KC', 'CS', 'FS', 'ST', 'SV',
       'EP', 'FA', nan, 'KN', 'PO', 'SC', 'FO'], dtype=object)

Group pitch types

In [7]:
fastballs = ['FF', 'SI', 'FC']
offspeed = ['CH', 'FS', 'FO']
breaking_balls = ['KC', 'CU', 'SL', 'ST', 'SV', 'CS', 'SC']

Train on 2020-2022 data, use 2023 as holdout set

In [8]:
df_fastballs = df.loc[(df['pitch_type'].isin(fastballs)) & (df['game_year'].isin([2020, 2021, 2022]))]
df_fastballs_holdout = df.loc[(df['pitch_type'].isin(fastballs)) & (df['game_year']==2023)]

In [9]:
df_fastballs.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,...,predicted_run_value,player_id,api_pitch_type,active_spin_formatted,hawkeye_measured,movement_inferred,diff_measured_inferred,vaa,haa,axis_deviation_adj
0,0,2875.0,FF,2020-09-27,91.6,2.31,6.19,"Hand, Brad",605137,543272,...,-0.023925,543272.0,FF,89.0,129.2,139.2,-10.0,-3.002928,-2.721848,10.0
6,6,3604.0,FF,2020-09-27,92.8,2.39,6.27,"Hand, Brad",663647,543272,...,-0.237219,543272.0,FF,89.0,129.2,139.2,-10.0,-3.475779,-2.956457,10.0
10,10,2843.0,SI,2020-09-27,96.7,-2.2,5.92,"Cederlind, Blake",596019,664977,...,-0.19518,,,,,,,-4.351309,2.456165,
11,11,2951.0,SI,2020-09-27,97.1,-2.2,5.99,"Cederlind, Blake",596019,664977,...,-0.06,,,,,,,-3.498537,2.966262,
12,12,3160.0,SI,2020-09-27,97.3,-2.34,5.94,"Cederlind, Blake",596019,664977,...,0.06,,,,,,,-3.579655,3.569029,


In [10]:
fastball_features = [
    'release_speed', 'az', 'ax', 'active_spin_formatted',
    'plate_x', 'plate_z', 'axis_deviation_adj', 'vaa', 'haa'
]
target = 'predicted_run_value'

In [11]:
df_fastballs[fastball_features].info()

<class 'pandas.core.frame.DataFrame'>
Index: 961810 entries, 0 to 1717492
Data columns (total 9 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   release_speed          961782 non-null  float64
 1   az                     961796 non-null  float64
 2   ax                     961796 non-null  float64
 3   active_spin_formatted  941051 non-null  float64
 4   plate_x                961796 non-null  float64
 5   plate_z                961796 non-null  float64
 6   axis_deviation_adj     941051 non-null  float64
 7   vaa                    961652 non-null  float64
 8   haa                    961652 non-null  float64
dtypes: float64(9)
memory usage: 73.4 MB


In [12]:
df_fastballs[target].info()

<class 'pandas.core.series.Series'>
Index: 961810 entries, 0 to 1717492
Series name: predicted_run_value
Non-Null Count   Dtype  
--------------   -----  
961810 non-null  float64
dtypes: float64(1)
memory usage: 14.7 MB


In [13]:
df_fastballs = df_fastballs.dropna(subset=fastball_features)
df_fastballs = df_fastballs.dropna(subset=[target])

df_fastballs_holdout = df_fastballs_holdout.dropna(subset=fastball_features)
df_fastballs_holdout = df_fastballs_holdout.dropna(subset=[target])

In [14]:
df_fastballs.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,...,predicted_run_value,player_id,api_pitch_type,active_spin_formatted,hawkeye_measured,movement_inferred,diff_measured_inferred,vaa,haa,axis_deviation_adj
0,0,2875.0,FF,2020-09-27,91.6,2.31,6.19,"Hand, Brad",605137,543272,...,-0.023925,543272.0,FF,89.0,129.2,139.2,-10.0,-3.002928,-2.721848,10.0
6,6,3604.0,FF,2020-09-27,92.8,2.39,6.27,"Hand, Brad",663647,543272,...,-0.237219,543272.0,FF,89.0,129.2,139.2,-10.0,-3.475779,-2.956457,10.0
23,23,4283.0,FF,2020-09-27,91.3,2.59,5.99,"Hand, Brad",624428,543272,...,0.30122,543272.0,FF,89.0,129.2,139.2,-10.0,-4.009259,-3.386027,10.0
24,24,4455.0,FF,2020-09-27,92.5,2.43,6.14,"Hand, Brad",624428,543272,...,0.06,543272.0,FF,89.0,129.2,139.2,-10.0,-3.643167,-3.466656,10.0
25,25,4472.0,FF,2020-09-27,90.5,2.52,6.02,"Hand, Brad",624428,543272,...,-0.06,543272.0,FF,89.0,129.2,139.2,-10.0,-4.321128,-3.717588,10.0


In [15]:
# Create a GroupShuffleSplit object
gss = GroupShuffleSplit(n_splits=1, train_size=0.8, random_state=42)

# Get the indices for the training and validation sets
train_idx, val_idx = next(gss.split(df_fastballs, groups=df_fastballs['pitcher']))

# Create the training and validation sets
train = df_fastballs.iloc[train_idx]
val = df_fastballs.iloc[val_idx]


In [16]:

# # Define the objective function for Optuna
# def objective(trial: Trial) -> float:
#     params = {
#         'device': 'cuda',  # Use GPU acceleration
#         'n_estimators': trial.suggest_int('n_estimators', 100, 2000),
#         'max_depth': trial.suggest_int('max_depth', 3, 10),
#         'learning_rate': trial.suggest_float('learning_rate', 0.001, 0.1),
#         'subsample': trial.suggest_float('subsample', 0.5, 1.0),
#         'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
#     }

#     model = xgb.XGBRegressor(**params)
#     model.fit(train[fastball_features], train[target])

#     val_preds = model.predict(val[fastball_features])
#     val_error = np.sqrt(((val_preds - val[target]) ** 2).mean())

#     return val_error

# # Run the Optuna optimization
# study = optuna.create_study(direction='minimize', sampler=TPESampler(seed=42))
# study.optimize(objective, n_trials=50)

# # Print the best parameters
# print(study.best_params)


In [17]:
# params = study.best_params
# params['device'] = 'cuda'
# xgb_fastball = xgb.XGBRegressor(**params)
# xgb_fastball.fit(df_fastballs[fastball_features], df_fastballs[target])

NameError: name 'study' is not defined

In [None]:
# joblib.dump(xgb_fastball, 'models/xgb_fastball_model.joblib')

['xgb_fastball_model.joblib']

In [18]:
xgb_fastball = joblib.load('models/xgb_fastball_model.joblib')

In [19]:
df_fastballs_holdout['xgb_preds'] = xgb_fastball.predict(df_fastballs_holdout[fastball_features])

Potential solutions:
- Use a data structure that matches the device ordinal in the booster.
- Set the device for booster before call to inplace_predict.




In [20]:
test_preds = df_fastballs_holdout.groupby(['player_name', 'pitch_type'], as_index=False)[['predicted_run_value', 'xgb_preds', 'release_speed']] \
    .agg({'predicted_run_value':'mean', 'xgb_preds':'mean', 'release_speed':'count'}) \
    .reset_index() \
    .rename(columns={'release_speed':'count'})

Generate percentile rankings for fastballs

In [21]:
test_preds['predicted_run_value_percentile'] = (1 - test_preds['predicted_run_value'].rank(pct=True).round(3)).mul(100)
test_preds['stuff_percentile'] = (1 - test_preds['xgb_preds'].rank(pct=True).round(3)).mul(100)
test_preds.query('count > 100').sort_values('xgb_preds', ascending=True).head(50)

Unnamed: 0,index,player_name,pitch_type,predicted_run_value,xgb_preds,count,predicted_run_value_percentile,stuff_percentile
790,790,"Minter, A.J.",FC,-0.003167,-0.01774,340,65.8,99.8
954,954,"Rasmussen, Drew",FF,-0.005837,-0.017576,199,74.9,99.8
433,433,"Graterol, Brusdar",SI,-0.017705,-0.017394,370,92.5,99.7
82,82,"Bautista, Félix",FF,-0.019317,-0.016323,679,93.4,99.6
1107,1107,"Stewart, Brock",FF,-0.019187,-0.015337,151,93.3,99.3
1208,1208,"Vesia, Alex",FF,-0.006955,-0.015178,486,77.5,99.2
327,327,"Estrada, Jeremiah",FF,0.022413,-0.014387,199,16.0,99.1
953,953,"Rasmussen, Drew",FC,-0.01794,-0.013992,242,92.6,98.9
991,991,"Rogers, Tyler",SI,-0.013752,-0.013945,480,88.8,98.8
1053,1053,"Sewald, Paul",FF,-0.016657,-0.013899,513,91.4,98.7


In [22]:
test_preds.loc[test_preds['player_name']=='Harrison, Kyle']

Unnamed: 0,index,player_name,pitch_type,predicted_run_value,xgb_preds,count,predicted_run_value_percentile,stuff_percentile
464,464,"Harrison, Kyle",FF,-0.000771,0.001352,170,61.6,55.5


Top 50 four-seam fastballs

In [23]:
test_preds.loc[test_preds['pitch_type']=='FF', ['player_name', 'pitch_type', 'stuff_percentile', 'count']].query('count > 25').sort_values('stuff_percentile', ascending=False).head(50)

Unnamed: 0,player_name,pitch_type,stuff_percentile,count
954,"Rasmussen, Drew",FF,99.8,199
82,"Bautista, Félix",FF,99.6,679
1107,"Stewart, Brock",FF,99.3,151
1208,"Vesia, Alex",FF,99.2,486
327,"Estrada, Jeremiah",FF,99.1,199
160,"Burdi, Nick",FF,98.9,41
1053,"Sewald, Paul",FF,98.7,513
419,"Glasnow, Tyler",FF,98.4,658
1265,"Wheeler, Zack",FF,98.3,1178
1119,"Strider, Spencer",FF,98.3,1587


Top 20 cutters

In [24]:
test_preds.loc[test_preds['pitch_type']=='FC', ['player_name', 'pitch_type', 'stuff_percentile', 'count']].query('count > 100').sort_values('stuff_percentile', ascending=False).head(20)

Unnamed: 0,player_name,pitch_type,stuff_percentile,count
790,"Minter, A.J.",FC,99.8,340
953,"Rasmussen, Drew",FC,98.9,242
552,"Jansen, Kenley",FC,97.9,559
359,"France, J.P.",FC,97.1,361
911,"Phillips, Evan",FC,96.8,187
217,"Clase, Emmanuel",FC,96.4,655
1191,"Urías, Julio",FC,96.2,162
405,"Gibaut, Ian",FC,95.7,259
337,"Faucher, Calvin",FC,95.3,158
1262,"Wesneski, Hayden",FC,95.2,132


Top 20 sinkers

In [25]:
test_preds.loc[test_preds['pitch_type']=='SI', ['player_name', 'pitch_type', 'stuff_percentile', 'count']].query('count > 100').sort_values('stuff_percentile', ascending=False).head(50)

Unnamed: 0,player_name,pitch_type,stuff_percentile,count
433,"Graterol, Brusdar",SI,99.7,370
991,"Rogers, Tyler",SI,98.8,480
201,"Chapman, Aroldis",SI,98.5,140
452,"Hader, Josh",SI,98.2,634
28,"Almonte, Yency",SI,98.0,210
1140,"Suárez, Ranger",SI,97.7,465
746,"May, Dustin",SI,97.6,245
689,"Loáisiga, Jonathan",SI,97.3,131
18,"Alcantara, Sandy",SI,97.2,765
996,"Romero, JoJo",SI,97.1,184


#### Breaking ball / off-speed feature engineering
Calculating some fastball-relative features 

In [26]:
from feature_engineering import compute_fastball_relative_features
df = compute_fastball_relative_features(df)

In [27]:
non_fastball_features = [
    'release_speed', 'az', 'ax', 'plate_x', 'plate_z',
    'axis_deviation_adj', 'vaa', 'haa', 'velo_delta',
    'spin_axis_delta', 'vert_delta', 'horz_delta'
]
target = 'predicted_run_value'

In [28]:
df_non_fastballs = df.loc[~df['pitch_type'].isin(fastballs) & (df['game_year'].isin([2020, 2021, 2022]))]
df_non_fastballs_holdout = df.loc[~df['pitch_type'].isin(fastballs) & (df['game_year']==2023)]

In [29]:
df_non_fastballs[non_fastball_features].info()

<class 'pandas.core.frame.DataFrame'>
Index: 755712 entries, 1 to 1717521
Data columns (total 12 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   release_speed       730541 non-null  float64
 1   az                  730548 non-null  float64
 2   ax                  730548 non-null  float64
 3   plate_x             730548 non-null  float64
 4   plate_z             730548 non-null  float64
 5   axis_deviation_adj  706676 non-null  float64
 6   vaa                 730482 non-null  float64
 7   haa                 730482 non-null  float64
 8   velo_delta          714150 non-null  float64
 9   spin_axis_delta     706676 non-null  float64
 10  vert_delta          714156 non-null  float64
 11  horz_delta          714156 non-null  float64
dtypes: float64(12)
memory usage: 75.0 MB


In [30]:
df_non_fastballs = df_non_fastballs.dropna(subset=non_fastball_features)
df_non_fastballs = df_non_fastballs.dropna(subset=[target])

df_non_fastballs_holdout = df_non_fastballs_holdout.dropna(subset=non_fastball_features)
df_non_fastballs_holdout = df_non_fastballs_holdout.dropna(subset=[target])

In [31]:
# Create a GroupShuffleSplit to keep pitchers in the same training or validation set
gss = GroupShuffleSplit(n_splits=1, train_size=0.8, random_state=42)

# Get the indices for the training and validation sets
train_idx, val_idx = next(gss.split(df_non_fastballs, groups=df_non_fastballs['pitcher']))

# Create the training and validation sets
train = df_non_fastballs.iloc[train_idx]
val = df_non_fastballs.iloc[val_idx]


Train and tune non-fastball model

In [32]:
# # Define the objective function for Optuna
# def objective(trial: Trial) -> float:
#     params = {
#         'device': 'cuda',  # Use GPU acceleration
#         'n_estimators': trial.suggest_int('n_estimators', 100, 2000),
#         'max_depth': trial.suggest_int('max_depth', 3, 10),
#         'learning_rate': trial.suggest_float('learning_rate', 0.001, 0.1),
#         'subsample': trial.suggest_float('subsample', 0.5, 1.0),
#         'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
#     }

#     model = xgb.XGBRegressor(**params)
#     model.fit(train[non_fastball_features], train[target])

#     val_preds = model.predict(val[non_fastball_features])
#     val_error = np.sqrt(((val_preds - val[target]) ** 2).mean())

#     return val_error

# # Run the Optuna optimization
# study = optuna.create_study(direction='minimize', sampler=TPESampler(seed=42))
# study.optimize(objective, n_trials=50)

# # Print the best parameters
# params = study.best_params
# params['device'] = 'cuda'

# xgb_non_fastball = xgb.XGBRegressor(**params)
# xgb_non_fastball.fit(df_non_fastballs[non_fastball_features], df_non_fastballs[target])


In [33]:

# joblib.dump(xgb_non_fastball, 'models/xgb_non_fastball_model.joblib')

NameError: name 'xgb_non_fastball' is not defined

In [34]:
xgb_non_fastball = joblib.load('models/xgb_non_fastball_model.joblib')

In [35]:
df_non_fastballs_holdout['xgb_preds'] = xgb_non_fastball.predict(df_non_fastballs_holdout[non_fastball_features])

In [36]:
test_preds_non_fb = df_non_fastballs_holdout.groupby(['player_name', 'pitch_type'], as_index=False)[['predicted_run_value', 'xgb_preds', 'release_speed']] \
    .agg({'predicted_run_value':'mean', 'xgb_preds':'mean', 'release_speed':'count'}) \
    .reset_index() \
    .rename(columns={'release_speed':'count'})

In [37]:
test_preds_non_fb['predicted_run_value_percentile'] = (1 - test_preds_non_fb['predicted_run_value'].rank(pct=True).round(3)).mul(100)
test_preds_non_fb['stuff_percentile'] = (1 - test_preds_non_fb['xgb_preds'].rank(pct=True).round(3)).mul(100)
test_preds_non_fb.query('count > 100').sort_values('xgb_preds', ascending=True).head(50)

Unnamed: 0,index,player_name,pitch_type,predicted_run_value,xgb_preds,count,predicted_run_value_percentile,stuff_percentile
1579,1579,"deGrom, Jacob",SL,-0.027671,-0.027571,144,96.1,99.9
1226,1226,"Santos, Gregory",SL,-0.024521,-0.021639,526,94.2,99.9
939,939,"Miller, Bobby",SL,0.006294,-0.019726,292,42.0,99.8
1517,1517,"Williams, Devin",CH,-0.013002,-0.019514,473,82.2,99.7
247,247,"Clase, Emmanuel",SL,-0.01755,-0.018862,302,88.0,99.6
557,557,"Harvey, Hunter",FS,-0.003771,-0.01866,149,64.8,99.5
1108,1108,"Pressly, Ryan",SL,-0.020341,-0.01855,357,90.6,99.4
1043,1043,"Ortiz, Luis F.",ST,-0.011842,-0.01775,113,79.9,99.4
663,663,"Jax, Griffin",ST,-0.017318,-0.01742,504,87.7,99.2
1137,1137,"Ragans, Cole",SL,-0.033876,-0.017302,114,97.7,99.1


In [39]:
test_preds_non_fb.loc[test_preds_non_fb['player_name']=='Harrison, Kyle']

Unnamed: 0,index,player_name,pitch_type,predicted_run_value,xgb_preds,count,predicted_run_value_percentile,stuff_percentile
551,551,"Harrison, Kyle",CH,0.073428,-0.005175,13,2.2,70.8
552,552,"Harrison, Kyle",SV,0.008386,-0.006737,58,37.7,79.2


Top changeups

In [None]:
test_preds_non_fb.loc[test_preds_non_fb['pitch_type']=='CH', ['player_name', 'pitch_type', 'stuff_percentile', 'count']].query('count > 100').sort_values('stuff_percentile', ascending=False).head(50)

Unnamed: 0,player_name,pitch_type,stuff_percentile,count
1517,"Williams, Devin",CH,99.7,473
963,"Montero, Rafael",CH,98.9,232
160,"Brazoban, Huascar",CH,98.7,263
27,"Alcantara, Sandy",CH,98.4,722
1302,"Snell, Blake",CH,97.2,505
1083,"Peralta, Wandy",CH,97.0,363
446,"Gallen, Zac",CH,95.4,374
164,"Brieske, Beau",CH,92.9,101
1210,"Sale, Chris",CH,92.8,157
826,"Luzardo, Jesús",CH,92.4,564


Top curveballs

In [None]:
test_preds_non_fb.loc[test_preds_non_fb['pitch_type']=='CU', ['player_name', 'pitch_type', 'stuff_percentile', 'count']].query('count > 100').sort_values('stuff_percentile', ascending=False).head(50)

Unnamed: 0,player_name,pitch_type,stuff_percentile,count
583,"Herget, Jimmy",CU,91.5,114
359,"Duran, Jhoan",CU,91.2,241
1200,"Ruiz, José",CU,90.8,285
716,"Kikuchi, Yusei",CU,90.3,420
726,"Kirby, George",CU,88.7,307
901,"McClanahan, Shane",CU,87.0,294
938,"Miller, Bobby",CU,86.1,262
460,"García, Yimi",CU,84.6,320
764,"Lange, Alex",CU,84.4,572
667,"Johnson, Pierce",CU,81.7,565


Top splitters

In [None]:
test_preds_non_fb.loc[test_preds_non_fb['pitch_type']=='FS', ['player_name', 'pitch_type', 'stuff_percentile', 'count']].query('count > 100').sort_values('stuff_percentile', ascending=False).head(50)

Unnamed: 0,player_name,pitch_type,stuff_percentile,count
557,"Harvey, Hunter",FS,99.5,149
360,"Duran, Jhoan",FS,98.5,247
405,"Finnegan, Kyle",FS,97.1,240
1541,"Winn, Keaton",FS,93.7,206
548,"Harris, Hobie",FS,93.0,175
374,"Eovaldi, Nathan",FS,92.2,511
99,"Bednar, David",FS,91.1,155
92,"Bautista, Félix",FS,81.5,237
252,"Cobb, Alex",FS,75.9,908
472,"Gausman, Kevin",FS,75.1,1016
