# Data Preprocessing

In this notebook, I clean my data and perform feature selection in addition to feature engineering. 

I have chosen to use pitching data from five seasons (2019-2023) to train my models which I will then test on data from the 2024 season. The 2020 season was much shorter due to COVID-19 so I decided to include the 2019 season as an additional dataset. The seasonal data also includes 2018 because training the model on a pitcher's previous year overall stats could be useful.

Operations on both pitch-by-pitch data and overall seasonal data.

- Deal with missing values
- Feature selection (100+ features likely not feasible for this project)
- Dimensionality reduction on features
- Feature engineering
- Target variable modification (condensing pitch types down to less than 14 which is the current number of different pitch types on the MLB website)


In [47]:
# Import libraries
import pybaseball
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta

from pybaseball import pitching_stats_bref
from pybaseball import statcast
from pybaseball import statcast_pitcher
from pybaseball import playerid_lookup
from pybaseball import playerid_reverse_lookup

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC, LinearSVC
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

In [48]:
# Suppress warnings
import warnings
warnings.filterwarnings("ignore")

In [49]:
# Set max display options so that I can see everything I need to see
pd.set_option("display.max_columns", 150)
pd.set_option("display.max_rows", 200)

## Load the data

In [50]:
# Pitch-by-pitch data from the 2019-2024 seasons
pbp2019 = pd.read_csv('../data/pitch-by-pitch/pitch_by_pitch_2019.csv')
pbp2020 = pd.read_csv('../data/pitch-by-pitch/pitch_by_pitch_2020.csv')
pbp2021 = pd.read_csv('../data/pitch-by-pitch/pitch_by_pitch_2021.csv')
pbp2022 = pd.read_csv('../data/pitch-by-pitch/pitch_by_pitch_2022.csv')
pbp2023 = pd.read_csv('../data/pitch-by-pitch/pitch_by_pitch_2023.csv')
pbp2024 = pd.read_csv('../data/pitch-by-pitch/pitch_by_pitch_2024.csv')

In [51]:
# Seasonal aggregate pitching data from the 2018-2024 seasons
season2018 = pd.read_csv('../data/seasonal/pitching_stats_2018.csv')
season2019 = pd.read_csv('../data/seasonal/pitching_stats_2019.csv')
season2020 = pd.read_csv('../data/seasonal/pitching_stats_2020.csv')
season2021 = pd.read_csv('../data/seasonal/pitching_stats_2021.csv')
season2022 = pd.read_csv('../data/seasonal/pitching_stats_2022.csv')
season2023 = pd.read_csv('../data/seasonal/pitching_stats_2023.csv')
season2024 = pd.read_csv('../data/seasonal/pitching_stats_2024.csv')

## Data Cleaning

In [52]:
# Convert game_date to datetime in pitch-by-pitch data
pbp2019['game_date'] = pd.to_datetime(pbp2019['game_date'])
pbp2020['game_date'] = pd.to_datetime(pbp2020['game_date'])
pbp2021['game_date'] = pd.to_datetime(pbp2021['game_date'])
pbp2022['game_date'] = pd.to_datetime(pbp2022['game_date'])
pbp2023['game_date'] = pd.to_datetime(pbp2023['game_date'])
pbp2024['game_date'] = pd.to_datetime(pbp2024['game_date'])

### Dealing with null values

As discovered during EDA, there are numerous columns with only null values. Let's take a closer look at what percentage of every column are missing values or labeled as missing values.

In [53]:
def null_percentage_report(df):
    """
    Function to calculate the percentage of null values in each column of a DataFrame.
    Returns:
    pd.DataFrame: A DataFrame with column names and the percentage of nulls.
    """
    null_percentage = (df.isnull().sum() / len(df)) * 100
    report = pd.DataFrame({
        'Column Name': df.columns,
        'Null Percentage (%)': null_percentage.round(2)
    }).sort_values(by='Null Percentage (%)', ascending=False).reset_index(drop=True)

    return report

In [54]:
# 2019 season
null_percentage_report(pbp2019).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112
Column Name,arm_angle,spin_dir,umpire,sv_id,tfs_zulu_deprecated,bat_speed,swing_length,break_length_deprecated,break_angle_deprecated,spin_rate_deprecated,tfs_deprecated,on_3b,estimated_slg_using_speedangle,estimated_ba_using_speedangle,launch_speed_angle,hc_y,hc_x,bb_type,on_2b,hit_location,estimated_woba_using_speedangle,woba_denom,woba_value,events,babip_value,iso_value,hit_distance_sc,launch_angle,launch_speed,hyper_speed,on_1b,pitcher_days_since_prev_game,pitcher_days_until_next_game,release_spin_rate,batter_days_since_prev_game,batter_days_until_next_game,delta_run_exp,delta_pitcher_run_exp,effective_speed,pitch_name,pitch_type,release_extension,release_pos_y,spin_axis,release_pos_z,release_pos_x,release_speed,api_break_z_with_gravity,api_break_x_arm,pfx_x,pfx_z,plate_x,plate_z,api_break_x_batter_in,sz_bot,sz_top,az,ay,ax,vz0,vy0,vx0,zone,of_fielding_alignment,if_fielding_alignment,age_pit,age_bat,player_name,post_fld_score,delta_home_win_exp,game_type,des,n_thruorder_pitcher,description,pitcher,batter,home_score_diff,bat_score_diff,home_win_exp,bat_win_exp,age_pit_legacy,age_bat_legacy,n_priorpa_thisgame_player_at_bat,inning_topbot,post_bat_score,strikes,outs_when_up,game_date,game_pk,fielder_2,fielder_3,fielder_4,fielder_5,fielder_6,fielder_7,fielder_8,fielder_9,game_year,balls,post_home_score,type,away_team,home_team,p_throws,at_bat_number,pitch_number,stand,inning,away_score,bat_score,fld_score,post_away_score,home_score
Null Percentage (%),100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,90.82,83.33,83.33,83.33,82.62,82.62,82.57,81.73,77.59,75.49,75.43,74.2,74.2,74.2,74.2,74.1,72.91,72.91,72.75,69.77,6.62,6.01,3.69,3.55,3.46,2.23,2.23,2.0,1.99,1.99,1.98,1.98,1.98,1.98,1.98,1.98,1.98,1.98,1.98,1.98,1.98,1.98,1.98,1.98,1.98,1.98,1.98,1.98,1.98,1.98,1.98,1.98,1.86,1.86,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [55]:
# 2020 season
null_percentage_report(pbp2020).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112
Column Name,spin_dir,bat_speed,tfs_deprecated,tfs_zulu_deprecated,umpire,sv_id,swing_length,spin_rate_deprecated,break_angle_deprecated,break_length_deprecated,on_3b,launch_speed_angle,estimated_slg_using_speedangle,estimated_ba_using_speedangle,hc_x,hc_y,bb_type,on_2b,hit_location,estimated_woba_using_speedangle,woba_denom,woba_value,iso_value,events,babip_value,launch_angle,launch_speed,hit_distance_sc,hyper_speed,on_1b,arm_angle,pitcher_days_since_prev_game,pitcher_days_until_next_game,batter_days_since_prev_game,batter_days_until_next_game,release_spin_rate,spin_axis,effective_speed,release_extension,of_fielding_alignment,if_fielding_alignment,pitch_name,pfx_x,api_break_x_batter_in,api_break_x_arm,api_break_z_with_gravity,release_speed,release_pos_x,release_pos_z,delta_pitcher_run_exp,zone,delta_run_exp,release_pos_y,pitch_type,plate_x,sz_bot,plate_z,vx0,vy0,vz0,ax,ay,sz_top,az,pfx_z,pitcher,batter,delta_home_win_exp,description,bat_score_diff,des,home_score_diff,game_year,home_win_exp,bat_win_exp,post_fld_score,age_bat_legacy,age_pit,age_bat,n_thruorder_pitcher,n_priorpa_thisgame_player_at_bat,player_name,outs_when_up,age_pit_legacy,post_bat_score,strikes,post_home_score,game_date,game_pk,fielder_2,fielder_3,fielder_4,fielder_5,fielder_6,fielder_7,fielder_8,fielder_9,balls,type,away_team,home_team,p_throws,stand,game_type,at_bat_number,pitch_number,inning,home_score,away_score,bat_score,fld_score,post_away_score,inning_topbot
Null Percentage (%),100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,90.39,83.48,83.48,83.48,83.38,83.38,83.37,81.02,78.32,75.03,74.94,74.83,74.83,74.83,74.83,70.48,70.48,70.33,70.32,68.39,16.0,9.49,9.19,2.62,2.61,0.48,0.48,0.22,0.22,0.19,0.19,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [56]:
# 2021 season
null_percentage_report(pbp2021).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112
Column Name,umpire,spin_dir,sv_id,bat_speed,swing_length,break_length_deprecated,break_angle_deprecated,spin_rate_deprecated,tfs_deprecated,tfs_zulu_deprecated,on_3b,estimated_ba_using_speedangle,estimated_slg_using_speedangle,launch_speed_angle,hc_x,hc_y,bb_type,on_2b,hit_location,estimated_woba_using_speedangle,woba_denom,woba_value,babip_value,iso_value,events,on_1b,launch_speed,launch_angle,hit_distance_sc,hyper_speed,pitcher_days_since_prev_game,pitcher_days_until_next_game,arm_angle,batter_days_since_prev_game,batter_days_until_next_game,delta_pitcher_run_exp,delta_run_exp,of_fielding_alignment,if_fielding_alignment,release_spin_rate,spin_axis,effective_speed,release_extension,release_pos_y,release_pos_x,release_pos_z,vx0,api_break_x_batter_in,api_break_x_arm,api_break_z_with_gravity,release_speed,zone,vy0,pitch_name,pfx_x,pfx_z,pitch_type,vz0,ax,ay,az,sz_top,sz_bot,plate_x,plate_z,game_pk,inning_topbot,batter,pitcher,description,game_type,des,bat_score_diff,delta_home_win_exp,home_score_diff,age_pit,home_win_exp,bat_win_exp,age_pit_legacy,age_bat_legacy,p_throws,age_bat,n_thruorder_pitcher,n_priorpa_thisgame_player_at_bat,player_name,stand,home_team,fielder_2,post_fld_score,fielder_3,fielder_4,fielder_5,fielder_6,fielder_7,fielder_8,fielder_9,game_date,game_year,strikes,balls,type,away_team,at_bat_number,pitch_number,outs_when_up,home_score,inning,bat_score,fld_score,post_away_score,post_home_score,post_bat_score,away_score
Null Percentage (%),100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,90.52,83.35,83.35,83.35,82.48,82.48,82.46,81.0,77.22,76.06,75.92,73.79,73.79,73.79,73.79,69.18,68.14,68.14,67.89,67.86,10.07,9.69,8.88,6.99,6.84,5.58,5.58,3.21,3.21,2.86,2.86,2.64,2.63,2.54,2.54,2.54,2.51,2.51,2.51,2.51,2.51,2.51,2.51,2.51,2.51,2.51,2.51,2.51,2.51,2.51,2.51,2.51,2.51,2.51,2.51,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [57]:
# 2022 season
null_percentage_report(pbp2022).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112
Column Name,umpire,spin_dir,sv_id,bat_speed,swing_length,break_length_deprecated,break_angle_deprecated,spin_rate_deprecated,tfs_deprecated,tfs_zulu_deprecated,on_3b,estimated_ba_using_speedangle,estimated_slg_using_speedangle,launch_speed_angle,hc_x,hc_y,bb_type,on_2b,hit_location,estimated_woba_using_speedangle,woba_denom,woba_value,babip_value,iso_value,events,on_1b,launch_speed,launch_angle,hit_distance_sc,hyper_speed,pitcher_days_until_next_game,pitcher_days_since_prev_game,batter_days_since_prev_game,batter_days_until_next_game,arm_angle,delta_pitcher_run_exp,delta_run_exp,of_fielding_alignment,if_fielding_alignment,release_spin_rate,spin_axis,effective_speed,release_extension,api_break_x_batter_in,api_break_x_arm,api_break_z_with_gravity,pfx_x,vx0,release_pos_y,release_speed,release_pos_x,release_pos_z,zone,vy0,pitch_name,pfx_z,pitch_type,vz0,ax,ay,az,sz_top,sz_bot,plate_x,plate_z,game_pk,inning_topbot,batter,pitcher,description,game_type,des,bat_score_diff,delta_home_win_exp,home_score_diff,age_pit,home_win_exp,bat_win_exp,age_pit_legacy,age_bat_legacy,p_throws,age_bat,n_thruorder_pitcher,n_priorpa_thisgame_player_at_bat,player_name,stand,home_team,fielder_2,post_fld_score,fielder_3,fielder_4,fielder_5,fielder_6,fielder_7,fielder_8,fielder_9,game_date,game_year,strikes,balls,type,away_team,at_bat_number,pitch_number,outs_when_up,home_score,inning,bat_score,fld_score,post_away_score,post_home_score,post_bat_score,away_score
Null Percentage (%),100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,90.92,83.12,83.12,83.12,81.97,81.97,81.96,81.39,76.75,76.27,76.18,73.48,73.48,73.48,73.48,69.29,67.85,67.8,67.65,67.61,11.07,11.0,8.05,8.04,7.52,6.9,6.9,4.95,4.95,3.74,3.74,3.66,3.61,3.21,3.21,3.21,3.21,3.2,3.2,3.2,3.2,3.2,3.2,3.2,3.2,3.2,3.2,3.2,3.2,3.2,3.2,3.2,3.2,3.2,3.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [58]:
# 2023 season
null_percentage_report(pbp2023).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112
Column Name,umpire,spin_dir,sv_id,bat_speed,swing_length,break_length_deprecated,break_angle_deprecated,spin_rate_deprecated,tfs_deprecated,tfs_zulu_deprecated,on_3b,estimated_ba_using_speedangle,estimated_slg_using_speedangle,launch_speed_angle,bb_type,hc_x,hc_y,on_2b,hit_location,estimated_woba_using_speedangle,woba_denom,woba_value,babip_value,iso_value,events,on_1b,launch_speed,launch_angle,hit_distance_sc,hyper_speed,pitcher_days_since_prev_game,pitcher_days_until_next_game,batter_days_since_prev_game,batter_days_until_next_game,arm_angle,delta_pitcher_run_exp,delta_run_exp,of_fielding_alignment,if_fielding_alignment,release_spin_rate,spin_axis,effective_speed,release_extension,vx0,release_pos_y,api_break_x_batter_in,api_break_x_arm,api_break_z_with_gravity,release_speed,release_pos_x,release_pos_z,zone,vy0,pitch_name,pfx_x,pfx_z,pitch_type,vz0,ax,ay,az,sz_top,sz_bot,plate_x,plate_z,game_pk,inning_topbot,batter,pitcher,description,game_type,des,bat_score_diff,delta_home_win_exp,home_score_diff,age_pit,home_win_exp,bat_win_exp,age_pit_legacy,age_bat_legacy,p_throws,age_bat,n_thruorder_pitcher,n_priorpa_thisgame_player_at_bat,player_name,stand,home_team,fielder_2,post_fld_score,fielder_3,fielder_4,fielder_5,fielder_6,fielder_7,fielder_8,fielder_9,game_date,game_year,strikes,balls,type,away_team,at_bat_number,pitch_number,outs_when_up,home_score,inning,bat_score,fld_score,post_away_score,post_home_score,post_bat_score,away_score
Null Percentage (%),100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,90.29,83.13,83.13,83.13,82.39,82.39,82.39,80.82,77.25,75.92,75.81,73.86,73.86,73.86,73.86,69.17,67.76,67.72,67.6,67.57,9.86,9.57,6.49,6.38,6.28,5.36,5.36,3.41,3.41,2.71,2.71,2.3,2.27,2.13,2.13,2.13,2.13,2.13,2.13,2.13,2.13,2.13,2.13,2.13,2.13,2.13,2.13,2.13,2.13,2.13,2.13,2.13,2.13,2.13,2.13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [59]:
# 2024 season
null_percentage_report(pbp2024).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112
Column Name,umpire,sv_id,break_length_deprecated,break_angle_deprecated,spin_rate_deprecated,spin_dir,tfs_deprecated,tfs_zulu_deprecated,on_3b,estimated_ba_using_speedangle,estimated_slg_using_speedangle,launch_speed_angle,hc_x,hc_y,bb_type,on_2b,hit_location,estimated_woba_using_speedangle,woba_denom,woba_value,babip_value,iso_value,events,on_1b,launch_speed,launch_angle,hit_distance_sc,hyper_speed,bat_speed,swing_length,pitcher_days_since_prev_game,pitcher_days_until_next_game,batter_days_since_prev_game,batter_days_until_next_game,arm_angle,delta_pitcher_run_exp,delta_run_exp,of_fielding_alignment,if_fielding_alignment,release_spin_rate,spin_axis,effective_speed,release_extension,api_break_z_with_gravity,release_pos_y,pfx_x,vy0,zone,release_pos_z,release_pos_x,release_speed,plate_x,plate_z,vx0,api_break_x_arm,api_break_x_batter_in,sz_bot,sz_top,az,ay,ax,vz0,pfx_z,pitch_name,pitch_type,fielder_9,fielder_8,fielder_3,description,pitcher,batter,home_score_diff,bat_score_diff,home_win_exp,bat_win_exp,age_pit_legacy,age_bat_legacy,age_pit,age_bat,n_thruorder_pitcher,n_priorpa_thisgame_player_at_bat,player_name,fielder_2,game_pk,game_date,outs_when_up,inning,inning_topbot,des,game_type,delta_home_win_exp,fielder_5,fielder_7,fielder_6,game_year,strikes,balls,type,away_team,at_bat_number,pitch_number,home_score,stand,fielder_4,bat_score,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,home_team,p_throws,away_score
Null Percentage (%),100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,90.65,82.92,82.92,82.92,82.3,82.3,82.29,81.36,77.17,75.69,75.58,73.95,73.95,73.95,73.95,69.64,67.23,67.18,67.03,67.01,57.43,57.43,9.28,8.95,5.78,5.76,5.09,4.68,4.68,2.88,2.88,2.41,2.41,2.11,2.07,1.97,1.96,1.96,1.96,1.96,1.96,1.96,1.96,1.96,1.96,1.96,1.96,1.96,1.96,1.96,1.96,1.96,1.96,1.96,1.96,1.81,1.81,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


A majority of the features with majority nulls can just be dropped. There are, however, some that require more nuanced consideration. 

The columns that indicate whether a player is on base have a null value whenever there is no player on base. While the missing value is technically true because there is no player on that base at that time, it isn't actually a missing value. Having no player on base is also useful information that a pitcher will take into account while making his decisions. In this case, I will turn the columns that indicate whether a player is on base into booleans, 0 for no player on a base, 1 for a player on base. While having information on who specifically is on base is potentially important (some players are great at stealing bases) I will not be factoring that into my model in order to reduce feature complexity a little bit.

The columns with all nulls are not relevant to my modelling so they can just be dropped.

#### Converting runner on base (on_1b, on_2b, on_3b) columns to binary indicators

Before proceeding to drop columns with majority nulls, I'll first convert the runner on base columns to binary indicators.

In [60]:
def encode_runners_on_base(df, columns=['on_1b', 'on_2b', 'on_3b']):
    """
    Encodes the presence of runners on base as binary indicators in a df.

    """
    for col in columns:
        if col in df.columns:
            # Replace column values with binary indicator (1 if not null, 0 if null)
            df[col] = df[col].notnull().astype(int)
    
    # Add an "any_on_base" column for whether any runner is on base
    df['any_on_base'] = df[[f'{col}_binary' for col in columns if f'{col}_binary' in df.columns]].sum(axis=1).astype(bool).astype(int)
    
    return df

In [61]:
pbp2019 = encode_runners_on_base(pbp2019)
pbp2020 = encode_runners_on_base(pbp2020)
pbp2021 = encode_runners_on_base(pbp2021)
pbp2022 = encode_runners_on_base(pbp2022)
pbp2023 = encode_runners_on_base(pbp2023)
pbp2024 = encode_runners_on_base(pbp2024)

#### Dropping columns with all/majority nulls

In [None]:
def drop_columns_by_null_threshold(df, threshold):
    """
    Function for dropping columns from a df where the percentage of null values exceeds a given threshold.

    Returns:
    - pd.DataFrame: A df with columns exceeding the threshold removed.
    - list: A list of dropped columns.
    """
    # Calculate the percentage of nulls for each column
    null_percentage = (df.isnull().sum() / len(df)) * 100
    
    # Identify columns to drop
    columns_to_drop = null_percentage[null_percentage >= threshold].index.tolist()
    
    # Drop the identified columns
    cleaned_df = df.drop(columns=columns_to_drop)
    
    return cleaned_df, columns_to_drop

In [63]:
# Dropping columns with all nulls in 2019 pitch-by-pitch
pbp2019, dropped_columns = drop_columns_by_null_threshold(pbp2019, 50)
print("Dropped columns:", len(dropped_columns), dropped_columns)

Dropped columns: 28 ['events', 'spin_dir', 'spin_rate_deprecated', 'break_angle_deprecated', 'break_length_deprecated', 'hit_location', 'bb_type', 'hc_x', 'hc_y', 'tfs_deprecated', 'tfs_zulu_deprecated', 'umpire', 'sv_id', 'hit_distance_sc', 'launch_speed', 'launch_angle', 'estimated_ba_using_speedangle', 'estimated_woba_using_speedangle', 'woba_value', 'woba_denom', 'babip_value', 'iso_value', 'launch_speed_angle', 'bat_speed', 'swing_length', 'estimated_slg_using_speedangle', 'hyper_speed', 'arm_angle']


In [64]:
# Dropping columns with all nulls in 2020 pitch-by-pitch
pbp2020, dropped_columns = drop_columns_by_null_threshold(pbp2020, 50)
print("Dropped columns:", len(dropped_columns), dropped_columns)

Dropped columns: 27 ['events', 'spin_dir', 'spin_rate_deprecated', 'break_angle_deprecated', 'break_length_deprecated', 'hit_location', 'bb_type', 'hc_x', 'hc_y', 'tfs_deprecated', 'tfs_zulu_deprecated', 'umpire', 'sv_id', 'hit_distance_sc', 'launch_speed', 'launch_angle', 'estimated_ba_using_speedangle', 'estimated_woba_using_speedangle', 'woba_value', 'woba_denom', 'babip_value', 'iso_value', 'launch_speed_angle', 'bat_speed', 'swing_length', 'estimated_slg_using_speedangle', 'hyper_speed']


In [65]:
# Dropping columns with all nulls in 2021 pitch-by-pitch
pbp2021, dropped_columns = drop_columns_by_null_threshold(pbp2021, 50)
print("Dropped columns:", len(dropped_columns), dropped_columns)

Dropped columns: 27 ['events', 'spin_dir', 'spin_rate_deprecated', 'break_angle_deprecated', 'break_length_deprecated', 'hit_location', 'bb_type', 'hc_x', 'hc_y', 'tfs_deprecated', 'tfs_zulu_deprecated', 'umpire', 'sv_id', 'hit_distance_sc', 'launch_speed', 'launch_angle', 'estimated_ba_using_speedangle', 'estimated_woba_using_speedangle', 'woba_value', 'woba_denom', 'babip_value', 'iso_value', 'launch_speed_angle', 'bat_speed', 'swing_length', 'estimated_slg_using_speedangle', 'hyper_speed']


In [66]:
# Dropping columns with all nulls in 2022 pitch-by-pitch
pbp2022, dropped_columns = drop_columns_by_null_threshold(pbp2022, 50)
print("Dropped columns:", len(dropped_columns), dropped_columns)

Dropped columns: 27 ['events', 'spin_dir', 'spin_rate_deprecated', 'break_angle_deprecated', 'break_length_deprecated', 'hit_location', 'bb_type', 'hc_x', 'hc_y', 'tfs_deprecated', 'tfs_zulu_deprecated', 'umpire', 'sv_id', 'hit_distance_sc', 'launch_speed', 'launch_angle', 'estimated_ba_using_speedangle', 'estimated_woba_using_speedangle', 'woba_value', 'woba_denom', 'babip_value', 'iso_value', 'launch_speed_angle', 'bat_speed', 'swing_length', 'estimated_slg_using_speedangle', 'hyper_speed']


In [67]:
# Dropping columns with all nulls in 2023 pitch-by-pitch
pbp2023, dropped_columns = drop_columns_by_null_threshold(pbp2023, 50)
print("Dropped columns:", len(dropped_columns), dropped_columns)

Dropped columns: 27 ['events', 'spin_dir', 'spin_rate_deprecated', 'break_angle_deprecated', 'break_length_deprecated', 'hit_location', 'bb_type', 'hc_x', 'hc_y', 'tfs_deprecated', 'tfs_zulu_deprecated', 'umpire', 'sv_id', 'hit_distance_sc', 'launch_speed', 'launch_angle', 'estimated_ba_using_speedangle', 'estimated_woba_using_speedangle', 'woba_value', 'woba_denom', 'babip_value', 'iso_value', 'launch_speed_angle', 'bat_speed', 'swing_length', 'estimated_slg_using_speedangle', 'hyper_speed']


In [68]:
# Dropping columns with all nulls in 2024 pitch-by-pitch
pbp2024, dropped_columns = drop_columns_by_null_threshold(pbp2024, 50)
print("Dropped columns:", len(dropped_columns), dropped_columns)

Dropped columns: 27 ['events', 'spin_dir', 'spin_rate_deprecated', 'break_angle_deprecated', 'break_length_deprecated', 'hit_location', 'bb_type', 'hc_x', 'hc_y', 'tfs_deprecated', 'tfs_zulu_deprecated', 'umpire', 'sv_id', 'hit_distance_sc', 'launch_speed', 'launch_angle', 'estimated_ba_using_speedangle', 'estimated_woba_using_speedangle', 'woba_value', 'woba_denom', 'babip_value', 'iso_value', 'launch_speed_angle', 'bat_speed', 'swing_length', 'estimated_slg_using_speedangle', 'hyper_speed']


It looks like a few of the stats got actual measurements in 2024 season compared to seasons before e.g. bat speed data. The arm_angle column also started getting measuremens starting in the 2020 season. This is an interesting feature because it is directly related to how a pitcher pitches. However, according to an [article](https://www.mlb.com/news/how-arm-slot-and-arm-angle-affect-pitches?partnerID=web_article-share) by Mike Petriello on MLB's official website, most pitchers are relatively consistent with their arm angle when pitching. 75% of pitchers stay within 3 degrees year-to-year. There are of course cases where pitchers can utilize changing arm angles that lead to success. That is a rarer occurence and due to the amount of missing arm angle data I'll just drop it as a feature.

In [70]:
# Drop arm_angle
pbp2020.drop(columns=['arm_angle'], inplace=True)
pbp2021.drop(columns=['arm_angle'], inplace=True)
pbp2022.drop(columns=['arm_angle'], inplace=True)
pbp2023.drop(columns=['arm_angle'], inplace=True)
pbp2024.drop(columns=['arm_angle'], inplace=True)

In [76]:
def check_columns_consistency(*dataframes):
    """
    Checks whether all input DataFrames have the same columns
    """
    # Get the columns from each df
    columns_list = [set(df.columns) for df in dataframes]
    
    # Check if all column sets are identical
    if all(columns == columns_list[0] for columns in columns_list):
        consistent = 'Yes'
    else:
        consistent = 'No'
    
    return consistent, [list(columns) for columns in columns_list]

In [77]:
# Always good to check if the columns are consistent across my DataFrames
consistent, columns_per_df = check_columns_consistency(pbp2019, pbp2020, pbp2021, pbp2022, pbp2023, pbp2024)
print("Are all DataFrames consistent in columns?", consistent)
if not consistent:
    print("Columns in each DataFrame:", columns_per_df)

Are all DataFrames consistent in columns? Yes


### Feature selection/elimination

Now that there are only features will a small proportion of missing values, I will take a look at eliminating features that are not important for the purposes of my goals. The remaining null values will be dealt with later. 

In [None]:
# 86 columns after dropping columns with more than 50% nulls
pbp2019.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 760498 entries, 0 to 760497
Data columns (total 86 columns):
 #   Column                            Non-Null Count   Dtype         
---  ------                            --------------   -----         
 0   pitch_type                        745397 non-null  object        
 1   game_date                         760498 non-null  datetime64[ns]
 2   release_speed                     745470 non-null  float64       
 3   release_pos_x                     745449 non-null  float64       
 4   release_pos_z                     745449 non-null  float64       
 5   player_name                       760498 non-null  object        
 6   batter                            760498 non-null  int64         
 7   pitcher                           760498 non-null  int64         
 8   description                       760498 non-null  object        
 9   zone                              745449 non-null  float64       
 10  des                             