# Data Preprocessing

In this notebook, I clean my data and perform feature selection in addition to feature engineering. 

I have chosen to use pitching data from five seasons (2019-2023) to train my models which I will then test on data from the 2024 season. The 2020 season was much shorter due to COVID-19 so I decided to include the 2019 season as an additional dataset. 

Operations on both pitch-by-pitch data and overall seasonal data.

- Deal with missing values
- Feature selection (100+ features likely not feasible for this project)
- Dimensionality reduction on features
- Feature engineering
- Target variable modification (condensing pitch types down to less than 14 which is the current number of different pitch types on the MLB website)


In [12]:
# Import libraries
import pybaseball
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta

from pybaseball import pitching_stats_bref
from pybaseball import statcast
from pybaseball import statcast_pitcher
from pybaseball import playerid_lookup
from pybaseball import playerid_reverse_lookup

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC, LinearSVC
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

In [2]:
# Suppress warnings
import warnings
warnings.filterwarnings("ignore")

## Load the data

In [None]:
# Pitch-by-pitch data from the 2019-2024 seasons
pbp2019 = pd.read_csv('../data/pitch-by-pitch/pitch_by_pitch_2019.csv')
pbp2020 = pd.read_csv('../data/pitch-by-pitch/pitch_by_pitch_2020.csv')
pbp2021 = pd.read_csv('../data/pitch-by-pitch/pitch_by_pitch_2021.csv')
pbp2022 = pd.read_csv('../data/pitch-by-pitch/pitch_by_pitch_2022.csv')
pbp2023 = pd.read_csv('../data/pitch-by-pitch/pitch_by_pitch_2023.csv')
pbp2024 = pd.read_csv('../data/pitch-by-pitch/pitch_by_pitch_2024.csv')

In [None]:
# Seasonal aggregate pitching data from the 2019-2024 seasons
season2019 = pd.read_csv('../data/seasonal/pitching_stats_2019.csv')
season2020 = pd.read_csv('../data/seasonal/pitching_stats_2020.csv')
season2021 = pd.read_csv('../data/seasonal/pitching_stats_2021.csv')
season2022 = pd.read_csv('../data/seasonal/pitching_stats_2022.csv')
season2023 = pd.read_csv('../data/seasonal/pitching_stats_2023.csv')
season2024 = pd.read_csv('../data/seasonal/pitching_stats_2024.csv')

## Data Cleaning

In [10]:
# Convert game_date to datetime in pitch-by-pitch data
pbp2019['game_date'] = pd.to_datetime(pbp2019['game_date'])
pbp2020['game_date'] = pd.to_datetime(pbp2020['game_date'])
pbp2021['game_date'] = pd.to_datetime(pbp2021['game_date'])
pbp2022['game_date'] = pd.to_datetime(pbp2022['game_date'])
pbp2023['game_date'] = pd.to_datetime(pbp2023['game_date'])
pbp2024['game_date'] = pd.to_datetime(pbp2024['game_date'])