In [12]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from pandas_profiling import ProfileReport

# Importing and loading files

Let us import and load our datasets.

In [2]:
statcast_df = pd.read_csv('data/2019-statcast.csv')
batters_df = pd.read_csv('data/batter-names.csv')

# Exploratory Data Analysis

Here are the available columns for the `statcast_df` dataset. Their meaning is described on the [Statcast documentation](https://baseballsavant.mlb.com/csv-docs).

In [3]:
statcast_df.columns

Index(['pitch_type', 'game_date', 'release_speed', 'release_pos_x',
       'release_pos_z', 'player_name', 'batter', 'pitcher', 'events',
       'description', 'zone', 'des', 'game_type', 'stand', 'p_throws',
       'home_team', 'away_team', 'type', 'balls', 'strikes', 'game_year',
       'pfx_x', 'pfx_z', 'plate_x', 'plate_z', 'on_3b', 'on_2b', 'on_1b',
       'outs_when_up', 'inning', 'inning_topbot', 'hc_x', 'hc_y', 'vx0', 'vy0',
       'vz0', 'ax', 'ay', 'az', 'sz_top', 'sz_bot', 'hit_distance_sc',
       'launch_speed', 'launch_angle', 'effective_speed', 'release_spin_rate',
       'release_extension', 'game_pk', 'pitcher.1', 'release_pos_y',
       'launch_speed_angle', 'at_bat_number', 'pitch_number', 'pitch_name',
       'home_score', 'away_score', 'bat_score', 'fld_score', 'post_away_score',
       'post_home_score', 'post_bat_score', 'post_fld_score',
       'if_fielding_alignment', 'of_fielding_alignment', 'spin_axis',
       'delta_home_win_exp', 'delta_run_exp'],
      dty

Here, instead, are the first entries of the `batters_df` dataset.

In [4]:
batters_df.head()

Unnamed: 0,key_mlbam,batter_name
0,547989,"abreu, josé"
1,660670,"acuna, ronald"
2,542436,"adames, cristhian"
3,642715,"adames, willy"
4,613534,"adams, austin"


I have chosen the `description` column as my target column. It's a **categoric feature** with 13 unique categories.

In [5]:
statcast_df.description.unique()

array(['swinging_strike', 'foul', 'ball', 'called_strike',
       'hit_into_play', 'swinging_strike_blocked', 'blocked_ball',
       'hit_by_pitch', 'foul_bunt', 'foul_tip', 'missed_bunt', 'pitchout',
       'bunt_foul_tip'], dtype=object)

We make a selection out of the features and narrow them down to the following: `['inning', 'inning_topbot', 'outs_when_up', 'balls', 'strikes', 'stand', 'p_throws', 'effective_speed', 'release_spin_rate', 'spin_axis', 'pfx_x', 'pfx_z', 'plate_x', 'plate_z', 'sz_top', 'sz_bot']`.

In [6]:
statcast_df = statcast_df[['inning', 'inning_topbot', 'outs_when_up', 'balls', 'strikes', 'stand', 
 'p_throws', 'effective_speed', 'release_spin_rate', 'spin_axis', 'pfx_x', 
 'pfx_z', 'plate_x', 'plate_z', 'sz_top', 'sz_bot', 'description']]

Our `statcast_df` dataset now looks like this.

In [7]:
statcast_df.head()

Unnamed: 0,inning,inning_topbot,outs_when_up,balls,strikes,stand,p_throws,effective_speed,release_spin_rate,spin_axis,pfx_x,pfx_z,plate_x,plate_z,sz_top,sz_bot,description
0,9,Bot,2,3,2,L,R,87.8,2461.0,175.0,0.02,0.21,0.88,1.03,3.35,1.4,swinging_strike
1,9,Bot,2,3,2,L,R,94.4,2572.0,201.0,-0.57,1.52,-0.47,1.92,3.35,1.56,foul
2,9,Bot,2,2,2,L,R,95.3,2637.0,205.0,-0.66,1.4,1.68,1.35,3.53,1.63,ball
3,9,Bot,2,2,1,L,R,94.9,2598.0,208.0,-0.81,1.5,0.75,2.05,3.35,1.56,foul
4,9,Bot,2,1,1,L,R,87.0,2598.0,186.0,-0.05,0.47,1.27,2.17,3.59,1.63,ball


## EDA - Pandas Profiling

Let us take a peek at basic features of the dataset: histograms, missing values, correlations (just as a first check for multicollinearity, making deductions now would be wrong!).

In [13]:
profile = ProfileReport(statcast_df, title="Pandas Profiling Report")

In [15]:
profile.to_notebook_iframe()

We see that some data is missing: in particular `sz_bot`, `sz_top`, `plate_z`, `plate_x`, `pfx_z`, `pfx_x`, `spin_axis` have all exactly 7332 missing entries. `release_spin_rate` has 20084 missing entries (2.7% of the entire dataset) and `effective_speed` has 4889 missing entries.

From the **missing values heatmap** we can see that the first group of features with missing entries totally correlate each other. This might suggest the fact that the absence of these features is due to a lack of measurement tools in those events, or the absence of necessity or the impossibility to record these measurements due to the nature of the recorded event.

Either way, I am not a *connoisseur* of baseball technicalities 🙃 so I will simply delete those rows. I could come up with ways to imput the data, but I would have to know the meaning of zero, NaN or infinite values in those variables. I know this is rough, but it is surely better than having to deal with NaNs.

In [17]:
statcast_df = statcast_df.dropna()

# Preprocessing data

We will now prepare data for the first stages of the analysis. We would firstly like to **correctly encode categorical features with numerical values**, and we will use the `OrdinalEncoder` provided by `sklearn`. In this case we will have to do this with features `stand`, `p_throws` and `description`.

We will then **standardize** or **normalize** numerical features for the analysis to behave in a correct manner, [depending on the sparsity of the data](http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-16.html).

## EDA - Visualization

In [18]:
sns.heatmap(statcast_df)

ValueError: could not convert string to float: 'Bot'

# Splitting Training set and Test set

Let's do important tasks right away: let us split *training* and *test* sets. It is crucial to do this as soon as possible as to avoid overfitting on testing data. We can subsequently divide the original training set in further two sets (the *training* and *validation* sets) in order to compare the performance of different algorithms over our dataset.