# Called Third Strike
## Part 2. Data Exploration 

This project's goal is to build probability models for as to whether a pitch will be called a strike or not. The intended models are to be:
1. A neural network (NN) based approach.
2. A non-NN based approach.

---
__**This Notebook's**__ objective is to explore the data to gain further familiarity with it, and to identify candiate features for our 

---

### Import Data

We have saved local versions of the data, so we will ingest them from there.

We'll use `pandas` to read into dataframes and explore. 

In [2]:
### Data handling
import pandas as pd

In [3]:
df_train = pd.read_csv('../data/train_ingested.csv')
df_test = pd.read_csv('../data/test_ingested.csv')

Let's work with the training data.

In [5]:
df_train.head().T

Unnamed: 0,0,1,2,3,4
pitch_id,01311c57-5046-48d7-ac68-000060a98ccb,208d0186-b7c9-46bd-8297-0001539b714c,4a24d09e-2d9b-4d12-a0eb-0004723ce539,486aa6b8-7c43-4974-8a53-000611a9c649,2aff251b-099b-447b-9862-00100124b7c1
season,2021,2021,2021,2021,2021
game_date,2021-05-13,2021-07-29,2021-05-15,2021-06-05,2021-06-13
inning,7,9,1,1,3
side,home,home,home,home,home
run_diff,-2,4,0,2,-2
at_bat_index,54,69,1,5,24
pitch_of_ab,5,2,3,3,5
batter,405947,468294,406141,615134,626949
pitcher,756778,778005,451846,564585,784463


On broad visual inspection, seems to make sense. We'll definitely get a opportunity for more questions as we go through each feature/target, which is the plan.

In [8]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354039 entries, 0 to 354038
Data columns (total 31 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   pitch_id              354039 non-null  object 
 1   season                354039 non-null  int64  
 2   game_date             354039 non-null  object 
 3   inning                354039 non-null  int64  
 4   side                  354039 non-null  object 
 5   run_diff              354039 non-null  int64  
 6   at_bat_index          354039 non-null  int64  
 7   pitch_of_ab           354039 non-null  int64  
 8   batter                354039 non-null  int64  
 9   pitcher               354039 non-null  int64  
 10  catcher               354039 non-null  int64  
 11  umpire                354039 non-null  int64  
 12  bside                 354039 non-null  object 
 13  pside                 354039 non-null  object 
 14  stringer_zone_bottom  354039 non-null  float64
 15  

### Feature Exploration

Since we have a somewhat manageable # of features, let's go through them and make sure we at least identify what they are and their utility to our goal.

In [10]:
# We should visualize some of the data, so let's get seaborn
import seaborn as sns

##### Summarize Columns

In [14]:
print(list(df_train.columns))

['pitch_id', 'season', 'game_date', 'inning', 'side', 'run_diff', 'at_bat_index', 'pitch_of_ab', 'batter', 'pitcher', 'catcher', 'umpire', 'bside', 'pside', 'stringer_zone_bottom', 'stringer_zone_top', 'on_1b_mlbid', 'on_2b_mlbid', 'on_3b_mlbid', 'outs', 'balls', 'strikes', 'pitch_speed', 'px', 'pz', 'break_x', 'break_z', 'angle_x', 'angle_z', 'pitch_type', 'strike_bool']


---

`pitch_id`

Self-explanatory, distinct ID given to each pitch.

In [19]:
# of rows in dataset
display(f'There are {df_train.shape[0]} rows in training data set.')

'There are 354039 rows in training data set.'

In [27]:
# of distinct pitch IDs
display(f"There are {df_train['pitch_id'].unique().shape[0]} rows in training data set.")

'There are 354035 rows in training data set.'

Interesting, so there are 4 pitches that are in the dataset 'twice'. At least that's my theory. Since that is such a small amount relative to the dataset, let's go ahead and identify and remove those dupes.

In [71]:
vc_pitch_id = df_train['pitch_id'].value_counts()

In [72]:
drop_pitch_id = vc_pitch_id[vc_pitch_id > 1].index.values.tolist()

In [73]:
drop_pitch_id

['39b8ee0e-f2aa-4422-8fd8-d2d0f885a108',
 '11aa870a-fc24-482f-b3e6-4aaeac474cfe',
 '0edd497b-23fd-4c7d-91b8-f6040416d3ad',
 '6eb345fe-931d-4707-a5a1-9ba8181085e8']

Smells right, only 4 as expected.

In [75]:
keep_pitch_id = vc_pitch_id[vc_pitch_id == 1].index.values.tolist()

In [76]:
len(keep_pitch_id)

354031

Makes sense, as 4x2 rows will have the dupes:

In [78]:
(len(keep_pitch_id) + 2*len(drop_pitch_id)) == df_train.shape[0]

True