# Called Third Strike
## Part 4. First crude non-neural network model

![](./resources/baseball_umpire_home_plate_1.jpg)

This project's goal is to build probability models for as to whether a pitch will be called a strike or not. The intended models are to be:
1. A neural network (NN) based approach.
2. A non-NN based approach.

---

__**This Notebook's**__ objective is to quickly build a simple working ML classication model to give a baseline to compare against as we iterate versions. 

---
---

### Table of Contents<a id='toc'></a>

<a href='#data_prep'>1. Data Preprocessing</a>

<a href='#build_model'>2. Build Model</a>

<a href='#random_search'>3. Random Hyperparameter Search</a>

...

<a href='#the_end'>Go to the End</a>

---

---  

<span style="font-size:0.5em;">Tag 1</span>

### Data Preprocessing<a id='data_prep'></a>

<span style="font-size:0.5em;"><a href='#toc'>Back to TOC</a></span>


#### Libraries


In [1]:
# Data wrangling and operations
import pandas as pd
import numpy as np
from datetime import datetime
import pickle

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# try:
#     import plotly_express as pex
# except ImportError:
#     !pip install plotly_express
# except ModuleNotFoundError:
#     !pip install plotly_express

# Estimators
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.base import BaseEstimator

# Processing
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

# Assessment
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.metrics import f1_score

Collecting plotly_express
  Downloading plotly_express-0.4.1-py2.py3-none-any.whl (2.9 kB)
Collecting plotly>=4.1.0
  Downloading plotly-5.7.0-py2.py3-none-any.whl (28.8 MB)
     |████████████████████████████████| 28.8 MB 42.0 MB/s            
[?25hCollecting statsmodels>=0.9.0
  Downloading statsmodels-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.9 MB)
     |████████████████████████████████| 9.9 MB 98.1 MB/s            
[?25hCollecting numpy>=1.11
  Downloading numpy-1.22.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
     |████████████████████████████████| 16.8 MB 100.5 MB/s            
[?25hCollecting scipy>=0.18
  Downloading scipy-1.8.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (41.6 MB)
     |████████████████████████████████| 41.6 MB 98.6 MB/s            
[?25hCollecting pandas>=0.20.0
  Downloading pandas-1.4.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB)
     |████████████████████████████████| 

---

#### Data Import 

Retrieve the latest version of our train/test files that we built during [initial exploration](02_data_exploration.ipynb).

In [1]:
# Data wrangling and operations
import pandas as pd
import numpy as np
from datetime import datetime
import pickle

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# try:
#     import plotly_express as pex
# except ImportError:
#     !pip install plotly_express
# except ModuleNotFoundError:
#     !pip install plotly_express

# Estimators
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.base import BaseEstimator

# Processing
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

# Assessment
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.metrics import f1_score

Collecting plotly_express
  Downloading plotly_express-0.4.1-py2.py3-none-any.whl (2.9 kB)
Collecting plotly>=4.1.0
  Downloading plotly-5.7.0-py2.py3-none-any.whl (28.8 MB)
     |████████████████████████████████| 28.8 MB 42.0 MB/s            
[?25hCollecting statsmodels>=0.9.0
  Downloading statsmodels-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.9 MB)
     |████████████████████████████████| 9.9 MB 98.1 MB/s            
[?25hCollecting numpy>=1.11
  Downloading numpy-1.22.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
     |████████████████████████████████| 16.8 MB 100.5 MB/s            
[?25hCollecting scipy>=0.18
  Downloading scipy-1.8.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (41.6 MB)
     |████████████████████████████████| 41.6 MB 98.6 MB/s            
[?25hCollecting pandas>=0.20.0
  Downloading pandas-1.4.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB)
     |████████████████████████████████| 

In [2]:
df_train = pd.read_pickle('../data/train_enriched.pkl')
df_test = pd.read_pickle('../data/test_enriched.pkl')

Let's refamiliarize ourselves with the features.

In [3]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 353983 entries, 0 to 354038
Data columns (total 31 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   pitch_id              353983 non-null  object        
 1   inning                353983 non-null  int64         
 2   side                  353983 non-null  object        
 3   run_diff              353983 non-null  int64         
 4   at_bat_index          353983 non-null  int64         
 5   pitch_of_ab           353983 non-null  int64         
 6   batter                353983 non-null  int64         
 7   pitcher               353983 non-null  int64         
 8   catcher               353983 non-null  int64         
 9   umpire                353983 non-null  int64         
 10  bside                 353983 non-null  object        
 11  pside                 353983 non-null  object        
 12  stringer_zone_bottom  353983 non-null  float64       
 13 

---

#### Feature Selection and Prep

Select features:

For our initial super simple model, let's just pick four features: 
- `px` which is the horizontal location of the pitch at the plate
- `pz` which is the vertical location
- `stringer_zone_bottom` which is an estimate of current batter's strike zone bottom
- `stringer_zone_top` which is an estimate of current batter's strike zone top

In [4]:
# Features selected
feat_select = ['px', 'pz', 'stringer_zone_bottom', 'stringer_zone_top']

Prep features:

In [5]:
# Get just the selected features
df_X = df_train[feat_select]
display(df_X.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 353983 entries, 0 to 354038
Data columns (total 4 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   px                    353983 non-null  float64
 1   pz                    353983 non-null  float64
 2   stringer_zone_bottom  353983 non-null  float64
 3   stringer_zone_top     353983 non-null  float64
dtypes: float64(4)
memory usage: 13.5 MB


None

While we are at it, let's prep our targets:

In [6]:
df_y = df_train['strike_bool']
display(pd.DataFrame(df_y).info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 353983 entries, 0 to 354038
Data columns (total 1 columns):
 #   Column       Non-Null Count   Dtype
---  ------       --------------   -----
 0   strike_bool  353983 non-null  int64
dtypes: int64(1)
memory usage: 5.4 MB


None

---

#### Feature Engineering

Select features:

For a neural network we will need to numerically represent our data (e.g. one-hot-encode any categorical features), as well as scale/normalize the data. 

All the selected features for this round are numerical, so we don't have to worry about encoding any categories. For scaling I'm choosing to *standardize*.

In [7]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_X)

---

#### Train / Test


Create train/test splits.

Breakout parameters for easy access.

In [8]:
# Parameters
test_size = 0.20
random_state = 24

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, df_y, test_size=test_size
                                                    ,stratify=df_y, random_state=random_state)

---  

<span style="font-size:0.5em;">Tag 2</span>

### Build Model / Hyperparameter Search<a id='build_model'></a>

<span style="font-size:0.5em;"><a href='#toc'>Back to TOC</a></span>

---  

For expediency we are going to use a pipeline and random hyperparameter search to scour through different classifiers and hyperparameter sets to come up with a good candidate.

We'll create a `DummyEstimator` class to act as a placeholder to which we can pass the different classifiers.

In [10]:
class DummyEstimator(BaseEstimator):
    def fit(self): pass
    def score(self): pass

Create the pipeline of dictionaries of classifiers and their respective grids of potential hyperparameters.

In [48]:
# Create a pipeline
pipe = Pipeline([('clf', DummyEstimator())]) # Placeholder Estimator

# Candidate learning algorithms and their hyperparameters
search_space = [{'clf': [LogisticRegression()], # Actual Estimator
                 'clf__penalty': ['l1', 'l2', 'elasticnet'],
                 'clf__C': np.logspace(-4, 4, 20)
                },
                {'clf': [DecisionTreeClassifier()],  # Actual Estimator
                 'clf__criterion': ['gini', 'entropy'],
                 'clf__max_depth': np.arange(3,11,1),
                 'clf__min_samples_split': np.linspace(0.1, 0.5, 10, endpoint=True),
                 'clf__max_features': [0.25, 0.5, 0.75, 1]
                },
                {'clf': [RandomForestClassifier()],  # Actual Estimator
                 'clf__n_estimators': [50, 100, 200],
                 'clf__criterion': ['gini', 'entropy'],                 
                 'clf__max_depth': np.arange(3,11,1),
                 'clf__min_samples_split': np.linspace(0.1, 0.5, 10, endpoint=True),
                 'clf__max_features': ["auto", "sqrt", "log2"],
                 'clf__max_samples': [0.25, 0.5, 0.75, 1]
                },
                {'clf': [GradientBoostingClassifier()],  # Actual Estimator
                 'clf__n_estimators': [50, 100, 200],
                 'clf__criterion': ['gini', 'entropy'],                 
                 'clf__max_depth': np.arange(3,11,1),
                 'clf__min_samples_split': np.linspace(0.1, 0.5, 10, endpoint=True),
                 'clf__max_features': ["auto", "sqrt", "log2"],
                 # 'clf__max_samples': [0.25, 0.5, 0.75, 1]
                },
                {'clf': [XGBClassifier()],  # Actual Estimator
                 'clf__n_estimators': [50, 100, 200],
                 'clf__criterion': ['gini', 'entropy'],                 
                 'clf__max_depth': np.arange(3,11,1),
                 'clf__min_samples_split': np.linspace(0.1, 0.5, 10, endpoint=True),
                 'clf__max_features': ["auto", "sqrt", "log2"],
                 'clf__max_samples': [0.25, 0.5, 0.75, 1]
                }
               ]

# GradientBoostingClassifier(
# RandomForestClassifier(
    
# Create grid search 
# gs = GridSearchCV(pipe, search_space, random_state=random_state)
rs = RandomizedSearchCV(pipe, search_space, random_state=random_state
                        ,n_iter=10 # To limit search time 
                        ,cv=3      # I like 5 as rule of thumb, but again to limit search time
                        ,verbose=True, refit=True, scoring='accuracy')

Fit the random search cv object.

In [None]:
rs_results = rs.fit(X_train, y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
Parameters: { criterion, max_features, max_samples, min_samples_split } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { criterion, max_features, max_samples, min_samples_split } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { criterion, max_features, max_samples, min_samples_split } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verifica

Check the results.

In [50]:
rs_results.best_estimator_

Pipeline(steps=[('clf',
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, criterion='entropy', gamma=0,
                               gpu_id=-1, importance_type='gain',
                               interaction_constraints='',
                               learning_rate=0.300000012, max_delta_step=0,
                               max_depth=7, max_features='auto',
                               max_samples=0.5, min_child_weight=1,
                               min_samples_split=0.4111111111111111,
                               missing=nan, monotone_constraints='()',
                               n_estimators=50, n_jobs=16, num_parallel_tree=1,
                               random_state=0, reg_alpha=0, reg_lambda=1,
                               scale_pos_weight=1, subsample=1,
                               tree_method='exact', validate_param

In [51]:
rs_results.best_params_

{'clf__n_estimators': 50,
 'clf__min_samples_split': 0.4111111111111111,
 'clf__max_samples': 0.5,
 'clf__max_features': 'auto',
 'clf__max_depth': 7,
 'clf__criterion': 'entropy',
 'clf': XGBClassifier(base_score=None, booster=None, colsample_bylevel=None,
               colsample_bynode=None, colsample_bytree=None, criterion='entropy',
               gamma=None, gpu_id=None, importance_type='gain',
               interaction_constraints=None, learning_rate=None,
               max_delta_step=None, max_depth=7, max_features='auto',
               max_samples=0.5, min_child_weight=None,
               min_samples_split=0.4111111111111111, missing=nan,
               monotone_constraints=None, n_estimators=50, n_jobs=None,
               num_parallel_tree=None, random_state=None, reg_alpha=None,
               reg_lambda=None, scale_pos_weight=None, subsample=None,
               tree_method=None, validate_parameters=None, verbosity=None)}

In [52]:
rs_results.best_score_

0.9279166346193067

In [53]:
rs_results.cv_results_

{'mean_fit_time': array([4.23047154e+01, 5.80357432e+00, 8.24969197e+01, 1.64146320e+00,
        9.46557408e+01, 1.49212432e+00, 3.20498943e-02, 1.45993568e+02,
        3.35522493e-02, 3.60049033e+01]),
 'std_fit_time': array([7.81819631e-01, 1.52984071e-01, 4.01336070e+00, 3.18320922e-02,
        3.49107406e+00, 4.40841065e-02, 2.84990362e-04, 3.33727149e+00,
        3.02022475e-03, 4.59957423e+00]),
 'mean_score_time': array([0.25502078, 0.41718825, 0.34918467, 0.33792988, 0.35999831,
        0.32580272, 0.        , 0.48696152, 0.        , 0.2239395 ]),
 'std_score_time': array([0.04673232, 0.00365037, 0.05106508, 0.01415307, 0.00223086,
        0.00580321, 0.        , 0.04432186, 0.        , 0.0415595 ]),
 'param_clf__n_estimators': masked_array(data=[50, 100, 200, 100, 100, 100, 100, 200, 50, 50],
              mask=[False, False, False, False, False, False, False, False,
                    False, False],
        fill_value='?',
             dtype=object),
 'param_clf__min_samples

In [54]:
df_rs_res = pd.DataFrame(rs_results.cv_results_)

In [56]:
df_rs_res.shape

(10, 18)

In [55]:
df_rs_res.head(10).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
mean_fit_time,42.304715,5.803574,82.49692,1.641463,94.655741,1.492124,0.03205,145.993568,0.033552,36.004903
std_fit_time,0.78182,0.152984,4.013361,0.031832,3.491074,0.044084,0.000285,3.337271,0.00302,4.599574
mean_score_time,0.255021,0.417188,0.349185,0.33793,0.359998,0.325803,0.0,0.486962,0.0,0.223939
std_score_time,0.046732,0.00365,0.051065,0.014153,0.002231,0.005803,0.0,0.044322,0.0,0.04156
param_clf__n_estimators,50,100,200,100,100,100,100,200,50,50
param_clf__min_samples_split,0.188889,0.366667,0.144444,0.277778,0.455556,0.366667,0.233333,0.233333,0.322222,0.411111
param_clf__max_samples,0.25,0.75,0.5,0.25,0.75,1,,0.75,,0.5
param_clf__max_features,sqrt,sqrt,log2,log2,log2,sqrt,sqrt,sqrt,auto,auto
param_clf__max_depth,8,3,4,10,9,7,3,7,5,7
param_clf__criterion,entropy,gini,gini,gini,gini,entropy,gini,entropy,entropy,entropy


Let's pickle the random search results.

In [60]:
str_ts = datetime.now().strftime("%Y%m%d_%H%M")
file_nm = 'classic_rs_1st_pass_' + str_ts
file_path = './models/rs_results/' + file_nm + '.pickle'
# pred_path = './predictions/test/' + file_nm + '.csv'

display(file_path)


with open(file_path, 'wb') as f:
    # Pickle the 'data' dictionary using the highest protocol available.
    pickle.dump(rs_results, f, pickle.HIGHEST_PROTOCOL)

# Test loading pickle
load_path = './models/rs_results/classic_rs_1st_pass_20220505_1544.pickle'

with open(load_path, 'rb') as f:
    # The protocol version used is detected automatically, so we do not
    # have to specify it.
    rs_rez_unpkl = pickle.load(f)    

In [61]:
rs_results.best_estimator_

Pipeline(steps=[('clf',
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, criterion='entropy', gamma=0,
                               gpu_id=-1, importance_type='gain',
                               interaction_constraints='',
                               learning_rate=0.300000012, max_delta_step=0,
                               max_depth=7, max_features='auto',
                               max_samples=0.5, min_child_weight=1,
                               min_samples_split=0.4111111111111111,
                               missing=nan, monotone_constraints='()',
                               n_estimators=50, n_jobs=16, num_parallel_tree=1,
                               random_state=0, reg_alpha=0, reg_lambda=1,
                               scale_pos_weight=1, subsample=1,
                               tree_method='exact', validate_param

In [62]:
rs_rez_unpkl.best_estimator_

Pipeline(steps=[('clf',
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, criterion='entropy', gamma=0,
                               gpu_id=-1, importance_type='gain',
                               interaction_constraints='',
                               learning_rate=0.300000012, max_delta_step=0,
                               max_depth=7, max_features='auto',
                               max_samples=0.5, min_child_weight=1,
                               min_samples_split=0.4111111111111111,
                               missing=nan, monotone_constraints='()',
                               n_estimators=50, n_jobs=16, num_parallel_tree=1,
                               random_state=0, reg_alpha=0, reg_lambda=1,
                               scale_pos_weight=1, subsample=1,
                               tree_method='exact', validate_param

Good enough. Testing equality won't work using simple '=='

----
### Here are results review from NN

In [None]:
rs_results = rs.fit(X_train, y_train)

Let's look at the best accuracy score, and the parameters that produced them.

In [None]:
display(rs_results.best_score_, rs_results.best_params_)

Let's look at it as an estimator object.

In [None]:
rs_best = rs_results.best_estimator_
display(rs_best)

Let's look at a high level plot of the model, to make sure it makes sense vis a vis the best params.

In [None]:
plot_model(rs_best.model_)

Look at summary:

In [None]:
rs_best.model_.summary()

Now let's look at performance vs Test.

In [None]:
display(f"test accuracy: {rs_best.score(X_test, y_test)}")
display(f"train accuracy: {rs_results.best_score_}")

Consistent, and aligns.

Let's get some predictions and review some high level classification metrics:

In [None]:
rs_pred = rs_best.predict(X_test)

*Confusion Matrix*

In [None]:
cm = confusion_matrix(y_test, rs_pred)
display(cm)

*Accuracy*

In [None]:
display("What % of pitches did we correctly categorize?")
display(f"test accuracy: {round(rs_best.score(X_test, y_test), 3)}")

*Precision*

In [None]:
precision = precision_score(y_test, rs_pred)
display("What % predicted strikes did we get correct?")
display(f"Positive Predictive Value: {round(precision,3)}")

*Recall*

In [None]:
recall = recall_score(y_test, rs_pred)
display("What percent of actual strikes did we get capture?")
display(f"True Positive Rate: {round(recall, 3)}")

*ROC/AUC*

In [None]:
plot_roc_curve(rs_best, X_train, y_train)
plt.show()

AUC of 0.98!

*Save the current state*

Let's save out the best model!

In [None]:
str_ts = datetime.now().strftime("%Y%m%d_%H%M")
model_nm = 'nn_1st_pass_' + str_ts
model_path = './models/' + model_nm
pred_path = './predictions/test/' + model_nm + '.csv'

display(model_path, pred_path)

# save out predicted probabilities
pred_curr = rs_best.predict_proba(X_test)
np.savetxt(pred_path, pred_curr, delimiter=",")

# Save out Keras model
rs_best.model_.save(model_path)

---  

<span style="font-size:0.5em;">Tag 4</span>

<a id='the_end'></a>

<span style="font-size:0.5em;"><a href='#toc'>Back to TOC</a></span>

-----

### Archive

Here is the model creation function I started building, before I stumbled across `scikeras` and decided to run with their code.

In [None]:
def create_binary_nn_model(num_inputs, learning_rate=0.01, num_layers=1
                           ,num_nodes=2, activation='relu'):
    """Create binary neural network Sequential model"""
    # Create Adam optimizer
    opt = Adam(lr=learning_rate)
    
    # Create Sequential model
    model=Sequential()
    
    # Input layer
    model.add(Dense(num_nodes, input_shape=(num_inputs,) 
              ,activation=activation, name='Input'))
    
    # Additional Hidden Layers
    for i in range(num_layers-1): #if only 1 then assume just input/hidden
        model.add(Dense(num_nodes, activation=activation))
    
    # Add a 1-neuron output layer
    model.add(Dense(num_outputs, activation='sigmoid', name='Output'))

    # Compile your model
    model.compile(loss='binary_crossentropy', optimizer=opt
                  ,metrics=['accuracy']
                 )

    return model

Let's do a test on the function:

In [None]:
mod_1 = create_binary_nn_model(num_inputs=X_train.shape[1]
                               ,learning_rate=1
                               ,num_layers=1
                              )

In [None]:
hh = mod_1.fit(X_train, y_train, epochs = n_epochs
                        ,validation_split=val_split)

In [None]:
plot_model(mod_1)

Seems correct. I think that I plugged in a learning rate of 1 caused the epochs to be basically the same, basically skipping over minimums.