# Called Third Strike
### _Building a Strike Probability Model_
<div>
<img src="resources/baseball_umpire_home_plate_1.jpg" width="600"/>
</div>

---
## Part 8. Produce Strike Probabilities - Alt Version

This notebook alters the work in notebook #7, with the goal of creating predictions for the _training_ data. This data will be used for a visualization exercise.

We will focus on only using the non-NN (XGBoost) model.


---

__**This Notebook's**__ objective is to produce the following deliverables:
- Strike probabilities for test set data as predicted by the selected [neural network model](./05_improved_neural_network.ipynb).
- Strike probabilities for test set data as predicted by the selected non-neural network model, in this case [an XGBoost model](./06_improved_non_nn_model.ipynb).


- Deliverables files will be `.csv`s with 2 fields:
    - `pitch_id`
    - `probability of a strike`


The majority of the content here will be reproductions from the ~[neural network notebook](./05_improved_neural_network.ipynb) and the~ [XGBoost notebook](./06_improved_non_nn_model.ipynb), but I wanted to unify it here for easier reference. 

---
---

### Table of Contents<a id='7_toc'></a>

<a href='#7_data'>1. Data Preparation</a>

~<a href='#7_nn_pred'>2. Predictions - Neural Network</a>~

<a href='#7_xgb_pred'>3. Predictions - XGBoost</a>

<a href='#7_dx'>4. Diagnostics</a>

...

<a href='#7_the_end'>Go to the End</a>

<span style="font-size:0.75em;">Note that some hyperlinks in this notebook may only work in a local context.</span>

---

---  
### 1. Data Preparation<a id='7_data'></a>
<span style="font-size:0.5em;"><a href='#7_toc'>Back to TOC</a></span>

For demo purposes we will ingest a version of the test and apply all the previous steps we had during development to create the input file.


#### Libraries

In [2]:
!pip install xgboost==1.3.3

Defaulting to user installation because normal site-packages is not writeable
Collecting xgboost==1.3.3
  Downloading xgboost-1.3.3-py3-none-manylinux2010_x86_64.whl (157.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m157.5/157.5 MB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: xgboost
Successfully installed xgboost-1.3.3


In [3]:
# Data wrangling and operations
import pandas as pd
import numpy as np
from datetime import datetime, timezone
import pytz
import pickle

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# try:
#     import plotly_express as pex
# except ImportError:
#     !pip install plotly_express
# except ModuleNotFoundError:
#     !pip install plotly_express

# Estimators
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.base import BaseEstimator

# Processing
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

# Assessment
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score, precision_score, auc
from sklearn.metrics import roc_curve, RocCurveDisplay

# Custom code
from project_helpers import get_clf, DummyEstimator

2023-01-25 18:43:51.994125: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-01-25 18:43:51.994148: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


Import the following packages, if needed (not included in default packages on the platform I'm currently using)

In [4]:
%%capture
try:
    import scikeras
except ImportError:
    !pip install scikeras[tensorflow]
    !python -m pip install scikeras[tensorflow]
except ModuleNotFoundError:
    !pip install scikeras[tensorflow]
    # !python -m pip install scikeras[tensorflow]

In [5]:
from scikeras.wrappers import KerasClassifier

Need older version of `xgboost` since pickled file depends on it.

In [6]:
!pip freeze | grep xgboost

xgboost==1.3.3


---
#### a. Import Raw Data

For this we will import the training data set, and pre-process it as we did with the test set in [notebook #7]('./07_produce_strike_probabilities.ipynb').

In [7]:
df_train = pd.read_pickle('../data/train_enriched.pkl')
# df_train['strike_bool'].value_counts(normalize=True)

In [8]:
# df_train.describe()
df_train.shape

(350959, 31)

### Alias the csv's for cleanliness
url_test = 'https://drive.google.com/file/d/1Cfb7CBORgo5tpJoPUwmOkw3HBzXIorlE/view?usp=sharing'
url_test ='https://drive.google.com/uc?id=' + url_test.split('/')[-2]
f'URL for test data: {url_test}'

### Import test.csv
df_test = pd.read_csv(url_test)

In [9]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 350959 entries, 0 to 354038
Data columns (total 31 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   pitch_id              350959 non-null  object        
 1   inning                350959 non-null  int64         
 2   side                  350959 non-null  object        
 3   run_diff              350959 non-null  int64         
 4   at_bat_index          350959 non-null  int64         
 5   pitch_of_ab           350959 non-null  int64         
 6   batter                350959 non-null  int64         
 7   pitcher               350959 non-null  int64         
 8   catcher               350959 non-null  int64         
 9   umpire                350959 non-null  int64         
 10  bside                 350959 non-null  object        
 11  pside                 350959 non-null  object        
 12  stringer_zone_bottom  350959 non-null  float64       
 13 

In [10]:
f'There are {df_train.shape[0]} rows.'

'There are 350959 rows.'

---
#### b. Feature Selection and Engineering

We decided on the following features for our final models; for simplicity this selection is the same for both the neural network and XGBoost model:


- `px` which is the horizontal location of the pitch at the plate
- `pz` which is the vertical location
- `stringer_zone_bottom` which is an estimate of current batter's strike zone bottom
- `stringer_zone_top` which is an estimate of current batter's strike zone top
- `break_x`, the horizontal break of the ball at the plate
- `break_z`, the vertical break of the ball at the plate
- `angle_x`, the horizontal angle of the ball at the plate, compared to if it had traveled in a straight-line from release
- `angle_z`, the vertical angle of the ball at the plate, compared to if it had traveled in a straight-line from release
- `pitch_speed`, how fast the ball is traveling
- `bside` - batter side (will be one-hot encoded)
- `pside` - pitcher handedness (will be one-hot encoded)

---
Here are the selected features:

Numeric:

In [11]:
feat_select = ['px',
               'pz',
               'stringer_zone_bottom',
               'stringer_zone_top',
               'break_x',
               'break_z',
               'angle_x',
               'angle_z',
               'pitch_speed',
               ]

##### Categorical
`b_side`, `p_side`

These are strings so I will one-hot encode them. Their order should end up identical to their order in the original model training.

In [12]:
feat_cat_select = ['bside', 'pside']

In [13]:
# df_hold_cat_ohe = pd.get_dummies(df_test[feat_cat_select])
df_train_cat_ohe = pd.get_dummies(df_train[feat_cat_select])

In [14]:
df_train_cat_ohe.head()

Unnamed: 0,bside_L,bside_R,pside_L,pside_R
0,0,1,1,0
1,1,0,0,1
2,1,0,1,0
3,0,1,1,0
5,1,0,0,1


Sort columns by alpha to ensure column order same as modeled.

#### c. Create input files

Here for each model we will:
- `concat` the raw numerical features with the OHE'd categoricals.
- Then we will standard scale both
    - The training `Standardscaler` for has been previously `pickle`d and will be applied here 

In [15]:
# Get just the selected features
df_train_slct = pd.concat([df_train[feat_select], df_train_cat_ohe], axis=1)

display(df_train_slct.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 350959 entries, 0 to 354038
Data columns (total 13 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   px                    350959 non-null  float64
 1   pz                    350959 non-null  float64
 2   stringer_zone_bottom  350959 non-null  float64
 3   stringer_zone_top     350959 non-null  float64
 4   break_x               350959 non-null  float64
 5   break_z               350959 non-null  float64
 6   angle_x               350959 non-null  float64
 7   angle_z               350959 non-null  float64
 8   pitch_speed           350959 non-null  float64
 9   bside_L               350959 non-null  uint8  
 10  bside_R               350959 non-null  uint8  
 11  pside_L               350959 non-null  uint8  
 12  pside_R               350959 non-null  uint8  
dtypes: float64(9), uint8(4)
memory usage: 28.1 MB


None

In [16]:
col_primary_order = list(df_train_slct.columns)
col_primary_order.sort()
col_primary_order

['angle_x',
 'angle_z',
 'break_x',
 'break_z',
 'bside_L',
 'bside_R',
 'pitch_speed',
 'pside_L',
 'pside_R',
 'px',
 'pz',
 'stringer_zone_bottom',
 'stringer_zone_top']

In [17]:
df_train_slct = df_train_slct[col_primary_order]

In [18]:
df_train_slct.head()

Unnamed: 0,angle_x,angle_z,break_x,break_z,bside_L,bside_R,pitch_speed,pside_L,pside_R,px,pz,stringer_zone_bottom,stringer_zone_top
0,3.02727,5.59379,1.91535,-9.54142,0,1,97.4298,1,0,-1.2981,2.30217,1.56,3.41
1,-1.56782,6.86676,-12.1373,-21.9427,1,0,91.7712,0,1,1.41222,1.57443,1.59,3.47
2,2.04966,7.17281,-0.992261,-25.5107,1,0,87.813,1,0,-0.18119,2.11248,1.68,3.58
3,2.96845,8.50392,-2.8393,-27.2509,0,1,86.5546,1,0,-0.885538,0.598692,1.63,3.55
5,-1.6463,9.33291,8.90615,-59.4133,1,0,72.0904,0,1,-1.45954,3.39951,1.5,3.3


Fetch `StandardScaler`

In [19]:
load_path = './models/scalers/nn_scaler_20220511_1134.pickle'

with open(load_path, 'rb') as f:
    # The protocol version used is detected automatically, so we do not
    # have to specify it.
    scaler = pickle.load(f)    

In [20]:
df_train_slct_scaled = scaler.transform(df_train_slct)

df_train_slct_scaled.shape

(350959, 13)

In [21]:
df_train_slct_scaled[:5]

array([[ 1.61150631, -0.52799452,  0.32485034,  1.25079619, -0.83902664,
         0.83902664,  1.40738267,  1.54231376, -1.54231376, -1.3403589 ,
         0.09292537, -0.10202344,  0.016214  ],
       [-0.35435097,  0.0804241 , -1.17793851,  0.23968821,  1.19185727,
        -1.19185727,  0.4821331 , -0.6483765 ,  0.6483765 ,  1.33759252,
        -0.53266107,  0.32478776,  0.48676475],
       [ 1.1932682 ,  0.22670132,  0.01391074, -0.05121993,  1.19185727,
        -1.19185727, -0.1650805 ,  1.54231376, -1.54231376, -0.23678784,
        -0.07013766,  1.60522137,  1.34944113],
       [ 1.58634212,  0.86290808, -0.18361141, -0.19310288, -0.83902664,
         0.83902664, -0.37084412,  1.54231376, -1.54231376, -0.93272401,
        -1.3714338 ,  0.89386937,  1.11416576],
       [-0.38792605,  1.25912555,  1.07244573, -2.81538537,  1.19185727,
        -1.19185727, -2.73591584, -0.6483765 ,  0.6483765 , -1.49987087,
         1.03623071, -0.95564585, -0.84646238]])

---  
### 3. Predictions - XGBoost<a id='7_xgb_pred'></a>

<span style="font-size:0.5em;"><a href='#7_toc'>Back to TOC</a></span>

---  

#### Make Predictions!

~- For completeness we will produce various files for both strike outcomes and probabilities, though we will only deliver the probabilities files.~

For the Tableau dataset we will need predicted classes, and the feature data.

### A. Prep dataset for Tableau

- We'll use `pitch_id` as identifier, and we want the training features.
- We'll also want the actual strike outcome `strike_bool`
- Let's use the training features
- Get predicted classes and concatenate

In fact, why don't we just use the **entire training dataset**, meaning not just the training features. A bit of overkill, but it would be convenient on the Tableau side to have features that we didn't use, such as `pitch_type`, which could add some color/context to the application.

#### Prep training dataset

We'll need to `reset_index` to ensure it `concat`s nicely with our prediction set.

In [26]:
pitch_ids = df_train.reset_index(drop=True)

In [33]:
display(pitch_ids.head().T)
display(pitch_ids.tail().T)

Unnamed: 0,0,1,2,3,4
pitch_id,01311c57-5046-48d7-ac68-000060a98ccb,208d0186-b7c9-46bd-8297-0001539b714c,4a24d09e-2d9b-4d12-a0eb-0004723ce539,486aa6b8-7c43-4974-8a53-000611a9c649,5c9afebb-b70b-45d3-95ee-0017115df7c9
inning,7,9,1,1,2
side,home,home,home,home,home
run_diff,-2,4,0,2,-2
at_bat_index,54,69,1,5,12
pitch_of_ab,5,2,3,3,1
batter,405947,468294,406141,615134,582836
pitcher,756778,778005,451846,564585,582729
catcher,528871,594400,633795,633812,594400
umpire,482420,583103,423579,482532,575678


Unnamed: 0,350954,350955,350956,350957,350958
pitch_id,debd3bc1-d5bf-491b-865f-fff16ed8ed94,d0d2f501-0f00-4a1c-afbd-fff3d840b12a,e0144b0e-26da-493e-9284-fff54874c94a,cb8fcf04-02fc-475f-afc9-fffa77e8a70a,f6f124c5-cadb-49b4-afc9-fffeeb5fbbfa
inning,8,6,6,6,3
side,away,home,away,away,home
run_diff,-1,0,7,-5,0
at_bat_index,64,47,50,54,16
pitch_of_ab,2,6,5,1,1
batter,614712,548676,466175,615238,777737
pitcher,529849,582661,582938,614737,529217
catcher,573687,577406,781416,564041,784663
umpire,480948,594132,574236,423580,596240


In [34]:
pitch_ids.shape

(350959, 31)

##### Rename `strike_bool` to `strike_actual`


In [35]:
pitch_ids.rename(columns={'strike_bool': 'strike_actual'}, inplace=True)

In [36]:
display(pitch_ids.head())
display(pitch_ids.tail())

Unnamed: 0,pitch_id,inning,side,run_diff,at_bat_index,pitch_of_ab,batter,pitcher,catcher,umpire,bside,pside,stringer_zone_bottom,stringer_zone_top,on_1b_mlbid,on_2b_mlbid,on_3b_mlbid,outs,balls,strikes,pitch_speed,px,pz,break_x,break_z,angle_x,angle_z,pitch_type,strike_actual,game_date_dt,strike_bool_tf
0,01311c57-5046-48d7-ac68-000060a98ccb,7,home,-2,54,5,405947,756778,528871,482420,R,L,1.56,3.41,,,,1,3,1,97.4298,-1.2981,2.30217,1.91535,-9.54142,3.02727,5.59379,FA,0,2021-05-13,False
1,208d0186-b7c9-46bd-8297-0001539b714c,9,home,4,69,2,468294,778005,594400,583103,L,R,1.59,3.47,614736.0,561368.0,,0,1,0,91.7712,1.41222,1.57443,-12.1373,-21.9427,-1.56782,6.86676,FA,0,2021-07-29,False
2,4a24d09e-2d9b-4d12-a0eb-0004723ce539,1,home,0,1,3,406141,451846,633795,423579,L,L,1.68,3.58,,,,1,2,0,87.813,-0.18119,2.11248,-0.992261,-25.5107,2.04966,7.17281,SL,1,2021-05-15,True
3,486aa6b8-7c43-4974-8a53-000611a9c649,1,home,2,5,3,615134,564585,633812,482532,R,L,1.63,3.55,433785.0,,,2,0,2,86.5546,-0.885538,0.598692,-2.8393,-27.2509,2.96845,8.50392,SL,0,2021-06-05,False
4,5c9afebb-b70b-45d3-95ee-0017115df7c9,2,home,-2,12,1,582836,582729,594400,575678,L,R,1.5,3.3,577470.0,,,2,0,0,72.0904,-1.45954,3.39951,8.90615,-59.4133,-1.6463,9.33291,CU,0,2021-04-04,False


Unnamed: 0,pitch_id,inning,side,run_diff,at_bat_index,pitch_of_ab,batter,pitcher,catcher,umpire,bside,pside,stringer_zone_bottom,stringer_zone_top,on_1b_mlbid,on_2b_mlbid,on_3b_mlbid,outs,balls,strikes,pitch_speed,px,pz,break_x,break_z,angle_x,angle_z,pitch_type,strike_actual,game_date_dt,strike_bool_tf
350954,debd3bc1-d5bf-491b-865f-fff16ed8ed94,8,away,-1,64,2,614712,529849,573687,480948,R,R,1.59,3.47,,,,0,1,0,98.9278,-2.04498,4.05574,-12.4306,-12.8704,1.45618,3.26582,FA,0,2021-07-07,False
350955,d0d2f501-0f00-4a1c-afbd-fff3d840b12a,6,home,0,47,6,548676,582661,577406,594132,R,R,1.6,3.49,630493.0,563994.0,,2,3,2,94.0491,-0.660754,4.61266,-3.84278,-12.1593,-1.1082,2.62405,FA,0,2021-08-05,False
350956,e0144b0e-26da-493e-9284-fff54874c94a,6,away,7,50,5,466175,582938,781416,574236,R,R,1.59,3.47,757494.0,,,1,1,2,92.8665,-1.51669,2.30947,-11.238,-22.9221,1.39454,7.15734,FA,0,2021-09-24,False
350957,cb8fcf04-02fc-475f-afc9-fffa77e8a70a,6,away,-5,54,1,615238,614737,564041,423580,R,R,1.63,3.55,,,,1,0,0,82.7588,-1.36022,2.85837,-0.383448,-35.8633,0.00739,7.23086,SL,0,2021-06-14,False
350958,f6f124c5-cadb-49b4-afc9-fffeeb5fbbfa,3,home,0,16,1,777737,529217,784663,596240,L,R,1.59,3.47,,,,2,0,0,93.476,-0.679888,2.98412,-11.2071,-9.66032,-0.069135,3.85099,FA,1,2021-04-21,True


#### Get predicted classes (XGBoost model)

We had previously pickled the candidate XGBoost model. (*Note that file was moved manually the original creation in [this notebook](./06_improved_non_nn_model.ipynb)*)

In [37]:
#%%script echo skipping

best_file_path = ('./models/best_models/classic_2nd_pass_best_model_' +
                  'xgb_20220511_2250.pickle'
                 )
with open(best_file_path, 'rb') as f:
    selected_model = pickle.load(f)

In [38]:
selected_model.get_params

<bound method Pipeline.get_params of Pipeline(steps=[('clf',
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, criterion='entropy', gamma=0,
                               gpu_id=-1, importance_type='gain',
                               interaction_constraints='',
                               learning_rate=0.300000012, max_delta_step=0,
                               max_depth=4, max_features='sqrt', max_samples=1,
                               min_child_weight=1, min_samples_split=0.38,
                               missing=nan, monotone_constraints='()',
                               n_estimators=200, n_jobs=16, num_parallel_tree=1,
                               random_state=0, reg_alpha=0, reg_lambda=1,
                               scale_pos_weight=1, subsample=1,
                               tree_method='exact', validate_parameters=1,


In [39]:
pred_classes = selected_model.predict(df_train_slct_scaled)
display(pred_classes.shape)

(350959,)

Quick look at results:

In [40]:
pred_classes[:100]

array([0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1])

In [41]:
sum(pred_classes)

110328

**Sniff Test #1:**  
Look at relative ratio of strikes:

Over this **train** set:

In [42]:
df_pred = pd.DataFrame(pred_classes, columns=['strike_pred'])
df_pred.value_counts(normalize=True)

strike_pred
0              0.685638
1              0.314362
dtype: float64

For comparison, relative ratio of actuals in **training** data:

In [43]:
df_train = pd.read_pickle('../data/train_enriched.pkl')
df_train['strike_bool'].value_counts(normalize=True)

0    0.686864
1    0.313136
Name: strike_bool, dtype: float64

Close, but we knew this would be the case; after all it is the training data.

**Sniff Test #2:**  
Look at the strike probabilities.

In [44]:
pred_proba = selected_model.predict_proba(df_train_slct_scaled)

Check sum.

In [45]:
np.sum(pred_proba)

350959.1

In [46]:
np.sum(pred_proba, axis=0)

array([240940.58, 110018.4 ], dtype=float32)

Quick look at results:

- Look at strike probabilities

In [47]:
pred_proba_strike_only = pred_proba[:, -1]
np.round(pred_proba_strike_only[:10], 3)

array([0.002, 0.   , 0.996, 0.   , 0.   , 0.162, 0.975, 0.955, 0.   ,
       0.   ], dtype=float32)

#### Create output file
Put into a `DataFrame` for readability and ease of export to `.csv`

In [50]:
df_strk_pred = pd.concat([pitch_ids, df_pred], axis=1)

# df_strk_pred.columns = ['pitch_id', 'pred_strike']

display(df_strk_pred.head(10).T)
display(df_strk_pred.tail(10).T)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
pitch_id,01311c57-5046-48d7-ac68-000060a98ccb,208d0186-b7c9-46bd-8297-0001539b714c,4a24d09e-2d9b-4d12-a0eb-0004723ce539,486aa6b8-7c43-4974-8a53-000611a9c649,5c9afebb-b70b-45d3-95ee-0017115df7c9,15a7f649-1ac6-47b2-8d6c-00186e729b3c,5db09bc8-22bc-4781-9f05-001a9f7d2dc8,4b8f95bc-b96d-42c5-accb-001ca191968d,145255da-ebe2-4da6-9cfc-001e9288a854,39b9e470-9f67-4443-abef-0022f82b60e5
inning,7,9,1,1,2,7,5,3,9,6
side,home,home,home,home,home,away,home,home,home,away
run_diff,-2,4,0,2,-2,-6,1,0,-1,0
at_bat_index,54,69,1,5,12,73,35,17,60,43
pitch_of_ab,5,2,3,3,1,1,2,2,1,1
batter,405947,468294,406141,615134,582836,562082,614690,529245,561161,784827
pitcher,756778,778005,451846,564585,582729,529457,561153,471807,538326,785097
catcher,528871,594400,633795,633812,594400,627109,781948,437382,529785,469396
umpire,482420,583103,423579,482532,575678,482764,440501,482905,482939,574301


Unnamed: 0,350949,350950,350951,350952,350953,350954,350955,350956,350957,350958
pitch_id,ef172bc9-4151-4a34-8083-ffbea6f1c1e1,d325171b-0792-40e4-a53d-ffc9f606086a,e5f86caa-1e71-4697-8fa5-ffcbd7f03352,cddf1d95-8de5-4ae4-a8c1-ffe460657877,d2087d6a-b663-4de3-9186-ffe745d5c516,debd3bc1-d5bf-491b-865f-fff16ed8ed94,d0d2f501-0f00-4a1c-afbd-fff3d840b12a,e0144b0e-26da-493e-9284-fff54874c94a,cb8fcf04-02fc-475f-afc9-fffa77e8a70a,f6f124c5-cadb-49b4-afc9-fffeeb5fbbfa
inning,6,5,1,10,1,8,6,6,6,3
side,away,away,home,home,home,away,home,away,away,home
run_diff,0,3,0,0,0,-1,0,7,-5,0
at_bat_index,45,45,0,80,1,64,47,50,54,16
pitch_of_ab,6,4,4,1,3,2,6,5,1,1
batter,781948,531864,394001,578286,406732,614712,548676,466175,615238,777737
pitcher,405952,529032,529374,776776,772075,529849,582661,582938,614737,529217
catcher,490814,782684,768784,469434,573523,573687,577406,781416,564041,784663
umpire,634203,583088,440496,482316,482337,480948,594132,574236,423580,596240


Check that no extra rows:

In [51]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 350959 entries, 0 to 354038
Data columns (total 31 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   pitch_id              350959 non-null  object        
 1   inning                350959 non-null  int64         
 2   side                  350959 non-null  object        
 3   run_diff              350959 non-null  int64         
 4   at_bat_index          350959 non-null  int64         
 5   pitch_of_ab           350959 non-null  int64         
 6   batter                350959 non-null  int64         
 7   pitcher               350959 non-null  int64         
 8   catcher               350959 non-null  int64         
 9   umpire                350959 non-null  int64         
 10  bside                 350959 non-null  object        
 11  pside                 350959 non-null  object        
 12  stringer_zone_bottom  350959 non-null  float64       
 13 

In [52]:
sum(df_strk_pred['pitch_id'].isna())

0

In [54]:
sum(df_strk_pred['strike_pred'].isna())

0

In [55]:
str_ts = datetime.now(timezone.utc).astimezone(pytz.timezone('US/Pacific')).strftime("%Y%m%d_%H%M")
# file_class_nm = 'xgb_pred_vals_' + str_ts
# file_proba_nm = 'xgb_pred_proba_' + str_ts
# file_proba_stk_nm = 'xgb_strike_proba_by_pitch_id_' + str_ts
file_class_stk_nm = 'xgb_strike_class_by_pitch_id_' + str_ts

#file_class_path = './predictions/for_tableau/' + file_class_nm + '.csv'
# file_proba_path = './predictions/holdout/' + file_proba_nm + '.csv'
# file_proba_stk_path = './predictions/holdout/deliverables/' + file_proba_stk_nm + '.csv'
file_class_stk_path = './predictions/for_tableau/' + file_class_stk_nm + '.csv'

display(file_class_stk_path)
# display(file_proba_path)
# display(file_proba_stk_path)

'./predictions/for_tableau/xgb_strike_class_by_pitch_id_20230125_1053.csv'

In [56]:
# Actual deliverable in desired format
df_strk_pred.to_csv(file_class_stk_path, index=False)

---  

<span style="font-size:0.5em;">End of Current Work</span>

<a id='7_the_end'></a>

<span style="font-size:0.5em;"><a href='#7_toc'>Back to TOC</a></span>

-----