# Toronto Blue Jays - Assignment
### Date: Sept 5, 2023
### By: Xenel Nazar
### Contact Info: xenel.nazar@gmail.com

### Introduction

The following notebook is in regards to my submission for the Toronto Blue Jays Technical Assignment. 

Provided are two files `deploy,csv` and `training.csv` which contain the following columns:
- `InPlay` – A binary column indicating if the batter put the ball in play (1 = in play, 0 = not in play)
- `Velo` – The velocity of the pitch at release (in mph)
- `SpinRate` – The Spin Rate of the pitch at release (in rpm)
- `HorzBreak` – The amount of movement the pitch had in the horizontal direction (in inches)
- `InducedVertBreak` – The amount of movement (in inches) the pitch had in the vertical direction after accounting for the effects of gravity. A positive value means the pitch would move up in a gravity-free environment that still had the same air resistance.


We will then utilize the data to answer the following questions:

A right-handed pitcher is curious about how velocity, movement, and spin rates on fastballs affect the chances of batters putting the ball in play. Attached are two CSV files. Both files contain 10,000 random pitches of fastballs thrown in the strike zone by right-handed pitchers to right-handed batters (swings and takes are both included). One of them (training.csv) also includes whether the batter was able to put the ball in play.

1. Predict the chance of a pitch being put in play. Please use this model to predict the chance of each pitch in the “deploy.csv” file being put in play and return a CSV with your predictions.
2. In one paragraph, please explain your process and reasoning for any decisions you made in Question 1.
3. In one or two sentences, please describe to the pitcher how these 4 variables affect the batter’s ability to put the ball in play. You can also include one plot or table to show to the pitcher if you think it would help.
4. In one or two sentences, please describe what you would see as the next steps with your model and/or results if you were in the analyst role and had another week to work on the question posed by the pitcher.



#### Import Libraries
We will first load the libraries needed to assist with Data Wrangling and Data Cleaning to help prepare for the next steps in EDA, as well as Modeling.

In [127]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px

In [128]:
# import train data
train = pd.read_csv('training.csv')

In [129]:
train.head(10)

Unnamed: 0,InPlay,Velo,SpinRate,HorzBreak,InducedVertBreak
0,0,95.33,2893.0,10.68,21.33
1,0,94.41,2038.0,17.13,5.77
2,0,90.48,2183.0,6.61,15.39
3,0,93.04,2279.0,9.33,14.57
4,0,95.17,2384.0,6.99,17.62
5,0,95.0,2580.0,7.16,16.07
6,0,97.94,2376.0,12.29,18.11
7,0,95.42,2103.0,7.98,10.98
8,0,94.12,2535.0,5.68,18.59
9,0,93.23,2242.0,4.1,16.95


In [130]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   InPlay            10000 non-null  int64  
 1   Velo              10000 non-null  float64
 2   SpinRate          9994 non-null   float64
 3   HorzBreak         10000 non-null  float64
 4   InducedVertBreak  10000 non-null  float64
dtypes: float64(4), int64(1)
memory usage: 390.8 KB


In [131]:
print('Train Data Shape:', train.shape)
print('\nNumber of rows:', train.shape[0])
print('\nNumber of columns:', train.shape[1])

Train Data Shape: (10000, 5)

Number of rows: 10000

Number of columns: 5


In [132]:
# import deply data
deploy = pd.read_csv('deploy.csv')

In [133]:
deploy.head(10)

Unnamed: 0,Velo,SpinRate,HorzBreak,InducedVertBreak
0,94.72,2375.0,3.1,18.15
1,95.25,2033.0,11.26,14.5
2,92.61,2389.0,11.0,21.93
3,94.94,2360.0,6.84,18.11
4,97.42,2214.0,16.7,13.38
5,95.98,2495.0,11.25,17.12
6,94.88,1998.0,15.13,15.22
7,92.73,2049.0,1.55,18.47
8,92.39,1955.0,18.15,7.25
9,95.77,1976.0,10.04,14.56


In [134]:
deploy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Velo              10000 non-null  float64
 1   SpinRate          9987 non-null   float64
 2   HorzBreak         10000 non-null  float64
 3   InducedVertBreak  10000 non-null  float64
dtypes: float64(4)
memory usage: 312.6 KB


In [135]:
print('Deploy Data Shape:', deploy.shape)
print('\nNumber of rows:', deploy.shape[0])
print('\nNumber of columns:', deploy.shape[1])

Deploy Data Shape: (10000, 4)

Number of rows: 10000

Number of columns: 4


We can see that the deploy csv file, does not include the `InPlay` column, which is expected based on the rationale that this will be our predictions based on our model.

#### Verify Null Values
We can see from the info for the train csv file, we have some null values under the `SpinRate` column. We can also see some null values in the `SpinRate` column in our deploy data as well.

In [136]:
x = train['SpinRate'].isnull().sum()
y = round((x/train.shape[0])*100,2)
print(f"Total number of SpinRate null values:",x)
print(f"The total number of null equates to:", y, "%")

Total number of SpinRate null values: 6
The total number of null equates to: 0.06 %


For our train data, we can see that the null values equate to 0.06% of all total values. Based on this we can go forward an remove the rows with the listed null values.

In [137]:
# drop null values
train.dropna(inplace=True)

In [138]:
# reset index
train.reset_index(drop = True, inplace = True)

In [139]:
# verify
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   InPlay            9994 non-null   int64  
 1   Velo              9994 non-null   float64
 2   SpinRate          9994 non-null   float64
 3   HorzBreak         9994 non-null   float64
 4   InducedVertBreak  9994 non-null   float64
dtypes: float64(4), int64(1)
memory usage: 390.5 KB


The null values have been dropped from our train data. 

In [140]:
x = deploy['SpinRate'].isnull().sum()
y = round((x/deploy.shape[0])*100,2)
print(f"Total number of SpinRate null values:",x)
print(f"The total number of null equates to:", y, "%")

Total number of SpinRate null values: 13
The total number of null equates to: 0.13 %


For our deploy data, we can see that the null values equate to 0.13% of all total values. Based on this small value we can go forward an remove the rows with the listed null values.

In [141]:
# drop null values
deploy.dropna(inplace=True)

In [142]:
# reset index
deploy.reset_index(drop = True, inplace = True)

In [143]:
deploy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9987 entries, 0 to 9986
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Velo              9987 non-null   float64
 1   SpinRate          9987 non-null   float64
 2   HorzBreak         9987 non-null   float64
 3   InducedVertBreak  9987 non-null   float64
dtypes: float64(4)
memory usage: 312.2 KB


Our null values in our deploy data are now removed.

#### Verify Duplciate Values

In [144]:
x = train.duplicated().sum()
y = round((x/train.shape[0])*100,2)

print(f"Total number of duplicate rows:",x)
print(f"The total number of duplicate rows equates to:", y, "%")

Total number of duplicate rows: 0
The total number of duplicate rows equates to: 0.0 %


There currently are no null values in the train data.

In [145]:
x = deploy.duplicated().sum()
y = round((x/deploy.shape[0])*100,2)

print(f"Total number of duplicate rows:",x)
print(f"The total number of duplicate rows equates to:", y, "%")

Total number of duplicate rows: 0
The total number of duplicate rows equates to: 0.0 %


There currently are no null values in the deploy data.

#### Review Column Values

In [146]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   InPlay            9994 non-null   int64  
 1   Velo              9994 non-null   float64
 2   SpinRate          9994 non-null   float64
 3   HorzBreak         9994 non-null   float64
 4   InducedVertBreak  9994 non-null   float64
dtypes: float64(4), int64(1)
memory usage: 390.5 KB


#### InPlay

In [147]:
count_unique = train['InPlay'].nunique()
print('# Number of Inplay Values:', count_unique) 
print('InPlay Values:')
train['InPlay'].unique()

# Number of Inplay Values: 2
InPlay Values:


array([0, 1])

We currently see two values for InPlay `0` and `1`.

In [148]:
# Calculate the count of each unique value in the 'InPlay' column
value_counts = train['InPlay'].value_counts().sort_index()

# Create a bar chart
fig = px.bar(x=value_counts.index, y=value_counts.values, title= "InPlay Values")

# Show the plot
fig.show()

For the `InPlay` values there are *7,278* values showing `0` and *2,716* values showing `1`. 

This imbalance between the `0` and `1` values could pose a concern when it comes time to model, as there are more values for 0 than 1 and would skew results toward 0 more often that not.

#### Velo

In [149]:
fig = px.histogram(train, x="Velo", opacity=0.8, color_discrete_sequence=['cornflowerblue'])
fig.update_layout(title='Velo Values',xaxis_title='Velo (mph)', yaxis_title='Count')
fig.show()
fig.write_html("visualizations/Velo.html")

In [150]:
train['Velo'].describe()

count    9994.000000
mean       93.957601
std         2.684281
min        59.760000
25%        92.540000
50%        94.100000
75%        95.660000
max       102.040000
Name: Velo, dtype: float64

In [151]:
# Calculate the mean
mean_value = train['Velo'].mean()

# Calculate the median
median_value = train['Velo'].median()

# Calculate the mode
mode_value = train['Velo'].mode().values[0]

print(f"Mean: {mean_value}")
print(f"Median: {median_value}")
print(f"Mode: {mode_value}")

Mean: 93.95760056033625
Median: 94.1
Mode: 94.69


The distribution of the data in the `Velo` column is left skewed with a majority of the data in the 85-100 mph range. It has a mean of ~93.96, median of 94.1, and a mode of 94.69. There are likely outliers in our data in the low end of the `Velo` data, for this case we are keeping these for the timebeing.

#### SpinRate

In [152]:
fig = px.histogram(train, x="SpinRate", opacity=0.8, color_discrete_sequence=['cornflowerblue'])
fig.update_layout(title='SpinRate Values',xaxis_title='SpinRate (rpm)', yaxis_title='Count')
fig.show()
fig.write_html("visualizations/SpinRate.html")

In [153]:
train['SpinRate'].describe()

count    9994.000000
mean     2238.952471
std       196.041323
min       770.000000
25%      2107.000000
50%      2241.000000
75%      2367.000000
max      3061.000000
Name: SpinRate, dtype: float64

In [154]:
# Calculate the mean
mean_value = train['SpinRate'].mean()

# Calculate the median
median_value = train['SpinRate'].median()

# Calculate the mode
mode_value = train['SpinRate'].mode().values[0]

print(f"Mean: {mean_value}")
print(f"Median: {median_value}")
print(f"Mode: {mode_value}")

Mean: 2238.95247148289
Median: 2241.0
Mode: 2313.0


We also see a left skewed distribution for our `SpinRate` column with a majority of values from around 1650-2750 rpm. The SpinRate column has a mean of ~2238.95 rpm, median of 2241 rpm, and a mode of 2313 rpm.

#### HorzBreak

In [155]:
fig = px.histogram(train, x="HorzBreak", opacity=0.8, color_discrete_sequence=['cornflowerblue'])
fig.update_layout(title='HorzBreak Values',xaxis_title='HorzBreak (in)', yaxis_title='Count')
fig.show()
fig.write_html("visualizations/HorzBreak.html")

In [156]:
train['HorzBreak'].describe()

count    9994.000000
mean        9.547185
std         5.052148
min        -6.270000
25%         5.730000
50%         9.430000
75%        13.600000
max        28.040000
Name: HorzBreak, dtype: float64

In [157]:
# Calculate the mean
mean_value = train['HorzBreak'].mean()

# Calculate the median
median_value = train['HorzBreak'].median()

# Calculate the mode
mode_value = train['HorzBreak'].mode().values[0]

print(f"Mean: {mean_value}")
print(f"Median: {median_value}")
print(f"Mode: {mode_value}")

Mean: 9.547185311186716
Median: 9.43
Mode: 7.42


We can see that the `HorzBreak` column has somewhat of a normal distribution, with a mean of ~9.55 inches, median of 9.43 inches, and a mode of 7.42 inches.

#### InducedVertBreak

In [158]:
fig = px.histogram(train, x="InducedVertBreak", opacity=0.8, color_discrete_sequence=['cornflowerblue'])
fig.update_layout(title='InducedVertBreak Values',xaxis_title='InducedVertBreak (in)', yaxis_title='Count')
fig.show()
fig.write_html("visualizations/InducedVertBreak.html")

In [159]:
train['InducedVertBreak'].describe()

count    9994.000000
mean       14.175964
std         4.604653
min        -6.820000
25%        11.370000
50%        15.160000
75%        17.630000
max        24.860000
Name: InducedVertBreak, dtype: float64

In [160]:
# Calculate the mean
mean_value = train['InducedVertBreak'].mean()

# Calculate the median
median_value = train['InducedVertBreak'].median()

# Calculate the mode
mode_value = train['InducedVertBreak'].mode().values[0]

print(f"Mean: {mean_value}")
print(f"Median: {median_value}")
print(f"Mode: {mode_value}")

Mean: 14.175963578146895
Median: 15.16
Mode: 16.51


We can see that there is a left skew in our data. We have mean of ~14.18 inches, meadian of 15.15 inches, and a mode of 16.51 inches.

### X and Y Split

We will now split our train data based on our target column (`InPlay`) and the balance of our columns in preparation of modeling.

In [161]:
# x and y split
y_col = 'InPlay'
y_clean = train[y_col]
X_clean = train[train.columns.drop(y_col)]

In [162]:
# verify
X_clean

Unnamed: 0,Velo,SpinRate,HorzBreak,InducedVertBreak
0,95.33,2893.0,10.68,21.33
1,94.41,2038.0,17.13,5.77
2,90.48,2183.0,6.61,15.39
3,93.04,2279.0,9.33,14.57
4,95.17,2384.0,6.99,17.62
...,...,...,...,...
9989,93.61,2074.0,13.08,7.39
9990,90.72,1928.0,14.10,6.08
9991,94.19,2694.0,0.98,14.95
9992,92.65,2176.0,9.28,17.62


In [163]:
# verify
y_clean

0       0
1       0
2       0
3       0
4       0
       ..
9989    0
9990    1
9991    1
9992    0
9993    0
Name: InPlay, Length: 9994, dtype: int64

In [164]:
# Check target variable proportions
y_clean.value_counts()/len(y_clean)*100

0    72.823694
1    27.176306
Name: InPlay, dtype: float64

As mentioned earlier, there is still a concern on value imbalance in our data, with more time we could try over or undersampling of our data or even using SMOTE.

#### Train-Validation Split

We will create a validation subset to help with our modelling. For simplicity we can do a 70/30 Train-Validation split. For reference we originally have 9,994 rows in our data.

In [165]:
# splitting our data into train and test sets
from sklearn.model_selection import train_test_split
# Split into train and validation
X_train, X_validation, y_train, y_validation = train_test_split(X_clean, 
                                                                      y_clean, 
                                                                      test_size = 0.30, 
                                                                      stratify=y_clean,
                                                                      random_state=1)

In [166]:
# Verify # of rows in train, validation, and test data
i = X_train.shape[0]
j = X_validation.shape[0]
n = i+j

print('Train Data # of Rows:', i)
print('Train Data equates to', round((i/n)*100),'% of our data')
print('Validation Data # of Rows:', j)
print('Validation Data equates to', round((j/n)*100),'% of our data')
print('Total rows', n)

Train Data # of Rows: 6995
Train Data equates to 70 % of our data
Validation Data # of Rows: 2999
Validation Data equates to 30 % of our data
Total rows 9994


In [167]:
# Check TRAIN rating proportions
print('---TRAIN---')
y_train.value_counts()/len(y_train)

---TRAIN---


0    0.728234
1    0.271766
Name: InPlay, dtype: float64

In [168]:
# Check VALIDATION rating proportions
print('---VALIDATION---')
y_validation.value_counts()/len(y_validation)

---VALIDATION---


0    0.728243
1    0.271757
Name: InPlay, dtype: float64

We have successfully split our train data into Train and Validation subsets.

#### Scaling

For the purpose of this capstone project, we will use a Standard Scaler, as it is typically the default we use when scaling data, and should help with our various spread of numerical values.

In [169]:
# For Scaling Data
from sklearn.preprocessing import StandardScaler

# Instantiate
scaler = StandardScaler()

# Fit
scaler.fit(X_train)

# Transform on train and test
X_scaled_train = scaler.transform(X_train)
X_scaled_validation = scaler.transform(X_validation)
X_scaled_test = scaler.transform(deploy) 

#### Logistic Regression

We will first look at running a Logistic Regression model on the scaled data.

In [170]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression

my_logreg = LogisticRegression(solver='lbfgs', random_state=1, max_iter=1000)

# Fitting to PCA data
my_logreg.fit(X_scaled_train,y_train)

# Scoring on PCA train and test sets
print(f'Train Score: {my_logreg.score(X_scaled_train , y_train)}')
print(f'Validation Score: {my_logreg.score(X_scaled_validation, y_validation)}')

Train Score: 0.7279485346676198
Validation Score: 0.7275758586195399


| Model                     | Result |
|---------------------------|--------|
| Model 1 - Logistic Regression | 73%  |

As we can see the baseline logistic model has an accuracty of 73%. The train and validation score are closely aligned, which can point to overfitting in the model.

#### SVM

We will now look at the LinearSVC model for this data.

In [171]:
# SVC
from sklearn.svm import LinearSVC

# Instantiate
SVM_model = LinearSVC(random_state=1, max_iter=1000)

# Fit
SVM_model.fit(X_scaled_train, y_train)

# Evaluate
print(f"The Train classification accuracy is: {SVM_model.score(X_scaled_train, y_train)}")
print(f"The Validation classification accuracy is: {SVM_model.score(X_scaled_validation, y_validation)}")

The Train classification accuracy is: 0.7280914939242316
The Validation classification accuracy is: 0.7279093031010336


We can see that the the SVM model performed in-line with Logistic Model at 73%.

We can look at tree-based models, such as Decision Trees, Random Forest, and ensemble learning tree methods, like XGBoost.

| Model                         | Result |
|-------------------------------|--------|
| Model 1 - Logistic Regression | 73%    |
| Model 2 - SVM                 | 73%    |

#### Decision Trees

We will now look Decision Trees with our data.

In [172]:
# Decision Trees
from sklearn.tree import DecisionTreeClassifier

# Instantiate
dt = DecisionTreeClassifier(max_depth=1)

# Fit
dt.fit(X_train, y_train)

# Score
print('Train Dataset Score:',dt.score(X_train, y_train))
print('Validation Dataset Score:',dt.score(X_validation, y_validation))

Train Dataset Score: 0.7282344531808435
Validation Dataset Score: 0.7282427475825275


The Decision Tree model provided an accuracy of 73%. The speed of the model will help in tuning the model to improve overall accuracy if needed.

| Model                         | Result |
|-------------------------------|--------|
| Model 1 - Logistic Regression | 73%    |
| Model 2 - SVM                 | 73%    |
| Model 3 - Decision Trees      | 73%    |

#### Random Forest

We can also look at another tree model by running a random Forest model. We can set n_estimators at 50 for simplicity.

In [173]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier
import warnings
warnings.filterwarnings("ignore")

#Instantiate
my_random_forest = RandomForestClassifier(n_estimators=50)

#Fit
my_random_forest.fit(X_train, y_train)

RandomForestClassifier(n_estimators=50)

In [174]:
# Score on Training Data
decision_tree_scores = []
for sub_tree in my_random_forest.estimators_:
    decision_tree_scores.append(sub_tree.score(X_train, y_train))
    
print("Performance on fitted data:")
print(f"Average Decision Tree: {np.mean(decision_tree_scores)}")
print(f"Random Forest: {my_random_forest.score(X_train, y_train)}")

Performance on fitted data:
Average Decision Tree: 0.8566433166547534
Random Forest: 0.9998570407433881


In [175]:
# Score on Validation Data
decision_tree_scores = []
for sub_tree in my_random_forest.estimators_:
    decision_tree_scores.append(sub_tree.score(X_validation, y_validation))

print("Performance on Validation data:")
print(f"Average Decision Tree: {np.mean(decision_tree_scores)}")
print(f"Random Forest: {my_random_forest.score(X_validation, y_validation)}")

Performance on Validation data:
Average Decision Tree: 0.6056685561853952
Random Forest: 0.699566522174058


While the Random Forest performed well on our training data, it did do well with our validation data.

| Model                         | Result |
|-------------------------------|--------|
| Model 1 - Logistic Regression | 73%    |
| Model 2 - SVM                 | 73%    |
| Model 3 - Decision Trees      | 73%    |
| Model 4 - Random Forest       | 70%    |

#### XGBoost

We can also look at an ensemble boosting method, such as XGBoost for the data.

In [176]:
# XGBoost
from xgboost import XGBClassifier

# Instantiate
XGB_model = XGBClassifier()

#Fit
XGB_model.fit(X_train, y_train)

print(f"XG Boost Train score: {XGB_model.score(X_train, y_train)}")
print(f"XG Boost Validation score: {XGB_model.score(X_validation, y_validation)}")

XG Boost Train score: 0.8577555396711937
XG Boost Validation score: 0.6918972990996999




| Model                         | Result |
|-------------------------------|--------|
| Model 1 - Logistic Regression | 73%    |
| Model 2 - SVM                 | 73%    |
| Model 3 - Decision Trees      | 73%    |
| Model 4 - Random Forest       | 70%    |
| Model 5 - XGBoost             | 69%    |

### Model Tuning

### Decision Trees
We will first look at the decision tree model, and adjust the `max_depth` and `min_samples_leaf` parameters. We will also use a 5-fold cross validation.

In [177]:
# To do a cross-validated grid search
from sklearn.model_selection import GridSearchCV
# To build a pipeline
from sklearn.pipeline import Pipeline

# Decision Tree

from tempfile import mkdtemp
cachedir = mkdtemp()

# List max_depths, and min_samples_leaf:
m_depths = list(range(1, 11))
m_leafs = list(range(1, 5))

estimators = [
    ('model', DecisionTreeClassifier())
]

my_pipe = Pipeline(estimators, memory = cachedir)

param_grid = [
    # Decision Tree
    {
        'model': [DecisionTreeClassifier()],
        'model__max_depth': m_depths,
        'model__min_samples_leaf': m_leafs
    }
]

grid = GridSearchCV(my_pipe, param_grid, cv=5)

fittedgrid = grid.fit(X_train, y_train)

In [178]:
# Best estimator object
fittedgrid.best_estimator_

Pipeline(memory='/var/folders/91/mptwb_gx05g7b144ydz52tcc0000gn/T/tmpj3lvgy9w',
         steps=[('model', DecisionTreeClassifier(max_depth=1))])

In [179]:
# Best hyperparameters
fittedgrid.best_params_

{'model': DecisionTreeClassifier(max_depth=1),
 'model__max_depth': 1,
 'model__min_samples_leaf': 1}

In [180]:
# Decision Trees
# Instantiate
dt_t = DecisionTreeClassifier(max_depth=1, min_samples_leaf=1)

# Fit
dt_t.fit(X_train, y_train)

# Score
print('Train Dataset Score:',dt_t.score(X_train, y_train))
print('Validation Dataset Score:',dt_t.score(X_validation, y_validation))

Train Dataset Score: 0.7282344531808435
Validation Dataset Score: 0.7282427475825275


Optimizing the parameters for the Decision Tree did not improve our score for our Decision Tree.

| Model                           | Result |
|---------------------------------|--------|
| Model 1 - Logistic Regression   | 73%    |
| Model 2 - SVM                   | 73%    |
| Model 3 - Decision Trees        | 73%    |
| Model 4 - Random Forest         | 70%    |
| Model 5 - XGBoost               | 69%    |
| Model 6 - Decision Tree (Tuned) | 73%    |

### Random Forest

We will also look at the Random Forest model, and adjust the paramaters for `n_estimators` and `max_depth`. We will also conduct 5-fold cross validation.

In [181]:
# Random Forest

from tempfile import mkdtemp
cachedir = mkdtemp()

# List max_depths, and n_estimators:
n_est = list(range(50, 100, 10))
m_depths = list(range(1, 11))

estimators = [
    ('model', RandomForestClassifier())
]

param_grid = [
    # Random Forest
    {
        'model': [RandomForestClassifier()],
        'model__max_depth': m_depths,
        'model__n_estimators': n_est
    }
]

grid = GridSearchCV(my_pipe, param_grid, cv=5)

fittedgrid = grid.fit(X_train, y_train)

In [182]:
# Best estimator object
fittedgrid.best_estimator_

Pipeline(memory='/var/folders/91/mptwb_gx05g7b144ydz52tcc0000gn/T/tmpj3lvgy9w',
         steps=[('model',
                 RandomForestClassifier(max_depth=6, n_estimators=60))])

In [183]:
# Best hyperparameters
fittedgrid.best_params_

{'model': RandomForestClassifier(max_depth=6, n_estimators=60),
 'model__max_depth': 6,
 'model__n_estimators': 60}

In [188]:
# Random Forest

#Instantiate
my_random_forest_t = RandomForestClassifier(n_estimators=60, max_depth=6)

#Fit
my_random_forest_t.fit(X_train, y_train)

RandomForestClassifier(max_depth=6, n_estimators=60)

In [189]:
# Score on Training Data
decision_tree_scores = []
for sub_tree in my_random_forest_t.estimators_:
    decision_tree_scores.append(sub_tree.score(X_train, y_train))
    
print("Performance on fitted data:")
print(f"Average Decision Tree: {np.mean(decision_tree_scores)}")
print(f"Random Forest: {my_random_forest_t.score(X_train, y_train)}")

Performance on fitted data:
Average Decision Tree: 0.7297116988324995
Random Forest: 0.7292351679771265


In [190]:
# Score on Validation Data
decision_tree_scores = []
for sub_tree in my_random_forest_t.estimators_:
    decision_tree_scores.append(sub_tree.score(X_validation, y_validation))

print("Performance on Validation data:")
print(f"Average Decision Tree: {np.mean(decision_tree_scores)}")
print(f"Random Forest: {my_random_forest_t.score(X_validation, y_validation)}")

Performance on Validation data:
Average Decision Tree: 0.7138268311659441
Random Forest: 0.7285761920640214


Tuning our Random Forest helped improve the score from 70% to 73%.

| Model                           | Result |
|---------------------------------|--------|
| Model 1 - Logistic Regression   | 73%    |
| Model 2 - SVM                   | 73%    |
| Model 3 - Decision Trees        | 73%    |
| Model 4 - Random Forest         | 70%    |
| Model 5 - XGBoost               | 69%    |
| Model 6 - Decision Tree (Tuned) | 73%    |
| Model 7 - Random Forest (Tuned) | 73%    |

#### XGBoost

We will now look at optimizing the paramaters for the XGBoost model, by adjusting the `n_estimators`, as well as the `max_depth`.

In [191]:
# XGBoost

from tempfile import mkdtemp
cachedir = mkdtemp()

# List max_depths, and n_estimators:
n_est = list(range(1, 50, 10))
m_depths = list(range(1, 11))

estimators = [
    ('model', XGBClassifier())
]

param_grid = [
    # XGBoost
    {
        'model': [XGBClassifier()],
        'model__max_depth': m_depths,
        'model__n_estimators': n_est
    }
]

grid = GridSearchCV(my_pipe, param_grid, cv=5)

fittedgrid = grid.fit(X_train, y_train)

In [192]:
# Best estimator object
fittedgrid.best_estimator_

Pipeline(memory='/var/folders/91/mptwb_gx05g7b144ydz52tcc0000gn/T/tmpj3lvgy9w',
         steps=[('model',
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, gamma=0, gpu_id=-1,
                               importance_type='gain',
                               interaction_constraints='',
                               learning_rate=0.300000012, max_delta_step=0,
                               max_depth=1, min_child_weight=1, missing=nan,
                               monotone_constraints='()', n_estimators=1,
                               n_jobs=0, num_parallel_tree=1, random_state=0,
                               reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
                               subsample=1, tree_method='exact',
                               validate_parameters=1, verbosity=None))])

In [193]:
# Best hyperparameters
fittedgrid.best_params_

{'model': XGBClassifier(base_score=None, booster=None, colsample_bylevel=None,
               colsample_bynode=None, colsample_bytree=None, gamma=None,
               gpu_id=None, importance_type='gain', interaction_constraints=None,
               learning_rate=None, max_delta_step=None, max_depth=1,
               min_child_weight=None, missing=nan, monotone_constraints=None,
               n_estimators=1, n_jobs=None, num_parallel_tree=None,
               random_state=None, reg_alpha=None, reg_lambda=None,
               scale_pos_weight=None, subsample=None, tree_method=None,
               validate_parameters=None, verbosity=None),
 'model__max_depth': 1,
 'model__n_estimators': 1}

In [194]:
# XGBoost

# Instantiate
XGB_model_t = XGBClassifier(n_estimators=1, max_depth=1)

#Fit
XGB_model_t.fit(X_train, y_train)

print(f"XG Boost Train score: {XGB_model_t.score(X_train, y_train)}")
print(f"XG Boost Validation score: {XGB_model_t.score(X_validation, y_validation)}")

XG Boost Train score: 0.7282344531808435
XG Boost Validation score: 0.7282427475825275


The tuned XGBoost model scored 73%

| Model                           | Result |
|---------------------------------|--------|
| Model 1 - Logistic Regression   | 73%    |
| Model 2 - SVM                   | 73%    |
| Model 3 - Decision Trees        | 73%    |
| Model 4 - Random Forest         | 70%    |
| Model 5 - XGBoost               | 69%    |
| Model 6 - Decision Tree (Tuned) | 73%    |
| Model 7 - Random Forest (Tuned) | 73%    |
| Model 8 - XGBoost (Tuned)       | 73%    |

#### Predictions on Tuned Models

In [195]:
# Generate Predictions
dt_predictions = dt_t.predict(deploy)
RF_predictions = my_random_forest_t.predict(deploy)
XGB_predictions = XGB_model_t.predict(deploy)


### Score on Predictions

In [196]:
print('Decision Tree Deploy Dataset Score:',dt_t.score(deploy, dt_predictions))
print('Random Forest Deploy Dataset Score:',my_random_forest_t.score(deploy, RF_predictions))
print('XGBoost Deploy Dataset Score:',XGB_model_t.score(deploy, XGB_predictions))

Decision Tree Deploy Dataset Score: 1.0
Random Forest Deploy Dataset Score: 1.0
XGBoost Deploy Dataset Score: 1.0


All of our models look to be overfitting on our data.

### Model Selection & Evaluation

Since all of our tuned models look to be performing similarly. For now we will look at our untuned XGBoost model for the sake of evaluation.

We can look at the features with the highest feature importance in the model.


In [202]:
feature_important = XGB_model.get_booster().get_score(importance_type='weight')
keys = list(feature_important.keys())
values = list(feature_important.values())

data = pd.DataFrame(data=values, index=keys, columns=["score"])
data

Unnamed: 0,score
InducedVertBreak,669
HorzBreak,707
Velo,722
SpinRate,684


### Questions

Q1. Predict the chance of a pitch being put in play. Please use this model to predict the chance of each pitch in the “deploy.csv” file being put in play and return a CSV with your predictions.

In [204]:
# Create a DataFrame to store the predictions
predictions_df = pd.DataFrame({
    'Decision Tree Predictions': dt_predictions,
    'Random Forest Predictions': RF_predictions,
    'XGBoost Predictions': RF_predictions
})

In [206]:
# Path
csv_file_path = 'model_predictions.csv'

# Save the DataFrame to a CSV file
predictions_df.to_csv(csv_file_path, index=False)

In [208]:
# verify
p = pd.read_csv('model_predictions.csv')
p.head()

Unnamed: 0,Decision Tree Predictions,Random Forest Predictions,XGBoost Predictions
0,0,0,0
1,0,0,0
2,0,0,0
3,0,0,0
4,0,0,0


Q2. In one paragraph, please explain your process and reasoning for any decisions you made in Question 1.

I utilized multiple models to find the optimal model to help predict `InPlay` values. A summary of the results are as below table. I decided to include all predictions for the Decision Tree, Random Forest, and XGBoost to showcase my technical ability, and providing options to the team for review as needed. 

| Model                           | Result |
|---------------------------------|--------|
| Model 1 - Logistic Regression   | 73%    |
| Model 2 - SVM                   | 73%    |
| Model 3 - Decision Trees        | 73%    |
| Model 4 - Random Forest         | 70%    |
| Model 5 - XGBoost               | 69%    |
| Model 6 - Decision Tree (Tuned) | 73%    |
| Model 7 - Random Forest (Tuned) | 73%    |
| Model 8 - XGBoost (Tuned)       | 73%    |

Q3. In one or two sentences, please describe to the pitcher how these 4 variables affect the batter’s ability to put the ball in play. You can also include one plot or table to show to the pitcher if you think it would help.

In [209]:
# Graph Top Features
rslt_df = data.sort_values(by = 'score', ascending = False).head(20)
fig = px.bar(rslt_df)
fig.update_layout( title='Top Features by Feature Importance',
                  xaxis_title='Feature', yaxis_title='Score', showlegend=False)
fig.show()
# export Graph
fig.write_html(f"visualizations/Top_Features.html")

The graph above helps show the impact or importance of each variable that would be worthwhile to know as a pitcher. In this case `Velo` and `HorzBreak` are the top 2 variables that would impact how effective a pitcher is against an opposing batter.

Q4. In one or two sentences, please describe what you would see as the next steps with your model and/or results if you were in the analyst role and had another week to work on the question posed by the pitcher.

If I had another week to work on the data. I would look to improve my models and overall predictions based on the data. Further exploration of the data would be worthwhile to understand each of the variables further. As well as further data cleaning to remove any potential outliers in the data which are impacting model scores and our predictions.