# Toronto Blue Jays - Assignment
### Date: Sept 5, 2023
### By: Xenel Nazar
### Contact Info: xenel.nazar@gmail.com

### Introduction

The following notebook is in regards to my submission for the Toronto Blue Jays Technical Assignment. 

Provided are two files `deploy,csv` and `training.csv` which contain the following columns:
- `InPlay` – A binary column indicating if the batter put the ball in play (1 = in play, 0 = not in play)
- `Velo` – The velocity of the pitch at release (in mph)
- `SpinRate` – The Spin Rate of the pitch at release (in rpm)
- `HorzBreak` – The amount of movement the pitch had in the horizontal direction (in inches)
- `InducedVertBreak` – The amount of movement (in inches) the pitch had in the vertical direction after accounting for the effects of gravity. A positive value means the pitch would move up in a gravity-free environment that still had the same air resistance.


We will then utilize the data to answer the following questions:

A right-handed pitcher is curious about how velocity, movement, and spin rates on fastballs affect the chances of batters putting the ball in play. Attached are two CSV files. Both files contain 10,000 random pitches of fastballs thrown in the strike zone by right-handed pitchers to right-handed batters (swings and takes are both included). One of them (training.csv) also includes whether the batter was able to put the ball in play.

1. Predict the chance of a pitch being put in play. Please use this model to predict the chance of each pitch in the “deploy.csv” file being put in play and return a CSV with your predictions.
2. In one paragraph, please explain your process and reasoning for any decisions you made in Question 1.
3. In one or two sentences, please describe to the pitcher how these 4 variables affect the batter’s ability to put the ball in play. You can also include one plot or table to show to the pitcher if you think it would help.
4. In one or two sentences, please describe what you would see as the next steps with your model and/or results if you were in the analyst role and had another week to work on the question posed by the pitcher.



#### Import Libraries
We will first load the libraries needed to assist with Data Wrangling and Data Cleaning to help prepare for the next steps in EDA, as well as Modeling.

In [2]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px

In [7]:
# import train data
train = pd.read_csv('training.csv')

In [9]:
train.head(10)

Unnamed: 0,InPlay,Velo,SpinRate,HorzBreak,InducedVertBreak
0,0,95.33,2893.0,10.68,21.33
1,0,94.41,2038.0,17.13,5.77
2,0,90.48,2183.0,6.61,15.39
3,0,93.04,2279.0,9.33,14.57
4,0,95.17,2384.0,6.99,17.62
5,0,95.0,2580.0,7.16,16.07
6,0,97.94,2376.0,12.29,18.11
7,0,95.42,2103.0,7.98,10.98
8,0,94.12,2535.0,5.68,18.59
9,0,93.23,2242.0,4.1,16.95


In [10]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   InPlay            10000 non-null  int64  
 1   Velo              10000 non-null  float64
 2   SpinRate          9994 non-null   float64
 3   HorzBreak         10000 non-null  float64
 4   InducedVertBreak  10000 non-null  float64
dtypes: float64(4), int64(1)
memory usage: 390.8 KB


In [19]:
# import deply data
deploy = pd.read_csv('deploy.csv')

In [20]:
deploy.head(10)

Unnamed: 0,Velo,SpinRate,HorzBreak,InducedVertBreak
0,94.72,2375.0,3.1,18.15
1,95.25,2033.0,11.26,14.5
2,92.61,2389.0,11.0,21.93
3,94.94,2360.0,6.84,18.11
4,97.42,2214.0,16.7,13.38
5,95.98,2495.0,11.25,17.12
6,94.88,1998.0,15.13,15.22
7,92.73,2049.0,1.55,18.47
8,92.39,1955.0,18.15,7.25
9,95.77,1976.0,10.04,14.56


In [21]:
deploy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Velo              10000 non-null  float64
 1   SpinRate          9987 non-null   float64
 2   HorzBreak         10000 non-null  float64
 3   InducedVertBreak  10000 non-null  float64
dtypes: float64(4)
memory usage: 312.6 KB


We can see that the deploy csv file, does not include the `InPlay` column, which is expected based on the rationale that this will be our predictions based on our model.

#### Verify Null Values
We can see from the info for the train csv file, we have some null values under the `SpinRate` column. We can also see some null values in the `SpinRate` column in our deploy data as well.

In [11]:
x = train['SpinRate'].isnull().sum()
y = round((x/train.shape[0])*100,2)
print(f"Total number of SpinRate null values:",x)
print(f"The total number of null equates to:", y, "%")

Total number of SpinRate null values: 6
The total number of null equates to: 0.06 %


For our train data, we can see that the null values equate to 0.06% of all total values. Based on this we can go forward an remove the rows with the listed null values.

In [12]:
# drop null values
train.dropna(inplace=True)

In [13]:
# reset index
train.reset_index(drop = True, inplace = True)

In [14]:
# verify
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   InPlay            9994 non-null   int64  
 1   Velo              9994 non-null   float64
 2   SpinRate          9994 non-null   float64
 3   HorzBreak         9994 non-null   float64
 4   InducedVertBreak  9994 non-null   float64
dtypes: float64(4), int64(1)
memory usage: 390.5 KB


The null values have been dropped from our train data. 

In [22]:
x = deploy['SpinRate'].isnull().sum()
y = round((x/deploy.shape[0])*100,2)
print(f"Total number of SpinRate null values:",x)
print(f"The total number of null equates to:", y, "%")

Total number of SpinRate null values: 13
The total number of null equates to: 0.13 %


For our deploy data, we can see that the null values equate to 0.13% of all total values. Based on this small value we can go forward an remove the rows with the listed null values.

In [23]:
# drop null values
deploy.dropna(inplace=True)

In [24]:
# reset index
deploy.reset_index(drop = True, inplace = True)

In [25]:
deploy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9987 entries, 0 to 9986
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Velo              9987 non-null   float64
 1   SpinRate          9987 non-null   float64
 2   HorzBreak         9987 non-null   float64
 3   InducedVertBreak  9987 non-null   float64
dtypes: float64(4)
memory usage: 312.2 KB


Our null values in our deploy data are now removed.

#### Verify Duplciate Values

In [18]:
x = train.duplicated().sum()
y = round((x/train.shape[0])*100,2)

print(f"Total number of duplicate rows:",x)
print(f"The total number of duplicate rows equates to:", y, "%")

Total number of duplicate rows: 0
The total number of duplicate rows equates to: 0.0 %


There currently are no null values in the train data.

In [None]:
x = deploy.duplicated().sum()
y = round((x/deploy.shape[0])*100,2)

print(f"Total number of duplicate rows:",x)
print(f"The total number of duplicate rows equates to:", y, "%")

In [15]:
count_unique = train['InPlay'].nunique()
print('# Number of InPlay Values:', count_unique) 
print('InPlay Values:')
train['InPlay'].unique()

# Number of InPlay Values: 2
InPlay Values:


array([0, 1])

In [8]:
# import deploy data
deploy = pd.read_csv('deploy.csv')