# Stolen Bases: A Matter of Skill or Guts?

<img src='https://img.mlbstatic.com/mlb-images/image/upload/t_2x1/t_w1536/mlb/yblyorebwvue0kwl7y0b.jpg' width='600' align='center'/>

# Business Understanding

MLB saw an increase in stolen bases in 2023. An MLB team wants to increase viewer retention rates and increase the fan base by adding more action into the game, and they've decided on joining the fun with stolen bases. In this project, I will be advising the MLB team on **how to increase SB stats** for their players. Is it a matter of skill or guts? While this MLB team wants to increase SB stats, they do not want it to jeopardize their wins.

I investigate the following questions:
1. Was the increase in stolen bases from 2022 to 2023 significant?
2. What is the number of **stolen base attempts** that would **increase action** in the game ***without*** greatly **increasing the chances of a loss**?

# Data Understanding

This data was extracted from a custom leaderboard I created on **Baseball Savant**. There are 2 datasets with the same format - [one from 2022](https://baseballsavant.mlb.com/leaderboard/custom?year=2022&type=batter&filter=&min=q&selections=r_total_caught_stealing%2Cr_total_stolen_base%2Cn_bolts%2Csprint_speed&chart=true&x=r_total_caught_stealing&y=r_total_caught_stealing&r=no&chartType=beeswarm&sort=r_total_stolen_base&sortDir=desc) and [one from 2023](https://baseballsavant.mlb.com/leaderboard/custom?year=2023&type=batter&filter=&min=q&selections=r_total_caught_stealing%2Cr_total_stolen_base%2Cn_bolts%2Csprint_speed&chart=true&x=r_total_caught_stealing&y=r_total_caught_stealing&r=no&chartType=beeswarm&sort=r_total_stolen_base&sortDir=desc). The data includes **CS** (caught stealing), **SB** (stolen base), **Bolts** and **Sprint Speed** from **players in the MLB** (Major League Baseball). The 2022 dataset has stats for 130 players and the 2023 dataset has stats for 133 players. This data has been collected from **Statcast**, "a state-of-the-art tracking technology, capable of measuring previously unquantifiable aspects of the game."([Baseball Savant](https://baseballsavant.mlb.com/about#:~:text=Where%20is%20the%20data%20from,probable%20pitchers%20for%20upcoming%20days.))

I began with importing the necessary libraries for my data preparation and exploratory data analysis. These libraries are for data manipulation and data visualization.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [14]:
df_2022 = pd.read_csv('stats_2022.csv')
df_2023 = pd.read_csv('stats_2023.csv')

## Data Preparation

Since the data was extracted from a custom leaderboard I created on [Baseball Savant](https://baseballsavant.mlb.com/), there was minimal cleaning that needed to be done. 

First, I previewed the first 5 entries of the table. This gives me a general idea on how I want to clean and handle the data. I notice:
1. I want to **clean the column names**.
2. I already see **NaNs for n_bolts**. I will have to decide how to handle that.
3. Team names are not included in the dataset. I could add team rosters to a variable and feature engineer a column for teams, but I am more **focused on specific players** stealing bases, not on overall team statistics.
4. On Baseball Savant, my customer leaderboard was **sorted** by descending stolen bases but the data did *not* import that way. I will keep that in mind as I proceed with my analysis.

In [3]:
# View the first 5 entries of the table
player_data_df.head()

Unnamed: 0,"last_name, first_name",player_id,year,r_total_caught_stealing,r_total_stolen_base,n_bolts,sprint_speed
0,"Candelario, Jeimer",600869,2023,1,8,,27.5
1,"McMahon, Ryan",641857,2023,5,5,,25.8
2,"Muncy, Max",571970,2023,2,1,,26.9
3,"Soler, Jorge",624585,2023,0,1,,26.6
4,"Edman, Tommy",669242,2023,4,27,7.0,28.8


In [4]:
# View the overall shape, dtypes and null counts for each column
player_data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133 entries, 0 to 132
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   last_name, first_name    133 non-null    object 
 1   player_id                133 non-null    int64  
 2   year                     133 non-null    int64  
 3   r_total_caught_stealing  133 non-null    int64  
 4   r_total_stolen_base      133 non-null    int64  
 5   n_bolts                  50 non-null     float64
 6   sprint_speed             133 non-null    float64
dtypes: float64(2), int64(4), object(1)
memory usage: 7.4+ KB


Above I noticed as I viewed the overall makeup of the data that n_bolts is the only column with nulls. I want to handle these so I will investigate further.

In [5]:
# View random 20 entries where n_bolts is NaN
player_data_df[player_data_df['n_bolts'].isna()].sample(20)

Unnamed: 0,"last_name, first_name",player_id,year,r_total_caught_stealing,r_total_stolen_base,n_bolts,sprint_speed
14,"Perez, Salvador",521692,2023,0,0,,24.3
129,"Machado, Manny",592518,2023,2,3,,26.3
2,"Muncy, Max",571970,2023,2,1,,26.9
43,"Tucker, Kyle",663656,2023,5,30,,26.6
41,"Bogaerts, Xander",593428,2023,2,19,,27.6
65,"Canha, Mark",592192,2023,1,11,,27.8
118,"Ramírez, José",608070,2023,6,28,,27.8
102,"Jung, Josh",673962,2023,3,1,,26.8
19,"Torres, Gleyber",650402,2023,6,13,,26.4
93,"France, Ty",664034,2023,0,1,,25.0


"A Bolt is any run where the Sprint Speed (defined as "feet per second in a player's fastest one-second window") of the runner is at least 30 ft/sec."([MLB](https://www.mlb.com/glossary/statcast/bolt)). All of the samples where Bolt is NaN include Sprint Speeds that are less than 30. Therefore, it seems safe to say that these players did not have any bolts. I will change the NaNs to 0's.

In [6]:
# Replace NaNs in N_Bolts with 0
player_data_df['n_bolts'] = player_data_df['n_bolts'].fillna(0)

Below shows the summary statistics of the dataset. Although all of the data below is numerical, player_id and year are categorical data, so those can be ignored. r_total_caught_stealing, r_total_stolen_base, n_bolts, and sprint_speed, however, give me a general overview of the values I expect to see in the data set.

In [7]:
# View summary statistics of the dataset.
player_data_df.describe()

Unnamed: 0,player_id,year,r_total_caught_stealing,r_total_stolen_base,n_bolts,sprint_speed
count,133.0,133.0,133.0,133.0,133.0,133.0
mean,633932.578947,2023.0,2.796992,11.75188,7.541353,27.395489
std,49936.834584,0.0,2.85975,12.585108,22.734139,1.242301
min,457759.0,2023.0,0.0,0.0,0.0,24.3
25%,605204.0,2023.0,1.0,3.0,0.0,26.4
50%,656305.0,2023.0,2.0,8.0,0.0,27.4
75%,666969.0,2023.0,4.0,16.0,2.0,28.3
max,807799.0,2023.0,15.0,73.0,149.0,30.5


Below I checked for duplicates. I didn't think there would be any but you never know!

In [8]:
# Check for duplicated
len(player_data_df[player_data_df.duplicated()])

0

In [11]:
# Rewrite column names in title case for preference
player_data_df.columns = player_data_df.columns.map(lambda x: x.title())

In [10]:
player_data_df

Unnamed: 0,"Last_Name, First_Name",Player_Id,Year,R_Total_Caught_Stealing,R_Total_Stolen_Base,N_Bolts,Sprint_Speed
0,"Candelario, Jeimer",600869,2023,1,8,0.0,27.5
1,"McMahon, Ryan",641857,2023,5,5,0.0,25.8
2,"Muncy, Max",571970,2023,2,1,0.0,26.9
3,"Soler, Jorge",624585,2023,0,1,0.0,26.6
4,"Edman, Tommy",669242,2023,4,27,7.0,28.8
...,...,...,...,...,...,...,...
128,"Benintendi, Andrew",643217,2023,2,13,0.0,27.3
129,"Machado, Manny",592518,2023,2,3,0.0,26.3
130,"Meneses, Joey",608841,2023,0,0,0.0,25.6
131,"Arcia, Orlando",606115,2023,0,1,0.0,25.9


# Exploratory Data Analysis

# Conclusions

## Limitations

## Recommendations

## Next Steps