# Stolen Bases: A Matter of Skill or Guts?

<img src='https://img.mlbstatic.com/mlb-images/image/upload/t_2x1/t_w1536/mlb/yblyorebwvue0kwl7y0b.jpg' width='600' align='center'/>

# Business Understanding

MLB saw an increase in stolen bases in 2023. An MLB team wants to increase viewer retention rates and increase the fan base by adding more action into the game, and they've decided on joining the fun with stolen bases. In this project, I will be advising the MLB team on **how to increase SB stats** for their players. Is it a matter of skill or guts? While this MLB team wants to increase SB stats, they do not want it to jeopardize their wins.

I investigate the following questions:
1. Was the increase in stolen bases from 2022 to 2023 **significant**?
2. What contributes to a high number of **stolen bases**?

# Data Understanding

This data was extracted from a custom leaderboard I created on **Baseball Savant**. There are 2 datasets with the same format - [one from 2022](https://baseballsavant.mlb.com/leaderboard/custom?year=2022&type=batter&filter=&min=q&selections=r_total_caught_stealing%2Cr_total_stolen_base%2Cn_bolts%2Csprint_speed&chart=true&x=r_total_caught_stealing&y=r_total_caught_stealing&r=no&chartType=beeswarm&sort=r_total_stolen_base&sortDir=desc) and [one from 2023](https://baseballsavant.mlb.com/leaderboard/custom?year=2023&type=batter&filter=&min=q&selections=r_total_caught_stealing%2Cr_total_stolen_base%2Cn_bolts%2Csprint_speed&chart=true&x=r_total_caught_stealing&y=r_total_caught_stealing&r=no&chartType=beeswarm&sort=r_total_stolen_base&sortDir=desc). The data includes **CS** (caught stealing), **SB** (stolen base), **Bolts** and **Sprint Speed** from **players in the MLB** (Major League Baseball). The 2022 dataset has stats for 130 players and the 2023 dataset has stats for 133 players. This data has been collected from **Statcast**, "a state-of-the-art tracking technology, capable of measuring previously unquantifiable aspects of the game."([Baseball Savant](https://baseballsavant.mlb.com/about#:~:text=Where%20is%20the%20data%20from,probable%20pitchers%20for%20upcoming%20days.))

I began with importing the necessary libraries for my data preparation and exploratory data analysis. These libraries are for data manipulation and data visualization.

In [None]:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

%matplotlib inline

I then imported the data as pandas dataframes. The 2022 data is saved as `df_2022` and the 2023 data is saved as `df_2023`.

In [None]:
# import datasets
df_2022 = pd.read_csv('stats_2022.csv')
df_2023 = pd.read_csv('stats_2023.csv')

## Data Preparation

Since the data was extracted from a custom leaderboard I created on [Baseball Savant](https://baseballsavant.mlb.com/), there was minimal cleaning that needed to be done. Both tables follow the same format so I will be following the same cleaning steps for each.

First, I previewed the first 5 entries of each table. This gives me a general idea on how I want to clean and handle the data. I notice:
1. I want to **clean the column names**.
2. I already see **NaNs for n_bolts**. I will have to decide how to handle that.
3. Team names are not included in the dataset. I could add team rosters to a variable and feature engineer a column for teams, but I am more **focused on specific players** stealing bases, not on overall team statistics.
4. On Baseball Savant, my customer leaderboard was **sorted** by descending stolen bases but the data did *not* import that way. I will keep that in mind as I proceed with my analysis.

In [None]:
# View the first 5 entries of 2022 data
df_2022.head()

In [None]:
# View the first 5 entries of 2023 data
df_2023.head()

In [None]:
# View the overall shape, dtypes and null counts for each column in df_2022
df_2022.info()

In [None]:
# View the overall shape, dtypes and null counts for each column in df_2023
df_2023.info()

Above I noticed as I viewed the overall makeup of the data that n_bolts is the only column with **nulls** in either dataframe. I want to handle these so I will investigate further.

In [None]:
# View random 20 entries where n_bolts is NaN
df_2023[df_2023['n_bolts'].isna()].sample(20)

"A Bolt is any run where the Sprint Speed (defined as "feet per second in a player's fastest one-second window") of the runner is at least 30 ft/sec."([MLB](https://www.mlb.com/glossary/statcast/bolt)). All of the samples where Bolt is NaN include Sprint Speeds that are less than 30. Therefore, it seems safe to say that these players did not have any bolts. I will **change the NaNs to 0's**.

In [None]:
# Replace NaNs in N_Bolts with 0
df_2022['n_bolts'] = df_2022['n_bolts'].fillna(0)
df_2023['n_bolts'] = df_2023['n_bolts'].fillna(0)

Below shows the **summary statistics** of each dataset. Although all of the data below is numerical, player_id and year are categorical data, so those can be ignored. r_total_caught_stealing, r_total_stolen_base, n_bolts, and sprint_speed, however, give me a general overview of the values I expect to see in the data set.

I do notice the difference in r_total_stolen_base means from 2022 to 2023. The 2022 mean is 8.9 and the 2023 mean is 11.75. It is a difference of 2.85, but **is that significant?** I will analyze further in the EDA. 

In [None]:
# View summary statistics of df_2022
df_2022.describe()

In [None]:
# View summary statistics of df_2023
df_2023.describe()

Below I checked for **duplicates**. I didn't think there would be any but you never know!

In [None]:
# Check for duplicated
len(df_2022[df_2022.duplicated()])

In [None]:
# Check for duplicated
len(df_2023[df_2023.duplicated()])

In [None]:
# Rewrite column names in title case for preference
df_2022.columns = df_2022.columns.map(lambda x: x.title())
df_2023.columns = df_2023.columns.map(lambda x: x.title())

### Feature Engineering

I create three columns that will be useful in my analysis:
1. `Total_Steal_Attempts`
2. `Stolen_Base_%`: This one is a column on Baseball Savant but was empty, therefore I am calculating it here.
3. `Is_Top_10`: This is a boolean column based on if the player's Sprint Speed is in the top 10. Two pairs of players have the same Sprint Speed, therefore there are 12 players with True in this column.

In [None]:
# Create Total_Steal_Attempts by adding R_Total_Caught_Stealing and R_Total_Stolen_Base
df_2022['Total_Steal_Attempts'] = df_2022['R_Total_Caught_Stealing'] + df_2022['R_Total_Stolen_Base']
df_2023['Total_Steal_Attempts'] = df_2023['R_Total_Caught_Stealing'] + df_2023['R_Total_Stolen_Base']

In [None]:
# Create %_Stolen_Base by adding R_Total_Caught_Stealing and R_Total_Stolen_Base
df_2022['Stolen_Base_%'] = df_2022['R_Total_Stolen_Base'] / df_2022['Total_Steal_Attempts']
df_2023['Stolen_Base_%'] = df_2023['R_Total_Stolen_Base'] / df_2023['Total_Steal_Attempts']

In [None]:
# Create Is_Top_10 boolean column indicating if SB is in the top 10
df_2022['Is_Top_10'] = df_2022['R_Total_Stolen_Base'].isin(df_2022['R_Total_Stolen_Base'].nlargest(10))
df_2023['Is_Top_10'] = df_2023['R_Total_Stolen_Base'].isin(df_2023['R_Total_Stolen_Base'].nlargest(10))

I then create a new dataframe `top_10_sb` with just the top 10 SB from 2023. There are 2 SBs that are held by two players each, therefore there are 12 entries in this dataframe. This data is the be used in the EDA to analyze the stats of the players with the top 10 SB for 2023. 

In [None]:
# Create df with players with top 10 SB
top_10_sb = df_2023[df_2023['Is_Top_10'] == True]

# Reset the index to order from 1 to 10
top_10_sb.reset_index(inplace=True, drop=True)
top_10_sb.index = top_10_sb.index + 1

In [None]:
# Preview how the df looks now
df_2023.head()

In [None]:
# Concatenate the two dataframes
sb_data = pd.concat([df_2022, df_2023], axis=0)

# Export new dataframe to visualize in tableau
sb_data.to_csv('sb_data.csv')

# Exploratory Data Analysis

There are two things I am looking to explore in this Exploratory Data Analysis:
1. Was the increase in stolen bases from 2022 to 2023 **significant**?
2. What contributes to a high number of **stolen bases**?

## 1. Was the increase in stolen bases from 2022 to 2023 significant?

In order to determine if the increase in stolen bases from 2022 to 2023 was significant, I am going to conduct a **two sample t-test**. This will determine if there is a significant difference between the stolen base means from 2022 to 2023, as was reported in [articles](https://www.mlb.com/news/mlb-records-3000th-stolen-base-in-2023) and fans alike.

These are the steps I will take to conduct the hypothesis test:
1. Set up null and alternative hypotheses.
2. Choose a significance level.
3. Calculate the t-statistic.
4. Determine the critical or p-value.
5. Compare t-value with critical t-value to reject or fail to reject the null hypothesis.

Before begining the two sample t-test, I visualize each of the distributions on a **KDE plot**. There is a slight difference in distributions and definitely **warrants further investigation** to determine if these differences are significant or not. Each distribution is **skewed to the right**, as the outliers are from players with high SB. 

In [None]:
# Visualize distribution plots
sns.set_theme(context='notebook', palette='bright', rc={'figure.figsize':(6,4)})
sns.histplot(df_2022['R_Total_Stolen_Base'], kde=True, stat='probability', label='2022')
sns.histplot(df_2023['R_Total_Stolen_Base'], kde=True, stat='probability', label='2023')

# Label axes and title
plt.xlabel('Total Stolen Base')
plt.ylabel('Probability')
plt.title('Total Stolen Base Distributions in 2022 and 2023')

plt.legend();

### 1. Set up null and alternative hypotheses.

**Null Hypothesis**: The mean number of stolen bases in the MLB did not increase from 2022 to 2023.
<br>
**Alternative Hypothesis**: The mean number of stolen bases in the MLB increased from 2022 to 2023.

### 2. Choose a significance level.

I will choose the standard 5% significance level for this hypothesis test.

In [None]:
# Save 5% significance level as alpha for future use
alpha = 0.05

### 3. Calculate the critical t-value.

In [None]:
# Save n variables for critical t formula
n_2022 = len(df_2022['R_Total_Stolen_Base'])
n_2023 = len(df_2023['R_Total_Stolen_Base'])


# Calculate critical t-value with a 5% significance level for a one-tailed test
t_crit = stats.t.ppf(1-0.05,(n_2022+n_2023-2))
print(f'The critical t-value is {t_crit}.')

### 4. Calculate the t-statistic and p-value.

In [None]:
t_stat, p_value = stats.ttest_ind(df_2023['R_Total_Stolen_Base'], df_2022['R_Total_Stolen_Base'])
print(f'The t-statistic is {t_stat} and the p-value is {p_value}.')

### 5. Compare t-value with critical t-value to reject or fail to reject the null hypothesis.

With the following findings:
- t-statistic of approximately 2.1 and a critical t-value of 1.65
- alpha value of 5% and p-value of approximately 0.035
<br>

I **reject the null hypothesis** at a significance level of 5%.

The following t-distribution marks the critical t-value in red. This sections stretches to the right. The t-statistic is marked in black, which falls to the right of the critical t-value.

In [None]:
fig = plt.figure(figsize=(6,4))
x_axis = np.linspace(-4,4,50)
 
# use stats.t.pdf to get values on the probability density function for the t-distribution
y_axis = stats.t.pdf(x_axis, (n_2022+n_2023-2), 0, 1)

plt.plot(x_axis, y_axis)
    
# Draw one sided boundary for critical-t
plt.axvline(t_crit, color='red', linestyle='--', lw=2, label='critical t')
plt.axvline(t_stat, color='black', linestyle='-', lw=2, label='t-statistic')
plt.legend()
plt.show()

## 2. What contributes to a high number of stolen bases?

In order to increase SB for an MLB team, I am going to analyze the data from `df_2023`. Upon creating my customer leaderboard, I chose stats that are wideley accepted to have an effect on [SB (Stolen Base)](https://www.mlb.com/glossary/standard-stats/stolen-base): [Sprint Speed](https://www.mlb.com/glossary/statcast/sprint-speed), [Bolt](https://www.mlb.com/glossary/statcast/bolt), and SB Attempts. The relationship between SB and attempts is shown through the [SB% (Stolen Base Percentage)](https://www.mlb.com/glossary/standard-stats/stolen-base-percentage) column, which is SB divided by the total number of attempts.

Do players with a high number of stolen bases also have a great number of caught stolen?

In [None]:
ax = sns.scatterplot(df_2023, x='R_Total_Caught_Stealing', y='R_Total_Stolen_Base')
ax.set_xlabel('Total CS')
ax.set_ylabel('Total SB')
ax.set_title('CS and SB Correlation');
corr = np.corrcoef(df_2023['R_Total_Caught_Stealing'], df_2023['R_Total_Stolen_Base'])
corr

In [None]:
ax = sns.scatterplot(df_2023, x='Sprint_Speed', y='R_Total_Stolen_Base')
ax.set_xlabel('Sprint Speed')
ax.set_ylabel('Total SB')
ax.set_title('Sprint Speed and SB Correlation');
corr = np.corrcoef(df_2023['Sprint_Speed'], df_2023['R_Total_Stolen_Base'])
corr

In [None]:
ax = sns.histplot(df_2023, x='Sprint_Speed')
ax.set_xlabel('Sprint Speed')
ax.set_ylabel('Total SB')
ax.set_title('Sprint Speed and SB Correlation');

In [None]:
ax = sns.scatterplot(df_2023, x='N_Bolts', y='R_Total_Stolen_Base')
ax.set_xlabel('Bolts')
ax.set_ylabel('Total SB')
ax.set_title('Bolt and SB Correlation');
corr = np.corrcoef(df_2023['N_Bolts'], df_2023['R_Total_Stolen_Base'])
corr

In [None]:
ax = sns.histplot(df_2023, x='Sprint_Speed', y='R_Total_Stolen_Base')
ax.set_xlabel('Sprint_Speed')
ax.set_ylabel('Total SB')
ax.set_title('Bolt and SB Correlation');

I am going to examine the stats of the players with the top 10 SB in 2023.

**Ronald Acuña Jr.** of the Atlanta Braves lands in first place with **73 [stolen bases](https://www.mlb.com/glossary/standard-stats/stolen-base)**. He does not have the fastest [Sprint Speed](https://www.mlb.com/glossary/statcast/sprint-speed) at 28 and not nearly the highest [Number of Bolts](https://www.mlb.com/glossary/statcast/bolt) at 18, but he does lead the charge in the Total Number of Steal Attempts at **87 attempts**. This leaves Acuña with a [SB% (Stolen Base Percentage)](https://www.mlb.com/glossary/standard-stats/stolen-base-percentage) of approximately 84%.

**2nd and 3rd place** in SB is awarded to **Corbin Carroll** of the Arizona Diamondbacks and **Bobby Witt Jr.** of the Kansas City Royals, with **54 and 49 SB** respectively. However, these two players hold Sprint Speeds in the top 3, with **30.1 and 30.5** respectively. They also hold the highest two Number of Bolts, at **133 and 149** respectively. This leaves Carroll with an impressive SB% of approximately 92% and Witt with a lower SB% of approximately 77%.

In [None]:
top_10_sb

In [None]:
ax = sns.scatterplot(top_10_sb, x='Sprint_Speed', y='R_Total_Stolen_Base')
ax.set_xlabel('Sprint Speed')
ax.set_ylabel('Total SB')
ax.set_title('Sprint Speed and SB Correlation');
corr = np.corrcoef(top_10_sb['Sprint_Speed'], top_10_sb['R_Total_Stolen_Base'])
corr

In [None]:
ax = sns.histplot(top_10_sb, x='Sprint_Speed')
ax.set_xlabel('Sprint Speed')
ax.set_ylabel('Count')
ax.set_title('Sprint Speed and SB Correlation');

In [None]:
ax = sns.histplot(df_2023, x='Stolen_Base_%', label='All Qualified Players')
ax = sns.histplot(top_10_sb, x='Stolen_Base_%', label='Top 10')
ax.set_xlabel('Stolen Base %')
ax.set_ylabel('Count')
ax.set_title('2023 Stolen Base %')

ax.legend();

In [None]:
ax = sns.histplot(df_2023, x='Sprint_Speed', label='All Qualified Players')
ax = sns.histplot(top_10_sb, x='Sprint_Speed', label='Top 10')
ax.set_xlabel('Sprint Speed')
ax.set_ylabel('Count')
ax.set_title('2023 Sprint Speeds')

ax.legend();

# Conclusions

This analysis has led me to the following two conclusions:
1. At a 5% significance level I reject the null hypothesis and accept the alternative hypothesis. Therefore, the increase in stolen bases from 2022 to 2023 ***was*** significant. Using a two-sample t-test, I found a the p-value to be less than the alpha level, and the t-statistic to be more extreme than the critical t-value. The numbers back the hype.
2. 

## Limitations

1. I used a rather **small dataset**. I analyzed the SB numbers based on the 2023 data, which has data from only 133 players. 

## Recommendations

1. Choose the players with **top Sprint Speeds** to focus on increasing SB attempts. While this seems intuitive, these players will have to **increase their ability to take risks**. This comes easier to some personality traits than others, so the coach will have to drill stolen base pracice and expect caught stealing to increase. The goal is to retain more viewers and attract more fans by increasing the action in the game, therefore, this is a risk we can be willing to take.
2. 

## Next Steps

1. I would like to **repeat this analysis** with a **larger dataset**, using all time data available. This would include analyzing SB over all the years and digging into increases and decreases. This also would include comparing stats from players with top SB over all the years to investigate any trends. I believe this could give me further insight into SB statistics.
2. 