# MLB Statcast + Bat Tracking Dataset (2024-2025)

Generate full pitch-by-pitch Statcast data with Bat Tracking metrics for 2024-2025 seasons.

**Output:** `statcast_bat_tracking_2024_2025.csv` (~2.4GB, ~1.4M rows)

In [1]:
# Install required packages
!pip install pybaseball -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/426.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m419.8/426.1 kB[0m [31m18.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m426.1/426.1 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m432.7/432.7 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import pandas as pd
import numpy as np
from pybaseball import statcast
from datetime import date

## Fetch 2024 Season

In [3]:
# 2024 regular season: March 20 - September 29
print('=== Fetching 2024 Season ===')
df_2024 = statcast(start_dt='2024-03-20', end_dt='2024-09-29')
print(f'2024 rows: {len(df_2024):,}')
print(f'2024 bat tracking: {df_2024["bat_speed"].notna().sum():,}')

=== Fetching 2024 Season ===
This is a large query, it may take a moment to complete


That's a nice request you got there. It'd be a shame if something were to happen to it.
We strongly recommend that you enable caching before running this. It's as simple as `pybaseball.cache.enable()`.
Since the Statcast requests can take a *really* long time to run, if something were to happen, like: a disconnect;
gremlins; computer repair by associates of Rudy Giuliani; electromagnetic interference from metal trash cans; etc.;
you could lose a lot of progress. Enabling caching will allow you to immediately recover all the successful
subqueries if that happens.
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[col

2024 rows: 731,904
2024 bat tracking: 316,354


## Fetch 2025 Season

In [4]:
# 2025 regular season: March 27 - September 28
print('\n=== Fetching 2025 Season ===')
df_2025 = statcast(start_dt='2025-03-27', end_dt='2025-09-28')
print(f'2025 rows: {len(df_2025):,}')
print(f'2025 bat tracking: {df_2025["bat_speed"].notna().sum():,}')


=== Fetching 2025 Season ===
This is a large query, it may take a moment to complete


That's a nice request you got there. It'd be a shame if something were to happen to it.
We strongly recommend that you enable caching before running this. It's as simple as `pybaseball.cache.enable()`.
Since the Statcast requests can take a *really* long time to run, if something were to happen, like: a disconnect;
gremlins; computer repair by associates of Rudy Giuliani; electromagnetic interference from metal trash cans; etc.;
you could lose a lot of progress. Enabling caching will allow you to immediately recover all the successful
subqueries if that happens.
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[column].apply(pd.to_datetime, errors='ignore', format=date_format)
  data_copy[column] = data_copy[col

2025 rows: 711,897
2025 bat tracking: 329,094


## Combine and Save

In [5]:
# Combine both seasons
df = pd.concat([df_2024, df_2025], ignore_index=True)
print(f'\n=== Combined Dataset ===')
print(f'Total rows: {len(df):,}')
print(f'Total columns: {len(df.columns)}')
print(f'Bat tracking rows: {df["bat_speed"].notna().sum():,}')
print(f'Bat tracking coverage: {df["bat_speed"].notna().sum() / len(df) * 100:.1f}%')


=== Combined Dataset ===
Total rows: 1,443,801
Total columns: 118
Bat tracking rows: 645,448
Bat tracking coverage: 44.7%


In [6]:
# Check memory usage
size_mb = df.memory_usage(deep=True).sum() / 1024**2
print(f'\nMemory usage: {size_mb:.1f} MB')


Memory usage: 2430.7 MB


In [7]:
# Save to CSV
output_file = 'statcast_bat_tracking_2024_2025.csv'
print(f'\nSaving to {output_file}...')
df.to_csv(output_file, index=False)
print('Done!')
print(f'\nDownload the file and upload to Kaggle.')


Saving to statcast_bat_tracking_2024_2025.csv...
Done!

Download the file and upload to Kaggle.


In [8]:
# Sample data preview
print('\n=== Sample Data (with bat tracking) ===')
sample_cols = ['game_date', 'pitcher', 'batter', 'player_name', 'events',
               'bat_speed', 'swing_length', 'launch_speed', 'launch_angle',
               'release_speed', 'pitch_type', 'pfx_x', 'pfx_z']
sample = df[df['bat_speed'].notna()][sample_cols].head(20)
print(sample)


=== Sample Data (with bat tracking) ===
    game_date  pitcher  batter    player_name     events  bat_speed  \
0  2024-09-29   669194  444482   Nelson, Ryne  field_out       80.7   
1  2024-09-29   669194  630105   Nelson, Ryne  field_out       68.9   
2  2024-09-29   669194  630105   Nelson, Ryne        NaN       63.3   
4  2024-09-29   669194  665487   Nelson, Ryne  field_out       82.3   
5  2024-09-29   673513  664983   Matsui, Yuki  field_out       68.9   
6  2024-09-29   673513  664983   Matsui, Yuki        NaN       36.6   
8  2024-09-29   663362  553993  Waldron, Matt     single       77.2   
12 2024-09-29   663362  553993  Waldron, Matt        NaN       64.5   
13 2024-09-29   663362  545341  Waldron, Matt   home_run       76.5   
15 2024-09-29   663362  545341  Waldron, Matt        NaN       70.4   
16 2024-09-29   663362  545341  Waldron, Matt        NaN       78.0   
17 2024-09-29   663362  545341  Waldron, Matt        NaN       70.8   
20 2024-09-29   663362  545341  Wald