# WBC 2026 Scouting — Statcast EDA

Exploratory analysis of MLB Statcast data for players selected to the **WBC 2026 (World Baseball Classic)** rosters.

This dataset covers pitch-by-pitch records for:
- **324,099 pitches** faced by WBC batters (18 countries)
- **217,139 pitches** thrown by WBC pitchers (14 countries)

All data reflects MLB regular season performance; WBC game data is not included.

---

**Interactive Dashboards:** Full per-player scouting dashboards (spray charts, heatmaps, pitch movement) are available at  
https://github.com/yasumorishima/wbc-scouting

---

## Contents
1. Roster Overview
2. Batting: Country Comparison
3. Batting: Exit Velocity vs xwOBA
4. Pitching: Country Comparison
5. Pitching: Velocity & Stuff
6. Pitch Type Distribution

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import warnings
warnings.filterwarnings('ignore')

plt.rcParams['figure.dpi'] = 120
plt.rcParams['font.size'] = 10

# Load datasets
rosters      = pd.read_csv('/kaggle/input/wbc-2026-scouting/rosters.csv')
bat_sum      = pd.read_csv('/kaggle/input/wbc-2026-scouting/batter_summary.csv')
pit_sum      = pd.read_csv('/kaggle/input/wbc-2026-scouting/pitcher_summary.csv')
bat_raw      = pd.read_csv('/kaggle/input/wbc-2026-scouting/statcast_batters.csv', low_memory=False)
pit_raw      = pd.read_csv('/kaggle/input/wbc-2026-scouting/statcast_pitchers.csv', low_memory=False)

print('Rosters:          ', rosters.shape)
print('Batter summary:   ', bat_sum.shape)
print('Pitcher summary:  ', pit_sum.shape)
print('Statcast batters: ', bat_raw.shape)
print('Statcast pitchers:', pit_raw.shape)

## 1. Roster Overview

In [None]:
# Players per country and role
roster_count = rosters.groupby(['country', 'pool', 'role']).size().unstack(fill_value=0).reset_index()
roster_count.columns.name = None
roster_count['total'] = roster_count.get('batter', 0) + roster_count.get('pitcher', 0)
roster_count = roster_count.sort_values('total', ascending=False)

fig, ax = plt.subplots(figsize=(10, 5))
x = range(len(roster_count))
b_vals = roster_count.get('batter', pd.Series([0]*len(roster_count))).values
p_vals = roster_count.get('pitcher', pd.Series([0]*len(roster_count))).values
ax.bar(x, b_vals, label='Batter', color='steelblue')
ax.bar(x, p_vals, bottom=b_vals, label='Pitcher', color='tomato')
ax.set_xticks(list(x))
ax.set_xticklabels(roster_count['country'].tolist(), rotation=45, ha='right')
ax.set_ylabel('Players')
ax.set_title('WBC 2026 — MLB-Affiliated Players per Country')
ax.legend()
plt.tight_layout()
plt.show()

**Roster notes:** USA and Dominican Republic have the largest contingents of MLB-affiliated players. Brazil and Czechia have no MLB-affiliated players and are not included in the Statcast data.

In [None]:
# Position distribution for batters
batters_r = rosters[rosters['role'] == 'batter'].copy()
pos_counts = batters_r['position'].value_counts()

fig, ax = plt.subplots(figsize=(8, 4))
pos_counts.plot(kind='bar', ax=ax, color='steelblue', edgecolor='white')
ax.set_title('Position Distribution — WBC 2026 Batters')
ax.set_xlabel('Position')
ax.set_ylabel('Count')
ax.set_xticklabels(ax.get_xticklabels(), rotation=0)
plt.tight_layout()
plt.show()

## 2. Batting: Country Comparison

Country-level batting stats derived from MLB regular season Statcast data. Players with fewer than 50 PA are excluded from country averages to reduce small-sample noise.

In [None]:
# Country-level batting averages (min 50 PA)
bat_q = bat_sum[bat_sum['PA'] >= 50].copy()

country_bat = bat_q.groupby('country').agg(
    players=('mlbam_id', 'count'),
    PA=('PA', 'sum'),
    AVG=('AVG', 'mean'),
    OBP=('OBP', 'mean'),
    SLG=('SLG', 'mean'),
    OPS=('OPS', 'mean'),
    xwOBA=('xwOBA', 'mean'),
    K_pct=('K_pct', 'mean'),
    BB_pct=('BB_pct', 'mean'),
    avg_ev=('avg_exit_velo', 'mean'),
).round(3).reset_index().sort_values('OPS', ascending=False)

print(country_bat[['country','players','PA','AVG','OBP','SLG','OPS','xwOBA','K_pct','BB_pct','avg_ev']].to_string(index=False))

In [None]:
# Bar chart: OPS and xwOBA by country
fig, axes = plt.subplots(1, 2, figsize=(13, 5))

cb = country_bat.sort_values('OPS', ascending=True)
axes[0].barh(cb['country'], cb['OPS'], color='steelblue')
axes[0].axvline(cb['OPS'].mean(), color='red', linestyle='--', label=f'Mean {cb["OPS"].mean():.3f}')
axes[0].set_title('OPS by Country (avg, min 50 PA players)')
axes[0].legend()

cb2 = country_bat.sort_values('xwOBA', ascending=True)
axes[1].barh(cb2['country'], cb2['xwOBA'], color='darkorange')
axes[1].axvline(cb2['xwOBA'].mean(), color='red', linestyle='--', label=f'Mean {cb2["xwOBA"].mean():.3f}')
axes[1].set_title('xwOBA by Country (avg, min 50 PA players)')
axes[1].legend()

plt.tight_layout()
plt.show()

**Batting overview:** OPS and xwOBA provide complementary views of offensive production. Countries with fewer qualified hitters may show higher variance. Small national rosters (e.g., GBR, COL) tend to reflect a smaller sample of MLB players, which may influence the averages.

In [None]:
# Top 20 batters by xwOBA (min 100 PA)
top_bat = bat_sum[bat_sum['PA'] >= 100].sort_values('xwOBA', ascending=False).head(20)
top_bat[['player_name','country','PA','HR','AVG','OBP','SLG','OPS','xwOBA','avg_exit_velo']]

## 3. Batting: Exit Velocity vs xwOBA

In [None]:
# Scatter: exit velocity vs xwOBA (min 100 PA)
sc = bat_sum[bat_sum['PA'] >= 100].dropna(subset=['avg_exit_velo', 'xwOBA'])

# Assign color per country
countries = sorted(sc['country'].unique())
cmap = plt.cm.get_cmap('tab20', len(countries))
c_idx = {c: i for i, c in enumerate(countries)}
colors = [cmap(c_idx[c]) for c in sc['country']]

fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(sc['avg_exit_velo'], sc['xwOBA'], c=colors, alpha=0.7, s=sc['PA']/10, edgecolors='none')

# Annotate top-right players
top = sc.nlargest(8, 'xwOBA')
for _, row in top.iterrows():
    ax.annotate(row['player_name'].split(',')[0], (row['avg_exit_velo'], row['xwOBA']),
                fontsize=7, ha='left', va='bottom')

# Legend for countries
patches = [mpatches.Patch(color=cmap(c_idx[c]), label=c) for c in countries]
ax.legend(handles=patches, fontsize=7, ncol=3, loc='lower right')
ax.set_xlabel('Avg Exit Velocity (mph)')
ax.set_ylabel('xwOBA')
ax.set_title('Exit Velocity vs xwOBA — WBC 2026 Batters (min 100 PA, dot size = PA)')
plt.tight_layout()
plt.show()

**Exit Velocity vs xwOBA:** Higher exit velocity is generally associated with higher xwOBA. The dot size represents plate appearances — larger dots indicate a more substantial MLB sample. Players in the upper-right quadrant may be considered among the more impactful hitters heading into WBC 2026.

## 4. Pitching: Country Comparison

Country-level pitching stats from MLB regular season data. Players with fewer than 200 pitches are excluded from country averages.

In [None]:
# Country-level pitching averages (min 200 pitches)
pit_q = pit_sum[pit_sum['total_pitches'] >= 200].copy()

country_pit = pit_q.groupby('country').agg(
    pitchers=('mlbam_id', 'count'),
    pitches=('total_pitches', 'sum'),
    K_pct=('K_pct', 'mean'),
    BB_pct=('BB_pct', 'mean'),
    xwOBA=('xwOBA_against', 'mean'),
    avg_velo=('avg_velo', 'mean'),
    opp_AVG=('opp_AVG', 'mean'),
    opp_SLG=('opp_SLG', 'mean'),
).round(3).reset_index().sort_values('xwOBA', ascending=True)

print(country_pit.to_string(index=False))

In [None]:
# Bar charts: K% and xwOBA against by country
fig, axes = plt.subplots(1, 2, figsize=(13, 5))

cp = country_pit.sort_values('K_pct', ascending=True)
axes[0].barh(cp['country'], cp['K_pct'], color='steelblue')
axes[0].axvline(cp['K_pct'].mean(), color='red', linestyle='--', label=f'Mean {cp["K_pct"].mean():.1f}%')
axes[0].set_title('K% by Country (avg, min 200 pitches)')
axes[0].set_xlabel('K%')
axes[0].legend()

cp2 = country_pit.sort_values('xwOBA', ascending=False)
axes[1].barh(cp2['country'], cp2['xwOBA'], color='tomato')
axes[1].axvline(cp2['xwOBA'].mean(), color='navy', linestyle='--', label=f'Mean {cp2["xwOBA"].mean():.3f}')
axes[1].set_title('xwOBA Against by Country (lower = better)')
axes[1].legend()

plt.tight_layout()
plt.show()

**Pitching overview:** K% reflects strikeout ability; xwOBA against reflects overall quality of contact allowed. Countries with fewer qualified pitchers may show higher variance. These metrics reflect MLB regular season performance and may not directly translate to WBC tournament context.

In [None]:
# Top 20 pitchers by xwOBA against (min 500 pitches)
top_pit = pit_sum[pit_sum['total_pitches'] >= 500].sort_values('xwOBA_against', ascending=True).head(20)
top_pit[['player_name','country','total_pitches','K_pct','BB_pct','xwOBA_against','avg_velo','primary_pitch','pitch_type_count']]

## 5. Pitching: Velocity Distribution

In [None]:
# Velocity distribution by country (fastballs only: FF, SI, FC)
fb_types = ['FF', 'SI', 'FC']
fb = pit_raw[pit_raw['pitch_type'].isin(fb_types)].copy()
fb = fb.dropna(subset=['release_speed'])

# Country median velo
velo_by_country = fb.groupby('country_name')['release_speed'].median().sort_values(ascending=False)

fig, ax = plt.subplots(figsize=(10, 5))
velo_by_country.plot(kind='bar', ax=ax, color='steelblue', edgecolor='white')
ax.axhline(velo_by_country.mean(), color='red', linestyle='--', label=f'Mean {velo_by_country.mean():.1f} mph')
ax.set_title('Median Fastball Velocity (FF/SI/FC) by Country')
ax.set_ylabel('Velocity (mph)')
ax.set_ylim(88, 96)
ax.legend()
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

**Fastball velocity:** Median fastball velocity (FF = Four-seam, SI = Sinker, FC = Cutter) across WBC pitching staffs. Differences between countries are relatively modest at this level of MLB competition.

In [None]:
# K% vs BB% scatter for pitchers (min 500 pitches)
pit_sc = pit_sum[pit_sum['total_pitches'] >= 500].dropna(subset=['K_pct', 'BB_pct'])

countries_p = sorted(pit_sc['country'].unique())
cmap_p = plt.cm.get_cmap('tab20', len(countries_p))
c_idx_p = {c: i for i, c in enumerate(countries_p)}
colors_p = [cmap_p(c_idx_p[c]) for c in pit_sc['country']]

fig, ax = plt.subplots(figsize=(9, 6))
ax.scatter(pit_sc['BB_pct'], pit_sc['K_pct'], c=colors_p, alpha=0.7,
           s=pit_sc['total_pitches']/50, edgecolors='none')

# Annotate top K% pitchers
top_k = pit_sc.nlargest(6, 'K_pct')
for _, row in top_k.iterrows():
    ax.annotate(row['player_name'].split(',')[0], (row['BB_pct'], row['K_pct']),
                fontsize=7, ha='left', va='bottom')

patches_p = [mpatches.Patch(color=cmap_p(c_idx_p[c]), label=c) for c in countries_p]
ax.legend(handles=patches_p, fontsize=7, ncol=2, loc='upper right')
ax.set_xlabel('BB%')
ax.set_ylabel('K%')
ax.set_title('K% vs BB% — WBC 2026 Pitchers (min 500 pitches, dot size = pitch count)')
plt.tight_layout()
plt.show()

**K% vs BB%:** Upper-left quadrant (high K%, low BB%) generally indicates more effective pitching. Dot size represents total pitch count — larger dots indicate a heavier MLB workload.

## 6. Pitch Type Distribution

In [None]:
# Pitch type usage across all WBC pitchers
pt_counts = pit_raw['pitch_type'].value_counts()
pt_pct = (pt_counts / pt_counts.sum() * 100).round(1)

PITCH_NAMES = {
    'FF': 'Four-seam FB', 'SI': 'Sinker', 'FC': 'Cutter',
    'SL': 'Slider', 'CU': 'Curveball', 'KC': 'Knuckle Curve',
    'CH': 'Changeup', 'FS': 'Splitter', 'ST': 'Sweeper',
    'SV': 'Slurve', 'EP': 'Eephus', 'KN': 'Knuckleball',
    'SC': 'Screwball', 'FO': 'Forkball', 'CS': 'Slow Curve',
    'PO': 'Pitchout',
}
pt_pct.index = [PITCH_NAMES.get(p, p) for p in pt_pct.index]
pt_pct = pt_pct[pt_pct >= 0.5]  # filter out near-zero

fig, ax = plt.subplots(figsize=(10, 5))
pt_pct.plot(kind='bar', ax=ax, color='steelblue', edgecolor='white')
ax.set_title('Pitch Type Usage — All WBC 2026 Pitchers (MLB regular season)')
ax.set_ylabel('Usage %')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

**Pitch type usage:** Four-seam fastballs, sliders, and sinkers are the most common pitch types across all WBC staffs. The Sweeper (ST) has grown in prevalence in recent seasons.

In [None]:
# Primary pitch by country (most common pitch type per pitcher, then mode per country)
primary_by_country = pit_sum.groupby('country')['primary_pitch'].apply(
    lambda x: x.value_counts().index[0] if len(x) > 0 else 'N/A'
).reset_index()
primary_by_country.columns = ['country', 'most_common_primary']

# Arsenal variety (avg pitch types per pitcher by country)
arsenal_by_country = pit_sum.groupby('country')['pitch_type_count'].mean().round(1).reset_index()
arsenal_by_country.columns = ['country', 'avg_pitch_types']

summary = primary_by_country.merge(arsenal_by_country, on='country').sort_values('avg_pitch_types', ascending=False)
print(summary.to_string(index=False))

## Summary

This notebook provides a high-level overview of WBC 2026 roster players using MLB Statcast data.

**Key takeaways:**
- USA and Dominican Republic have the deepest MLB-affiliated rosters
- There is meaningful variation in batting (xwOBA, exit velocity) and pitching (K%, velocity) across national teams
- These patterns may offer some context for WBC 2026, though performance in the tournament will depend on many additional factors

**Dataset files:**
| File | Rows | Description |
|---|---|---|
| `statcast_batters.csv` | 324,099 | Pitch-by-pitch batting data, 18 countries |
| `statcast_pitchers.csv` | 217,139 | Pitch-by-pitch pitching data, 14 countries |
| `batter_summary.csv` | 105 | Per-player batting summary |
| `pitcher_summary.csv` | 86 | Per-player pitching summary |
| `rosters.csv` | 309 | Full WBC 2026 roster (20 countries) |
| `stadiums.csv` | 1,002 | MLB stadium coordinates |

**Interactive dashboards:** Spray charts, zone heatmaps, and pitch movement charts for all WBC teams  
https://github.com/yasumorishima/wbc-scouting