# Exploratory Data Analysis

**Team:** XOH - SOOMI OH, YOO MI OH

**Date:** February 2026  

---

## 1. Executive Summary

### Key Findings

- **Recent tournament data is available and usable:** StatsBomb covers 466 international tournament matches from 2022–2025 (2022 World Cup, 2024 Euros, 2024 Copa América, 2025 Women's Euro), giving us modern, tactically relevant data directly preceding the 2026 cycle.
- **All 8 tactical dimensions are fully computable:** Every metric in our team profiling framework (PPDA, Field Tilt, Possession %, Progression Ratio, xG Buildup, Defensive Line Height, xG Totals, and EPR) maps cleanly to available event types with no data gaps.
- **Player trajectories are trackable across time:** 10,808 unique players appear across multiple seasons, enabling decay-weighted quality scoring and forward projection to 2026.
- **Tactical diversity is real and separable:** Preliminary heatmap analysis confirms meaningful variation across all 8 dimensions allowing us to perform K-means clustering into 6–8 archetypes.

### Final Deliverable Path

**Track 2: Deployed Soccer Analytics Dashboard**

Our EDA confirms the dataset supports a fully interactive dashboard built on three pillars:
1. **Tactical DNA** — cluster teams into 6–8 archetypes using 8 tactical dimensions
2. **Player Quality** — compute decay-weighted scores across 10 position-specific dimensions
3. **2026 Predictions** — forecast Men's World Cup & Women's Euro outcomes using the above

> For detailed analysis, intermediate experiments, and supporting visualizations, see **`EDA.ipynb`**.

---

## 2. Data Retrieval

### Data Sources

We use the **StatsBomb Open Data** dataset, accessed via the official template repository scripts:

| Dataset | Format | Description |
|---|---|---|
| Matches | Parquet/CSV | Match metadata — competition, season, teams, score, date |
| Events | Parquet/CSV | 12.2M event-level rows — passes, shots, pressures, carries, etc. |
| Lineups | Parquet/CSV | Player-match lineups with positions |
| 360° Frames | Parquet/CSV | Freeze-frame tracking (select matches only) |

Data was downloaded by running `download_data.py` per the root `README.md`, then read locally from `data/Statsbomb/`.

### Preprocessing Steps
- No joins performed at EDA stage and each file analyzed independently using template utilities
- Event type labels extracted from nested JSON-style fields via `eda_starter_template.py` helper functions
- Tournament matches filtered by competition name to isolate international data from club fixtures

### Assumptions & Limitations
- StatsBomb open data skews toward elite competitions and tactical patterns from lower leagues are unrepresented
- 360° freeze frame data is sparse with select matches only and excluded from primary metrics
- Player projections assume stylistic consistency with major injuries, managerial changes, or positional shifts are not modeled

In [1]:
import sys
from pathlib import Path
sys.path.append(str(Path('..').resolve()))

from eda.eda_starter_template import (
    header, sub, dist, desc, top, safe_run,
    analyze_sb_matches,
    analyze_sb_events,
    analyze_sb_lineups,
    analyze_sb_360,
    STATSBOMB_DIR
)

from src.eda_functions import (
    analyze_tournament_coverage,
    analyze_metric_data_readiness,
    plot_tournament_timeline,
    plot_tactical_diversity_heatmap,
    plot_player_trajectory_concept,
)

print(f"Data directory: {STATSBOMB_DIR}")

Data directory: /Users/yoomioh/Library/CloudStorage/OneDrive-GeorgiaInstituteofTechnology/2026 Spring/Capstone/soccer-analytics-capstone-XOh/data/Statsbomb


---

## 3. General Dataset Overview

### 3.1 Matches — Data Integrity & Descriptive Stats

## 4. Soccer Analytics Dashboard