# Executive Exploratory Data Analysis

## Georgia Tech MSA Spring 2026 Practicum

**Team:** XOH - SOOMI OH, YOO MI OH  
**Date:** February 2026  

## 1. Executive Summary

### Key Findings:

1. **Dataset Scale & Quality:** StatsBomb dataset contains **3,464 matches** (1958-2025) with **12.2M events** across 21 competitions. Data quality is exceptional with 99.25% location coverage, zero duplicate IDs, and comprehensive 35-event taxonomy. Core tactical events such as Pass, Carry, and Pressure represent 67% of all events—ideal for metric calculation.

2. **Women's Football Focus:** **504 women's matches** (14.5% of dataset) provide robust foundation: **326 FA WSL matches** for club baseline and **178 tournament matches** (62 UEFA Women's Euro, 116 FIFA Women's World Cup). Higher average goals/match (3.00 vs 2.83 men's) suggests more attacking play.

3. **Temporal Concentration Challenge:** 53.7% of dataset is concentrated into 2015-2016 creating bias toward mid-2010s tactics. **Solution:** We focus exclusively on 2018-2025 women's data to capture modern tactical evolution and avoid historical bias.

4. **Event Richness for Tactical Analysis:** Shot analysis reveals median xG of 0.055 (72.3% low-quality chances), suggesting dataset captures full spectrum from speculative shots to big chances. Pass completion (77.7%) and pressure events (20.9% under pressure) enable sophisticated possession and defensive metrics.

5. **Dashboard-Ready Data Structure:** Lineups data shows 32.5% actual playing time (vs 67.5% bench), enabling player-level minute-adjusted metrics. 141 countries represented, 10,803 unique players tracked—sufficient diversity for archetype clustering and comparative analysis.

### Track Selection: **Track 2 - Deployed Soccer Analytics Dashboard**

**Focus:** Tactical archetype identification and club-to-tournament translation patterns in women's football using 12 advanced event-based metrics.

---

## 2. Data Retrieval

### Data Sources

We use the official StatsBomb dataset provided via the capstone template repository. Data was downloaded using `download_data.py` following repository instructions.

**Primary Files:**
- `matches.parquet` - Match-level metadata (3,464 matches)
- `events.parquet` - Event-level data (12.2M events, 17.8GB)
- `lineups.parquet` - Player participation data (165K entries)
- `reference.parquet` - Lookup tables (teams, players, positions)
- `three_sixty.parquet` - 360° tracking data (9.3% coverage)

### Focus Competitions for Analysis

**Club Baseline:**
- FA Women's Super League: 326 matches (2018-2021)
  - 3 seasons: 2018/19 (108), 2019/20 (87), 2020/21 (131)

**Tournament Data:**
- UEFA Women's Euro: 62 matches (2022, 2025)
- FIFA Women's World Cup: 116 matches (2023)
- **Total tournaments:** 178 matches

**Total Dataset for Analysis:** 504 women's football matches

### Preprocessing Pipeline

1. **Event Filtering:** Focus on tactical event types (Pass, Pressure, Carry, Shot, Duel)
2. **Metric Calculation:** 12 tactical metrics computed via custom pipeline (see Section 4)
3. **Aggregation:** Player-level metrics → Team-level metrics → Season-level profiles
4. **Quality Checks:** Validated 99.25% location coverage, zero duplicate IDs

### Assumptions & Limitations

**Assumptions:**
- StatsBomb event definitions are consistent across competitions
- Missing player positions can be inferred from event patterns (20.6% missing)
- Match context (home/away, competition importance) does not fundamentally alter tactical metrics

**Limitations:**
- **Temporal coverage:** Men's data heavily weighted to 2015-2016; women's data more recent (2018-2025)
- **360 data:** Only 9.3% of matches have tracking data (limits packing metric to supplementary role)
- **Playing time:** 67.5% of lineup entries are unused bench players (filtered in analysis)
- **Selection bias:** StatsBomb focuses on top leagues/tournaments

### Note on Dataset Composition

The full StatsBomb dataset contains **2,960 men's matches** (85.5%) and **504 women's matches** (14.5%). While men's data offers rich comparative potential for future analysis, we focus exclusively on women's football for this deliverable to enable:

1. Deep tactical profiling within a cohesive competitive ecosystem
2. Novel insights in an under-analyzed domain
3. Manageable scope for dashboard development within project timeline

The men's dataset remains available for the comparative analysis in the final report. For complete men's football analysis, see **`EDA.ipynb` Section 2.5**.

---

## 3. General Dataset Overview

This section provides a high-level understanding of the StatsBomb dataset before diving into tactical analysis.

**For detailed analysis:** See `EDA.ipynb` for:
- Complete schema inspection (Sections 2.1, 3.1, 4.1)
- Full competition breakdown (Section 2.3)
- Temporal pattern analysis (Section 2.4)
- Men's vs Women's comparative statistics (Sections 2.5-2.6)
- Comprehensive event type analysis (Section 3.3)
- Shot, pass, and defensive event deep dives (Sections 3.6-3.8)
- Complete lineups analysis (Section 4)

In [17]:
# Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

DATA_DIR = Path("..") / "data" / "Statsbomb"

# Load data
matches = pd.read_parquet(DATA_DIR / "matches.parquet")
events = pd.read_parquet(DATA_DIR / "events.parquet")
lineups = pd.read_parquet(DATA_DIR / "lineups.parquet")

print("Datasets loaded successfully!")

Datasets loaded successfully!
