# Exploratory Data Analysis

**Team:** XOH - SOOMI OH, YOO MI OH
**Date:** February 2026  

---

## 1. Executive Summary

### Key Findings:

1. **Temporal Bias Identified:** Men's football data shows **62.8% concentration in 2015/2016** (1,860/2,960 matches), creating severe tactical bias toward mid-2010s European football. Women's data (504 matches, 2018-2025) shows **0% concentration**—enabling modern tactical analysis without historical bias.

2. **Gender-Specific Tactical Signals:** Women's football averages **3.00 goals/match vs 2.83 men's** (6% higher), suggesting different tactical dynamics. Men's data dominated by **top 5 European leagues** (2,442 matches, 82.5%), while women's provides **cohesive club-tournament ecosystem** (326 FA WSL + 178 tournaments).

3. **Production-Ready Quality:** **99.25% location coverage** across 12.2M events, **zero duplicate IDs**, realistic distributions (77.7% pass completion, median xG 0.055). Data ready for 12-metric framework without preprocessing. 67% tactical event coverage (Pass, Carry, Pressure) provides rich signal for archetype clustering.

4. **Strategic focus on Women's Football:** 504 matches structured as **326 FA WSL (club baseline)** + **178 tournaments (Euro + World Cup)** = ideal for club→tournament tactical translation study. Modern coverage (2018-2025) captures current tactical evolution. Under-analyzed domain enables novel contributions.

5. **Men's Football for comparative analysis:** Men's tournament data comprehensive (147 matches: World Cup, Euro, Copa America) enables future comparative analysis: "Do men's and women's teams exhibit same tactical compression from club→tournament?" Hypothesis: Different archetype translation patterns by gender.


### Track Selection: **Track 2 - Deployed Soccer Analytics Dashboard**

**Focus:** Tactical archetype identification and club-to-tournament translation patterns in women's and men's football, and cross-gender comparative analysis.

**Deliverable:** Interactive dashboard enabling tactical profiling across 12 advanced metrics (Possession, Progression, Defensive dimensions).

---

## 2. Data Retrieval

### Sources
**StatsBomb Open Data** 
- We use the official StatsBomb dataset provided via the capstone template repository. Data was downloaded using `download_data.py` following repository instructions.
- 5 datasets: matches, events, lineups, reference, three_sixty

**Primary Files:**
- `matches.parquet` - Match-level metadata (3,464 matches)
- `events.parquet` - Event-level data (12.2M events, 17.8GB)
- `lineups.parquet` - Player participation data (165K entries)
- `reference.parquet` - Lookup tables (teams, players, positions)
- `three_sixty.parquet` - 360° tracking data (9.3% coverage)


### Preprocessing
- **Event Filtering:** Focus on tactical event types (Pass, Pressure, Carry, Shot, Duel)
- **Metric Calculation:** 12 tactical metrics computed via custom pipeline `run_pipeline.py`
- **Aggregation:** Player-level metrics → Team-level metrics → Season-level profiles
- **Quality Checks:** Validated 99.25% location coverage, zero duplicate IDs


### Key Assumptions
- StatsBomb event definitions consistent across competitions
- Missing positions (20.6%) inferred from event patterns during metric calculation
- Tournament context (knockout pressure) captured through match-level analysis, not event-level flags

### Known Limitations
- **360° tracking data:** Only 9.3% of matches have tracking data, limits packing metric to supplementary role
- **Selection bias:** StatsBomb focus on elite competitions only (top leagues/tournaments)
- **Temporal scope:** Men's data heavily weighted to 2015-2016 while women's data more recent in 2018-2025

---

## 3. General Dataset Overview

### 3.1 Scale & Composition

| Metric | Full Dataset | Men's | Women's (Phase 1 Focus) |
|--------|--------------|-------|-------------------------|
| Matches | 3,464 | 2,960 (85.5%) | 504 (14.5%) |
| Events | 12.2M | ~10.3M | ~1.9M |
| Players | 10,803 | ~9,500 | ~1,300 |
| Competitions | 21 | 15 | 3 |
| Date Range | 1958-2025 | 1958-2024 | 2018-2025 |
| Avg Goals/Match | 2.85 | 2.83 | **3.00** |
| 2015-16 Concentration | 52.7% | **62.8%** | **0%** |

**Men's Football Breakdown:**
- **Top 5 Leagues:** La Liga (868), Ligue 1 (435), Premier League (418), Serie A (381), Bundesliga (340) = 2,442 matches (82.5%)
- **Tournaments:** FIFA WC (147), UEFA Euro (102), Copa America (32), African Cup (52) = 333 matches
- **Temporal Issue:** 1,860/2,960 matches (62.8%) in 2015-2016 → severe mid-2010s bias
- **Phase 2 Potential:** Filter to 2018+ or use as historical baseline for tactical evolution study

**Women's Football Breakdown:**
- **Club Data:** FA Women's Super League (326 matches, 2018-2021)
- **Tournaments:** FIFA Women's WC (116), UEFA Women's Euro (62) = 178 matches
- **Temporal Balance:** Even distribution 2018-2025, 0% in 2015-2016
- **Phase 1 Focus:** Ideal structure for club→tournament translation analysis

**Key Insights:**
- Women's **6% higher scoring** (3.00 vs 2.83) suggests different tactical dynamics
- Men's **fragmented** across 15 competitions vs women's **focused** 3 competitions
- Women's **modern coverage** (2018-2025) vs men's **historical bias** (62.8% in 2015-16)
- Both genders have **tournament data** enabling Phase 2 comparative analysis

### 3.2 Critical Finding: Temporal Concentration

**From Basic EDA:**
```
Season Distribution:
2015/2016:  1,824 matches (52.7%)
2020/2021:    166 matches
2018/2019:    143 matches
...
```

**Impact on Analysis:**
- **Problem:** 52.7% in 2015/2016 = any "football insights" are actually "mid-2010s football insights"
- **Implication:** Archetype clustering would be dominated by outdated tactical patterns
- **Solution:** Women's data has **0% in 2015/2016**, spans 2018-2025 (modern tactics)

| Period | Men's Data | Women's Data |
|--------|------------|---------------|
| 2015-2016 | 62.8% | 0%  |
| 2018-2025 | 3.2% | 100%  |

**Decision Validation:** Choosing women's football is a **data quality decision**, not just novelty.

### 3.3 Data Quality Assessment

**Production-Ready Metrics:**

| Quality Check | Result | Status |
|---------------|--------|--------|
| Location Coverage | 99.25% (12.1M/12.2M events) | Excellent |
| Duplicate IDs | 0 | Perfect |
| Event Types | 35 unique, 67% tactical | Rich |
| Under Pressure | 20.9% of events | Good context |
| Pass Completion | 77.7% | Realistic |
| Shot Quality (median xG) | 0.055 | Full spectrum |

**Implications:**
- **Spatial metrics reliable:** 99.25% coverage enables Field Tilt, Defensive Line Height
- **No data corruption:** Zero duplicates confirms integrity
- **Context-aware analysis:** 20.9% pressure flags enable PPDA calculation
- **Realistic distributions:** 77.7% pass completion allows quality filtering

**Conclusion:** No preprocessing required. Dataset ready for 12-metric framework.

---

## 4. Soccer Analytics Dashboard Exploration

### Research Question

**How do tactical systems translate from club to tournament play in women's football?**

We hypothesize "tactical compression" occurs in tournaments:
- Shorter possession sequences (fatigue, pressure)
- Higher PPDA (less aggressive pressing)
- Lower defensive lines (risk aversion)
- Fewer progressive actions (conservative play)

### 12-Metric Framework

Our metrics span three tactical dimensions:

**Possession (Control Quality):**
1. Possession % — Team control
2. Field Tilt — Territorial dominance
3. Possession Value (EPR) — Quality over quantity
4. Sequence Length — Patient vs direct play

**Progression (Forward Movement):**
5. Progressive Passes — 10+ yards toward goal
6. Progressive Carries — Dribbling advancement
7. Progressive Actions — Combined metric
8. Packing* — Opponents eliminated (*360° limited)

**Defensive (Pressing & Structure):**
9. PPDA — Pressing intensity
10. High Turnovers — Gegenpressing success
11. Defensive Line Height — High line vs deep block
12. Defensive Actions by Zone — Spatial distribution

### Expected Tactical Archetypes

Based on metric combinations, we expect 4-6 archetypes:

1. **Dominant Possession** — High Poss%, Field Tilt, Long Sequences, Low PPDA
2. **High Press Counter** — Med Poss%, High Prog Carries, Very Low PPDA, High Turnovers
3. **Deep Block Counter** — Low Poss%, Short Sequences, High PPDA, Low Def Line
4. **Mid-Block Control** — High Poss%, Balanced Progression, Medium PPDA

### Analysis Pipeline

```
504 Women's Matches
        ↓
Event-Level Metrics (run_pipeline.py)
        ↓
Player & Team Aggregation
        ↓
Dimensionality Reduction (PCA/t-SNE)
        ↓
Clustering (K-means, K=4-6)
        ↓
Archetype Profiling
        ↓
Club vs Tournament Comparison
        ↓
Dashboard Visualization
```

### Why This Dataset Enables Novel Insights

**1. Club-Tournament Structure:**
- 326 FA WSL matches → establish club-level tactical baselines
- 178 tournament matches → measure tactical adaptation
- Same player pool → isolate context effect from personnel

**2. Modern Tactical Trends:**
- 2018-2025 coverage captures current football
- No 2015-2016 bias (unlike men's 62.8%)
- Relevant to 2024+ dashboard users

**3. Under-Analyzed Domain:**
- Women's football less studied than men's
- 6% higher scoring → different tactical dynamics
- Novel contributions to emerging field

**4. Data Quality:**
- 99.25% location coverage → spatial metrics reliable
- 67% tactical events → rich signal
- Zero duplicates → clean analysis

---

## 5. Next Steps


---

## 6. Conclusion

### Summary



**For comprehensive analysis details:** See `EDA.ipynb`

**For metric calculation code:** See `run_pipeline.py` and `src/metrics.py`