# Premier League EDA - Phase 2 Analysis## Overview & ObjectivesThis notebook contains exploratory data analysis for the Premier League dataset, focusing on:- **Match Outcomes**: Distribution of wins, draws, losses across historical data- **Betting Market Analysis**: Calibration of bookmaker odds vs actual outcomes  - **Financial Analytics**: Relationship between squad values and wage bills- **Injury Impact**: Time-series analysis of injury burden patterns- **Data Quality Assessment**: Coverage and completeness validation## Data Sources- Historical match results (E0*.csv)- Club financial data (Club Value.csv, Club wages.csv)- Injury records (Injury list 2002-2016.csv)- Possession/xG statistics (Possession data 24-25.csv)- Manager tenure data (Premier League Managers.csv)---

In [None]:
# Import required librariesimport polars as plimport matplotlib.pyplot as pltimport numpy as npfrom pathlib import Pathfrom datetime import datetimeimport warningswarnings.filterwarnings('ignore')# Set up plotting styleplt.style.use('default')plt.rcParams['figure.figsize'] = (12, 8)plt.rcParams['font.size'] = 10# Define data pathsdata_raw = Path("../data/raw")data_processed = Path("../data/processed")figures_dir = Path("../outputs/figures")figures_dir.mkdir(exist_ok=True)print("🏈 Premier League EDA - Setup Complete")print(f"📁 Raw data directory: {data_raw}")print(f"📁 Processed data directory: {data_processed}")print(f"📁 Figures output: {figures_dir}")

---## 2. Match Outcomes DistributionAnalyzing the distribution of match results (wins, draws, losses) across historical Premier League data.**Questions to explore:**- What is the overall distribution of home wins vs away wins vs draws?- Has this distribution changed over time?- Are there seasonal patterns in outcome frequencies?

In [None]:
# Load historical match datahistorical_files = list(data_raw.glob("E0*.csv"))print(f"Found {len(historical_files)} historical match files")# TODO: Load and combine historical match data# TODO: Analyze FTR (Full Time Result) distribution# TODO: Create visualization of outcomes distribution# TODO: Examine trends over time

---## 3. Betting Odds Calibration AnalysisExamining how well bookmaker odds predict actual match outcomes.**Key Metrics:**- Implied probability vs empirical probability- Calibration curves for different bet types- Market efficiency indicators**Bookmaker columns to analyze:**- B365H, B365D, B365A (Bet365 odds)- Other available bookmakers for comparison

In [None]:
# TODO: Extract bookmaker odds columns# TODO: Convert odds to implied probabilities# TODO: Calculate empirical probabilities by binned predictions# TODO: Create calibration plots# TODO: Assess market efficiency

---## 4. Financial Analytics: Squad Value vs Wage BillExploring the relationship between club valuations and wage expenditure.**Research Questions:**- Is there a correlation between squad value and wage bill?- Which clubs are over/under-spending relative to their squad value?- How does financial power translate to on-field performance?

In [None]:
# Load financial dataif (data_raw / "Club Value.csv").exists():    club_values = pl.read_csv(data_raw / "Club Value.csv")    print("✅ Loaded club values data")    print(club_values.head())else:    print("⚠️ Club Value.csv not found")# TODO: Load Club wages.csv# TODO: Merge datasets on team names (apply canonicalization)# TODO: Create scatter plot with annotations# TODO: Identify outliers and interesting patterns

---## 5. Injury Burden Time-Series AnalysisAnalyzing injury patterns and their potential impact on team performance.**Analysis Goals:**- Identify seasonal injury patterns- Examine injury burden by team/position- Correlate injury levels with performance metrics

In [None]:
# Load injury data if availableinjury_file = data_raw / "Injury list 2002-2016.csv"if injury_file.exists():    injuries = pl.read_csv(injury_file)    print("✅ Loaded injury data")    print(f"Injury records: {injuries.height:,} rows")    print("Columns:", injuries.columns)else:    print("⚠️ Injury list 2002-2016.csv not found")# TODO: Parse injury dates and durations# TODO: Aggregate by time periods (monthly/seasonal)# TODO: Create time-series visualizations# TODO: Analyze patterns by position/severity

---## 6. Research Questions & Hypotheses### Key Questions to Investigate**Match Outcomes & Patterns:**- [ ] Is there a significant home advantage in the Premier League?- [ ] How has the competitive balance changed over different seasons?- [ ] Are there specific matchweeks with unusual outcome patterns?**Financial Impact:**- [ ] Do higher wage bills correlate with better league positions?- [ ] Which clubs show the best value-for-money in terms of points per £ spent?- [ ] Has financial fair play affected spending patterns?**Betting Market Efficiency:**- [ ] Are bookmaker odds well-calibrated across different probability ranges?- [ ] Do certain types of matches show systematic mispricing?- [ ] How quickly do odds adjust to team news and form?**Performance Analytics:**- [ ] How well do xG metrics predict actual goals scored?- [ ] Which teams consistently over/under-perform their xG?- [ ] What factors drive the largest xG vs actual goal differences?**Injury & Squad Management:**- [ ] Do injury-prone teams have different performance patterns?- [ ] Are there optimal squad rotation strategies visible in the data?- [ ] How do manager changes correlate with performance shifts?### Working Hypotheses1. **Home Advantage Hypothesis**: Home teams should win ~45-50% of matches2. **Financial Power Hypothesis**: Wage bill correlates more strongly with performance than squad value3. **Market Efficiency Hypothesis**: Betting odds should be well-calibrated for high-volume markets4. **Squad Depth Hypothesis**: Teams with more balanced squads perform better over full seasons---## Next StepsAfter completing this EDA:1. Validate key hypotheses with statistical tests2. Identify features for predictive modeling3. Document data quality issues for Phase 34. Generate insights for tactical/strategic analysis---*Generated by Premier League Data Pipeline - Phase 2*