# 03 - Metric Calculations: Building the Evidence
**Project**: Premier League Competitiveness Analysis (2000-2024)  
**Purpose**: Calculate competitiveness and financial inequality metrics that tell the story of how money reshaped English football

## The Story We're Telling
> The Premier League was once unpredictable. Over 24 years, money created a permanent hierarchy — and we can prove it with numbers.

## Metrics We'll Calculate
1. **Points Gini Coefficient** — measures overall inequality in the league each season
2. **Champion's Winning Margin** — how dominant is the top team?
3. **Top 6 Stability Rate** — are the same clubs always at the top?
4. **Squad Value Gini Coefficient** — financial inequality measure
5. **Spending vs Position Correlation** — does money buy success?
6. **HHI on Market Value** — financial concentration measure
7. **Market Value Standard Deviation** — financial spread in millions
8. **Points Regression on Market Value** — how much of success does money explain?
9. **Dominance Trend** — is the champion pulling further away over time?

In [0]:
from pyspark.sql.functions import *

df = spark.table("unified_season_data")
print(f"Loaded: {df.count()} rows, {len(df.columns)} columns")

# Convert to pandas for metric calculations
pdf = df.toPandas()
pdf.head()

## Metric 1: Points Gini Coefficient
The Gini coefficient measures inequality on a 0-1 scale. Applied to league points:
- **0** = every team finishes on the same points (perfect equality)
- **1** = one team wins every game, everyone else gets nothing

If the Premier League is becoming less competitive, this number should be **rising** over time.


In [0]:
import numpy as np

def calculate_gini(values):
    values = np.sort(np.array(values, dtype=float))
    n = len(values)
    index = np.arange(1, n+1)
    return (2 * np.sum(index * values) / (n * np.sum(values))) - (n + 1) / n

In [0]:
gini_pdf = pdf.groupby('season_start_year')['points'].apply(calculate_gini)
gini_pdf = gini_pdf.reset_index().rename(columns = {'points': 'gini'})

print(gini_pdf.to_string(index=False))

## Metric 2: Squad Value Gini Coefficient
Same Gini formula, applied to squad market values instead of points. This measures **financial inequality** — how unevenly the money is distributed across clubs. Available from 2004/05 onward.

In [0]:
financial_gini_pdf = pdf[pdf['has_financial_data'] == True].groupby('season_start_year')['total_market_value_eur'].apply(calculate_gini)
financial_gini_pdf = financial_gini_pdf.reset_index().rename(columns = {'total_market_value_eur': 'financial_gini'})

print(financial_gini_pdf.to_string(index=False))

## Metric 3: Champion's Winning Margin
The points gap between 1st and 2nd place. A large margin means the champion dominated; a small margin means a tight title race. We'll later test whether this gap is growing over time.

In [0]:
# Find margin between champion and ruuners up for each season
pos_1 = pdf[pdf['position']==1]
pos_1 = pos_1[['season_start_year', 'club', 'points']].reset_index(drop=True)

pos_2 = pdf[pdf['position']==2]
pos_2 = pos_2[['season_start_year', 'club', 'points']].reset_index(drop=True)

margin_pdf = pos_1.merge(pos_2, on='season_start_year').rename(columns={'club_x': 'champion', 'points_x': 'champion_pts', 'club_y': 'runner_up', 'points_y': 'runner_up_pts'})
margin_pdf['margin'] = margin_pdf['champion_pts'] - margin_pdf['runner_up_pts']

margin_pdf = margin_pdf.reset_index(drop=True)
margin_pdf

## Metric 4: Top 6 Stability Rate
How many of last season's top 6 clubs remain in the top 6? High stability (5-6) means the elite is locked in. Low stability (2-3) means there's genuine movement. This directly measures whether the league has created permanent tiers.

In [0]:
import pandas as pd

stability_set = pdf[pdf['position']<=6].groupby('season_start_year')['club'].agg(set)

stability_pdf = pd.DataFrame(columns=['season_start_year','stability'])

for year, teams in stability_set.items():
    if year > 2000:
        temp_count = len(teams & stability_set[year-1])
        stability_pdf.loc[len(stability_pdf)] = [year, temp_count]

stability_pdf
    

## Metric 5: Money vs Position Correlation
Correlation between squad market value and final league position each season. A strong negative correlation means richer clubs reliably finish higher. If this correlation is **strengthening** over time, money's grip on results is tightening.

In [0]:
def calc_corr(group):
    return group['total_market_value_eur'].corr(group['position'])


financial_corr_pdf = pdf[pdf['has_financial_data'] == True].groupby('season_start_year').apply(calc_corr, include_groups=False)

financial_corr_pdf = financial_corr_pdf.reset_index().rename(columns = {0: 'financial_corr'})

financial_corr_pdf

In [0]:
def era(start_year):
    if 2000 <= start_year <= 2002:
        return "Pre-Abramovich"
    elif 2003 <= start_year <= 2007:
        return "Abramovich"
    elif 2008 <= start_year <= 2010:
        return "Man City Enters"
    elif 2011 <= start_year <= 2015:
        return "FFP Era"
    elif 2016 <= start_year <= 2019:
        return "TV Money Explosion"
    else:
        return "State Ownership Maturity"

In [0]:
pdf['era'] = pdf['season_start_year'].apply(era)
pdf.head()

spark.createDataFrame(pdf).write.mode('overwrite').option("overwriteSchema", "true").saveAsTable('unified_season_data')

## Metric 6: HHI (Herfindahl-Hirschman Index) on Market Value
Measures how concentrated financial power is. Each club's share of total league market value is squared and summed. Ranges from 0.05 (equal across 20 clubs) to 1.0 (one club owns everything). Used by economists to measure market monopoly power.

In [0]:
def calc_hhi(group):
    shares = group['total_market_value_eur'] / group['total_market_value_eur'].sum()
    return (shares**2).sum()

hhi_pdf = pdf[pdf['has_financial_data'] == True].groupby('season_start_year').apply(calc_hhi, include_groups=False)

hhi_pdf = hhi_pdf.reset_index().rename(columns = {0: 'hhi'})

hhi_pdf

## Metric 7: Market Value Standard Deviation
How spread out are squad values in millions of euros? A rising standard deviation means the gap between rich and poor clubs is widening in absolute terms, even if relative concentration (HHI) is stable.

In [0]:
def calc_std(group):
    return group['total_market_value_eur'].std() / 1000000


mv_std_pdf = pdf[pdf['has_financial_data'] == True].groupby('season_start_year').apply(calc_std, include_groups=False)

mv_std_pdf = mv_std_pdf.reset_index().rename(columns = {0: 'mv_std'})

mv_std_pdf

## Metric 8: Points Regression on Market Value (R-squared)
Fits a linear regression: `points = β₀ + β₁ × market_value` each season. The R-squared tells us what **percentage of league performance is explained by money**. If this is rising, the league is becoming more pay-to-win.

In [0]:
from scipy import stats

def calc_lin(group):
    r = (stats.linregress(group['total_market_value_eur'], group['points']))

    return r.rvalue**2

regression_pdf = pdf[pdf['has_financial_data'] == True].groupby('season_start_year').apply(calc_lin, include_groups=False)

regression_pdf = regression_pdf.reset_index().rename(columns = {0: 'r2_value'})

regression_pdf

## Metric 9: Dominance Trend Over Time
Tests whether the champion's winning margin is systematically increasing by regressing margin against time. A significant positive slope would mean the champion is pulling further away each year.

In [0]:
dom_regres = stats.linregress(margin_pdf['season_start_year'], margin_pdf['margin'])

print(f"Slope: {dom_regres.slope}")
print(f"R2 Value: {dom_regres.rvalue**2}")
print(f"P Value: {dom_regres.pvalue}")

## Combine All Metrics
Merge all metric DataFrames into a single unified metrics table and save to Delta Lake for dashboard creation.

In [0]:
competitiveness_metrics = gini_pdf.merge(financial_gini_pdf, on='season_start_year').merge(margin_pdf, on='season_start_year').merge(stability_pdf, on='season_start_year').merge(financial_corr_pdf, on='season_start_year').merge(hhi_pdf, on='season_start_year').merge(mv_std_pdf, on='season_start_year').merge(regression_pdf, on='season_start_year')

competitiveness_metrics['era'] = competitiveness_metrics['season_start_year'].apply(era)

display(competitiveness_metrics)

In [0]:
spark.createDataFrame(competitiveness_metrics).write.format("delta").saveAsTable("competitiveness_metrics")

In [0]:
spark.createDataFrame(competitiveness_metrics).write.mode("overwrite").option("overwriteSchema", "true").saveAsTable("competitiveness_metrics")

## Era Comparison
Grouping metrics by financial eras to identify which watershed moments had the biggest impact on competitive balance. Each era represents a major financial event that reshaped the league.

| Era | Period | Trigger |
|-----|--------|---------|
| Abramovich | 2003-2007 | Roman Abramovich buys Chelsea |
| Man City Enters | 2008-2010 | Abu Dhabi Group buys Man City |
| FFP Era | 2011-2015 | Financial Fair Play introduced |
| TV Money Explosion | 2016-2019 | Record £5.1bn TV deal kicks in |
| State Ownership Maturity | 2020-2024 | COVID + state-owned clubs fully established |

In [0]:
era_summary = competitiveness_metrics.groupby('era').mean(numeric_only=True)

era_summary = era_summary.reset_index()

era_summary

In [0]:
spark.createDataFrame(era_summary).write.format('delta').saveAsTable('era_comparison')

## Summary of Key Findings

| Metric | Finding | Story Implication |
|--------|---------|-------------------|
| Points Gini | Rising (0.15 → 0.22) | League is becoming more unequal |
| Financial Gini | Declining (0.36 → 0.27) | Money is spreading out across clubs |
| HHI | Declining (0.076 → 0.062) | Financial concentration decreasing |
| Market Value Std Dev | Quadrupled (€86m → €324m) | Absolute gap widening dramatically |
| Money-Position Correlation | Consistently strong (-0.63 to -0.89) | Money reliably buys league position |
| R-squared | Mostly 0.67-0.89 | Money explains 67-89% of results |
| Champion's Margin Trend | No significant trend (p=0.92) | Title race itself isn't getting worse |
| Top 6 Stability | Generally high (4-6) | Same clubs dominate the top |

### The Paradox
Financial concentration is **decreasing** (more clubs have money) but competitive inequality is **increasing**. This suggests a threshold effect — once top clubs cross a spending threshold, additional money elsewhere can't close the gap. The league's problem isn't that money is concentrated, it's that **the baseline cost of competing has risen beyond what most clubs can sustain**.

### Tables Created
| Table | Description |
|-------|-------------|
| `competitiveness_metrics` | All metrics per season with era labels |
| `era_comparison` | Average metrics grouped by financial era |

### Next Step
 `04_dashboard_the_decline`: First Plotly dashboard — visual proof of declining competitiveness