## üìà Predicting Premier League Final Positions Using Betting Odds, Probabilistic Modelling & Simulation

**Competition:** English Premier League 2025/26  
**Purpose:** Estimate probabilities of final league positions using betting market information and simulation  
**Methods:** Odds-implied probabilities, Monte Carlo simulation, scenario analysis  
**Author:** [Victoria Friss de Kereki](https://www.linkedin.com/in/victoria-friss-de-kereki/)  
**Medium Articles:**  
[Predicting Premier League Final Positions Using Betting Odds, Probabilistic Modelling & Simulation](https://medium.com/p/2720ec335c3c)  
[Building a Probabilistic Premier League Simulator in Python](https://medium.com/p/2720ec335c3chttps://medium.com/@vickyfrissdekereki/building-a-probabilistic-premier-league-simulator-in-python-34b5248f81b9)

---

**Notebook first written:** `07/02/2026`  
**Last updated:** `07/02/2026`  

> This notebook develops a probabilistic framework to predict final Premier League final positions using betting odds as market-based expectations.
>
> Betting odds are transformed into implied probabilities and adjusted for bookmaker margin. These probabilities are then used to simulate the remainder of the season via Monte Carlo methods, generating distributions over final points totals and league positions.
>
> The analysis focuses on estimating the likelihood of key outcomes such as title wins, top-four finishes, relegation, and mid-table placements. Results are presented at team level with uncertainty intervals, and the framework can be extended to incorporate form, fixture difficulty, or alternative predictive inputs beyond betting markets.


<div style="text-align: left;">
    <img src="Images and others/Predicting Premier League Final Positions Using Betting Odds, Probabilistic Modelling & Simulation.png" alt="Predicting Premier League Final Positions Using Betting Odds, Probabilistic Modelling & Simulation" width="600">
</di>
>

In [1]:
# Core
from datetime import datetime, timedelta
import os

# Data manipulation
import numpy as np
import pandas as pd

# APIs & environment
import requests
from dotenv import load_dotenv

# Statistics
from scipy.stats import poisson

# Visualisation
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap

# Nicer printing of tables, no wrapping
pd.set_option("display.width", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.expand_frame_repr", False)

## 1. Premier League Final Standings (ESPN Scraping)
##### Using the ESPN scraper I built in my previous project.

In [2]:
import pandas as pd

year = 2025  # current season start year

leagues = {
    "ENG.1": "premierleague_england",
    "ITA.1": "seriea_italy",
    "ESP.1": "laliga_spain",
    "GER.1": "bundesliga_germany",
    "FRA.1": "ligue1_france",
}

for league_code, df_name in leagues.items():
    url = f"https://www.espn.com/soccer/standings/_/league/{league_code}/season/{year}"
    tables = pd.read_html(url)

    teams_raw = tables[0]
    stats = tables[1]

    teams = pd.DataFrame()
    teams["position"] = teams_raw.iloc[:, 0].str.extract(r"^(\d+)").astype(int)
    teams["team"] = (
        teams_raw.iloc[:, 0]
        .str.replace(r"^\d+", "", regex=True)
        .str.replace(r"^[A-Z]{2,3}", "", regex=True)
        .str.strip()
    )

    stats.columns = ["gp", "w", "d", "l", "gf", "ga", "gd", "pts"]
    stats = stats.apply(
        lambda c: c.astype(str)
                  .str.replace("+", "", regex=False)
                  .astype(int)
    )

    globals()[df_name] = pd.concat([teams, stats], axis=1)

In [3]:
print("\nPremier League (England)")
print(premierleague_england.head(3))

print("\nSerie A (Italy)")
print(seriea_italy.head(3))

print("\nLa Liga (Spain)")
print(laliga_spain.head(3))

print("\nBundesliga (Germany)")
print(bundesliga_germany.head(3))

print("\nLigue 1 (France)")
print(ligue1_france.head(3))


Premier League (England)
   position             team  gp   w  d  l  gf  ga  gd  pts
0         1          Arsenal  25  17  5  3  49  17  32   56
1         2  Manchester City  24  14  5  5  49  23  26   47
2         3      Aston Villa  25  14  5  6  36  27   9   47

Serie A (Italy)
   position            team  gp   w  d  l  gf  ga  gd  pts
0         1  Internazionale  23  18  1  4  52  19  33   55
1         2        AC Milan  23  14  8  1  38  17  21   50
2         3          Napoli  24  15  4  5  36  23  13   49

La Liga (Spain)
   position             team  gp   w  d  l  gf  ga  gd  pts
0         1        Barcelona  23  19  1  3  63  23  40   58
1         2      Real Madrid  22  17  3  2  47  18  29   54
2         3  Atl√©tico Madrid  22  13  6  3  38  17  21   45

Bundesliga (Germany)
   position               team  gp   w  d  l  gf  ga  gd  pts
0         1      Bayern Munich  20  16  3  1  74  18  56   51
1         2  Borussia Dortmund  21  14  6  1  43  20  23   48
2         3    

In [59]:
leagues_data = {
    "Premier League (England)": premierleague_england,
    "Serie A (Italy)": seriea_italy,
    "La Liga (Spain)": laliga_spain,
    "Bundesliga (Germany)": bundesliga_germany,
    "Ligue 1 (France)": ligue1_france,
}

matches_unplayed_ = {}

for league_name, df in leagues_data.items():
    num_teams = len(df)
    total_matches = num_teams * (num_teams - 1)  # double round-robin, total matches counted twice for GP
    matches_played = df["gp"].sum() /2              # GP already counts each match per team
    matches_unplayed = total_matches - matches_played
    
    matches_unplayed_[league_name] = matches_unplayed
    print(f"{league_name}: {matches_unplayed} matches unplayed")

Premier League (England): 132.0 matches unplayed
Serie A (Italy): 147.0 matches unplayed
La Liga (Spain): 158.0 matches unplayed
Bundesliga (Germany): 120.0 matches unplayed
Ligue 1 (France): 122.0 matches unplayed


## 2. Get betting odds using API

In [4]:
# Load variables from API_KEY.env
load_dotenv("API_KEY.env")

API_KEY = os.getenv("ODDS_DATA_API_KEY")

if API_KEY is None:
    raise ValueError("API_KEY not found. Check API_KEY.env")

print("API key loaded successfully")

API key loaded successfully


In [5]:
import requests

API_KEY = API_KEY  # assuming already defined

leagues = {
    "soccer_epl": "odds_premierleague_england",
    "soccer_italy_serie_a": "odds_seriea_italy",
    "soccer_spain_la_liga": "odds_laliga_spain",
    "soccer_germany_bundesliga": "odds_bundesliga_germany",
    "soccer_france_ligue_one": "odds_ligue1_france",
}

base_url = "https://api.the-odds-api.com/v4/sports/{}/odds"

params = {
    "apiKey": API_KEY,
    "regions": "uk",
    "markets": "h2h",
    "oddsFormat": "decimal",
    "dateFormat": "iso",
    "days": 365
}

for sport_key, var_name in leagues.items():
    url = base_url.format(sport_key)

    response = requests.get(url, params=params)
    response.raise_for_status()

    globals()[var_name] = response.json()

In [6]:
print("Premier League (England):", len(odds_premierleague_england))
print("Serie A (Italy):", len(odds_seriea_italy))
print("La Liga (Spain):", len(odds_laliga_spain))
print("Bundesliga (Germany):", len(odds_bundesliga_germany))
print("Ligue 1 (France):", len(odds_ligue1_france))

Premier League (England): 22
Serie A (Italy): 17
La Liga (Spain): 16
Bundesliga (Germany): 11
Ligue 1 (France): 19


In [7]:
def flatten_odds(data):
    rows = []

    for match in data:
        match_id = match["id"]
        home = match["home_team"]
        away = match["away_team"]
        time = match["commence_time"]

        for book in match["bookmakers"]:
            bookmaker = book["title"]

            # Find head-to-head (h2h) market. Find the market where key == 'h2h' (win/draw/win odds). If not found, skip this bookmaker.
            h2h = next((m for m in book["markets"] if m["key"] == "h2h"), None)
            if not h2h:
                continue

            outcomes = {o["name"]: o["price"] for o in h2h["outcomes"]}

            rows.append({
                "match_id": match_id,
                "commence_time": time,
                "home_team": home,
                "away_team": away,
                "bookmaker": bookmaker,
                "home_odds": outcomes.get(home),
                "draw_odds": outcomes.get("Draw"),
                "away_odds": outcomes.get(away),
            })

    return pd.DataFrame(rows)

In [8]:
# Flatten odds into DataFrames
df_premierleague_england = flatten_odds(odds_premierleague_england)
df_seriea_italy = flatten_odds(odds_seriea_italy)
df_laliga_spain = flatten_odds(odds_laliga_spain)
df_bundesliga_germany = flatten_odds(odds_bundesliga_germany)
df_ligue1_france = flatten_odds(odds_ligue1_france)

In [9]:
print("\nPremier League (England)")
print(df_premierleague_england.head(3))

print("\nSerie A (Italy)")
print(df_seriea_italy.head(3))

print("\nLa Liga (Spain)")
print(df_laliga_spain.head(3))

print("\nBundesliga (Germany)")
print(df_bundesliga_germany.head(3))

print("\nLigue 1 (France)")
print(df_ligue1_france.head(3))


Premier League (England)
                           match_id         commence_time                 home_team       away_team    bookmaker  home_odds  draw_odds  away_odds
0  a7f9683fe58c4fc6a5ac52396f279456  2026-02-08T14:00:00Z  Brighton and Hove Albion  Crystal Palace  Unibet (UK)       2.00        3.6       3.75
1  a7f9683fe58c4fc6a5ac52396f279456  2026-02-08T14:00:00Z  Brighton and Hove Albion  Crystal Palace      Sky Bet       1.95        3.5       3.75
2  a7f9683fe58c4fc6a5ac52396f279456  2026-02-08T14:00:00Z  Brighton and Hove Albion  Crystal Palace  Paddy Power       1.95        3.4       3.75

Serie A (Italy)
                           match_id         commence_time home_team away_team     bookmaker  home_odds  draw_odds  away_odds
0  14cea9dda59eac8cbb063c8d777171bd  2026-02-08T11:30:00Z   Bologna     Parma  William Hill       1.65        3.7        5.0
1  14cea9dda59eac8cbb063c8d777171bd  2026-02-08T11:30:00Z   Bologna     Parma      888sport       1.61        3.7        5.

In [10]:
def bookmaker_implied_probs(df):
    # Convert odds to implied probabilities per bookmaker
    df = df.assign(
        p_home_raw=1 / df["home_odds"],
        p_draw_raw=1 / df["draw_odds"],
        p_away_raw=1 / df["away_odds"],
    )

    # Remove bookmaker margin (normalise)
    total = (
        df["p_home_raw"] +
        df["p_draw_raw"] +
        df["p_away_raw"]
    )

    df = df.assign(
        p_home_book=df["p_home_raw"] / total,
        p_draw_book=df["p_draw_raw"] / total,
        p_away_book=df["p_away_raw"] / total,
    )

    # Average normalised probabilities across bookmakers
    betting_odds_avg = (
        df.groupby(["match_id", "home_team", "away_team"], as_index=False)
          .agg(
              p_home_book=("p_home_book", "mean"),
              p_draw_book=("p_draw_book", "mean"),
              p_away_book=("p_away_book", "mean"),
          )
    )

    # Keep only required fields
    betting_odds_avg = betting_odds_avg[
        [
            "home_team",
            "away_team",
            "p_home_book",
            "p_draw_book",
            "p_away_book",
        ]
    ]

    return betting_odds_avg

In [11]:
betting_odds_premierleague_england = bookmaker_implied_probs(df_premierleague_england)
betting_odds_seriea_italy = bookmaker_implied_probs(df_seriea_italy)
betting_odds_laliga_spain = bookmaker_implied_probs(df_laliga_spain)
betting_odds_bundesliga_germany = bookmaker_implied_probs(df_bundesliga_germany)
betting_odds_ligue1_france = bookmaker_implied_probs(df_ligue1_france)

In [12]:
print("\nPremier League (England)")
print(betting_odds_premierleague_england.head(3))

print("\nSerie A (Italy)")
print(betting_odds_seriea_italy.head(3))

print("\nLa Liga (Spain)")
print(betting_odds_laliga_spain.head(3))

print("\nBundesliga (Germany)")
print(betting_odds_bundesliga_germany.head(3))

print("\nLigue 1 (France)")
print(betting_odds_ligue1_france.head(3))


Premier League (England)
        home_team                away_team  p_home_book  p_draw_book  p_away_book
0  Crystal Palace                  Burnley     0.580750     0.248025     0.171224
1  Crystal Palace  Wolverhampton Wanderers     0.509145     0.275321     0.215534
2     Aston Villa             Leeds United     0.508923     0.264455     0.226623

Serie A (Italy)
     home_team away_team  p_home_book  p_draw_book  p_away_book
0      Bologna     Parma     0.567115     0.252303     0.180582
1  Inter Milan  Juventus     0.449615     0.283706     0.266680
2         Pisa  AC Milan     0.154081     0.237670     0.608250

La Liga (Spain)
     home_team      away_team  p_home_book  p_draw_book  p_away_book
0       Getafe     Villarreal     0.326691     0.312062     0.361247
1   Villarreal       Espanyol     0.559892     0.237987     0.202121
2  Real Madrid  Real Sociedad     0.690659     0.179938     0.129403

Bundesliga (Germany)
       home_team       away_team  p_home_book  p_draw_book

## 3. Get fixtures for upcoming EPL games

In [13]:
# Load variables from API_KEY.env
load_dotenv("API_KEY.env")

API_KEY = os.getenv("FOOTBALL_DATA_API_KEY")

if API_KEY is None:
    raise ValueError("API_KEY not found. Check API_KEY.env")

print("API key loaded successfully")

API key loaded successfully


In [14]:
competitions = {
    "PL": "fixtures_premierleague_england",
    "SA": "fixtures_seriea_italy",
    "PD": "fixtures_laliga_spain",
    "BL1": "fixtures_bundesliga_germany",
    "FL1": "fixtures_ligue1_france",
}

headers = {
    "X-Auth-Token": API_KEY
}

today = datetime.utcnow().date()
end_of_season = today + timedelta(days=365)

params = {
    "status": "SCHEDULED",
    "dateFrom": today.isoformat(),
    "dateTo": end_of_season.isoformat()
}

for comp_code, df_name in competitions.items():
    url = f"https://api.football-data.org/v4/competitions/{comp_code}/matches"

    response = requests.get(url, headers=headers, params=params)
    response.raise_for_status()

    data = response.json()
    fixtures = data["matches"]

    df_fixtures = pd.DataFrame(fixtures)

    df_fixtures_clean = df_fixtures[
        ["utcDate", "status", "homeTeam", "awayTeam"]
    ].copy()  # copy avoids SettingWithCopyWarning

    # Extract team names
    df_fixtures_clean["homeTeam"] = df_fixtures_clean["homeTeam"].apply(lambda x: x["name"])
    df_fixtures_clean["awayTeam"] = df_fixtures_clean["awayTeam"].apply(lambda x: x["name"])

    globals()[df_name] = df_fixtures_clean

  today = datetime.utcnow().date()


In [15]:
print("Premier League (England):", len(fixtures_premierleague_england))
print("Serie A (Italy):", len(fixtures_seriea_italy))
print("La Liga (Spain):", len(fixtures_laliga_spain))
print("Bundesliga (Germany):", len(fixtures_bundesliga_germany))
print("Ligue 1 (France):", len(fixtures_ligue1_france))

Premier League (England): 131
Serie A (Italy): 147
La Liga (Spain): 156
Bundesliga (Germany): 120
Ligue 1 (France): 122


In [16]:
print("Premier League (England):", fixtures_premierleague_england.head(3))
print("Serie A (Italy):", fixtures_seriea_italy.head(3))
print("La Liga (Spain):", fixtures_laliga_spain.head(3))
print("Bundesliga (Germany):", fixtures_bundesliga_germany.head(3))
print("Ligue 1 (France):", fixtures_ligue1_france.head(3))

Premier League (England):                 utcDate status                   homeTeam            awayTeam
0  2026-02-08T14:00:00Z  TIMED  Brighton & Hove Albion FC   Crystal Palace FC
1  2026-02-08T16:30:00Z  TIMED               Liverpool FC  Manchester City FC
2  2026-02-10T19:30:00Z  TIMED                 Chelsea FC     Leeds United FC
Serie A (Italy):                 utcDate status            homeTeam                  awayTeam
0  2026-02-08T11:30:00Z  TIMED     Bologna FC 1909         Parma Calcio 1913
1  2026-02-08T14:00:00Z  TIMED            US Lecce            Udinese Calcio
2  2026-02-08T17:00:00Z  TIMED  US Sassuolo Calcio  FC Internazionale Milano
La Liga (Spain):                 utcDate status                 homeTeam             awayTeam
0  2026-02-08T13:00:00Z  TIMED         Deportivo Alav√©s            Getafe CF
1  2026-02-08T15:15:00Z  TIMED            Athletic Club           Levante UD
2  2026-02-08T17:30:00Z  TIMED  Club Atl√©tico de Madrid  Real Betis Balompi√©
Bundeslig

## 4. Get this season (2025/26) and last season (2024/25) results

In [18]:
competitions = {
    "PL": "premierleague_england",
    "SA": "seriea_italy",
    "PD": "laliga_spain",
    "BL1": "bundesliga_germany",
    "FL1": "ligue1_france",
}

seasons = [2025, 2024]  # finished seasons you want

headers = {
    "X-Auth-Token": API_KEY
}

for comp_code, league_name in competitions.items():
    for season in seasons:
        url = f"https://api.football-data.org/v4/competitions/{comp_code}/matches"
        params = {
            "season": season,
            "status": "FINISHED"
        }

        response = requests.get(url, headers=headers, params=params)
        response.raise_for_status()

        matches = response.json()["matches"]

        clean_rows = []
        for m in matches:
            clean_rows.append({
                "utcDate": m["utcDate"],
                "matchday": m["matchday"],
                "status": m["status"],
                "homeTeam": m["homeTeam"]["name"],
                "awayTeam": m["awayTeam"]["name"],
                "homeGoals": m["score"]["fullTime"]["home"],
                "awayGoals": m["score"]["fullTime"]["away"],
                "winner": m["score"]["winner"],
            })

        df_clean = pd.DataFrame(clean_rows)

        globals()[f"past_matches_{league_name}_{season}_clean"] = df_clean

In [19]:
for league in [
    "premierleague_england",
    "seriea_italy",
    "laliga_spain",
    "bundesliga_germany",
    "ligue1_france",
]:
    for season in [2025, 2024]:
        df = globals()[f"past_matches_{league}_{season}_clean"]
        print(f"\n{league.replace('_', ' ').title()} ‚Äì Season {season}")
        print(df.tail(2))


Premierleague England ‚Äì Season 2025
                  utcDate  matchday    status                    homeTeam      awayTeam  homeGoals  awayGoals     winner
246  2026-02-07T15:00:00Z        25  FINISHED  Wolverhampton Wanderers FC    Chelsea FC          1          3  AWAY_TEAM
247  2026-02-07T17:30:00Z        25  FINISHED         Newcastle United FC  Brentford FC          2          3  AWAY_TEAM

Premierleague England ‚Äì Season 2024
                  utcDate  matchday    status                    homeTeam                   awayTeam  homeGoals  awayGoals     winner
378  2025-05-25T15:00:00Z        38  FINISHED        Tottenham Hotspur FC  Brighton & Hove Albion FC          1          4  AWAY_TEAM
379  2025-05-25T15:00:00Z        38  FINISHED  Wolverhampton Wanderers FC               Brentford FC          1          1       DRAW

Seriea Italy ‚Äì Season 2025
                  utcDate  matchday    status        homeTeam    awayTeam  homeGoals  awayGoals     winner
231  2026-02-07T17:0

## 5. Combine and calculate probabilities of W/D/L for each match

In [20]:
leagues = [
    "premierleague_england",
    "seriea_italy",
    "laliga_spain",
    "bundesliga_germany",
    "ligue1_france",
]

for league in leagues:
    # Load DataFrames
    df_current = globals()[f"past_matches_{league}_2025_clean"]
    df_prev = globals()[f"past_matches_{league}_2024_clean"]
    df_future = globals()[f"fixtures_{league}"]

    # Combine all past fixtures together
    df_all = pd.concat([df_prev, df_current], ignore_index=True)

    # Store results
    globals()[f"past_matches_{league}_all"] = df_all
    globals()[f"future_matches_{league}"] = df_future

In [21]:
leagues = [
    "premierleague_england",
    "seriea_italy",
    "laliga_spain",
    "bundesliga_germany",
    "ligue1_france",
]

for league in leagues:
    df_all = globals()[f"past_matches_{league}_all"].copy()

    # Convert date
    df_all["date"] = pd.to_datetime(df_all["utcDate"])

    # Sort so newer matches get higher weight
    df_all = df_all.sort_values("date").reset_index(drop=True)

    # Add linear weights (oldest ‚Üí newest)
    df_all["weight"] = np.linspace(1, 2, len(df_all))

    # Store weighted dataset
    globals()[f"past_matches_{league}_weighted"] = df_all

In [22]:
for league in [
    "premierleague_england",
    "seriea_italy",
    "laliga_spain",
    "bundesliga_germany",
    "ligue1_france",
]:
    df = globals()[f"past_matches_{league}_weighted"]
    print(f"\n{league.replace('_', ' ').title()} ‚Äì weighted past matches")
    print(df.tail(2))


Premierleague England ‚Äì weighted past matches
                  utcDate  matchday    status                    homeTeam      awayTeam  homeGoals  awayGoals     winner                      date    weight
626  2026-02-07T15:00:00Z        25  FINISHED  Wolverhampton Wanderers FC    Chelsea FC          1          3  AWAY_TEAM 2026-02-07 15:00:00+00:00  1.998405
627  2026-02-07T17:30:00Z        25  FINISHED         Newcastle United FC  Brentford FC          2          3  AWAY_TEAM 2026-02-07 17:30:00+00:00  2.000000

Seriea Italy ‚Äì weighted past matches
                  utcDate  matchday    status        homeTeam    awayTeam  homeGoals  awayGoals     winner                      date    weight
611  2026-02-07T17:00:00Z        24  FINISHED       Genoa CFC  SSC Napoli          2          3  AWAY_TEAM 2026-02-07 17:00:00+00:00  1.998366
612  2026-02-07T19:45:00Z        24  FINISHED  ACF Fiorentina   Torino FC          2          2       DRAW 2026-02-07 19:45:00+00:00  2.000000

Laliga Spa

In [23]:
# compute home advantage per league and save it to globals().

leagues = [
    "premierleague_england",
    "seriea_italy",
    "laliga_spain",
    "bundesliga_germany",
    "ligue1_france",
]

home_advantage_by_league = {}

for league in leagues:
    # Access the weighted past matches for this league
    df_all = globals()[f"past_matches_{league}_weighted"]

    # Compute home advantage
    home_adv = df_all["homeGoals"].mean() - df_all["awayGoals"].mean()

    # Save to dictionary
    home_advantage_by_league[league] = home_adv

    # Save to globals (for your Poisson model)
    globals()[f"home_advantage_{league}"] = home_adv

    # Print nicely
    print(f"{league.replace('_', ' ').title()}: {home_adv:.3f}")

Premierleague England: 0.186
Seriea Italy: 0.124
Laliga Spain: 0.332
Bundesliga Germany: 0.181
Ligue1 France: 0.317


In [24]:
leagues = [
    "premierleague_england",
    "seriea_italy",
    "laliga_spain",
    "bundesliga_germany",
    "ligue1_france",
]

for league in leagues:
    df_all = globals()[f"past_matches_{league}_weighted"]

    # All teams in the league
    teams = pd.unique(df_all[["homeTeam", "awayTeam"]].values.ravel("K"))

    attack = pd.Series(1.0, index=teams)
    defense = pd.Series(1.0, index=teams)

    team_stats = {}

    for team in teams:
        home_games = df_all[df_all["homeTeam"] == team]
        away_games = df_all[df_all["awayTeam"] == team]

        goals_scored = (
            (home_games["homeGoals"] * home_games["weight"]).sum() +
            (away_games["awayGoals"] * away_games["weight"]).sum()
        )

        goals_against = (
            (home_games["awayGoals"] * home_games["weight"]).sum() +
            (away_games["homeGoals"] * away_games["weight"]).sum()
        )

        matches = home_games["weight"].sum() + away_games["weight"].sum()

        team_stats[team] = {
            "scored": goals_scored / matches,
            "against": goals_against / matches
        }

    # League average goals per team per match
    league_avg_scored = (
        df_all["homeGoals"].mean() + df_all["awayGoals"].mean()
    ) / 2

    for team in teams:
        attack[team] = team_stats[team]["scored"] / league_avg_scored
        defense[team] = team_stats[team]["against"] / league_avg_scored

    # Store outputs
    globals()[f"attack_{league}"] = attack
    globals()[f"defense_{league}"] = defense
    globals()[f"league_avg_scored_{league}"] = league_avg_scored

üî• Summary

This function:
+ Calculates expected goals for each team
+ Uses Poisson distribution to compute goal probabilities
+ Converts score probabilities into match outcome probabilities
+ Returns probabilities for:
++ home win
++ draw
++ away win

The Poisson distribution models the number of goals a team scores in a match based on an expected goal rate (Œª). Using the formula \(P(X=k)=e^{-\lambda}\lambda^k/k!\), it calculates the probability of scoring 0, 1, 2, ‚Ä¶ goals, where Œª is estimated from team attack/defense strengths and league averages. In the model, I compute separate Poisson probabilities for home and away goals, then combine them to get the probabilities of each possible scoreline and therefore the probabilities of a home win, draw, or away win.


In [25]:
def match_probabilities_league(
    home,
    away,
    attack,
    defense,
    league_avg_scored,
    home_advantage,
    max_goals=6,
):
    # expected goals
    exp_home = np.exp(
        np.log(league_avg_scored)
        + np.log(attack[home])
        + np.log(defense[away])
        + home_advantage
    )

    exp_away = np.exp(
        np.log(league_avg_scored)
        + np.log(attack[away])
        + np.log(defense[home])
    )

    p_home = poisson.pmf(range(max_goals + 1), exp_home)
    p_away = poisson.pmf(range(max_goals + 1), exp_away)

    p_win = p_draw = p_loss = 0.0

    for i in range(max_goals + 1):
        for j in range(max_goals + 1):
            prob = p_home[i] * p_away[j]
            if i > j:
                p_win += prob
            elif i == j:
                p_draw += prob
            else:
                p_loss += prob

    return p_win, p_draw, p_loss

In [67]:
leagues = [
    "premierleague_england",
    "seriea_italy",
    "laliga_spain",
    "bundesliga_germany",
    "ligue1_france",
]

for league in leagues:
    df_future = globals()[f"fixtures_{league}"]  # corrected variable name

    attack = globals()[f"attack_{league}"]
    defense = globals()[f"defense_{league}"]
    league_avg_scored = globals()[f"league_avg_scored_{league}"]
    home_advantage = globals()[f"home_advantage_{league}"]

    results = []

    for _, row in df_future.iterrows():
        home = row["homeTeam"]
        away = row["awayTeam"]

        p_win, p_draw, p_loss = match_probabilities_league(
            home,
            away,
            attack,
            defense,
            league_avg_scored,
            home_advantage,
        )

        results.append({
            "utcDate": row["utcDate"],
            "homeTeam": home,
            "awayTeam": away,
            "p_home_win": p_win,
            "p_draw": p_draw,
            "p_away_win": p_loss,
        })

    globals()[f"df_odds_{league}"] = pd.DataFrame(results)

In [68]:
for league in leagues:
    print(f"\n=== {league.upper()} ===")
    print(globals()[f"df_odds_{league}"].head(2))


=== PREMIERLEAGUE_ENGLAND ===
                utcDate                   homeTeam            awayTeam  p_home_win    p_draw  p_away_win
0  2026-02-08T14:00:00Z  Brighton & Hove Albion FC   Crystal Palace FC    0.488375  0.236573    0.272690
1  2026-02-08T16:30:00Z               Liverpool FC  Manchester City FC    0.394116  0.226530    0.375626

=== SERIEA_ITALY ===
                utcDate         homeTeam           awayTeam  p_home_win    p_draw  p_away_win
0  2026-02-08T11:30:00Z  Bologna FC 1909  Parma Calcio 1913    0.583324  0.224736    0.188892
1  2026-02-08T14:00:00Z         US Lecce     Udinese Calcio    0.261169  0.294228    0.444289

=== LALIGA_SPAIN ===
                utcDate          homeTeam    awayTeam  p_home_win    p_draw  p_away_win
0  2026-02-08T13:00:00Z  Deportivo Alav√©s   Getafe CF    0.445404  0.308900    0.245490
1  2026-02-08T15:15:00Z     Athletic Club  Levante UD    0.631968  0.205064    0.157929

=== BUNDESLIGA_GERMANY ===
                utcDate           h

## 6. Compare calculated probabilities to bookmaker ones

In [69]:
# --- Step 1: Normalization function ---
def normalize_team(name):
    name = name.lower()
    name = name.replace(" fc", "")
    name = name.replace(" afc", "")
    name = name.replace("&", "and")
    name = name.replace("afc ", "")
    name = name.replace(".", "")  # remove periods
    name = name.replace("  ", " ")  # remove double spaces
    name = name.strip()
    return name

# --- Step 2: Manual mapping for remaining differences ---
manual_mapping = {
    # Premier League
    "a bournemouth": "bournemouth",
    "sunderland a": "sunderland",

    # Serie A
    "ac pisa 1909": "pisa",
    "acf fiorentina": "fiorentina",
    "bologna 1909": "bologna",
    "ssc napoli": "napoli",
    "ss lazio": "lazio",
    "genoa cfc": "genoa",
    "parma calcio 1913": "parma",
    "us sassuolo calcio": "sassuolo",
    "fc internazionale milano": "inter milan",
    "us lecce": "lecce",
    "como 1907": "como",
    "us cremonese": "cremonese",
    "cagliari calcio": "cagliari",
    "udinese calcio": "udinese",

    # La Liga
    "rayo vallecano de madrid": "rayo vallecano",
    "valencia cf": "valencia",
    "athletic club": "athletic bilbao",
    "fc barcelona": "barcelona",
    "getafe cf": "getafe",
    "real betis balompi√©": "real betis",
    "rcd mallorca": "mallorca",
    "club atl√©tico de madrid": "atl√©tico madrid",
    "villarreal cf": "villarreal",
    "rc celta de vigo": "celta vigo",
    "real madrid cf": "real madrid",
    "rcd espanyol de barcelona": "espanyol",
    "real oviedo": "oviedo",
    "levante ud": "levante",
    "real sociedad de f√∫tbol": "real sociedad",
    "deportivo alav√©s": "alav√©s",

    # Bundesliga
    "tsg 1899 hoffenheim": "tsg hoffenheim",
    "fc st pauli 1910": "fc st pauli",
    "fc augsburg": "augsburg",
    "bayer 04 leverkusen": "bayer leverkusen",
    "1 union berlin": "union berlin",
    "1 heidenheim 1846": "1 heidenheim",
    "1 fsv mainz 05": "fsv mainz 05",
    "fc bayern m√ºnchen": "bayern munich",
    "borussia m√∂nchengladbach": "borussia monchengladbach",
    "sv werder bremen": "werder bremen",

    # Ligue 1
    "rc strasbourg alsace": "strasbourg",
    "lille osc": "lille",
    "fc lorient": "lorient",
    "stade brestois 29": "brest",
    "stade rennais 1901": "rennes",
    "fc metz": "metz",
    "angers sco": "angers",
    "olympique lyonnais": "lyon",
    "olympique de marseille": "marseille",
    "paris saint-germain": "paris saint germain",
    "fc nantes": "nantes",
    "le havre ac": "le havre",
    "racing club de lens": "rc lens",
    "ogc nice": "nice",
    "aj auxerre": "auxerre",
}

# --- Step 3: Normalize model and bookmaker names ---
leagues = [
    "premierleague_england",
    "seriea_italy",
    "laliga_spain",
    "bundesliga_germany",
    "ligue1_france",
]

for league in leagues:
    print(f"\n=== {league.replace('_', ' ').title()} ===")

    # Fetch dfs
    df_odds = globals()[f"df_odds_{league}"]
    betting_odds = globals()[f"betting_odds_{league}"]

    # Normalize
    df_odds["home_norm"] = df_odds["homeTeam"].apply(lambda x: manual_mapping.get(normalize_team(x), normalize_team(x)))
    df_odds["away_norm"] = df_odds["awayTeam"].apply(lambda x: manual_mapping.get(normalize_team(x), normalize_team(x)))

    betting_odds["home_norm"] = betting_odds["home_team"].apply(lambda x: manual_mapping.get(normalize_team(x), normalize_team(x)))
    betting_odds["away_norm"] = betting_odds["away_team"].apply(lambda x: manual_mapping.get(normalize_team(x), normalize_team(x)))

    # Compare
    model_teams = set(df_odds["home_norm"].unique()) | set(df_odds["away_norm"].unique())
    book_teams = set(betting_odds["home_norm"].unique()) | set(betting_odds["away_norm"].unique())

    print("Teams in model not in bookmaker:", model_teams - book_teams)
    print("Teams in bookmaker not in model:", book_teams - model_teams)


=== Premierleague England ===
Teams in model not in bookmaker: set()
Teams in bookmaker not in model: set()

=== Seriea Italy ===
Teams in model not in bookmaker: set()
Teams in bookmaker not in model: set()

=== Laliga Spain ===
Teams in model not in bookmaker: set()
Teams in bookmaker not in model: set()

=== Bundesliga Germany ===
Teams in model not in bookmaker: set()
Teams in bookmaker not in model: set()

=== Ligue1 France ===
Teams in model not in bookmaker: set()
Teams in bookmaker not in model: set()


In [70]:
# Map the short league names to the full league names used in matches_unplayed_
league_name_map = {
    "premierleague_england": "Premier League (England)",
    "seriea_italy": "Serie A (Italy)",
    "laliga_spain": "La Liga (Spain)",
    "bundesliga_germany": "Bundesliga (Germany)",
    "ligue1_france": "Ligue 1 (France)"
}

df_compare_all = {}

for league in leagues:
    full_name = league_name_map[league]
    print(f"\n=== {full_name} ===")
    
    df_odds = globals()[f"df_odds_{league}"]
    betting_odds = globals()[f"betting_odds_{league}"]

    # Merge on normalized names using outer join to catch all differences
    df_compare = df_odds.merge(
        betting_odds,
        left_on=["home_norm", "away_norm"],
        right_on=["home_norm", "away_norm"],
        how="outer",
        indicator=True
    )

    total_model = len(df_odds)
    total_book = len(betting_odds)
    matched = len(df_compare[df_compare["_merge"] == "both"])
    matches_to_play = matches_unplayed_[full_name]  # now uses correct key

    print(f"Total model rows: {total_model}")
    print(f"Matches unplayed (expected): {matches_to_play}")
    print(f"Total bookmaker rows: {total_book}")
    print(f"Matched rows: {matched}")

    # Flag differences
    if total_model != matches_to_play:
        print(f"‚ö†Ô∏è Warning: Total model rows ({total_model}) != matches unplayed ({matches_to_play})")
    if matched != total_book:
        print(f"‚ö†Ô∏è Warning: Not all bookmaker rows matched ({matched} matched vs {total_book} in bookmaker)")

    # Show only truly missing matches
    if len(unmatched_book) > 0:
        print("\nBookmaker rows not in model (‚ö†Ô∏è truly missing matches):")
        display(unmatched_book[["home_team", "away_team", "home_norm", "away_norm"]])
    else:
        print("‚úÖ All bookmaker matches exist in the model")


    df_compare_all[league] = df_compare


=== Premier League (England) ===
Total model rows: 131
Matches unplayed (expected): 132.0
Total bookmaker rows: 22
Matched rows: 22
‚úÖ All bookmaker matches exist in the model

=== Serie A (Italy) ===
Total model rows: 147
Matches unplayed (expected): 147.0
Total bookmaker rows: 17
Matched rows: 17
‚úÖ All bookmaker matches exist in the model

=== La Liga (Spain) ===
Total model rows: 156
Matches unplayed (expected): 158.0
Total bookmaker rows: 16
Matched rows: 15
‚úÖ All bookmaker matches exist in the model

=== Bundesliga (Germany) ===
Total model rows: 120
Matches unplayed (expected): 120.0
Total bookmaker rows: 11
Matched rows: 11
‚úÖ All bookmaker matches exist in the model

=== Ligue 1 (France) ===
Total model rows: 122
Matches unplayed (expected): 122.0
Total bookmaker rows: 19
Matched rows: 19
‚úÖ All bookmaker matches exist in the model


In [71]:
for league in leagues:
    df_odds = globals()[f"df_odds_{league}"]
    betting_odds = globals()[f"betting_odds_{league}"]

    # Identify bookmaker matches missing from the model
    model_matches = set(zip(df_odds["home_norm"], df_odds["away_norm"]))
    book_matches = set(zip(betting_odds["home_norm"], betting_odds["away_norm"]))
    missing_matches = book_matches - model_matches

    # If any are missing, append them to the model DataFrame
    if missing_matches:
        print(f"{league}: Adding {len(missing_matches)} missing matches from bookmaker")
        rows_to_add = []
        for home_norm, away_norm in missing_matches:
            row = betting_odds[
                (betting_odds["home_norm"] == home_norm) &
                (betting_odds["away_norm"] == away_norm)
            ].iloc[0]  # take first row if duplicates
            rows_to_add.append({
                "utcDate": row.get("utcDate", pd.Timestamp.now()),  # fallback if missing
                "homeTeam": row["home_team"],
                "awayTeam": row["away_team"],
                "p_home_win": np.nan,  # model probability unknown
                "p_draw": np.nan,
                "p_away_win": np.nan,
                "home_norm": home_norm,
                "away_norm": away_norm,
            })
        df_odds = pd.concat([df_odds, pd.DataFrame(rows_to_add)], ignore_index=True)
        globals()[f"df_odds_{league}"] = df_odds

laliga_spain: Adding 1 missing matches from bookmaker


In [74]:
rmse_results = {}

for league in leagues:
    print(f"\n=== {league.replace('_', ' ').title()} ===")
    
    df_compare = df_compare_all[league].copy()

    # Skip rows where bookmaker odds are missing
    df_compare = df_compare[df_compare["p_home_book"].notna()]

    # Compute differences
    df_compare["diff_home"] = df_compare["p_home_win"] - df_compare["p_home_book"]
    df_compare["diff_draw"] = df_compare["p_draw"] - df_compare["p_draw_book"]
    df_compare["diff_away"] = df_compare["p_away_win"] - df_compare["p_away_book"]

    # RMSE per outcome
    rmse_home = np.sqrt(np.mean(df_compare["diff_home"]**2))
    rmse_draw = np.sqrt(np.mean(df_compare["diff_draw"]**2))
    rmse_away = np.sqrt(np.mean(df_compare["diff_away"]**2))

    # Total RMSE
    rmse_total = np.sqrt(np.mean(
        df_compare["diff_home"]**2 +
        df_compare["diff_draw"]**2 +
        df_compare["diff_away"]**2
    ))

    rmse_avg = np.mean([rmse_home, rmse_draw, rmse_away])

    # Absolute differences
    df_compare["abs_diff"] = abs(df_compare["diff_home"]) + abs(df_compare["diff_draw"]) + abs(df_compare["diff_away"])

    # Top 10 differences
    top5_diff = df_compare.sort_values("abs_diff", ascending=False)[
        ["homeTeam", "awayTeam", "diff_home", "diff_draw", "diff_away"]
    ].head(5)

    # Store results
    rmse_results[league] = {
        "rmse_home": rmse_home,
        "rmse_draw": rmse_draw,
        "rmse_away": rmse_away,
        "rmse_total": rmse_total,
        "rmse_avg": rmse_avg,
        "top10_diff": top10_diff
    }

    # Print summary
    print(f"RMSE Home: {rmse_home:.4f}, Draw: {rmse_draw:.4f}, Away: {rmse_away:.4f}")
    print(f"RMSE Total: {rmse_total:.4f}, Average: {rmse_avg:.4f}")
    #print("Top 5 differences:")
    #display(top5_diff)


=== Premierleague England ===
RMSE Home: 0.0562, Draw: 0.0406, Away: 0.0474
RMSE Total: 0.0840, Average: 0.0481

=== Seriea Italy ===
RMSE Home: 0.0514, Draw: 0.0352, Away: 0.0574
RMSE Total: 0.0847, Average: 0.0480

=== Laliga Spain ===
RMSE Home: 0.0753, Draw: 0.0509, Away: 0.0669
RMSE Total: 0.1129, Average: 0.0644

=== Bundesliga Germany ===
RMSE Home: 0.0488, Draw: 0.0415, Away: 0.0461
RMSE Total: 0.0789, Average: 0.0455

=== Ligue1 France ===
RMSE Home: 0.0812, Draw: 0.0607, Away: 0.0547
RMSE Total: 0.1152, Average: 0.0655


## 7. Replace my estimates probabilities with the ones I have from odds, creating my final match probabilities

In [75]:
df_final_probabilities_all = {}

for league in leagues:
    print(f"\n=== {league.replace('_', ' ').title()} ===")
    
    df_odds = globals()[f"df_odds_{league}"]
    betting_odds_avg = globals()[f"betting_odds_{league}"]

    # Merge model and bookmaker probabilities
    df_final_probabilities = df_odds.merge(
        betting_odds_avg,
        left_on=["home_norm", "away_norm"],
        right_on=["home_norm", "away_norm"],
        how="left"
    )

    # Keep relevant columns
    df_final_probabilities = df_final_probabilities[[
        "utcDate",
        "homeTeam",
        "awayTeam",
        "p_home_win",
        "p_draw",
        "p_away_win",
        "p_home_book",
        "p_draw_book",
        "p_away_book",
    ]]

    # Replace model probabilities with bookmaker odds where available
    df_final_probabilities["p_home_final"] = np.where(
        df_final_probabilities["p_home_book"].notna(),
        df_final_probabilities["p_home_book"],
        df_final_probabilities["p_home_win"]
    )

    df_final_probabilities["p_draw_final"] = np.where(
        df_final_probabilities["p_draw_book"].notna(),
        df_final_probabilities["p_draw_book"],
        df_final_probabilities["p_draw"]
    )

    df_final_probabilities["p_away_final"] = np.where(
        df_final_probabilities["p_away_book"].notna(),
        df_final_probabilities["p_away_book"],
        df_final_probabilities["p_away_win"]
    )

    print("Used betting odds:", df_final_probabilities["p_home_book"].notna().sum())
    print("Used model:", df_final_probabilities["p_home_book"].isna().sum())

    # Keep only final probabilities
    df_final_probabilities = df_final_probabilities[[
        "utcDate",
        "homeTeam",
        "awayTeam",
        "p_home_final",
        "p_draw_final",
        "p_away_final"
    ]]

    df_final_probabilities_all[league] = df_final_probabilities
    display(df_final_probabilities.head(2))


=== Premierleague England ===
Used betting odds: 22
Used model: 109


Unnamed: 0,utcDate,homeTeam,awayTeam,p_home_final,p_draw_final,p_away_final
0,2026-02-08T14:00:00Z,Brighton & Hove Albion FC,Crystal Palace FC,0.476419,0.267663,0.255917
1,2026-02-08T16:30:00Z,Liverpool FC,Manchester City FC,0.408544,0.254683,0.336774



=== Seriea Italy ===
Used betting odds: 17
Used model: 130


Unnamed: 0,utcDate,homeTeam,awayTeam,p_home_final,p_draw_final,p_away_final
0,2026-02-08T11:30:00Z,Bologna FC 1909,Parma Calcio 1913,0.567115,0.252303,0.180582
1,2026-02-08T14:00:00Z,US Lecce,Udinese Calcio,0.299649,0.333784,0.366567



=== Laliga Spain ===
Used betting odds: 16
Used model: 141


Unnamed: 0,utcDate,homeTeam,awayTeam,p_home_final,p_draw_final,p_away_final
0,2026-02-08T13:00:00Z,Deportivo Alav√©s,Getafe CF,0.402789,0.349568,0.247642
1,2026-02-08T15:15:00Z,Athletic Club,Levante UD,0.582598,0.245482,0.17192



=== Bundesliga Germany ===
Used betting odds: 11
Used model: 109


Unnamed: 0,utcDate,homeTeam,awayTeam,p_home_final,p_draw_final,p_away_final
0,2026-02-08T14:30:00Z,1. FC K√∂ln,RB Leipzig,0.282259,0.247316,0.470425
1,2026-02-08T16:30:00Z,FC Bayern M√ºnchen,TSG 1899 Hoffenheim,0.757061,0.138074,0.104865



=== Ligue1 France ===
Used betting odds: 19
Used model: 108


Unnamed: 0,utcDate,homeTeam,awayTeam,p_home_final,p_draw_final,p_away_final
0,2026-02-08T14:00:00Z,OGC Nice,AS Monaco FC,0.299656,0.247905,0.452439
1,2026-02-08T16:15:00Z,Le Havre AC,RC Strasbourg Alsace,0.228102,0.263473,0.508425


In [76]:
for league in leagues:
    print(f"\n=== {league.replace('_', ' ').title()} ===")
    df = df_final_probabilities_all[league]
    home_names = df["homeTeam"].unique()
    away_names = df["awayTeam"].unique()
    all_names = sorted(set(home_names) | set(away_names))
    print(all_names)


=== Premierleague England ===
['AFC Bournemouth', 'Arsenal FC', 'Aston Villa FC', 'Brentford FC', 'Brighton & Hove Albion FC', 'Burnley FC', 'Chelsea FC', 'Crystal Palace FC', 'Everton FC', 'Fulham FC', 'Leeds United FC', 'Liverpool FC', 'Manchester City FC', 'Manchester United FC', 'Newcastle United FC', 'Nottingham Forest FC', 'Sunderland AFC', 'Tottenham Hotspur FC', 'West Ham United FC', 'Wolverhampton Wanderers FC']

=== Seriea Italy ===
['AC Milan', 'AC Pisa 1909', 'ACF Fiorentina', 'AS Roma', 'Atalanta BC', 'Bologna FC 1909', 'Cagliari Calcio', 'Como 1907', 'FC Internazionale Milano', 'Genoa CFC', 'Hellas Verona FC', 'Juventus FC', 'Parma Calcio 1913', 'SS Lazio', 'SSC Napoli', 'Torino FC', 'US Cremonese', 'US Lecce', 'US Sassuolo Calcio', 'Udinese Calcio']

=== Laliga Spain ===
['Athletic Club', 'CA Osasuna', 'Club Atl√©tico de Madrid', 'Deportivo Alav√©s', 'Elche CF', 'FC Barcelona', 'Getafe CF', 'Girona', 'Girona FC', 'Levante UD', 'RC Celta de Vigo', 'RCD Espanyol de Barcel

In [77]:
league_name_maps = {
    "Premier League": {
        "Aston Villa FC": "Aston Villa",
        "Brighton & Hove Albion FC": "Brighton & Hove Albion",
        "AFC Bournemouth": "AFC Bournemouth",
        "Bournemouth": "AFC Bournemouth",
        "Sunderland AFC": "Sunderland",
        "Newcastle United FC": "Newcastle United",
        "Manchester City FC": "Manchester City",
        "Manchester United FC": "Manchester United",
        "West Ham United FC": "West Ham United",
        "Wolverhampton Wanderers FC": "Wolverhampton Wanderers",
        "Tottenham Hotspur FC": "Tottenham Hotspur",
        "Crystal Palace FC": "Crystal Palace",
        "Brentford FC": "Brentford",
        "Everton FC": "Everton",
        "Leeds United FC": "Leeds United",
        "Chelsea FC": "Chelsea",
        "Liverpool FC": "Liverpool",
        "Nottingham Forest FC": "Nottingham Forest",
        "Burnley FC": "Burnley",
        "Fulham FC": "Fulham",
        "Arsenal FC": "Arsenal"
    },
    "Serie A": {
        "AC Milan": "AC Milan",
        "AC Pisa 1909": "Pisa",
        "ACF Fiorentina": "Fiorentina",
        "AS Roma": "Roma",
        "Atalanta BC": "Atalanta",
        "Bologna FC 1909": "Bologna",
        "Cagliari Calcio": "Cagliari",
        "Como 1907": "Como",
        "FC Internazionale Milano": "Inter Milan",
        "Genoa CFC": "Genoa",
        "Hellas Verona FC": "Hellas Verona",
        "Juventus FC": "Juventus",
        "Parma Calcio 1913": "Parma",
        "SS Lazio": "Lazio",
        "SSC Napoli": "Napoli",
        "Torino FC": "Torino",
        "US Cremonese": "Cremonese",
        "US Lecce": "Lecce",
        "US Sassuolo Calcio": "Sassuolo",
        "Udinese Calcio": "Udinese"
    },
    "La Liga": {
        "Athletic Club": "Athletic Bilbao",
        "CA Osasuna": "Osasuna",
        "Club Atl√©tico de Madrid": "Atl√©tico Madrid",
        "Deportivo Alav√©s": "Alav√©s",
        "Elche CF": "Elche",
        "FC Barcelona": "Barcelona",
        "Getafe CF": "Getafe",
        "Girona": "Girona",
        "Girona FC": "Girona",
        "Levante UD": "Levante",
        "RC Celta de Vigo": "Celta Vigo",
        "RCD Espanyol de Barcelona": "Espanyol",
        "RCD Mallorca": "Mallorca",
        "Rayo Vallecano de Madrid": "Rayo Vallecano",
        "Real Betis Balompi√©": "Real Betis",
        "Real Madrid CF": "Real Madrid",
        "Real Oviedo": "Oviedo",
        "Real Sociedad de F√∫tbol": "Real Sociedad",
        "Sevilla": "Sevilla",
        "Sevilla FC": "Sevilla",
        "Valencia CF": "Valencia",
        "Villarreal CF": "Villarreal"
    },
    "Bundesliga": {
        "1. FC Heidenheim 1846": "1. Heidenheim",
        "1. FC K√∂ln": "1. FC K√∂ln",
        "1. FC Union Berlin": "Union Berlin",
        "1. FSV Mainz 05": "FSV Mainz 05",
        "Bayer 04 Leverkusen": "Bayer Leverkusen",
        "Borussia Dortmund": "Borussia Dortmund",
        "Borussia M√∂nchengladbach": "Borussia Monchengladbach",
        "Eintracht Frankfurt": "Eintracht Frankfurt",
        "FC Augsburg": "Augsburg",
        "FC Bayern M√ºnchen": "Bayern Munich",
        "FC St. Pauli 1910": "FC St. Pauli",
        "Hamburger SV": "Hamburger SV",
        "RB Leipzig": "RB Leipzig",
        "SC Freiburg": "SC Freiburg",
        "SV Werder Bremen": "Werder Bremen",
        "TSG 1899 Hoffenheim": "TSG Hoffenheim",
        "VfB Stuttgart": "VfB Stuttgart",
        "VfL Wolfsburg": "VfL Wolfsburg"
    },
    "Ligue 1": {
        "AJ Auxerre": "Auxerre",
        "AS Monaco FC": "Monaco",
        "Angers SCO": "Angers",
        "FC Lorient": "Lorient",
        "FC Metz": "Metz",
        "FC Nantes": "Nantes",
        "Le Havre AC": "Le Havre",
        "Lille OSC": "Lille",
        "OGC Nice": "Nice",
        "Olympique Lyonnais": "Lyon",
        "Olympique de Marseille": "Marseille",
        "Paris FC": "Paris FC",
        "Paris Saint-Germain FC": "Paris Saint Germain",
        "RC Strasbourg Alsace": "Strasbourg",
        "Racing Club de Lens": "RC Lens",
        "Stade Brestois 29": "Brest",
        "Stade Rennais FC 1901": "Rennes",
        "Toulouse FC": "Toulouse"
    }
}


In [79]:
league_key_map = {
    "premierleague_england": "Premier League",
    "seriea_italy": "Serie A",
    "laliga_spain": "La Liga",
    "bundesliga_germany": "Bundesliga",
    "ligue1_france": "Ligue 1"
}

for league in leagues:
    df = df_final_probabilities_all[league]
    map_dict = league_name_maps[league_key_map[league]]
    
    df["home_team_norm"] = df["homeTeam"].replace(map_dict)
    df["away_team_norm"] = df["awayTeam"].replace(map_dict)
    
    df_final_probabilities_all[league] = df

In [80]:
# Dictionary to store simulation dataframes for each league
df_simulation_all = {}

# Columns for probability normalization
prob_cols = ["p_home_final", "p_draw_final", "p_away_final"]

for league in leagues:
    df = df_final_probabilities_all[league].copy()
    
    # Normalize probabilities so each row sums to 1
    df[prob_cols] = df[prob_cols].div(df[prob_cols].sum(axis=1), axis=0)
    
    # Store in dictionary
    df_simulation_all[league] = df
    
    # Preview top 3 rows
    print(f"\n=== {league.replace('_', ' ').title()} ===")
    display(df.head(3))


=== Premierleague England ===


Unnamed: 0,utcDate,homeTeam,awayTeam,p_home_final,p_draw_final,p_away_final,home_team_norm,away_team_norm
0,2026-02-08T14:00:00Z,Brighton & Hove Albion FC,Crystal Palace FC,0.476419,0.267663,0.255917,Brighton & Hove Albion,Crystal Palace
1,2026-02-08T16:30:00Z,Liverpool FC,Manchester City FC,0.408544,0.254683,0.336774,Liverpool,Manchester City
2,2026-02-10T19:30:00Z,Chelsea FC,Leeds United FC,0.598002,0.224396,0.177602,Chelsea,Leeds United



=== Seriea Italy ===


Unnamed: 0,utcDate,homeTeam,awayTeam,p_home_final,p_draw_final,p_away_final,home_team_norm,away_team_norm
0,2026-02-08T11:30:00Z,Bologna FC 1909,Parma Calcio 1913,0.567115,0.252303,0.180582,Bologna,Parma
1,2026-02-08T14:00:00Z,US Lecce,Udinese Calcio,0.299649,0.333784,0.366567,Lecce,Udinese
2,2026-02-08T17:00:00Z,US Sassuolo Calcio,FC Internazionale Milano,0.131802,0.211473,0.656725,Sassuolo,Inter Milan



=== Laliga Spain ===


Unnamed: 0,utcDate,homeTeam,awayTeam,p_home_final,p_draw_final,p_away_final,home_team_norm,away_team_norm
0,2026-02-08T13:00:00Z,Deportivo Alav√©s,Getafe CF,0.402789,0.349568,0.247642,Alav√©s,Getafe
1,2026-02-08T15:15:00Z,Athletic Club,Levante UD,0.582598,0.245482,0.17192,Athletic Bilbao,Levante
2,2026-02-08T17:30:00Z,Club Atl√©tico de Madrid,Real Betis Balompi√©,0.653252,0.203575,0.143173,Atl√©tico Madrid,Real Betis



=== Bundesliga Germany ===


Unnamed: 0,utcDate,homeTeam,awayTeam,p_home_final,p_draw_final,p_away_final,home_team_norm,away_team_norm
0,2026-02-08T14:30:00Z,1. FC K√∂ln,RB Leipzig,0.282259,0.247316,0.470425,1. FC K√∂ln,RB Leipzig
1,2026-02-08T16:30:00Z,FC Bayern M√ºnchen,TSG 1899 Hoffenheim,0.757061,0.138074,0.104865,Bayern Munich,TSG Hoffenheim
2,2026-02-13T19:30:00Z,Borussia Dortmund,1. FSV Mainz 05,0.62679,0.212587,0.160623,Borussia Dortmund,FSV Mainz 05



=== Ligue1 France ===


Unnamed: 0,utcDate,homeTeam,awayTeam,p_home_final,p_draw_final,p_away_final,home_team_norm,away_team_norm
0,2026-02-08T14:00:00Z,OGC Nice,AS Monaco FC,0.299656,0.247905,0.452439,Nice,Monaco
1,2026-02-08T16:15:00Z,Le Havre AC,RC Strasbourg Alsace,0.228102,0.263473,0.508425,Le Havre,Strasbourg
2,2026-02-08T16:15:00Z,AJ Auxerre,Paris FC,0.375692,0.304229,0.320079,Auxerre,Paris FC


## 8. Run simulations to build the Premier League table probabilities

In [81]:
def simulate_once(fixtures, table):
    table_sim = table.copy()

    # Use normalized team name column
    points = dict(zip(table_sim["team_norm"], table_sim["pts"]))

    for _, row in fixtures.iterrows():
        home = row["home_team_norm"]
        away = row["away_team_norm"]

        # choose outcome
        probs = [row["p_home_final"], row["p_draw_final"], row["p_away_final"]]
        outcome = np.random.choice(["H", "D", "A"], p=probs)

        if outcome == "H":
            points[home] += 3
        elif outcome == "D":
            points[home] += 1
            points[away] += 1
        else:
            points[away] += 3

    result_df = table_sim.copy()
    result_df["pts"] = result_df["team_norm"].map(points)

    # sort by points and goal difference
    result_df = result_df.sort_values(["pts", "gd"], ascending=[False, False])
    result_df["position"] = np.arange(1, len(result_df)+1)

    return result_df


def run_simulations(fixtures, table, n_sim=10000):
    position_counts = {team: np.zeros(len(table)) for team in table["team_norm"]}

    for _ in range(n_sim):
        final_table = simulate_once(fixtures, table)

        for _, row in final_table.iterrows():
            position_counts[row["team_norm"]][row["position"]-1] += 1

    pos_df = pd.DataFrame(position_counts, index=np.arange(1, len(table)+1))
    pos_df.index.name = "position"
    return pos_df

In [None]:
n_sim = 10000  # total simulations

for league in leagues:
    print(f"\n=== {league.replace('_', ' ').title()} ===")
    
    fixtures = df_simulation_all[league]
    table = league_tables[league]
    
    if "team_norm" not in table.columns:
        table["team_norm"] = table["team"]
    
    position_counts = {team: np.zeros(len(table)) for team in table["team_norm"]}
    
    # Run simulations with progress printing every 2000 sims
    for i in range(n_sim):
        final_table = simulate_once(fixtures, table)
        for _, row in final_table.iterrows():
            position_counts[row["team_norm"]][row["position"]-1] += 1
        
        if (i+1) % 2000 == 0:
            print(f"{i+1}/{n_sim} simulations done...")
    
    pos_df = pd.DataFrame(position_counts, index=np.arange(1, len(table)+1))
    pos_df.index.name = "position"
    pos_df_t = pos_df.T
    pos_df_pct = pos_df_t.div(pos_df_t.sum(axis=1), axis=0) * 100
    
    position_distribution_all[league] = pos_df
    position_distribution_pct_all[league] = pos_df_pct
    
    print(f"Finished simulations for {league} ‚úÖ")


=== Premierleague England ===
2000/10000 simulations done...


In [None]:
# RUN
position_distribution = run_simulations(df_simulation, premierleague_england, n_sim=20000)

In [None]:
position_distribution.index.name = "TEAM"
position_distribution_t = position_distribution.T

In [None]:
position_distribution_pct = position_distribution_t.div(
    position_distribution_t.sum(axis=1),
    axis=0
) * 100


## 9. Preview and present the results graphically

In [None]:
# Build label mapping: "position  team" (extra space for 1-9)
team_labels = (
    premierleague_england[["team", "position"]]
    .set_index("team")["position"]
    .map(lambda pos: f"{pos}{'  ' if pos < 10 else ' '}")
)

# Join position and team name into one label
team_labels = (
    premierleague_england[["team", "position"]]
    .assign(
        label=lambda df: df.apply(
            lambda r: f"{r['position']}{'&nbsp;&nbsp;&nbsp;&nbsp;' if r['position'] < 10 else '&nbsp;&nbsp;'}{r['team']}",
            axis=1
        )
    )
    .set_index("team")["label"]
)


# Apply labels to your table index
position_distribution_pct.index = position_distribution_pct.index.map(team_labels)

# Drop position column if present
position_distribution_pct = position_distribution_pct.drop(columns=["position"], errors="ignore")

# Remove index name
position_distribution_pct.index.name = None


In [None]:
greens = plt.cm.Greens
green_cmap = LinearSegmentedColormap.from_list(
    "Greens_soft",
    greens(np.linspace(0.03, 0.65, 256))
)

vmax = 25

def zero_style(val):
    if val < 0.005:
        return "background-color: white !important;"
    return ""

# ---- transform ONLY for colouring ----
color_data = position_distribution_pct.copy()
color_data = (color_data / vmax).pow(0.65) * vmax

position_distribution_pct.style \
    .background_gradient(
        cmap=green_cmap,
        vmin=0,
        vmax=vmax,
        gmap=color_data,
        axis=None          # üîë THIS FIXES THE ERROR
    ) \
    .applymap(zero_style) \
    .format("{:.2f}%") \
    .set_table_styles([
        {"selector": "th", "props": [
            ("background-color", "#e6edf4"),
            ("color", "#333"),
            ("text-align", "center"),
            ("font-family", "Inter, Roboto, Arial, sans-serif"),
            ("font-size", "13px"),
            ("font-weight", "600")
        ]},

        {"selector": "th.col_heading", "props": [
            ("text-align", "center")
        ]},

        {"selector": "th.row_heading", "props": [
            ("text-align", "left"),
            ("font-size", "13px"),
            ("font-weight", "600"),
            ("white-space", "nowrap"),
            ("max-width", "250px"),
            ("overflow", "hidden"),
            ("text-overflow", "ellipsis")
        ]},

        {"selector": "tr:nth-child(odd) th.row_heading", "props": [
            ("background-color", "#fbfcfe")
        ]},
        {"selector": "tr:nth-child(even) th.row_heading", "props": [
            ("background-color", "#e6edf4")
        ]},

        {"selector": "td", "props": [
            ("text-align", "center"),
            ("font-family", "Inter, Roboto, Arial, sans-serif"),
            ("font-size", "12px"),
            ("font-weight", "500"),
            ("color", "#000")
        ]}
    ])
