# Data preparation for predicting World Cup and Euro results from qualification performance

Dependencies

In [1]:
import pandas as pd
from datetime import datetime

**Data**: the dataset is downloaded from kaggle: https://www.kaggle.com/datasets/martj42/international-football-results-from-1872-to-2017 
From 3 files, results.csv is used here.

In [2]:
df_raw = pd.read_csv("../data/raw/results.csv")
df_raw.info() ; df_raw.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47399 entries, 0 to 47398
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   date        47399 non-null  object
 1   home_team   47399 non-null  object
 2   away_team   47399 non-null  object
 3   home_score  47399 non-null  int64 
 4   away_score  47399 non-null  int64 
 5   tournament  47399 non-null  object
 6   city        47399 non-null  object
 7   country     47399 non-null  object
 8   neutral     47399 non-null  bool  
dtypes: bool(1), int64(2), object(6)
memory usage: 2.9+ MB


Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False


In [3]:
df_recent = df_raw.copy()
df_recent["date"] = pd.to_datetime(df_recent["date"], format = "%Y-%m-%d")

## Creating datasets for qualification periods and main events

Here, we focus on World Cup and Euro, because these two happen with two-year intervals from each other (except Covid-related delays), and make it easy to divide the entire timeframe neatly into qualification and main-event phases, where we can calculate performance metrics from the former, and build models for the latter. Other continental tournaments are not so regular (they switch between 2-, 3- and 4-year intervals, and odd- and even-numbered years, sometimes overlapping with the World Cup).

The idea is to divide the entire time frame into 2-year periods named after the main event that concludes the period. Then, within each period, qualification and main-event phases are distinguished. The first main tournament considered here is World Cup 1978 (the first Euro with group stage is 1980). In order to include the qualification phase of World Cup 78, we also take into account the end date of Euro 1976.

To begin with, main event dates are saved in a dictionary and a dataframe below. End dates are one day after the final, hence the beginning of the next period.

In [4]:
event_dates = {
    "Euro24" : [datetime(2024, 6, 14), datetime(2024, 7, 15)], 
    "Euro20" : [datetime(2021, 6, 11), datetime(2021, 7, 12)], 
    "Euro16" : [datetime(2016, 6, 10), datetime(2016, 7, 11)], 
    "Euro12" : [datetime(2012, 6, 8), datetime(2012, 7, 2)], 
    "Euro08" : [datetime(2008, 6, 7), datetime(2008, 6, 30)], 
    "Euro04" : [datetime(2004, 6, 12), datetime(2004, 7, 5)], 
    "Euro00" : [datetime(2000, 6, 10), datetime(2000, 7, 3)], 
    "Euro96" : [datetime(1996, 6, 8), datetime(1996, 7, 1)], 
    "Euro92" : [datetime(1992, 6, 10), datetime(1992, 6, 27)], 
    "Euro88" : [datetime(1988, 6, 10), datetime(1988, 6, 26)],
    "Euro84" : [datetime(1984, 6, 12), datetime(1984, 6, 28)], 
    "Euro80" : [datetime(1980, 6, 11), datetime(1980, 6, 23)], 
    "Euro76" : [datetime(1976, 6, 16), datetime(1976, 6, 21)], 
    "WC22" :   [datetime(2022, 11, 20), datetime(2022, 12, 19)], 
    "WC18" :   [datetime(2018, 6, 14), datetime(2018, 7, 16)], 
    "WC14" :   [datetime(2014, 6, 12), datetime(2014, 7, 14)], 
    "WC10" :   [datetime(2010, 6, 11), datetime(2010, 7, 12)], 
    "WC06" :   [datetime(2006, 6, 9), datetime(2005, 7, 10)], 
    "WC02" :   [datetime(2002, 5, 31), datetime(2002, 7, 1)], 
    "WC98" :   [datetime(1998, 6, 10), datetime(1998, 7, 13)], 
    "WC94" :   [datetime(1994, 6, 17), datetime(1994, 7, 18)], 
    "WC90" :   [datetime(1990, 6, 8), datetime(1990, 7, 9)], 
    "WC86" :   [datetime(1986, 5, 31), datetime(1986, 6, 30)], 
    "WC82" :   [datetime(1982, 6, 13), datetime(1982, 7, 12)], 
    "WC78" :   [datetime(1978, 6, 1), datetime(1978, 6, 26)]
}

df_event_dates = pd.DataFrame(event_dates).T.reset_index()\
    .rename(columns = {"index" : "event", 0 : "start", 1 : "end"})\
    .sort_values("start").reset_index(drop = True)

We add periods and phases (main and qualification) to the main dataframe using these dates as cut-off.

In [5]:
df_recent["period"] = ""
df_recent["phase"] = ""

for i in df_event_dates.iloc[:-1, :].index:
    # i is the previous event, i+1 as the current event
    df_recent.loc[
        # period as between the end dates of main events, prveious and current 
        (df_recent["date"] >= df_event_dates.loc[i, "end"]) & 
        (df_recent["date"] < df_event_dates.loc[i+1, "end"]), 
        "period"
        ] = df_event_dates.loc[i+1, "event"]
    df_recent.loc[
        # qualification phase as between the end of the previous event and the start of the current event
        (df_recent["date"] >= df_event_dates.loc[i, "end"]) &
        (df_recent["date"] < df_event_dates.loc[i+1, "start"]),
        "phase"
        ] = "qualification"
    df_recent.loc[
        # main phase as between the start and end dates of the current event
        (df_recent["date"] >= df_event_dates.loc[i+1, "start"]) &
        (df_recent["date"] < df_event_dates.loc[i+1, "end"]),
        "phase"
        ] = "main"
        

And we divide the games into qualifications and main events. In the timeframe of main events, there are also other games; so, we specifiy the tournament name as World Cup or Euro.

In [6]:
df_quals = df_recent[df_recent["phase"] == "qualification"].drop(columns = "phase").copy()
df_mains = df_recent[(df_recent["phase"] == "main") & 
                     ((df_recent["tournament"] == "FIFA World Cup") | (df_recent["tournament"] == "UEFA Euro"))].drop(columns = "phase").copy()

## Calculating qualification performance metrics

For simplicity, we will not distinguish between home and away performance. Since the data consists of games, we have to process it from both home and away team perspective.

In [7]:
df_quals_hometeam = df_quals.drop(columns = "away_team")\
    .rename(columns = {
        "home_team" : "team",
        "home_score" : "scored",
        "away_score" : "conceded"
    }).reset_index(names = "match_id").copy()
df_quals_awayteam = df_quals.drop(columns = "home_team")\
    .rename(columns = {
        "away_team" : "team",
        "away_score" : "scored",
        "home_score" : "conceded"
    }).reset_index(names = "match_id").copy()
df_quals_teams = pd.concat([df_quals_hometeam, df_quals_awayteam], axis = 0).sort_values("match_id").reset_index(drop = True)

The performance metrics to be calculated are win and draw ratios, and average goals scored and conceded.

In [8]:
df_quals_teams["win"] = (df_quals_teams["scored"] > df_quals_teams["conceded"]).astype(int)
df_quals_teams["draw"] = (df_quals_teams["scored"] == df_quals_teams["conceded"]).astype(int)

In [9]:
df_quals_performance = df_quals_teams.groupby(["team", "period"]).agg(
    win_ratio = ("win", "mean"),
    draw_ratio = ("draw", "mean"),
    avg_goals_scored = ("scored", "mean"),
    avg_goals_conceded = ("conceded", "mean")
).reset_index()

## Adding performance metrics to main event matches

Since these are mostly in a neutral venue in a host country, we begin with more meaningful column names.

In [10]:
df_mains = df_mains.rename(columns = {
    "home_team" : "team_A",
    "away_team" : "team_B",
    "home_score" : "A_score",
    "away_score" : "B_score"
})
df_mains["host_advantage"] = df_mains["neutral"] * (-1) + 1
df_mains.drop(columns = ["neutral", "tournament", "city", "country"], inplace = True)

And we add qualification performance metrics with respect to team A and team B.

In [11]:
df_mains_perf = df_mains\
    .merge(df_quals_performance, left_on = ["team_A", "period"], right_on = ["team", "period"])\
    .rename(columns = {
        "win_ratio" : "A_win_ratio",
        "draw_ratio" : "A_draw_ratio",
        "avg_goals_scored" : "A_avg_goals_scored",
        "avg_goals_conceded" : "A_avg_goals_conceded"
    })\
    .drop(columns = "team")\
    .merge(df_quals_performance, left_on = ["team_B", "period"], right_on = ["team", "period"])\
    .rename(columns = {
        "win_ratio" : "B_win_ratio",
        "draw_ratio" : "B_draw_ratio",
        "avg_goals_scored" : "B_avg_goals_scored",
        "avg_goals_conceded" : "B_avg_goals_conceded"
    })\
    .drop(columns = "team")\
    .sort_values("date").reset_index(drop = True)

We will try to predict the match results; so, we create a column for it.

In [12]:
def get_match_result(row):
    if row["A_score"] > row["B_score"]:
        return "A_win"
    elif row["A_score"] < row["B_score"]:
        return "B_win"
    else:
        return "Draw"
df_mains_perf["result"] = df_mains_perf.apply(get_match_result, axis = 1)

## Inspecting class balance

It is reasonable to expect that host teams have an advantage. But for neutral venues, the probabilities for A and B should be roughly equal.

In [13]:
df_mains_perf[df_mains_perf["host_advantage"] == 0].value_counts("result", normalize = True)

result
A_win    0.407101
B_win    0.337278
Draw     0.255621
Name: proportion, dtype: float64

For some reason, the winning teams are more often put under team A (home team originally). To reduce potential bias, let's reverse the order of teams for half of the neutral games.

In [14]:
df_hostteam = df_mains_perf[df_mains_perf["host_advantage"] == 1].copy()
df_neutral = df_mains_perf[df_mains_perf["host_advantage"] == 0].sample(frac = 1, random_state = 42).copy()

half_len = len(df_neutral) // 2
df_neutral_keep = df_neutral.iloc[:half_len, :].copy()
df_neutral_reverse = df_neutral.iloc[half_len:, :].copy()

df_neutral_reverse = df_neutral_reverse.rename(columns = {
    "team_A" : "team_B",
    "team_B" : "team_A",
    "A_score" : "B_score",
    "B_score" : "A_score",
    "A_win_ratio" : "B_win_ratio",
    "B_win_ratio" : "A_win_ratio",
    "A_draw_ratio" : "B_draw_ratio",
    "B_draw_ratio" : "A_draw_ratio",
    "A_avg_goals_scored" : "B_avg_goals_scored",
    "B_avg_goals_scored" : "A_avg_goals_scored",
    "A_avg_goals_conceded" : "B_avg_goals_conceded",
    "B_avg_goals_conceded" : "A_avg_goals_conceded"
})

df_neutral_reverse["result"] = df_neutral_reverse.apply(get_match_result, axis = 1)


Combine back together, check balance.

In [15]:
df_mains_perf_balanced = pd.concat([df_hostteam, df_neutral_keep, df_neutral_reverse], axis = 0).sort_values("date").reset_index(drop = True)

df_mains_perf_balanced[df_mains_perf_balanced["host_advantage"] == 0].value_counts("result", normalize = True)

result
A_win    0.382249
B_win    0.362130
Draw     0.255621
Name: proportion, dtype: float64

## Saving the final dataframe

In [76]:
df_mains_perf_balanced.to_pickle("../data/processed/qualification_performance.pkl")