The aim of the project is to predict which team is going to win each match, in order to achieve that we scrapped data from ESPN results tab for each team (raw data folder), it contains the last 2 years of data of each team or in some cases only 1 year.
All the code and posterior checks of the model can be seen in the jupyter notebook.
The vector of observed scores is
Where the parameters
We model these parameters according to a formulation that has been used widely in the statistical literature (see Karlis & Ntzoufras 2003), assumming a log-linear random effect model:
The parameter
As suggested by various works we need to impose some identifiability constraints on the team-specific parameters. We use a sum-to-zero constraint, that is:
Using PyMC probabilistic programming Framework we can build such model in just a few lines of code:
home_idx = data['home_team'].map(team2id).values
away_idx = data['away_team'].map(team2id).values
coords = {"team": teams}
with pm.Model(coords=coords) as model:
# Data inputs
home_team = pm.MutableData("home_team", home_idx)
away_team = pm.MutableData("away_team", away_idx)
# Home field effect
eta = pm.Normal("eta", mu=0, sigma=1)
# Average score (log scale)
mu = pm.Normal('mu', 0.4, sigma=1)
# Team attack and defense skills
alpha = pm.Normal("alpha", mu=0, sigma=1, dims="team")
delta = pm.Normal("delta", mu=0, sigma=1, dims="team")
# constrain the offense and defense skills to sum to zero
alpha_star = pm.Deterministic("alpha_star", alpha - at.mean(alpha), dims="team")
delta_star = pm.Deterministic("delta_star", delta - at.mean(delta), dims="team")
# expected values for the home and away teams in each game can be calculated as
home_theta = at.exp(mu + eta + alpha_star[home_team] + delta_star[away_team])
away_theta = at.exp(mu + alpha_star[away_team] + delta_star[home_team])
# the likehoods of the observed scores
home_score = pm.Poisson(
"home_score",
mu=home_theta,
observed=data["home_score"].to_numpy()
)
away_score = pm.Poisson(
"away_score",
mu=away_theta,
observed=data["away_score"].to_numpy()
)The result is the following Hierarchical Model:
We trained the model with all matches previous to the world cup available in the data we collected and for testing purposes we checked the results against the first matches for which we have ground-truth labels.
We can easily simulate matches that haven't been played already with a few lines of code:
# Predict future matches
home_teams = ['Argentina', 'Denmark', 'Mexico', 'France', 'Morocco', 'Germany']
away_teams = ['Saudi Arabia', 'Tunisia', 'Poland', 'Australia', 'Croatia', 'Japan']
home_teams_idx = [team2id[team] for team in home_teams]
away_teams_idx = [team2id[team] for team in away_teams]
with model:
pm.set_data({
"home_team": home_teams_idx,
"away_team": away_teams_idx,
})
predictions = pm.sample_posterior_predictive(
trace,
predictions=True,
random_seed=42,
)As we can see the model is not terrible but is not good either, there are a few points that can easily improve it:
- The Home field effect is not the same in all matches, it depends on the type of the tournament.
- As we don't have much data of teams from diferent regions playing each other, teams that are the strongest in relative weaker regions end up with biased strenght. We could collect more data or add some dependencies on the type of tournament when estimating strenght parameters.
- We could add historic data to better capture teams that are usually strong, but also add time dependacy so that we can have higher weights for recent matches.
We continuously invest in researching IA 🤓, developing and delivering state-of-the-art data-centric AI systems ☕ with MLOps best practices. Check our webpage


