# Euro Cup Predictions
Look at scores distribution of a typical Euro CUP and mimic scores

Use FIFA rankings to distribute scores by goal differential

A blog post about these predictions can be found at [Towards Data Science](https://sijmenvdw.medium.com/predict-euro-cup-matches-with-simple-statistics-2fc913678117).

Code can also be found on [GitHub](https://github.com/sijmenw/predict-euro-cup-simple-stats).

By: Sijmen van der Willik

### Import libraries

In [None]:
import json
import pandas as pd
import numpy as np

### Load data

In [None]:
# download external data files
!wget -O fifa.json https://github.com/sijmenw/predict-euro-cup-simple-stats/raw/master/fifa.json
!wget -O games_group_stage.json https://github.com/sijmenw/predict-euro-cup-simple-stats/raw/master/games_group_stage.json

In [None]:
fifa = json.load(open("fifa.json"))
games = json.load(open("games_group_stage.json"))
df = pd.read_csv("../input/all-euro-cup-football-games-19602016/euro_cup_games.csv")

### Wrangle data

In [None]:
# only use non-qualifying games
df = df[df['round'] != 'QUALIFYING'].copy()

In [None]:
# add max-min column
#  - stores the max - min goals from a match so outcomes can be easily compared
#  - i.e. 2-1 and 1-2 both output 2-1
def get_maxmin(x):
    scores = [int(x) for x in x.split("-")]
    return f"{max(scores)}-{min(scores)}"

df['maxmin'] = df['score'].apply(get_maxmin)

In [None]:
# add scores as int so difference can be calculated
df[['score1', 'score2']] = df['score'].str.split("-", n=1, expand=True)
df['score1'] = df['score1'].astype(int)
df['score2'] = df['score2'].astype(int)

In [None]:
# count the number of occurrence for each match outcome and save into a DataFrame
df2 = df[['edition', 'maxmin']].groupby("edition")['maxmin'].value_counts().copy()

# init DataFrame for totals
totals = pd.DataFrame(index=sorted(df['maxmin'].unique()))

# fill the DataFrame
for idx, v in zip(df2.index, df2.values):
    edition, maxmin = idx
    
    if edition not in totals:
        totals[edition] = np.nan
    
    totals.loc[maxmin, edition] = v
totals = totals.fillna(0)
totals = totals.astype(int)

In [None]:
# calculate estimates
fractions = totals/totals.sum()
ests = fractions.mean(axis=1) * 36

In [None]:
# save estimates to DataFrame
edf = pd.DataFrame(ests)
edf.columns = ['E']

# add maxmin and score differentials as columns for easy comparison later
edf['maxmin'] = edf.index
edf['diff'] = edf['maxmin'].apply(lambda x: int(x[0]) - int(x[2]))

In [None]:
def build_n(e):
    n = [int(x) for x in e]
    e = [x % 1 for x in e]
    
    while sum(n) < 36:
        idx = np.argmax(e)
        e[idx] = 0
        n[idx] += 1
    
    return n

edf['n'] = build_n(edf['E'])

In [None]:
# create game DataFrame
gdf = pd.DataFrame(games)
gdf.columns = ['c1', 'c2']

# add columns for FIFA ranks and differences
gdf['r1'] = gdf['c1'].apply(lambda x: fifa[x])
gdf['r2'] = gdf['c2'].apply(lambda x: fifa[x])
gdf['r_diff'] = gdf['r1'] - gdf['r2']
gdf['r_diff_abs'] = np.abs(gdf['r_diff'])

In [None]:
# create a sorted list of the scores to distribute
score_list = []

for _, row in edf.sort_values(by='diff').iterrows():
    for i in range(row['n']):
        score_list.append(row['maxmin'])

In [None]:
# sort the games by FIFA rank difference
gdf = gdf.sort_values(by=['r_diff_abs'])

In [None]:
# add the scores the games DataFrame
gdf['pred'] = score_list
gdf['pred'] = gdf.apply(lambda row: row['pred'][::-1] if row['r_diff'] > 0 else row['pred'], axis=1)

## Result

In [None]:
# show the games with predictions in their original order
gdf.sort_index()

## Plots

Additional plots are created for the [blog post](https://sijmenvdw.medium.com/predict-euro-cup-matches-with-simple-statistics-2fc913678117).

In [None]:
import matplotlib.pyplot as plt

### Occurrence table

In [None]:
t = totals
t.columns = [x.split("-")[0] for x in t.columns]
t.astype(str).replace("0", "")

### Heatmap

In [None]:
df = fractions

plt.figure(figsize=(14,12))
plt.pcolor(df[::-1], cmap='coolwarm')
yticks = np.arange(0.5, len(df.index), 1)
plt.yticks(list(reversed(yticks)), df.index)
plt.xticks(np.arange(0.5, len(df.columns), 1), [x.split("-")[0] for x in df.columns])
plt.title("Relative occurrence of scores in Euro Cup by year")
plt.show()