# WDSS Football Forecasting Competition
<img src="logo_subtitle.png" width=500 height=500 />

## The Competition

- Predicting the scores of football matches isn't a new idea. 


- People try their luck at bookies daily with nothing more than guesses, however, at WDSS a more data-driven approach is preferred.

- The competition aims to draw together the best implementations of these with hopes of finding the best model possible. 

## The Format

- 

## The Rules

- All weekly submissions must be accompanied by a model created in Python Jupyter Notebook (.ipynb file) or R Markdown notebooks (.Rmd). For any further queries regarding this please message us. 


- Scores predicted must match the output of your model (we will check this and disqualify inconsistent submissions). 



- There will be some flexibility in what constitutes as a model, however, please refer to our demo model as a baseline. 



- Have Fun! 

## The Winner & Prize

- The competition will be split across Term 1 & 2 with a prize pool of £700 for each iteration. 


- This prize pool will be split between the top competitors as ranked by the accuracy of their predictions.


- Additionally, a £100 prize will be awarded for the model that displays the most ingenuity, creativity, and good statistical practice. 

# Getting Started: lets build a demo model

- Today we will guide you in building a baseline model for our upcoming Premier League forecasting competition.

- This model will NOT win the competition for you, but it will help point you in the right direction

## Start out with our imports

In [1]:
# Dependencies
from scipy.stats import poisson, skellam
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.ticker import AutoMinorLocator
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

## Gathering our data

In [4]:
#### FUNCTION TO RETRIEVE PREMIER LEAGUE DATA ####
def get_premier_league_data(start_year):
    """

    Function to get Premier League data
    :int start_year: Takes in the starting year of the season

    """
    season = str(start_year)[-2:] + str(start_year + 1)[-2:]
    data = pd.read_csv("http://www.football-data.co.uk/mmz4281/" + season + "/E0.csv") 
    return data

In [6]:
# Get data from the 2018/2019 season
data = get_premier_league_data(2018)
data.head()

Unnamed: 0,Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,...,BbAv<2.5,BbAH,BbAHh,BbMxAHH,BbAvAHH,BbMxAHA,BbAvAHA,PSCH,PSCD,PSCA
0,E0,10/08/2018,Man United,Leicester,2,1,H,1,0,H,...,1.79,17,-0.75,1.75,1.7,2.29,2.21,1.55,4.07,7.69
1,E0,11/08/2018,Bournemouth,Cardiff,2,0,H,1,0,H,...,1.83,20,-0.75,2.2,2.13,1.8,1.75,1.88,3.61,4.7
2,E0,11/08/2018,Fulham,Crystal Palace,0,2,A,0,1,A,...,1.87,22,-0.25,2.18,2.11,1.81,1.77,2.62,3.38,2.9
3,E0,11/08/2018,Huddersfield,Chelsea,0,3,A,0,2,A,...,1.84,23,1.0,1.84,1.8,2.13,2.06,7.24,3.95,1.58
4,E0,11/08/2018,Newcastle,Tottenham,1,2,A,1,2,A,...,1.81,20,0.25,2.2,2.12,1.8,1.76,4.74,3.53,1.89


## Light cleaning

In [7]:
# Filtering and renaming columns of interest
columns = ["HomeTeam", "AwayTeam", "FTHG", "FTAG", "FTR"]
data = data[columns].rename(
    columns={"FTHG": "HomeGoals", "FTAG": "AwayGoals", "FTR": "Result"}
)

In [8]:
# Remove final week of fixtures
data = data[:-10]

## Simple analysis: Home team advantage?


In [9]:
# Compute the average number of home and away goals
data[["HomeGoals", "AwayGoals"]].mean()

HomeGoals    1.575676
AwayGoals    1.224324
dtype: float64

## Towards a match prediction model 

- One way to predict the match score is to consider the number of goals scored by each team


- We will denote the number of home team goals by $y_i$ where $i$ indicates the particular match


- Furthermore, we will use *regression analysis* to model $\mathbb{E}[y_i | X_i]$ 

## The Poisson distribution

- The Poisson distribution is often used to model the probability distribution of *count events* (that is, the same event happening a specific number of times in a fixed time frame)


- It is a *discrete* distribution parametarized by a mean constant rate of occurences $\lambda$

- It assumes that event occurances within the interval are *independent* of one another

- It can be especially useful to model the number of goals we expect a team to score

In [None]:
x = np.random.poisson(2, 100000000)
plt.hist(x, 14, density=True)
plt.show()

In [None]:
# Prepare the dataset
# Separate home and away teams/goals - then concatenate
goal_model_data = pd.concat(
    [
        data[["HomeTeam", "AwayTeam", "HomeGoals"]]
        .assign(home=1)
        .rename(
            columns={"HomeTeam": "team", "AwayTeam": "opponent", "HomeGoals": "goals"}
        ),
        data[["AwayTeam", "HomeTeam", "AwayGoals"]]
        .assign(home=0)
        .rename(
            columns={"AwayTeam": "team", "HomeTeam": "opponent", "AwayGoals": "goals"}
        ),
    ]
)
goal_model_data.head()

## The Poisson regression model

- To predict the number of goals each team scores, we will fit the following model using match-level data


$$y_i ~ Poisson(\lambda_i)$$
$$\ln (\lambda_i) = X_i\beta$$

- Here, $\lambda_i$ is the mean number goals scored by the home team in match $i$, which we aim to predict using variables $X_i$

In [None]:
# Building the model
# Poisson Regression: log-linear model
poisson_model = smf.glm(
    formula="goals ~ home + team + opponent",
    data=goal_model_data,
    family=sm.families.Poisson(),
).fit()

In [None]:
# Get a statistical summary of the poisson model
poisson_model.summary()

### Simulation & Validation

In [None]:
# Build a function to simulate a match using the newly generated poisson model
# Outputs the probability distribution
# Considers 8 goals as a maximum for either team


def simulate_match(homeTeam, awayTeam, max_goals=8, foot_model=poisson_model):
    home_goals_avg = foot_model.predict(
        pd.DataFrame(
            data={"team": homeTeam, "opponent": awayTeam, "home": 1}, index=[1]
        )
    ).values[0]
    away_goals_avg = foot_model.predict(
        pd.DataFrame(
            data={"team": awayTeam, "opponent": homeTeam, "home": 0}, index=[1]
        )
    ).values[0]
    team_pred = [
        [poisson.pmf(i, team_avg) for i in range(0, max_goals + 1)]
        for team_avg in [home_goals_avg, away_goals_avg]
    ]
    return np.outer(np.array(team_pred[0]), np.array(team_pred[1]))


simulate_match("Chelsea", "Man City")

In [None]:
# Similar function, selecting most probable scoreline


def simulate_match_output(homeTeam, awayTeam, max_goals=8, foot_model=poisson_model):
    # Predict avg goals
    home_goals_avg = foot_model.predict(
        pd.DataFrame(
            data={"team": homeTeam, "opponent": awayTeam, "home": 1}, index=[1]
        )
    ).values[0]
    away_goals_avg = foot_model.predict(
        pd.DataFrame(
            data={"team": awayTeam, "opponent": homeTeam, "home": 0}, index=[1]
        )
    ).values[0]

    team_pred = [
        [poisson.pmf(i, team_avg) for i in range(0, max_goals + 1)]
        for team_avg in [home_goals_avg, away_goals_avg]
    ]
    distribution = np.outer(
        np.array(team_pred[0]), np.array(team_pred[1])
    )  # multiply distributions together

    # Get most likely score from the matrix
    global h, a
    h = np.argmax(distribution) // (max_goals + 1)
    a = np.argmax(distribution) % (max_goals + 1)
    output = homeTeam + ": " + str(h) + "\n" + awayTeam + ": " + str(a)

    return print(output)
    return (h, a)


simulate_match_output("Chelsea", "Man City")

In [None]:
# Similar function, built to be iterated


def simulate_match_clean(homeTeam, awayTeam, max_goals=8, foot_model=poisson_model):
    # Predict avg goals
    home_goals_avg = foot_model.predict(
        pd.DataFrame(
            data={"team": homeTeam, "opponent": awayTeam, "home": 1}, index=[1]
        )
    ).values[0]
    away_goals_avg = foot_model.predict(
        pd.DataFrame(
            data={"team": awayTeam, "opponent": homeTeam, "home": 0}, index=[1]
        )
    ).values[0]

    team_pred = [
        [poisson.pmf(i, team_avg) for i in range(0, max_goals + 1)]
        for team_avg in [home_goals_avg, away_goals_avg]
    ]
    distribution = np.outer(
        np.array(team_pred[0]), np.array(team_pred[1])
    )  # *multiply distributions together

    # Get most likely score
    global h, a
    h = np.argmax(distribution) // (max_goals + 1)
    a = np.argmax(distribution) % (max_goals + 1)

    return (h, a)


simulate_match_clean("Chelsea", "Man City")

In [None]:
# Simulate matches for any given PL seaason
# Takes in dataset as input


def simulate_test(x):

    data = x.copy()

    data["HomePred"] = [0] * len(data)
    data["AwayPred"] = [0] * len(data)

    for i in range(len(data)):
        homeTeam = data["HomeTeam"][i]
        awayTeam = data["AwayTeam"][i]

        simulate_match_clean(homeTeam, awayTeam)
        data.loc[i, "HomePred"] = int(h)
        data.loc[i, "AwayPred"] = int(a)

    data = pd.DataFrame.from_dict(data)

    return data


epl_1819_post = simulate_test(epl_1819)
epl_1819_post

In [None]:
type(epl_1819_post["AwayPred"][2])

In [None]:
# Add prediction results column
# NOT WORKING - Does not register draws


def update_df_res(data):

    ResultPred = []

    for i in data["HomePred"]:
        if i == data["AwayPred"][i]:
            ResultPred.append("D")
        elif i < data["AwayPred"][i]:
            ResultPred.append("A")
        else:
            ResultPred.append("H")

    data = pd.concat([data, pd.Series(ResultPred)], axis=1).rename(
        {0: "ResultPred"}, axis=1
    )

    return data

In [None]:
# Validate full time result prediction


def update_df_ftr(data):

    correctFTR = []

    for i in range(len(data)):
        if str(data["ResultPred"][i]) is str(data["Result"][i]):
            correctFTR.append(True)
        else:
            correctFTR.append(False)

    data = pd.concat([data, pd.Series(correctFTR)], axis=1).rename(
        {0: "correctFTR"}, axis=1
    )

    return data

In [None]:
# Vallidate Scoreline Prediction


def update_df_correct(data):

    correctScore = []

    for i in range(len(data)):
        if (
            data["HomeGoals"][i] == data["HomePred"][i]
            and data["AwayGoals"][i] == data["AwayPred"][i]
        ):
            correctScore.append(True)
        elif data["HomeGoals"][i] == data["HomePred"][i]:
            correctScore.append("Home")
        elif data["AwayGoals"][i] == data["AwayPred"][i]:
            correctScore.append("Away")
        else:
            correctScore.append(False)

    data = pd.concat([data, pd.Series(correctScore)], axis=1).rename(
        {0: "correctScore"}, axis=1
    )

    return data

In [None]:
epl_1819_post = update_df_correct(epl_1819_post)

In [None]:
epl_1819_post.columns

In [None]:
# Add validation columns
# NOT WORKING - Does not take into account draws


def update_df(data):

    ResultPred = []

    for i in data["HomePred"]:
        if int(i) > int(data["AwayPred"][i]):
            ResultPred.append("H")
        elif int(i) == int(data["AwayPred"][i]):
            ResultPred.append("D")
        else:
            ResultPred.append("A")

    data = pd.concat([data, pd.Series(ResultPred)], axis=1).rename(
        {0: "ResultPred"}, axis=1
    )

    correctFTR = []

    for i in range(len(data)):
        if str(data["ResultPred"][i]) is str(data["Result"][i]):
            correctFTR.append(True)
        else:
            correctFTR.append(False)

    data = pd.concat([data, pd.Series(correctFTR)], axis=1).rename(
        {0: "correctFTR"}, axis=1
    )

    correctScore = []

    for i in range(len(data)):
        if (
            data["HomeGoals"][i] == data["HomePred"][i]
            and data["AwayGoals"][i] == data["AwayPred"][i]
        ):
            correctScore.append(True)
        elif data["HomeGoals"][i] == data["HomePred"][i]:
            correctScore.append("Home")
        elif data["AwayGoals"][i] == data["AwayPred"][i]:
            correctScore.append("Away")
        else:
            correctScore.append(False)

    data = pd.concat([data, pd.Series(correctScore)], axis=1).rename(
        {0: "correctScore"}, axis=1
    )

    return data


# epl_1819_post = update_df(epl_1819_post)

In [None]:
epl_1819_post.head(20)

In [None]:
# Total correct final outcomes
epl_1819_post["correctScore"].value_counts()

In [None]:
# At least one correct prediction (home, away or both)
sum(epl_1819_post["correctScore"].value_counts()[-3:])

In [None]:
# Most common incorrect predictions (Draws seem prety common here?!)
# epl_1819_post2 = epl_1819_post[epl_1819_post['correctFTR'] == False]
# epl_1819_post2['ResultPred'].value_counts()

# WILL NOT WORK UNTIL DRAWS ARE ACCOUNTED FOR

### Plots

In [None]:
chelsea_mancity = simulate_match("Chelsea", "Man City")

In [None]:
def goal_matrix(homeTeam, awayTeam):

    x = simulate_match(homeTeam, awayTeam)

    # *Match the plotting functions to the max goals
    # Axes labels
    goals = [0, 1, 2, 3, 4, 5, 6, 7, 8]

    # Plot figure
    fig, ax = plt.subplots(figsize=(3, 3), dpi=400)
    fig.tight_layout()
    im = ax.imshow(x, cmap="winter")

    # Add grid
    ax.minorticks_on()
    ax.xaxis.set_minor_locator(AutoMinorLocator(2))
    ax.yaxis.set_minor_locator(AutoMinorLocator(2))
    ax.tick_params(axis="both", which="minor", color="w", length=0)
    ax.grid(which="minor", color="b", linestyle="-", linewidth=0.6)

    # Set ticks and paremeters
    ax.set_yticks(np.arange(len(goals)))
    ax.set_yticklabels(goals, fontsize=5)

    ax.xaxis.tick_top()
    ax.set_xticks(np.arange(len(goals)))
    ax.set_xticklabels(goals, fontsize=5)

    ax.tick_params(axis="both", which="major", length=2, pad=1.5)

    ax.set_ylabel("Home Goals", fontsize=6)  # axes label (y)
    ax.set_xlabel("Away Goals", fontsize=6)
    ax.xaxis.set_label_position("top")

    # * Set this to the Home vs Away teams
    ax.set_title("Chelsea vs Sunderland Forecast", fontsize=7, y=1.1)
    ax.set_title(
        str(homeTeam) + " vs. " + str(awayTeam) + " Forecast", fontsize=7, y=1.1
    )

    # *rename this variable
    # Rounding probabilites to add to the plot
    rounded = x.tolist()
    rounded3 = []
    for z in rounded:
        rounded2 = [round(x, 3) for x in z]
        rounded3.append(rounded2)
    rounded3 = np.array(rounded3)

    for i in range(len(goals)):
        for j in range(len(goals)):
            text = ax.text(
                j,
                i,
                rounded3[i, j],
                ha="center",
                va="center",
                color="black",
                fontsize=3,
                fontfamily="monospace",
            )

    plt.show()


goal_matrix("Man City", "Burnley")

In [None]:
# Define a function to get the seasons fixtures from csv
def get_epl_fixtures(season):
    """
    Takes in season formatted as YYYY (First Year of the season)
    """
    # x = pd.read_csv("https://fixturedownload.com/download/csv/epl-" + str(season)) # input season year within hyperlink
    x = pd.read_csv(
        "https://fixturedownload.com/download/epl-"
        + str(season)
        + "-GMTStandardTime.csv"
    )  # input season year within hyperlink

    x = x[
        ["Round Number", "Home Team", "Away Team", "Result", "Date"]
    ]  # isolate required columns
    x = x.rename(
        columns={"Round Number": "GW", "Home Team": "HomeTeam", "Away Team": "AwayTeam"}
    )
    globals()["epl_fixtures_" + str(season)] = x

    return globals()["epl_fixtures_" + str(season)]

In [None]:
get_epl_fixtures(2021)

In [None]:
# Get this weeks fixtures
epl_fixtures_2021[50:60][:]

#### Season Standings

https://www.rotowire.com//soccer/tables/standings.php?league=EPL&length=total&season=2019

https://www.rotowire.com/soccer/league-table.php?season=2018

### Next Steps

Build on the baseline.

Participate in the competition.