# Introduction

Predicting the results of football matches is an area of active research, with state of the art models reaching accuracies of only about 50\%. ‌In this paper, additional data to model the performance of each football club - e.g. club worth, attack score, etc. - was sourced, and an exponential weighted average calculated from them in order to model time decay in the style of the Dixon-Coles model of modelling goal likelihood. Some dimensionality reduction was also attempted on the feature set using Linear Discriminant Analysis with little effect. The resulting data was then tested with a repertoire of five different classifier models, all of which achieved accuracies of between 0.468 and 0.534. Of these, a Support Vector Machine was chosen as it consistently demonstrated the highest accuracy. Lastly, sequential feature selection, where features are added and removed to determine their effect on accuracy, demonstrated that accuracy peaked when three features where being considered at once. The cross validation of the model was calculated to be 0.55 with a standard deviation of 0.014. In terms of predictions using training data, it was noticed that the feature which was correctly predicted with the highest accuracy of 0.78 was home wins whereas probability of away wins being correctly predicted was 0.56. Interestingly, the model made no predictions for draws, with the probability of incorrectly predicting a draw as home win being 22\% higher than predicting it as an away win.

# Data Import

The section imports all the required libraries. Then, the required CSV files are loaded into variables for use throughout the notebook. There are three required CSV files.

`epi-training.csv` - contains the provided match data

`epl-teams.csv` - contains scraped team scores from [sofifa.com](https://sofifa.com)

`epl-test.csv` - contains the final test data

## Import libraries

In [None]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler, LabelEncoder

from sklearn.model_selection import cross_val_score, GridSearchCV, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold

from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.feature_selection import SequentialFeatureSelector

from sklearn.metrics import ConfusionMatrixDisplay

import seaborn as sns
sns.set_theme(style="whitegrid")

import matplotlib.pyplot as plt

## Scraping tool

The scraping tool is stored in a separate python file, `fifa_scraper.py`. If the `epl-teams.csv` has not been generated, the scraper must be run. Configure the leagues for which to retrieve team scores by controlling the years variable, each year has a corresponding league ID which can be found on [sofifa.com](sofifa.com).

The team scores that are obtained are ratings from FIFA, which represent a measure of how popular and good each team is considered to be. This can be used to augment the training data.

The following ratings are collected

- **OA**: Overall score (0-100)
- **AT**: Attack score (0-100)
- **MD**: Midfield score (0-100)
- **DF**: Defense score (0-100)
- **CW**: Club worth (millions)


In [None]:
from fifa_scraper import scrape_premier_league_teams

years = [('240016', 2024), ('230054', 2023), ('220069', 2022), 
         ('210064', 2021), ('200061', 2020), ('190075', 2019), 
         ('180084', 2018), ('170099', 2017), ('160058', 2016), 
         ('150059', 2015), ('140052', 2014), ('130034', 2013)]

scrape_premier_league_teams(years)

## Data loading

Now, processing and loading all the required data. The match data is stored in `df` and the team scores are stored in `df_teams`. In following sections, the two features will be merged. 

In [None]:
root_dir = './' # ENTER YOUR WORKING DRIECTORY HERE #
epl_training_path = root_dir + 'epl-training.csv'
epl_teams_path = root_dir + 'epl-teams.csv'
epl_test_path = root_dir + 'epl-test.csv'

# Load and describe the csv file
df = pd.read_csv(epl_training_path)
df_teams = pd.read_csv(epl_teams_path)
df_test = pd.read_csv(epl_test_path)

### _Raw Training Data_
View the match data and verify content.

In [None]:
df.describe()

In [None]:
df.tail()

### _Raw Team Scores_
View the raw team scores and verify content

In [None]:
df_teams.tail()

# Data Transformation and Exploration

## Data Cleaning

The section double checks the match data stored in `df` for any missing values and NaNs which might generate problems later on. All rows with NaN values are dropped.

`df_teams` need not be sanitized as it is sanitized by the scraper tool during the scraping process.

In [None]:
# Remove any row where the HomeTeam or AwayTeam is not a valid name
df = df.dropna()
df['Date'] = pd.to_datetime(df['Date'])

# Check that all invalid fields have been removed
assert(df['HomeTeam'].apply(lambda x: not isinstance(x, str)).sum() == 0)
assert(df['AwayTeam'].apply(lambda x: not isinstance(x, str)).sum() == 0)

df.describe()

This results in roughly half of the match data being discarded. However, 4082 matchs is sufficient examples for the model to achieve good performance.

## Encoding classes as labels

To encode classes as labels, the `LabelEncoder` from sklearn is used. There are two primary pieces of class information, namely the team names and the final result. The team names in `df_teams` is already standardized with the names used in the match data, thus the same `LabelEncoder` can be used in both cases.

In [None]:
# Instantiate LabelEncoder
team_encoder = LabelEncoder()

# Fit and transform the 'Names' column with LabelEncoder
team_encoder = team_encoder.fit(df['HomeTeam'])
df['HomeTeamLabel'] = team_encoder.transform(df['HomeTeam'])
df['AwayTeamLabel'] = team_encoder.transform(df['AwayTeam'])
df_teams['TeamLabel'] = team_encoder.transform(df_teams['Team'])

print(f"Team Names: {team_encoder.classes_}")

In [None]:
result_encoder = LabelEncoder()
result_encoder = result_encoder.fit(df['FTR'])
df['FTRLabel'] = result_encoder.transform(df['FTR'])
df['HTRLabel'] = result_encoder.transform(df['HTR'])

print(f"Result names: {result_encoder.classes_}")

## Analyzing home team advantage

Looking at the data, it is possible to calculate the probability of a home team winning, against the away team winning, against a draw. It is clear that the home team is more likely to win as compared to the away team, showing the existence of home team advantage. 

In [None]:
ftr_counts = df['FTR'].value_counts(normalize=True) # Normalize to convert to probabilities

print(ftr_counts)

plt.figure(figsize=(8, 6))
ftr_counts.plot(kind='bar', edgecolor='black')
plt.xlabel('Game Outcome')
plt.ylabel('Probability')
plt.xticks(range(len(ftr_counts)), ['Home Win', 'Away Win', 'Draw'], rotation=0)
plt.grid(axis='y')
    
plt.tight_layout()
plt.show()

## Combined match data and team scores

The following snippet of code combines the data from each match and the team scores. While doing so, it performs data transformation on the match data to make it more suitable for future training.

### _Feature engineering from match data_

One concern with the match data in its current form, is that a process must be developed to rollup historical match data into a feature vector which can be used to classify the next result. A naive implementation might just use the features of the immediately preceeding home and away game, but a better approach would include the historial data of the team.

The **Dixon-Coles model**, uses a poisson distribution to model the goal likelihood of the home and away teams, to make predictions on game outcomes. Two key contributions the model made to the basic possion model is to add a correction term for low-frequency matches (games with 0-0, 0-1, 1-0, and 1-1) outcomes, and to add a exponential time weighting component to more strongly consider recent matches than old matches.

The second element is of concern to use in the feature engineering stage. By applying an exponential weighted average to columns, a feature can be constructed for each column in the match data. The features can be precomputed like follows:

In [None]:
home_columns = ['FTHG', 'HTHG', 'HS', 'HST', 'HF', 'HC', 'HY', 'HR']
away_columns = ['FTAG', 'HTAG', 'AS', 'AST', 'AF', 'AC', 'AY', 'AR']
team_columns_single = ['OA', 'AT', 'MD', 'DF', 'CW']
team_columns = ['HOA', 'HAT', 'HMD', 'HDF', 'HCW', 'AOA', 'AAT', 'AMD', 'ADF', 'ACW']

td = pd.Timedelta(days=365) # Adjust for exponential weighting

# Precomputing features
precomputed_features = {}

for team in team_encoder.classes_:
    #  Calculating the Home features
    df_home = df[df['HomeTeam'] == team]
    df_home = df_home[['Date', *home_columns]]

    # Applying the exponential weighting mean
    df_home[home_columns] = df_home[home_columns].ewm(times=df_home['Date'], halflife=td).mean()
    precomputed_features[team] = {}
    precomputed_features[team]['Home'] = df_home
    
    # Calculating the Away features
    df_away = df[df['AwayTeam'] == team]
    df_away = df_away[['Date', *away_columns]]

    # Applying the exponential weighting mean
    df_away[away_columns] = df_away[away_columns].ewm(times=df_away['Date'], halflife=td).mean()
    precomputed_features[team]['Away'] = df_away


### _Merging match data with team scores_

To construct the final feature matrix, the transformed match data must be merged with the team scores. For each match, a feature is constructed by taking the exponential weighted mean up to the previous match, and the home and away team scores for that league is appended to the feature.

Additionally, only match data after 2013 is considered useful, as a decade of training examples is deemed to be sufficient to minimize bias in predictions. Too much match data might confuse the model with players in teams which do not play anymore.

The team scores are assumed to be constant throughout a league.

In [None]:
# List to store league date ranges
league_dates = []

# Loop through the years from 2014 to 2023. 
for year in range(2013, 2023 + 1):
    # Define the start and end dates for each league year (assuming August 1 to June 1)
    start_date = f"{year}-08-01"
    end_date = f"{year + 1}-06-01"
    
    # Append the start and end dates as a tuple to the league_dates list
    league_dates.append((start_date, end_date, year))

feature_columns = ['Date', 'League', 'HomeTeam', 'AwayTeam', *home_columns, *away_columns, *team_columns, 'FTRLabel']
df_features = pd.DataFrame(columns=feature_columns)
feature_list = []

for start, end, year in league_dates:
    print(f"{start} : {end}")
    league = df[(df['Date'] >= start) & (df['Date'] <= end)]
    
    for index in range(len(league)):
        row = league.iloc[index]
        home =  precomputed_features[row['HomeTeam']]['Home']
        home = home[home['Date'] < row['Date']]
        if home.empty:
            continue
        else:
            home = home.iloc[-1]

        # Check if ratings exist for the year
        home_team = df_teams[df_teams['Team'] == row['HomeTeam']]
        home_team = home_team[home_team['Year'] == year]

        # Else use the latest one
        if home_team.empty:
            home_team = df_teams[df_teams['Team'] == row['HomeTeam']]
            home_team = home_team[home_team['Year'] < year]
            # Else use sensible default team stats
            if home_team.empty:
                home_team = pd.DataFrame([[60, 60, 60, 60, 1]], columns=team_columns_single) # Defaults
                home_team = home_team.iloc[0]
            else:
                home_team = home_team.iloc[0]
        else:
            home_team = home_team.iloc[0]

        away =  precomputed_features[row['AwayTeam']]['Away']
        away = away[away['Date'] < row['Date']]
        if away.empty:
            continue
        else:
            away = away.iloc[-1]

        # Check if ratings exist for the year
        away_team = df_teams[df_teams['Team'] == row['AwayTeam']]
        away_team = away_team[away_team['Year'] == year]

        # Else use the latest one
        if away_team.empty:
            away_team = df_teams[df_teams['Team'] == row['AwayTeam']]
            away_team = away_team[away_team['Year'] < year]
            if away_team.empty:
                # Else use sensible default team stats
                away_team = pd.DataFrame([[60, 60, 60, 60, 1]], columns=team_columns_single) # Defaults
                away_team = away_team.iloc[0]
            else:
                away_team = away_team.iloc[0]
        else:
            away_team = away_team.iloc[0]

        feature_list.append([row['Date'], year, row['HomeTeam'], row['AwayTeam'],
                            *home[home_columns], *away[away_columns],
                            *home_team[team_columns_single], *away_team[team_columns_single], row['FTRLabel']])

df_features = pd.DataFrame(feature_list, columns=feature_columns)
df_features.tail()

## Plot the correlation matrix

It is unsurprising if many of the features in the augmented feature set are redundant, as many factors of matches are correlated. For example, the Half Time goals is correlated to the full time goals for both the home and away team. It is worth getting an idea of which features are correlated to which other features to aid in feature selection, and to help preserve conditional probability assumptions.

In [None]:
corr_mat = df_features[[*home_columns, *away_columns, *team_columns]].corr().stack().reset_index(name="correlation")

# Draw each cell as a scatter point with varying size and color
g = sns.relplot(
    data=corr_mat,
    x="level_0", y="level_1", hue="correlation", size="correlation",
    palette="vlag", hue_norm=(-1, 1), edgecolor=".7",
    height=10, sizes=(50, 250), size_norm=(-.2, .8),
)

# Tweak the figure to finalize
g.set(xlabel="", ylabel="", aspect="equal")
g.despine(left=True, bottom=True)
g.ax.margins(.02)
for label in g.ax.get_xticklabels():
    label.set_rotation(90)

# Methodology Overview

## Train-test split

The performance of teams in leagues is expected to vary through the duration of a league. This could be due to tiredness, loss of morale and other factors. Therefore, the performance of any model could vary significantly during a league.

Thus, the test setup of models must consider this fact, and entire leagues must be set aside for testing. In this study, the years 2013-2019, are considered training data and 2020 onwards is considered test data. Within the training data, **5-fold cross validation** shall be used for hyperparameter tuning. The data is split as follows.

- Training Set: 2462 matches (67.8%)
- Test Set: 1170 matches (32.2%)

In [None]:
df_features_train = df_features[df_features['League'] <= 2019] # Train on everything before and including 2019
df_features_train.describe()

In [None]:
df_features_test = df_features[df_features['League'] > 2019] # Test on 2019 and onwards
df_features_test.describe()

In [None]:
X_train = df_features_train[[*home_columns, *away_columns, *team_columns]].values
y_train = df_features_train[['FTRLabel']].values.ravel()

In [None]:
# Apply LDA with 2 components for visualization (can be changed)
lda = LDA(n_components=2)
X_lda = lda.fit_transform(X_train, y_train)

targets = list(set(y_train))
colors = ['blue', 'orange', 'green']

for target, color in zip(targets, colors):
    indices_to_keep = y_train == target
    plt.scatter(X_lda[indices_to_keep, 0], X_lda[indices_to_keep, 1], c=color, label=target, edgecolor='k')

plt.xlabel('Linear Discriminant 1')
plt.ylabel('Linear Discriminant 2')
plt.title('LDA of features')
plt.legend()
plt.tight_layout()

plt.show()

## Model Selection

There are a variety of possible models that could be deployed for this problem. The following section attempts to compare different models and their performance for the task.

In [None]:
def train_and_compare_classifiers(X, y, classifiers, title="Comparison of Classifier Accuracies"):
    accuracies = {}
    std_devs = {}
    
    for clf_name, clf in classifiers.items():
        skf = StratifiedKFold(n_splits=5)
        scores = cross_val_score(clf, X, y, cv=skf)
        accuracies[clf_name] = scores.mean()
        std_devs[clf_name] = scores.std()

        print(f"{clf_name}: Mean {scores.mean()}, Std. Deviation {scores.std()}")

    # Plotting accuracies
    plt.figure(figsize=(10, 6))
    bars = plt.bar(accuracies.keys(), accuracies.values(), yerr=list(std_devs.values()), capsize=5, alpha=0.7)
    plt.ylabel('Accuracy')
    plt.title(title)
    plt.ylim([0, 1])
    plt.tight_layout()

    # Add labels on top of each bar
    for bar in bars:
        height = bar.get_height()
        label = height * 100
        plt.text(bar.get_x() + bar.get_width()/2, height + 0.05, f'{label:.3f}%', ha='center', va='bottom')

    plt.show()

### _Setup 5 potential classifier types_

For all classifier types, having a normalization applied to the features tends to prevent a single feature from dominating the loss function classifier. Thus a `StandardScaler` is applied to the data within a Pipeline, to normalize the data.

The tested classifiers are: Support Vector Machine, KNN, Random Forest, Naive Bayes and Logistic Regression. Of these, the Support Vector Machine has the best empirical performance.

In [None]:
classifiers = {
    'SVM': Pipeline([
            ('scaler', StandardScaler()),
            ('classifier', SVC())]),
    'KNN': Pipeline([
            ('scaler', StandardScaler()),
            ('classifier', KNeighborsClassifier())]),
    'Random Forest': Pipeline([
            ('scaler', StandardScaler()),
            ('classifier', RandomForestClassifier())]),
    'Naive Bayes': Pipeline([
            ('scaler', StandardScaler()),
            ('classifier', GaussianNB())]),
    'Logistic Regression': Pipeline([
            ('scaler', StandardScaler()),
            ('classifier', LogisticRegression())]),
}

train_and_compare_classifiers(X_train, y_train, classifiers)

### _Dimensionality Reduction_
Since there are many features, and many of those features are correlated, there might be opportunites where dimensionality reduction could improve the performance of the model. Since the Logistic Regression is seen to have the best classifier performance, it is chosen for further investigation.

A standard `LogisticRegression` is compared to an `LogisticRegression` with Principal Component Analysis appled, and another with Linear Discriminant Analysis applied, as a preprocessing step. A `Pipeline` is used to prevent data leakage.

Dimensionality reduction does not help.

In [None]:
classifiers = {
    'SVM': Pipeline([
            ('scaler', StandardScaler()),
            ('classifier', SVC())]),
    'SVM + PCA': Pipeline([
            ('scaler', StandardScaler()),
            ('pca', PCA(n_components=2)), # There are only 3 classes
            ('classifier', SVC())]),
    'SVM + LDA': Pipeline([
            ('scaler', StandardScaler()),
            ('lda', LDA(n_components=2)),
            ('classifier', SVC())]),
}

train_and_compare_classifiers(X_train, y_train, classifiers)

### _Hyperparameter Tuning_

This section looks at the SVC C hyperparameter, to see if hyperparameter tuning can help outperform the Logistic Regression model.

In [None]:
classifiers = {
    'SVM (C=0.001)': Pipeline([
            ('scaler', StandardScaler()),
            ('classifier', SVC(C=0.001))]),
    'SVM (C=0.01)': Pipeline([
            ('scaler', StandardScaler()),
            ('classifier', SVC(C=0.01))]),
    'SVM (C=0.1)': Pipeline([
            ('scaler', StandardScaler()),
            ('classifier', SVC(C=0.1))]),
    'SVM (C=1.0)': Pipeline([
            ('scaler', StandardScaler()),
            ('classifier', SVC(C=1.0))]),
    'SVM (C=10)': Pipeline([
            ('scaler', StandardScaler()),
            ('classifier', SVC(C=10))]),
     'SVM (C=100)': Pipeline([
            ('scaler', StandardScaler()),
            ('classifier', SVC(C=100))]),
    'SVM (C=1000)': Pipeline([
            ('scaler', StandardScaler()),
            ('classifier', SVC(C=1000))])
}

train_and_compare_classifiers(X_train, y_train, classifiers, title="Hyperparameter tuning for SVM")

### _Sequential Feature Selection_
Since there are still many correlated features, and dimensionality reduction did not help, sequential feature selection is used to find optimal features to include. Mean validation accuracy peaks at **7 features**, and the 7 selected features are used in the final model.

In [None]:
clf = Pipeline([
            ('scaler', StandardScaler()),
            ('classifier', SVC(C=1.0))])

accuracies = {}
std_devs = {}

for i in range(1, 11):
    sfs_forward = SequentialFeatureSelector(
        clf, n_features_to_select=i, direction="forward", n_jobs=4
    ).fit(X_train, y_train)

    selected_features = sfs_forward.get_support()
    X_train_selected = X_train[:, selected_features]

    skf = StratifiedKFold(n_splits=5)
    scores = cross_val_score(clf, X_train_selected, y_train, cv=skf)
    accuracies[f"{i}"] = scores.mean()
    std_devs[f"{i}"] = scores.std()

    print(f"Using {i} features, mean score: {scores.mean()}, std dev: {scores.std()}")

# Plotting accuracies
plt.figure(figsize=(10, 6))
bars = plt.bar(accuracies.keys(), accuracies.values(), yerr=list(std_devs.values()), capsize=5, alpha=0.7)
plt.ylabel('Accuracy')
plt.xlabel('Number of Selected Features')
plt.title('Feature Selection with SVM')
plt.ylim([0.5, 0.6])
plt.tight_layout()
plt.show()

# Add labels on top of each bar
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2, height, f'{height:.3f}', ha='center', va='bottom')
    

# Model Training and Validation

In the full training run, the following setup will be used.

* The training data will be all the data until the **2019** league.
* The test data will be the **2020** league and onwards.
* Forward Sequential Feature Selection will be done to identify the top 10 features for the model.
* All features will be scaled to zero mean and unit variance
* A logistic regression classifier will be used as the model

In [None]:
X_train = df_features_train[[*home_columns, *away_columns, *team_columns]].values
y_train = df_features_train[['FTRLabel']].values.ravel()

X_test = df_features_test[[*home_columns, *away_columns, *team_columns]].values
y_test = df_features_test[['FTRLabel']].values.ravel()

print(f"Training set size is {X_train.shape}")
print(f"Test set size is {X_test.shape}")

In [None]:
# Define the pipeline
clf = Pipeline([('scaler', StandardScaler()), ('classifier', SVC(C=1.0))])

In [None]:
# Sequential Feature Selection, select 10
sfs_forward = SequentialFeatureSelector(
    clf, n_features_to_select=10, direction="forward", n_jobs=4
).fit(X_train, y_train)

selected_features = sfs_forward.get_support()
X_train_selected = X_train[:, selected_features]

feature_names = np.array([*home_columns, *away_columns, *team_columns])
selected_feature_names = feature_names[selected_features]
print(f"The selected features are {selected_feature_names}")

In [None]:
X_train = df_features_train[selected_feature_names].values
y_train = df_features_train[['FTRLabel']].values.ravel()

X_test = df_features_test[selected_feature_names].values
y_test = df_features_test[['FTRLabel']].values.ravel()

skf = StratifiedKFold(n_splits=5)
scores = cross_val_score(clf, X_train, y_train, cv=skf)
print(f"Cross validation accuracy of the pipeline is {scores.mean()}. Std dev: {scores.std()}")

# Results

In [None]:
test_score = clf.fit(X_train, y_train).score(X_test, y_test)
print(f"Test accuracy of the pipeline is {test_score}")

disp = ConfusionMatrixDisplay.from_estimator(
    clf,
    X_test,
    y_test,
    display_labels=result_encoder.classes_,
    normalize="true",
)
print(disp.confusion_matrix)

# Final Predictions on Test Set

## Generate features

In [None]:
# Setup variables to store the computed features
test_feature_list = []
test_feature_columns = ['Date', 'League', 'HomeTeam', 'AwayTeam', *home_columns, *away_columns, *team_columns]

# Name map because test team names are different from training team names
name_map = {
    'Man City': 'Man City',
    'Arsenal': 'Arsenal',
    'Liverpool': 'Liverpool',
    'Man Utd': 'Man United',
    'Spurs': 'Tottenham',
    'Aston Villa': 'Aston Villa',
    'Chelsea': 'Chelsea',
    'Newcastle': 'Newcastle',
    'West Ham': 'West Ham',
    'Everton': 'Everton',
    'Nottingham Forest': "Nott'm Forest",
    'Brighton': 'Brighton',
    'Wolves': 'Wolves',
    'Fulham': 'Fulham',
    'Crystal Palace': 'Crystal Palace',
    'Brentford': 'Brentford',
    'AFC Bournemouth': 'Bournemouth',
    'Burnley': 'Burnley',
    'Cardiff City': 'Cardiff',
    'Huddersfield Town': 'Huddersfield',
    'Hull City': 'Hull',
    'Leeds United': 'Leeds',
    'Leicester City': 'Leicester',
    'Luton Town': 'Luton',
    'Middlesbrough': 'Middlesbrough',
    'Norwich City': 'Norwich',
    'Queens Park Rangers': 'QPR',
    'Reading': 'Reading',
    'Sheff Utd': 'Sheffield United',
    'Southampton': 'Southampton',
    'Stoke City': 'Stoke',
    'Sunderland': 'Sunderland',
    'Swansea City': 'Swansea',
    'Watford': 'Watford',
    'West Bromwich Albion': 'West Brom',
    'Wigan Athletic': 'Wigan'
}

for index in range(len(df_test)):
    year = 2024
    row = df_test.iloc[index]
    hometeam = name_map[row['HomeTeam']]
    awayteam = name_map[row['AwayTeam']]
    
    home =  precomputed_features[hometeam]['Home']
    home = home[home['Date'] < row['Date']]
    if home.empty:
        continue
    else:
        home = home.iloc[-1]

    # Check if ratings exist for the year
    home_team = df_teams[df_teams['Team'] == hometeam]
    home_team = home_team[home_team['Year'] == year]

    # Else use the latest one
    if home_team.empty:
        home_team = df_teams[df_teams['Team'] == hometeam]
        home_team = home_team[home_team['Year'] < year]
        # Else use sensible default team stats
        if home_team.empty:
            home_team = pd.DataFrame([[60, 60, 60, 60, 1]], columns=team_columns_single) # Defaults
            home_team = home_team.iloc[0]
        else:
            home_team = home_team.iloc[0]
    else:
        home_team = home_team.iloc[0]

    away =  precomputed_features[awayteam]['Away']
    away = away[away['Date'] < row['Date']]
    if away.empty:
        continue
    else:
        away = away.iloc[-1]

    # Check if ratings exist for the year
    away_team = df_teams[df_teams['Team'] == awayteam]
    away_team = away_team[away_team['Year'] == year]

    # Else use the latest one
    if away_team.empty:
        away_team = df_teams[df_teams['Team'] == awayteam]
        away_team = away_team[away_team['Year'] < year]
        if away_team.empty:
            # Else use sensible default team stats
            away_team = pd.DataFrame([[60, 60, 60, 60, 1]], columns=team_columns_single) # Defaults
            away_team = away_team.iloc[0]
        else:
            away_team = away_team.iloc[0]
    else:
        away_team = away_team.iloc[0]

    test_feature_list.append([row['Date'], year, hometeam, awayteam,
                        *home[home_columns], *away[away_columns],
                        *home_team[team_columns_single], *away_team[team_columns_single]])

test_df_features = pd.DataFrame(test_feature_list, columns=test_feature_columns)
test_df_features.tail()

## Train on all the data and make predictions

In [None]:
# Use selected features from earlier
X = df_features[selected_feature_names].values
y = df_features[['FTRLabel']].values.ravel()

X_test = test_df_features[selected_feature_names].values

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', SVM(C=1.0))
])

y_pred = pipeline.fit(X,y).predict(X_test)
y_pred = result_encoder.inverse_transform(y_pred)
print(y_pred)

## Store to the csv in desired format

In [None]:
df_test['FTR'] = y_pred
df_test.to_csv('epl-test-populated.csv', index=False)
df_test.head()

# References

- [Forecasting football](https://mercurius.io/en/learn/predicting-forecasting-football)
- [Prediction of football match results with Machine Learning](https://www.sciencedirect.com/science/article/pii/S1877050922007955)
- [Predicting Football Matches Results using Bayesian Networks for English Premier League (EPL)](https://iopscience.iop.org/article/10.1088/1757-899X/226/1/012099)
- [Predicting Football Results Using Machine Learning Techniques](https://www.imperial.ac.uk/media/imperial-college/faculty-of-engineering/computing/public/1718-ug-projects/Corentin-Herbinet-Using-Machine-Learning-techniques-to-predict-the-outcome-of-profressional-football-matches.pdf)
- [Forecasting football match results using a player rating based model](https://www.sciencedirect.com/science/article/pii/S016920702300033X)
- [Forecasting football matches by predicting match statistics](https://content.iospress.com/articles/journal-of-sports-analytics/jsa200462)
  
- Datasets
    - https://www.kaggle.com/datasets/hugomathien/soccer
    - https://www.football-data.co.uk/ratings.pdf