<div style="display:flex; align-items:center; background-color:black; padding:20px;">
  <span style="font-size:3em; color:red; font-weight:bold; margin-right:30px;">Predicting the English Premier League</span>
  <img src="logo.png" alt="logo" style="max-width:120px; height:auto; border-radius:10px; box-shadow:0 0 10px #222; background:white;">
</div>


<font size="6.5" color="yellow">Introduction about the Project</font>


The most popular and developed sport in the world is Football, and arguably the English Premier league is the most competitive, and it is therefore the most natural choice to study. The game of football is constantly evolving, and features that may be important in one generation may become less so later - there has never been an analytic definition of the game. This is where machine learning could prove invaluable.

In this project, we look at applying statistical and machine learning methods, in order to attempt at predicting match results based on historic data of the teams that play the match. A majority part of the project builds a framework that takes raw match data and creates suitable features based on historic statistics to match results. We then by using Decision trees and Random forest classifiers and finally we build deep learning logic, combine it with a number of feature selection tools, including PCA, to get the best accuracy metric.

<font size="5" color="cyan">Problem Statement</font>

The premise then lies, to build a machine learning framework, that can use historic data from football matches between two teams and learn how to best predict outcomes of games. The general outline of the steps to be taken to get to a framework that can get to the premise are as follows:                                                                   
* Recover a source of data with enough information that links team performance to final result.
* Build a tool which takes the raw data and transforms it into a format in which each match has a number of historic features based on previous games, such as 'Average goals in x games'.
* Use feature selection tools to filter out unimportant/redundant features.
* Feed features into a number of machine learning algorithms and decide on the best one to fine-tune.
* Fine-tune the chosen machine learning algorithm for the greatest accuracy.

<font size="5" color="cyan">Background & Metrics</font>

As previously stated, the use of machine learning in the gambling industry has grown in recent past. Along with this steady academic interest has also shot up as well with many academic journals and papers detailing the methodology and results gathered. One such study (PETTERSSON,NYQUIST,2017)[1] used recurrent neural networks to predict the outcome of games with in-game infomration as the game progressed; the study used an accuracy metric, which ranged from 33.5% to 98% as the game progressed. Another study (HERBINET, 2018)[2] used publically available databases along with classical machine learning algorithms (regression, svm etc) to predict match outcomes and expected goals using aggregated statistics and player ratings. The study managed to accurate predict outcome of matches 50% of time.

In this project, we will employ the use of the accuracy metric, which simply calculates the percentage of correct predictions vs total predictions of the test data:

$$Accuracy = \frac{\sum N_{C}}{\sum N_{total}}$$
 
Where  NC=1
  for a correct prediction, and  Ntotal
  is the total number of predictions.

<font size="6" color="yellow">Data Preprocessing & Exploration</font>

We sourced premier league match results from football-data.co.uk, the website has downloadable csv files for each season of the premier league from 1993. We used data from season 2000-2001 onwards as this data had the most consistancy in features throughout.

On first inspection, we see a number of features. These features describe the events that occur in the game, such as total goals by the home side (FTHG). The list below describes what each of these mean.

&nbsp;&nbsp;&nbsp;&nbsp;Div = League Division<br>&nbsp;&nbsp;&nbsp;&nbsp;Date = Match Date (dd/mm/yy)<br>&nbsp;&nbsp;&nbsp;&nbsp;HomeTeam = Home Team<br>&nbsp;&nbsp;&nbsp;&nbsp;AwayTeam = Away Team<br>&nbsp;&nbsp;&nbsp;&nbsp;FTHG and HG = Full Time Home Team Goals<br>&nbsp;&nbsp;&nbsp;&nbsp;FTAG and AG = Full Time Away Team Goals<br>&nbsp;&nbsp;&nbsp;&nbsp;FTR and Res = Full Time Result (H=Home Win, D=Draw, A=Away Win)<br>&nbsp;&nbsp;&nbsp;&nbsp;HTHG = Half Time Home Team Goals<br>&nbsp;&nbsp;&nbsp;&nbsp;HTAG = Half Time Away Team Goals<br>&nbsp;&nbsp;&nbsp;&nbsp;HTR = Half Time Result (H=Home Win, D=Draw, A=Away Win)<br>&nbsp;&nbsp;&nbsp;&nbsp;HS = Home Team Shots<br>&nbsp;&nbsp;&nbsp;&nbsp;AS = Away Team Shots<br>&nbsp;&nbsp;&nbsp;&nbsp;HST = Home Team Shots on Target<br>&nbsp;&nbsp;&nbsp;&nbsp;AST = Away Team Shots on Target<br>&nbsp;&nbsp;&nbsp;&nbsp;HHW = Home Team Hit Woodwork<br>&nbsp;&nbsp;&nbsp;&nbsp;AHW = Away Team Hit Woodwork<br>&nbsp;&nbsp;&nbsp;&nbsp;HC = Home Team Corners<br>&nbsp;&nbsp;&nbsp;&nbsp;AC = Away Team Corners<br>&nbsp;&nbsp;&nbsp;&nbsp;HF = Home Team Fouls Committed<br>&nbsp;&nbsp;&nbsp;&nbsp;AF = Away Team Fouls Committed<br>&nbsp;&nbsp;&nbsp;&nbsp;HFKC = Home Team Free Kicks Conceded<br>&nbsp;&nbsp;&nbsp;&nbsp;AFKC = Away Team Free Kicks Conceded<br>&nbsp;&nbsp;&nbsp;&nbsp;HO = Home Team Offsides<br>&nbsp;&nbsp;&nbsp;&nbsp;AO = Away Team Offsides<br>&nbsp;&nbsp;&nbsp;&nbsp;HY = Home Team Yellow Cards<br>&nbsp;&nbsp;&nbsp;&nbsp;AY = Away Team Yellow Cards<br>&nbsp;&nbsp;&nbsp;&nbsp;HR = Home Team Red Cards



import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
import copy
os.chdir("capstone_proj/csv_files")
""" In this section, we will take an in-depth view of the dataset. We will see
what we have available and whether we can make some initial predictions"""
s0001 = pd.read_csv("2000_2001.csv")
colsToKeep = list(["Div","Date","HomeTeam","AwayTeam","FTHG","HG","FTAG","AG","FTR","Res","HTHG","HTAG","HTR","HS","AS","HST","AST","HHW","AHW","HC","AC","HF","AF","HO","AO","HY","AY","HR","AR"])
allCols = s0001.columns
colsToDrop = set(allCols) - set(colsToKeep)
s0001 = s0001.drop(labels=colsToDrop,axis=1)
### Data ###
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    print(s0001.head())
