## Football Match Outcome Prediction
### 1. Baseline Results

This notebook will present some baseline results to evaluate/improve modeling results for some sophisticated machine learning models.

The baseline methods are: random guess, always predicting home team win/draw/lose, and always predicting results corresponding to minimum/middle/maximum odds.

### Highlights

* **Among all baseline methods, sticking with minimum odds result is the winner. Useful machine learning models should be beyond this baseline.**
* **No baseline methods will help us make money.**

* **Some baseline results are summarized below:**

|                    Baseline Method                  |Prediction Acc. | Avg. Interest Rate per Match|
| :--------------------------------------------------:| :------------: | :--------------------------:|
| Random Guess                                        |    33.33%      |             -7%             |
| Always choosing home team win                       |    45.90%      |          -4.7%              |
| Always choosing min odds result (human level)       |    53.32%      |          -3.7%              |



##### Metrics: Average Interest Rate Per Match

Acutally we care about how much money we can make by making predictions. So one important metrics to evaulate the results is named as average interest rate per match (AIRPM), which is equal to the average rate of return on investment for each match.

#### Data Analysis Details

The original database is downloaded here: [European Soccer Database](https://www.kaggle.com/hugomathien/soccer), and data preprocessing for this notebook can be found [here](https://github.com/xzl524/football_data_analysis/tree/master/projects/european_soccer_database_analysis/process_ball_events).

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import pickle
import sqlite3
import os
from tqdm import tqdm

sns.set_style('whitegrid')
sns.set_context('poster')

print('All general modules are imported.')

All general modules are imported.


In [2]:
# Reload preprocessed match stats dataset
# Change the directory as you wish
data_folder = os.path.join(os.path.pardir, os.path.pardir, 'data_source', 'kaggle')
match_stats_table_name = 'match_stats.data'

try:
    with open(os.path.join(data_folder, match_stats_table_name), 'rb') as f:
        print('Loading saved match stats table...')
        match = pd.read_pickle(f)
        print('Successfully reload {}.'.format(match_stats_table_name))
except Exception as e:
    print('Unable to reload {}.'.format(match_stats_table_name))

Loading saved match stats table...
Successfully reload match_stats.data.


In [3]:
# Most of operations in this cell are not useful in baseline result calculation.
# Keeping these operations is just to be consistent with pre-processing steps in later machine learning modeling.

# Select source of odds
# B365, BW, IW, LB, PS, WH, SJ, VC, GB, BS
odds_source = 'B365'

# H: home win odds, D: draw odds, A: away win odds
odds_columns = ['{}H'.format(odds_source), '{}D'.format(odds_source), '{}A'.format(odds_source)]

# Select demand columns
selected_columns=['country', 'league', 'season', 'stage', 'date', 
                  'home_team', 'away_team'] + odds_columns + ['home_team_goal', 'away_team_goal']

selected_data = match[selected_columns]

# Remove null odds data matches
selected_data = selected_data[selected_data['{}H'.format(odds_source)].notnull()]

# Extract home team's goal difference
goal_diff = selected_data['home_team_goal'] - selected_data['away_team_goal']

# Encode match result
# home team win:  0
# home team draw: 1
# home team lose: 2
selected_data['result'] = np.where(goal_diff > 0, 0, np.where(goal_diff < 0, 2, 1))

# Verify if still has null data in the dataframe
if not any(selected_data.isnull().sum().values):
    print(selected_data.info())
else:
    print('NULL DATA FOUND!!!!')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22592 entries, 0 to 24556
Data columns (total 13 columns):
country           22592 non-null object
league            22592 non-null object
season            22592 non-null object
stage             22592 non-null int64
date              22592 non-null object
home_team         22592 non-null object
away_team         22592 non-null object
B365H             22592 non-null float64
B365D             22592 non-null float64
B365A             22592 non-null float64
home_team_goal    22592 non-null int64
away_team_goal    22592 non-null int64
result            22592 non-null int32
dtypes: float64(3), int32(1), int64(3), object(6)
memory usage: 2.3+ MB
None


In [4]:
# Select demand inputs for models
y = selected_data['result'].values
odds = selected_data[odds_columns].values

#### 1.1 Random Guess

First of all, random guess will yield 33% prediction accuracy theoretically.

Suppose home team win/draw/lose percentages are: $p_{w}$, $p_{d}$, and $p_{l}$ respectively, then randomly guessing the results will yield prediction accuracy of:

$$\frac{p_w}{3}+\frac{p_d}{3}+\frac{p_l}{3}=\frac{p_w+p_d+p_l}{3}=\frac{1}{3}=33\% $$.

Then verify the calculation with data.

In [5]:
# avg_interest_rate_per_match is self-defined function to calculate AIRPM
from evaluation import avg_interest_rate_per_match
from sklearn.metrics import accuracy_score

# random guess
y_pred = np.random.choice([0,1,2], len(y))

# Prediction Report
print('RANDOM GUESS')

# Prediction Accuracy
print('Prediction Accuracy:             {:.2f}%'.format(accuracy_score(y, y_pred)*100))

# Average Interest Rate Per Match
print('Average Interest Rate Per Match: {:.2f}%'.format(avg_interest_rate_per_match(y, y_pred, odds)*100))

  from pandas.core import datetools


RANDOM GUESS
Prediction Accuracy:             33.28%
Average Interest Rate Per Match: -8.37%


Always predicting home team win, draw and lose will yield prediction accuracy of 46%, 25%, and 29%, respectively; AIRPM is -4.7%, -8.6%, and -9.6%, respectively.
#### 1.2 Always predicting home team win

In [6]:
# Always predicting home team win
y_pred = np.zeros(len(y))

# Prediction Report
print('ALWAYS PREDICTING HOME TEAM WIN')

# Prediction Accuracy
print('Prediction Accuracy:             {:.2f}%'.format(accuracy_score(y, y_pred)*100))

# Average Interest Rate Per Match
print('Average Interest Rate Per Match: {:.2f}%'.format(avg_interest_rate_per_match(y, y_pred, odds)*100))

ALWAYS PREDICTING HOME TEAM WIN
Prediction Accuracy:             45.91%
Average Interest Rate Per Match: -4.73%


#### 1.3 Always predicting draw

In [7]:
# Always predicting draw
y_pred = np.ones(len(y))

# Prediction Report
print('ALWAYS PREDICTING DRAW')

# Prediction Accuracy
print('Prediction Accuracy:             {:.2f}%'.format(accuracy_score(y, y_pred)*100))

# Average Interest Rate Per Match
print('Average Interest Rate Per Match: {:.2f}%'.format(avg_interest_rate_per_match(y, y_pred, odds)*100))

ALWAYS PREDICTING DRAW
Prediction Accuracy:             25.30%
Average Interest Rate Per Match: -8.55%


#### 1.4 Always predicting home team lose

In [8]:
# Always predicting home team lose
y_pred = 2*np.ones(len(y))

# Prediction Report
print('ALWAYS PREDICTION HOME TEAM LOSE')

# Prediction Accuracy
print('Prediction Accuracy:             {:.2f}%'.format(accuracy_score(y, y_pred)*100))

# Average Interest Rate Per Match
print('Average Interest Rate Per Match: {:.2f}%'.format(avg_interest_rate_per_match(y, y_pred, odds)*100))

ALWAYS PREDICTION HOME TEAM LOSE
Prediction Accuracy:             28.79%
Average Interest Rate Per Match: -9.62%


Always sticking with min/mid/max odds will yield prediction accuracy of 53%, 26%, and 21%, respecitvely; AIRPM is -3.7%, -9.2% and -9.5%, respectively.
#### 1.5 Always predicting with min odds

In [9]:
# Always predicting with min odds
y_pred = odds.argmin(axis=1)

# Prediction Report
print('ALWAYS PREDICTING WITH MIN ODDS')

# Prediction Accuracy
print('Prediction Accuracy:             {:.2f}%'.format(accuracy_score(y, y_pred)*100))

# Average Interest Rate Per Match
print('Average Interest Rate Per Match: {:.2f}%'.format(avg_interest_rate_per_match(y, y_pred, odds)*100))

ALWAYS PREDICTING WITH MIN ODDS
Prediction Accuracy:             53.33%
Average Interest Rate Per Match: -3.65%


#### 1.6 Always predicting with middle odds

In [10]:
# Always predicting with middle odds
y_pred = np.argsort(odds)[:,1]

# Prediction Report
print('ALWAYS PREDICTING WITH MIDDLE ODDS')

# Prediction Accuracy
print('Prediction Accuracy:             {:.2f}%'.format(accuracy_score(y, y_pred)*100))

# Average Interest Rate Per Match
print('Average Interest Rate Per Match: {:.2f}%'.format(avg_interest_rate_per_match(y, y_pred, odds)*100))

ALWAYS PREDICTING WITH MIDDLE ODDS
Prediction Accuracy:             25.93%
Average Interest Rate Per Match: -9.15%


#### 1.7 Always predicting with max odds

In [11]:
# Always predicting with max odds
y_pred = odds.argmax(axis=1)

# Prediction Report
print('ALWAYS PREDICTING WITH MAX ODDS')

# Prediction Accuracy
print('Prediction Accuracy:             {:.2f}%'.format(accuracy_score(y, y_pred)*100))

# Average Interest Rate Per Match
print('Average Interest Rate Per Match: {:.2f}%'.format(avg_interest_rate_per_match(y, y_pred, odds)*100))

ALWAYS PREDICTING WITH MAX ODDS
Prediction Accuracy:             20.92%
Average Interest Rate Per Match: -9.48%
