# ML bootcamp hackthon challenge:  predict the outcome of EURO 2016 soccer matches for fun and profit


At this stage you have learned what machine learning is and you have seen examples of how to use machine learning. Now it is time to see how well YOU can generalize what you have learned to a new challenge.

We will be using machine learning to **predict the outcome of EURO 2016 soccer matches for fun and profit**.
The hackathon is based on the code of the following repo:
https://github.wdf.sap.corp/I073941/euro2016_ml

The code below will help you to get started. At the final pitch session, teams can present their models to the team and to test their system on how accurately they would have predicted the games in the EURO 2016 competition.

Check out the report from Goldman Sachs about predicting the 2014 worldcup.
http://www.goldmansachs.com/our-thinking/outlook/world-cup-sections/world-cup-book-2014-statistical-model.html

In [1]:
# import packages that we need
from sklearn.metrics import accuracy_score
from sklearn import cross_validation
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import PassiveAggressiveClassifier

from bs4 import BeautifulSoup
import pandas as pd
import os
import numpy as np
import random
from subprocess import call
from dateutil.parser import parse
import datetime

# set random seed, make results reproducible
np.random.seed(42)


In [2]:
# helper functions to download and parse historic soccer matches
# from http://www.eloratings.net/
#
# Note: Elo rating system is a method for calculating the relative skill levels of players in 
# games such as chesshttps://en.wikipedia.org/wiki/Elo_rating_system)

def load_elo(country):
    """download ELO ratings the web, and parse html pages, return data frame"""
    print("load %s.." % country)
    url_template = "http://www.eloratings.net/%s.htm"
    data_dir = '/tmp/elo/'
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)
    elo_file = "%s/ELO_%s.htm" % (data_dir, country)
    # NOTE: urllib gets a HTTP 412 error, use curl workaround for download
    if not os.path.isfile(elo_file):
        call (["curl", url_template % country, "-o", elo_file, "-x", "http://proxy.wdf.sap.corp:8080"])
    with open(elo_file) as fin:
        html = fin.read()
    historic_games = list(parse_elo(html))
    # at the moment we only extract team names, the match score, match type, date and the ELO score of each team
    # there are more columns, consider adding them as additional attributes
    df = pd.DataFrame(historic_games, columns=["Team1", "Team2", "Score1", "Score2", "ELO1", "ELO2", "MatchType", "Date"])
    return df

def text_with_newline(elem):
    text = ''
    for e in elem.recursiveChildGenerator():
        if isinstance(e, str):
            text += e.strip()
        elif e.name == 'br':
            text += '\n'
    return text

def clean_string(s):
    # remove non-breaking space character
    s = s.replace(u'\xa0', u' ')
    return s

def parse_elo(html):
    """Parse ELO ratings and extract dates and ranking"""
    soup = BeautifulSoup(html, 'html.parser')
    # find the main result table
    table = soup.body.find('table', class_='results')
    for row in table.find_all('tr', class_='nh'):
        fields = row.find_all('td')
        if len(fields) < 6:
            continue
        date = parse(text_with_newline(fields[0]).replace('\n', ' '))
        team1, team2 = clean_string(text_with_newline(fields[1])).split('\n')
        score1, score2 = map(lambda x: int(x), text_with_newline(fields[2]).split('\n'))
        elo1, elo2 = map(lambda x: int(x), text_with_newline(fields[5]).split('\n'))
        match_type = clean_string(text_with_newline(fields[3]).replace('\n', ' ')).strip()
        # the elo rating is already adjusted to the outcome of the game
        # so calculate what the rating has been before the game
        elo1_change, elo2_change = map(lambda x: int(x), text_with_newline(fields[4]).split('\n'))
        elo1 -= elo1_change
        elo2 -= elo2_change
        yield team1, team2, score1, score2, elo1, elo2, match_type, date


In [4]:
# load data
countries = ["Albania", "Austria", "Belgium", "Croatia", "Czech Republic", "England", "France", "Germany", "Hungary", "Iceland", "Italy", "Northern Ireland", "Poland", "Portugal", "Ireland", "Romania", "Russia", "Slovakia", "Spain", "Sweden", "Switzerland", "Turkey", "Ukraine", "Wales"]
country_canonical = {"Czech Republic": "Czechia", "Northern Ireland": "Nthrn_Irelnd"}
df = pd.DataFrame([], columns=["Team1", "Team2", "Score1", "Score2", "ELO1", "ELO2", "Date"])

# load elo ranking
df = pd.concat((load_elo(country_canonical.get(country, country)) for country in countries), ignore_index=True).sort_values("Date")

# split data frame into EURO 2016 games (test) and others (train)
euro2016_start= datetime.date(2016, 6, 10)
euro2016_end= datetime.date(2016, 7, 10)

is_euro2016 = (df['MatchType'] == "European Championship in France") & (df['Date'] >= euro2016_start) & (df['Date'] <= euro2016_end)
df_euro2016 = df.loc[ is_euro2016 ]
df_train = df.loc[ ~is_euro2016 ]

# print the first 5 rows of each data frame and look at the data
# you can do more advanced data exploration and plotting here
print("=== training data ===")
print(df_train.head(5))
print("\n\n=== EURO2016 ===")
print(df_euro2016.head(5))

load Albania..
load Austria..
load Belgium..
load Croatia..
load Czechia..
load England..
load France..
load Germany..
load Hungary..
load Iceland..
load Italy..
load Nthrn_Irelnd..
load Poland..
load Portugal..
load Ireland..
load Romania..
load Russia..
load Slovakia..
load Spain..
load Sweden..
load Switzerland..
load Turkey..
load Ukraine..
load Wales..
=== training data ===
         Team1     Team2  Score1  Score2  ELO1  ELO2             MatchType  \
2838  Scotland   England       0       0  1800  1800  Friendly in Scotland   
2839   England  Scotland       4       2  1803  1797   Friendly in England   
2840  Scotland   England       2       1  1786  1814  Friendly in Scotland   
2841   England  Scotland       2       2  1806  1794   Friendly in England   
2842  Scotland   England       3       0  1797  1803  Friendly in Scotland   

           Date  
2838 1872-11-30  
2839 1873-03-08  
2840 1874-03-07  
2841 1875-03-06  
2842 1876-03-04  


=== EURO2016 ===
             Team1    

In [5]:
def generate_learning_to_rank(df):
    """Create learning to rank feature matrix X and labels y. Yield (x, y) instances as generator."""
    for row in df.iterrows():
        score_delta = row[1].Score1 - row[1].Score2
        elo_delta = row[1].ELO1 - row[1].ELO2
        # learning to rank as binary classification                                                                                             
        if score_delta == 0:
            # draw, ignore for learning to rank                                                                         
            continue
        label = np.sign(score_delta)
        # at the moment the ELO delta is the only feature, you can add more features here
        features = [ elo_delta ]
        # yield (x, y)                                                                                                  
        yield (features, label)
        # yield symmetric instance (-x, -y)                                                                             
        yield (list(map(lambda x: -x, features)), -label)


In [6]:
## learning to rank                                                                                                 
# generate features and labels                                                                                      
print("== Learning to rank ==")
Xy_train = list(generate_learning_to_rank(df_train))
X_train = np.array([x for x,y in Xy_train])
y_train = np.array([y for x,y in Xy_train])

# random guess                                                                                                      
y_rand = [random.choice([-1, 1]) for _ in range(len(y_train))]
print("Baseline random guess")
print("Accuracy random : %.4f " % accuracy_score(y_train, y_rand))
print("")

# cross validation with different classifiers
models = [("logistic regression SGD", SGDClassifier(loss="log")), ("perceptron SGD", SGDClassifier(loss\
="perceptron")), ("SGD hinge", SGDClassifier(loss="hinge")), ("Passive Agressive", PassiveAggressiveClassifier())]
best_model = None
best_score = 0.0
print("Cross-validation for model selection")
for clf_name, clf in models:
    scores = cross_validation.cross_val_score(clf, X_train, y_train, cv=10)
    print(clf_name)
    print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
    if best_model == None or scores.mean() > best_score:
        print("--> New best score")
        best_score = scores.mean()
        best_model = (clf_name, clf)
    print("")

# train best model on the whote training data and test on the EURO 2016
clf_name, clf = best_model
clf.fit(X_train, y_train)
Xy_test = list(generate_learning_to_rank(df_euro2016))
X_test = np.array([x for x,y in Xy_test])
y_test = np.array([y for x,y in Xy_test])
y_guess = clf.predict(X_test)
print("Test accuracy on EURO 2016 with model %s : %0.2f" % (clf_name, float(sum(y_test == y_guess)) / len(y_test)))

== Learning to rank ==
Baseline random guess
Accuracy random : 0.4953 

Cross-validation for model selection
logistic regression SGD
Accuracy: 0.68 (+/- 0.27)
--> New best score

perceptron SGD
Accuracy: 0.69 (+/- 0.25)
--> New best score

SGD hinge
Accuracy: 0.64 (+/- 0.34)

Passive Agressive
Accuracy: 0.50 (+/- 0.42)

Test accuracy on EURO 2016 with model perceptron SGD : 0.56


## What to do next
Congratulations, you have created a simple learning-to-rank binary classifier! 

You can try to improve this classifiers by **adding more features**, for example a feature which team is the home team, whether it is a friendly match or a competitive match, you can try to give more weight to recent training examples or create trend features like 'percentage of matches won out of the last 5 matches' (be careful with time-dependent features when you do cross-validation, you need to re-compute features depending on the test instance).

Or you can **try out more machine learning classifiers** and try to **fine-tune the hyper-parameters** of each model.

You can also extend this notebook to **more advanced tasks**: try making this a 3-way multi-class classification where you predict win, lose, draw, or if that is still to easy, try to convert it to a regression problem and predict the exact score of the matches.