# Introduction

This project aims at predicting the results of group stage matches at the 2018 football world cup, the brackets of knockout stage and the eventual world cup winner.

## Data

The main data source for this project is the [European Soccer Database](https://www.kaggle.com/hugomathien/soccer) from [Kaggle](https://www.kaggle.com/). The dataset consists of following information:  
* +25,000 matches
* +10,000 players
* 11 European Countries with their lead championship
* Seasons 2008 to 2016
* Players and Teams' attributes* sourced from EA Sports' FIFA video game series, including the weekly updates
* Team line up with squad formation (X, Y coordinates)
* Betting odds from up to 10 providers
* Detailed match events (goal types, possession, corner, cross, fouls, cards etc...) for +10,000 matches  

Particular data attributes used for this project are as follows:
* European League Matches attributes from 2008 to 2016
* Starting lineup of the teams for all matches
* Player attributes sourced from EA Sports' FIFA game ([sofifa.com](https://sofifa.com/))

## Features

Each data sample is a match with label as Win, Defeat or Draw. I used the starting XI players from both teams as features. For each player, I used these 3 attributes: overall rating (FIFA game), potential (FIFA game) and age before the match. With a total of 22 starting XI players for each match and 3 attributes per player, the total number of features for a data sample is 66.  
Here's the format of a data sample:  
* Player Features: Overall Rating, Potential, Age
* Team Features: Player1 Features, Player2 Features,.......,Player11 Features
* Match Features: Team1 Features,Team2 Features
* Match Label: Match result wrt to Team1. For example, if match feature is [Team1, Team2] then match label is Win if Team1 wins, Defeat if Team1 loses or Draw in case of a draw.  

Sofifa has a lot more attributes for each player but I ended up using only overall rating and potential.

## Classifier

**Naive Bayes Classifier** was used to train the classifier on the aforementioned dataset. Since the attributes of each player can safely be assumed to be independent of his teammates. Although potential and overall rating of a player are related to each other but for some players the overall rating tends to be lower than potential. And even if the individual features are not independent, the Bayes classifier can still give really good results.  

In [110]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn import svm
from sklearn.externals import joblib
import pandas as pd
import numpy as np
import csv
import json
import datetime

In [111]:
# load data
data = pd.read_csv('data/training_data.csv')
X = data.values[:,:-1]
Y = data.values[:,-1]

In [112]:
def filter_features(feat, overall_rating=True, potential=True, age=True):
    ''' Filters out specific attributes of the players '''
    keep_col_nums = [i for i in range(feat.shape[1]) if (i%3==0 and overall_rating) or (i%3==1 and potential) or (i%3==2 and age)]
    return feat[:,keep_col_nums]

def mirror_teams(X,Y):
    ''' Data -> [Team1,Team2], [Team2,Team1] '''
    num_cols = X.shape[1]
    team1_ind = num_cols/2
    mirror_X = np.concatenate((X[:,team1_ind:], X[:,:team1_ind]), axis=1)
    feat = np.concatenate((X, mirror_X), axis=0)
    mirror_Y = Y
    for i in range(Y.shape[0]):
        if Y[i][0]=='Win':
            mirror_Y[i][0] = 'Defeat'
        elif Y[i][0]=='Defeat':
            mirror_Y[i][0] = 'Win'
    labels = np.concatenate((Y,mirror_Y), axis=0)
    return feat, labels

In [113]:
# Cross Validation accuracy
clf = GaussianNB()

# all features
print sum(cross_val_score(clf, X, Y, cv=5, n_jobs=-1))/5

# overall rating
feat = filter_features(X, True, False, False)
print sum(cross_val_score(clf, feat, Y, cv=5, n_jobs=-1))/5

# potential
feat = filter_features(X, False, True, False)
print sum(cross_val_score(clf, feat, Y, cv=5, n_jobs=-1))/5

# age

feat = filter_features(X, False, False, True)
print sum(cross_val_score(clf, feat, Y, cv=5, n_jobs=-1))/5

# overall rating and potential
feat = filter_features(X, True, True, False)
print sum(cross_val_score(clf, feat, Y, cv=5, n_jobs=-1))/5

# overall rating and age
feat = filter_features(X, True, False, True)
print sum(cross_val_score(clf, feat, Y, cv=5, n_jobs=-1))/5

# potential and age
feat = filter_features(X, False, True, True)
print sum(cross_val_score(clf, feat, Y, cv=5, n_jobs=-1))/5


0.4757733204091652
0.48602615173636804
0.4878049259472824
0.45437940423145956
0.4744156752678772
0.48630691910579343
0.4879448276038788


From above results it looks like, potential and age of the players gives us the best results. But in the end, I ended up using all the features anyways to get my final classifier model which is present in the data folder in the pickle file labeled as 'gaussian.pkl'.

In [121]:
all_feat_trained_model = joblib.load('data/gaussian.pkl')

## Predicting 2018 World Cup Matches Results

Since the data that I trained my model was from sofifa.com, I used the same data source to predict the world cup matches as well. I started with preparing a list of probable starting XI players. For this purpose i used foxsports.com probable list of starting XI players and then scraped current stats for those players from sofifa.com. the list of starting XI players for each team is in starting_xi.csv file in the data folder and the scraped player attributes are in a folder sofifa_data inside data directory. There's a json file for each of the 32 participating countries. Besides scraping starting XI players, I also scraped other good players from each of those countries so as to update the results of the matches when the official starting XI players list is announced before the matches.

In [122]:
# Starting XI 
starting_xi = {}
with open('data/starting_xi.csv','rU') as f:
    reader = csv.reader(f)
    for row in reader:
        starting_xi[row[0]] = row[1:]

In [123]:
def get_age_before_match(d1,d2):
    t1 = datetime.datetime.strptime(d1,'%b %d, %Y')
    t2 = datetime.datetime.strptime(d2,'%Y-%m-%d')
    return (t2-t1).days

def read_sofifa_data(country):
    player_info = {}
    file_name = '_'.join(country.lower().split())+'.json'
    with open('data/sofifa_data/'+file_name) as f:
        for line in f:
            obj = json.loads(line.strip())
            player_info[obj['name']] = [float(obj['overall_rating']),
                                        float(obj['potential']),
                                        obj['bday'].replace('(','').replace(')','')]
    return player_info


def get_features(country1, country2, match_date):
    team1 = read_sofifa_data(country1)
    team2 = read_sofifa_data(country2)
    
    feat = [f for pl in starting_xi[country1] for f in team1[pl][:2]+[get_age_before_match(team1[pl][2], match_date)]]
    feat += [f for pl in starting_xi[country2] for f in team2[pl][:2]+[get_age_before_match(team2[pl][2], match_date)]]
    return feat

In [126]:
def predict_result(trained_model, country1, country2, match_date):
    f = get_features(country1,country2,match_date)
    result = trained_model.predict(np.array(f).reshape(1,-1))
    pre_string = '{country1} vs {country2} result: '.format(country1=country1, country2=country2)
    if result[0]==0:
        print pre_string+'{winner} wins'.format(winner=country1)
    elif result[0]==2:
        print pre_string+'{winner} wins'.format(winner=country2)
    else:
        print pre_string+'Draw'

## Matches

In [127]:
pred_model = all_feat_trained_model

# Day 1
print "############# DAY 1 Results ###########"
predict_result(pred_model, 'Russia','Saudi Arabia','2018-06-14')
print
# Day 2
print "############# DAY 2 Results ###########"
predict_result(pred_model, 'Egypt','Uruguay','2018-06-15')
predict_result(pred_model, 'Morocco','Iran','2018-06-15')
predict_result(pred_model, 'Portugal','Spain','2018-06-15')
print
# Day 3
print "############# DAY 3 Results ###########"
predict_result(pred_model, 'France','Australia','2018-06-16')
predict_result(pred_model, 'Argentina','Iceland','2018-06-16')
predict_result(pred_model, 'Peru','Denmark','2018-06-16')
predict_result(pred_model, 'Croatia','Nigeria','2018-06-16')
print
# Day 4
print "############# DAY 4 Results ###########"
predict_result(pred_model, 'Costa Rica','Serbia','2018-06-17')
predict_result(pred_model, 'Germany','Mexico','2018-06-17')
predict_result(pred_model, 'Brazil','Switzerland','2018-06-17')
print
# Day 5
print "############# DAY 5 Results ###########"
predict_result(pred_model, 'Sweden','South Korea','2018-06-18')
predict_result(pred_model, 'Belgium','Panama','2018-06-18')
predict_result(pred_model, 'Tunisia','England','2018-06-18')
print
# Day 6
print "############# DAY 6 Results ###########"
predict_result(pred_model, 'Colombia','Japan','2018-06-19')
predict_result(pred_model, 'Poland','Senegal','2018-06-19')
predict_result(pred_model, 'Russia','Egypt','2018-06-19')
print
# Day 7
print "############# DAY 7 Results ###########"
predict_result(pred_model, 'Portugal','Morocco','2018-06-20')
predict_result(pred_model, 'Uruguay','Saudi Arabia','2018-06-20')
predict_result(pred_model, 'Iran','Spain','2018-06-20')
print
# Day 8
print "############# DAY 8 Results ###########"
predict_result(pred_model, 'Denmark','Australia','2018-06-21')
predict_result(pred_model, 'France','Peru','2018-06-21')
predict_result(pred_model, 'Argentina','Croatia','2018-06-21')
print
# Day 9
print "############# DAY 9 Results ###########"
predict_result(pred_model, 'Brazil','Costa Rica','2018-06-22')
predict_result(pred_model, 'Nigeria','Iceland','2018-06-22')
predict_result(pred_model, 'Serbia','Switzerland','2018-06-22')
print
# Day 10
print "############# DAY 10 Results ###########"
predict_result(pred_model, 'Belgium','Tunisia','2018-06-23')
predict_result(pred_model, 'South Korea','Mexico','2018-06-23')
predict_result(pred_model, 'Germany','Sweden','2018-06-23')
print
# Day 11
print "############# DAY 11 Results ###########"
predict_result(pred_model, 'England','Panama','2018-06-24')
predict_result(pred_model, 'Japan','Senegal','2018-06-24')
predict_result(pred_model, 'Poland','Colombia','2018-06-24')
print
# Day 12
print "############# DAY 12 Results ###########"
predict_result(pred_model, 'Saudi Arabia','Egypt','2018-06-25')
predict_result(pred_model, 'Uruguay','Russia','2018-06-25')
predict_result(pred_model, 'Iran','Portugal','2018-06-25')
predict_result(pred_model, 'Spain','Morocco','2018-06-25')
print
# Day 13
print "############# DAY 13 Results ###########"
predict_result(pred_model, 'Australia','Peru','2018-06-26')
predict_result(pred_model, 'Denmark','France','2018-06-26')
predict_result(pred_model, 'Nigeria','Argentina','2018-06-26')
predict_result(pred_model, 'Iceland','Croatia','2018-06-26')
print
# Day 14
print "############# DAY 14 Results ###########"
predict_result(pred_model, 'Mexico','Sweden','2018-06-27')
predict_result(pred_model, 'South Korea','Germany','2018-06-27')
predict_result(pred_model, 'Switzerland','Costa Rica','2018-06-27')
predict_result(pred_model, 'Serbia','Brazil','2018-06-27')
print
# Day 15
print "############# DAY 15 Results ###########"
predict_result(pred_model, 'Senegal','Colombia','2018-06-28')
predict_result(pred_model, 'Japan','Poland','2018-06-28')
predict_result(pred_model, 'England','Belgium','2018-06-28')
predict_result(pred_model, 'Panama','Tunisia','2018-06-28')
print


############# DAY 1 Results ###########
Russia vs Saudi Arabia result: Russia wins

############# DAY 2 Results ###########
Egypt vs Uruguay result: Uruguay wins
Morocco vs Iran result: Morocco wins
Portugal vs Spain result: Spain wins

############# DAY 3 Results ###########
France vs Australia result: France wins
Argentina vs Iceland result: Argentina wins
Peru vs Denmark result: Denmark wins
Croatia vs Nigeria result: Croatia wins

############# DAY 4 Results ###########
Costa Rica vs Serbia result: Serbia wins
Germany vs Mexico result: Germany wins
Brazil vs Switzerland result: Brazil wins

############# DAY 5 Results ###########
Sweden vs South Korea result: Sweden wins
Belgium vs Panama result: Belgium wins
Tunisia vs England result: England wins

############# DAY 6 Results ###########
Colombia vs Japan result: Colombia wins
Poland vs Senegal result: Poland wins
Russia vs Egypt result: Russia wins

############# DAY 7 Results ###########
Portugal vs Morocco result: Portugal wins

## Knockout Stage

According to above predicted results for group stage matches here are the match-ups and predictions for the knockout stage.

### Round of 16

In [130]:
predict_result(pred_model, 'Uruguay','Portugal','2018-06-30')
predict_result(pred_model, 'France','Croatia','2018-06-30')
predict_result(pred_model, 'Brazil','Mexico','2018-07-02')
predict_result(pred_model, 'Belgium','Poland','2018-07-02')
predict_result(pred_model, 'Spain','Russia','2018-07-01')
predict_result(pred_model, 'Argentina','Denmark','2018-07-01')
predict_result(pred_model, 'Germany','Switzerland','2018-07-03')
predict_result(pred_model, 'Colombia','England','2018-07-03')

Uruguay vs Portugal result: Portugal wins
France vs Croatia result: France wins
Brazil vs Mexico result: Brazil wins
Belgium vs Poland result: Belgium wins
Spain vs Russia result: Spain wins
Argentina vs Denmark result: Argentina wins
Germany vs Switzerland result: Germany wins
Colombia vs England result: England wins


### Quarterfinals

In [131]:
predict_result(pred_model, 'Portugal','France','2018-07-06')
predict_result(pred_model, 'Brazil','Belgium','2018-07-06')
predict_result(pred_model, 'Spain','Argentina','2018-07-07')
predict_result(pred_model, 'Germany','England','2018-07-07')

Portugal vs France result: France wins
Brazil vs Belgium result: Brazil wins
Spain vs Argentina result: Spain wins
Germany vs England result: Germany wins


### Semifinals

In [132]:
predict_result(pred_model, 'France','Brazil','2018-07-10')
predict_result(pred_model, 'Spain','Germany','2018-07-11')

France vs Brazil result: France wins
Spain vs Germany result: Spain wins


### Finals

In [133]:
predict_result(pred_model, 'France','Spain','2018-07-15')

France vs Spain result: Spain wins
