After a month of intense and suspenseful matches the finals are here. And it's the teams that most of the predictions missed: France vs Croatia. (Well I got one :) )

## Probable Starting Lineups

### France
<img src="images/france_lineup.png" height=500 width=500>

### Croatia
<img src="images/croatia_lineup.png" height=500 width=500>

The above lineups are from semifinals. France hasn't changed their lineup in the knockout stage so most probably they will stick to the same starting lineup. Croatia, on the other hand, has had few changes in lineup but for prediction purposes I chose to go with their semifinal lineup as well.

In [28]:
from sklearn.naive_bayes import GaussianNB
from sklearn.externals import joblib
import pandas as pd
import numpy as np
import csv
import json
import datetime
import math

In [41]:
trained_model = joblib.load('data/gaussian.pkl')

In [30]:
starting_xi = {}
        
starting_xi['France'] = ['Hugo Lloris',
                         'Benjamin Pavard',
                         'Raphael Varane',
                         'Samuel Umtiti',
                         'Lucas Hernandez',
                         "N'Golo Kante",
                         'Paul Pogba',
                         'Kylian Mbappe',
                         'Antoine Griezmann',
                         'Blaise Matuidi',
                         'Olivier Giroud',
                         ]
starting_xi['Croatia'] = ['Danijel Subasic',
                          'Sime Vrsaljko',
                          'Dejan Lovren',
                          'Domagoj Vida',
                          'Ivan Strinic',
                          'Marcelo Brozovic',
                          'Ante Rebic',
                          'Luka Modric',
                          'Ivan Rakitic',
                          'Ivan Perisic',
                          'Mario Mandzukic'
                         ]

In [60]:
def get_age_before_match(d1,d2):
    t1 = datetime.datetime.strptime(d1,'%b %d, %Y')
    t2 = datetime.datetime.strptime(d2,'%Y-%m-%d')
    return (t2-t1).days

def read_sofifa_data(country):
    player_info = {}
    file_name = '_'.join(country.lower().split())+'.json'
    with open('data/sofifa_data/'+file_name) as f:
        for line in f:
            obj = json.loads(line.strip())
            player_info[obj['name']] = [float(obj['overall_rating']),
                                        float(obj['potential']),
                                        obj['bday'].replace('(','').replace(')','')]
    return player_info


def get_features(country1, country2, match_date):
    team1 = read_sofifa_data(country1)
    team2 = read_sofifa_data(country2)
    
    feat = [f for pl in starting_xi[country1] for f in team1[pl][:2]+[get_age_before_match(team1[pl][2], match_date)]]
    feat += [f for pl in starting_xi[country2] for f in team2[pl][:2]+[get_age_before_match(team2[pl][2], match_date)]]
    return feat

def predict_result(trained_model, country1, country2, match_date):
    f = get_features(country1,country2,match_date)
    result = trained_model.predict(np.array(f).reshape(1,-1))
    pre_string = '{country1} vs {country2} result: '.format(country1=country1, country2=country2)
    if result[0]==0:
        print pre_string+'{winner} wins'.format(winner=country1)
    elif result[0]==2:
        print pre_string+'{winner} wins'.format(winner=country2)
    else:
        print pre_string+'Draw'
        
def get_prob_value(feat, sigma, theta):
    return np.exp(-math.pow(feat-theta, 2)/(2*math.pow(sigma,2)))/(sigma*math.sqrt(2*math.pi))

def get_feature_prob(trained_model, feat, label, n=5):
    theta = trained_model.theta_[label]
    sigma = trained_model.sigma_[label]

    div_ind = len(feat)/2
    # Team 1
    team1_prob = [get_prob_value(feat[i], sigma[i], theta[i]) for i in range(div_ind)]
    
    # Team 2
    team2_prob = [get_prob_value(feat[i], sigma[i], theta[i]) for i in range(div_ind,len(feat))]
    return team1_prob, team2_prob


In [61]:
predict_result(trained_model, 'Croatia','France','2018-07-15')

Croatia vs France result: France wins


In [71]:
f = get_features('Croatia','France','2018-07-15')
_,t2 = get_feature_prob(trained_model, f, 2)
t1,_ = get_feature_prob(trained_model, f, 0)

print "Team 1 best players: ",[starting_xi['Croatia'][ind] for ind in np.argsort([-t for t in f[1:33:3]])[:5]]
print "Team 2 best players: ",[starting_xi['France'][ind] for ind in np.argsort([-t for t in f[34::3]])[:5]]
print "Team 1 top features: ",np.argsort([-t for t in t1])[:10]
print "Team 2 top features: ",np.argsort([-t for t in t2])[:10]

Team 1 best players:  ['Luka Modric', 'Ivan Rakitic', 'Ivan Perisic', 'Danijel Subasic', 'Sime Vrsaljko']
Team 2 best players:  ['Kylian Mbappe', 'Raphael Varane', 'Paul Pogba', 'Samuel Umtiti', "N'Golo Kante"]
Team 1 top features:  [13  4 12 19  7 10 16  6  3 18]
Team 2 top features:  [13  4 12  3 19 16  7 10  6 31]


### Important Features/Players

#### Croatia (Top 5 players according to EA sports data)
* Luka Modric
* Ivan Rakitic
* Ivan Perisic
* Danijel Subasic
* Sime Vrsaljko
#### France (Top 5 players according to EA sports data)
* Kylian Mbappe
* Raphael Varane
* Paul Pogba
* Samuel Umtiti
* N'Golo Kante

Since Naive Bayes Classifier assumes independence condition for individual features, using that property we can sort the features according to desired label. According to the trained model, following are the key players that will contribute towards their respective teams. Note these set of players can be different from the best players in the team.
#### Croatia (Key players)
* Ivan Strinic
* Sime Vrsaljko
* Ante Rebic
* Dejan Lovren
* Domagoj Vida

#### France (Key players)
* Lucas Hernandez
* Benjamin Pavard
* Paul Pogba
* N'Golo Kante
* Raphael Varane