# Predicting FIFA World Cup 2018 using Machine Learning.

Launay Christian & Alfred Chantharath 

With the 2018 FIFA World Cup finished : We were curious to know wich team would had won using Machine Learning ?



# Goal
1. The goal is to use Machine Learning to predict who is going to win the FIFA World Cup 2018.
1. Predict the outcome of individual matches for the entire competition.
1. Run simulation of the next matches i.e quarter finals, semi finals and finals.
These goals present a unique real-world Machine Learning prediction problem and involve solving various Machine Learning tasks: data integration, feature modelling and outcome prediction.

# Data
We used four data sets from Kaggle. Fifa Soccer Rankings ,International foorball result from 1870 to 2018, fixture_cup ans World Cup 2018.
We will use results of historical matches since the beginning of the championship (1930) for all participating teams.

Limitation: FIFA ranking was created in the 90’s thus a huge portion of the dataset is lacking. So let’s stick to historical match records.

Environment and tools: jupyter notebook, numpy, pandas, seaborn, matplotlib and scikit-learn.

We are first going to do some exploratory analysis on Fifa Soccer Rankings and International foorball result from 1870 to 2018, do some feature engineering to select most relevant feature for prediction, do some data manipulation, choose a Machine Learning model and finally deploy it on the dataset.



In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.ticker as ticker
import matplotlib.ticker as plticker
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [None]:
world_cup = pd.read_csv('../input/world-cup-2018/World Cup 2018 Dataset.csv')

In [None]:
#load data 
results = pd.read_csv('../input/international-football-results-from-1872-to-2017/results.csv')

In [None]:
world_cup.head()


In [None]:
results.head()


# Exploratory Analysis


Exploratory analysis and feature engineering: which involve establishing which features are relevant for the Machine Learning model is the most time consuming part of any Data science project.

Let’s now add goal difference and outcome column to the results dataset and Check out the new results dataframe.





In [None]:
#Adding goal difference and establishing who is the winner 
winner = []
for i in range (len(results['home_team'])):
    if results ['home_score'][i] > results['away_score'][i]:
        winner.append(results['home_team'][i])
    elif results['home_score'][i] < results ['away_score'][i]:
        winner.append(results['away_team'][i])
    else:
        winner.append('Draw')
results['winning_team'] = winner

#adding goal difference column
results['goal_difference'] = np.absolute(results['home_score'] - results['away_score'])

results.head()

Then we’ll work with a subset of the data. One that includes games played only by France. This will help us focus on what features are interesting for one country and later expand to countries participating in the world cup.

In [None]:
#lets work with a subset of the data one that includes games played by Nigeria in a Nigeria dataframe
df = results[(results['home_team'] == 'France') | (results['away_team'] == 'France')]
france = df.iloc[:]
france.head()

The first World Cup was played in 1930. Create a column for year and pick all the games played after 1930.

In [None]:
#creating a column for year and the first world cup was held in 1930
year = []
for row in france['date']:
    year.append(int(row[:4]))
france ['match_year']= year
france_1930 = france[france.match_year >= 1930]
france_1930.count()

We can now visualize the most common match outcome for France throughout the years.



In [None]:
#what is the common game outcome for nigeria visualisation
wins = []
for row in france_1930['winning_team']:
    if row != 'France' and row != 'Draw':
        wins.append('Loss')
    else:
        wins.append(row)
winsdf= pd.DataFrame(wins, columns=[ 'France_Results'])

#plotting
fig, ax = plt.subplots(1)
fig.set_size_inches(10.7, 6.27)
sns.set(style='darkgrid')
sns.countplot(x='France_Results', data=winsdf)

Getting the winning rate for every country that will participate in the world cup is a useful metric and we could use it to predict the most likely outcome of each match in the tournament.

Venue of the matches won’t matter that much.

# Narrowing to the teams participating in the World Cup
Create a dataframe with all the participating teams.

We then further filter the results dataframe to show only teams in this years world cup from 1930 onwards as well as drop duplicates.

In [None]:
#narrowing to team patcipating in the world cup
worldcup_teams = ['Australia', ' Iran', 'Japan', 'Korea Republic', 
            'Saudi Arabia', 'Egypt', 'Morocco', 'Nigeria', 
            'Senegal', 'Tunisia', 'Costa Rica', 'Mexico', 
            'Panama', 'Argentina', 'Brazil', 'Colombia', 
            'Peru', 'Uruguay', 'Belgium', 'Croatia', 
            'Denmark', 'England', 'France', 'Germany', 
            'Iceland', 'Poland', 'Portugal', 'Russia', 
            'Serbia', 'Spain', 'Sweden', 'Switzerland']
df_teams_home = results[results['home_team'].isin(worldcup_teams)]
df_teams_away = results[results['away_team'].isin(worldcup_teams)]
df_teams = pd.concat((df_teams_home, df_teams_away))
df_teams.drop_duplicates()
df_teams.count()

In [None]:
df_teams.head()


Create a year column and drop games before 1930 as well as columns that won’t affect match outcome for example date, home_score, away_score, tournament, city, country, goal_difference and match_year

In [None]:
#create an year column to drop games before 1930
year = []
for row in df_teams['date']:
    year.append(int(row[:4]))
df_teams['match_year'] = year
df_teams_1930 = df_teams[df_teams.match_year >= 1930]
df_teams_1930.head()

In [None]:
#dropping columns that wll not affect matchoutcomes
df_teams_1930 = df_teams.drop(['date', 'home_score', 'away_score', 'tournament', 'city', 'country', 'goal_difference', 'match_year'], axis=1)
df_teams_1930.head()

Modify the “Y” (prediction label) in order to simplify our model’s processing.

The winning_team column will show “2” if the home team has won, “1” if it was a tie, and “0” if the away team has won.

In [None]:
#Building the model
#the prediction label: The winning_team column will show "2" if the home team has won, "1" if it was a tie, and "0" if the away team has won.

df_teams_1930 = df_teams_1930.reset_index(drop=True)
df_teams_1930.loc[df_teams_1930.winning_team == df_teams_1930.home_team,'winning_team']=2
df_teams_1930.loc[df_teams_1930.winning_team == 'Draw', 'winning_team']=1
df_teams_1930.loc[df_teams_1930.winning_team == df_teams_1930.away_team, 'winning_team']=0

df_teams_1930.head()

Convert home_team and away _team from categorical variables to continuous inputs, by setting dummy variables.

Using pandas, get_dummies() function. It replaces categorical columns with their one-hot (numbers ‘1’ and ‘0’) representations enabling them to be loaded into Scikit-learn model.

We then separate the X and Y set and split the data into 75 percent training and 25 percent test.



In [None]:
#convert home team and away team from categorical variables to continous inputs 
# Get dummy variables
final = pd.get_dummies(df_teams_1930, prefix=['home_team', 'away_team'], columns=['home_team', 'away_team'])

# Separate X and y sets
X = final.drop(['winning_team'], axis=1)
y = final["winning_team"]
y = y.astype('int')

# Separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

We will use logistic regression, a classifier algorithm. How does this algorithm work? It measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function. Specifically the cumulative logistic distribution.

In other words logistic regression attempts to predict an outcome (a win or a loss) given a set of data points (stats) that likely influence that outcome.

The way this works in practice is you feed the algorithm one game at a time, with both the aforementioned “set of data” and the actual outcome of the game. The model then learns how each piece of data you feed it influences the outcome of the game positively, negatively and to what extent.

Give it enough (good) data and you have a model that you can use to predict future outcomes.

A model is as good as the data you give it.

Let’s have a look at our final dataframe:

In [None]:
final.head()


In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
score = logreg.score(X_train, y_train)
score2 = logreg.score(X_test, y_test)

print("Training set accuracy: ", '%.3f'%(score))
print("Test set accuracy: ", '%.3f'%(score2))

Our model got a 57% accuracy on the training set and 56% accuracy on the test set. This doesn’t look great but let’s move on.

At this point we will create a dataframe that we will deploy our model.

We will start by loading the FIFA ranking dataset and a dataset containing the fixture of the group stages of the tournament obtained from here. The team which is positioned higher on the FIFA Ranking will be considered “favourite” for the match and therefore, will be positioned under the “home_teams” column since there are no “home” or “away” teams in World Cup games. We then add teams to the new prediction dataset based on ranking position of each team. The next step will be to create dummy variables and and deploy the machine learning model.



# Match Prediction


# Deploying the model to the dataset
We start with deploying the model to the group matches.

In [None]:
#adding Fifa rankings
#the team which is positioned higher on the FIFA Ranking will be considered "favourite" for the match
#and therefore, will be positioned under the "home_teams" column
#since there are no "home" or "away" teams in World Cup games. 

# Loading new datasets
ranking = pd.read_csv('../input/fifa-international-soccer-mens-ranking-1993now/fifa_ranking.csv') 
fixtures = pd.read_csv('../input/fixture-cup/cupp.csv')
rankings = ranking.drop_duplicates(subset='country_full',)
# List for storing the group stage games
pred_set = []

In [None]:
# Create new columns with ranking position of each team
fixtures.insert(1, 'first_position', fixtures['Home Team'].map(rankings.set_index('country_full',verify_integrity= True)['rank']))
fixtures.insert(2, 'second_position', fixtures['Away Team'].map(rankings.set_index('country_full')['rank']))

# We only need the group stage games, so we have to slice the dataset
fixtures = fixtures.iloc[:48, :]
fixtures.tail()

In [None]:

# Loop to add teams to new prediction dataset based on the ranking position of each team
for index, row in fixtures.iterrows():
    if row['first_position'] < row['second_position']:
        pred_set.append({'home_team': row['Home Team'], 'away_team': row['Away Team'], 'winning_team': None})
    else:
        pred_set.append({'home_team': row['Away Team'], 'away_team': row['Home Team'], 'winning_team': None})
        
pred_set = pd.DataFrame(pred_set)
backup_pred_set = pred_set

pred_set.head()

In [None]:
# Get dummy variables and drop winning_team column
pred_set = pd.get_dummies(pred_set, prefix=['home_team', 'away_team'], columns=['home_team', 'away_team'])

# Add missing columns compared to the model's training dataset
missing_cols = set(final.columns) - set(pred_set.columns)
for c in missing_cols:
    pred_set[c] = 0
pred_set = pred_set[final.columns]

# Remove winning team column
pred_set = pred_set.drop(['winning_team'], axis=1)

pred_set.head()

Here are the results of group stages.



In [None]:
#group matches 
predictions = logreg.predict(pred_set)
for i in range(fixtures.shape[0]):
    print(backup_pred_set.iloc[i, 1] + " and " + backup_pred_set.iloc[i, 0])
    if predictions[i] == 2:
        print("Winner: " + backup_pred_set.iloc[i, 1])
    elif predictions[i] == 1:
        print("Draw")
    elif predictions[i] == 0:
        print("Winner: " + backup_pred_set.iloc[i, 0])
    print('Probability of ' + backup_pred_set.iloc[i, 1] + ' winning: ', '%.3f'%(logreg.predict_proba(pred_set)[i][2]))
    print('Probability of Draw: ', '%.3f'%(logreg.predict_proba(pred_set)[i][1]))
    print('Probability of ' + backup_pred_set.iloc[i, 0] + ' winning: ', '%.3f'%(logreg.predict_proba(pred_set)[i][0]))
    print("")

In [None]:
# List of tuples before 
group_16 = [('Uruguay', 'Portugal'),
            ('France', 'Croatia'),
            ('Brazil', 'Sweden'),
            ('England', 'Colombia'),
            ('Spain', 'Russia'),
            ('Argentina', 'Denmark'),
            ('Germany', 'Switzerland'),
            ('Poland', 'Belgium')]

In [None]:
def clean_and_predict(matches, ranking, final, logreg):

    # Initialization of auxiliary list for data cleaning
    positions = []

    # Loop to retrieve each team's position according to FIFA ranking
    for match in matches:
        positions.append(rankings.loc[rankings['country_full'] == match[0],'rank'].iloc[0])
        positions.append(rankings.loc[rankings['country_full'] == match[1],'rank'].iloc[0])
    
    # Creating the DataFrame for prediction
    pred_set = []

    # Initializing iterators for while loop
    i = 0
    j = 0

    # 'i' will be the iterator for the 'positions' list, and 'j' for the list of matches (list of tuples)
    while i < len(positions):
        dict1 = {}
                # If position of first team is better, he will be the 'home' team, and vice-versa
        if positions[i] < positions[i + 1]:
            dict1.update({'home_team': matches[j][0], 'away_team': matches[j][1]})
        else:
            dict1.update({'home_team': matches[j][1], 'away_team': matches[j][0]})

        # Append updated dictionary to the list, that will later be converted into a DataFrame
        pred_set.append(dict1)
        i += 2
        j += 1

    # Convert list into DataFrame
    pred_set = pd.DataFrame(pred_set)
    backup_pred_set = pred_set

    # Get dummy variables and drop winning_team column
    pred_set = pd.get_dummies(pred_set, prefix=['home_team', 'away_team'], columns=['home_team', 'away_team'])

    # Add missing columns compared to the model's training dataset
    missing_cols2 = set(final.columns) - set(pred_set.columns)
    for c in missing_cols2:
        pred_set[c] = 0
    pred_set = pred_set[final.columns]

    # Remove winning team column
    pred_set = pred_set.drop(['winning_team'], axis=1)

    # Predict!
    predictions = logreg.predict(pred_set)
    for i in range(len(pred_set)):
        print(backup_pred_set.iloc[i, 1] + " and " + backup_pred_set.iloc[i, 0])
        if predictions[i] == 2:
            print("Winner: " + backup_pred_set.iloc[i, 1])
        elif predictions[i] == 1:
            print("Draw")
        elif predictions[i] == 0:
            print("Winner: " + backup_pred_set.iloc[i, 0])
        print('Probability of ' + backup_pred_set.iloc[i, 1] + ' winning: ' , '%.3f'%(logreg.predict_proba(pred_set)[i][2]))
        print('Probability of Draw: ', '%.3f'%(logreg.predict_proba(pred_set)[i][1])) 
        print('Probability of ' + backup_pred_set.iloc[i, 0] + ' winning: ', '%.3f'%(logreg.predict_proba(pred_set)[i][0]))
        print("")

In [None]:
clean_and_predict(group_16, ranking, final, logreg)


In [None]:
# List of matches
quarters = [('France', 'Russia'),
            ('Uruguay', 'Argentina'),
            ('Brazil', 'England'),
            ('Germany', 'Belgium')]

In [None]:
clean_and_predict(quarters, ranking, final, logreg)


In [None]:
# List of matches
semi = [('Russia', 'Brazil'),
        ('Argentina', 'Germany')]

In [None]:
clean_and_predict(semi, ranking, final, logreg)


In [None]:
# Finals
finals = [('Brazil', 'Germany')]

In [None]:
clean_and_predict(finals, ranking, final, logreg)


According to this model Germany was likely to win this World Cup.

# Areas of further Research/ Improvement
We used the global ranking we could have used the 2018 Fifa ranking which could have been more realistic for the prevision , moreover including key player influence in parameters could be something interessant to do we will probably improve this work in the weeks to come .

