#Predicting the result of a game between Team 1 and Team 2

You have been recruited as a football analyst in a company - Mchezopesa Ltd and tasked to accomplish the task below.

A prediction result of a game between team 1 and team 2, based on who's home and who's away, and on whether or not the game is friendly (include rank in your training).

You have two possible approaches (as  shown below) given the datasets that will be provided

Input: Home team, Away team, Tournament type (World cup, Friendly, Other)

#1. Defining the Question

##a) Specifying the Question

To predict the results of a game between team 1 and team 2, based on who's home and who's away, and on whether or not the game is friendly

##b) Defining the Metric of success

To be able to create a suitable model that predicts the number of goals a team playing in a tournament will score using polynomial or logistic regression 

##c) Understanding the context

There are two datasets available for this project; results and fifa_ranking. 
Results dataset tells whether a team played as an away team or home team, the scores, type of tournament and the country the match was played in.
The FIFA ranking dataset shows the rankings of different Countries by points and ranking dates.

#d) Recording the experimental design

i) Perform EDA and any necessary feature engineering 

ii) Check for multicollinearity

iii) Model building

iv) Cross-validation of the model

v) Compute RMSE

vi) Create residual plots for the models, and assess heteroscedasticity of the 
models using Bartlett’s test

vii) Perform appropriate regressions on the data including the justification for this

viii) Challenge the solution by providing insights on how there can be any improvements.

#Importing our libraries

In [15]:
#Importing pandas libraries will be used for data manipulation and data visualization

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

In [16]:
#importing sklearn libraries that will be used for regressin and predictions

from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures


#Loading our datasets

In [17]:
# We load the csv files for the fifa ranking dataset and the results dataset

fifa = pd.read_csv('fifa_ranking.csv')
results = pd.read_csv('results.csv')

Previewing the datasets



In [18]:
# previewing the fifa ranking dataset

fifa.head()

Unnamed: 0,rank,country_full,country_abrv,total_points,previous_points,rank_change,cur_year_avg,cur_year_avg_weighted,last_year_avg,last_year_avg_weighted,two_year_ago_avg,two_year_ago_weighted,three_year_ago_avg,three_year_ago_weighted,confederation,rank_date
0,1,Germany,GER,0.0,57,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,1993-08-08
1,2,Italy,ITA,0.0,57,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,1993-08-08
2,3,Switzerland,SUI,0.0,50,9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,1993-08-08
3,4,Sweden,SWE,0.0,55,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,1993-08-08
4,5,Argentina,ARG,0.0,51,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,CONMEBOL,1993-08-08


In [19]:
# previewing columns
fifa.columns

Index(['rank', 'country_full', 'country_abrv', 'total_points',
       'previous_points', 'rank_change', 'cur_year_avg',
       'cur_year_avg_weighted', 'last_year_avg', 'last_year_avg_weighted',
       'two_year_ago_avg', 'two_year_ago_weighted', 'three_year_ago_avg',
       'three_year_ago_weighted', 'confederation', 'rank_date'],
      dtype='object')

In [20]:
# previewing the results dataset

results.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False


In [21]:
# previewing columns

results.columns

Index(['date', 'home_team', 'away_team', 'home_score', 'away_score',
       'tournament', 'city', 'country', 'neutral'],
      dtype='object')

# Data Exploration

Checking the number of records for each dataset. This will show us the size of our datasets.

In [23]:
# Checking the number of records of each dataset 

print(fifa.shape)
print(results.shape)

# the fifa ranking datasethas 57,793 rows and 16 columns
# the results dataset has 40,839 rows and 9 columns 

(57793, 16)
(40839, 9)


In [32]:
# Checking for the information of each dataset

print(fifa.info())
print("**********************************************************")
print(results.info())

# The data types are all correct except for the rank_date column in the fifa ranking dataset 
# and the date column in the results date column 
# these should be in date dtype
# we will convert them during data cleaning

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57793 entries, 0 to 57792
Data columns (total 16 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   rank                     57793 non-null  int64  
 1   country_full             57793 non-null  object 
 2   country_abrv             57793 non-null  object 
 3   total_points             57793 non-null  float64
 4   previous_points          57793 non-null  int64  
 5   rank_change              57793 non-null  int64  
 6   cur_year_avg             57793 non-null  float64
 7   cur_year_avg_weighted    57793 non-null  float64
 8   last_year_avg            57793 non-null  float64
 9   last_year_avg_weighted   57793 non-null  float64
 10  two_year_ago_avg         57793 non-null  float64
 11  two_year_ago_weighted    57793 non-null  float64
 12  three_year_ago_avg       57793 non-null  float64
 13  three_year_ago_weighted  57793 non-null  float64
 14  confederation         

We describe each dataset to get the general summary statistics.

In [30]:
# Describing our datasets

print(fifa.describe())
print("**************************************************************************************")
print(results.describe())

               rank  total_points  previous_points   rank_change  \
count  57793.000000  57793.000000     57793.000000  57793.000000   
mean     101.628086    122.068637       332.302926     -0.009897   
std       58.618424    260.426863       302.872948      5.804309   
min        1.000000      0.000000         0.000000    -72.000000   
25%       51.000000      0.000000        56.000000     -2.000000   
50%      101.000000      0.000000       272.000000      0.000000   
75%      152.000000     92.790000       525.000000      1.000000   
max      209.000000   1775.030000      1920.000000     92.000000   

       cur_year_avg  cur_year_avg_weighted  last_year_avg  \
count  57793.000000           57793.000000   57793.000000   
mean      61.798602              61.798602      61.004602   
std      138.014883             138.014883     137.688204   
min        0.000000               0.000000       0.000000   
25%        0.000000               0.000000       0.000000   
50%        0.000000  

# Data cleaning

In [33]:
# checking for missing values

print(fifa.isnull().sum())
print("*************************")
print(results.isnull().sum())

# there are no missing values in either of the datasets

rank                       0
country_full               0
country_abrv               0
total_points               0
previous_points            0
rank_change                0
cur_year_avg               0
cur_year_avg_weighted      0
last_year_avg              0
last_year_avg_weighted     0
two_year_ago_avg           0
two_year_ago_weighted      0
three_year_ago_avg         0
three_year_ago_weighted    0
confederation              0
rank_date                  0
dtype: int64
*************************
date          0
home_team     0
away_team     0
home_score    0
away_score    0
tournament    0
city          0
country       0
neutral       0
dtype: int64


In [34]:
#checking for duplicates

print(fifa.duplicated().sum())
print("*************************")
print(results.duplicated().sum())

# there are 37 duplicates in the fifa ranking dataset and 
# no duplicates in the results dataset

37
*************************
0


Managing Duplicates