<b><center style="font-size:250%; font-family:verdana;">Project "SIUUU!"</center></b>

<hr class="solid">

<b><center style="font-size:140%; font-family:verdana;">A Machine Learning Tutorial To Football Match Predictions — By Zijie Cai</center></b>

<center><img src="siu.gif" width="300" height="500" vspace="12"></center>

<hr class="solid">

<b><center style="font-size:200%; font-family:verdana;">1. Introduction</center></b>

<p style="font-family:verdana;">Welcome to Project "SIUUU!", an engaging exploration into applying machine learning to the world of football. Using multiple datasets from Kaggle that include historical match records and various football statistics, I'll navigate though the entire machine learning process.</p>

<p style="font-family:verdana;">This adventure involves data cleaning, preprocessing, feature engineering, model selecting, training, testing, and tuning. The end goal? To create a robust model that can precisely predict the outcome of any football match happening today.</p>

<p style="font-family:verdana;">This project is designed for everyone. Whether you're a data science enthusiast, a football fan with a love for stats, or just someone interested in machine learning, there's something here for you. Project "SIUUU!" is all about demonstrating practical machine learning applications in sports analytics in a friendly, approachable way. So, join me on this exciting journey to discover the intersection of data science and football! SIUUU!</p>

<hr class="solid">

<b><center style="font-size:200%; font-family:verdana;">2. Data Collecting</center></b>

<p style="font-family:verdana;">For the scope of this project, I will utilize two main datasets from Kaggle, which will serve as the backbone of my analysis and model training:</p>

- <p style="font-family:verdana;"><b>International Football Results from 1872 to 2023</b>: This comprehensive dataset contains over 40,000 records of international football matches dating from 1872 to 2023. It is an invaluable historical resource that will provide the features necessary for our deep dive into predictive analysis. <a href="https://www.kaggle.com/martj42/international-football-results-from-1872-to-2017">Access the dataset here</a>.</p>

- <p style="font-family:verdana;"><b>FIFA World Cup 2022</b>: This up-to-date dataset offers data on international soccer matches and team strengths from 1993 to 2022. It includes the latest FIFA rankings, points, and team scores for national teams, providing additional feature sets to enrich our model and ensure it is tuned with the latest team performance metrics. <a href="https://www.kaggle.com/datasets/brenda89/fifa-world-cup-2022">Access the dataset here</a>.</p>

<p style="font-family:verdana;">By integrating both datasets, we are equipped with extensive historical context and the latest performance indicators. These multi-faceted features form a robust foundation for our predictive analysis, allowing us to build a model that can accurately forecast football match outcomes.</p>

<hr class="solid">

<b><center style="font-size:200%; font-family:verdana;">3. Data Cleaning and Preprocessing</center></b>

<p style="font-family:verdana;">In this section, I'll prepare my data for analysis. This step is crucial in any machine learning project, as the quality and quantity of the data that I feed into my models will directly determine how well they can learn and make predictions. The process involves various stages, including importing necessary libraries, loading the datasets, dealing with missing values, and encoding categorical features.</p>

<b><p style="font-family:verdana;">Step 1: Import the Necessary Libraries</p></b>

<p style="font-family:verdana;">Each of these libraries serves a specific purpose:</p>

- `pandas`: A powerful tool for manipulating and analyzing data.
- `numpy`: Used for mathematical and logical operations on arrays.
- `matplotlib` and `seaborn`: Libraries for data visualization.
- `sklearn`: An extensive library that contains a variety of machine learning algorithms as well as utilities for preprocessing data, fine-tuning model parameters, and evaluating model performance.
- `math`: Provides mathematical functions.
- `IPython.display`: Allows us to display HTML code in the notebook.

<p style="font-family:verdana;"><b>Here is the code to import these libraries:</b></p>

In [340]:
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np
import math 
import seaborn as sns
from sklearn import datasets
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from IPython.display import display_html

<b><p style="font-family:verdana;">Step 2: Load the Dataset</p></b>

<p style="font-family:verdana;">Now, we move on to loading our dataset. We will read the data from the "shootouts.csv" file into a pandas DataFrame:</p>

In [341]:
# Import the dataset
df_shootouts = pd.read_csv('shootouts.csv')

<b><p style="font-family:verdana;">Step 3: Examine the data</p></b>

<p style="font-family:verdana;">With the data loaded, the next step is to understand its structure, features, and data types. We'll inspect our DataFrame by displaying the first few rows, and examining the data types and any null values:</p>

In [342]:
# Display the first few rows of the dataset
df_shootouts.head()

Unnamed: 0,date,home_team,away_team,winner
0,1967-08-22,India,Taiwan,Taiwan
1,1971-11-14,South Korea,Vietnam Republic,South Korea
2,1972-05-07,South Korea,Iraq,Iraq
3,1972-05-17,Thailand,South Korea,South Korea
4,1972-05-19,Thailand,Cambodia,Thailand


In [343]:
# Check the data types and any null values
df_shootouts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 547 entries, 0 to 546
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   date       547 non-null    object
 1   home_team  547 non-null    object
 2   away_team  547 non-null    object
 3   winner     547 non-null    object
dtypes: object(4)
memory usage: 17.2+ KB


<b><p style="font-family:verdana;">Step 4: Data Cleaning</p></b>

<p style="font-family:verdana;">After examining our DataFrame, we can see that there are no missing/null values. However, we need to convert the 'date' column from an object to a DateTime type. This will make filtering by date much easier:</p>

In [344]:
# Convert the 'date' column to datetime format
df_shootouts['date'] = pd.to_datetime(df_shootouts['date'])

<p style="font-family:verdana;">Also, note that we have data ranging from 1967 all the way up to 2022. For now, in order to simulate the World Cup to test out the final model, we'll only consider data up until before the 2022 World Cup (November 20, 2022):</p>

In [345]:
# Keep only the data before the 2022 World Cup
df_shootouts = df_shootouts[df_shootouts['date'] <= '2022-11-20']

<p style="font-family:verdana;">After these steps, our data should be clean and ready for preprocessing. Here is the resulting DataFrame:</p>

In [346]:
# Display the last few rows of the dataset
df_shootouts.tail()

Unnamed: 0,date,home_team,away_team,winner
537,2022-09-23,Iraq,Oman,Oman
538,2022-09-25,Malaysia,Tajikistan,Tajikistan
539,2022-11-16,Lithuania,Iceland,Iceland
540,2022-11-16,Latvia,Estonia,Latvia
541,2022-11-19,Latvia,Iceland,Iceland


In [347]:
# Import the dataset
df_international_matches = pd.read_csv('international_matches.csv')

df_international_matches.columns

Index(['date', 'home_team', 'away_team', 'home_team_continent',
       'away_team_continent', 'home_team_fifa_rank', 'away_team_fifa_rank',
       'home_team_total_fifa_points', 'away_team_total_fifa_points',
       'home_team_score', 'away_team_score', 'tournament', 'city', 'country',
       'neutral_location', 'shoot_out', 'home_team_result',
       'home_team_goalkeeper_score', 'away_team_goalkeeper_score',
       'home_team_mean_defense_score', 'home_team_mean_offense_score',
       'home_team_mean_midfield_score', 'away_team_mean_defense_score',
       'away_team_mean_offense_score', 'away_team_mean_midfield_score'],
      dtype='object')

In [348]:
df_international_matches['shoot_out']

0         No
1         No
2         No
3         No
4         No
        ... 
23916     No
23917     No
23918    Yes
23919     No
23920     No
Name: shoot_out, Length: 23921, dtype: object

In [349]:
df_international_matches.head()

Unnamed: 0,date,home_team,away_team,home_team_continent,away_team_continent,home_team_fifa_rank,away_team_fifa_rank,home_team_total_fifa_points,away_team_total_fifa_points,home_team_score,...,shoot_out,home_team_result,home_team_goalkeeper_score,away_team_goalkeeper_score,home_team_mean_defense_score,home_team_mean_offense_score,home_team_mean_midfield_score,away_team_mean_defense_score,away_team_mean_offense_score,away_team_mean_midfield_score
0,1993-08-08,Bolivia,Uruguay,South America,South America,59,22,0,0,3,...,No,Win,,,,,,,,
1,1993-08-08,Brazil,Mexico,South America,North America,8,14,0,0,1,...,No,Draw,,,,,,,,
2,1993-08-08,Ecuador,Venezuela,South America,South America,35,94,0,0,5,...,No,Win,,,,,,,,
3,1993-08-08,Guinea,Sierra Leone,Africa,Africa,65,86,0,0,1,...,No,Win,,,,,,,,
4,1993-08-08,Paraguay,Argentina,South America,South America,67,5,0,0,1,...,No,Lose,,,,,,,,


In [350]:
# Convert the 'date' column to datetime format
df_international_matches['date'] = pd.to_datetime(df_international_matches['date'])

# Keep only the data before the 2022 World Cup
df_international_matches = df_international_matches[df_international_matches['date'] <= '2022-11-20']

df_international_matches.tail()

Unnamed: 0,date,home_team,away_team,home_team_continent,away_team_continent,home_team_fifa_rank,away_team_fifa_rank,home_team_total_fifa_points,away_team_total_fifa_points,home_team_score,...,shoot_out,home_team_result,home_team_goalkeeper_score,away_team_goalkeeper_score,home_team_mean_defense_score,home_team_mean_offense_score,home_team_mean_midfield_score,away_team_mean_defense_score,away_team_mean_offense_score,away_team_mean_midfield_score
23916,2022-06-14,Moldova,Andorra,Europe,Europe,180,153,932,1040,2,...,No,Win,65.0,,,,,,,
23917,2022-06-14,Liechtenstein,Latvia,Europe,Europe,192,135,895,1105,0,...,No,Lose,,65.0,,,,,,
23918,2022-06-14,Chile,Ghana,South America,Africa,28,60,1526,1387,0,...,Yes,Lose,79.0,74.0,75.5,76.7,78.2,75.5,76.0,78.2
23919,2022-06-14,Japan,Tunisia,Asia,Africa,23,35,1553,1499,0,...,No,Lose,73.0,,75.2,75.0,77.5,70.8,72.3,74.0
23920,2022-06-14,Korea Republic,Egypt,Asia,Africa,29,32,1519,1500,4,...,No,Win,75.0,,73.0,80.0,73.8,,79.3,70.8


In [351]:
# Keep only penalty shootout matches
df_international_matches = df_international_matches[df_international_matches['shoot_out'].apply(lambda x: x == 'Yes')]

df_international_matches.tail()

Unnamed: 0,date,home_team,away_team,home_team_continent,away_team_continent,home_team_fifa_rank,away_team_fifa_rank,home_team_total_fifa_points,away_team_total_fifa_points,home_team_score,...,shoot_out,home_team_result,home_team_goalkeeper_score,away_team_goalkeeper_score,home_team_mean_defense_score,home_team_mean_offense_score,home_team_mean_midfield_score,away_team_mean_defense_score,away_team_mean_offense_score,away_team_mean_midfield_score
23535,2022-03-25,Tajikistan,Uganda,Asia,Africa,115,84,1152,1279,1,...,Yes,Lose,,,,,,,63.3,
23578,2022-03-29,Senegal,Egypt,Africa,Africa,18,34,1587,1497,1,...,Yes,Win,83.0,,79.0,80.7,79.0,,79.3,70.8
23628,2022-03-29,Kazakhstan,Moldova,Europe,Europe,120,181,1140,926,0,...,Yes,Win,,65.0,,,,,,
23876,2022-06-13,Australia,Peru,Oceania,South America,42,22,1462,1562,0,...,Yes,Win,77.0,74.0,72.0,72.3,73.5,74.5,73.0,76.8
23918,2022-06-14,Chile,Ghana,South America,Africa,28,60,1526,1387,0,...,Yes,Lose,79.0,74.0,75.5,76.7,78.2,75.5,76.0,78.2


In [352]:
# Keep only the data after 1993-08-15 to match both datasets
df_shootouts = df_shootouts[df_shootouts['date'] >= '1993-08-15']
df_shootouts = df_shootouts[df_shootouts['date'] <= '2022-06-14']

df_shootouts.tail()

Unnamed: 0,date,home_team,away_team,winner
531,2022-03-25,Tajikistan,Uganda,Uganda
532,2022-03-29,Kazakhstan,Moldova,Kazakhstan
533,2022-03-29,Senegal,Egypt,Senegal
534,2022-06-13,Australia,Peru,Australia
535,2022-06-14,Chile,Ghana,Ghana


In [353]:
# Merge both datasets for more features
df_penalty = pd.merge(df_shootouts, df_international_matches, how='left', 
                  left_on=['date','home_team'], 
                  right_on = ['date','home_team'])

df_penalty

Unnamed: 0,date,home_team,away_team_x,winner,away_team_y,home_team_continent,away_team_continent,home_team_fifa_rank,away_team_fifa_rank,home_team_total_fifa_points,...,shoot_out,home_team_result,home_team_goalkeeper_score,away_team_goalkeeper_score,home_team_mean_defense_score,home_team_mean_offense_score,home_team_mean_midfield_score,away_team_mean_defense_score,away_team_mean_offense_score,away_team_mean_midfield_score
0,1993-08-15,Australia,Canada,Australia,Canada,Oceania,North America,52.0,46.0,0.0,...,Yes,Win,,,,,,,,
1,1993-10-24,Burundi,Guinea,Guinea,Guinea,Africa,Africa,98.0,63.0,0.0,...,Yes,Lose,,,,,,,,
2,1994-01-29,Burkina Faso,Guinea,Burkina Faso,Guinea,Africa,Africa,127.0,63.0,0.0,...,Yes,Win,,,,,,,,
3,1994-04-06,Nigeria,Ivory Coast,Nigeria,Côte d'Ivoire,Africa,Africa,18.0,28.0,0.0,...,Yes,Win,,,,,,,,
4,1994-07-05,Mexico,Bulgaria,Bulgaria,Bulgaria,North America,Europe,16.0,29.0,0.0,...,Yes,Lose,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
375,2022-03-25,Tajikistan,Uganda,Uganda,Uganda,Asia,Africa,115.0,84.0,1152.0,...,Yes,Lose,,,,,,,63.3,
376,2022-03-29,Kazakhstan,Moldova,Kazakhstan,Moldova,Europe,Europe,120.0,181.0,1140.0,...,Yes,Win,,65.0,,,,,,
377,2022-03-29,Senegal,Egypt,Senegal,Egypt,Africa,Africa,18.0,34.0,1587.0,...,Yes,Win,83.0,,79.0,80.7,79.0,,79.3,70.8
378,2022-06-13,Australia,Peru,Australia,Peru,Oceania,South America,42.0,22.0,1462.0,...,Yes,Win,77.0,74.0,72.0,72.3,73.5,74.5,73.0,76.8


In [354]:
# Remove unmatched data
df_penalty = df_penalty[df_penalty['away_team_y'].notna()]
df_penalty

Unnamed: 0,date,home_team,away_team_x,winner,away_team_y,home_team_continent,away_team_continent,home_team_fifa_rank,away_team_fifa_rank,home_team_total_fifa_points,...,shoot_out,home_team_result,home_team_goalkeeper_score,away_team_goalkeeper_score,home_team_mean_defense_score,home_team_mean_offense_score,home_team_mean_midfield_score,away_team_mean_defense_score,away_team_mean_offense_score,away_team_mean_midfield_score
0,1993-08-15,Australia,Canada,Australia,Canada,Oceania,North America,52.0,46.0,0.0,...,Yes,Win,,,,,,,,
1,1993-10-24,Burundi,Guinea,Guinea,Guinea,Africa,Africa,98.0,63.0,0.0,...,Yes,Lose,,,,,,,,
2,1994-01-29,Burkina Faso,Guinea,Burkina Faso,Guinea,Africa,Africa,127.0,63.0,0.0,...,Yes,Win,,,,,,,,
3,1994-04-06,Nigeria,Ivory Coast,Nigeria,Côte d'Ivoire,Africa,Africa,18.0,28.0,0.0,...,Yes,Win,,,,,,,,
4,1994-07-05,Mexico,Bulgaria,Bulgaria,Bulgaria,North America,Europe,16.0,29.0,0.0,...,Yes,Lose,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
375,2022-03-25,Tajikistan,Uganda,Uganda,Uganda,Asia,Africa,115.0,84.0,1152.0,...,Yes,Lose,,,,,,,63.3,
376,2022-03-29,Kazakhstan,Moldova,Kazakhstan,Moldova,Europe,Europe,120.0,181.0,1140.0,...,Yes,Win,,65.0,,,,,,
377,2022-03-29,Senegal,Egypt,Senegal,Egypt,Africa,Africa,18.0,34.0,1587.0,...,Yes,Win,83.0,,79.0,80.7,79.0,,79.3,70.8
378,2022-06-13,Australia,Peru,Australia,Peru,Oceania,South America,42.0,22.0,1462.0,...,Yes,Win,77.0,74.0,72.0,72.3,73.5,74.5,73.0,76.8


In [355]:
df_penalty = df_penalty.drop(columns=['winner', 'shoot_out', 'away_team_y', 'home_team_total_fifa_points', 
                                      'away_team_total_fifa_points', 'home_team_score', 'away_team_score', 
                                      'home_team_goalkeeper_score', 'away_team_goalkeeper_score', 
                                      'home_team_mean_defense_score', 'home_team_mean_offense_score',
                                      'home_team_mean_midfield_score', 'away_team_mean_defense_score',
                                      'away_team_mean_offense_score', 'away_team_mean_midfield_score', 
                                      'city', 'country'])

In [356]:
df_penalty = df_penalty.rename(columns={'away_team_x': 'away_team'})
df_penalty

Unnamed: 0,date,home_team,away_team,home_team_continent,away_team_continent,home_team_fifa_rank,away_team_fifa_rank,tournament,neutral_location,home_team_result
0,1993-08-15,Australia,Canada,Oceania,North America,52.0,46.0,FIFA World Cup qualification,False,Win
1,1993-10-24,Burundi,Guinea,Africa,Africa,98.0,63.0,African Cup of Nations qualification,True,Lose
2,1994-01-29,Burkina Faso,Guinea,Africa,Africa,127.0,63.0,Friendly,False,Win
3,1994-04-06,Nigeria,Ivory Coast,Africa,Africa,18.0,28.0,African Cup of Nations,True,Win
4,1994-07-05,Mexico,Bulgaria,North America,Europe,16.0,29.0,FIFA World Cup,True,Lose
...,...,...,...,...,...,...,...,...,...,...
375,2022-03-25,Tajikistan,Uganda,Asia,Africa,115.0,84.0,Navruz Cup,True,Lose
376,2022-03-29,Kazakhstan,Moldova,Europe,Europe,120.0,181.0,UEFA Nations League,False,Win
377,2022-03-29,Senegal,Egypt,Africa,Africa,18.0,34.0,FIFA World Cup qualification,False,Win
378,2022-06-13,Australia,Peru,Oceania,South America,42.0,22.0,FIFA World Cup qualification,True,Win


In [357]:
df_penalty.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 310 entries, 0 to 379
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   date                 310 non-null    datetime64[ns]
 1   home_team            310 non-null    object        
 2   away_team            310 non-null    object        
 3   home_team_continent  310 non-null    object        
 4   away_team_continent  310 non-null    object        
 5   home_team_fifa_rank  310 non-null    float64       
 6   away_team_fifa_rank  310 non-null    float64       
 7   tournament           310 non-null    object        
 8   neutral_location     310 non-null    object        
 9   home_team_result     310 non-null    object        
dtypes: datetime64[ns](1), float64(2), object(7)
memory usage: 26.6+ KB


<hr class="solid">

<b><center style="font-size:200%; font-family:verdana;">4. Feature Engineering</center></b>

<p style="font-family:verdana;">In this stage, I will create new features using the existing data, aiming to highlight hidden relationships and patterns that can significantly improve the performance of our model.</p>

<ul style="font-family:verdana;">
    <li><b>Feature 1: Difference in FIFA Rankings:</b><br>
    I'll calculate the difference in FIFA rankings between the home team and the away team. This feature can act as an indicator of the relative strength between the two teams.
    </li>
</ul>

In [358]:
# Create 'rank_difference' feature
df_penalty['rank_difference'] = df_penalty['home_team_fifa_rank'] - df_penalty['away_team_fifa_rank']

df_penalty.head()

Unnamed: 0,date,home_team,away_team,home_team_continent,away_team_continent,home_team_fifa_rank,away_team_fifa_rank,tournament,neutral_location,home_team_result,rank_difference
0,1993-08-15,Australia,Canada,Oceania,North America,52.0,46.0,FIFA World Cup qualification,False,Win,6.0
1,1993-10-24,Burundi,Guinea,Africa,Africa,98.0,63.0,African Cup of Nations qualification,True,Lose,35.0
2,1994-01-29,Burkina Faso,Guinea,Africa,Africa,127.0,63.0,Friendly,False,Win,64.0
3,1994-04-06,Nigeria,Ivory Coast,Africa,Africa,18.0,28.0,African Cup of Nations,True,Win,-10.0
4,1994-07-05,Mexico,Bulgaria,North America,Europe,16.0,29.0,FIFA World Cup,True,Lose,-13.0


<ul style="font-family:verdana;">
    <li><b>Feature 2: Neutral Location Weighting and Encoding</b><br>
    I'll create a feature that assigns a weight of 0.5 to neutral locations and 1 otherwise. This represents the home team's advantage when they are not in a neutral location. I'll also convert the 'neutral_location' feature into binary format, with 'True' mapped to 1 and 'False' mapped to 0.
    </li>
</ul>

In [359]:
# Create 'neutral_weight' feature and convert 'neutral_location' into binary format
df_penalty['neutral_weight'] = df_penalty['neutral_location'].apply(lambda x: 0.5 if x else 1)
df_penalty['neutral_location'] = df_penalty['neutral_location'].map({True: 1, False: 0})


df_penalty.head()

Unnamed: 0,date,home_team,away_team,home_team_continent,away_team_continent,home_team_fifa_rank,away_team_fifa_rank,tournament,neutral_location,home_team_result,rank_difference,neutral_weight
0,1993-08-15,Australia,Canada,Oceania,North America,52.0,46.0,FIFA World Cup qualification,0,Win,6.0,1.0
1,1993-10-24,Burundi,Guinea,Africa,Africa,98.0,63.0,African Cup of Nations qualification,1,Lose,35.0,0.5
2,1994-01-29,Burkina Faso,Guinea,Africa,Africa,127.0,63.0,Friendly,0,Win,64.0,1.0
3,1994-04-06,Nigeria,Ivory Coast,Africa,Africa,18.0,28.0,African Cup of Nations,1,Win,-10.0,0.5
4,1994-07-05,Mexico,Bulgaria,North America,Europe,16.0,29.0,FIFA World Cup,1,Lose,-13.0,0.5


<ul style="font-family:verdana;">
    <li><b>Feature 3: Tournament Weighting and One-Hot Encoding</b><br>
    I'll create a feature to weight the importance of the tournament type, assigning a lower weight to 'Friendly' matches and a higher weight to other matches. Additionally, I'll convert the 'tournament' feature using One-Hot Encoding, creating binary features for each unique tournament category.
    </li>
</ul>

In [360]:
# Create 'tournament_weight' feature and one-hot encode 'tournament' feature
df_penalty['tournament_weight'] = df_penalty['tournament'].apply(lambda x: 0.5 if x=='Friendly' else 1)
df_penalty = pd.get_dummies(df_penalty, columns=['tournament'], prefix='tournament')

df_penalty.head()

Unnamed: 0,date,home_team,away_team,home_team_continent,away_team_continent,home_team_fifa_rank,away_team_fifa_rank,neutral_location,home_team_result,rank_difference,...,tournament_SAFF Cup,tournament_South Pacific Games,tournament_Superclásico de las Américas,tournament_TIFOCO Tournament,tournament_UEFA Euro,tournament_UEFA Euro qualification,tournament_UEFA Nations League,tournament_UNCAF Cup,tournament_United Arab Emirates Friendship Tournament,tournament_WAFF Championship
0,1993-08-15,Australia,Canada,Oceania,North America,52.0,46.0,0,Win,6.0,...,0,0,0,0,0,0,0,0,0,0
1,1993-10-24,Burundi,Guinea,Africa,Africa,98.0,63.0,1,Lose,35.0,...,0,0,0,0,0,0,0,0,0,0
2,1994-01-29,Burkina Faso,Guinea,Africa,Africa,127.0,63.0,0,Win,64.0,...,0,0,0,0,0,0,0,0,0,0
3,1994-04-06,Nigeria,Ivory Coast,Africa,Africa,18.0,28.0,1,Win,-10.0,...,0,0,0,0,0,0,0,0,0,0
4,1994-07-05,Mexico,Bulgaria,North America,Europe,16.0,29.0,1,Lose,-13.0,...,0,0,0,0,0,0,0,0,0,0


<ul style="font-family:verdana;">
    <li><b>Feature 4: Result Encoding</b><br>
    I'll convert the 'home_team_result' feature into numerical format, with 'Win' mapped to 1 and 'Lose' mapped to 0. This numerical representation will be more suitable for our model.
    </li>
</ul>

In [361]:
# Convert 'home_team_result' into numerical format
df_penalty['home_team_result'] = df_penalty['home_team_result'].map({'Win': 1, 'Lose': 0})

df_penalty.head()

Unnamed: 0,date,home_team,away_team,home_team_continent,away_team_continent,home_team_fifa_rank,away_team_fifa_rank,neutral_location,home_team_result,rank_difference,...,tournament_SAFF Cup,tournament_South Pacific Games,tournament_Superclásico de las Américas,tournament_TIFOCO Tournament,tournament_UEFA Euro,tournament_UEFA Euro qualification,tournament_UEFA Nations League,tournament_UNCAF Cup,tournament_United Arab Emirates Friendship Tournament,tournament_WAFF Championship
0,1993-08-15,Australia,Canada,Oceania,North America,52.0,46.0,0,1,6.0,...,0,0,0,0,0,0,0,0,0,0
1,1993-10-24,Burundi,Guinea,Africa,Africa,98.0,63.0,1,0,35.0,...,0,0,0,0,0,0,0,0,0,0
2,1994-01-29,Burkina Faso,Guinea,Africa,Africa,127.0,63.0,0,1,64.0,...,0,0,0,0,0,0,0,0,0,0
3,1994-04-06,Nigeria,Ivory Coast,Africa,Africa,18.0,28.0,1,1,-10.0,...,0,0,0,0,0,0,0,0,0,0
4,1994-07-05,Mexico,Bulgaria,North America,Europe,16.0,29.0,1,0,-13.0,...,0,0,0,0,0,0,0,0,0,0


<ul style="font-family:verdana;">
    <li><b>Feature 5: Average Rank</b><br>
    For the fifth feature, I'll calculate the average FIFA rank for each team across all their games. This gives us a better sense of their overall performance, not just the rank at the time of a specific match.
    </li>
</ul>

In [362]:
# Create 'home_team_avg_rank' and 'away_team_avg_rank' features
home_team_avg_rank = df_penalty.groupby('home_team')['home_team_fifa_rank'].transform('mean')
away_team_avg_rank = df_penalty.groupby('away_team')['away_team_fifa_rank'].transform('mean')

df_penalty['home_team_avg_rank'] = home_team_avg_rank
df_penalty['away_team_avg_rank'] = away_team_avg_rank

df_penalty.head()

Unnamed: 0,date,home_team,away_team,home_team_continent,away_team_continent,home_team_fifa_rank,away_team_fifa_rank,neutral_location,home_team_result,rank_difference,...,tournament_Superclásico de las Américas,tournament_TIFOCO Tournament,tournament_UEFA Euro,tournament_UEFA Euro qualification,tournament_UEFA Nations League,tournament_UNCAF Cup,tournament_United Arab Emirates Friendship Tournament,tournament_WAFF Championship,home_team_avg_rank,away_team_avg_rank
0,1993-08-15,Australia,Canada,Oceania,North America,52.0,46.0,0,1,6.0,...,0,0,0,0,0,0,0,0,47.25,46.0
1,1993-10-24,Burundi,Guinea,Africa,Africa,98.0,63.0,1,0,35.0,...,0,0,0,0,0,0,0,0,98.0,64.333333
2,1994-01-29,Burkina Faso,Guinea,Africa,Africa,127.0,63.0,0,1,64.0,...,0,0,0,0,0,0,0,0,81.4,64.333333
3,1994-04-06,Nigeria,Ivory Coast,Africa,Africa,18.0,28.0,1,1,-10.0,...,0,0,0,0,0,0,0,0,34.0,31.6
4,1994-07-05,Mexico,Bulgaria,North America,Europe,16.0,29.0,1,0,-13.0,...,0,0,0,0,0,0,0,0,12.714286,29.0


s

<hr class="solid">

<b><center style="font-size:200%; font-family:verdana;">5. Preparing the Data for Splitting</center></b>

<p style="font-family:verdana;">Before we proceed to splitting training and testing set, let's ensure that our data only contains numeric features, as our model cannot process non-numeric data. We have already engineered the relevant features from 'date', 'home_team', 'away_team', 'home_team_continent', and 'away_team_continent', so we can drop these columns.</p>

In [363]:
# Drop non-numeric columns
df_penalty = df_penalty.drop(['date', 'home_team', 'away_team', 
                              'home_team_continent', 'away_team_continent'], 
                             axis=1)

df_penalty.head()

Unnamed: 0,home_team_fifa_rank,away_team_fifa_rank,neutral_location,home_team_result,rank_difference,neutral_weight,tournament_weight,tournament_ABCS Tournament,tournament_AFC Asian Cup,tournament_AFF Championship,...,tournament_Superclásico de las Américas,tournament_TIFOCO Tournament,tournament_UEFA Euro,tournament_UEFA Euro qualification,tournament_UEFA Nations League,tournament_UNCAF Cup,tournament_United Arab Emirates Friendship Tournament,tournament_WAFF Championship,home_team_avg_rank,away_team_avg_rank
0,52.0,46.0,0,1,6.0,1.0,1.0,0,0,0,...,0,0,0,0,0,0,0,0,47.25,46.0
1,98.0,63.0,1,0,35.0,0.5,1.0,0,0,0,...,0,0,0,0,0,0,0,0,98.0,64.333333
2,127.0,63.0,0,1,64.0,1.0,0.5,0,0,0,...,0,0,0,0,0,0,0,0,81.4,64.333333
3,18.0,28.0,1,1,-10.0,0.5,1.0,0,0,0,...,0,0,0,0,0,0,0,0,34.0,31.6
4,16.0,29.0,1,0,-13.0,0.5,1.0,0,0,0,...,0,0,0,0,0,0,0,0,12.714286,29.0


<hr class="solid">

<b><center style="font-size:200%; font-family:verdana;">6. Splitting the Data</center></b>

<p style="font-family:verdana;">With the features prepared, we are ready to split the dataset into a training set and a test set. This will allow us to evaluate the performance of our model on unseen data:</p>

In [364]:
# Split data into features (X) and target (y)
X = df_penalty.drop('home_team_result', axis=1)
y = df_penalty['home_team_result']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

<p style="font-family:verdana;">The training set will be used to train our model, while the test set will serve as new, unseen data for evaluating the model's performance. This is a crucial step in machine learning modeling to ensure that our model generalizes well to new data and does not simply memorize the training data.</p>

<hr class="solid">

<b><center style="font-size:200%; font-family:verdana;">7. Model Selection and Training</center></b>

<p style="font-family:verdana;">Now that our data is fully prepared, I will select a model for training. Given the nature of our data and the task, I've chosen to use the Random Forest Classifier. This powerful ensemble model combines multiple decision trees to make predictions and is known for its robustness and ability to handle complex datasets with many features.</p>

In [365]:
# Create an instance of the RandomForestClassifier
rf_model = RandomForestClassifier(random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

<p style="font-family:verdana;">In the code above, we first create an instance of the RandomForestClassifier and then fit the model to our training data. The 'fit' function trains the model to find patterns in our data that can be used to make predictions.</p>

<hr class="solid">

<b><center style="font-size:200%; font-family:verdana;">8. Model Evaluation</center></b>

<p style="font-family:verdana;">Now that we've trained our model, it's time to see how well it performs. We'll use our test set to evaluate the model. We'll start by making predictions using the test set, and then compare these predictions to the actual values to get an idea of how accurate our model is.</p>

In [366]:
# Make predictions using the test set
y_pred = rf_model.predict(X_test)

# Import the metrics class from sklearn for evaluation
from sklearn import metrics

# Compute and print the confusion matrix and classification report
print(metrics.confusion_matrix(y_test, y_pred))
print(metrics.classification_report(y_test, y_pred))

[[ 6 20]
 [12 24]]
              precision    recall  f1-score   support

           0       0.33      0.23      0.27        26
           1       0.55      0.67      0.60        36

    accuracy                           0.48        62
   macro avg       0.44      0.45      0.44        62
weighted avg       0.46      0.48      0.46        62

