## EDA of international Football Matches

In this notebook we will take a closer look into the evolution of international football matches (i.e. matches played between countries) through the years.

#### Questions to be explored:
- How is the performance of a wining world cup team in the years the preceed the tournament?

#### Columns of results.csv:
- **date**: Date of the match
- **home_team**: Name of the home team
- **away_team**: Name of the away team
- **home_score**: Home team goals
- **away_score**: Away team goals
- **tournament**: Tournament name
- **city**: City where the match took place
- **country**: Country where the match took place.
- **neutral**: Whether the match took place at a neutral venue or not.

In [1]:
# IMPORTS

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [9]:
# Start the analysis by reading the results file from the "data" folder
df_results = pd.read_csv("data/results.csv")

# Convert the date column to datetime
df_results['date']  = pd.to_datetime(df_results['date'])

print(f"There are {len(df_results)} matches")
print(f"The first game was {df_results['date'].min().date()} and the last was {df_results['date'].max().date()}")

df_results.head(5)

There are 43752 matches
The first game was 1872-11-30 and the last was 2022-06-14


Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False


In [38]:
# Creating some new columns that might be usefull

def get_game_outcome ( home_score, away_score):
    """
    The outcome will be encoded as:
        - D: Draw
        - H: Home team wins
        - A: Away team wins
    """
    if (home_score == away_score):
        return 'D'
    elif (home_score > away_score):
        return 'H'
    elif (home_score < away_score):
        return 'A'

def winning_team(home_team, away_team, outcome):
    if (outcome == 'H'):
        return home_team
    elif (outcome == 'A'):
        return away_team
    else:
        return '-'

def losing_team(home_team, away_team, outcome):
    if (outcome == 'A'):
        return home_team
    elif (outcome == 'H'):
        return away_team
    else:
        return '-'

# The outcome of the game
df_results['outcome'] = df_results.apply(lambda x: get_game_outcome(x.home_score, x.away_score), axis=1)

# Name of the winning team
df_results['winning_team'] =  df_results.apply(lambda x: winning_team(x.home_team, x.away_team, x.outcome), axis=1)

# Name of the losing team
df_results['losing_team'] =  df_results.apply(lambda x: losing_team(x.home_team, x.away_team, x.outcome), axis=1)

# Score difference
df_results['score_difference'] = df_results.apply(lambda x: abs(x.home_score - x.away_score), axis=1)

In [56]:
df_geo_regions = pd.read_csv('data/geographic-regions.csv')
df_geo_regions.rename(columns = {'Country or Area':'country'}, inplace = True)

df_merged = df_results.merge(df_geo_regions, how = 'left', on ='country')
df_merged[df_merged['Region Name'].isna() == True]['country'].value_counts()

England                721
Scotland               407
Wales                  344
Turkey                 337
Tanzania               311
                      ... 
Lautoka                  1
Portuguese Guinea        1
Bohemia and Moravia      1
Belgian Congo            1
Mali Federation          1
Name: country, Length: 72, dtype: int64

In [39]:
df_results

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,outcome,winning_team,losing_team,score_difference
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False,D,-,-,0
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False,H,England,Scotland,2
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False,H,Scotland,England,1
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False,D,-,-,0
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False,H,Scotland,England,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
43747,2022-06-14,Moldova,Andorra,2,1,UEFA Nations League,Chișinău,Moldova,False,H,Moldova,Andorra,1
43748,2022-06-14,Liechtenstein,Latvia,0,2,UEFA Nations League,Vaduz,Liechtenstein,False,A,Latvia,Liechtenstein,2
43749,2022-06-14,Chile,Ghana,0,0,Kirin Cup,Suita,Japan,True,D,-,-,0
43750,2022-06-14,Japan,Tunisia,0,3,Kirin Cup,Suita,Japan,False,A,Tunisia,Japan,3


In [48]:
df_geo_regions.dtypes

Region Name                 object
Sub-region Name             object
Intermediate Region Name    object
country                     object
dtype: object

In [49]:
df_results.dtypes

date                datetime64[ns]
home_team                   object
away_team                   object
home_score                   int64
away_score                   int64
tournament                  object
city                        object
country                     object
neutral                       bool
outcome                     object
winning_team                object
losing_team                 object
score_difference             int64
dtype: object