Introduction

Hi guys, this notebook contains a general look at the international football games dataset from 1872 until 2019.

My goal was to offer some interesting data visualization and to take a unique look at the data. Hopefully, this will provide some interesting insights and inspire other people to look deeper into some topics.

Football is the most popular sport in the world and a source of entertainment for millions. With teams from almost every nation in the world its a highly competitive sport and a source of national pride. In the next segments, we will take a general look at how the game, tournaments and certain nations evolved.


Data preperation

First, we are going to import the basic libraries and check the dataset for any missing and inconsistent data. Afterwards, we do a basic check for outliers just to be safe.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import datetime
%matplotlib inline

In [None]:
df= pd.read_csv("../input/international-football-results-from-1872-to-2017/results.csv")

In [None]:
df.isnull().sum()

There are no missing values on any column.

In [None]:
df.info() #We will use .info to check the data type of the columns.

Two columns have a numeric data type, one has a boolean type while the rest is filed under object.

In [None]:
df.describe() #Describe will help us to a broad outlier check on the data.

The Max values look realistic enough, but just be sure we will check them out.

In [None]:
df.head()

We now have a general idea of the data and we can proceed with making a to-do list.

1.	We will take a look at the rows with the Max number of goals over 20 (for both home and away games).
2.	We need to create a row that specifies who won/lost in the particular game. We will also make it display draws.
3.	The date column needs to be converted to datetime, so we can extract the year for future use.
4.	To keep the years consistent, I will remove 2019 as the year is still incomplete.  

In [None]:
above_20 = df[df["home_score"] >= 20]
above_20_a = df[df["away_score"] >= 20]

In [None]:
above_20.head(15)

In [None]:
above_20_a.head(15)

It appears that games with an overwhelming difference in goals are rare, but exist. We can also see that some of them happened during FIFA World Cup qualifications(I am looking at you Australia).

In [None]:
def winner(row): #Function for wins
    if row["home_score"] > row["away_score"]:
        return row["home_team"]
    if row["home_score"] == row["away_score"]:
        return "Draw"
    if row["home_score"] < row["away_score"]:
        return row["away_team"]
def loser (row): #Funcion for lose
    if row["home_score"] > row["away_score"]:
        return row["away_team"]
    if row["home_score"] == row["away_score"]:
        return "Draw"
    if row["home_score"] < row["away_score"]:
        return row["home_team"]

In [None]:
df["Winner"] = df.apply (lambda row: winner(row), axis=1)
df["Loser"] = df.apply (lambda row: loser(row), axis=1)

In [None]:
df["date"] = pd.to_datetime(df["date"]) # Converting the colum to datetime

In [None]:
df["Year"] = df["date"].dt.year #Extracting the year from the date column
df = df[df["Year"] < 2019] #Removing 2019 from the mix

The raise of football

While there is evidence that a kind of football was played in historic civilizations the current form arose in England in the middle of the 19th century. During some of its history, the game was interchangeable with rugby and no clear rules were put in place. The first event that shaped the game as we know today was in 1863 when it was decided that carrying the ball with the hands wasn’t allowed. However, the game has undergone many changes over the years and we are witnessing it evolve even today. The game quickly became popular across the British Empire and soon all over the world.
 
Using the data we have available to us, we will look at the historic rise of popularity of the game, mark some historic events(both in the football world and outside it) that had a direct impact on the game and generally look at its evolution.

The scores dataset will allow us to resample the data more easily and get the insights we want. It consists of the numeric columns of the dataset and the date column that we will use as an index.

In [None]:
scores = df.groupby("date")["home_score","away_score"].agg(["sum","count"]) #Using groupby to get the data we need

In [None]:
scores.info()

In [None]:
scores.columns = ["Home_Number_goals", "Home_Games","Away_Number_goals", "Away_Games"] #Renaming the columns

We will only use one of the columns to get the number of games per year. The reason for this is that one game has one home and one away team. Summing them would result in an inflated number of games.

In [None]:
scores_year = pd.DataFrame(scores["Home_Games"].resample("Y").sum())

In [None]:
scores_goals_h = pd.DataFrame(scores["Home_Number_goals"].resample("Y").sum())
scores_goals_a = pd.DataFrame(scores["Away_Number_goals"].resample("Y").sum())

In [None]:
scores_goals_a.head()

The plot below represents the total number of games played each year. To add additional insight, I have added some historic events that affected the number of games played (WW1 & WW2), the founding dates of some major football confederations and the dates of major FIFA changes.

In [None]:
fig, ax = plt.subplots(figsize=(17, 10))
plt.style.use('seaborn-darkgrid')
ax.plot(scores_year["Home_Games"], label="Games per year", color="black")
ax.tick_params(labelsize=12)
plt.legend(loc=0, fontsize="large")
fig.suptitle("Games per Year", fontsize=20)


ax.annotate("Start of WW2", xy=('1939', 105),  xycoords='data',size=15,
            bbox=dict(boxstyle="round", fc="none", ec="gray"),
            xytext=(-30, -60), textcoords='offset points', ha='center',
            arrowprops=dict(arrowstyle="->"))
ax.annotate("End of WW2", xy=('1945', 35),  xycoords='data',size=15,
            bbox=dict(boxstyle="round", fc="none", ec="gray"),
            xytext=(35, -30), textcoords='offset points', ha='center',
            arrowprops=dict(arrowstyle="->"))
ax.annotate("Start of WW1", xy=('1914', 35),  xycoords='data',size=15,
            bbox=dict(boxstyle="round", fc="none", ec="gray"),
            xytext=(-50, 20), textcoords='offset points', ha='center',
            arrowprops=dict(arrowstyle="->"))
ax.annotate("End of WW1", xy=('1918', 30),  xycoords='data',size=15,
            bbox=dict(boxstyle="round", fc="none", ec="gray"),
            xytext=(-35, -30), textcoords='offset points', ha='center',
            arrowprops=dict(arrowstyle="->"))
ax.annotate("Founding of FIFA", xy=('1930', 85),  xycoords='data',size=15,
            bbox=dict(boxstyle="round", fc="none", ec="gray"),
            xytext=(35, 60), textcoords='offset points', ha='center',
            arrowprops=dict(arrowstyle="->"))
ax.annotate("Founding of AFC", xy=('1956', 150),  xycoords='data',size=15,
            bbox=dict(boxstyle="round", fc="none", ec="gray"),
            xytext=(-35, 60), textcoords='offset points', ha='center',
            arrowprops=dict(arrowstyle="->"))
ax.annotate("Founding of AFCON", xy=('1957', 200),  xycoords='data',size=15,
            bbox=dict(boxstyle="round", fc="none", ec="gray"),
            xytext=(50, -50), textcoords='offset points', ha='center',
            arrowprops=dict(arrowstyle="->"))
ax.annotate("Founding of CONMEBOL", xy=('1916', 30),  xycoords='data',size=15,
            bbox=dict(boxstyle="round", fc="none", ec="gray"),
            xytext=(10, 50), textcoords='offset points', ha='center',
            arrowprops=dict(arrowstyle="->"))
ax.annotate("FIFA-Expansion to 24 teams", xy=('1982', 500),  xycoords='data',size=15,
            bbox=dict(boxstyle="round", fc="none", ec="gray"),
            xytext=(-50, 80), textcoords='offset points', ha='center',
            arrowprops=dict(arrowstyle="->"))
ax.annotate("FIFA-Expansion to 32 teams", xy=('1998', 850),  xycoords='data',size=15,
            bbox=dict(boxstyle="round", fc="none", ec="gray"),
            xytext=(-90, 80), textcoords='offset points', ha='center',
            arrowprops=dict(arrowstyle="->"))
ax.annotate("FIFA-Expansion to 48 teams", xy=('2013', 1050),  xycoords='data',size=15,
            bbox=dict(boxstyle="round", fc="none", ec="gray"),
            xytext=(0, -180), textcoords='offset points', ha='center',
            arrowprops=dict(arrowstyle="->"))

The two major observations we can make is that WW2 had a substantial effect on the games played during that period and that there is a spike in games every few years. While the first one is obvious, the second one is due to most of the major tournaments are played every few years. The reason this gets more apparent later on in the dataset is because of two factors:
1. Most of the football confederations/tournaments decided to move the games a year before/after FIFA.
2. The number of teams added during the FIFA expansion.  


But what about the goals?

There is a belief about the “new” football, where there are less excitement, safer play and fewer goals. Using the dataset we can debunk this myth.

In [None]:
#5 day SMA
fig, ax = plt.subplots(figsize=(14, 8))
ax.plot(scores_goals_h, label="Mean home goals")
ax.plot(scores_goals_a, label="Mean away goals")
ax.plot(scores_goals_h.rolling(window=5).mean(), color = "red", label="Rolling mean of home goals")
ax.plot(scores_goals_a.rolling(window=5).mean(), color = "red", label="Rolling mean of away goals")
plt.legend(loc=0, fontsize="large")
fig.suptitle("Mean goals per year", fontsize=20)
plt.legend()

As we can see there are more goals than ever. Additionally, I have decided to separate home and away goals per year to showcase that statistically there is an advantage while playing on the home stadium.

In [None]:
h_goals_year = pd.DataFrame(scores["Home_Number_goals"].resample("10A").sum())
a_goals_year = pd.DataFrame(scores["Away_Number_goals"].resample("10A").sum())

In [None]:
h_goals_year["Decade"] = h_goals_year.index
h_goals_year["Decade"] = h_goals_year["Decade"].dt.year
a_goals_year["Decade"] = a_goals_year.index
a_goals_year["Decade"] = a_goals_year["Decade"].dt.year

To further highlight the number of games, I have made this alternative plot that shows the increase per decade and without the historic landmarks.

In [None]:
fig,ax = plt.subplots(figsize=(15,7))
p1 = plt.bar(h_goals_year["Decade"],h_goals_year["Home_Number_goals"],color="g",width=5,label="Wins")
p2 = plt.bar(a_goals_year["Decade"],a_goals_year["Away_Number_goals"],bottom=h_goals_year["Home_Number_goals"],width=5,color="r",label="Loses")
plt.xticks(h_goals_year["Decade"])
plt.legend()

Where do we play?

Below, we will take a quick look at the cities and countries were the game is played.

In [None]:
fig,(ax1,ax2) = plt.subplots(1,2,figsize=(14,7))
sns.countplot(y = df["city"],order=df["city"].value_counts().index[:15],ax=ax1)
sns.countplot(y = df["country"],order=df["country"].value_counts().index[:15], ax=ax2)
fig.suptitle("Most frequent locations", fontsize=20)

Interestingly enough the city with the most games played is Kuala Lumpur. It should not be too much of a surprise as the “Bukit Jalil National Stadium” hosted the AFF & AFC multiple times. They also have multiple stadiums in the city and the country is 4th on the country part of the plot.

Regarding the country with the most games being hosted the US takes the surprising first place with more than a 1000 games hosted.

What tournaments are the most played?

In [None]:
fig,ax = plt.subplots(figsize=(15,7))
sns.countplot(y = df["tournament"],order=df["tournament"].value_counts().index[:15])
fig.suptitle("Most games per tournament", fontsize=20)

Friendly games are the most prominent, it makes sense as most countries tend to play them to prepare from other major tournaments. Also, note that some major tournaments are split between the qualification stage and the actual tournament on the plot.

To better understand the tournament data, I have plotted each major tournament on its own line plot in comparison to the other tournaments. This will allow us to look into the historic evolution of the tournament played and how its growth compares to all others.

In [None]:
best_tournaments = df[df["tournament"].isin(df["tournament"].value_counts().index[:12])]
df_cups = best_tournaments.pivot_table(index=best_tournaments["date"],
                                columns=["tournament"],aggfunc="size", fill_value=0).resample("Y").sum()

In [None]:
#Test
plt.style.use('seaborn-darkgrid')
palette = plt.get_cmap('Set2')
num=0
ax,fix = plt.subplots(figsize=(15,7))
for column in df_cups:
    num+=1
    plt.subplot(3,4, num)
    for v in df_cups:
        plt.plot(df_cups.index,v,data=df_cups,marker='', color='black', linewidth=0.9, alpha=0.3)
        plt.tick_params(labelbottom=False)
        plt.plot(df_cups.index,column, data=df_cups,color="red", linewidth=2.4, alpha=0.9, label=column)
        plt.title(column, loc='left', fontsize=12, fontweight=0, color="black")
        plt.suptitle("Historic increase of tournament games ", fontsize=20, fontweight=0, color='black', style='italic', y=1.02)

What about the FIFA 2018 teams?

For this, we will only take the data from the teams that competed in the 2018 FIFA.

Well, initially I had doubts on if I should include the following plot. There were multiple reasons for this, one of them was that I was unsure if the “wins per year” would give a good metric to define success. The other reason being is that it is a bit difficult to compare as the number of nations that played is a bit high for this kind of comparison. Nevertheless, I decided to leave it in for the following reasons:

1. It gives a general idea of how a national team is doing right now.
2. It also gives us insight into the football history of that nation.

In [None]:
Fifa_2018_teams = ["Argentina","Australia","Belgium","Brazil","Colombia","Costa Rica","Croatia","Denmark","Egypt","England","France","Germany","Iceland","Iran","Japan",
"South Korea","Mexico","Morocco","Nigeria","Panama","Peru","Poland","Portugal","Russia","Saudi Arabia","Senegal","Serbia","Spain","Sweden",
"Switzerland","Tunisia","Uruguay"]
df_fifa_teams = df[df["Winner"].isin(Fifa_2018_teams)]

In [None]:
fifa_teams_wins = df_fifa_teams.groupby(["Year","Winner"])["Winner"].agg("count")
fifa_teams_wins = pd.DataFrame(fifa_teams_wins)
fifa_teams_wins["Country"] = fifa_teams_wins.index.get_level_values(1)
fifa_teams_wins["Date"] = fifa_teams_wins.index.get_level_values(0)
#fifa_teams_lose = df_fifa_teams.groupby(["Year","Loser"])["Loser"].agg("count")
#fifa_teams_lose = pd.DataFrame(fifa_teams_lose)
#fifa_teams_lose["Country"] = fifa_teams_lose.index.get_level_values(1)
#fifa_teams_lose["Date"] = fifa_teams_lose.index.get_level_values(0)

In [None]:
g = sns.FacetGrid(fifa_teams_wins, col="Country", hue="Country", col_wrap=4)
g = g.map(plt.plot, "Date", "Winner").set_titles("{col_name}")

A closer look.

While the plot offers some insight I wanted to take a closer look at some national teams. For this purpose I have picked four European and four South American national teams to take a closer look. Please note that I picked these teams more or less randomly from the nations that have a strong football history.

For Europe I have picked England, France, Germany and Spain. I will compare their yearly number of wins to the mean of the other three. This is the first version of the notebook and I plan to use the mean wins of all European nations in the future (same goes for SA).

In [None]:
south_america = ["Brazil","Argentina","Uruguay","Colombia"]
europe = ["Germany","England","France","Spain"]

In [None]:
europe_wins = fifa_teams_wins[fifa_teams_wins["Country"].isin(europe)]

In [None]:
mean_europe = europe_wins.groupby("Year")["Winner"].agg(["mean"])

In [None]:
fig,[[ax1, ax2],[ax3, ax4]] = plt.subplots(2,2,figsize=(14,7),sharey=True)
#fig,ax = plt.subplots(2,2,figsize=(14,7))
ax1.plot("Date","Winner",data=europe_wins[europe_wins["Country"]=="England"],color="r",label="England wins")
ax1.plot(mean_europe.index,mean_europe["mean"],color = "black",label = "EU mean")
ax1.title.set_text("English wins")

ax2.plot("Date","Winner",data=europe_wins[europe_wins["Country"]=="France"],color="skyblue")
ax2.plot(mean_europe.index,mean_europe["mean"],color = "black",label = "EU mean")
ax2.title.set_text("French wins")

ax3.plot("Date","Winner",data=europe_wins[europe_wins["Country"]=="Germany"],color="y")
ax3.plot(mean_europe.index,mean_europe["mean"],color = "black",label = "EU mean")
ax3.title.set_text("German wins")

ax4.plot("Date","Winner",data=europe_wins[europe_wins["Country"]=="Spain"],color="orange")
ax4.plot(mean_europe.index,mean_europe["mean"],color = "black",label = "EU mean")
ax4.title.set_text("Spanish wins")

Representing South America in this plot are: Brazil, Argentina, Uruguay and Colombia.We will apply the same plot as above.

In [None]:
sa_wins = fifa_teams_wins[fifa_teams_wins["Country"].isin(south_america)]
sa_mean = sa_wins.groupby("Year")["Winner"].agg(["mean"])

In [None]:
fig,[[ax1, ax2],[ax3, ax4]] = plt.subplots(2,2,figsize=(14,7),sharey=True)

ax1.plot("Date","Winner",data=sa_wins[sa_wins["Country"]=="Brazil"],color="g")
ax1.plot(sa_mean.index,"mean",data=sa_mean,color="black")
ax1.title.set_text("Brazilian wins")

ax2.plot("Date","Winner",data=sa_wins[sa_wins["Country"]=="Argentina"],color="b")
ax2.plot(sa_mean.index,"mean",data=sa_mean,color="black")
ax2.title.set_text("Argentinian wins")

ax3.plot("Date","Winner",data=sa_wins[sa_wins["Country"]=="Colombia"],color="red")
ax3.plot(sa_mean.index,"mean",data=sa_mean,color="black")
ax3.title.set_text("Colombian wins")

ax4.plot("Date","Winner",data=sa_wins[sa_wins["Country"]=="Uruguay"],color="y")
ax4.plot(sa_mean.index,"mean",data=sa_mean,color="black")
ax4.title.set_text("Uruguay wins")

Finally, we will use the mean wins of these continents as comparison.

In [None]:
fig,ax =plt.subplots(figsize=(14,7))
ax.plot(sa_mean.index,"mean",data=sa_mean,color="orange",label="SA mean")
ax.plot(mean_europe.index,"mean",data=mean_europe,color="skyblue",label = "EU mean")
ax.title.set_text("European mean vs SA mean")
ax.legend()

We can see that European national teams created a slight lead over their South American counterparts. Of course, we are only looking at the mean wins and these numbers can be inflated by playing against weaker football nations, so we have to remain sceptical.

Another thing I would like to see is the goal distribution between the continents.

In [None]:
europe_goals = df[df["home_team"].isin(europe)]
sa_goals = df[df["home_team"].isin(south_america)]

The distribution of home goals.

In [None]:
f, (ax1,ax2) = plt.subplots(1, 2, figsize=(14, 5), sharex=True)
sns.distplot(europe_goals["home_score"] , color="skyblue", ax=ax1)
sns.distplot(sa_goals["home_score"]  , color="olive", ax=ax2)

Distribution of away goals.

In [None]:
f, (ax1,ax2) = plt.subplots(1, 2, figsize=(14, 5), sharex=True)
sns.distplot(europe_goals["away_score"] , color="skyblue", ax=ax1)
sns.distplot(sa_goals["away_score"]  , color="olive", ax=ax2)

As we can see, there are no substantial differences in the distribution.

The end?

Well for now, yes. There are still some things that I would like to do with this dataset. Some goals for future updates of this notebook are:

1. Adding Asian and African nations to the detailed comparison.
2. Using the mean wins of all Europe/SA national teams.
3. Comparing the total/home/mean goals between nations as a metric of success.

I hope you liked this short notebook and that it hopefully inspired you to do your own EDA.