# **World Marathon Majors**

# Introduction

The World Marathon Majors (WMM) is a series of worldwide top-level marathons, disputed yearly in six major cities of the world: 

* Tokyo (Japan)
* Boston (USA)
* London (UK)
* Berlin (Germany)
* Chicago (USA)
* New York City (USA)

While the modern WMM was established in 2006, our dataset also includes historical winners of the six marathons since their start (the earliest data is from the Boston marathon in the late 1890s). 

The data provides interesting insights on the development of the competition and on which athletes and countries have been more successful. The purpose here is to take a look at the data and explore which countries and individual athletes have been the most successful in WMM history, and what change can be observed over time and across different host cities. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# The dataset

In [None]:
df = pd.read_csv("../input/world-marathons-majors/world_marathon_majors.csv",encoding='latin1')

In [None]:
df.head()

In [None]:
df.info()

# Most successful athletes

Here we can take a look at which athletes have won the most competitions in the history of these six marathons. Norwegian runner Grete Waitz appears to take the first place with 11 marathons won, while Ingrid Kristiansen (Norway) and Bill Rodgers (USA) won 8 each. 

In [None]:
top10 = df.winner.value_counts(normalize=False).iloc[:10]
top10.plot(kind = "bar", figsize=(15,8), fontsize=14, cmap="RdYlBu_r")
plt.ylabel("No. of Victories", fontsize=14)
plt.show()

# Marathon wins by country

When we consider the country of origin of the athletes, the most successful countries have been Kenya (overall 136 marathons won), the US (104) and Ethiopia (51). 

In [None]:
df.country.value_counts()

In [None]:
top_countries = df.country.value_counts().iloc[:15]
top_countries.plot(kind = "bar", figsize=(15,8), fontsize=14)
plt.ylabel("No. of Victories", fontsize=14)
plt.show()

The success of individual countries tend to vary over the decades. Let's take the first two countries in the ranking as an example. Kenyan athletes have won most of their marathons after 1995, having won 35 of them only in the years 2010-2014, and thus enjoying a growing momentum of success in this discipline. 

In [None]:
kenya = df.loc[df.country=="Kenya"].sort_values("year")
us = df.loc[df.country=="United States"].sort_values("year")

In [None]:
px.histogram(kenya, x="year", title="Kenya - Marathon Wins")

The US participation, on the other hand, has a much longer history, spanning well over a century. The most successful decade for US athletes were the 1970s, when they won 36 marathons. 

In [None]:
px.histogram(us, x="year", title="USA - Marathon Wins by Decade")

# Marathon wins by location

But what if we isolate each of the six WMM locations while looking at WMM wins by country? Would the results still look the same, i.e. races dominated by Kenya, the US, etc. or can we observe more variance?

In [None]:
London = df.loc[df.marathon == "London"]
London.country.value_counts().iloc[:5].plot(kind = "bar", figsize=(15,6), cmap="summer")
plt.title("London Marathon - Wins by Country", fontsize=16)
plt.show()

In [None]:
Tokyo = df.loc[df.marathon == "Tokyo"]
Tokyo.country.value_counts().iloc[:5].plot(kind = "bar", figsize=(15,6), cmap="hot")
plt.title("Tokyo Marathon - Wins by Country", fontsize=16)
plt.show()

In [None]:
Berlin = df.loc[df.marathon == "Berlin"]
Berlin.country.value_counts().iloc[:5].plot(kind = "bar", figsize=(15,6), cmap="spring")
plt.title("Berlin Marathon - Wins by Country", fontsize=16)
plt.show()

In [None]:
Boston = df.loc[df.marathon == "Boston"]
Boston.country.value_counts().iloc[:5].plot(kind = "bar", figsize=(15,6), cmap="autumn")
plt.title("Boston Marathon - Wins by Country", fontsize=16)
plt.show()

In [None]:
NYC = df.loc[df.marathon == "NYC"]
NYC.country.value_counts().iloc[:5].plot(kind = "bar", figsize=(15,6), cmap="cividis")
plt.title("NYC Marathon - Wins by Country", fontsize=16)
plt.show()

In [None]:
Chicago = df.loc[df.marathon == "Chicago"]
Chicago.country.value_counts().iloc[:5].plot(kind = "bar", figsize=(15,6), cmap="viridis")
plt.title("Chicago Marathon - Wins by Country", fontsize=16)
plt.show()

As can be observed in the graphs above, Kenya still retains its strong position across the board, coming in first in London, Tokyo, NYC, and Chicago, and second in Berlin and Boston. We can however observe something more here: for instance, the home country seems to have some advantage with regard to how many of its athletes have won the local WMM. Germany comes on top in the Berlin race, and the UK has a strong second place in the London marathon. The US have by far the most wins in Boston, and lose only to Kenya in Chicago and NYC. Interestingly, the US do not even figure in the top 5 in Berlin, Tokyo and London. 
There could be many reasons to this; one might think of potential higher barriers to entry for non locally-based athletes, such as higher travel costs, which could make it easier for local athletes to compete. 

# Winning times by WWM location

Are all marathons equal? Or do winning performances vary between locations? We can see whether there is any statistically relevant variation with a simple box plot.

In [None]:
px.box(df.sort_values(by="time"), y="time", x="marathon")

Winning times vary widely for each of the six WWM locations. However, median winning times appear to be slightly better in certain locations (e.g. London) than in others (such as Boston or NYC). This is particularly evident when comparing London to Boston. 

While is has little statistical relevance in itself, 11-times WMM winner Grete Waitz also recorded her best two winning performances in London. 

In [None]:
gw= df.loc[df.winner == "Grete Waitz"].sort_values(by="time").drop(columns = ["gender", "country"])
gw