It is always difficult to determine who is the best of all time when it comes to sports that have been played for decades. Running seems to be different, since athletes do not need to face each other in order for them to be compared. And indeed the fastest man in history is also the current Olympic champion. 

However, simply comparing results doesn't account for changes in training, gear, nutrition, culture etc. It is very clear that Usain Bolt has the best result, but this does not necessarily mean that he was inherently faster than, say, Jessie Owens, have they lived in the same era.

In this script I will try to see who was ahead of his/her time

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import math
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [None]:
df = pd.read_csv('../input/results.csv', names = ['Gender', 'Event', 'Location', 'Year', 'Medal', 'Name', 'Nationality', 'Result', 'Wind'])

We need to choose a model for the general improvement. It is very clear that a linear model would be a terrible choice, implying that sooner or later gold medalist would be able to travel faster than the speed of light. An exponential model seems more reasonable.

In [None]:
def exponenial_func(x, a, b, c):
    return a*np.exp(-b*x)+c

# 100M Men

In [None]:
sprint = df[df.Event == '100M Men']
sprint = sprint[sprint.Result != 'None']
sprint['Year'] = sprint.Year.apply(lambda x: int(x))
sprint['Result'] = sprint.Result.apply(lambda x: float(x))

plt.figure(figsize = (8,8))
plt.plot(sprint.Year[sprint.Medal == 'G'].values,sprint.Result[sprint.Medal == 'G'].values,'o', color = 'y', markersize = 10, alpha = 0.4)
plt.plot(sprint.Year[sprint.Medal == 'S'].values,sprint.Result[sprint.Medal == 'S'].values,'o', color = 'gray',markersize = 10, alpha = 0.4)
plt.plot(sprint.Year[sprint.Medal == 'B'].values,sprint.Result[sprint.Medal == 'B'].values,'o', color = 'r',markersize = 10, alpha = 0.4)

x = sprint.Year.values - 1896
y = sprint.Result
popt, pcov = curve_fit(exponenial_func, x, y, p0=(12, 1e-6, 1))
yy = exponenial_func(np.unique(x), *popt)
plt.plot(np.unique(sprint.Year.values),yy)
plt.xlabel('Year')
plt.ylabel('Result [sec]')
plt.legend(['Gold','Silver','Bronze'])
plt.title('100M Men')

# 100M Women

In [None]:
sprint = df[df.Event == '100M Women']
sprint = sprint[sprint.Result != 'None']
sprint['Year'] = sprint.Year.apply(lambda x: int(x))
sprint['Result'] = sprint.Result.apply(lambda x: float(x))

plt.figure(figsize = (8,8))
plt.plot(sprint.Year[sprint.Medal == 'G'].values,sprint.Result[sprint.Medal == 'G'].values,'o', color = 'y', markersize = 10, alpha = 0.4)
plt.plot(sprint.Year[sprint.Medal == 'S'].values,sprint.Result[sprint.Medal == 'S'].values,'o', color = 'gray',markersize = 10, alpha = 0.4)
plt.plot(sprint.Year[sprint.Medal == 'B'].values,sprint.Result[sprint.Medal == 'B'].values,'o', color = 'r',markersize = 10, alpha = 0.4)



x = sprint.Year.values - 1932
y = sprint.Result
popt, pcov = curve_fit(exponenial_func, x, y, p0=(12, 1e-6, 1))
yy = exponenial_func(np.unique(x), *popt)
plt.plot(np.unique(sprint.Year.values),yy, 'b')


popt, pcov = curve_fit(exponenial_func, x, y, p0=(12, 1e-6, 1))
yy = exponenial_func(np.unique(x), *popt)
plt.plot(np.unique(sprint.Year.values),yy)
plt.xlabel('Year')
plt.ylabel('Result [sec]')
plt.legend(['Gold','Silver','Bronze'])
plt.title('100M Women')

# 200M Men

In [None]:
sprint = df[df.Event == '200M Men']
sprint = sprint[sprint.Result != 'None']
sprint['Year'] = sprint.Year.apply(lambda x: int(x))
sprint['Result'] = sprint.Result.apply(lambda x: float(x))

plt.figure(figsize = (8,8))
plt.plot(sprint.Year[sprint.Medal == 'G'].values,sprint.Result[sprint.Medal == 'G'].values,'o', color = 'y', markersize = 10, alpha = 0.4)
plt.plot(sprint.Year[sprint.Medal == 'S'].values,sprint.Result[sprint.Medal == 'S'].values,'o', color = 'gray',markersize = 10, alpha = 0.4)
plt.plot(sprint.Year[sprint.Medal == 'B'].values,sprint.Result[sprint.Medal == 'B'].values,'o', color = 'r',markersize = 10, alpha = 0.4)

x = sprint.Year.values - 1896
y = sprint.Result
popt, pcov = curve_fit(exponenial_func, x, y, p0=(12, 1e-6, 1))
yy = exponenial_func(np.unique(x), *popt)
plt.plot(np.unique(sprint.Year.values),yy)
plt.xlabel('Year')
plt.ylabel('Result [sec]')
plt.legend(['Gold','Silver','Bronze'])
plt.title('200M Men')

# 200M Women

In [None]:
sprint = df[df.Event == '200M Women']
sprint = sprint[sprint.Result != 'None']
sprint['Year'] = sprint.Year.apply(lambda x: int(x))
sprint['Result'] = sprint.Result.apply(lambda x: float(x))

plt.figure(figsize = (8,8))
plt.plot(sprint.Year[sprint.Medal == 'G'].values,sprint.Result[sprint.Medal == 'G'].values,'o', color = 'y', markersize = 10, alpha = 0.4)
plt.plot(sprint.Year[sprint.Medal == 'S'].values,sprint.Result[sprint.Medal == 'S'].values,'o', color = 'gray',markersize = 10, alpha = 0.4)
plt.plot(sprint.Year[sprint.Medal == 'B'].values,sprint.Result[sprint.Medal == 'B'].values,'o', color = 'r',markersize = 10, alpha = 0.4)

x = sprint.Year.values - 1948
y = sprint.Result
popt, pcov = curve_fit(exponenial_func, x, y, p0=(12, 1e-6, 1))
yy = exponenial_func(np.unique(x), *popt)
plt.plot(np.unique(sprint.Year.values),yy)
plt.xlabel('Year')
plt.ylabel('Result [sec]')
plt.legend(['Gold','Silver','Bronze'])
plt.title('200M Women')

So who are the athletes who beat the expectations with the largest margin? I calculated this using the ratio of improvement compared to the expected result. This approach is not flawless as we are going to see:

In [None]:
def get_expectation(df,event,year):
    temp_df = df[df.Event == event]
    temp_df = temp_df[temp_df.Result != 'None']
    temp_df['Year'] = temp_df.Year.apply(lambda x: int(x))
    temp_df['Result'] = temp_df.Result.apply(lambda x: float(x))

    x = temp_df.Year.values - temp_df.Year.min()
    y = temp_df.Result
    popt, pcov = curve_fit(exponenial_func, x, y, p0=(np.mean(y), 1e-6, 1))
    return exponenial_func(year -  temp_df.Year.min() , *popt)


get_expectation(df,'200M Men',2016)

# Ahead of their time - 100M Men

In [None]:
sprint = df[df.Event == '100M Men']
sprint = sprint[sprint.Result != 'None']
sprint['Year'] = sprint.Year.apply(lambda x: int(x))
sprint['Result'] = sprint.Result.apply(lambda x: float(x))
sprint = sprint.reset_index()
sprint['expectation'] = 0
for i,row in enumerate(sprint.iterrows()):
    sprint.loc[i,'expectation'] = get_expectation(df,'100M Men',float(sprint.loc[i,'Year']))
    
sprint['ratio'] = np.true_divide(sprint.expectation,sprint.Result)
sprint.sort_values('ratio', ascending = False).head(20)[['Name','Year','Result','expectation']].reset_index().drop('index', axis = 1)

It seems that this metric is biased in favor of runners from the early 20th century. Donovan Bailey broke the world record in the 1996 Olympic games in Atlanta, but is ranked way below many athletes from the very first Olympic games.

Regardless of this bias, the legendary Jessie Owens is ranked much lower than Usain Bolt.  Jim Hines is also ranked high, with his 9.9 result in 1968, which could easily be worth a bronze medal even in the next Olympic games.

#100M Women

In [None]:
sprint = df[df.Event == '100M Women']
sprint = sprint[sprint.Result != 'None']
sprint['Year'] = sprint.Year.apply(lambda x: int(x))
sprint['Result'] = sprint.Result.apply(lambda x: float(x))
sprint = sprint.reset_index()
sprint['expectation'] = 0
for i,row in enumerate(sprint.iterrows()):
    sprint.loc[i,'expectation'] = get_expectation(df,'100M Women',float(sprint.loc[i,'Year']))
    
sprint['ratio'] = np.true_divide(sprint.expectation,sprint.Result)
sprint.sort_values('ratio', ascending = False).head(20)[['Name','Year','Result','expectation']].reset_index().drop('index', axis = 1)

# 200M Men

In [None]:
sprint = df[df.Event == '200M Men']
sprint = sprint[sprint.Result != 'None']
sprint['Year'] = sprint.Year.apply(lambda x: int(x))
sprint['Result'] = sprint.Result.apply(lambda x: float(x))
sprint = sprint.reset_index()
sprint['expectation'] = 0
for i,row in enumerate(sprint.iterrows()):
    sprint.loc[i,'expectation'] = get_expectation(df,'200M Men',float(sprint.loc[i,'Year']))
    
sprint['ratio'] = np.true_divide(sprint.expectation,sprint.Result)
sprint.sort_values('ratio', ascending = False).head(20)[['Name','Year','Result','expectation']].reset_index().drop('index', axis = 1)

Michael Johnson is ranked 2nd, after Archie Hahn, the gold medalist from 1904 (It took 20 years until someone improved his Olympic record). In this case, Usain bolt is ranked below Jessie Owens. There are also 2 prominent silver medalists in this list - Frank Fredericks and Yohan Blake, both would easily have been gold medalists if it wasn't for Johnson and Bolt.

# 200M Women

In [None]:
sprint = df[df.Event == '200M Women']
sprint = sprint[sprint.Result != 'None']
sprint['Year'] = sprint.Year.apply(lambda x: int(x))
sprint['Result'] = sprint.Result.apply(lambda x: float(x))
sprint = sprint.reset_index()
sprint['expectation'] = 0
for i,row in enumerate(sprint.iterrows()):
    sprint.loc[i,'expectation'] = get_expectation(df,'200M Women',float(sprint.loc[i,'Year']))
    
sprint['ratio'] = np.true_divide(sprint.expectation,sprint.Result)
sprint.sort_values('ratio', ascending = False).head(20)[['Name','Year','Result','expectation']].reset_index().drop('index', axis = 1)