Let's see if there's anything hiding in this data!

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import csv

%matplotlib inline

import time
import datetime

Read in CSV with an extra column for wind speed. This accounts for the additional column present in the data, and pandas won't complain about reading it.

In [None]:
v = open('../input/results.csv', 'r')
r = csv.reader(v)
next(r)
df = pd.read_csv('../input/results.csv', names = 
    ['Gender',
     'Event',
     'Location',
     'Year',
     'Medal',
     'Name',
     'Nationality',
     'Result',
     'Wind'])

Make sure that the 'Wind' column is just 0 for all of the rows that don't have a wind adjustment.

In [None]:
df.head()
df["Wind"].fillna(0, inplace=True)

Let's check out what events there are!

In [None]:
df["Event"].unique()

It seems like the events are all named Men and Women respectively, even though there is a gender column already. We can definitely get rid of that naming convention, and get something a little nicer to look at. In particular, for the running events, we can definitely get it to down to just the distance number.

In [None]:
df['Event'] = df['Event'].str.replace('\sMen|\sWomen', '')

In [None]:
def distance_map(row):
    try:
        return int(row)
    except ValueError:
        return np.nan
    
df['Distance'] = df['Event'].str.replace("M$", "").apply(lambda row: 42195.0 if row == "Marathon" else distance_map(row))

Now, looking at the result column for running events, we can convert the time mark into something a little more useful and standard, i.e. number of seconds. There are quite a few different methods of input of the result, however, so I tried to capture a few different ones. I've definitely missed a few...

We could also calculate pace per kilometer for the time and distance given so that all of the events are on the same relative scale.

In [None]:
def time_map(row):
    x = datetime.time()
    if np.isnan(row['Distance']):
        return np.nan
    else:
        try:
            x = datetime.datetime.strptime(row['Result'], "%H:%M:%S")
            return datetime.timedelta(hours=x.hour, minutes=x.minute, seconds=x.second).total_seconds()
        except ValueError:
            pass
        try:
            x = datetime.datetime.strptime(row['Result'], "%H:%M:%S.%f")
            return datetime.timedelta(hours=x.hour, minutes=x.minute, seconds=x.second, microseconds = x.microsecond).total_seconds()
        except ValueError:
            pass
        try:
            x = datetime.datetime.strptime(row['Result'], "%Hh%M:%S")
            return datetime.timedelta(hours=x.hour, minutes=x.minute, seconds=x.second).total_seconds()
        except ValueError:
            pass
        try:
            x = datetime.datetime.strptime(row['Result'], "%H-%M:%S")
            return datetime.timedelta(hours=x.hour, minutes=x.minute, seconds=x.second).total_seconds()
        except ValueError:
            pass
        try:
            x = datetime.datetime.strptime(row['Result'], "%H-%M:%S.%f")
            return datetime.timedelta(hours=x.hour, minutes=x.minute, seconds=x.second, microseconds = x.microsecond).total_seconds()
        except ValueError:
            pass
        try:
            x = datetime.datetime.strptime(row['Result'], "%M:%S.%f")
            return datetime.timedelta(hours=x.hour, minutes=x.minute, seconds=x.second, microseconds = x.microsecond).total_seconds()
        except ValueError:
            pass
        try:
            x = datetime.datetime.strptime(row['Result'], "%S.%f")
            return datetime.timedelta(hours=x.hour, minutes=x.minute, seconds=x.second, microseconds = x.microsecond).total_seconds()
        except ValueError:
            pass      
    
df['Seconds'] = df.apply(lambda row: time_map(row), axis=1)
df['Pace'] = df['Seconds']/(df['Distance']/1000)

Let's make some visualizations.

In [None]:
df['Year'] = df['Year'][1:].apply(int)
g = sns.lmplot(x = "Year", y = "Pace", col = "Event", 
               hue = "Medal", data = df.dropna(), col_wrap = 3)

Pace and times are going down over time. That makes sense.

In [None]:
g = sns.lmplot(x = "Year", y = "Pace", col = "Event", 
               hue = "Gender", data = df.dropna(), col_wrap = 3)

As expected, the elite women tend to be slower than elite men, at a consistent level. Some of the regressions look a little strange only because women did not have results in some events (for example, in the marathon) before 1984.

In [None]:
g = sns.regplot(x = "Distance", y = "Pace",
                data = df.dropna(), logx = True)

Interesting, pace seems to follow a logarithmic trend across distances; the difference in pace between the 5K, 10K, and Marathon, is not as large as the difference in pace among the sprinting events 100M to the Mile.