## Reading and preparing data

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import os
import sys
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import sys
from IPython.display import display

 ### Reading csv files

A first inspection of the downloaded files is made, without reading them, and it is verified that there are 15 csv files ranging from the year 2003 to 2017. The one corresponding to the year 2003 is particularly small compared to the others.

In [None]:
raw_folder = '/kaggle/input/ultratrail-du-montblanc-20032017'
files = sorted(os.listdir(raw_folder))

def full_file(dirname, filename):
    return os.path.join(dirname, filename)

print(files)

In order to load each file as a dataframe without having to explicitly define a variable for each one of them, I will store them in a dictionary whose key will be the year of the file and the value, the corresponding dataframe. Likewise, all concatenated annual dataframes will be stored in a single dataframe, so that all data can be inspected together. A variable with the corresponding year is added to each dataframe. I declare an auxiliary function to be able to read the files as it will be reused later. For this part of the work it will be enough to read only a couple of rows.****

In [None]:
# helper function to load csv. If nrows is less than 0 it returns all rows


def read_csvs(nrows, show=True):
    dic = dict()
    df = pd.DataFrame()
    for file in files:
        year = file[5:9]
        if nrows > 0:
            args = {'filepath_or_buffer': full_file(raw_folder, file), 'nrows': nrows}
        else:
            args = args = {'filepath_or_buffer': full_file(raw_folder, file)}
        dic[year] = pd.read_csv(**args)
        dfd = dic[year]
        dfd['Year'] = year
        if show:
            print(year, list(dfd))
        df = pd.concat([df, dfd], sort=False)
    return dic, df


df_dict, df = read_csvs(nrows=2)


 ### Data preparation

We see that the number and name of columns of each dataframe is quite variable.

In [None]:
df.shape

In the global dataframe we have 136! columns whose names are listed below, it must be taken into account that those with the same name in the source files are only seen once, a hard work awaits us ... I begin by listing the columns in alphabetical order:

In [None]:
columns = np.array(list(df))
columns.sort()
columns

There are basically two things observed:
  1. Some of them are repeated, but the capitalization of the letters changes: `'Champ' - 'champ'`.
  2. Many of them look alike eg: `'Champex La', 'Champex Lac'`.

We convert all column names to lowercase and retrieve unique values.

In [None]:
columns = np.char.lower(np.array(list(df)))
columns = np.lib.arraysetops.unique(columns)
columns

We verify that we have managed to reduce the number of columns.

In [None]:
len(columns)

Let's take a first look at the content of the global dataframe to see what data it stores. 10 random observations are displayed.

In [None]:
# option to show all columns of the dataframe
pd.set_option ('display.max_columns', df.shape [1])
# we choose 10 at random
df.sample (10)

#### Most of them are passing times and their names, for sure, the place where the runner has been timed. There are also other data such as number, name, team, category, position and the like.
#### To continue, each column must be mapped with a new name that groups its "reasonable resemblances" in a relation n to 1. For example, it is logical to relate `'bert', 'berton', 'bertone'` with `'Bertone'`. For this task, each name must be identified with a place crossed by the route. *Behind the scenes*, using the race website, Google Maps and some patience, a table with the relationships has been assembled. Of all of them, it has not been possible to identify five.
#### To distinguish the time columns, the unknown and the remaining ones, a qualifying field is added. Two iterations have been made in this process since in the final concatenation of dataframes the year 2012 failed - it has two fields `'conta'` and`' coun r'` that were assigned to `'Contamines'` - and the year 2015 - has two fields `'courm1'` and`' courm2'` that mapped to `'Courmayeur'`–.
#### A convenience field `Order` has also been added by which the columns of the final dataframe will be ordered. All the time columns are located at the end of the dataframe starting at index 8 and their position is relative to the race course.
#### In the dataframe there are two fields: `'Cham', 'cham'` which can be Champex or Chamoniz, in the list of the dataframe above it is verified that their times are higher than those of passage through Champex and similar to those of the arrival in Chamonix. They are therefore assigned to the latter and the name of the field will be `Arrivee` respecting the original French for arrival but without accentuation signs, for the rest of the fields they have also been eliminated, as well as the spaces and hyphens, keeping only the characters of the alphabet English.

 #### The result of the work described is saved in a csv file that is retrieved in the lower cell.

In [None]:
file = 'columns.csv'
data_folder = '/kaggle/input/utmbcolumns'
cols_map = pd.read_csv(full_file(data_folder, file))
cols_map[:15]

#### The unknown variables are:

In [None]:
unknowns = cols_map.loc[cols_map['type'] == 'Unknown']
unknowns

 #### To continue, this time we will load all the data without displaying it on the screen using the auxiliary function created in cell 4.

In [None]:
df_dict, df = read_csvs(nrows=-1, show=False)

#### Now you have to rename the columns of each dataframe. It is important to bear in mind that this part has to be done in each individual dataframe, and then concatenate them; if it were done in the global dataframe, the new names of the columns with different capitalization would become identical. The global dataframe in which all the annual dataframes are concatenated is initialized.

In [None]:
# initialize the global dataframe
df = pd.DataFrame()
# we go through the dictionary
for k, dfd in df_dict.items():
     # create a dataframe of a column with the column names in lowercase
    cols_dfd = pd.DataFrame(np.char.lower(np.array(list(dfd))))
    # we put the same name as its equivalent column on the map
    cols_dfd.columns = ['old']
     # merge the map and the original columns by 'old'
    cols_new = pd.merge(cols_dfd, cols_map)['new'].tolist()
     # we rename the columns in the original
    dfd.columns = cols_new
     # add the corrected annual dataframe to the global dataframe
    df = pd.concat([df, dfd], sort=False)

df.shape

#### We check its dimensions and see that the variables have gone from 136 to 53. Not bad ...

#### We sort the columns according to the column map.

In [None]:
#### We sort the columns according to the column map. # Order is repeated in the new columns that group the original ones
# the minimum value is taken, but the maximum or the average would be equal
group = cols_map.groupby ('new') ['order']. min (). sort_values ()
# we save the ordered columns
columns_order = list (group.index.values)
# the global dataframe is reindexed with ordered columns
df = df.reindex (columns = columns_order)

Now we check if there are variables with all null observations.

In [None]:
empties = df.columns[df.isna().all()].tolist()
empties


One appears and we delete it.

In [None]:
df.drop(columns=empties, inplace=True)
df.shape

The data in the chrono columns are `timedeltas`. We will make sure that everyone is properly trained before finishing the cleaning.
If we look at the data in `Timediff` we see that there are observations of the form` mm: ss.0` and the function expects `hh: mm: ss.0`. We are going to pass this data to the `hh: mm: ss` format, also valid for` pd.to_timedelta () `.

In [None]:
df['Timediff'].head()

In [None]:
# we reset the index since we are going to use masks to act only
# on affected observations
df = df.reset_index(drop=True)
# we create a mask with the values that end in .0 ignoring the non-null
mask = df['Timediff'].str.endswith(".0") & pd.notnull(df['Timediff'])
# the replacement is done using regex syntax
df['Timediff'] = df['Timediff'].loc[mask].str.replace('\\.0', '', 1)
# the hh: part is added at the beginning
df['Timediff'] = df['Timediff'].loc[mask].apply(lambda x: '00:' + x)
df['Timediff'].head()

#### We verify that all values accept the conversion to timedelta.

In [None]:
cronos_list = list(df)[8:]
cronos_df = df[df.columns.intersection(cronos_list)]
for crono in cronos_list:
    print(crono)
    pd.to_timedelta(cronos_df[crono])


 # Exploration

In [None]:
pd.set_option('display.max_columns', df.shape[1])
df[:3]


### Column type conversion

#### Columns with the chrono label are located from index 8. We retrieve them directly without going through the column map. At the end we verify that we can operate with the values correctly with any subtraction of times.

In [None]:
df['Year'] = df['Year'].astype(int)

cronos = list(df)[8:]
for crono in cronos:
    print(crono)
    td = pd.to_timedelta(df[crono])
    df[crono] = td

df.Time[30] - df.Time[29]


### Verifications

#### We will do some basic checks. We start by checking the `Arrivee` variable.

In [None]:
df.groupby('Year')['Arrivee'].min().sort_values()

#### Three abnormal values are seen
  1. 2012 Arrival time is half of the other reported years.
  2. 2003 There is no arrival time.
  3. 2010 There is no arrival time.

#### We prepare an auxiliary function to retrieve only the data of the columns with information, the varied scheme of each year means that in many variables there is no data; in this way we can perform the analysis with less noise.

In [None]:
# auxiliary function to recover a year by eliminating
# columns with all nans values

def get_year(df, year):
    year_df = df.loc[df.Year == year]
    notnas = year_df.columns[year_df.notna().any()].tolist()
    return year_df[notnas]

 #### Year 2012

In [None]:
get_year(df, 2012)[:3]

#### For 2012 we see that the data is correct and seems reasonable, but the passing points and times indicate that it is another race. We will not be able to use this year to contrast it with the others. Since we don't even know what race it is, we are going to exclude it directly from the rest of the analysis. You have to delete the observations whose year is 2012 and then delete the exclusive variables of that year knowing that they have no data.

In [None]:
# we exclude the observations from the year 2012 (it's another race)
df = df[df.Year != 2012]
# we locate your unique variables knowing that they will be nans
empties = df.columns[df.isna().all()].tolist()
display(empties)
# we delete them from the dataframe
df.drop(columns=empties, inplace=True)
display(df.shape)

 #### Year 2003

In [None]:
get_year(df, 2003)[:3]

#### In 2003 it is verified that there are no intermediate passage times, but only the total time. Its observations can inform the analysis. We will update the `Arrivee` column with the value of` Time`.

In [None]:
mask = df.Year == 2003
df.loc[mask, 'Arrivee'] = df.loc[mask, 'Time']
get_year(df, 2003)[:3]

 #### Year 2010

In [None]:
get_year(df, 2010)[:3]

#### 2010 is not a great year for our analysis task, there are only two passing columns and no total time. We will keep the observations because they can be useful for the analysis of participants when times are not required.

#### Finally, we examine every year to check their quality - at least those in the first rows.

In [None]:
years = np.arange(2003, 2018)
for y in years:
    if y != 2012:
        display(get_year(df, y).head(3))


 #### New kid in town! Everything seems correct except a negative time of passage in 2005 for the runner with number 1 in Contamines. Let's review the negative times.

In [None]:
# we retrieve the time data by position
# knowing they are consecutive
cronos = df[list(df)[8:]]


def check_less_than_0(cronos):
    # we filter negative times and not nans
    less_than_0 = (cronos < pd.Timedelta(0)) & pd.notna(cronos)
    # we add the result to count them (sum of booleans)
    totals_lt0 = less_than_0.sum().copy()
    # we save those that have at least one negative value in an orderly fashion
    totals_gt0 = totals_lt0[totals_lt0 > 0].sort_values(ascending=False)
    display(totals_gt0)
    return totals_gt0.index.values


cols_lt0 = check_less_than_0(cronos)


#### There are not many, but it is not correct to leave them in the dataframe. We set them to `nan`.

In [None]:
for col in cols_lt0:
    mask = (cronos[col] < pd.Timedelta(0)) & pd.notna(cronos[col])
    df.loc[mask, col] = np.nan

cronos = df[list(df)[8:]]
check_less_than_0(cronos)


#### Let's now see the quality of the Nationality variable.

In [None]:
df.Nationality = df.Nationality.str.lower()

countries = df.groupby('Nationality')[
    'Nationality'].count().sort_values(ascending=False)

display(countries[:5])

display(countries[-5:])

display(countries.index.values)

 #### **Caution!**, there is a country with a blank space as a code. We will take it into account later.

#### We review the categories. In the listings above you could see some categories with spaces.

In [None]:
# we pass all values to lowercase
df.Category = df.Category.str.lower()
# we remove the spaces
df.Category = df.Category.str.replace(' ', '')
# we group, count and order the countries
categories = df.groupby('Category')['Category'].count()
# we show the first ones
display(categories[:5])
# we show the latest ones
display(categories[-5:])
# we show all values
display(categories.index.values)

#### The category variable is made up of the category and gender. Let's separate them.

In [None]:
df['Sex'] = df.Category.str[2]
display(df['Sex'][:3])

df['Category'] = df.Category.str[:2]

cols = list(df)
cols.remove('Sex')
cols.insert(8, 'Sex')
df = df[cols]

 # Visualization

 ### Settings

In [None]:
sns.set()
sns.set_context('notebook', font_scale=1.3, rc={'lines.linewidth': 2})
plot_width = 12
plot_height = 8
plt.rcParams.update({'figure.max_open_warning': 0})

 ### Data visualization

#### Let's start by looking at the evolution of the number of participants per year showing those who managed to finish the race. Remember that for 2010 we do not have arrival data, so we exclude it.

In [None]:
fdf = df[df.Year != 2010]

runners = pd.DataFrame(fdf.groupby('Year')['Id'].count())
runners.rename(columns={'Id': 'Runners'}, inplace=True)
runners.reset_index(level=0, inplace=True)

finishers = pd.DataFrame(
    fdf.loc[df.Arrivee.notna()].groupby('Year')['Id'].count())
finishers.rename(columns={'Id': 'Finishers'}, inplace=True)
finishers.reset_index(level=0, inplace=True)

f, ax = plt.subplots(figsize=(plot_width, plot_height))

sns.set_color_codes("pastel")
sns.barplot(data=runners, x='Year', y='Runners',
            label="Runners", color='b')

sns.set_color_codes("muted")
sns.barplot(data=finishers, x='Year', y='Finishers',
            label="Finishers", color='b')

ax.set(xlabel='Year', ylabel='Runners')
ax.legend(ncol=1, loc="upper left", frameon=False)

#### We see that the first year they all finished, there would probably be more participants but the dataset that we have for that year must be exclusively those who finished. In 2014 barely a third of the runners arrived and in 2015 and 2016 less than half. In the rest of the years there is no regular pattern. We would have to try to find out the cause.

#### The winners time graph is shown below.

In [None]:
winners = pd.DataFrame(fdf.groupby('Year')['Arrivee'].min().dropna())
winners.reset_index(level=0, inplace=True)
winners['Hours'] = winners.Arrivee / np.timedelta64(1, 'h')

f, ax = plt.subplots(figsize=(plot_width, plot_height))

# sns.set_color_codes("pastel")
sns.lineplot(data=winners, x='Year', y='Hours',
             label="Winner time", color='b')
ax.set_ylim(18.5, 22.5)

ax.set(xlabel='Year', ylabel='Hours')
ax.legend(ncol=1, loc="upper left", frameon=False)

#### The time of 2003 is the second fastest, a bit surprising; in those years mountain races were beginning and the participants were amateurs. With professional runners, the time was not lowered until 14 years later, it does not seem credible. It is more plausible that after the first edition a modification of the route was made.

#### Will race time be related to the percentage of runners who finish the race? In a race that runs at an altitude of more than 2,000 m for most of its length, weather conditions and the state of the terrain may influence race speed. Let's see, we have to separate 2003 from this analysis since there is no data on non-finalizers.

In [None]:

merged1 = pd.merge(winners, runners)
merged1 = pd.merge(merged1, finishers)

merged1['FinishersPerc'] = merged1.Finishers / merged1.Runners * 100

merged1 = merged1[merged1.Year != 2003]

f, ax = plt.subplots(figsize=(plot_width, plot_height))

sns.scatterplot(data=merged1, x="FinishersPerc", y="Hours",
                s=100)

ax.set(xlabel='% finishers', ylabel='Hours')

#### If we look at the points to the right of 55% we see a negative correlation: the higher the percentage of finishers, the better the race time. But we are only looking at the time of the winners, we will extend the study to the average time of all those who arrived.

In [None]:
total_time = pd.DataFrame(
    fdf.loc[df.Arrivee.notna()].groupby('Year')['Arrivee'].sum())
total_time.rename(columns={'Arrivee': 'TotalTime'}, inplace=True)

total_time['TotalHours'] = total_time.TotalTime / np.timedelta64(1, 'h')

total_time.reset_index(level=0, inplace=True)

merged2 = pd.merge(merged1, total_time)

merged2['AvgFinishersTime'] = merged2.TotalHours / merged2.Finishers


f, ax = plt.subplots(figsize=(plot_width, plot_height))

sns.scatterplot(data=merged2, x="FinishersPerc", y="AvgFinishersTime", s=100)

ax.set(xlabel='% finishers', ylabel='Hours')

#### Well, there is a growing trend, it can be said that, in general, the higher the percentage of finishers, the average number of hours per finisher increases. The correlation between the series is 0.67:

In [None]:
merged3 = merged2[['FinishersPerc', 'AvgFinishersTime']]
corr = merged3.corr()
corr

#### Using `jointplots` the mapping looks like this

In [None]:
j = sns.jointplot(data=merged2, x='FinishersPerc',
                  y='AvgFinishersTime', kind='reg')
j.ax_joint.set_xlabel('% finishers')
j.ax_joint.set_ylabel('Hours')

j = sns.jointplot(data=merged2, x='FinishersPerc',
                  y='AvgFinishersTime', kind='kde')
j.ax_joint.set_xlabel('% finishers')
j.ax_joint.set_ylabel('Hours')

#### We take the opportunity to see the evolution of the average time of those who finished the race.

In [None]:
f, ax = plt.subplots(figsize=(plot_width, plot_height))

sns.lineplot(data=merged2, x='Year', y='AvgFinishersTime',
             label="Finishers average time", color='b')

ax.set(xlabel='Year', ylabel='Hours')
ax.legend(ncol=1, loc="upper left", frameon=False)

 #### We see that in the evolution of the average time of those who finished the race an upward trend is observed until 2011, when it begins to oscillate.

#### In terms of participants by country, the presence of French is overwhelming. We show two charts for easy viewing. We do the study on the complete dataframe recovering the year 2010 since we do not deal with times.

In [None]:
countries = pd.DataFrame(df.groupby('Nationality')['Nationality'].count().sort_values(ascending=False))

countries.rename(columns={'Nationality': 'Quantity'}, inplace=True)

countries.reset_index(level=0, inplace=True)
countries.rename(columns={'Nationality': 'Country'}, inplace=True)

countries = countries.loc[countries.Country != ' ']

display(countries[:10])


countries_above = countries.loc[countries.Quantity >= 50]
countries_below = countries.loc[countries.Quantity < 50]


f, ax = plt.subplots(figsize=(plot_width, plot_width))

sns.barplot(data=countries_above, x='Quantity', y='Country',
            label="Runners", color='b')

ax.set(title='Countries with 50 or more runners',
       xlabel='Quantity', ylabel='Country')
ax.legend(ncol=1, loc="lower right", frameon=False)

f, ax = plt.subplots(figsize=(plot_width, plot_width))

sns.barplot(data=countries_below, x='Quantity', y='Country',
            label="Runners", color='b')

ax.set(title='Countries with less than 50 runners',
       xlabel='Quantity', ylabel='Country')
ax.legend(ncol=1, loc="lower right", frameon=False)

#### Let's check what happens with participation by gender.

In [None]:
sex_df = df.pivot_table('Id', index='Year', columns='Sex', aggfunc='count')

sex_df.reset_index(level=0, inplace=True)
display(sex_df)

f, ax = plt.subplots(figsize=(plot_width, plot_height))

sns.set_color_codes("pastel")
sns.barplot(data=sex_df, x='Year', y='h',
            label="Men", color='b')

sns.set_color_codes("muted")
sns.barplot(data=sex_df, x='Year', y='f',
            label="Women", color='b')

ax.set(xlabel='Year', ylabel='Runners')
ax.legend(ncol=1, loc="upper left", frameon=False)


#### Wow, the girls turnout is only around 10% on a steady basis. Let's review the performance of the girls vs the boys.

In [None]:
times_sex_df = fdf.pivot_table('Arrivee', index='Year', columns='Sex',
                               aggfunc=[np.sum, np.min, 'count'])

cols = ['tf', 'th', 'mf', 'mh', 'cf', 'ch']
times_sex_df.columns = cols

times_sex_df.tf = times_sex_df.tf / np.timedelta64(1, 'h')
times_sex_df.th = times_sex_df.th / np.timedelta64(1, 'h')
times_sex_df.mf = times_sex_df.mf / np.timedelta64(1, 'h')
times_sex_df.mh = times_sex_df.mh / np.timedelta64(1, 'h')
times_sex_df['tt'] = times_sex_df.th + times_sex_df.tf


times_sex_df['af'] = times_sex_df.tf / times_sex_df.cf
times_sex_df['ah'] = times_sex_df.th / times_sex_df.ch
times_sex_df['at'] = times_sex_df.tt / (times_sex_df.ch + times_sex_df.cf)


times_sex_df.reset_index(level=0, inplace=True)
display(times_sex_df)

f, ax = plt.subplots(figsize=(plot_width, plot_height))

sns.lineplot(data=times_sex_df, x='Year', y='af',
             label="Avg girls", color='g')
sns.lineplot(data=times_sex_df, x='Year', y='ah',
             label="Avg boys", color='b')
sns.lineplot(data=times_sex_df, x='Year', y='at',
             label="Avg all", color='r')
sns.lineplot(data=times_sex_df, x='Year', y='mf',
             label="Min girls", color='m')
sns.lineplot(data=times_sex_df, x='Year', y='mh',
             label="Min boys", color='y')

ax.set(xlabel='Year', ylabel='Hours')
ax.legend(ncol=1, loc='best', frameon=False)


#### Well! the difference in the last three years is around half an hour and in 2011 we are talking about minutes. It also stands out that the average of all the runners is highly influenced by the great difference in participation between boys and girls, being almost parallel to that of the first. On the other hand, the distance by sex between winners and between the averages is remarkable: the overall of the girls is close to that of the boys, but between the numbers one the lines are much more separated.

In [None]:
g = sns.catplot("Year", data=df, aspect=4.0, kind='count',
                   hue='Category', order=range(2003, 2018))
g.set_xlabels('Year')
g.set_ylabels('Runners by category')


#### Regarding the comparative evolution of the categories, it can be seen that the pattern of participants is almost clonic in recent years.

#### In order to see the evolution at each crossing point of the first n runners of each race, an auxiliary function has been defined that accepts as parameters the desired year and the number of runners - although, for clarity, it is not recommended to put more than 10-. The function is able to generate the waypoints dynamically, showing those corresponding to each year. We launched the run for 10 male and female runners. Remember that for 2003 there were no stages, so straight lines are shown. In no case are big comebacks seen.

In [None]:
def plot_topN(df, year, n=5, sex='all'):

    def get_year(df, year, sex):
        # filtra el año
        year_df = df.loc[df.Year == year]
        
        if sex == 'h' or sex == 'f':
            year_df = year_df.loc[year_df.Sex == sex]
        
        notnas = year_df.columns[year_df.notna().any()].tolist()
        return year_df[notnas]

    
    ydf = get_year(df, year, sex)
    
    n = min(ydf.shape[0], n)

    
    topN = ydf.sort_values(by='Arrivee')[:n]
    
    cols = list(topN)[10:]
    
    cronos = topN[cols] / np.timedelta64(1, 'h')
    
    cronos['Name'] = topN.Name
    
    cronos.set_index('Name', inplace=True)

    
    plt.subplots(figsize=(plot_width, plot_height))
    plt.tight_layout(pad=2)
    
    y = np.arange(0, len(cols))
    
    for i in np.arange(0, n):
    
        x = cronos.iloc[i]
    
        plt.yticks(y, cols)
        plt.plot(x, y, label=x.name)
    plt.legend(framealpha=1, frameon=True)
    
    fig = plt.gcf()
    if sex == 'f':
        who = 'Girls'
    elif sex == 'h':
        who = 'Boys'
    else:
        who = 'Girls & Boys'
    fig.suptitle('Evolution of the top ' + str(n) +
                 ' .Year: ' + str(year) + '. ' + who + '.')


for x in np.arange(2003, 2018):

    if not (x == 2010 or x == 2012):
        plot_topN(df, x, 10, 'f')
        plot_topN(df, x, 10, 'h')


#### **Boys**. In 2005 there are a few nulls. In 2006 there is a crossing point with no data and a timing in Tseppes by runner Vincent Delebarre clearly wrong. In general, it is seen that the separation between the first and the tenth is more than three hours, except in 2016 when it does not reach two hours: the lines at the arrival are very close.

#### **Girls**. In 2004, 2007 and 2008 there are also a few nulls. In this case, the separation between the first and the tenth is much wider, except in 2016 when it is less than two hours: the lines at the finish are very close.