## Exploritary Data Analysis - Team MenChesthair United - Iditarod 2017

In this Exploritary Data Analysis, we will be looking at the Iditarod race of 2017 as extracted from <a href="https://www.kaggle.com/iditarod/iditarod-race" target="blank">Kaggle</a>. First, we will explain what Iditarod is. Next, we will import all modules and data. After which we will proceed with some Exploritary Data Analysis. 

## Background information: what is Iditarod?

The Iditarod Trail Sled Dog Race is an annual long-distance sled dog race run in early March from Settler's Bay to Nome, which takes place entirely in the US state of Alaska. Mushers and a team of 16 dogs of which at least 5 must be on the towline at the finish line, cover the distance in 8–15 days or more. The Iditarod began in 1973 as an event to test the best sled dog mushers and teams but evolved into today's highly competitive race. Then a record, the second fastest winning time was recorded in 2016 by Dallas Seavey with a time of 8 days, 11 hours, 20 minutes, and 16 seconds. As of 2012, Dallas Seavey was also the youngest musher to win the race at the age of 25. In 2017, at the age of 57, Dallas' father Mitch Seavey is the oldest and fastest person ever to win the race, crossing the line in Nome in 8 days, 3 hours, 40 minutes and 13 seconds. Dallas finished second, two hours and 44 minutes behind. (Wikipedia, april 2017)

## Import all necessary modules

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from copy import deepcopy
import re
sns.set()

%matplotlib inline

## Loading & setting up the Data Frame and descriptives

Before we start to explore the data it is important to load and set up the data frame. The following code is used for that. The dataframe is loaded, columns are edited and new columns are introduced. The position of each rider during the race is added. Besides that, a column called 'arrival day' is added. This is useful because not all the riders arrive at the same day in the same place. 

In [None]:
#importing the dataset
race = pd.read_csv('../input/report.csv')

# Lowercasing and replacing spaces with underscores (enables code such as race.column_name)
race.columns = [x.lower().replace(" ", "_") for x in race.columns]

# Select unique racers before dataframe order changes 
#racers = race.ix[0:71,0:4]

# Updating the dataframe to include the finish position
pd.read_csv('https://raw.githubusercontent.com/piecurus/JBP0102017/master/EDA-MenChesthairUnited/iditarod2017.csv',sep=';')
pos_name = new_df[['finish_position', 'name']]
#pos_name= new_df.ix[:, 0:2].drop([0])
#pos_name.columns= ['finish_position', 'name']

#inserting the position into the race dataframe
race= pd.merge(race, pos_name, on="name", how="inner")

#create new dataframe for later operations
times = pd.read_csv('report.csv',usecols=['Number','Name','Checkpoint','Distance','Speed','Time','Elapsed Time','Departure Dogs'])

The Iditarod Trail Sled Dog Race is an annual long-distance sled dog race run in early March from Settler's Bay to Nome, which takes place entirely in the US state of Alaska. Mushers and a team of 16 dogs of which at least 5 must be on the towline at the finish line, cover the distance in 8–15 days or more (Wikipedia, april 2017).

5 weeks ago, 72 riders (mushers) started the race of 2017. During that race is data collected and described by several variables. These variables are described below.

### Description of variables: 

* Number: the identifiction number of each musher (sled rider).
* Name: name of the musher.
* Status: rookie or veteran.
* Country: Musher's country of origin.
* Checkpoint: name of the checkpoint, all columns to the right of it are descriptive for only that checkpoint.
* Latitude: lines from pole to pole.
* Longtitude: lines parallel to the equator.
* Distance: the distance relative to previous checkpoint.
* Time:  time it took to get from the previous checkpoint to the current one. (hours):(fraction of minutes) eg 5:85
* Speed: average speed between previous and current checkpoint, in miles per hour.
* Arrival_date: date of arrival at checkpoint
* Arrival_time: time of arrival at checkpoint
* Arrival_dogs: amount of dogs that arrived at checkpoint
* Elapsed_time: time that elapsed while resting at the checkpoint 
* Departure_date: day of departure from the checkpoint
* Departure_dogs: amount of dogs that departed from the checkpoint
* Finish_position: the musher's finish position in the race

### Head, shape and info

In [None]:
race.head()

In [None]:
print(race.shape)

As we can see above, we have 1129 observations for 18 variables. 

In [None]:
print(race.info())

As we can see above, there are some datapoints missing for some variables. This has probably to do with drop-outs. 

In [None]:
len(race.name.unique())

As we can see, there are 71 unique names, which means there are 71 mushes (contestens).

## Exporitary Data Analysis

### Nationalities

As mentioned before 71 mushers started the race. Most of them, 59, come from the US. Two mushers come from Canada and all the others are from European countries. How the nationalities are divided is shown in the pie chart below.

In [None]:
countries = pd.Series(race[['country','name']].drop_duplicates().country.value_counts())
print(countries)

In [None]:
countries.plot(kind='pie', figsize=(7,7))

### Rookie vs Veteran

As in every competition there will be experienced and new participants (rookies). The pie chart below shows us that almost a quarter of the mushers are participating for the first time. 

In [None]:
status = pd.Series(race.status.value_counts())
print(status)

In [None]:
status.plot(kind='pie', figsize=(7,7))

### Distribution of dogs

<span style="color:red">These two cells seem not to work all the time. When we start up the notebook it does, but if the cell is run a second time it gives an error.</span>

Each team is composed of twelve to sixteen dogs, and no more may be added during the race. At least five dogs must be on the towline when crossing the finish line in Nome. Mushers keep a veterinary diary on the trail and are required to have it signed by a veterinarian at each checkpoint. Dogs that become exhausted or injured may be carried in the sled's "basket" to the next "dog-drop" site. 

Therefore it is useful to analyze the amount of dogs departed from each checkpoint.

The boxplot shows that in the first 2 checkpoints all the dogs depart. The further the race evaluates the median shifts from 16 to 10. Still there is a big difference between the minimum and maximum dogs that depart each stage. For example, in Elim and Safety were mushers that started with only 7 dogs but also mushers that start with all the 16 dogs. 

In [None]:
# Calculate the cumulative time to get at the specific checkpoint for every musher and define test

mushers = pd.DataFrame(index=[times['Checkpoint'].unique()])
for nr in range(2,74):
    test = times[times['Number'] == nr]
    test['total time'] = test['Time'] + test['Elapsed Time']
    test['total time'] = test['total time'].cumsum(axis=0)
    musher1 = test[['total time']]
    if len(musher1) != 17:
        empty = pd.DataFrame([0.0 for n in range(17 - len(musher1))],columns=['total time'])
        musher1 = pd.concat([musher1,empty])
    musher1.index = times.Checkpoint.unique()
    musher1.columns = [test[['Name']].iloc[0,0]]
    mushers = pd.concat([mushers,musher1],axis=1,names=[test['Checkpoint']])

In [None]:
#Calculate the amount of dogs departed from each checkpoint

dogs = pd.DataFrame(index=[test['Checkpoint']])
for nr in range(2,74):
    test = times[times['Number'] == nr]
    dogs1 = test[['Departure Dogs']]
    if len(dogs1) != 17:
        empty = pd.DataFrame([0.0 for n in range(17 - len(dogs1))],columns=['Departure Dogs'])
        dogs1 = pd.concat([dogs1,empty])
    dogs1.index = times.Checkpoint.unique()
    dogs1.columns = [test.Name[nr-2]]
    dogs = pd.concat([dogs,dogs1],axis=1,names=[test['Checkpoint']])
dogs = dogs.transpose()

#Plot dogs departed per checkpoint as boxplots

dogs.plot(kind='box', figsize=(15,8))
plt.xlabel('Checkpoint')
plt.ylabel('Departure Dogs')
plt.title('Box Plots of Departure Dogs at each Checkpoint')
plt.show()

### Participants

Not everyone will reach the finish line. The following barchart shows the amount of contestent left at each checkpoint.

In [None]:
contestents = pd.Series(data=race.checkpoint.value_counts())
plt.title('Amount of contestents left at each checkpoint', size=14)
plt.ylabel('Amount of contestens', size=14)
plt.xlabel('Checkpoints', size=14)
contestents.plot(kind="bar", figsize=(10,6), fontsize=12, colormap='summer')

### Arrival and departure dogs

Not every dog will make it to the finish line either. The first graph below makes this very clear. Around 1100 dogs start the race and around 650 of them will make it to the finish line. When the blue and green bar differ a lot the race to that checkpoint can be seen as heavy. Therefore is an extra graph added which shows the distance between the checkpoints. 

For example, the distance between Koyukuk and Nulato is small and so almost all the dogs that arrived in Nulato will depart from there as well because the race was not too heavy in terms of distance. 

In [None]:
#create checkpoint_number, 
race['checkpoint_number'] = [list(race.checkpoint).index(checkpoint) for checkpoint in race.checkpoint]
race_dogs = race[['checkpoint', 'departure_dogs', 'arrival_dogs', 'distance']].groupby('checkpoint')[['departure_dogs', 'arrival_dogs']].sum().reset_index()
race_dogs['dogs_dropout'] = race_dogs.arrival_dogs - race_dogs.departure_dogs
distance = race[['checkpoint', 'distance', 'checkpoint_number']].groupby('checkpoint').mean().reset_index()
distance_dogs = pd.merge(distance, race_dogs, on='checkpoint', how='outer').set_index('checkpoint_number').sort_index()

dogs_dropout_pot = distance_dogs.plot(x='checkpoint', y=['arrival_dogs', 'departure_dogs'], kind='Bar')
dogs_dropout_pot.set_ylim(600, 1200)
distance_plot = distance_dogs.plot(x='checkpoint', y='distance', kind='line', rot=90)
distance_plot.set_ylim(0,150)

#### Arrival and departure dogs difference between rookies and veterans

Here we visualised the differences between rookie and veteran dog losses, to see whether there are differences between the two groups

In [None]:
#calculate dog_loss for every muster and checkpoint
test_df= deepcopy(race)
test_df['dog_loss'] = test_df['arrival_dogs'] - test_df[ 'departure_dogs']

#calculate race_time for every muster and checkpoint
test_df['race_time'] = test_df['time']+test_df['elapsed_time']
test_df['race_time (sec.)'] = test_df['race_time']*3600

In [None]:
#sort musters by total race_time
test_df_timesorted=(test_df.groupby(['status','name']).sum().reset_index()).sort_values(by='race_time', ascending=True)
test_df_timesorted= test_df_timesorted.loc[test_df_timesorted.distance == 968]

In [None]:
#seperate veterans from rookies, sort on race time and select only those who finished the race
test_df_timesorted_veterans = test_df_timesorted[test_df_timesorted.status == 'Veteran']
test_df_timesorted_veterans.sort_values(by='race_time', ascending=True)
test_df_timesorted_veterans= test_df_timesorted_veterans.loc[test_df_timesorted.distance == 968]

#seperate rookies from veterans, sort on race time and select only those who finished the race
test_df_timesorted_rookie = test_df_timesorted[test_df_timesorted.status == 'Rookie']
test_df_timesorted_rookie.sort_values(by='race_time', ascending=True)
test_df_timesorted_rookie= test_df_timesorted_rookie.loc[test_df_timesorted.distance == 968]

In [None]:
#create a scatter plot with rookie (red) and veteran (blue) dog losses on the x-axis and the race time on the y-axis
plt.scatter(test_df_timesorted_veterans['dog_loss'], test_df_timesorted_veterans['race_time'], color='blue')

plt.scatter(test_df_timesorted_rookie['dog_loss'].ravel(), test_df_timesorted_rookie['race_time'], color='red')

### The relationship between race time/speed and racetime/dog losses. 

In this jointplot we can see density plots for race time/speed and for race time/dog losses. It appears that the relation between dog losses and race time is more telling than is the relationship between speed and race time. In the plots you see that both relationships are approaching normal. 

In [None]:
#visualising the occurances of race_time against speed 
sns.jointplot("race_time", "speed", data=test_df_timesorted, kind='kde')

#visualising the occurances of race_time against dog_losses 
sns.jointplot("race_time", "dog_loss", data=test_df_timesorted, kind='kde')

### Drop-outs per checkpoint

In the following figure, we can see that there is after the checkpoint unalakleet, two groups seem to be forming, a quicker one and a slower one.

In [None]:
#Show how the cumulative times between mushers changes

mushers.plot(kind='line',figsize=(18,8),legend=False)
plt.xlabel('Checkpoint')
plt.ylabel('Time in hours since start of race')
plt.title('Arrival time of each musher for all checkpoints')
plt.show()

### Speed per checkpoint
In the plot below, we can see that the speed for each musher varies a lot per checkpoint. This could imply that some routes are harder then other. This might have to do the weather or the lenght of the route.

In [None]:
# Show the avarage speed at each checkpoint for 10 different mushers
mushers = pd.DataFrame(index=[times['Checkpoint'].unique()])
for nr in range(2,10):
    test = times[times['Number'] == nr]
    test['speed'] = test['Speed']
    musher1 = test[['speed']]
    if len(musher1) != 17:
        empty = pd.DataFrame([0.0 for n in range(17 - len(musher1))],columns=['speed'])
        musher1 = pd.concat([musher1,empty])
    musher1.index = times.Checkpoint.unique()
    musher1.columns = [test[['Name']].iloc[0,0]]
    mushers = pd.concat([mushers,musher1],axis=1,names=[test['Checkpoint']])

mushers.plot(kind='line', figsize=(18,8))
# plt.xlabel('Checkpoint')
# plt.ylabel('Avarage speed over distance between last and current checkpoint')
plt.title('The development of the avarage speed over the course of the race')
plt.show()

### The relationship between speed and amount of dogs

In [None]:
# Scatterplot displaying the relationship between the speed and the amount of dogs at departure

speed_matrix = pd.DataFrame()
for nr in range(2,74):
    test = times[times['Number'] == nr]
    if len(test) > 1:
        dogs = test['Departure Dogs'][:len(test)-1]
        speed = test['Speed'][1:]
        speed.index = dogs.index
        matrix = pd.concat([dogs,speed],axis=1)
        speed_matrix = pd.concat([speed_matrix,matrix])

speed_matrix.plot.scatter(x='Departure Dogs',y='Speed')
plt.title('Scatterplot of # of dogs vs speed')
plt.show()

### Relationship between distance and the speed of the leg.
The scatterplot below shows that there seems to be a relationship between the distance between two checkpoints and the speed on that leg. The differences in speed on 'middle distances' are more scattered than the extremes.

In [None]:
# Scatterplot displaying the relationship between the speed and the distance between checkpoints

distance_matrix = pd.DataFrame()
for nr in range(2,74):
    test = times[times['Number'] == nr]
    if len(test) > 1:
        distance = test['Distance'][1:]
        speed = test['Speed'][1:]
        matrix = pd.concat([distance,speed],axis=1)
        distance_matrix = pd.concat([distance_matrix,matrix])

distance_matrix.plot.scatter(x='Distance',y='Speed')
plt.title('Scatterplot of distance per leg vs. speed')
plt.show()

### Do veterans do better? 
This swarmplot shows that lower finish positions are mostly reserved for the rookies in the race.

In [None]:
#Bee swarmplot of status vs finish position in the race
ax= sns.swarmplot(x='status', y='finish_position', data= race.sort_values('checkpoint')[1074:], size=6)
ax.set_xlabel("Maturity of musher")
ax.set_ylabel("Finish position in the race")