# Five Facts About Le Tour

# Table of Contents

1. [Data](#Data)
2. [Objectives](#Objectives)
3. [Imports](#Imports)
4. [Review Data and Clean](#Review-Data-and-Clean)
5. [Fact 1 Tour de France has a History of Doping](#Fact-1---Tour-de-France-has-a-History-of-Doping)
6. [Fact 2 The Tour is Very European](#Fact-2---The-Tour-is-Very-European)
7. [Fact 3 Le Tour's Course is Always Changing](#Fact-3---Le-Tour's-Course-is-Always-Changing)
8. [Fact 4 It is a Very Difficult Race](#Fact-4---It-is-a-Very-Difficult-Race)
9. [Fact 5 - Cannibalism](#Fact-5---Cannibalism)


# Data

The Tour de France is without a doubt one of the most historical, brutal and controversial sporting events. This data has a plethora of information ranging from dates and distances to start and finish locations. By looking at this data, we can get a good idea of how the race once was and how it has evolved over time. Let's get started!

# Objectives

- Let's first understand what kind of data we have, how clean it is and if it needs any modifications.
- Review winner data and doping.
- See which nations participate in the Tour de France.
- Analyze the course starts and finsihes. Which locations have the most Tour de France appearances. 
- Get a feel for the length of the race, and how the difficulty of the race has evolved over time. 
- Who are the famous stage winners, and what kind of riding style do they have? 

# Imports

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Review Data and Clean

In [None]:
df = pd.read_csv('../input/stages_TDF.csv')

In [None]:
df.info()

In [None]:
df.head()

In [None]:
plt.figure(figsize=(9,6))
sns.heatmap(df.isnull(),cmap='summer',yticklabels=False,cbar=False)
plt.title('Missing Data?',fontsize=20)
plt.xticks(fontsize=15,rotation=90)
#plt.xticks(fontsize=10)
plt.show()
#df.isnull()

In [None]:
df[df['Winner_Country'].isnull()].head(10)

Most null data in the 'Winner_Country' column comes from 'Team time trial' events. My guess is that many teams have team members from various nations, making it hard to nationalize? 

However, there are also other very interesting events that resulted in missing data that were not 'Team time trial' events. 

- In 1998, the [Festina affair](https://en.wikipedia.org/wiki/Festina_affair) caused numerous teams to strike and protest the race after one team was caught with illegal performance drugs. 

- In stage 16 of the 1995 Tour de France, the stage with held in non competitive spirirt due to the death of [Fabio Casartelli](https://en.wikipedia.org/wiki/Fabio_Casartelli), who died a stage earlier in a descent in the mountains. 

- In the 5th stage of 1982, steel workers from Usinor blocked the road. 

- The 12th stage of the 1978 tour ended when the riders protested an early ride after a late finish the night before. They protested by riding at an average of 12 miles per hour, reaching the finish line behind schedule.

- Stage 18 of 1977, no winner was declared becuase the first two finishers were caught cheating, and the rest of the riders finished at the sametime. 

In [None]:
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].map(lambda x: x.year)

In [None]:
df.duplicated().value_counts()

In [None]:
df['Date'].duplicated().value_counts()

Interesting. No duplicate data, but some of the data on the 'Date' column is repetitive. Are there dates where more than one race was held per day? 

In [None]:
df[df['Date']=='1991-07-07']

Pretty gnarly, on July 7, 1991, there were two races on the same day. One race, a traditional 70 mile plain stage, fairly easy by Tour de France standards. The second race that day was a team time trial. Also notice the 'Nan' for Winner_Country in the time trial row. Maybe that is because many times teams have members from various countries.  

In [None]:
df['Type'].value_counts()

## Convert Kilometers to Miles

Let's create a column that displays miles instead of kiloeters for our non metric system friends.

In [None]:
df['DistanceInMiles'] = df['Distance'].apply(lambda x: x*.62)

## Reorganize Terrain Type

In [None]:
df['Type'].value_counts()

I understand that I may be losing some meaning by aggressively restructuring the values in the 'Type' column. Having so many types of flat stages, plains stages and mountain stages makes the data harder to understand.

In [None]:
def changeStageNames(x):
    
    plainsList = ['Plain stage', 'Flat Stage','Flat cobblestone stage','Plain stage with cobblestones']
    mountainsList = ['Stage with mountain(s)','High mountain stage','Medium mountain stage','Stage with mountain','Mountain Stage']
    
    if x in plainsList:
        return 'Flat stage'
    if x in mountainsList:
        return 'Mountain stage'
    else:
        return x

In [None]:
df['Type'] = df['Type'].apply(changeStageNames)

In [None]:
df['Type'].value_counts()

# Fact 1 - Tour de France has a History of Doping

Doping is professional cyclng is wide spread. Check out this wikipedia article. Most of the podium, in certain eras, were [doping.](https://en.wikipedia.org/wiki/Doping_at_the_Tour_de_France)

In [None]:
df[df['Winner']=='Alberto Contador']

In [None]:
df[df['Winner']=='Alberto Contador[n 1]']

In [None]:
df[df['Winner']=='Jan Ullrich'].head(5)

In [None]:
df[df['Winner']=='Jan Ullrich[n 1]']

In [None]:
df[df['Winner']=='Lance Armstrong']

In [None]:
df[df['Winner']=='Lance Armstrong[n 1]'].head(5)

Seems that Armstrong is targeted on this dataset, while other famous dopers were left out, or unmarked with the '[n 1]' symbol. Let's change the marking so that we can have a more accurate data reading.

In [None]:
def lanceArmstrong(x):
    if x == 'Lance Armstrong[n 1]':
        return 'Lance Armstrong'
    return x

In [None]:
df['Winner'] = df['Winner'].apply(lanceArmstrong)

In [None]:
df['Winner'].value_counts().head(10)

# Fact 2 - The Tour is Very European

Le Tour is massively european. Let's take a closer look by examining the winners's native country.

In [None]:
pre1970 = df[df['Year']<=1970]
post1970 = df[df['Year']>=1970]
post2000 = df[df['Year']>=2000]
post2010 = df[df['Year']>=2010]

In [None]:
plt.figure(figsize=(9,6))
pre1970['Winner_Country'].value_counts().head(15).plot('bar')
plt.title('Stage Winners by Country Origin (Pre 1970)',fontsize=20)
plt.xticks(fontsize=12,rotation=90)
plt.show()
print('Stage Winners by Country Origin\n')
print(pre1970['Winner_Country'].value_counts().head(15))

France, Belgium and Italy clearly enjoy most of the territory before 1970.

In [None]:
plt.figure(figsize=(9,6))
post1970['Winner_Country'].value_counts().head(15).plot('bar')
plt.title('Stage Winners by Country Origin (Year 1970 and Beyond)',fontsize=20)
plt.xticks(fontsize=12,rotation=0)
plt.show()
print('Stage Winners by Country Origin\n')
print(post1970['Winner_Country'].value_counts().head(15))

France and Belgium still holding the top two spots.

In [None]:
plt.figure(figsize=(9,6))
post2000['Winner_Country'].value_counts().head(15).plot('bar')
plt.title('Stage Winners by Country Origin (Year 2000 and Beyond)',fontsize=20)
plt.xticks(fontsize=12,rotation=0)
plt.show()
print('Stage Winners by Country Origin\n')
print(post2000['Winner_Country'].value_counts().head(15))

Pretty clear here that the French are losing their grip on the Tour de France stage victories, and Belgium is now in the middle of the pack.

In [None]:
plt.figure(figsize=(9,6))
post2010['Winner_Country'].value_counts().head(15).plot('bar')
plt.title('Stage Winners by Country Origin (Year 2010 and Beyond)',fontsize=20)
plt.xticks(fontsize=12,rotation=0)
plt.show()
print('Stage Winners by Country Origin\n')
print(post2010['Winner_Country'].value_counts().head(15))

As of recent, with Mr. Cavendish, Mr. Froome and Team Sky, the Tour has suffered an English Invasion.

In [None]:
df[df['Winner_Country']=='FRG']['Year'].value_counts()

FRG is West Germany

In [None]:
df[df['Winner_Country']=='COL']['Type'].value_counts()

COL is Colombia

Europe and North America aside, Colombia has an odd appearance in the Tour. Colombians are also well known for their climbing ability. Andes anyone? 

# Fun Fact 3 - Le Tour's Course is Always Changing 

Le Tour de France is not the same route every year. The race course is always changing, making it very dynamic and exciting to watch. 

In [None]:
plt.figure(figsize=(9,6))
df['Destination'].value_counts().head(20).plot(kind='bar',color='green')
plt.xlabel('Destinations',fontsize=16)
plt.ylabel('Count',fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.title('Count of Destinations',fontsize=20)
plt.show()
print(df['Destination'].value_counts().head(20))

In [None]:
plt.figure(figsize=(9,6))
df['Origin'].value_counts().head(20).plot(kind='bar',color='purple')
plt.xlabel('Origin',fontsize=16)
plt.ylabel('Count',fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.title('Count of Origins',fontsize=20)
plt.show()
print(df['Destination'].value_counts().head(20))

In [None]:
plt.figure(figsize=(20,8))
df.groupby('Year').count()['Type'].plot(kind='bar')
plt.title('Number of Stages by Year',fontsize=20)
plt.xticks(fontsize=12)
plt.yticks(fontsize=16)
plt.show()

As of the late 80s, the tour has been very consistant with the number of stages per tour.

There are a couple of gaps when the Le Tour was not held.  
- Between 1914 and 1919 because of WW1.
- Between 1939 and 1947 because of WW2.

# Fact 4 - It is a Very Difficult Race

In [None]:
plt.figure(figsize=(20,8))
df.groupby('Year')['DistanceInMiles'].sum().plot(kind='bar')
plt.title('Total Miles Ridden per Tour',fontsize=20)
plt.yticks(fontsize=18)
plt.ylabel('Distance in Miles',fontsize=18)
plt.xticks(fontsize=12)
plt.xlabel('Year',fontsize=18)
plt.show()

https://en.wikipedia.org/wiki/1926_Tour_de_France

in 1926,the longest tour ever, the organizer of the tour lead the riders in a perimeter of the nation of France.

In [None]:
plt.figure(figsize=(20,8))
df.groupby('Year')['DistanceInMiles'].mean().plot(kind='bar')
plt.title('Average Miles Ridden per Day per Tour',fontsize=20)
plt.yticks(fontsize=18)
plt.ylabel('Distance in Miles',fontsize=18)
plt.xticks(fontsize=12)
plt.xlabel('Year',fontsize=18)
plt.show()

In [None]:
plt.figure(figsize=(9,6))
sns.distplot(df['DistanceInMiles'],hist=True)
plt.title('Histogram of Tour de France Course Distance in Miles',fontsize=20)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.xlabel('Distance in Miles',fontsize=12)
plt.show()
print('The median distance of a Tour de France race course is ' + str(df['DistanceInMiles'].median()) + ' miles.')
print('The shortest distance of a Tour de France race course was ' + str(df['DistanceInMiles'].min()) + ' miles.')
print('The longest distance of a Tour de France race course was ' + str(df['DistanceInMiles'].max()) + ' miles.')

In [None]:
plt.figure(figsize=(9,6))
df['Type'].value_counts().plot('bar')
plt.title('Stage Types',fontsize=20)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

# Fact 5 - Cannibalism

[Eddy Merckx](https://en.wikipedia.org/wiki/Eddy_Merckx), AKA 'The Cannabal' is the most decorated Tour de France rider with 34 stage wins. 

In [None]:
df['Winner'].value_counts().head(5)

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(data=df[df['Winner']=='Eddy Merckx'],x='Year',hue='Type')
plt.title('Eddy Merckx Wins by Stage Type',fontsize=20)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Wins',fontsize=16)
plt.legend(loc=1)
plt.show()
df[df['Winner']=='Eddy Merckx'].groupby('Year').count()['Winner']

As you can see from the plot above, the Cannibal feasted on his prey on various stages, over the span of 6 years. It should also be knowned that Merckx is considered the best competitive cyclist of all time with 11 [Grand Tour](https://en.wikipedia.org/wiki/Grand_Tour_(cycling) victories, victories in all of the [Classics](https://en.wikipedia.org/wiki/Classic_cycle_races), 3 [world championships](https://en.wikipedia.org/wiki/UCI_Road_World_Championships) and broke the [hour record.](https://en.wikipedia.org/wiki/Hour_record). Here is a [video](https://www.youtube.com/watch?v=KNCamaNuxwE) of Merckx riding in le Tour.

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(data=df[df['Winner']=='Mark Cavendish'],x='Year',hue='Type')
plt.title('Mark Cavendish Wins by Stage Type',fontsize=20)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Wins',fontsize=16)
plt.legend(loc=1)
plt.show()
df[df['Winner']=='Mark Cavendish'].groupby('Year').count()['Winner']

Mr. Cavendish can be considered one of the best sprinters ever. All of his victories have come from punchy sprints at the end of a flat stage races. Check out some of his sprinting victories in this [video.](https://www.youtube.com/watch?v=7PPlYnhWYj0) Pretty amazing, and scary! 

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(data=df[df['Winner']=='Bernard Hinault'],x='Year',hue='Type')
plt.title('Bernard Hinault Wins by Stage Type',fontsize=20)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Wins',fontsize=16)
plt.legend(loc=1)
plt.show()
df[df['Winner']=='Bernard Hinault'].groupby('Year').count()['Winner']

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(data=df[df['Winner']=='André Leducq'],x='Year',hue='Type')
plt.title('André Leducq Wins by Stage Type',fontsize=20)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Wins',fontsize=16)
plt.legend(loc=1)
plt.show()
df[df['Winner']=='André Leducq'].groupby('Year').count()['Winner']

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(data=df[df['Winner']=='Lance Armstrong'],x='Year',hue='Type')
plt.title('Lance Armstrong Wins by Stage Type',fontsize=20)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.xlabel('Year',fontsize=16)
plt.ylabel('Wins',fontsize=16)
plt.legend(loc=2)
plt.show()
df[df['Winner']=='Lance Armstrong'].groupby('Year').count()['Winner']

Armstrong went from being a flat stage sprinting cannonball, to a climbing, time trialing animal. This transition also happened at around the same time he admitted to doping.

And for all of the Armstrong fans, here is his famous ['look back'](https://www.youtube.com/watch?v=F94TCxLYZew) Alpe D'Huez performance. Enjoy.

# Thank You!