# This notebook explores results from the 2018 road cycling season

Data<sup>1</sup> include categorized races performed by all riders in World Tour teams<sup>2</sup>. This include 18 teams with a total of 522 riders.
#### The goal is to find features that can predict the general ciassification winner (maybe top 10 is better) of Tour de France (TdF).  
<sup>1</sup>data is scraped from CQranking  
<sup>2</sup>Pro Continental teams are excluded for now

In [1]:
import pandas as pd
import numpy as np

In [2]:
pd.set_option('max_colwidth', 500)
pd.set_option('display.width', 100)
pd.set_option('display.expand_frame_repr', False)
pd.set_option("display.max_rows", 100)

In [54]:
df_riders = pd.read_csv(r'E:\cycling\cycling\data\riders_2018.csv',encoding='iso-8859-1')
df_rider_results = pd.read_csv(r'E:\cycling\cycling\data\rider_results_2018.csv',encoding='iso-8859-1')

In [55]:
df_riders.head(20)

Unnamed: 0,team,Rider,Date of birth,rank_start,point_start,rank_end,point_end,Comments,Country
0,ALM,BAGDONAS Gediminas,26/12/1985,510.0,140.0,317.0,226.0,,LTU
1,ALM,BAKELANTS Jan,14/02/1986,117.0,472.0,565.0,120.0,,BEL
2,ALM,BARBIER Rudy,18/12/1992,104.0,506.0,717.0,90.0,,FRA
3,ALM,BARDET Romain,09/11/1990,30.0,1162.0,14.0,1492.0,,FRA
4,ALM,BIDARD François,19/03/1992,781.0,80.0,523.0,137.0,,FRA
5,ALM,BOUCHARD Geoffrey,01/04/1992,,,1140.0,42.0,Trainee as from 31/07,FRA
6,ALM,CHAMPOUSSIN Clément,29/05/1998,2949.0,2.0,978.0,55.0,Trainee as from 31/07,FRA
7,ALM,CHEREL Mikael,17/03/1986,552.0,127.0,492.0,147.0,,FRA
8,ALM,CHEVRIER Clément,29/06/1992,844.0,70.0,1416.0,25.0,,FRA
9,ALM,COSNEFROY Benoit,17/10/1995,375.0,191.0,311.0,229.0,,FRA


In [6]:
df_results = df_rider_results.merge(df_riders,on='Rider',how='inner',suffixes=('_l','_r'))

Clean up the data types

Replace below values  
leader (leader of the stage race): 0  
\- (usually means TTT): nan  
DNF (did not finish): -1  
OOT (out of time): -2  
DNS (did not start): -5  
DQ (disqualified): -10

In [8]:
df_results['Rank'] = df_results['Rank'].replace('leader',0)
df_results['Rank'] = df_results['Rank'].replace('DNF',-1)
df_results['Rank'] = df_results['Rank'].replace('OOT',-2)
df_results['Rank'] = df_results['Rank'].replace('-',np.nan)
df_results['Rank'] = df_results['Rank'].replace('DNS',-5)
df_results['Rank'] = df_results['Rank'].replace('DQ',-10)
df_results['Rank'] = df_results['Rank'].astype(float)

df_results['Date'] = pd.to_datetime(df_results['Date'],format='%d/%m/%Y')

TdF results

In [10]:
df_tdf_gc = df_results.loc[df_results['Race']=='Tour de France',['Rider','Rank']]

In [11]:
df_tdf_gc.sort_values('Rank').head(10)

Unnamed: 0,Rider,Rank
25245,THOMAS Geraint,1.0
25720,DUMOULIN Tom,2.0
24110,FROOME Chris,3.0
23036,ROGLIC Primoz,4.0
22587,KRUIJSWIJK Steven,5.0
128,BARDET Romain,6.0
14372,LANDA MEANA Mikel,7.0
29786,MARTIN Daniel,8.0
21861,ZAKARIN Ilnur,9.0
14649,QUINTANA ROJAS Nairo Alexander,10.0


Riders who participated in TdF

In [12]:
tdf_riders = df_results.loc[df_results['Race'].str.contains('Tour de France'),'Rider'].unique()

Top ten finishers

In [14]:
top10_riders = df_tdf_gc.loc[df_tdf_gc['Rank'].between(1,10),'Rider'].values

Non top ten finishers

In [16]:
non_top10_riders = list(set(tdf_riders) - set(top10_riders))

Find races before TdF (2018-07-07)

In [17]:
mask = (df_results['Date'] < '2018-07-07')
df_pre_tdf = df_results.loc[mask]

Calculate rider performance in pre-TdF races

In [19]:
points_sum = []
race_days = []
wins = []
podiums = []
top_tens = []
for r in tdf_riders:
    df_r = df_pre_tdf[df_pre_tdf['Rider']==r]
    points_sum.append(df_r['CQ'].sum())
    race_days.append(df_r['Race'].count())
    wins.append(df_r.loc[df_r['Rank']==1,'Race'].count())
    podiums.append(df_r.loc[df_r['Rank'].between(1,3),'Race'].count())
    top_tens.append(df_r.loc[df_r['Rank'].between(1,10),'Race'].count())
   
df_performance = pd.DataFrame(
    {'riders': tdf_riders,
     'win': wins,
     'podium': podiums,
     'top_ten': top_tens,
     'points': points_sum,
     'race_days': race_days,
    },columns=['riders','win','podium','top_ten','points','race_days'])

In [22]:
df_performance.sort_values(by=['win', 'podium','top_ten'],ascending=False).head(10)

Unnamed: 0,riders,win,podium,top_ten,points,race_days
71,VALVERDE BELMONTE Alejandro,11,17,26,1600,42
105,GROENEWEGEN Dylan,9,10,11,438,29
61,IMPEY Daryl,8,16,19,838,49
109,ROGLIC Primoz,7,14,18,982,37
115,KWIATKOWSKI Michal,7,11,17,805,43
74,GAVIRIA RENDON Fernando,7,10,12,312,43
39,SAGAN Peter,6,14,26,1246,43
112,BERNAL GOMEZ Egan Arley,6,11,20,866,39
51,GREIPEL André,6,10,16,438,46
30,VAN AVERMAET Greg,5,10,21,785,47


In [23]:
df_performance[df_performance['riders'].isin(top10_riders)]

Unnamed: 0,riders,win,podium,top_ten,points,race_days
0,BARDET Romain,1,6,17,772,32
67,LANDA MEANA Mikel,1,4,13,500,33
68,QUINTANA ROJAS Nairo Alexander,1,7,14,606,31
103,ZAKARIN Ilnur,0,0,4,190,34
107,KRUIJSWIJK Steven,0,0,9,362,32
109,ROGLIC Primoz,7,14,18,982,37
114,FROOME Chris,3,4,15,837,41
119,THOMAS Geraint,4,11,15,804,36
122,DUMOULIN Tom,1,5,12,646,39
141,MARTIN Daniel,1,3,8,370,37


We see that:
1. Many riders with good pre-tdf performance are sprinters like Dylan Groenewegen who will never win a TdF general classification.  
1. TdF top 10 riders don't necessarily perform well in the early season as they are building up towards TdF  

Let's look at Critérium du Dauphiné and Tour de Suisse instead which are TdF preparation races and are good indicators of the form of riders leading up to TdF.

In [101]:
def get_race_performance(race_name):
    df_race = df_results.loc[df_results['Race'].str.contains(race_name.decode('utf-8')),:]
    riders = df_race['Rider'].unique()
    points_sum = []
    wins = []
    podiums = []
    top_tens = []
    for r in riders:
        df_r = df_race[df_race['Rider']==r]
        points_sum.append(df_r['CQ'].sum())
        wins.append(df_r.loc[df_r['Rank']==1,'Race'].count())
        podiums.append(df_r.loc[df_r['Rank'].between(1,3),'Race'].count())
        top_tens.append(df_r.loc[df_r['Rank'].between(1,10),'Race'].count())

    df_race_performance = pd.DataFrame(
        {'riders': riders,
         'win': wins,
         'podium': podiums,
         'top_ten': top_tens,
         'points': points_sum,
        },columns=['riders','win','podium','top_ten','points'])
    
    return df_race_performance

In [102]:
df_dauphine_perform = get_race_performance('Critérium du Dauphiné')
df_swiss_perform = get_race_performance('Tour de Suisse')

In [103]:
df_dauphine_perform.sort_values(by=['win', 'podium','top_ten'],ascending=False).head(10)

Unnamed: 0,riders,win,podium,top_ten,points
102,THOMAS Geraint,2,5,6,372
99,KWIATKOWSKI Michal,2,2,3,99
55,YATES Adam,1,3,6,268
120,MARTIN Daniel,1,3,5,230
52,IMPEY Daryl,1,2,3,81
63,ALAPHILIPPE Julian,1,2,3,92
100,MOSCON Gianni,1,2,3,45
28,ACKERMANN Pascal,1,2,2,63
97,CASTROVIEJO NICOLAS Jonathan,1,1,2,17
98,GEOGHEGAN HART Tao,1,1,2,59


In [105]:
df_swiss_perform.sort_values(by=['win', 'podium','top_ten'],ascending=False).head(10)

Unnamed: 0,riders,win,podium,top_ten,points
24,PORTE Richie,2,3,5,337
23,KÜNG Stefan,2,2,2,105
106,ANDERSEN Søren Kragh,1,3,3,85
34,SAGAN Peter,1,2,5,90
61,QUINTANA ROJAS Nairo Alexander,1,2,4,209
27,VAN GARDEREN Tejay,1,2,2,34
126,ULISSI Diego,1,1,5,126
14,COLBRELLI Sonny,1,1,3,56
26,VAN AVERMAET Greg,1,1,2,19
36,DEMARE Arnaud,1,1,2,50


We're getting closer to filtering out the actual GC riders like Geraint Thomas but there's still many irrelevant riders.  
Maybe we can use these to predict stage wins... but for GC wins we need to further scrutinize.  
Also, some big contenders like Chris Froome are not doing either race.  

### need to figure out how to extract more information e.g. rider specialty, stage type etc.