## Gathering data from procyclingstats.com

This notebook scrapes the input data from [procyclingstats.com](https://www.procyclingstats.com/) using the [**procyclingsstats**](https://github.com/themm1/procyclingstats) scraping library. I add some high-level cleaning and assembling functionality on top to make the scraping easier.

It collects:
- For a large number of riders from the best teams...
- Metadata for each rider, but most importantly...
- Their results in one-day or multi-stage...
- High-level races...
- For up to a few years in the past

The data is transformed into a simple matrix (pandas DataFrame) format, so that it can be used in the next step's algorithm to find hidden factors (called embeddings) determining a racer's and a race's profile. All having to specify very little about the type of race! Ready, set, go!

In [1]:
# TODO: parse all races
# TODO: general classification slug gets results from final stage, not yet gc

## Imports

In [2]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from procyclingstats import (
    Race,          # Race("race/tour-de-france/2022/overview").parse()
    Rider,         # Rider("rider/tadej-pogacar").parse()
    Stage,         # Stage("race/tour-de-france/2018/stage-18").parse()
    Team,          # Team("team/bora-hansgrohe-2021").parse()
    RiderResults,  # RiderResults("rider/alberto-contador/results").parse()
    RaceStartlist,
    RaceClimbs,
    Ranking        # Ranking("rankings/me/individual").parse() --> Summation of PCS points over a 12-month + 2 weeks overlap period
)

The scraping classes I focus on are: `Race`, `Rider`, `Stage`, and `Team`.

## Functions

In [3]:
def print_parse_info(slug):
    print(f"Parsing > {slug} ...")
    
def try_to_parse(obj, slug, printit=False):
    if printit:
        print_parse_info(slug)
    
    p = None  # fallback
    try:
        p = obj(slug).parse()
    except:
        print(f"Oopsie! This one failed: {slug}")
    return p

def parse_results_from_stage(stage, rid="results"):
    results = None  # fallback
    if stage is not None:
        if stage[rid] is not None:
            results = [(r["rider_name"], r["rank"]) for r in stage[rid]]  # e.g. [(WVA, 1), (MVDP, 2), (Pogiboy, 3), ...]
    return results

## Config

In [4]:
YEARS = [2018, 2019, 2020, 2021, 2022, 2023]

I use the 2023 races as base calendar, inluding only UCI Worldtour, UCI ProSeries, and Europe Tour races. Of course, races change over the years but not so much. U23 (xU) and championships (NN/CC) races are dropped. I also had to remove a few duplicates. The idea is that we deduce the most important riders based on who participated in these races. Doing the inverse seems less straightforward with the API package.

In [5]:
df_races = pd.read_excel("../data/races.xlsx")
df_races = df_races.dropna()
RACES = df_races.set_index("Race").transpose().to_dict("list")

In [6]:
df_races.Class.unique().tolist()  # 1.x = one-day race, 2.x = multi-day race & .UWT > .Pro > .1 > .2

['2.UWT', '1.UWT', '2.Pro', '1.Pro', '1.1', '1.2', '2.1', '2.2']

In [7]:
cutoff_date = "2023-04-30"
print(cutoff_date)

2023-04-30


## Parse results

In [8]:
lst_out = []
for year in YEARS:
    races, classes, stages = [], [], []
    print(f"----- {year} -----\nParsing...")
    for race_key, race_info in RACES.items():  
        _, race_class, race_slug = race_info
        race_slug_full = f"race/{race_slug}/{year}/overview"   
        race_p = try_to_parse(Race, race_slug_full)     
        if race_p is None:
            continue
        else:
            # do not process if race end date is beyond dataset cutoff date
            # but keep going, because races are not ordered chronologically
            if race_p["enddate"] > cutoff_date:
                continue
            stage_slug_base = race_slug_full.replace("overview", "")  # has general classification if multi-day race
            if race_p["is_one_day_race"] is True:
                stage_slugs = [stage_slug_base]  # single stage
            elif "stages" in race_p:
                stage_slugs = [stage_slug_base] + [s["stage_url"] for s in race_p["stages"]]  # multiple stages
            races += [race_key] * len(stage_slugs)
            classes += [race_class] * len(stage_slugs)
            stages += stage_slugs
    lst_out.append(pd.DataFrame({"year": year, "race": races, "class": classes, "stage_slug": stages}))
    print("")
        
df_races_out = pd.concat(lst_out)

----- 2018 -----
Parsing...
Oopsie! This one failed: race/uae-tour/2018/overview
Oopsie! This one failed: race/zlm-tour/2018/overview
Oopsie! This one failed: race/mont-ventoux-denivele-challenge/2018/overview
Oopsie! This one failed: race/maryland-cycling-classic/2018/overview
Oopsie! This one failed: race/giro-del-veneto/2018/overview
Oopsie! This one failed: race/veneto-classic/2018/overview
Oopsie! This one failed: race/gp-de-valence/2018/overview
Oopsie! This one failed: race/trofeo-calvia/2018/overview
Oopsie! This one failed: race/trofeo-alcudia/2018/overview
Oopsie! This one failed: race/grand-prix-aspendos/2018/overview
Oopsie! This one failed: race/grand-prix-apollon-temple-me/2018/overview
Oopsie! This one failed: race/figueira-champions-classic/2018/overview
Oopsie! This one failed: race/clasica-jaen-paraiso-interior/2018/overview
Oopsie! This one failed: race/gran-camino/2018/overview
Oopsie! This one failed: race/alanya-cup/2018/overview
Oopsie! This one failed: race/le-t

In [9]:
df_races_out.sample(10)

Unnamed: 0,year,race,class,stage_slug
545,2021,Tour de l'Ain,2.1,race/tour-de-l-ain/2021/stage-3
396,2022,Visit South Aegean Islands,2.2,race/south-aegean-tour/2022/stage-2
436,2020,Sibiu Cycling Tour,2.1,race/sibiu-cycling-tour/2020/
5,2019,Santos Tour Down Under,2.UWT,race/tour-down-under/2019/stage-5
228,2018,Tour de Hongrie,2.Pro,race/tour-de-hongrie/2018/
64,2019,Tour de Romandie,2.UWT,race/tour-de-romandie/2019/stage-3
363,2018,Japan Cup Cycle Road Race,1.Pro,race/japan-cup/2018/
231,2023,Belgrade Banjaluka,2.2,race/banja-luka-belgrade-i/2023/stage-1
473,2019,Vuelta Asturias Julio Alvarez Mendo,2.1,race/vuelta-asturias/2019/
622,2022,Grand Prix de la ville de Pérenchies,1.2,race/grand-prix-de-la-ville-de-perenchies/2022/


In [10]:
df_races_out["parsed"] = df_races_out["stage_slug"].apply(lambda x: try_to_parse(Stage, x))

Oopsie! This one failed: race/great-ocean-race/2018/
Oopsie! This one failed: race/tirreno-adriatico/2018/stage-1
Oopsie! This one failed: race/itzulia-basque-country/2018/stage-4
Oopsie! This one failed: race/eschborn-frankfurt/2018/
Oopsie! This one failed: race/giro-d-italia/2018/stage-20
Oopsie! This one failed: race/dauphine/2018/stage-3
Oopsie! This one failed: race/tour-de-suisse/2018/stage-1
Oopsie! This one failed: race/tour-de-france/2018/stage-3
Oopsie! This one failed: race/tour-de-france/2018/stage-9
Oopsie! This one failed: race/vuelta-a-la-comunidad-valenciana/2018/stage-3
Oopsie! This one failed: race/clasica-de-almeria/2018/
Oopsie! This one failed: race/faun-ardeche-classic/2018/
Oopsie! This one failed: race/la-drome-classic/2018/
Oopsie! This one failed: race/milano-torino/2018/
Oopsie! This one failed: race/tour-of-norway/2018/stage-1
Oopsie! This one failed: race/tour-of-taihu-lake/2018/stage-4
Oopsie! This one failed: race/gp-samyn/2018/
Oopsie! This one failed: 

In [11]:
# handy to help developers debug > GitHub Issues
# example: Stage("race/vuelta-a-espana/2022/stage-1").parse() bugs because key of DNS rider is still considered but should be dropped
stages_not_parsed = df_races_out[df_races_out.parsed.isnull()]["stage_slug"].tolist()
len(stages_not_parsed)

817

In [12]:
df_races_out.dropna(inplace=True)  # drop stages that couldn't be parsed

In [13]:
df_races_out["results"] = df_races_out["parsed"].apply(parse_results_from_stage)

In [14]:
# override results for multi-stage general classifications (gc) with actual gc outcome (not final-stage results)
mask_gc = (df_races_out["class"].str.contains("2")) & ~(df_races_out["stage_slug"].str.contains("stage-"))
df_races_out.loc[mask_gc, "results"] = df_races_out.loc[mask_gc, "parsed"].apply(parse_results_from_stage, rid="gc")

In [15]:
df_races_out.tail()

Unnamed: 0,year,race,class,stage_slug,parsed,results
229,2023,Tour du Doubs,1.1,race/tour-du-doubs/2023/,"{'arrival': 'Pontarlier - Le Larmont ', 'climb...","[(HERRADA Jesús, 1), (PINOT Thibaut, 2), (PETE..."
236,2023,Vuelta Asturias Julio Alvarez Mendo,2.1,race/vuelta-asturias/2023/,"{'arrival': 'Oviedo', 'climbs': [], 'date': '2...","[(FORTUNATO Lorenzo, 1), (RUBIO Einer Augusto,..."
237,2023,Vuelta Asturias Julio Alvarez Mendo,2.1,race/vuelta-asturias/2023/stage-1,"{'arrival': 'Pola de Lena', 'climbs': [], 'dat...","[(HOWSON Damien, 1), (ALBANESE Vincenzo, 2), (..."
238,2023,Vuelta Asturias Julio Alvarez Mendo,2.1,race/vuelta-asturias/2023/stage-2,"{'arrival': 'Cangas del Narcea', 'climbs': [],...","[(FORTUNATO Lorenzo, 1), (RUBIO Einer Augusto,..."
239,2023,Vuelta Asturias Julio Alvarez Mendo,2.1,race/vuelta-asturias/2023/stage-3,"{'arrival': 'Oviedo', 'climbs': [], 'date': '2...","[(SÁNCHEZ Pelayo, 1), (ALBANESE Vincenzo, 2), ..."


In [16]:
df_races_out.shape

(2979, 6)

In [17]:
vec = DictVectorizer()

measurements = df_races_out["results"].apply(lambda x: {} if x is None else dict(x))
df_results = pd.DataFrame(
    vec.fit_transform(measurements).toarray(),
    columns=vec.get_feature_names_out(),
    # set year, stage slug, and class as indices
    index=pd.MultiIndex.from_frame(pd.concat([df_races_out["year"],
                                              df_races_out["stage_slug"].str.replace("race/", ""),
                                              df_races_out["class"]],
                                             axis=1))
)
df_results.replace(0, np.nan, inplace=True)  # initially NaN = did not finish race, 0 = did not participate; this replace() drops distinction

In [18]:
df_results.sample(4)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,AAEN JØRGENSEN Jonas,AAGAARD HANSEN Tobias,AALRUST Håkon,AALTO Jimi,AASHEIM Aksel,AASHEIM Ludvig,AASKOV PALLESEN Jeppe,AASVOLD Kristian,ABAY Burak,ABAZI Qendrim,...,ŠTOČEK Matúš,ŠTYBAR Zdeněk,ŠTĀLS Renāts,ŠĒLIS Jānis,ŤOUPALÍK Adam,ŤOUPALÍK Jakub,ŻUBER Adam,ŻUREK Jakub,ŽUMER Matic,ȚVETCOV Serghei
year,stage_slug,class,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
2022,tour-of-norway/2022/stage-2,2.Pro,,,91.0,,,,67.0,39.0,,,...,,,,,,,,,,
2022,tour-of-szeklerland/2022/stage-3,2.2,,,,,,,,,,,...,,,,,,,,,,
2022,famenne-ardenne-classic/2022/,1.1,,,,,,,,,,,...,,,,,,,,,,
2019,tour-of-norway/2019/stage-1,2.Pro,,,,,,,,31.0,,,...,,,,,,,,,,


In [19]:
df_results.filter(regex="VAN AERT Wout").dropna().loc[2022]

Unnamed: 0_level_0,Unnamed: 1_level_0,VAN AERT Wout
stage_slug,class,Unnamed: 2_level_1
omloop-het-nieuwsblad/2022/,1.UWT,1.0
paris-nice/2022/,2.UWT,32.0
paris-nice/2022/stage-1,2.UWT,3.0
paris-nice/2022/stage-2,2.UWT,2.0
paris-nice/2022/stage-3,2.UWT,3.0
paris-nice/2022/stage-4,2.UWT,1.0
paris-nice/2022/stage-5,2.UWT,98.0
paris-nice/2022/stage-6,2.UWT,3.0
paris-nice/2022/stage-7,2.UWT,62.0
paris-nice/2022/stage-8,2.UWT,2.0


In [20]:
print(df_results.shape)
df_results = df_results.dropna(axis=0, how="all")  # drop races that were cancelled or couldn't be parsed
print(df_results.shape)

(2979, 6964)
(2887, 6964)


In [21]:
df_results.to_csv("../data/results_matrix.csv", index=True)