## Gathering data from procyclingstats.com

This notebook scrapes the input data from [procyclingstats.com](https://www.procyclingstats.com/) using the [**procyclingsstats**](https://github.com/themm1/procyclingstats) scraping library. I add some high-level cleaning and assembling functionality on top to make the scraping easier.

It collects:
- For a large number of riders from the best teams...
- Metadata for each rider, but most importantly...
- Their results in one-day or multi-stage...
- High-level races...
- For up to a few years in the past

The data is transformed into a simple matrix (pandas DataFrame) format, so that it can be used in the next step's algorithm to find hidden factors (called embeddings) determining a racer's and a race's profile. All having to specify very little about the type of race! Ready, set, go!

A script version of this notebook is in `scripts/scrape.py`.

## Imports

In [None]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from procyclingstats import (
    Race,          # Race("race/tour-de-france/2022/overview").parse()
    Rider,         # Rider("rider/tadej-pogacar").parse()
    Stage,         # Stage("race/tour-de-france/2018/stage-18").parse()
    Team,          # Team("team/bora-hansgrohe-2021").parse()
    RiderResults,  # RiderResults("rider/alberto-contador/results").parse()
    RaceStartlist,
    RaceClimbs,
    Ranking        # Ranking("rankings/me/individual").parse() --> Summation of PCS points over a 12-month + 2 weeks overlap period
)

The scraping classes I focus on are: `Race`, `Rider`, `Stage`, and `Team`.

## Functions

In [None]:
def try_to_parse(obj, slug, printit=False):
    if printit:
        print(f"Parsing > {slug} ...")
    
    p = None  # fallback
    try:
        p = obj(slug).parse()
    except:
        print(f"Oopsie! This one failed: {slug}")
    return p

def parse_results_from_stage(stage, rid="results"):
    results = None  # fallback
    if stage is not None:
        if stage[rid] is not None:
            results = [(r["rider_name"], r["rank"]) for r in stage[rid]]  # e.g. [(WVA, 1), (MVDP, 2), (Pogiboy, 3), ...]
    return results

## Config

In [None]:
YEARS = [2022, 2023]

I use the 2023 races as base calendar, inluding only UCI Worldtour, UCI ProSeries, and Europe Tour races. Of course, races change over the years but not so much. U23 (xU) and championships (NN/CC) races are dropped. I also had to remove a few duplicates. The idea is that we deduce the most important riders based on who participated in these races. Doing the inverse seems less straightforward with the API package.

In [None]:
cutoff_date = "2023-04-30"
print(cutoff_date)

In [None]:
df_races = pd.read_excel("../data/races.xlsx")
df_races = df_races.dropna()
RACES = df_races.set_index("Race").transpose().to_dict("list")

In [None]:
df_races.Class.unique().tolist()  # 1.x = one-day race, 2.x = multi-day race & .UWT > .Pro > .1 > .2

## Parse results

In [None]:
lst_out = []
for year in YEARS:
    races, classes, stages = [], [], []
    print(f"----- {year} -----")
    for race_key, race_info in RACES.items():  
        _, race_class, race_slug = race_info
        race_slug_full = f"race/{race_slug}/{year}/overview"   
        race_p = try_to_parse(Race, race_slug_full)     
        if race_p is None:
            continue
        else:
            # do not process if race end date is beyond dataset cutoff date
            # but keep going, because races are not ordered chronologically
            if race_p["enddate"] > cutoff_date:
                continue
            stage_slug_base = race_slug_full.replace("overview", "")  # has general classification if multi-stage race
            if race_p["is_one_day_race"] is True:
                stage_slugs = [stage_slug_base]  # single stage
            elif "stages" in race_p:
                stage_slugs = [stage_slug_base] + [s["stage_url"] for s in race_p["stages"]]  # multiple stages
            races += [race_key] * len(stage_slugs)
            classes += [race_class] * len(stage_slugs)
            stages += stage_slugs
    lst_out.append(pd.DataFrame({"year": year, "race": races, "class": classes, "stage_slug": stages}))
    print("")
        
df_races_out = pd.concat(lst_out)

In [None]:
print(len(df_races_out))
df_races_out.sample(10)

In [None]:
df_races_out["parsed"] = df_races_out["stage_slug"].apply(lambda x: try_to_parse(Stage, x))

In [None]:
# handy to help developers debug > GitHub Issues
# example: Stage("race/vuelta-a-espana/2022/stage-1").parse() bugs because key of DNS rider is still considered but should be dropped
stages_not_parsed = df_races_out[df_races_out.parsed.isnull()]["stage_slug"].tolist()
print(f"{len(stages_not_parsed)} out of {len(df_races_out)} race results were not parsed")

In [None]:
df_races_out.dropna(inplace=True)  # drop stages that couldn't be parsed

In [None]:
df_races_out["results"] = df_races_out["parsed"].apply(parse_results_from_stage)

In [None]:
# override results for multi-stage general classifications (gc) with actual gc outcome (not final-stage results)
mask_gc = (df_races_out["class"].str.contains("2")) & (df_races_out["stage_slug"].str.endswith("/"))  # alternative: does not contain 'stage-' or 'prologue'
df_races_out.loc[mask_gc, "results"] = df_races_out.loc[mask_gc, "parsed"].apply(parse_results_from_stage, rid="gc")

In [None]:
df_races_out.shape

In [None]:
vec = DictVectorizer()

measurements = df_races_out["results"].apply(lambda x: {} if x is None else dict(x))
df_results = pd.DataFrame(
    vec.fit_transform(measurements).toarray(),
    columns=vec.get_feature_names_out(),
    # set year, stage slug, and class as indices
    index=pd.MultiIndex.from_frame(pd.concat([df_races_out["year"],
                                              df_races_out["stage_slug"].str.replace("race/", ""),
                                              df_races_out["class"]],
                                             axis=1))
)
df_results.replace(0, np.nan, inplace=True)  # initially NaN = did not finish race, 0 = did not participate; this replace() drops distinction

In [None]:
df_results.sample(5)

In [None]:
df_results.filter(regex="VAN AERT Wout").dropna().loc[2022]

In [None]:
print(df_results.shape)
df_results = df_results.dropna(axis=0, how="all")  # drop races that were cancelled or couldn't be parsed
print(df_results.shape)

In [None]:
df_results.to_csv("../data/results_matrix.csv", index=True)