## Scraping & Verifying HYROX Results Across Seasons

**Competitions:** HYROX Seasons 1‚Äì8 (Official Results Portal)  
**Purpose:** Scrape all available race results and verify that every race, division, and gender has been successfully collected  
**Methods:** Selenium automation, controlled multi-threading, dynamic pagination handling, structured CSV export, division detection, empty-file tracking, and race-level completeness checks  
**Author:** [Victoria Friss de Kereki](https://www.linkedin.com/in/victoria-friss-de-kereki/)  

---

**Notebook first written:** `23/02/2026`  
**Last updated:** `27/02/2026`  

> This notebook builds a **robust scraping and verification pipeline** for HYROX competition results.
> 
> The workflow:
> 
> - üåê Scrapes race results directly from the official HYROX results platform  
> - üóÇ Organises outputs by **Season, Race, and Division**  
> - üîÑ Handles dynamic page loading and pagination safely  
> - üìÅ Saves structured CSV files for each race-division combination  
> - ‚ö†Ô∏è Tracks empty divisions and failed scrapes  
> - üèÅ Verifies that every race includes all the available divisions
> 
> The objective of this notebook is to ensure **complete and reliable data extraction**, creating a solid foundation for downstream cleaning, validation, and analytical modelling in subsequent notebooks.

------------------

### Scrape many at a time, for seasons and divisions available, even the empty ones.

In [None]:
import time
import random
import pandas as pd
import os
from concurrent.futures import ThreadPoolExecutor

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait, Select
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException, TimeoutException


# ============================================================
# CONFIGURATION
# ============================================================

# HYROX Seasons 1‚Äì8
BASE_URLS = {
    f"Season_{i}": f"https://results.hyrox.com/season-{i}/"
    for i in range(1, 9)
}

SAVE_ROOT = r"Datasets\Hyrox"

# Keep this conservative to avoid overwhelming system or website
MAX_THREADS = 20


# ============================================================
# UTILITY FUNCTIONS
# ============================================================

def human_pause(a=2, b=5):
    """Random sleep to mimic human behaviour."""
    time.sleep(random.uniform(a, b))


def safe_filename(text):
    """Make race/division names safe for filenames."""
    return text.replace(" ", "_").replace("/", "-")


# ============================================================
# DRIVER SETUP
# ============================================================

def create_driver():
    """Create a headless Chrome driver."""
    options = webdriver.ChromeOptions()
    options.add_argument("--headless=new")
    options.add_argument("--start-maximized")
    options.add_argument(
        "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/120.0.0.0 Safari/537.36"
    )
    return webdriver.Chrome(options=options)


# ============================================================
# DROPDOWN CONTEXT SELECTION
# ============================================================

def select_context(driver, base_url, race_value, division=None, gender=None):
    """
    Load season page and select:
    - Race
    - Division (optional)
    - Gender (optional)
    """
    driver.get(base_url)
    human_pause(2, 4)

    # Select race
    Select(driver.find_element(By.ID, "default-lists-event_main_group"))\
        .select_by_value(race_value)

    # Select division if provided
    if division:
        Select(driver.find_element(By.ID, "default-lists-event"))\
            .select_by_visible_text(division)

    # Select gender if provided
    if gender:
        Select(driver.find_element(By.ID, "default-lists-sex"))\
            .select_by_value(gender)


# ============================================================
# PAGE SCRAPING (PAGINATION HANDLING)
# ============================================================

def scrape_pages(driver, race_name, division, gender_label, race_results):
    """
    Scrape all pages for a given race/division/gender combination.
    Handles pagination until no next page exists.
    """

    is_doubles = "DOUBLES" in division.upper()
    page_number = 1

    while True:

        # Wait until either results appear OR "no results" message appears
        try:
            WebDriverWait(driver, 25).until(
                lambda d:
                "There are currently no results available" in d.page_source
                or len(d.find_elements(By.CSS_SELECTOR, "li.list-group-item.row")) > 1
            )
        except TimeoutException:
            print(f"[{race_name}] Timeout waiting for {division} - {gender_label}")
            return False

        # Check if no results message is displayed
        try:
            no_result_elem = driver.find_element(
                By.XPATH,
                "//*[contains(text(),'There are currently no results available')]"
            )
            if no_result_elem.is_displayed():
                print(f"[{race_name}] No results for {division} - {gender_label}")
                return False
        except NoSuchElementException:
            pass

        # Extract result rows
        rows = driver.find_elements(By.CSS_SELECTOR, "li.list-group-item.row")
        rows = [r for r in rows if "list-group-header" not in r.get_attribute("class")]

        if not rows:
            print(f"[{race_name}] No rows found for {division} - {gender_label}")
            return False

        scraped_any = False

        for row in rows:
            try:
                rank = row.find_element(By.CSS_SELECTOR, ".place-primary").text
                age_rank = row.find_element(By.CSS_SELECTOR, ".place-secondary").text
                total_time = row.find_element(
                    By.CSS_SELECTOR, ".type-time"
                ).text.replace("Total", "").strip()
                age_group = row.find_element(
                    By.CSS_SELECTOR, ".type-age_class"
                ).text.replace("Age Group", "").strip()

                if is_doubles:
                    members = row.find_elements(By.CSS_SELECTOR, ".type-relay_member a")
                    member_names = " & ".join([m.text for m in members])

                    race_results.append([
                        race_name, division, gender_label,
                        rank, age_rank, member_names, "",
                        age_group, total_time
                    ])

                else:
                    name = row.find_element(
                        By.CSS_SELECTOR, "h4.type-fullname"
                    ).text
                    nation = row.find_element(
                        By.CSS_SELECTOR, ".nation__abbr"
                    ).text

                    race_results.append([
                        race_name, division, gender_label,
                        rank, age_rank, name, nation,
                        age_group, total_time
                    ])

                scraped_any = True

            except Exception as e:
                print(f"[{race_name}] Row parsing error: {e}")
                continue

        print(f"[{race_name}] Page {page_number} scraped for {division} - {gender_label}")
        page_number += 1

        # Try to click next page
        try:
            next_button = driver.find_element(By.XPATH, "//a[text()='>']")
            driver.execute_script("arguments[0].click();", next_button)
            human_pause(2, 5)
        except NoSuchElementException:
            break

    return scraped_any


# ============================================================
# RACE SCRAPING
# ============================================================

def scrape_race(season, base_url, race_name, race_value):
    """
    Scrape all divisions and genders for a single race.
    Saves one CSV per race-division combination.
    """

    driver = create_driver()
    season_folder = os.path.join(SAVE_ROOT, season)
    os.makedirs(season_folder, exist_ok=True)

    safe_race = safe_filename(race_name)

    print(f"\n[{race_name}] Starting scraping")

    # Load race once to collect available divisions
    select_context(driver, base_url, race_value)
    division_dropdown = Select(driver.find_element(By.ID, "default-lists-event"))
    divisions = [o.text for o in division_dropdown.options]

    # Skip race if ALL division files already exist
    if all(
        os.path.exists(
            os.path.join(season_folder, f"{safe_race}_{safe_filename(div)}.csv")
        )
        for div in divisions
    ):
        print(f"[{race_name}] All division CSVs exist. Skipping race.")
        driver.quit()
        return f"[{race_name}] Skipped (all divisions exist)"

    # Process each division
    for div in divisions:

        safe_div = safe_filename(div)
        file_path = os.path.join(season_folder, f"{safe_race}_{safe_div}.csv")

        print(f"[{race_name}] Scraping division: {div}")

        # Select race + division
        select_context(driver, base_url, race_value, division=div)

        # Get genders (some divisions may not have gender dropdown)
        try:
            gender_dropdown = Select(driver.find_element(By.ID, "default-lists-sex"))
            genders = [(o.get_attribute("value"), o.text) for o in gender_dropdown.options]
        except NoSuchElementException:
            genders = [("", "All")]

        division_results = []
        division_has_data = False

        # Process each gender
        for gender_code, gender_label in genders:

            print(f"[{race_name}] Scraping gender: {gender_label} in {div}")

            select_context(
                driver,
                base_url,
                race_value,
                division=div,
                gender=gender_code if gender_code else None
            )

            # Set results per page to 100
            Select(driver.find_element(By.ID, "default-num_results"))\
                .select_by_value("100")

            driver.find_element(By.ID, "default-submit").click()
            human_pause(2, 4)

            has_data = scrape_pages(
                driver,
                race_name,
                div,
                gender_label,
                division_results
            )

            if has_data:
                division_has_data = True

        # Save CSV (even if empty)
        df = pd.DataFrame(
            division_results,
            columns=[
                "Race", "Division", "Gender",
                "Rank Overall", "Rank Age Group",
                "Name", "Nation", "Age Group", "Total Time"
            ]
        )

        df.to_csv(file_path, index=False)

        if division_has_data:
            print(f"[{race_name}] Saved {len(df)} rows for {div}")
        else:
            print(f"[{race_name}] Division {div} had no data. Empty CSV saved.")

    driver.quit()
    return f"[{race_name}] Finished scraping"


# ============================================================
# MAIN EXECUTION
# ============================================================

all_tasks = []

# Collect all season ‚Üí race combinations
for season, base_url in BASE_URLS.items():

    driver_main = create_driver()
    driver_main.get(base_url)
    human_pause(2, 4)

    try:
        WebDriverWait(driver_main, 20).until(
            EC.presence_of_element_located(
                (By.ID, "default-lists-event_main_group")
            )
        )
    except TimeoutException:
        print(f"Failed loading {season}")
        driver_main.quit()
        continue

    race_dropdown = Select(
        driver_main.find_element(By.ID, "default-lists-event_main_group")
    )

    races = [
        (opt.text.strip(), opt.get_attribute("value"))
        for opt in race_dropdown.options
    ]

    driver_main.quit()

    for race_name, race_value in races:
        all_tasks.append((season, base_url, race_name, race_value))


# Run scraping in parallel (controlled threads)
with ThreadPoolExecutor(max_workers=MAX_THREADS) as executor:
    futures = [
        executor.submit(scrape_race, *task)
        for task in all_tasks
    ]

    for f in futures:
        print(f.result())


print("\nALL DONE.")


[2019 N√ºrnberg] Starting scraping

[World Championships] Starting scraping

[2018 Essen] Starting scraping
[2019 Wien] Starting scraping


[2018 Wien] Starting scraping

[2019 New York] Starting scraping

[2018 Stuttgart] Starting scraping

[2020 Dallas] Starting scraping

[2020 Karlsruhe] Starting scraping

[2019 Hamburg] Starting scraping

[2020 Elite 12] Starting scraping

[2018 Hamburg] Starting scraping

[2020 Chicago] Starting scraping

[2019 Essen] Starting scraping

[2019 Oberhausen] Starting scraping

[2019 Karlsruhe] Starting scraping

[2018 Leipzig] Starting scraping

[2019 Frankfurt] Starting scraping

[2019 Hannover] Starting scraping

[2020 Hannover] Starting scraping
[2019 New York] Scraping division: WCHE Elite 12
[2019 Wien] Scraping division: WCHE Elite 12
[2020 Dallas] Scraping division: WCHE Elite 12
[2020 Karlsruhe] Scraping division: WCHE Elite 12
[2018 Stuttgart] All division CSVs exist. Skipping race.
[2020 Chicago] Scraping division: WCHE Elite 12
[2019 N√ºrn

### Check all existing seasons/races/divisions have been downloaded.

In [1]:
import os
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait, Select
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

# ==========================
# CONFIG
# ==========================

BASE_URLS = {f"Season_{i}": f"https://results.hyrox.com/season-{i}/" for i in range(1, 9)}
SAVE_ROOT = r"Datasets\Hyrox"

# ==========================
# DRIVER
# ==========================

def create_driver():
    options = webdriver.ChromeOptions()
    options.add_argument("--headless=new")
    options.add_argument("--start-maximized")
    options.add_argument(
        "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/120.0.0.0 Safari/537.36"
    )
    return webdriver.Chrome(options=options)

# ==========================
# VERIFICATION
# ==========================

def verify_all_downloads():

    print("\n==============================")
    print("VERIFYING HYROX DATASET")
    print("==============================")

    for season, base_url in BASE_URLS.items():

        print(f"\n========== CHECKING {season} ==========")

        season_folder = os.path.join(SAVE_ROOT, season)

        if not os.path.exists(season_folder):
            print(f"‚ùå Season folder missing: {season}")
            continue

        local_files = set(os.listdir(season_folder))

        driver = create_driver()
        driver.get(base_url)

        try:
            WebDriverWait(driver, 20).until(
                
                EC.presence_of_element_located((By.ID, "default-lists-event_main_group"))
            )
        except TimeoutException:
            print(f"‚ùå Could not load season page: {season}")
            driver.quit()
            continue

        race_dropdown = Select(driver.find_element(By.ID, "default-lists-event_main_group"))
        races = [(opt.text.strip(), opt.get_attribute("value"))
                 for opt in race_dropdown.options]

        missing_races = []
        missing_divisions = []

        for race_name, race_value in races:

            safe_race = race_name.replace(" ", "_").replace("/", "-")

            print(f"Checking race: {race_name}")

            driver.get(base_url)
            time.sleep(2)

            Select(driver.find_element(By.ID, "default-lists-event_main_group"))\
                .select_by_value(race_value)
            time.sleep(2)

            division_dropdown = Select(driver.find_element(By.ID, "default-lists-event"))
            divisions = [opt.text.strip() for opt in division_dropdown.options]

            race_files = [f for f in local_files if f.startswith(safe_race + "_")]

            if not race_files:
                missing_races.append(race_name)

            race_missing_divs = []

            for div in divisions:
                safe_div = div.replace(" ", "_").replace("/", "-")
                expected_filename = f"{safe_race}_{safe_div}.csv"

                if expected_filename not in local_files:
                    race_missing_divs.append(div)

            if race_missing_divs:
                missing_divisions.append((race_name, race_missing_divs))

        driver.quit()

        # REPORT
        if not missing_races:
            print("‚úÖ No missing races.")
        else:
            print("\n‚ùå Missing races:")
            for r in missing_races:
                print(f"   - {r}")

        if not missing_divisions:
            print("‚úÖ All divisions present.")
        else:
            print("\n‚ö† Missing divisions:")
            for race, divs in missing_divisions:
                print(f"\n  {race}")
                for d in divs:
                    print(f"     - {d}")

        print(f"\n========== DONE {season} ==========")

    print("\n==============================")
    print("VERIFICATION COMPLETE")
    print("==============================")

# ==========================
# RUN
# ==========================

verify_all_downloads()


VERIFYING HYROX DATASET

Checking race: World Championships
Checking race: 2019 Oberhausen
Checking race: 2019 Karlsruhe
Checking race: 2019 N√ºrnberg
Checking race: 2019 Hannover
Checking race: 2018 Stuttgart
Checking race: 2018 Wien
Checking race: 2018 Essen
Checking race: 2018 Hamburg
Checking race: 2018 Leipzig
‚úÖ No missing races.
‚úÖ All divisions present.


Checking race: 2020 Elite 12
Checking race: 2020 Karlsruhe
Checking race: 2020 Dallas
Checking race: 2020 Hannover
Checking race: 2020 Chicago
Checking race: 2019 New York
Checking race: 2019 Frankfurt
Checking race: 2019 Wien
Checking race: 2019 Hamburg
Checking race: 2019 Essen
Checking race: 2019 Leipzig
Checking race: 2019 Miami
‚úÖ No missing races.
‚úÖ All divisions present.


Checking race: 2021 Dallas
Checking race: 2021 Chicago
Checking race: 2021 Orlando
Checking race: 2021 Austin
‚úÖ No missing races.
‚úÖ All divisions present.


Checking race: 2022 Los Angeles
Checking race: 2022 Frankfurt
Checking race: 2022 Da