# **8a.nu Sport Climbing Scraper**

## **Introduction**

This notebook scrapes sport climbing data from climber profiles on [8a.nu](https://www.8a.nu/), a popular platform for tracking climbing ascents. Using **Selenium** and **BeautifulSoup**, it extracts key performance metrics for each climber, including:

- **Highest Grade Climbed**: The most difficult route completed.
- **8c+ and Above Ascent Count**: The total number of ascents at grade 8c+ or higher.
- **Average Grade**: Weighted average of the first 5 unique grades (converted via `grade_to_linear_scale`), using ascent counts. The top 5 grades are used instead of all grades to avoid skewing the average downward for climbers who have many lower-grade ascents alongside significant high-grade achievements.

The data is sourced from individual climber profile pages (e.g., sport climbing sections) identified in a prior notebook (`1.2_8anu_profile_finder`).

8a.nu uses the French Grading System, for more context on sport climbing grades, you can refer to [this website](https://www.guidedolomiti.com/en/rock-climbing-grades/).

---

### **Challenges Encountered**

Several obstacles were navigated during the development of this scraper:

- **Grade Conversion Complexity**: Converting climbing grades (e.g., 8a+, 9b) into numerical values for averaging required custom logic. A linear scale was implemented (e.g., 7a = 1, 7a+ = 2) to handle this, with weights applied based on ascent counts.

- **Locating Statistics Data**: Extracting data from the profile pages demanded precise parsing of the `statistics-body` HTML block. Early attempts struggled with inconsistent page structures, necessitating a refined approach using BeautifulSoup.

- **Dropdown Navigation Issues**: Profile pages default to a “Last 12 Months” view in a dropdown menu, but “All Time” data was needed. Automating the switch to “All Time” using Selenium proved challenging and remains buggy—it often requires the WebDriver UI to be visible and fails in headless mode.

---
Let's start scraping!

#### **Imports**

In [1]:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.action_chains import ActionChains
from bs4 import BeautifulSoup
import re
import time
import os
import random
from selenium.common.exceptions import TimeoutException, WebDriverException

#### **Grade Conversion Function**

The `grade_to_number` function converts French sport climbing grades (e.g., 8a, 8a+, 9a) into numerical values for comparison in finding grades higher or equal to 8c+. It uses a simplified linear scale, despite the non-linear difficulty progression in reality (e.g., 8a+ to 9a is a bigger jump than 8a to 8a+). The process:

- **Input**: A grade string (e.g., "8a+"), stripped of whitespace.
- **Conversion**:
  - Number sets the base (e.g., 8).
  - Letter adds: a = 0, b = 0.25, c = 0.5.
  - "+" adds 0.25 if present.
  - Examples: "8a" = 8.0, "8a+" = 8.25, "8b" = 8.25, "9a" = 9.0.
- **Output**: A float (e.g., 8.25) or 0 if invalid.

To find the count of grades 8c+ and above, we count the occurrence of numerical grade 8.75 or higher

In [2]:
# Function to convert climbing grade to numerical value for comparison and ranking
def grade_to_number(grade):
    grade = grade.strip()
    # Handle sport climbing grades (e.g., 8a+, 9a)
    match = re.match(r"(\d+)([abc])(\+)?", grade)
    if match:
        number, letter, plus = match.groups()
        number = int(number)

        # Base grade based on letter
        if letter == "a": base = number
        elif letter == "b": base = number + 0.25
        elif letter == "c": base = number + 0.5

        # Add 0.25 if there's a "+"
        if plus: base += 0.25
        return base

    return 0  # Invalid grade

#### **Linear Grade Scale Function**

The `grade_to_linear_scale` function converts French sport climbing grades (e.g., 7a, 7a+, 9b) into a linear numerical scale starting at 7a = 1, 7a+ = 2, etc., for **simplified averaging.** The process:

- **Input**: A grade string (e.g., "7a+"), stripped of whitespace.
- **Conversion**:
  - Base: `(number - 7) * 6` counts steps from 7a (each grade has 6 steps: a, a+, b, b+, c, c+).
  - Letter adds: a = 0, b = 2, c = 4.
  - "+" adds 1 if present.
  - Add 1 to shift 7a to 1 (not 0).
  - Examples: "7a" = 1, "7a+" = 2, "7b" = 3, "8a" = 7, "9c+" = 18.
- **Output**: An integer or 0 if invalid.

This linear scaled grade is used to find the average grade of the climber.

In [3]:
# Function to convert climbing grade to linear scale (7a=1, 7a+=2, etc.)
def grade_to_linear_scale(grade):
    grade = grade.strip()
    match = re.match(r"(\d+)([abc])(\+)?", grade)
    if not match:
        return 0

    number, letter, plus = match.groups()
    number = int(number)

    # Base value starts at 7a = 1
    base_value = (number - 7) * 6  # Each number adds 6 values (a, a+, b, b+, c, c+)

    # Add for the letter
    if letter == "a": letter_value = 0
    elif letter == "b": letter_value = 2
    elif letter == "c": letter_value = 4

    # Add for the plus
    plus_value = 1 if plus else 0

    # Final value (7a = 1, 7a+ = 2, 7b = 3, etc.)
    linear_value = base_value + letter_value + plus_value + 1

    return linear_value

#### **Sport Climbing Data Extraction Function**

The `get_sport_climbing_data` function scrapes a climber’s sport climbing stats from an 8a.nu profile URL using Selenium and BeautifulSoup. It returns the highest grade, 8c+ ascent count, and average grade of the first 5 unique ascents. The process:

- **Input**: A Selenium `driver`, profile `url`, and `retries` (default 2).
- **Steps**:
  - Loads the URL and waits for the `statistics-header` to appear (20s timeout).
  - Scrolls 300px to reveal the dropdown, then sets it to “All Time” if needed using `ActionChains`.
  - Parses the page with BeautifulSoup after confirming `statistics-body` loads.
  - Extracts:
    - **Highest Grade**: Text from the first row’s `difficulty` span (e.g., "9a").
    - **8c+ Count**: Sums ascents ≥ 8c+ (numerical value ≥ 8.75) from all rows.
    - **Average Grade**: Weighted average of the first 5 rows’ grades (converted via `grade_to_linear_scale`), using ascent counts.
- **Error Handling**: Retries on timeouts or WebDriver errors with random delays; returns `(None, 0, 0)` if all retries fail.
- **Output**: Tuple of `(highest_grade, count_8c_plus, avg_grade_linear)`.

In [4]:
# Function to get sport climbing data: highest grade, 8c+ count, and average of first 5 unique grade rows
def get_sport_climbing_data(driver, url, retries=2):
    for attempt in range(retries):
        try:
            driver.get(url)
            # Wait for the page to load key elements
            WebDriverWait(driver, 20).until(
                EC.presence_of_element_located((By.CLASS_NAME, "statistics-header"))
            )
            # Scroll the page down by 300 pixels to ensure the dropdown is visible
            driver.execute_script("window.scrollTo(0, 300);")
            time.sleep(1.5)  # Brief pause for JavaScript to settle

            # Target the correct dropdown more precisely
            try:
                # Scope to statistics-header to avoid other dropdowns
                stats_header = driver.find_element(By.CLASS_NAME, "statistics-header")
                # Use a general selector for the dropdown input
                dropdown_input = stats_header.find_element(By.CSS_SELECTOR, "input[type='text']")
                current_value = dropdown_input.get_attribute("placeholder") or dropdown_input.get_attribute("value")
                print(f"Current dropdown value: {current_value}")

                # If not on "All Time," adjust it
                if "All Time" not in current_value:
                    ActionChains(driver).move_to_element(dropdown_input).click().perform()
                    print("Clicked dropdown input")
                    time.sleep(0.5)  # Wait for options to appear

                    # Select "All Time" from options
                    all_time_option = WebDriverWait(driver, 10).until(
                        EC.element_to_be_clickable((By.XPATH, "//*[contains(text(), 'All Time')]"))
                    )
                    ActionChains(driver).move_to_element(all_time_option).click().perform()
                    print("Selected 'All Time' option")
                    time.sleep(1)  # Wait for page update
                else:
                    print("Already on 'All Time', no change needed")

            except Exception as e:
                print(f"Error interacting with dropdown: {e}")
                print("Proceeding with current selection")

            # Extract data with optimized content extraction
            WebDriverWait(driver, 20).until(
                EC.presence_of_element_located((By.CLASS_NAME, "statistics-body"))
            )

            # Use optimized parsing approach
            soup = BeautifulSoup(driver.page_source, "html.parser")
            stats_lines = soup.find_all("div", {"class": "statistics-line stats"})

            if not stats_lines:
                print(f"No statistics lines found for {url}")
                return None, 0, 0

            # Highest grade (first row)
            highest_grade = None
            if stats_lines:
                grade_elem = stats_lines[0].find("span", {"class": "difficulty"})
                highest_grade = grade_elem.text.strip() if grade_elem else None

            # Count 8c+ or above ascents - optimize with early exit when possible
            count_8c_plus = 0
            if highest_grade:
                highest_numerical = grade_to_number(highest_grade)
                if highest_numerical >= 8.75:  # 8c+ threshold
                    # First check if we can extract all counts at once to avoid iterating
                    grades_over_threshold = []
                    counts_over_threshold = []

                    for line in stats_lines:
                        grade_elem = line.find("span", {"class": "difficulty"})
                        if grade_elem:
                            grade = grade_elem.text.strip()
                            if grade_to_number(grade) >= 8.75:
                                number_grid = line.find("div", {"class": "number-grid"})
                                if number_grid:
                                    total = int(number_grid.find_all("div", {"class": "number-cell"})[-1].text.strip())
                                    count_8c_plus += total

            # Average grade of first 5 unique rows - use direct indexing for efficiency
            max_rows = min(5, len(stats_lines))
            weighted_sum, total_ascents = 0, 0
            for i in range(max_rows):
                line = stats_lines[i]
                grade_elem = line.find("span", {"class": "difficulty"})
                if grade_elem:
                    grade = grade_elem.text.strip()
                    linear_value = grade_to_linear_scale(grade)
                    number_grid = line.find("div", {"class": "number-grid"})
                    if number_grid:
                        total_ascents_row = int(number_grid.find_all("div", {"class": "number-cell"})[-1].text.strip())
                        weighted_sum += linear_value * total_ascents_row
                        total_ascents += total_ascents_row

            avg_grade_linear = weighted_sum / total_ascents if total_ascents > 0 else 0
            return highest_grade, count_8c_plus, round(avg_grade_linear, 2)

        except TimeoutException as e:
            print(f"Timeout on attempt {attempt + 1} for {url}: {e}")
            if attempt == retries - 1:
                return None, 0, 0
            time.sleep(random.uniform(2, 4))
        except WebDriverException as e:
            print(f"WebDriver error on attempt {attempt + 1} for {url}: {e}")
            if attempt == retries - 1:
                return None, 0, 0
            time.sleep(random.uniform(2, 4))
        except Exception as e:
            print(f"Unexpected error on attempt {attempt + 1} for {url}: {e}")
            if attempt == retries - 1:
                return None, 0, 0
            time.sleep(random.uniform(2, 4))

    return None, 0, 0

#### **Main Script**

The main script orchestrates the scraping of sport climbing data from 8a.nu profiles, processing climbers in batches and saving results to CSV. The workflow:

- **WebDriver Setup**: Initializes a Selenium Chrome WebDriver with `ChromeDriverManager` and a 10-second implicit wait; exits on failure.
- **CSV Loading**: Reads climber names and URLs from `8a_nu_profiles.csv`; exits if the file is missing or invalid.
- **Batch Processing**:
  - Splits the climber list into batches of 10 for manageable execution.
  - For each climber:
    - Constructs the sport climbing URL (`<base_url>/sportclimbing`).
    - Calls `get_sport_climbing_data` to extract highest grade, 8c+ count, and average of the first 5 grades.
    - Skips climbers with no data; logs errors and continues on exceptions.
    - Stores results in a list of dictionaries (`filtered_data`).
- **Intermediate Saves**: Writes batch results to `8a_nu_climbing_data.csv` every 10 climbers or at the end, creating the output directory if needed.
- **Cleanup and Output**:
  - Closes the WebDriver, handling any errors.
  - Prints a summary of included climbers.
  - Saves the final dataset to `8a_nu_climbing_data.csv` as a DataFrame, using empty columns if no data is collected.

In [5]:
# Main script
# Set up Selenium WebDriver with reliability options
try:
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
    driver.implicitly_wait(10)
except Exception as e:
    print(f"Failed to initialize WebDriver: {e}")
    exit(1)

csv_path = "../data/8anu_data/8a_nu_profiles.csv"
try:
    df = pd.read_csv(csv_path, header=None, names=["name", "url"])
    print(f"Loaded CSV with {len(df)} climbers.")
except FileNotFoundError:
    print(f"CSV file not found at {csv_path}.")
    driver.quit()
    exit(1)
except Exception as e:
    print(f"Error loading CSV from {csv_path}: {e}")
    driver.quit()
    exit(1)

# Create a smaller batch for processing to allow periodic breaks
batch_size = 10
total_climbers = len(df)
filtered_data = []

for batch_start in range(0, total_climbers, batch_size):
    batch_end = min(batch_start + batch_size, total_climbers)
    print(f"\nProcessing batch {batch_start//batch_size + 1} (climbers {batch_start+1}-{batch_end})...")

    # Process each climber in the batch
    for index in range(batch_start, batch_end):
        name = df.iloc[index]["name"]
        base_url = df.iloc[index]["url"]

        print(f"Processing {name} ({index+1}/{total_climbers})...")
        try:
            sportclimbing_url = f"{base_url}/sportclimbing"
            highest_grade, count_8c_plus, avg_grade_first5 = get_sport_climbing_data(driver, sportclimbing_url)

            if highest_grade is None:
                print(f"Could not extract sport climbing data for {name}, skipping.")
                continue

            print(f"{name} Highest Grade: {highest_grade}, 8c+ Ascents: {count_8c_plus}, Avg Grade (First 5): {avg_grade_first5}")

            filtered_data.append({
                "name": name,
                "url": base_url,
                "highest_grade": highest_grade,
                "count_8c_plus": count_8c_plus,
                "avg_grade_first5": avg_grade_first5
            })

            # Save intermediate results every batch
            if len(filtered_data) % batch_size == 0 or index == total_climbers - 1:
                temp_df = pd.DataFrame(filtered_data)
                output_dir = "../data/8anu_data"
                os.makedirs(output_dir, exist_ok=True)
                temp_output_path = os.path.join(output_dir, "8a_nu_climbing_data.csv")
                temp_df.to_csv(temp_output_path, index=False)
                print(f"Intermediate results saved after processing {len(filtered_data)} climbers")

        except Exception as e:
            print(f"Error processing {name}: {e}")
            continue

try:
    driver.quit()
except Exception as e:
    print(f"Error closing WebDriver: {e}")

print("\nClimbers included in the output CSV:")
for climber in filtered_data:
    print(f"- {climber['name']}: Highest Grade: {climber['highest_grade']}, 8c+ Ascents: {climber['count_8c_plus']}, Avg First 5: {climber['avg_grade_first5']}")

if filtered_data:
    filtered_df = pd.DataFrame(filtered_data)
else:
    filtered_df = pd.DataFrame(columns=["name", "url", "highest_grade", "count_8c_plus", "avg_grade_first5"])


print(f"Final results saved to 8a_nu_climbing_data")

Loaded CSV with 192 climbers.

Processing batch 1 (climbers 1-10)...
Processing name (1/192)...
WebDriver error on attempt 1 for profile_url/sportclimbing: Message: invalid argument
  (Session info: chrome=134.0.6998.166)
Stacktrace:
0   chromedriver                        0x000000010107b6c8 cxxbridge1$str$ptr + 2791212
1   chromedriver                        0x0000000101073c9c cxxbridge1$str$ptr + 2759936
2   chromedriver                        0x0000000100bc5ca4 cxxbridge1$string$len + 92532
3   chromedriver                        0x0000000100bafb58 cxxbridge1$string$len + 2088
4   chromedriver                        0x0000000100baddc4 chromedriver + 187844
5   chromedriver                        0x0000000100bae910 chromedriver + 190736
6   chromedriver                        0x0000000100bc8de0 cxxbridge1$string$len + 105136
7   chromedriver                        0x0000000100c4f118 cxxbridge1$string$len + 654824
8   chromedriver                        0x0000000100c4e5f8 cxxbridge1$s

Let's preview how our dataset looks like

In [6]:
# Load the saved dataset to preview the results
df = pd.read_csv("../data/8anu_data/8a_nu_climbing_data.csv")
df.head()

Unnamed: 0,name,url,highest_grade,count_8c_plus,avg_grade_first5
0,Anze Peharc,https://www.8a.nu/user/ane-peharc,8b,0,7.78
1,Hannes Van Duysen,https://www.8a.nu/user/hannes-van-duysen,8a,0,7.0
2,Jakob Schubert,https://www.8a.nu/user/jakob-schubert,9a+,26,11.21
3,Mejdi Schalck,https://www.8a.nu/user/mejdi-schalck,9a+,10,11.33
4,Adam Ondra,https://www.8a.nu/user/adam-ondra,9c,388,13.54
