# Scraping Diplomeo

[Diplomeo](https://diplomeo.com/etablissements/resultats)

## Objectives

- Nom 
- Catégorie
- Note
- Adresse
- Mail 
- Numéro de téléphone

## Notes Scraping:

### 1) All results page: 

1. **Handle Pagination**

* Keep clicking on the **"voir plus"** button:

    ```html
    <div data-action="click->pagination#loadMoreTrainings"></div>
    ```
* Repeat until the button is no longer present (or clickable).

2. **Extract URLs from `<li>` elements**

* We look first for ul[data-cy="hub-schools-results"]
* For each `<li>` element, check if it contains an `<a>` tag.
* If the `<a>` tag’s `href` contains `diplomeo.com`:

    * Save the `href` (URL).
    * Save the text content of the `<a>`.

### 2) Specific page


1. **Category**

   * Path:
     `main → first div → first div → first div → second div → first div → div → first div`
   * Extract the text from this element.

2. **Note**

   * Path:
     `main → first div → first div → first div → second div → first div → div → second div → div → first div`
   * Extract the text.

3. **Address**

   * Find the `<address>` tag.
   * Extract its text content.

5. **Email**

   * Locate the `<h2>` element with the text:
     `"Envie d’étudier avec nous ?"`
   * Take the `<ul>` immediately following this `<h2>`.
   * Inside that list, find the `<li>` where `data-l` starts with `"xznvygb"`.
   * Decode the value of `data-l` (ROT13 scheme).

6. **Phone**

   * Locate the `<h2>` element with the text:
     `"Envie d’étudier avec nous ?"`
   * Take the `<ul>` immediately following this `<h2>`.
   * Inside that list, find the `<li>` where `data-l` starts with `"xgry"`.
   * Decode the value of `data-l`.

## First step : finding the url description page of each school


In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, ElementClickInterceptedException, NoSuchElementException
from selenium.webdriver.chrome.options import Options

import time
import pandas as pd
import random
import codecs

In [2]:
# --- Config ---
url = "https://diplomeo.com/etablissements/resultats" 
output_file = "first_step_name_and_description_url.xlsx"

# --- Start Selenium ---
options = Options()
options.add_argument("--start-maximized")
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36")
driver = webdriver.Chrome(options=options)

driver.get(url)
wait = WebDriverWait(driver, 5)

try:
    consent_btn = wait.until(
        EC.element_to_be_clickable((By.XPATH, "//button[contains(., 'Continuer sans accepter')]"))
    )
    driver.execute_script("arguments[0].click();", consent_btn)
    print("✅ Clicked 'Continuer sans accepter'")
except TimeoutException:
    print("ℹ️ No cookie banner found, continuing...")

while True:
    try:
        voir_plus = wait.until(
            EC.element_to_be_clickable((By.CSS_SELECTOR, "div[data-action='click->pagination#loadMoreTrainings']"))
        )
        
        driver.execute_script("arguments[0].scrollIntoView({block: 'center'});", voir_plus)
        time.sleep(1)

        driver.execute_script("arguments[0].click();", voir_plus)

        time.sleep(random.uniform(3, 6))
    except (TimeoutException, ElementClickInterceptedException, NoSuchElementException):
        break

results = []
uls = driver.find_elements(By.CSS_SELECTOR, 'ul[data-cy="hub-schools-results"]')

for ul in uls:
    li_elements = ul.find_elements(By.CSS_SELECTOR, ':scope > li a[href*="diplomeo.com"]')
    for a in li_elements:
        href = a.get_attribute("href")
        text = a.text.strip()
        if href and text:
            print({"name": text, "url": href})
            results.append({"name": text, "url": href})

df = pd.DataFrame(results)
df.to_excel(output_file, index=False)

print(f"✅ Saved {len(results)} results to {output_file}")

# driver.quit()

✅ Clicked 'Continuer sans accepter'
{'name': 'École d’Ingénieurs de PURPAN', 'url': 'https://diplomeo.com/etablissement-ecole_d_ingenieurs_de_purpan-2390'}
{'name': 'DSTI School of Engineering - Paris', 'url': 'https://diplomeo.com/etablissement-dsti_school_of_engineering_paris-11920'}
{'name': 'DSTI School of Engineering - French Riviera', 'url': 'https://diplomeo.com/etablissement-dsti_school_of_engineering_french_riviera-11922'}
{'name': 'European School of Cybersecurity', 'url': 'https://diplomeo.com/etablissement-european_school_of_cybersecurity-12253'}
{'name': 'CFA DIFCAM Ile-de-France', 'url': 'https://diplomeo.com/etablissement-cfa_difcam_ile_de_france-12429'}
{'name': 'IFFP - Institut Français de Formation Professionnelle', 'url': 'https://diplomeo.com/etablissement-iffp_institut_francais_de_formation_professionnelle-12897'}
{'name': 'Purple Campus - Alès', 'url': 'https://diplomeo.com/etablissement-purple_campus_ales-13138'}
{'name': 'Purple Campus - Nîmes', 'url': 'https://

# Second step : Scraping data for each school

In [3]:
def decode_mail(encoded: str) -> str:
    """
    Decode obfuscated Diplomeo email (data-l starting with 'xznvygb').
    """
    if not encoded:
        return None
    # Strip the "xznvygb:" prefix and the trailing junk char
    encoded = encoded.lstrip("xznvygb:")
    encoded = encoded[:-1]
    # ROT13
    decoded = codecs.decode(encoded, "rot_13")
    # Replace obfuscation for dots
    decoded = decoded.replace("=cg=", ".")
    # Remove "mailto:"
    return decoded.lstrip("mailto:")


def decode_phone(encoded: str) -> str:
    """
    Decode obfuscated Diplomeo phone (data-l starting with 'xgry').
    """
    if not encoded:
        return None
    # Strip the "xgry:" prefix
    encoded = encoded.lstrip("xgry:")
    # Add "+" in front and drop trailing junk char
    return "+" + encoded[:-1]


def scrape_school(driver, url):
    driver.get(url)
    wait = WebDriverWait(driver, 5)

    # Give time for page load
    time.sleep(random.uniform(3, 6))

    try:
        consent_btn = wait.until(
            EC.element_to_be_clickable(
                (By.XPATH, "//button[contains(., 'Continuer sans accepter')]")
            )
        )
        driver.execute_script("arguments[0].click();", consent_btn)
        print("✅ Clicked 'Continuer sans accepter'")
    except TimeoutException:
        print("ℹ️ No cookie banner found, continuing...")

    data = {
        "Category": None,
        "Note": None,
        "Address": None,
        "Email": None,
        "Phone": None,
    }

    try:
        # Category
        cat = driver.find_element(
            By.CSS_SELECTOR,
            "main > div:first-child > div:first-child > div:first-child > div:nth-of-type(2) > div > div:first-child",
        ).text.strip()
        data["Category"] = cat
    except NoSuchElementException:
        pass

    try:
        # Note
        note = driver.find_element(
            By.CSS_SELECTOR,
            "main > div:first-child > div:first-child > div:first-child > div:nth-of-type(2) > div > div:nth-of-type(2)",
        ).text.strip()
        note_clean = note.split("/5")[0].strip()
        data["Note"] = note_clean + "/5"
    except NoSuchElementException:
        pass

    try:
        # Address
        addr = driver.find_element(By.TAG_NAME, "address").text.strip()
        data["Address"] = addr
    except NoSuchElementException:
        pass

    try:
        # Email + Phone (from section under "Envie d’étudier avec nous ?")
        h2 = driver.find_element(
            By.XPATH,
            "//h2[contains(normalize-space(.), 'avec nous')]"
        )        
        section = h2.find_element(By.XPATH, "./following::ul[1]")
        items = section.find_elements(By.TAG_NAME, "li")
        for li in items:
            try:
                inner_div = li.find_element(By.CSS_SELECTOR, "div[data-l]")
                data_l = inner_div.get_attribute("data-l")
            except:
                data_l = None
                
            if data_l:
                if data_l.startswith("xznvygb"):
                    data["Email"] = decode_mail(data_l)
                elif data_l.startswith("xgry"):
                    data["Phone"] = decode_phone(data_l)
    except NoSuchElementException:
        pass

    return data


# Load Excel with first step results
df = pd.read_excel("first_step_name_and_description_url.xlsx")

options = Options()
options.add_argument("--start-maximized")
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument(
    "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
)
driver = webdriver.Chrome(options=options)

results = []

for idx, row in df.iterrows():
    print(f"Scraping {idx+1}/{len(df)}: {row['url']}")
    try:
        info = scrape_school(driver, row["url"])
        combined = {**row.to_dict(), **info}
        print(combined)
        results.append(combined)
    except Exception as e:
        print(f"❌ Error scraping {row['url']}: {e}")
        combined = {
            **row.to_dict(),
            "Category": None,
            "Note": None,
            "Address": None,
            "Email": None,
            "Phone": None,
        }
        results.append(combined)

    # polite wait
    time.sleep(random.uniform(2, 5))

driver.quit()

# Save results
final_df = pd.DataFrame(results)
final_df.to_excel("scraped_schools.xlsx", index=False)
print("✅ Saved results to scraped_schools.xlsx")

Scraping 1/10000: https://diplomeo.com/etablissement-ecole_d_ingenieurs_de_purpan-2390
✅ Clicked 'Continuer sans accepter'
{'name': 'École d’Ingénieurs de PURPAN', 'url': 'https://diplomeo.com/etablissement-ecole_d_ingenieurs_de_purpan-2390', 'Category': "École d'ingénieurs", 'Note': '4.3/5', 'Address': '75 voie du Toec, Toulouse 31076', 'Email': 'communication@purpan.fr', 'Phone': '+33561153030'}
Scraping 2/10000: https://diplomeo.com/etablissement-dsti_school_of_engineering_paris-11920
ℹ️ No cookie banner found, continuing...
{'name': 'DSTI School of Engineering - Paris', 'url': 'https://diplomeo.com/etablissement-dsti_school_of_engineering_paris-11920', 'Category': "École d'ingénierie informatique", 'Note': '4.8/5', 'Address': '4 Rue de la Collégiale, Paris 75005', 'Email': 'contact@dsti.institute', 'Phone': '+33489412944'}
Scraping 3/10000: https://diplomeo.com/etablissement-dsti_school_of_engineering_french_riviera-11922
ℹ️ No cookie banner found, continuing...
{'name': 'DSTI Scho

KeyboardInterrupt: 