# Selenium vs SupaCrawler: Complete Performance Comparison

This notebook compares Selenium WebDriver against SupaCrawler for both single-page scraping and multi-page crawling. We focus on performance, resource usage, and infrastructure complexity.

Note: this test was done on a local Mac M4 machine with 24GB Memory.

In [None]:
# Installation requirements
# !pip install selenium webdriver-manager
# !pip install supacrawler

In [23]:
import time
import os
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
from supacrawler import SupacrawlerClient
from urllib.parse import urljoin, urlparse
from collections import deque
import random

## Part 1: Single Page Scraping Comparison
For the first comparison, let's try rendering javascript from both SDKs. We'll run both Selenium & Supacrawler locally:

In [5]:
def selenium_scrape(url):
    """Single page scraping with Selenium"""
    start = time.time()
    
    # Setup Chrome options
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    
    # Setup Chrome driver
    service = Service(ChromeDriverManager().install())
    driver = webdriver.Chrome(service=service, options=chrome_options)
    
    try:
        driver.get(url)
        time.sleep(2)  # Wait for page load
        
        title = driver.title
        body = driver.find_element(By.TAG_NAME, "body")
        content = body.text[:200] + "..." if len(body.text) > 200 else body.text
        
        result = {
            'title': title,
            'content': content,
            'time': time.time() - start,
            'javascript_support': True,
            'resource_usage': 'High (local browser)'
        }
        
    finally:
        driver.quit()
    
    return result

def supacrawler_scrape(url):
    """Single page scraping with SupaCrawler"""
    start = time.time()
    
    client = SupacrawlerClient(api_key='')
    response = client.scrape(url, format='markdown', render_js=True, fresh=True)
    
    title = response.metadata.title if response.metadata else 'No title'
    content = response.content if response.content else "No content"
    
    return {
        'title': title,
        'content': content[:200] + "..." if len(content) > 200 else content,
        'metadata': response.metadata,
        'time': time.time() - start,
        'javascript_support': True,
        'resource_usage': 'Zero local resources'
    }

# Test single page scraping
test_url = 'https://supabase.com'

print("Single Page Scraping Comparison")
print("=" * 35)
print(f"Test URL: {test_url}")
print()

print("Selenium WebDriver:")
selenium_result = selenium_scrape(test_url)
print(f"Title: {selenium_result['title']}")
print(f"Time: {selenium_result['time']:.2f}s")
print()

print("SupaCrawler:")
sc_result = supacrawler_scrape(test_url)
print(f"Title: {sc_result['title']}")
print(f"Time: {sc_result['time']:.2f}s")
print()

if selenium_result['time'] > 0 and sc_result['time'] > 0:
    ratio = selenium_result['time'] / sc_result['time']
    print(f"Performance: SupaCrawler is {ratio:.1f}x faster")

Single Page Scraping Comparison
Test URL: https://supabase.com

Selenium WebDriver:
Title: Supabase | The Postgres Development Platform.
Time: 4.08s

SupaCrawler:
Title: Supabase | The Postgres Development Platform.
Time: 1.37s

Performance: SupaCrawler is 3.0x faster


## Part 2: Multi-Page Crawling Comparison

### How SupaCrawler Crawl Works

1. **Crawl dispatch** → Crawl breaks the target domain into a queue of page URLs.
2. **Per-page scrape** → For each URL, crawl calls the exact same **scrape service** that is exposed publicly (`/scrape`).
3. **Retry & backoff** → Each scrape runs with up to **3 retries** and **exponential backoff** (1s → 2s → 4s).
4. **Error handling** → If all retries fail, the page is marked as failed.
5. **Aggregation** → Successfully scraped pages are aggregated into the crawl result.

To make the Selenium and Playwright benchmarks fair, we applied the **same retry + backoff algorithm** when fetching each page.

The crawl path is literally just a loop of our scrape service. That means the **retry, backoff, and error logic** are identical between crawl and scrape. To make the Selenium and Playwright benchmarks fair, we applied the **same retry + backoff algorithm** when fetching each page.

Therefore, **all three systems (SupaCrawler, Selenium, Playwright)** are operating under identical conditions — same error tolerance, same retry count, same backoff schedule.

This guarantees the benchmark is **apples-to-apples**: differences come from **architecture and performance**, not from error-handling or retry policies.


In [26]:
from selenium.common.exceptions import WebDriverException

def selenium_crawl(start_url, max_pages=5, max_retries=3):
    """Manual crawling with Selenium WebDriver + retries/backoff"""
    start_time = time.time()
    
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    chrome_options.add_argument("--disable-gpu")
    
    service = Service(ChromeDriverManager().install())
    driver = webdriver.Chrome(service=service, options=chrome_options)
    
    visited = set()
    to_visit = deque([start_url])
    results = []
    
    try:
        while to_visit and len(results) < max_pages:
            url = to_visit.popleft()
            if url in visited:
                continue
            visited.add(url)

            attempt = 0
            success = False

            while attempt < max_retries and not success:
                try:
                    driver.get(url)
                    WebDriverWait(driver, 10).until(
                        EC.presence_of_element_located((By.TAG_NAME, "body"))
                    )
                    time.sleep(2)  # let JS render

                    title = driver.title or "N/A"
                    body = driver.find_element(By.TAG_NAME, "body")
                    content = body.text[:500] + "..." if len(body.text) > 500 else body.text

                    # Meta tags
                    try:
                        description = driver.find_element(By.XPATH, "//meta[@name='description']").get_attribute("content")
                    except:
                        description = None
                    try:
                        keywords = driver.find_element(By.XPATH, "//meta[@name='keywords']").get_attribute("content")
                    except:
                        keywords = None

                    # Headers
                    headers = []
                    for tag in ["h1", "h2", "h3"]:
                        for elem in driver.find_elements(By.TAG_NAME, tag):
                            text = elem.text.strip()
                            if text:
                                headers.append({tag: text})

                    # Links (same domain only)
                    base_domain = urlparse(start_url).netloc
                    links = []
                    try:
                        link_elements = driver.find_elements(By.TAG_NAME, "a")
                        for link in link_elements:
                            href = link.get_attribute("href")
                            if href and href.startswith("http"):
                                if urlparse(href).netloc == base_domain:
                                    links.append(href)
                                    if href not in visited and len(to_visit) < 20:
                                        to_visit.append(href)
                    except:
                        pass

                    # Metadata
                    metadata = {
                        "url": url,
                        "title": title,
                        "description": description,
                        "keywords": keywords,
                        "headers": headers,
                        "word_count": len(body.text.split()),
                        "links_found": len(links)
                    }

                    results.append({
                        "content": content,
                        "metadata": metadata
                    })

                    # Success, mark and stop retry loop
                    success = True
                    # Random sleep to avoid fixed pattern
                    time.sleep(random.uniform(1, 2))

                except WebDriverException as e:
                    attempt += 1
                    if attempt < max_retries:
                        backoff = 2 ** (attempt - 1)  # 1s, 2s, 4s
                        print(f"Retry {attempt} for {url} after {backoff}s due to {e}")
                        time.sleep(backoff)
                    else:
                        print(f"Failed to scrape {url} after {max_retries} attempts")
                        break

    finally:
        driver.quit()
    
    end_time = time.time()
    return {
        "pages_crawled": len(results),
        "total_time": end_time - start_time,
        "avg_time_per_page": (end_time - start_time) / len(results) if results else 0,
        "results": results
    }

def supacrawler_crawl(start_url, max_pages=5):
    """Built-in crawling with SupaCrawler (SDK-native usage)"""
    start_time = time.time()
    client = SupacrawlerClient(api_key="")

    try:
        job = client.create_crawl_job(
            url=start_url,
            format="markdown",
            link_limit=max_pages,
            depth=3,
            include_subdomains=False,
            render_js=True,
            fresh=True # fresh never uses cached results
        )

        crawl_output = client.wait_for_crawl(
            job.job_id,
            interval_seconds=1.0,
            timeout_seconds=120.0
        )

        crawl_data = crawl_output.data.crawl_data
        end_time = time.time()

        return {
            "pages_crawled": len(crawl_data),
            "total_time": end_time - start_time,
            "avg_time_per_page": (end_time - start_time) / len(crawl_data) if crawl_data else 0,
            "crawl_data": crawl_data  # keep the native objects
        }

    except Exception as e:
        end_time = time.time()
        return {
            "pages_crawled": 0,
            "total_time": end_time - start_time,
            "error": str(e)
        }

Selenium results:

In [3]:
# Test crawling
crawl_url = "https://docs.python.org"
max_pages = 10

print("Multi-Page Crawling Comparison")
print("=" * 35)
print(f"Test URL: {crawl_url}")
print(f"Max pages: {max_pages}")
print()

# --- Selenium ---
print("Selenium manual crawling:")
selenium_crawl_result = selenium_crawl(crawl_url, max_pages)
print(f"Pages crawled: {selenium_crawl_result['pages_crawled']}")
print(f"Total time: {selenium_crawl_result['total_time']:.2f}s")
print(f"Average per page: {selenium_crawl_result['avg_time_per_page']:.2f}s")
print()

if selenium_crawl_result["pages_crawled"] > 0:
    first_page = selenium_crawl_result["results"][0]
    print("\nFirst Selenium page result:")
    print("URL:", first_page["metadata"]["url"])
    print("Title:", first_page["metadata"]["title"])
    print("Text Preview:", first_page.get("content", "")[:30], "...\n")
    print("Metadata:", first_page["metadata"])

Multi-Page Crawling Comparison
Test URL: https://docs.python.org
Max pages: 10

Selenium manual crawling:
Pages crawled: 10
Total time: 42.11s
Average per page: 4.21s


First Selenium page result:
URL: N/A
Title: N/A
Text Preview: dev (3.15)
pre (3.14)
3.13.7
3 ...

Metadata: {'url': 'https://docs.python.org', 'title': '3.13.7 Documentation', 'description': 'The official Python documentation.', 'keywords': None, 'headers': [{'h1': 'Python 3.13.7 documentation'}], 'word_count': 244, 'links_found': 70}


Supacrawler results:

In [13]:
# --- SupaCrawler ---
print("SupaCrawler built-in crawling:")
sc_crawl_result = supacrawler_crawl(crawl_url, max_pages)

if "error" in sc_crawl_result:
    print(f"Error: {sc_crawl_result['error']}")
else:
    print(f"Pages crawled: {sc_crawl_result['pages_crawled']}")
    print(f"Total time: {sc_crawl_result['total_time']:.2f}s")
    print(f"Average per page: {sc_crawl_result['avg_time_per_page']:.2f}s")

    # Show preview of first page (SDK-native)
    if sc_crawl_result["pages_crawled"] > 0:
        first_url = sc_crawl_result["crawl_data"][0]
        print("\nFirst SupaCrawler page result:")
        print("Markdown Preview:", first_url.markdown[:200], "...")
        print("\nMetadata:", first_url.metadata.to_json())

SupaCrawler built-in crawling:
Pages crawled: 10
Total time: 2.05s
Average per page: 0.20s

First SupaCrawler page result:
Markdown Preview: # Python 3.13.7 documentation
Welcome! This is the official documentation for Python 3.13.7.
**Documentation sections:**
[What's new in Python 3.13?](whatsnew/3.13.html)
Or [all "What's new" documents ...

Metadata: {'title': '3.13.7 Documentation', 'status_code': 200, 'description': 'The official Python documentation.', 'canonical': 'https://docs.python.org/3/index.html', 'favicon': 'https://docs.python.org/_static/py.svg', 'og_title': 'Python 3.13 documentation', 'og_description': 'The official Python documentation.', 'og_image': 'https://docs.python.org/3/_static/og-image.png', 'source_url': 'https://docs.python.org'}


Now let's compare their performance!

In [14]:
# --- Performance summary ---
if (
    selenium_crawl_result["pages_crawled"] > 0
    and sc_crawl_result.get("pages_crawled", 0) > 0
    and "error" not in sc_crawl_result
):
    speed_ratio = selenium_crawl_result["avg_time_per_page"] / sc_crawl_result["avg_time_per_page"]
    print(f"\n\nPerformance: SupaCrawler is {speed_ratio:.1f}x faster per page")



Performance: SupaCrawler is 20.6x faster per page


In [8]:
if sc_crawl_result["pages_crawled"] > 0:
    first_url = sc_crawl_result["crawl_data"][0]
    print("\nFirst SupaCrawler page result:")
    print("Markdown Preview:", first_url.markdown[:200], "...")
    print("\nMetadata:", first_url.metadata.to_json())


First SupaCrawler page result:
Markdown Preview: # Python 3.13.7 documentation
Welcome! This is the official documentation for Python 3.13.7.
**Documentation sections:**
[What's new in Python 3.13?](whatsnew/3.13.html)
Or [all "What's new" documents ...

Metadata: {'title': '3.13.7 Documentation', 'status_code': 200, 'description': 'The official Python documentation.', 'canonical': 'https://docs.python.org/3/index.html', 'favicon': 'https://docs.python.org/_static/py.svg', 'og_title': 'Python 3.13 documentation', 'og_description': 'The official Python documentation.', 'og_image': 'https://docs.python.org/3/_static/og-image.png', 'source_url': 'https://docs.python.org'}


# Part 3: A more comprehensive test for crawling multiple pages!
Let's try running for 50 pages on 3 different websites this time:
-  supacrawler.com
-  supabase.com
-  ai.google.dev

In [29]:
# Benchmark Runner: Compare Selenium vs SupaCrawler across multiple sites
test_sites = [
    "https://supabase.com",
    "https://docs.python.org",
    "https://ai.google.dev",
]
max_pages = 50

def run_comparison(crawl_url: str, max_pages: int):
    print("\n" + "=" * 60)
    print(f"Benchmarking: {crawl_url}")
    print(f"Max pages: {max_pages}")
    print("=" * 60)

    # --- Selenium ---
    print("\n[Selenium] Manual crawling:")
    selenium_result = selenium_crawl(crawl_url, max_pages)
    print(f"  Pages crawled: {selenium_result['pages_crawled']}")
    print(f"  Total time: {selenium_result['total_time']:.2f}s")
    print(f"  Avg per page: {selenium_result['avg_time_per_page']:.2f}s")

    if selenium_result["pages_crawled"] > 0:
        first_page = selenium_result["results"][0]
        print("  First page title:", first_page["metadata"]["title"])

    # --- SupaCrawler ---
    print("\n[SupaCrawler] Built-in crawling:")
    sc_result = supacrawler_crawl(crawl_url, max_pages)

    if "error" in sc_result:
        print(f"  Error: {sc_result['error']}")
    else:
        print(f"  Pages crawled: {sc_result['pages_crawled']}")
        print(f"  Total time: {sc_result['total_time']:.2f}s")
        print(f"  Avg per page: {sc_result['avg_time_per_page']:.2f}s")

        if sc_result["pages_crawled"] > 0:
            first_url = sc_result["crawl_data"][0]
            print("  First page markdown preview:", first_url.markdown[:120], "...")

    # --- Performance summary ---
    if (
        selenium_result["pages_crawled"] > 0
        and sc_result.get("pages_crawled", 0) > 0
        and "error" not in sc_result
    ):
        ratio = selenium_result["avg_time_per_page"] / sc_result["avg_time_per_page"]
        print(f"\n⚡ Performance: SupaCrawler is {ratio:.1f}x faster per page")

# --- Run benchmark for all test sites ---
print("Multi-Site Crawling Comparison\n")
for site in test_sites:
    run_comparison(site, max_pages)


Multi-Site Crawling Comparison


Benchmarking: https://supabase.com
Max pages: 50

[Selenium] Manual crawling:
  Pages crawled: 50
  Total time: 241.49s
  Avg per page: 4.83s
  First page title: Supabase | The Postgres Development Platform.

[SupaCrawler] Built-in crawling:
  Pages crawled: 50
  Total time: 36.26s
  Avg per page: 0.73s
  First page markdown preview: # Build in a weekendScale to millions
Supabase is the Postgres development platform.
Start your project with a Postgres  ...

⚡ Performance: SupaCrawler is 6.7x faster per page

Benchmarking: https://docs.python.org
Max pages: 50

[Selenium] Manual crawling:
  Pages crawled: 50
  Total time: 206.45s
  Avg per page: 4.13s
  First page title: 3.13.7 Documentation

[SupaCrawler] Built-in crawling:
  Pages crawled: 50
  Total time: 41.33s
  Avg per page: 0.83s
  First page markdown preview: # Python 3.13.7 documentation
Welcome! This is the official documentation for Python 3.13.7.
**Documentation sections:** ...

⚡ Performance

# Multi-Site Crawling Comparison

| Site | Tool        | Pages Crawled | Total Time (s) | Avg per Page (s) | First Page Title / Preview | Performance Gain |
|------|------------|---------------|----------------|------------------|----------------------------|------------------|
| [Supabase](https://supabase.com) | **Selenium**     | 50 | 241.49 | 4.83 | *Supabase \| The Postgres Development Platform.* | — |
|      | **SupaCrawler** | 50 | 36.26  | 0.73 | `# Build in a weekend — Scale to millions` … | ⚡ **6.7× faster** |
| [Python Docs](https://docs.python.org) | **Selenium**     | 50 | 206.45 | 4.13 | *3.13.7 Documentation* | — |
|      | **SupaCrawler** | 50 | 41.33  | 0.83 | `# Python 3.13.7 documentation` … | ⚡ **5.0× faster** |
| [Google AI Docs](https://ai.google.dev) | **Selenium**     | 50 | 455.67 | 9.11 | *Gemini Developer API \| Gemma open models \| Google AI for Developers* | — |
|      | **SupaCrawler** | 50 | 37.25  | 0.74 | `[New Gemini 2.5 Flash Image …]` | ⚡ **12.2× faster** |
