# Playwright vs SupaCrawler: Complete Performance Comparison

This notebook compares Playwright (Python) against SupaCrawler for both single-page scraping and multi-page crawling. We focus on performance, infrastructure requirements, and development complexity.

In [None]:
# Installation requirements
# !pip install playwright
# !playwright install chromium
# !pip install supacrawler

In [None]:
import time
from playwright.async_api import async_playwright
from supacrawler import SupacrawlerClient
from urllib.parse import urljoin, urlparse
from collections import deque

## Part 1: Single Page Scraping Comparison

In [10]:
async def playwright_scrape(url):
    """Single page scraping with Playwright"""
    start = time.time()
    
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        )
        page = await context.new_page()
        
        try:
            await page.goto(url, wait_until='networkidle')
            
            title = await page.title()
            content = await page.evaluate('document.body.textContent')
            
            result = {
                'title': title,
                'content': content[:200] + "..." if len(content) > 200 else content,
                'time': time.time() - start,
                'javascript_support': True,
            }
            
        finally:
            await browser.close()
    
    return result

def supacrawler_scrape(url):
    """Single page scraping with SupaCrawler"""
    start = time.time()
    
    client = SupacrawlerClient(api_key='')
    response = client.scrape(url, format='markdown', render_js=True, fresh=True)
    
    title = response.metadata.title if response.metadata else 'No title'
    content = response.markdown if response.markdown else "No content"
    
    return {
        'title': title,
        'content': content[:200] + "..." if len(content) > 200 else content,
        'time': time.time() - start,
        'javascript_support': True,
    }

# Test single page scraping
async def run_scraping_test():
    test_url = 'https://example.com'
    
    print("Single Page Scraping Comparison")
    print("=" * 35)
    print(f"Test URL: {test_url}")
    print()
    
    print("Playwright:")
    playwright_result = await playwright_scrape(test_url)
    print(f"Title: {playwright_result['title']}")
    print(f"Time: {playwright_result['time']:.2f}s")
    print()
    
    print("SupaCrawler:")
    sc_result = supacrawler_scrape(test_url)
    print(f"Title: {sc_result['title']}")
    print(f"Time: {sc_result['time']:.2f}s")
    print()
    
    if playwright_result['time'] > 0 and sc_result['time'] > 0:
        ratio = playwright_result['time'] / sc_result['time']
        print(f"Performance: SupaCrawler is {ratio:.1f}x faster")
    
    print(f"Setup complexity: Playwright (browser install + 1GB), SupaCrawler (API key only)")

# Run the async function
await run_scraping_test()

Single Page Scraping Comparison
Test URL: https://example.com

Playwright:
Title: Example Domain
Time: 7.58s

SupaCrawler:
Title: Example Domain
Time: 1.21s

Performance: SupaCrawler is 6.3x faster
Setup complexity: Playwright (browser install + 1GB), SupaCrawler (API key only)


## Part 2: Multi-Page Crawling Comparison

### How SupaCrawler Crawl Works

1. **Crawl dispatch** → Crawl breaks the target domain into a queue of page URLs.
2. **Per-page scrape** → For each URL, crawl calls the exact same **scrape service** that is exposed publicly (`/scrape`).
3. **Retry & backoff** → Each scrape runs with up to **3 retries** and **exponential backoff** (1s → 2s → 4s).
4. **Error handling** → If all retries fail, the page is marked as failed.
5. **Aggregation** → Successfully scraped pages are aggregated into the crawl result.

To make the playwright and Playwright benchmarks fair, we applied the **same retry + backoff algorithm** when fetching each page.

The crawl path is literally just a loop of our scrape service. That means the **retry, backoff, and error logic** are identical between crawl and scrape. To make the playwright and Playwright benchmarks fair, we applied the **same retry + backoff algorithm** when fetching each page.

Therefore, **all three systems (SupaCrawler, playwright, Playwright)** are operating under identical conditions — same error tolerance, same retry count, same backoff schedule.

This guarantees the benchmark is **apples-to-apples**: differences come from **architecture and performance**, not from error-handling or retry policies.


In [2]:
import asyncio
import random
import time
from collections import deque
from urllib.parse import urlparse, urljoin
from playwright.async_api import async_playwright

async def playwright_crawl(start_url, max_pages=5, max_retries=3):
    """Manual crawling with Playwright + retries/backoff"""
    start_time = time.time()
    
    visited = set()
    to_visit = deque([start_url])
    results = []
    
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
        )
        
        try:
            while to_visit and len(results) < max_pages:
                url = to_visit.popleft()
                if url in visited:
                    continue
                visited.add(url)

                attempt = 0
                success = False

                while attempt < max_retries and not success:
                    try:
                        page = await context.new_page()
                        await page.goto(url, wait_until="networkidle", timeout=30000)

                        # Extract title and content
                        title = await page.title() or "N/A"
                        body_content = await page.evaluate("document.body.textContent")
                        content = body_content[:500] + "..." if len(body_content) > 500 else body_content

                        # Meta tags
                        try:
                            description = await page.locator("meta[name=description]").get_attribute("content")
                        except:
                            description = None
                        try:
                            keywords = await page.locator("meta[name=keywords]").get_attribute("content")
                        except:
                            keywords = None

                        # Headers
                        headers = []
                        for tag in ["h1", "h2", "h3"]:
                            elems = await page.locator(tag).all()
                            for elem in elems:
                                text = (await elem.inner_text()).strip()
                                if text:
                                    headers.append({tag: text})

                        # Links (same domain only)
                        base_domain = urlparse(start_url).netloc
                        links = []
                        link_elements = await page.locator("a[href]").all()
                        for link in link_elements:
                            href = await link.get_attribute("href")
                            if href:
                                full_url = urljoin(url, href)
                                if urlparse(full_url).netloc == base_domain:
                                    links.append(full_url)
                                    if full_url not in visited and len(to_visit) < 20:
                                        to_visit.append(full_url)

                        # Metadata
                        metadata = {
                            "url": url,
                            "title": title,
                            "description": description,
                            "keywords": keywords,
                            "headers": headers,
                            "word_count": len(body_content.split()),
                            "links_found": len(links),
                        }

                        results.append({
                            "content": content,
                            "metadata": metadata
                        })

                        success = True
                        await page.close()
                        await asyncio.sleep(random.uniform(1, 2))  # avoid fixed patterns

                    except Exception as e:
                        attempt += 1
                        if attempt < max_retries:
                            backoff = 2 ** (attempt - 1)  # exponential backoff
                            print(f"Retry {attempt} for {url} after {backoff}s due to {e}")
                            await asyncio.sleep(backoff)
                        else:
                            print(f"Failed to scrape {url} after {max_retries} attempts")
                            break

        finally:
            await browser.close()
    
    end_time = time.time()
    return {
        "pages_crawled": len(results),
        "total_time": end_time - start_time,
        "avg_time_per_page": (end_time - start_time) / len(results) if results else 0,
        "results": results
    }


def supacrawler_crawl(start_url, max_pages=5):
    """Built-in crawling with SupaCrawler (SDK-native usage)"""
    start_time = time.time()
    client = SupacrawlerClient(api_key="")

    try:
        job = client.create_crawl_job(
            url=start_url,
            format="markdown",
            link_limit=max_pages,
            depth=3,
            include_subdomains=False,
            render_js=True,
            fresh=True # fresh never uses cached results
        )

        crawl_output = client.wait_for_crawl(
            job.job_id,
            interval_seconds=1.0,
            timeout_seconds=120.0
        )

        crawl_data = crawl_output.data.crawl_data
        end_time = time.time()

        return {
            "pages_crawled": len(crawl_data),
            "total_time": end_time - start_time,
            "avg_time_per_page": (end_time - start_time) / len(crawl_data) if crawl_data else 0,
            "crawl_data": crawl_data  # keep the native objects
        }

    except Exception as e:
        end_time = time.time()
        return {
            "pages_crawled": 0,
            "total_time": end_time - start_time,
            "error": str(e)
        }


In [12]:
# Test crawling
async def run_crawling_test():
    crawl_url = "https://docs.python.org"
    max_pages = 5
    
    print("Multi-Page Crawling Comparison")
    print("=" * 35)
    print(f"Test URL: {crawl_url}")
    print(f"Max pages: {max_pages}")
    print()
    
    print("Playwright manual crawling:")
    playwright_crawl_result = await playwright_crawl(crawl_url, max_pages)
    print(f"Pages crawled: {playwright_crawl_result['pages_crawled']}")
    print(f"Total time: {playwright_crawl_result['total_time']:.2f}s")
    print(f"Average per page: {playwright_crawl_result['avg_time_per_page']:.2f}s\n")
    
    print("SupaCrawler built-in crawling:")
    sc_crawl_result = supacrawler_crawl(crawl_url, max_pages)
    
    if 'error' in sc_crawl_result:
        print(f"Error: {sc_crawl_result['error']}")
    else:
        print(f"Pages crawled: {sc_crawl_result['pages_crawled']}")
        print(f"Total time: {sc_crawl_result['total_time']:.2f}s")
        print(f"Average per page: {sc_crawl_result['avg_time_per_page']:.2f}s")
        
    
    # Performance summary
    if (playwright_crawl_result['pages_crawled'] > 0 and sc_crawl_result.get('pages_crawled', 0) > 0 and 'error' not in sc_crawl_result):
        speed_ratio = playwright_crawl_result['avg_time_per_page'] / sc_crawl_result['avg_time_per_page']
        print(f"Performance: SupaCrawler is {speed_ratio:.1f}x faster per page")
    
    print(f"Playwright success rate: {playwright_crawl_result['pages_crawled'] / max_pages * 100:.1f}%")
    print(f"SupaCrawler success rate: {sc_crawl_result.get('pages_crawled', 0) / max_pages * 100:.1f}%")

# Run the async function
await run_crawling_test()

Multi-Page Crawling Comparison
Test URL: https://docs.python.org
Max pages: 5

Playwright manual crawling:
Pages crawled: 5
Total time: 164.39s
Average per page: 32.88s

SupaCrawler built-in crawling:
Pages crawled: 5
Total time: 5.08s
Average per page: 1.02s
Performance: SupaCrawler is 32.4x faster per page
Playwright success rate: 100.0%
SupaCrawler success rate: 100.0%


# Part 3: A more comprehensive test for crawling multiple pages!
Let's try running for 50 pages on 3 different websites this time:
-  supacrawler.com
-  supabase.com
-  ai.google.dev

In [3]:
async def run_comparison(crawl_url: str, max_pages: int):
    print("\n" + "=" * 60)
    print(f"Benchmarking: {crawl_url}")
    print(f"Max pages: {max_pages}")
    print("=" * 60)

    # --- Playwright ---
    print("\n[Playwright] Manual crawling:")
    playwright_result = await playwright_crawl(crawl_url, max_pages)
    print(f"  Pages crawled: {playwright_result['pages_crawled']}")
    print(f"  Total time: {playwright_result['total_time']:.2f}s")
    print(f"  Avg per page: {playwright_result['avg_time_per_page']:.2f}s")

    if playwright_result["pages_crawled"] > 0:
        first_page = playwright_result["results"][0]
        print("  First page title:", first_page["metadata"]["title"])

    # --- SupaCrawler ---
    print("\n[SupaCrawler] Built-in crawling:")
    sc_result = supacrawler_crawl(crawl_url, max_pages)

    if "error" in sc_result:
        print(f"  Error: {sc_result['error']}")
    else:
        print(f"  Pages crawled: {sc_result['pages_crawled']}")
        print(f"  Total time: {sc_result['total_time']:.2f}s")
        print(f"  Avg per page: {sc_result['avg_time_per_page']:.2f}s")

        if sc_result["pages_crawled"] > 0:
            first_url = sc_result["crawl_data"][0]
            print("  First page markdown preview:", first_url.markdown[:120], "...")

    # --- Performance summary ---
    if (
        playwright_result["pages_crawled"] > 0
        and sc_result.get("pages_crawled", 0) > 0
        and "error" not in sc_result
    ):
        ratio = playwright_result["avg_time_per_page"] / sc_result["avg_time_per_page"]
        print(f"\n⚡ Performance: SupaCrawler is {ratio:.1f}x faster per page")

In [5]:
# Benchmark Runner: Compare Playwright vs SupaCrawler across multiple sites
test_sites = [
    "https://supabase.com",
    "https://docs.python.org",
    "https://ai.google.dev",
]
max_pages = 50

await run_comparison(test_sites[0], max_pages)


Benchmarking: https://supabase.com
Max pages: 50

[Playwright] Manual crawling:
Retry 1 for https://supabase.com/dashboard after 1s due to Page.goto: Timeout 30000ms exceeded.
Call log:
  - navigating to "https://supabase.com/dashboard", waiting until "networkidle"

Retry 2 for https://supabase.com/dashboard after 2s due to Page.goto: Timeout 30000ms exceeded.
Call log:
  - navigating to "https://supabase.com/dashboard", waiting until "networkidle"

Failed to scrape https://supabase.com/dashboard after 3 attempts
Retry 1 for https://supabase.com/dashboard/support/new after 1s due to Page.goto: Timeout 30000ms exceeded.
Call log:
  - navigating to "https://supabase.com/dashboard/support/new", waiting until "networkidle"

Retry 2 for https://supabase.com/dashboard/support/new after 2s due to Page.goto: Timeout 30000ms exceeded.
Call log:
  - navigating to "https://supabase.com/dashboard/support/new", waiting until "networkidle"

Failed to scrape https://supabase.com/dashboard/support/ne

In [6]:
await run_comparison(test_sites[1], max_pages)


Benchmarking: https://docs.python.org
Max pages: 50

[Playwright] Manual crawling:
  Pages crawled: 50
  Total time: 2775.49s
  Avg per page: 55.51s
  First page title: 3.13.7 Documentation

[SupaCrawler] Built-in crawling:
  Pages crawled: 50
  Total time: 35.32s
  Avg per page: 0.71s
  First page markdown preview: # Python 3.13.7 documentation
Welcome! This is the official documentation for Python 3.13.7.
**Documentation sections:** ...

⚡ Performance: SupaCrawler is 78.6x faster per page


In [7]:
await run_comparison(test_sites[2], max_pages)


Benchmarking: https://ai.google.dev
Max pages: 50

[Playwright] Manual crawling:
Retry 1 for https://ai.google.dev/gemini-api/docs/video after 1s due to Page.goto: Timeout 30000ms exceeded.
Call log:
  - navigating to "https://ai.google.dev/gemini-api/docs/video", waiting until "networkidle"

Retry 2 for https://ai.google.dev/gemini-api/docs/video after 2s due to Page.goto: Timeout 30000ms exceeded.
Call log:
  - navigating to "https://ai.google.dev/gemini-api/docs/video", waiting until "networkidle"

Failed to scrape https://ai.google.dev/gemini-api/docs/video after 3 attempts
  Pages crawled: 50
  Total time: 1433.59s
  Avg per page: 28.67s
  First page title: Gemini Developer API | Gemma open models  |  Google AI for Developers

[SupaCrawler] Built-in crawling:
  Pages crawled: 50
  Total time: 36.44s
  Avg per page: 0.73s
  First page markdown preview: [NewGemini 2.5 Flash Image (aka Nano Banana) is now available in the Gemini API!](https://ai.google.dev/gemini-api/docs/ ...

⚡ Pe