
# Fair BeautifulSoup vs Supacrawler: Complete Performance Comparison

This notebook compares BeautifulSoup (with requests) against Supacrawler with similar retry logic that matches Supacrawler's exact implementation.

1. **Identical Retry Logic**: Uses the same exponential backoff (1s, 2s, 4s) and retry conditions as Supacrawler
2. **Smart Error Classification**: Only retries on retryable errors (429, 503, timeouts) - not 403/404
3. **Same Max Retries**: 3 attempts total, matching Supacrawler service
4. **Timeout Consistency**: Uses similar timeout patterns to Supacrawler's HTTP implementation

Note: BeautifulSoup + requests is similar to Supacrawler's simple HTTP scraping (no JavaScript), so this is a fair comparison for non-JS content.

In [1]:
import time
import requests
from bs4 import BeautifulSoup
from supacrawler import SupacrawlerClient

In [2]:
def is_retryable_error(error):
    """
    Check if error should be retried - EXACTLY matching Supacrawler's retry logic.
    This mirrors the logic in supacrawler/internal/core/scrape/service.go:isRetryableScrapingError
    """
    if error is None:
        return False
    
    error_str = str(error).lower()
    
    # Rate limiting and server errors (retryable)
    if any(term in error_str for term in ["429", "too many requests", "rate limit"]):
        return True
    if any(term in error_str for term in ["503", "service unavailable", "502", "bad gateway", "504", "gateway timeout"]):
        return True
    if any(term in error_str for term in ["connection reset", "connection refused", "timeout"]) and "permanent" not in error_str:
        return True
    
    # Client errors (NOT retryable - this is crucial for fairness)
    if any(term in error_str for term in ["403", "forbidden", "404", "not found"]):
        return False
    
    return False

In [3]:
def beautifulsoup_scrape_single(url, max_retries=3):
    """Single page scraping with BeautifulSoup + requests - FAIR version"""
    start = time.time()
    
    session = requests.Session()
    session.headers.update({
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36'
    })
    
    # EXACT Supacrawler retry logic
    for attempt in range(max_retries):
        try:
            # Apply backoff BEFORE attempt (if retry)
            if attempt > 0:
                # EXACT Supacrawler formula: d := time.Duration(1<<(attempt-1)) * time.Second
                backoff = 1 << (attempt - 1)  # 1s, 2s, 4s
                print(f"Retry {attempt} for {url} after {backoff}s")
                time.sleep(backoff)
            
            # Timeout matching Supacrawler's 10s HTTP timeout
            response = session.get(url, timeout=10)
            response.raise_for_status()
            
            soup = BeautifulSoup(response.content, 'html.parser')
            title = soup.title.string if soup.title else 'No title'
            
            # Extract text content (similar to body.textContent)
            content = soup.get_text()[:200] + "..." if len(soup.get_text()) > 200 else soup.get_text()
            
            return {
                'title': title.strip(),
                'content': content.strip(),
                'time': time.time() - start,
                'javascript_support': False,  # BeautifulSoup doesn't execute JS
                'resource_usage': 'Low (HTTP only)'
            }
            
        except Exception as e:
            # CRUCIAL: Use Supacrawler's EXACT retry logic
            if is_retryable_error(e) and attempt < max_retries - 1:
                print(f"Retryable error for {url}: {e}")
                continue  # Will apply backoff on next iteration
            else:
                # Non-retryable error or max retries reached
                if attempt >= max_retries - 1:
                    print(f"Failed to scrape {url} after {max_retries} attempts: {e}")
                else:
                    print(f"Non-retryable error for {url}: {e}")
                return {
                    'title': 'Error',
                    'content': f"Error: {e}",
                    'time': time.time() - start,
                    'javascript_support': False,
                    'resource_usage': 'Low (HTTP only)'
                }
    
def supacrawler_scrape_single(url):
    """Single page scraping with Supacrawler (non-JS for fair comparison)"""
    start = time.time()
    
    client = SupacrawlerClient(api_key='')
    # Use render_js=False for fair comparison with BeautifulSoup
    response = client.scrape(url, format='markdown', render_js=False, fresh=True)
    
    title = response.metadata.title if response.metadata else 'No title'
    content = response.content if response.content else "No content"
    
    return {
        'title': title,
        'content': content[:200] + "..." if len(content) > 200 else content,
        'metadata': response.metadata,
        'time': time.time() - start,
        'javascript_support': False,  # Fair comparison
        'resource_usage': 'Zero local resources'
    }


In [11]:
url = 'https://supabase.com'

bs_output = beautifulsoup_scrape_single(url)
print(bs_output)

sc_output = supacrawler_scrape_single(url)
print(sc_output)

{'title': 'Supabase | The Postgres Development Platform.', 'content': 'Supabase | The Postgres Development Platform.Product Developers Solutions PricingDocsBlog88.5KSign inStart your projectOpen main menuBuild in a weekendScale to millionsSupabase is the Postgres develop...', 'time': 0.3613901138305664, 'javascript_support': False, 'resource_usage': 'Low (HTTP only)'}
{'title': 'Supabase | The Postgres Development Platform.', 'content': '# Build in a weekendScale to millions\nSupabase is the Postgres development platform.\nStart your project with a Postgres database, Authentication, instant APIs, Edge Functions, Realtime subscriptions, ...', 'metadata': <supacrawler.types.PageMetadata object at 0x10daffe50>, 'time': 0.1987929344177246, 'javascript_support': False, 'resource_usage': 'Zero local resources'}


In [12]:
# time difference
print(f"Time difference: {sc_output['time'] - bs_output['time']} seconds")
print(f"Supacrawler is {1 / (sc_output['time'] / bs_output['time'])} times faster than BeautifulSoup")

Time difference: -0.1625971794128418 seconds
Supacrawler is 1.8179223265107378 times faster than BeautifulSoup


In [14]:
# content length
print(f"Beautifulsoup Content: \n{(bs_output['content'])}\n")

print(f"Supacrawler Content: \n{(sc_output['content'])}\n")


Beautifulsoup Content: 
Supabase | The Postgres Development Platform.Product Developers Solutions PricingDocsBlog88.3KSign inStart your projectOpen main menuBuild in a weekendScale to millionsSupabase is the Postgres develop...

Supacrawler Content: 
# Build in a weekendScale to millions
Supabase is the Postgres development platform.
Start your project with a Postgres database, Authentication, instant APIs, Edge Functions, Realtime subscriptions, ...



In [6]:
from collections import deque
from urllib.parse import urljoin, urlparse
import time, requests
from bs4 import BeautifulSoup

def beautifulsoup_crawl(start_url, max_pages=5, max_retries=3):
    """Multi-page crawling with BeautifulSoup + requests - FAIR version"""
    start_time = time.time()
    
    session = requests.Session()
    session.headers.update({
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36'
    })
    
    visited = set()
    to_visit = deque([start_url])
    results = []
    failures = []  # track but exclude from metrics
    
    while to_visit and len(results) < max_pages:
        url = to_visit.popleft()
        
        if url in visited:
            continue
        visited.add(url)
        
        for attempt in range(max_retries):
            try:
                if attempt > 0:
                    backoff = 1 << (attempt - 1)
                    time.sleep(backoff)
                
                response = session.get(url, timeout=10)
                response.raise_for_status()
                
                soup = BeautifulSoup(response.content, 'html.parser')
                title = soup.title.string.strip() if soup.title else "No title"
                text = soup.get_text()
                content = text[:200] + "..." if len(text) > 200 else text
                
                base_domain = urlparse(start_url).netloc
                links = []
                for link in soup.find_all("a", href=True):
                    href = urljoin(url, link["href"])
                    if urlparse(href).netloc == base_domain:
                        if href not in visited and len(to_visit) < 20:
                            links.append(href)
                            to_visit.append(href)
                
                results.append({
                    "url": url,
                    "title": title,
                    "content": content.strip(),
                    "links_found": len(links),
                    "metadata": {
                        "status_code": response.status_code,
                        "word_count": len(text.split()),
                        "headers": [h.get_text(strip=True) for h in soup.find_all(["h1", "h2", "h3"])],
                    }
                })
                
                break  # success, exit retry loop
                
            except Exception as e:
                if attempt == max_retries - 1:
                    failures.append({
                        "url": url,
                        "error": str(e)
                    })
    
    end_time = time.time()
    
    return {
        "pages_crawled": len(results),   # only successes
        "failures": failures,            # track errors separately
        "total_time": end_time - start_time,
        "avg_time_per_page": (end_time - start_time) / len(results) if results else 0,
        "results": results
    }


def supacrawler_crawl(start_url, max_pages=5):
    """Built-in crawling with SupaCrawler (SDK-native usage)"""
    start_time = time.time()
    client = SupacrawlerClient(api_key="")

    try:
        job = client.create_crawl_job(
            url=start_url,
            format="markdown",
            link_limit=max_pages,
            depth=3,
            include_subdomains=False,
            render_js=False,
            fresh=True # fresh never uses cached results
        )

        crawl_output = client.wait_for_crawl(
            job.job_id,
            interval_seconds=1.0,
            timeout_seconds=120.0
        )

        crawl_data = crawl_output.data.crawl_data
        end_time = time.time()

        return {
            "pages_crawled": len(crawl_data),
            "total_time": end_time - start_time,
            "avg_time_per_page": (end_time - start_time) / len(crawl_data) if crawl_data else 0,
            "crawl_data": crawl_data  # keep the native objects
        }

    except Exception as e:
        end_time = time.time()
        return {
            "pages_crawled": 0,
            "total_time": end_time - start_time,
            "error": str(e)
        }


In [7]:
# Benchmark Runner: Compare BeautifulSoup vs SupaCrawler across multiple sites
test_sites = [
    "https://nodejs.org/docs",
    "https://docs.python.org",
    "https://go.dev/doc/",
]
max_pages = 50

def run_comparison(crawl_url: str, max_pages: int):
    print("\n" + "=" * 60)
    print(f"Benchmarking: {crawl_url}")
    print(f"Max pages: {max_pages}")
    print("=" * 60)

    # --- BeautifulSoup ---
    print("\n[BeautifulSoup] Manual crawling:")
    bs_result = beautifulsoup_crawl(crawl_url, max_pages)
    print(f"  Pages crawled: {bs_result['pages_crawled']}")
    print(f"  Total time: {bs_result['total_time']:.2f}s")
    print(f"  Avg per page: {bs_result['avg_time_per_page']:.2f}s")

    if bs_result["pages_crawled"] > 0:
        first_page = bs_result["results"][0]
        print("  First page title:", first_page["title"])
        print("  Metadata:", first_page["metadata"])

    # --- SupaCrawler ---
    print("\n[SupaCrawler] Built-in crawling:")
    sc_result = supacrawler_crawl(crawl_url, max_pages)

    if "error" in sc_result:
        print(f"  Error: {sc_result['error']}")
    else:
        print(f"  Pages crawled: {sc_result['pages_crawled']}")
        print(f"  Total time: {sc_result['total_time']:.2f}s")
        print(f"  Avg per page: {sc_result['avg_time_per_page']:.2f}s")

        if sc_result["pages_crawled"] > 0:
            first_url = sc_result["crawl_data"][0]
            print("  First page markdown preview:", first_url.markdown[:120], "...")
            print("  Metadata:", first_url.metadata.to_json())

    # --- Performance summary ---
    if (
        bs_result["pages_crawled"] > 0
        and sc_result.get("pages_crawled", 0) > 0
        and "error" not in sc_result
    ):
        ratio = bs_result["avg_time_per_page"] / sc_result["avg_time_per_page"]
        print(f"\n⚡ Performance: SupaCrawler is {ratio:.1f}x faster per page")

# --- Run benchmark for all test sites ---
print("Multi-Site Crawling Comparison\n")
for site in test_sites:
    run_comparison(site, max_pages)


Multi-Site Crawling Comparison


Benchmarking: https://nodejs.org/docs
Max pages: 50

[BeautifulSoup] Manual crawling:


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


  Pages crawled: 50
  Total time: 109.19s
  Avg per page: 2.18s
  First page title: Index of /docs/
  Metadata: {'status_code': 200, 'word_count': 3454, 'headers': ['Index of /docs/']}

[SupaCrawler] Built-in crawling:
  Pages crawled: 50
  Total time: 65.51s
  Avg per page: 1.31s
  First page markdown preview: # Run JavaScript Everywhere
Node.js® is a free, open-source, cross-platform JavaScript runtime environment
that lets dev ...
  Metadata: {'title': 'Node.js — Run JavaScript Everywhere', 'status_code': 200, 'description': 'Node.js® is a free, open-source, cross-platform JavaScript runtime environment that lets developers create servers, web apps, command line tools and scripts.', 'canonical': 'https://nodejs.org/en', 'favicon': 'https://nodejs.org/static/images/favicons/favicon.png', 'og_title': 'Node.js — Run JavaScript Everywhere', 'og_description': 'Node.js® is a free, open-source, cross-platform JavaScript runtime environment that lets developers create servers, web apps, com