Skip to content

scraping-test/scraping-API-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Web Scraping API Benchmarker

A comprehensive benchmarking tool for evaluating web scraping APIs across diverse target domains with varying levels of protection and complexity.

🎯 Latest Results

Live benchmark results are continuously updated at: https://scrapingtest.com/web-scraping-api-benchmark

πŸ“Š Overview

This benchmarking suite was designed to provide the web scraping community with objective, reproducible performance data across major scraping API providers. The tool tests each API against 16 carefully selected domains representing the most commonly scraped websites, ranging from simple API endpoints to heavily protected sites with advanced WAF (Web Application Firewall) protection and JavaScript rendering requirements.

Tested Domains Categories

  • E-Commerce: Amazon, Best Buy, eBay, Walmart
  • Social Media: Instagram, Reddit, YouTube, X/Twitter
  • Professional Networks: GitHub, Indeed, LinkedIn
  • Review Platforms: Trustpilot, G2, Capterra
  • Search Engines: Google
  • Real Estate: Zillow

Supported APIs

  • Oxylabs Web Unblocker
  • ScrapingDog API
  • ScraperAPI
  • ScrapingAnt API
  • WebScrapingAPI
  • Scrape.do
  • ZenRows Universal Scraper
  • Bright Data Web Unlocker
  • ScrapingBee HTML API
  • ⚠️ Scrapfly Web Scraping API: I've been unable to get verification for this API so the code might not work. If someone from Scrapfly can reach out to provide an API key for testing, I'd be happy to make them work :)

πŸ”¬ How It Works

Architecture

The benchmarker employs a modular architecture with clear separation of concerns:

β”œβ”€β”€ run-test.py          # Main benchmarking engine
β”œβ”€β”€ APIs/                # Modular API implementations
β”‚   β”œβ”€β”€ base.py         # Base classes and utilities
β”‚   β”œβ”€β”€ oxylabs.py      # Oxylabs implementation
β”‚   β”œβ”€β”€ scrapedo.py     # scrape.do implementation
β”‚   └── ...             # Other API modules
β”œβ”€β”€ config/             # Configuration files
β”‚   β”œβ”€β”€ api_credentials.json    # API keys and credentials
β”‚   └── domain_configs.json     # Domain test configurations
└── results/            # Benchmark output files

Testing Methodology

  1. Concurrent API Testing: All APIs are tested simultaneously against each domain to ensure fair timing comparisons
  2. Batch Processing: Requests are processed in batches of 50 to enable early termination on persistent failures
  3. Dual Verification System:
    • HTTP Status Validation: Checks for 200 OK responses
    • Content Verification: Validates actual page content using CSS selectors and text matching
  4. False Positive Detection: Identifies cases where APIs return 200 status but deliver blocked/captcha content
  5. Progressive Monitoring: Real-time progress tracking with 10-second interval updates

Configuration System

Each domain is configured with:

  • Target URL: The specific page to scrape
  • Verification Selectors: CSS selectors and text strings that must be present for successful scraping
  • API-Specific Parameters: Custom headers, proxy settings, and rendering options per API
  • Category Classification: Grouping for analysis (e-commerce, social media, etc.)

Example domain configuration:

{
  "amazon.com": {
    "url": "https://www.amazon.com/product-page",
    "category": "E-Commerce",
    "verification_selectors": [
      "#productTitle",
      ".a-price-whole"
    ],
    "api_configs": {
      "scrapingbee": {
        "params": {
          "render_js": "false"
        }
      }
    }
  }
}

πŸ“ˆ Metrics Explained

Success Rate

  • Status Success Rate: Percentage of requests returning HTTP 200
  • Verified Success Rate: Percentage of requests with valid content (post-verification)
  • False Positive Rate: Percentage of 200 responses that contained blocked content

Performance Metrics

  • Average Response Time: Mean response time for successful requests (milliseconds)
  • Average Content Length: Mean size of successfully retrieved content (characters)
  • Early Termination: APIs terminated after 40 consecutive failures to prevent resource waste

Content Verification

The tool implements intelligent content verification:

  • CSS Selector Matching: Validates presence of specific DOM elements
  • Text Content Matching: Searches for expected text strings
  • Lenient Validation: Allows 1 missing selector to account for minor page variations
  • BeautifulSoup Integration: Robust HTML parsing for accurate element detection

πŸš€ Quick Start

Prerequisites

  • Python 3.7+
  • Valid API credentials for the services you want to test

Installation

  1. Clone the repository:

    git clone https://github.com/your-username/web-scraping-api-benchmarker.git
    cd web-scraping-api-benchmarker
  2. Install dependencies:

    pip install -r requirements.txt
  3. Configure API credentials: Edit config/api_credentials.json with your actual API keys:

    {
      "scrapingbee": {
        "api_key": "your_actual_api_key_here"
      },
      "oxylabs": {
        "username": "your_username",
        "password": "your_password"
      }
    }
  4. Run the benchmark:

    python run-test.py

Configuration

Adding New APIs

  1. Create API module in APIs/ directory:

    # APIs/newapi.py
    from .base import BaseAPI, APIResponse
    
    class NewAPIAPI(BaseAPI):
        def test_request(self, domain, url, credentials, domain_configs=None):
            # Implementation here
            return self._make_request('GET', api_url, params=params)
    
    def test_newapi(domain, url, api_key, domain_configs=None):
        api = NewAPIAPI()
        credentials = {"api_key": api_key}
        response = api.test_request(domain, url, credentials, domain_configs)
        return response.to_dict()
  2. Update APIs/__init__.py:

    from .newapi import test_newapi
    
    API_FUNCTIONS = {
        # ... existing APIs
        "newapi": test_newapi,
    }
  3. Add credentials to config/api_credentials.json:

    {
      "newapi": {
        "api_key": "YOUR_NEWAPI_KEY"
      }
    }

Adding New Domains

Edit config/domain_configs.json:

{
  "newdomain.com": {
    "url": "https://www.newdomain.com/test-page",
    "category": "Your Category",
    "verification_selectors": [
      ".expected-element",
      "Expected text content"
    ],
    "api_configs": {
      "api_name": {
        "params": {
          "custom_param": "value"
        }
      }
    }
  }
}

Removing APIs/Domains

  • Remove API: Delete from config/api_credentials.json and APIs/__init__.py
  • Remove Domain: Delete from config/domain_configs.json

πŸ“Š Output Format

Results are saved in results/ directory with timestamp:

  • benchmarks_YYYYMMDD_HHMMSS.json: Final results
  • benchmarks_partial_YYYYMMDD_HHMMSS.json: Incremental progress saves

Result Structure

[
  {
    "domain": "amazon.com",
    "api": "scrapingbee",
    "successRate": 0.85,
    "avgResponseTimeMs": 1250.5,
    "category": "E-Commerce",
    "falsePositiveRate": 0.02
  }
]

βš™οΈ Advanced Configuration

Test Parameters

Modify constants in run-test.py:

REQUESTS_PER_DOMAIN = 300        # Requests per API per domain
MAX_THREADS = 3                  # Concurrent requests per API
EARLY_TERMINATION_THRESHOLD = 40 # Failures before API termination

Custom Verification

Add domain-specific verification logic by extending the verify_html_elements function or adding custom selectors to domain configurations.

🀝 Contributing

Dear API owners or seasoned scrapers, contributions are welcome!

Please feel free to:

  • Add new API integrations
  • Fix existing API logics
  • Improve verification logic
  • Add new test domains
  • Enhance documentation
  • Report bugs and suggest features

πŸ™ Acknowledgments

This tool was created to provide the web scraping community with transparent, objective performance data. Special thanks to all API providers for their services and the open-source community for the underlying libraries that make this project possible.


Disclaimer: This tool is for educational and research purposes. Always respect robots.txt files and website terms of service when scraping.

About

Benchmark top 10 web scraping APIs against each other in 16 of the most popular domains.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages