Web Scraping API Benchmarker

A comprehensive benchmarking tool for evaluating web scraping APIs across diverse target domains with varying levels of protection and complexity.

🎯 Latest Results

Live benchmark results are continuously updated at: https://scrapingtest.com/web-scraping-api-benchmark

📊 Overview

This benchmarking suite was designed to provide the web scraping community with objective, reproducible performance data across major scraping API providers. The tool tests each API against 16 carefully selected domains representing the most commonly scraped websites, ranging from simple API endpoints to heavily protected sites with advanced WAF (Web Application Firewall) protection and JavaScript rendering requirements.

Tested Domains Categories

E-Commerce: Amazon, Best Buy, eBay, Walmart
Social Media: Instagram, Reddit, YouTube, X/Twitter
Professional Networks: GitHub, Indeed, LinkedIn
Review Platforms: Trustpilot, G2, Capterra
Search Engines: Google
Real Estate: Zillow

Supported APIs

Oxylabs Web Unblocker
ScrapingDog API
ScraperAPI
ScrapingAnt API
WebScrapingAPI
Scrape.do
ZenRows Universal Scraper
Bright Data Web Unlocker
ScrapingBee HTML API
⚠️ Scrapfly Web Scraping API: I've been unable to get verification for this API so the code might not work. If someone from Scrapfly can reach out to provide an API key for testing, I'd be happy to make them work :)

🔬 How It Works

Architecture

The benchmarker employs a modular architecture with clear separation of concerns:

├── run-test.py          # Main benchmarking engine
├── APIs/                # Modular API implementations
│   ├── base.py         # Base classes and utilities
│   ├── oxylabs.py      # Oxylabs implementation
│   ├── scrapedo.py     # scrape.do implementation
│   └── ...             # Other API modules
├── config/             # Configuration files
│   ├── api_credentials.json    # API keys and credentials
│   └── domain_configs.json     # Domain test configurations
└── results/            # Benchmark output files

Testing Methodology

Concurrent API Testing: All APIs are tested simultaneously against each domain to ensure fair timing comparisons
Batch Processing: Requests are processed in batches of 50 to enable early termination on persistent failures
Dual Verification System:
- HTTP Status Validation: Checks for 200 OK responses
- Content Verification: Validates actual page content using CSS selectors and text matching
False Positive Detection: Identifies cases where APIs return 200 status but deliver blocked/captcha content
Progressive Monitoring: Real-time progress tracking with 10-second interval updates

Configuration System

Each domain is configured with:

Target URL: The specific page to scrape
Verification Selectors: CSS selectors and text strings that must be present for successful scraping
API-Specific Parameters: Custom headers, proxy settings, and rendering options per API
Category Classification: Grouping for analysis (e-commerce, social media, etc.)

Example domain configuration:

{
  "amazon.com": {
    "url": "https://www.amazon.com/product-page",
    "category": "E-Commerce",
    "verification_selectors": [
      "#productTitle",
      ".a-price-whole"
    ],
    "api_configs": {
      "scrapingbee": {
        "params": {
          "render_js": "false"
        }
      }
    }
  }
}

📈 Metrics Explained

Success Rate

Status Success Rate: Percentage of requests returning HTTP 200
Verified Success Rate: Percentage of requests with valid content (post-verification)
False Positive Rate: Percentage of 200 responses that contained blocked content

Performance Metrics

Average Response Time: Mean response time for successful requests (milliseconds)
Average Content Length: Mean size of successfully retrieved content (characters)
Early Termination: APIs terminated after 40 consecutive failures to prevent resource waste

Content Verification

The tool implements intelligent content verification:

CSS Selector Matching: Validates presence of specific DOM elements
Text Content Matching: Searches for expected text strings
Lenient Validation: Allows 1 missing selector to account for minor page variations
BeautifulSoup Integration: Robust HTML parsing for accurate element detection

🚀 Quick Start

Prerequisites

Python 3.7+
Valid API credentials for the services you want to test

Installation

Clone the repository:

git clone https://github.com/your-username/web-scraping-api-benchmarker.git
cd web-scraping-api-benchmarker

Install dependencies:
```
pip install -r requirements.txt
```

Configure API credentials: Edit config/api_credentials.json with your actual API keys:

{
  "scrapingbee": {
    "api_key": "your_actual_api_key_here"
  },
  "oxylabs": {
    "username": "your_username",
    "password": "your_password"
  }
}

Run the benchmark:
```
python run-test.py
```

Configuration

Adding New APIs

Create API module in APIs/ directory:

# APIs/newapi.py
from .base import BaseAPI, APIResponse

class NewAPIAPI(BaseAPI):
    def test_request(self, domain, url, credentials, domain_configs=None):
        # Implementation here
        return self._make_request('GET', api_url, params=params)

def test_newapi(domain, url, api_key, domain_configs=None):
    api = NewAPIAPI()
    credentials = {"api_key": api_key}
    response = api.test_request(domain, url, credentials, domain_configs)
    return response.to_dict()

Update APIs/__init__.py:

from .newapi import test_newapi

API_FUNCTIONS = {
    # ... existing APIs
    "newapi": test_newapi,
}

Add credentials to config/api_credentials.json:

{
  "newapi": {
    "api_key": "YOUR_NEWAPI_KEY"
  }
}

Adding New Domains

Edit config/domain_configs.json:

{
  "newdomain.com": {
    "url": "https://www.newdomain.com/test-page",
    "category": "Your Category",
    "verification_selectors": [
      ".expected-element",
      "Expected text content"
    ],
    "api_configs": {
      "api_name": {
        "params": {
          "custom_param": "value"
        }
      }
    }
  }
}

Removing APIs/Domains

Remove API: Delete from config/api_credentials.json and APIs/__init__.py
Remove Domain: Delete from config/domain_configs.json

📊 Output Format

Results are saved in results/ directory with timestamp:

benchmarks_YYYYMMDD_HHMMSS.json: Final results
benchmarks_partial_YYYYMMDD_HHMMSS.json: Incremental progress saves

Result Structure

[
  {
    "domain": "amazon.com",
    "api": "scrapingbee",
    "successRate": 0.85,
    "avgResponseTimeMs": 1250.5,
    "category": "E-Commerce",
    "falsePositiveRate": 0.02
  }
]

⚙️ Advanced Configuration

Test Parameters

Modify constants in run-test.py:

REQUESTS_PER_DOMAIN = 300        # Requests per API per domain
MAX_THREADS = 3                  # Concurrent requests per API
EARLY_TERMINATION_THRESHOLD = 40 # Failures before API termination

Custom Verification

Add domain-specific verification logic by extending the verify_html_elements function or adding custom selectors to domain configurations.

🤝 Contributing

Dear API owners or seasoned scrapers, contributions are welcome!

Please feel free to:

Add new API integrations
Fix existing API logics
Improve verification logic
Add new test domains
Enhance documentation
Report bugs and suggest features

🙏 Acknowledgments

This tool was created to provide the web scraping community with transparent, objective performance data. Special thanks to all API providers for their services and the open-source community for the underlying libraries that make this project possible.

Disclaimer: This tool is for educational and research purposes. Always respect robots.txt files and website terms of service when scraping.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
APIs		APIs
config		config
results		results
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run-test.py		run-test.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Scraping API Benchmarker

🎯 Latest Results

📊 Overview

Tested Domains Categories

Supported APIs

🔬 How It Works

Architecture

Testing Methodology

Configuration System

📈 Metrics Explained

Success Rate

Performance Metrics

Content Verification

🚀 Quick Start

Prerequisites

Installation

Configuration

Adding New APIs

Adding New Domains

Removing APIs/Domains

📊 Output Format

Result Structure

⚙️ Advanced Configuration

Test Parameters

Custom Verification

🤝 Contributing

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

scraping-test/scraping-API-benchmark

Folders and files

Latest commit

History

Repository files navigation

Web Scraping API Benchmarker

🎯 Latest Results

📊 Overview

Tested Domains Categories

Supported APIs

🔬 How It Works

Architecture

Testing Methodology

Configuration System

📈 Metrics Explained

Success Rate

Performance Metrics

Content Verification

🚀 Quick Start

Prerequisites

Installation

Configuration

Adding New APIs

Adding New Domains

Removing APIs/Domains

📊 Output Format

Result Structure

⚙️ Advanced Configuration

Test Parameters

Custom Verification

🤝 Contributing

🙏 Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages