A comprehensive benchmarking tool for evaluating web scraping APIs across diverse target domains with varying levels of protection and complexity.
Live benchmark results are continuously updated at: https://scrapingtest.com/web-scraping-api-benchmark
This benchmarking suite was designed to provide the web scraping community with objective, reproducible performance data across major scraping API providers. The tool tests each API against 16 carefully selected domains representing the most commonly scraped websites, ranging from simple API endpoints to heavily protected sites with advanced WAF (Web Application Firewall) protection and JavaScript rendering requirements.
- E-Commerce: Amazon, Best Buy, eBay, Walmart
- Social Media: Instagram, Reddit, YouTube, X/Twitter
- Professional Networks: GitHub, Indeed, LinkedIn
- Review Platforms: Trustpilot, G2, Capterra
- Search Engines: Google
- Real Estate: Zillow
- Oxylabs Web Unblocker
- ScrapingDog API
- ScraperAPI
- ScrapingAnt API
- WebScrapingAPI
- Scrape.do
- ZenRows Universal Scraper
- Bright Data Web Unlocker
- ScrapingBee HTML API
β οΈ Scrapfly Web Scraping API: I've been unable to get verification for this API so the code might not work. If someone from Scrapfly can reach out to provide an API key for testing, I'd be happy to make them work :)
The benchmarker employs a modular architecture with clear separation of concerns:
βββ run-test.py # Main benchmarking engine
βββ APIs/ # Modular API implementations
β βββ base.py # Base classes and utilities
β βββ oxylabs.py # Oxylabs implementation
β βββ scrapedo.py # scrape.do implementation
β βββ ... # Other API modules
βββ config/ # Configuration files
β βββ api_credentials.json # API keys and credentials
β βββ domain_configs.json # Domain test configurations
βββ results/ # Benchmark output files
- Concurrent API Testing: All APIs are tested simultaneously against each domain to ensure fair timing comparisons
- Batch Processing: Requests are processed in batches of 50 to enable early termination on persistent failures
- Dual Verification System:
- HTTP Status Validation: Checks for 200 OK responses
- Content Verification: Validates actual page content using CSS selectors and text matching
- False Positive Detection: Identifies cases where APIs return 200 status but deliver blocked/captcha content
- Progressive Monitoring: Real-time progress tracking with 10-second interval updates
Each domain is configured with:
- Target URL: The specific page to scrape
- Verification Selectors: CSS selectors and text strings that must be present for successful scraping
- API-Specific Parameters: Custom headers, proxy settings, and rendering options per API
- Category Classification: Grouping for analysis (e-commerce, social media, etc.)
Example domain configuration:
{
"amazon.com": {
"url": "https://www.amazon.com/product-page",
"category": "E-Commerce",
"verification_selectors": [
"#productTitle",
".a-price-whole"
],
"api_configs": {
"scrapingbee": {
"params": {
"render_js": "false"
}
}
}
}
}
- Status Success Rate: Percentage of requests returning HTTP 200
- Verified Success Rate: Percentage of requests with valid content (post-verification)
- False Positive Rate: Percentage of 200 responses that contained blocked content
- Average Response Time: Mean response time for successful requests (milliseconds)
- Average Content Length: Mean size of successfully retrieved content (characters)
- Early Termination: APIs terminated after 40 consecutive failures to prevent resource waste
The tool implements intelligent content verification:
- CSS Selector Matching: Validates presence of specific DOM elements
- Text Content Matching: Searches for expected text strings
- Lenient Validation: Allows 1 missing selector to account for minor page variations
- BeautifulSoup Integration: Robust HTML parsing for accurate element detection
- Python 3.7+
- Valid API credentials for the services you want to test
-
Clone the repository:
git clone https://github.com/your-username/web-scraping-api-benchmarker.git cd web-scraping-api-benchmarker
-
Install dependencies:
pip install -r requirements.txt
-
Configure API credentials: Edit
config/api_credentials.json
with your actual API keys:{ "scrapingbee": { "api_key": "your_actual_api_key_here" }, "oxylabs": { "username": "your_username", "password": "your_password" } }
-
Run the benchmark:
python run-test.py
-
Create API module in
APIs/
directory:# APIs/newapi.py from .base import BaseAPI, APIResponse class NewAPIAPI(BaseAPI): def test_request(self, domain, url, credentials, domain_configs=None): # Implementation here return self._make_request('GET', api_url, params=params) def test_newapi(domain, url, api_key, domain_configs=None): api = NewAPIAPI() credentials = {"api_key": api_key} response = api.test_request(domain, url, credentials, domain_configs) return response.to_dict()
-
Update
APIs/__init__.py
:from .newapi import test_newapi API_FUNCTIONS = { # ... existing APIs "newapi": test_newapi, }
-
Add credentials to
config/api_credentials.json
:{ "newapi": { "api_key": "YOUR_NEWAPI_KEY" } }
Edit config/domain_configs.json
:
{
"newdomain.com": {
"url": "https://www.newdomain.com/test-page",
"category": "Your Category",
"verification_selectors": [
".expected-element",
"Expected text content"
],
"api_configs": {
"api_name": {
"params": {
"custom_param": "value"
}
}
}
}
}
- Remove API: Delete from
config/api_credentials.json
andAPIs/__init__.py
- Remove Domain: Delete from
config/domain_configs.json
Results are saved in results/
directory with timestamp:
benchmarks_YYYYMMDD_HHMMSS.json
: Final resultsbenchmarks_partial_YYYYMMDD_HHMMSS.json
: Incremental progress saves
[
{
"domain": "amazon.com",
"api": "scrapingbee",
"successRate": 0.85,
"avgResponseTimeMs": 1250.5,
"category": "E-Commerce",
"falsePositiveRate": 0.02
}
]
Modify constants in run-test.py
:
REQUESTS_PER_DOMAIN = 300 # Requests per API per domain
MAX_THREADS = 3 # Concurrent requests per API
EARLY_TERMINATION_THRESHOLD = 40 # Failures before API termination
Add domain-specific verification logic by extending the verify_html_elements
function or adding custom selectors to domain configurations.
Dear API owners or seasoned scrapers, contributions are welcome!
Please feel free to:
- Add new API integrations
- Fix existing API logics
- Improve verification logic
- Add new test domains
- Enhance documentation
- Report bugs and suggest features
This tool was created to provide the web scraping community with transparent, objective performance data. Special thanks to all API providers for their services and the open-source community for the underlying libraries that make this project possible.
Disclaimer: This tool is for educational and research purposes. Always respect robots.txt files and website terms of service when scraping.