Profile any URL for scrapability — before you write a single line of scraper code.
scrape-check is a small CLI that tells you what you'll be up against if you try
to scrape a website. It makes two HTTP requests (one to /robots.txt, one to
your target) and reports:
robots.txt— is your path allowed? WhatCrawl-delayapplies? Where are the sitemaps?- Anti-bot stack — Cloudflare, Akamai Bot Manager, DataDome, HUMAN (PerimeterX), Imperva, Sucuri, Fastly, reCAPTCHA, hCaptcha, Turnstile — detected from headers, cookies, and body signatures.
- Rendering mode — server-rendered, client-rendered, or hybrid? Which framework (Next.js, Nuxt, Gatsby, SvelteKit, Remix, Astro, React, Vue, Angular)?
- HTTP basics — status, redirects, HTTP version, rate-limit headers,
Retry-After. - A recommendation —
requestsis fine / use stealth headers / needs a headless browser / don't scrape.
It does not execute JavaScript, solve challenges, or attempt to bypass anything — this is a planning tool, not a scraper.
From source:
pip install git+https://github.com/python-web-scraping-com/scrape-check.gitRequires Python 3.9+.
scrape-check example.com
scrape-check https://news.ycombinator.com
scrape-check https://www.zillow.com --json
scrape-check https://example.com --user-agent "MyBot/1.0"The exit code reflects the recommendation, which is useful in CI:
| Strategy | Exit code |
|---|---|
requests |
0 |
stealth-headers |
1 |
headless |
1 |
do-not-scrape |
2 |
╭─ scrape-check ──────────────────────────────────────────────╮
│ https://example.com │
╰─────────────────────────────────────────────────────────────╯
HTTP
status 200
http version HTTP/1.1
server nginx
elapsed 142 ms
robots.txt
url https://example.com/robots.txt
status 200
allowed yes
sitemaps https://example.com/sitemap.xml
╭─ Anti-bot ──────────────────────────────────────────────────╮
│ no anti-bot signals detected │
╰─────────────────────────────────────────────────────────────╯
╭─ Rendering ─────────────────────────────────────────────────╮
│ mode: ssr │
│ • visible-text ratio: 21.4% │
╰─────────────────────────────────────────────────────────────╯
╭─ Recommendation ────────────────────────────────────────────╮
│ requests — Safe to use plain HTTP (requests / httpx) │
│ │
│ • server-rendered HTML, no anti-bot signals detected │
╰─────────────────────────────────────────────────────────────╯
from scrape_check import analyze
report = analyze("https://example.com")
print(report.recommendation.strategy) # "requests"
print(report.antibot.detected) # []
print(report.rendering.mode) # "ssr"
print(report.robots.sitemaps) # ["https://example.com/sitemap.xml"]analyze() returns a dataclass Report; call report.to_dict() to serialize.
scrape-checkonly loads the initial HTML. Sites that gate content behind scroll, click, or login will look fine here but block real scraping.- Anti-bot products are detected from signatures. A missing detection doesn't mean a site has no protection — it might just be dormant for unauthenticated requests.
- Rendering classification is a heuristic. When in doubt, run with
--jsonand inspect therendering.signalsfield.
Once scrape-check tells you what you're dealing with, the
python-web-scraping.com guides cover what to do next:
- 🔰 Complete Guide to Python Web Scraping — the fundamentals you'll need for
requests-friendly sites. - 🛡 Advanced Scraping & Anti-bot Evasion — headless browsers, stealth patching, proxy rotation, and getting past Cloudflare / Akamai.
- ⚖️ Legal, Ethical & Compliance —
robots.txt, GDPR, copyright, and the responsible scraping playbook.
git clone https://github.com/python-web-scraping-com/scrape-check.git
cd scrape-check
pip install -e ".[dev]"
pytestMIT — see LICENSE.