Resilient AI Web Scraper v2.0

Stealth · Infinite Scroll · Pagination · AI Extraction

Scrape any modern e-commerce website — no HTML, no CSS selectors, no coding.

Project Files

File	Purpose
`scraper.py`	Main script — full interactive CLI
`quick_scrape.py`	Quick launcher — pick a preset site in seconds
`presets.py`	Site profiles (eMAG, Ethnasia, Books, HN)
`requirements.txt`	Python dependencies
`setup.bat`	One-click installer
`run.bat`	Runs `scraper.py`
`run_quick.bat`	Runs `quick_scrape.py`

Setup (run once)

cd C:\dev\ai-scraper
setup.bat

Or manually:

python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt
playwright install chromium

Groq API Key

$env:GROQ_API_KEY = "gsk_xxxxxxxxxxxxxxxxxxxx"

Or edit scraper.py directly.

Two Ways to Run

Option A — Interactive (any site)

python scraper.py

You'll answer 5 quick questions:

URL — paste any product listing URL
Data points — e.g. Product Name, Price, Rating
Infinite scroll? — y for sites like Ethnasia
Paginate? — y for sites like eMAG
Next button selector — CSS selector or press Enter for default

Option B — Quick preset (zero config)

python quick_scrape.py

Pick from:

[1]  eMAG.ro — Products
[2]  Ethnasia.com — Products
[3]  Books to Scrape (demo site)
[4]  Hacker News — Top Stories
[c]  Custom URL

How the 6 Resilience Layers Work

1. Smarter Page Loading

Instead of networkidle (which times out on heavy sites), the script:

Waits for domcontentloaded first (fast)
Then either waits for your item selector to appear, OR falls back to a 5-second timeout

2. Stealth Mode

Rotates from a pool of 4 real Chrome User-Agents
Injects JavaScript to patch navigator.webdriver = undefined
Spoofs navigator.plugins, navigator.languages, window.chrome
Sets realistic viewport, locale, and timezone headers
Disables Chromium's AutomationControlled flag

3. Infinite Scroll Engine

Scrolls in 600px increments with random 0.8–2.2s pauses
Detects stalls (4 rounds with no height change) and stops
Hard cap of 60 scroll rounds to prevent runaway loops
Optionally counts item elements to confirm new content

4. Pagination Engine

Finds "Next" / "Load More" button by CSS selector
Clicks it with a random 1–3s human-like delay
Stops when button is missing, hidden, or disabled
Supports up to 20 pages (configurable in scraper.py)

5. Cumulative Collection

Collects all records across all pages and scrolls
Merges inconsistent keys gracefully (union of all keys)
Exports one clean scraped_data.csv at the end

6. Human Behavior

All delays between scrolls, clicks, and actions are randomized
Avoids fixed-interval patterns that rate-limiters detect

Tuning Constants (top of `scraper.py`)

Constant	Default	Effect
`SCROLL_STEP_PX`	`600`	Pixels per scroll step
`SCROLL_PAUSE_MIN_MS`	`800`	Fastest scroll pause
`SCROLL_PAUSE_MAX_MS`	`2200`	Slowest scroll pause
`MAX_SCROLL_ROUNDS`	`60`	Max scroll iterations
`STALL_THRESHOLD`	`4`	Rounds with no change before stopping
`MAX_PAGES`	`20`	Max pages to paginate
`MAX_HTML_CHARS`	`28000`	Max chars sent to LLM
`MODEL`	`llama-3.1-8b-instant`	Groq model

Site-Specific Selector Guide

Site	Item Selector	Next Button Selector
eMAG.ro	`.card-item`	`.pagination-next a`
Ethnasia.com	`.product-item`	(scroll only)
Shopify stores	`.product-card`	`a[rel='next']`
WooCommerce	`.product`	`a.next.page-numbers`
Generic	(leave blank)	`a[rel='next']`

Troubleshooting

Problem	Fix
Page loads but no items	Increase wait — raise `wait_for_timeout` to 8000
Bot detection / CAPTCHA	Try `headless=False` in `launch_stealth_browser()`
LLM returns empty array	Check `MAX_HTML_CHARS` — try increasing to 40000
Pagination stops too early	Inspect the actual Next button selector with DevTools
CSV missing columns	LLM inconsistency — retry; consider a larger model

Adding a New Preset

Open presets.py and add:

"mysite": {
    "label":             "My Site — Products",
    "url":               "https://mysite.com/products",
    "data_points":       ["Name", "Price", "SKU"],
    "do_scroll":         True,
    "do_paginate":       False,
    "next_btn_selector": None,
    "item_selector":     ".product-tile",
},

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Resilient AI Web Scraper v2.0

Stealth · Infinite Scroll · Pagination · AI Extraction

Project Files

Setup (run once)

Groq API Key

Two Ways to Run

Option A — Interactive (any site)

Option B — Quick preset (zero config)

How the 6 Resilience Layers Work

1. Smarter Page Loading

2. Stealth Mode

3. Infinite Scroll Engine

4. Pagination Engine

5. Cumulative Collection

6. Human Behavior

Tuning Constants (top of `scraper.py`)

Site-Specific Selector Guide

Troubleshooting

Adding a New Preset

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
presets.py		presets.py
quick_scrape.py		quick_scrape.py
requirements.txt		requirements.txt
run.bat		run.bat
run_quick.bat		run_quick.bat
scraped_data.csv		scraped_data.csv
scraper.py		scraper.py
setup.bat		setup.bat

Folders and files

Latest commit

History

Repository files navigation

Resilient AI Web Scraper v2.0

Stealth · Infinite Scroll · Pagination · AI Extraction

Project Files

Setup (run once)

Groq API Key

Two Ways to Run

Option A — Interactive (any site)

Option B — Quick preset (zero config)

How the 6 Resilience Layers Work

1. Smarter Page Loading

2. Stealth Mode

3. Infinite Scroll Engine

4. Pagination Engine

5. Cumulative Collection

6. Human Behavior

Tuning Constants (top of scraper.py)

Site-Specific Selector Guide

Troubleshooting

Adding a New Preset

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Tuning Constants (top of `scraper.py`)

Packages