Scrape any modern e-commerce website — no HTML, no CSS selectors, no coding.
| File | Purpose |
|---|---|
scraper.py |
Main script — full interactive CLI |
quick_scrape.py |
Quick launcher — pick a preset site in seconds |
presets.py |
Site profiles (eMAG, Ethnasia, Books, HN) |
requirements.txt |
Python dependencies |
setup.bat |
One-click installer |
run.bat |
Runs scraper.py |
run_quick.bat |
Runs quick_scrape.py |
cd C:\dev\ai-scraper
setup.batOr manually:
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt
playwright install chromium$env:GROQ_API_KEY = "gsk_xxxxxxxxxxxxxxxxxxxx"Or edit scraper.py directly.
python scraper.pyYou'll answer 5 quick questions:
- URL — paste any product listing URL
- Data points — e.g.
Product Name, Price, Rating - Infinite scroll? —
yfor sites like Ethnasia - Paginate? —
yfor sites like eMAG - Next button selector — CSS selector or press Enter for default
python quick_scrape.pyPick from:
[1] eMAG.ro — Products
[2] Ethnasia.com — Products
[3] Books to Scrape (demo site)
[4] Hacker News — Top Stories
[c] Custom URL
Instead of networkidle (which times out on heavy sites), the script:
- Waits for
domcontentloadedfirst (fast) - Then either waits for your item selector to appear, OR falls back to a 5-second timeout
- Rotates from a pool of 4 real Chrome User-Agents
- Injects JavaScript to patch
navigator.webdriver = undefined - Spoofs
navigator.plugins,navigator.languages,window.chrome - Sets realistic viewport, locale, and timezone headers
- Disables Chromium's
AutomationControlledflag
- Scrolls in 600px increments with random 0.8–2.2s pauses
- Detects stalls (4 rounds with no height change) and stops
- Hard cap of 60 scroll rounds to prevent runaway loops
- Optionally counts item elements to confirm new content
- Finds "Next" / "Load More" button by CSS selector
- Clicks it with a random 1–3s human-like delay
- Stops when button is missing, hidden, or disabled
- Supports up to 20 pages (configurable in
scraper.py)
- Collects all records across all pages and scrolls
- Merges inconsistent keys gracefully (union of all keys)
- Exports one clean
scraped_data.csvat the end
- All delays between scrolls, clicks, and actions are randomized
- Avoids fixed-interval patterns that rate-limiters detect
| Constant | Default | Effect |
|---|---|---|
SCROLL_STEP_PX |
600 |
Pixels per scroll step |
SCROLL_PAUSE_MIN_MS |
800 |
Fastest scroll pause |
SCROLL_PAUSE_MAX_MS |
2200 |
Slowest scroll pause |
MAX_SCROLL_ROUNDS |
60 |
Max scroll iterations |
STALL_THRESHOLD |
4 |
Rounds with no change before stopping |
MAX_PAGES |
20 |
Max pages to paginate |
MAX_HTML_CHARS |
28000 |
Max chars sent to LLM |
MODEL |
llama-3.1-8b-instant |
Groq model |
| Site | Item Selector | Next Button Selector |
|---|---|---|
| eMAG.ro | .card-item |
.pagination-next a |
| Ethnasia.com | .product-item |
(scroll only) |
| Shopify stores | .product-card |
a[rel='next'] |
| WooCommerce | .product |
a.next.page-numbers |
| Generic | (leave blank) | a[rel='next'] |
| Problem | Fix |
|---|---|
| Page loads but no items | Increase wait — raise wait_for_timeout to 8000 |
| Bot detection / CAPTCHA | Try headless=False in launch_stealth_browser() |
| LLM returns empty array | Check MAX_HTML_CHARS — try increasing to 40000 |
| Pagination stops too early | Inspect the actual Next button selector with DevTools |
| CSV missing columns | LLM inconsistency — retry; consider a larger model |
Open presets.py and add:
"mysite": {
"label": "My Site — Products",
"url": "https://mysite.com/products",
"data_points": ["Name", "Price", "SKU"],
"do_scroll": True,
"do_paginate": False,
"next_btn_selector": None,
"item_selector": ".product-tile",
},