Skip to content

rayen03/ai-scraper

Repository files navigation

Resilient AI Web Scraper v2.0

Stealth · Infinite Scroll · Pagination · AI Extraction

Scrape any modern e-commerce website — no HTML, no CSS selectors, no coding.


Project Files

File Purpose
scraper.py Main script — full interactive CLI
quick_scrape.py Quick launcher — pick a preset site in seconds
presets.py Site profiles (eMAG, Ethnasia, Books, HN)
requirements.txt Python dependencies
setup.bat One-click installer
run.bat Runs scraper.py
run_quick.bat Runs quick_scrape.py

Setup (run once)

cd C:\dev\ai-scraper
setup.bat

Or manually:

python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt
playwright install chromium

Groq API Key

$env:GROQ_API_KEY = "gsk_xxxxxxxxxxxxxxxxxxxx"

Or edit scraper.py directly.


Two Ways to Run

Option A — Interactive (any site)

python scraper.py

You'll answer 5 quick questions:

  1. URL — paste any product listing URL
  2. Data points — e.g. Product Name, Price, Rating
  3. Infinite scroll?y for sites like Ethnasia
  4. Paginate?y for sites like eMAG
  5. Next button selector — CSS selector or press Enter for default

Option B — Quick preset (zero config)

python quick_scrape.py

Pick from:

[1]  eMAG.ro — Products
[2]  Ethnasia.com — Products
[3]  Books to Scrape (demo site)
[4]  Hacker News — Top Stories
[c]  Custom URL

How the 6 Resilience Layers Work

1. Smarter Page Loading

Instead of networkidle (which times out on heavy sites), the script:

  • Waits for domcontentloaded first (fast)
  • Then either waits for your item selector to appear, OR falls back to a 5-second timeout

2. Stealth Mode

  • Rotates from a pool of 4 real Chrome User-Agents
  • Injects JavaScript to patch navigator.webdriver = undefined
  • Spoofs navigator.plugins, navigator.languages, window.chrome
  • Sets realistic viewport, locale, and timezone headers
  • Disables Chromium's AutomationControlled flag

3. Infinite Scroll Engine

  • Scrolls in 600px increments with random 0.8–2.2s pauses
  • Detects stalls (4 rounds with no height change) and stops
  • Hard cap of 60 scroll rounds to prevent runaway loops
  • Optionally counts item elements to confirm new content

4. Pagination Engine

  • Finds "Next" / "Load More" button by CSS selector
  • Clicks it with a random 1–3s human-like delay
  • Stops when button is missing, hidden, or disabled
  • Supports up to 20 pages (configurable in scraper.py)

5. Cumulative Collection

  • Collects all records across all pages and scrolls
  • Merges inconsistent keys gracefully (union of all keys)
  • Exports one clean scraped_data.csv at the end

6. Human Behavior

  • All delays between scrolls, clicks, and actions are randomized
  • Avoids fixed-interval patterns that rate-limiters detect

Tuning Constants (top of scraper.py)

Constant Default Effect
SCROLL_STEP_PX 600 Pixels per scroll step
SCROLL_PAUSE_MIN_MS 800 Fastest scroll pause
SCROLL_PAUSE_MAX_MS 2200 Slowest scroll pause
MAX_SCROLL_ROUNDS 60 Max scroll iterations
STALL_THRESHOLD 4 Rounds with no change before stopping
MAX_PAGES 20 Max pages to paginate
MAX_HTML_CHARS 28000 Max chars sent to LLM
MODEL llama-3.1-8b-instant Groq model

Site-Specific Selector Guide

Site Item Selector Next Button Selector
eMAG.ro .card-item .pagination-next a
Ethnasia.com .product-item (scroll only)
Shopify stores .product-card a[rel='next']
WooCommerce .product a.next.page-numbers
Generic (leave blank) a[rel='next']

Troubleshooting

Problem Fix
Page loads but no items Increase wait — raise wait_for_timeout to 8000
Bot detection / CAPTCHA Try headless=False in launch_stealth_browser()
LLM returns empty array Check MAX_HTML_CHARS — try increasing to 40000
Pagination stops too early Inspect the actual Next button selector with DevTools
CSV missing columns LLM inconsistency — retry; consider a larger model

Adding a New Preset

Open presets.py and add:

"mysite": {
    "label":             "My Site — Products",
    "url":               "https://mysite.com/products",
    "data_points":       ["Name", "Price", "SKU"],
    "do_scroll":         True,
    "do_paginate":       False,
    "next_btn_selector": None,
    "item_selector":     ".product-tile",
},

About

An AI-powered web scraper designed to extract and structure data from complex websites using LLMs. Features automated parsing, dynamic content handling, and structured data output.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors