Skip to content

zubedev/scrapydoo

Repository files navigation

███████╗ ██████╗██████╗  █████╗ ██████╗ ██╗   ██╗██████╗  ██████╗  ██████╗
██╔════╝██╔════╝██╔══██╗██╔══██╗██╔══██╗╚██╗ ██╔╝██╔══██╗██╔═══██╗██╔═══██╗
███████╗██║     ██████╔╝███████║██████╔╝ ╚████╔╝ ██║  ██║██║   ██║██║   ██║
╚════██║██║     ██╔══██╗██╔══██║██╔═══╝   ╚██╔╝  ██║  ██║██║   ██║██║   ██║
███████║╚██████╗██║  ██║██║  ██║██║        ██║   ██████╔╝╚██████╔╝╚██████╔╝
╚══════╝ ╚═════╝╚═╝  ╚═╝╚═╝  ╚═╝╚═╝        ╚═╝   ╚═════╝  ╚═════╝  ╚═════╝
>--------------------------------- Scrapy dappy doo crawler for proxy sites

scrapy python mypy black Ruff pre-commit license CI

Features

  • Crawls proxy sites for working proxies
  • Scrapyd server to initiate crawl and get results
  • Retain jobs and logs for recent crawls

Usage

# Copy the example environment file to .env
cp .env.example .env

# Build the docker image and run the container
docker-compose up --build --detach

# Run a scrapy crawl job via cli
# docker-compose exec -it scrapyd scrapy crawl <spider_name>
docker-compose exec -it scrapyd scrapy crawl freeproxylist

# Run a scrapy crawl job via scrapyd api
# Scrapyd documentation: https://scrapyd.readthedocs.io/en/latest/api.html#schedule-json
curl http://localhost:6800/schedule.json -d project=scrapydoo -d spider=freeproxylist

Scrapyd API is now available at http://localhost:6800.

Pages

  • root: / - Scrapyd server
  • jobs: /jobs - crawl jobs
  • items: /items - scraped items
  • logs: /logs - spider logs

Endpoints

provided by scrapyd server

  • daemonstatus: /daemonstatus.json - to check the load status of a service
  • addversion: /addversion.json - to add a new version of a project
  • schedule: /schedule.json - to schedule a spider run
  • cancel: /cancel.json - to cancel a spider run
  • listprojects: /listprojects.json - to list all projects
  • listversions: /listversions.json - to list all versions of a project
  • listspiders: /listspiders.json - to list all spiders of a project
  • listjobs: /listjobs.json - to list all pending, running and finished jobs
  • delversion: /delversion.json - to delete a version of a project
  • delproject: /delproject.json - to delete a project

Development

# Poetry is required for installing and managing dependencies
# https://python-poetry.org/docs/#installation
poetry install

# Run the crawlers
#poetry run scrapy crawl <spider_name>
poetry run scrapy crawl freeproxylist

# Install pre-commit hooks
poetry run pre-commit install

# Formatting (inplace formats code)
poetry run black .

# Linting (and to fix automatically)
poetry run ruff .
poetry run ruff --fix .

# Type checking
poetry run mypy .

Configuration details can be found in pyproject.toml.

Support

Paypal