MongoDB-based components for Scrapy that allows distributed crawling
- Scheduler
- Duplication Filter
From github
To install it via pip
,
# install
pip install git+https://github.com/taicaile/scrapy-mongodb
# reinstall
pip install --ignore-installed git+https://github.com/taicaile/scrapy-mongodb
or clone it first,
git clone https://github.com/taicaile/scrapy-mongodb.git
cd scrapy-mongodb
python setup.py install
To install specific version,
# replace the version `v0.1.0` as you expect,
pip install git+https://github.com/taicaile/scrapy-mongodb@v0.1.0
You can put the following in requirements.txt,
scrapy-mongodb@git+https://github.com/taicaile/scrapy-mongodb@v0.1.0
Enable the components in your settings.py
:
# Enables scheduling storing requests queue in mongodb.
SCHEDULER = "scrapy_mongodb.scheduler.Scheduler"
# Specify the host and port to use when connecting to Mongodb (optional).
MONGODB_SERVER = 'localhost'
MONGODB_PORT = 27017
MONGODB_DB = "scrapy"
persist,
MONGODB_DUPEFILTER_PERSIST = False # by default
MONGODB_SCHEDULER_QUEUE_PERSIST = False # By default
Note this is not suitable for distribution currently.