A SongCi spider project. (Chinese: 宋词爬虫)
An efficient spider based on scrapy to crawl SongCi from web.
Results can be saved into multiple formats of files, as well as into MongoDB collections.
- Python 3.5+
- (Optional) Docker && Docker-compose
- (Optional) Mongodb
You may choose one of the following methods to run this project.
Using docker is the recommended way, as you don't need to bother installing and configuring MongoDB and other stuff.
- Install Docker && Docker-compose.
- (Optional) Run
docker-compose build
in case of environment updates. - Run sc_spider:
docker-compose up
- Install MongoDB
- Install project requirements:
pip install -r requirements.txt
- Run sc_spider:
cd sc_scrapy
scrapy crawl gushiwen -s MONGO_URI=localhost:27017
- Install MongoDB
- Install project requirements:
pip install -r requirements.txt
- Edit your hosts file, adding:
127.0.0.1 mongo
Or - Modify
MONGO_URI
settings insc_scrapy/settings.py
. - Run sc_spider:
cd sc_scrapy
python execute.py
- Fast and flexible
- Able to pause and resume crawls, as Requests are serializable.
- Multiple output formats (thanks to scrapy) with UTF8 literals support
You can download the latest stable releases from: https://github.com/wings27/sc_spider/releases
All contributions are welcomed: you can add new spiders, create enhancements patches or resolve issues.
However, please follow these conventions:
- Your coding style should follow PEP 8
- Spiders should only crawl for SongCi related contents
- Spiders should obey robots.txt rules