In this project we try to collect data from the ptt website. We adopt scrapy framework based on python language and use mongoDB as our storage. However, crawler handles it job only on single machine. To explore efficently, scrapy-redis provides distributed mechanism that helps us running spider on clients. For the purpose of deployment, we use scrapyd to achieve it.
- pttCrawler
Full dependency installation on Ubuntu 16.04
- Python 3 (tested on python 3.7.2)
- redis 3.4.1
- mongodb 4.0.16
- pymongo==3.10.1 (used nosql db)
- Scrapy==2.0.0 (framework of crawler)
- scrapy-redis==0.6.8 (achieve distributed scrawling)
- scrapyd==1.2.1 (provide a crawling daemon )
- scrapyd-client==1.1.0 (used to deploy our spider)
- scrapyd-web==1.4.0* (show the UI for the crawler)
In settings.py
, we should define the mongodb settings:
## in settings.py
MONGO_URI = 'mongodb://localhost:27017'
MONGO_DATABASE = 'ptt-sandbox'
## in settings.py
EDIS_HOST = 'localhost'
REDIS_PARAMS = {
'password':'yourpassword'
}
REDIS_PORT = 6379
DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'
scrapy crawl ptt -a start={m/d} -a end={m/d}
- where
-a
received an argument that is a parameter to the spider. {m/d}
means month/day. 3/5 just represents March 5th.
For example, the command would bescrapy crawl ptt -a start=3/5 -a end=3/8
redis-cli
auth yourpassword
- where yourpassword is in
settings.py
and it can be modified directly.
lpush ptt:start_urls https://www.ptt.cc/{board}/index.html
- where
{board}
can be described Soft_Job, Gossiping or etc.
There are three collections in mongoDB:
- Post
- Author
- Comment
schema | Description |
---|---|
*canonicalUrl | url where the page visited |
authorId | who post the article |
title | title in the article |
content | content in the article |
publishedTime | the date this post was created |
updateTime | the date this post was updated |
board | what post belong with in ptt |
schema | Description |
---|---|
*authorId | who post the article |
authorName | the author's nickname |
schema | Description |
---|---|
commentId | who post the conmment |
commentTime | when user posted |
commentContent | the content in comment |
board | what comment belong with in ptt |
Note: where schema prefix * represents primary key.
- master-slaver architecture
- the master runs spider by following command:
scrapy crawl pttCrawl
- start redis service and run it :
redis-cil
- the most important step is to push your url that you attempt to crawl. Here, we use
lpush
to attain this goal. The following redis key ispttCrawl:start_urls
. We push urls to redis.
lpush pttCrawl:start_urls {ptt url}
- (optimal) wake our slaver machines up which have a little bit different declaration in
settings.py
:
scrapy crawl pttCrawl
In settings.py
, we just add a line that would prevent from repetitive redirection:
## in settings.py
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
In settings.py
, we just add a line that can keep tracking processes of the crawler. As requests in the redis queue just exist after crawling process stopped. It make convenient start to crawl again.
## in settings.py
# Enable scheduling storing requests queue in redis
SCHEDULER = 'scrapy_redis.scheduler.Scheduler'
# Start from the last endpoint
SCHEDULER_PERSIST = True
- scrapyd provide a daemon for crawling. Like http server, we run it by typing the following command:
scrapyd
- for the purpose of deployment, we install the package
scrapy-client
and run it:
scrapyd-deploy pttCrawler
In case of duplicates in database, we filter the data here.
def process_item(self, item, spider):
if isinstance(item, PostItem):
logging.debug("filter duplicated post items.")
if item['canonicalUrl'] in self.post_set:
raise DropItem("Duplicate post found:%s" % item)
self.post_set.add(item['canonicalUrl'])
elif isinstance(item, AuthorItem):
logging.debug("filter duplicated author items.")
if item['authorId'] in self.author_set:
raise DropItem("Duplicate author found:%s" % item)
self.author_set.add(item['authorId'])
return item
save data in mongodb.
generate a json file.
To avoid getting banned, we adopt some tricks while we are crawling web pages.
- Download delays
We set the
DOWLOAD_DELAY
insettings.py
to limit the dowmload behavior.
## in settings.py
DOWNLOAD_DELAY = 2
- Distrbuted downloader
scrapy-redis has already helped us indeed.
- User Agent Pool
Randomly choose one user-agent through middleware.
## in middlewares.py
class RandomUserAgentMiddleware(object):
def process_request(self, request, spider):
agent = random.choice(list(UserAgentList))
request.headers['User-Agent'] = agent
## in settings.py
UserAgentList = [
'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1500.55 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17'
]
DOWNLOADER_MIDDLEWARES = {
'pttCrawler.middlewares.RandomUserAgentMiddleware': 543,
}
Note: we cannot disable cookies because we have to pass the 'over18' message to some ptt boards.
First we need to download the scrapydweb immediately.
pip install scrapydweb
Then running it by following command:
scrapydweb
We can get in from localhost:5000
and monitor our crawler.
Also, we can track the crawler in here.
Scrapy comes with a built-in service, called "Scrapyd", which allows you to deploy your projects and control their spiders using a JSON web service.
A full-featured web UI for Scrapyd cluster management, with Scrapy log analysis & visualization supported.
- Spider_app (scrapy-redis)
- Redis
- mongoDB
Before deploy to docker, we need to modify a little parts in settings.py
:
# local
# MONGO_URI = 'mongodb://localhost:27017'
# docker
MONGO_URI = 'mongodb://mongodb:27017'
# local
# REDIS_HOST = 'localhost'
# docker
REDIS_HOST = 'redis'
Since the docker seems the service defined at .yml
as server host, we modify localhost
here.
In the main spider script ptt.py
, for the sake of convenience we restrict the date stuck in year 2020.
Also, we set maximum_missing_count
as 500 where aims to control the bound of exploring articles. If there has been no page can be visited or got the limit of our missing count, we then stop crawling so that waste less resource.
class PTTspider(RedisSpider):
configure_logging(install_root_handler=False)
logging.basicConfig (
filename = 'logging.txt',
format = '%(levelname)s: %(message)s',
level = logging.INFO)
name = 'ptt'
redis_key = 'ptt:start_urls'
board = None
## where are restrictions
year = 2020
maximum_missing_count = 500
- scrapy api: https://scrapy.readthedocs.io/en/0.12/index.html
- scrapy-redis api: https://scrapy-redis.readthedocs.io/en/v0.6.1/readme.html
- jianshu personal note: https://www.jianshu.com/p/8a9176d11372
- SCUTJcfeng 's github: https://github.com/SCUTJcfeng/Scrapy-redis-Projects
- ptt website C_Chat board: https://www.ptt.cc/bbs/C_Chat/index.html
- ripples's markdown: http://www.q2zy.com/articles/2015/12/15/note-of-scrapy/
- my8100 scrapywebUI: https://iter01.com/149794.html