Skip to content

silenceliang/ptt-crawler-scrapyRedis

Repository files navigation

pttCrawler

In this project we try to collect data from the ptt website. We adopt scrapy framework based on python language and use mongoDB as our storage. However, crawler handles it job only on single machine. To explore efficently, scrapy-redis provides distributed mechanism that helps us running spider on clients. For the purpose of deployment, we use scrapyd to achieve it.

Dependencies

Full dependency installation on Ubuntu 16.04

  • Python 3 (tested on python 3.7.2)
  • redis 3.4.1
  • mongodb 4.0.16

Requirements

  • pymongo==3.10.1 (used nosql db)
  • Scrapy==2.0.0 (framework of crawler)
  • scrapy-redis==0.6.8 (achieve distributed scrawling)
  • scrapyd==1.2.1 (provide a crawling daemon )
  • scrapyd-client==1.1.0 (used to deploy our spider)
  • scrapyd-web==1.4.0* (show the UI for the crawler)

Setup

mongodb settings

In settings.py, we should define the mongodb settings:

## in settings.py
MONGO_URI = 'mongodb://localhost:27017'
MONGO_DATABASE = 'ptt-sandbox'

redis settings

## in settings.py
EDIS_HOST = 'localhost'
REDIS_PARAMS = {
    'password':'yourpassword'
}
REDIS_PORT = 6379

(Optional) filter duplicates

DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'

Usage

Running spider by following command:

scrapy crawl ptt -a start={m/d} -a end={m/d}
  • where -a received an argument that is a parameter to the spider.
  • {m/d} means month/day. 3/5 just represents March 5th.
    For example, the command would be scrapy crawl ptt -a start=3/5 -a end=3/8

Start the redis server and get in terminal

redis-cli

Before crawling, we need to obtain the authentication by specific keyword

auth yourpassword
  • where yourpassword is in settings.py and it can be modified directly.

Push url to redis and running Crawler

lpush ptt:start_urls https://www.ptt.cc/{board}/index.html
  • where {board} can be described Soft_Job, Gossiping or etc.

SnapShot

Result in db

post info

Workflow in the local

interaction with redis using redis-cli

terminal1

Run the crawler by scrapy crawl ptt -a start={date} -a end={date}

terminal2

Collections

There are three collections in mongoDB:

  • Post
  • Author
  • Comment

Post

schema Description
*canonicalUrl url where the page visited
authorId who post the article
title title in the article
content content in the article
publishedTime the date this post was created
updateTime the date this post was updated
board what post belong with in ptt

Author

schema Description
*authorId who post the article
authorName the author's nickname

Comment

schema Description
commentId who post the conmment
commentTime when user posted
commentContent the content in comment
board what comment belong with in ptt

Note: where schema prefix * represents primary key.

Scrapy-Redis Framework

Distributed crawler

  • master-slaver architecture
  1. the master runs spider by following command:
scrapy crawl pttCrawl
  1. start redis service and run it :
redis-cil
  1. the most important step is to push your url that you attempt to crawl. Here, we use lpush to attain this goal. The following redis key is pttCrawl:start_urls. We push urls to redis.
lpush pttCrawl:start_urls {ptt url}
  1. (optimal) wake our slaver machines up which have a little bit different declaration in settings.py:
scrapy crawl pttCrawl

Benefits

filter duplicates

In settings.py, we just add a line that would prevent from repetitive redirection:

## in settings.py
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

scheduler persist

In settings.py, we just add a line that can keep tracking processes of the crawler. As requests in the redis queue just exist after crawling process stopped. It make convenient start to crawl again.

## in settings.py
# Enable scheduling storing requests queue in redis
SCHEDULER = 'scrapy_redis.scheduler.Scheduler'
# Start from the last endpoint
SCHEDULER_PERSIST = True

Deploy with scrapyd

  1. scrapyd provide a daemon for crawling. Like http server, we run it by typing the following command:
scrapyd
  1. for the purpose of deployment, we install the package scrapy-client and run it:
scrapyd-deploy pttCrawler

Pipeline

DuplicatesPipeline

In case of duplicates in database, we filter the data here.

def process_item(self, item, spider):

    if isinstance(item, PostItem):
        logging.debug("filter duplicated post items.")
        if item['canonicalUrl'] in self.post_set:
            raise DropItem("Duplicate post found:%s" % item)
        self.post_set.add(item['canonicalUrl'])

    elif isinstance(item, AuthorItem):
        logging.debug("filter duplicated author items.")
        if item['authorId'] in self.author_set:
            raise DropItem("Duplicate author found:%s" % item)
        self.author_set.add(item['authorId'])

    return item

MongoPipeline

save data in mongodb.

JsonPipeline

generate a json file.

Security Methodology

To avoid getting banned, we adopt some tricks while we are crawling web pages.

  1. Download delays

We set the DOWLOAD_DELAY in settings.py to limit the dowmload behavior.

## in settings.py
DOWNLOAD_DELAY = 2
  1. Distrbuted downloader

scrapy-redis has already helped us indeed.

  1. User Agent Pool

Randomly choose one user-agent through middleware.

## in middlewares.py
class RandomUserAgentMiddleware(object):

    def process_request(self, request, spider):
        agent = random.choice(list(UserAgentList))
        request.headers['User-Agent'] = agent
## in settings.py
UserAgentList = [
    'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1500.55 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17'
]
DOWNLOADER_MIDDLEWARES = {
    'pttCrawler.middlewares.RandomUserAgentMiddleware': 543,
}

Note: we cannot disable cookies because we have to pass the 'over18' message to some ptt boards.

Web UI for scrapyd server

First we need to download the scrapydweb immediately.

pip install scrapydweb

Then running it by following command:

scrapydweb

We can get in from localhost:5000 and monitor our crawler.

monitor1

Also, we can track the crawler in here. monitor2 monitor3

Deployment

Scrapyd

Scrapy comes with a built-in service, called "Scrapyd", which allows you to deploy your projects and control their spiders using a JSON web service.

scrapyd

scrapyd-terminal

Scrapydweb

A full-featured web UI for Scrapyd cluster management, with Scrapy log analysis & visualization supported.

docker-compose

container

  • Spider_app (scrapy-redis)
  • Redis
  • mongoDB

memo

Before deploy to docker, we need to modify a little parts in settings.py :

# local
# MONGO_URI = 'mongodb://localhost:27017'
# docker
MONGO_URI = 'mongodb://mongodb:27017'

# local
# REDIS_HOST = 'localhost'
# docker
REDIS_HOST = 'redis'

Since the docker seems the service defined at .yml as server host, we modify localhost here.

docker-terminal

Supplement

In the main spider script ptt.py, for the sake of convenience we restrict the date stuck in year 2020.
Also, we set maximum_missing_count as 500 where aims to control the bound of exploring articles. If there has been no page can be visited or got the limit of our missing count, we then stop crawling so that waste less resource.

class PTTspider(RedisSpider):
    configure_logging(install_root_handler=False) 
    logging.basicConfig ( 
        filename = 'logging.txt', 
        format = '%(levelname)s: %(message)s', 
        level = logging.INFO)
    name = 'ptt'
    redis_key = 'ptt:start_urls'
    board = None
    ## where are restrictions
    year = 2020
    maximum_missing_count = 500

Reference