Skip to content

Latest commit

 

History

History
117 lines (76 loc) · 3.19 KB

README.rst

File metadata and controls

117 lines (76 loc) · 3.19 KB

PiCrawler

https://badge.fury.io/py/picrawler.png https://travis-ci.org/studio-ousia/picrawler.png?branch=master

PiCrawler is a distributed web crawler using PiCloud.

Using PiCrawler, you can easily implement a distributed web crawler within a few lines of code.

>>> from picrawler import PiCloudConnection
>>>
>>> with PiCloudConnection() as conn:
...     response = conn.send(['http://en.wikipedia.org/wiki/Star_Wars',
...                           'http://en.wikipedia.org/wiki/Darth_Vader'])
...     print 'status code:', response[0].status_code
...     print 'content:', response[0].content[:15]
status code: 200
content: <!DOCTYPE html>

Installation

To install PiCrawler, simply:

$ pip install picrawler

Alternatively,

$ easy_install picrawler

PiCloud Setup

Before using PiCrawler, it is neccessary to configure an API key of PiCloud.

>>> import cloud
>>> cloud.setkey(API_KEY, API_SECRETKEY)

You can obtain an API key by signing up on PiCloud.

Using Real-time Cores

PiCloud enables you to reserve your exclusive computational resources by requesting real-time cores.

PiCrawler provides a thin wrapper class for requesting the cores.

NOTE: s1 core is the most suitable for crawling tasks, because PiCloud ensures that each s1 core has a unique IP address.

>>> from picrawler import RTCoreRequest
>>>
>>> with RTCoreRequest(core_type='s1', num_cores=10):
...     pass

Customizing Requests

You can easily customize the request headers and other internal behaviors by using Request instances instead of raw URL strings. Since PiCrawler internally uses Python requests, it supports all arguments that are supported in Python requests.

>>> from picrawler import PiCloudConnection
>>> from picrawler.request import Request
>>>
>>> req = Request('http://en.wikipedia.org/wiki/Star_Wars',
...               'GET',
...               headers={'User-Agent': 'MyCrawler'},
...               args={'timeout': 5})
>>>
>>> with PiCloudConnection() as conn:
...     response = conn.send([req])

Defining Callbacks

You can also define callbacks to the request.

>>> import logging
>>>
>>> from picrawler import PiCloudConnection
>>> from picrawler.request import Request
>>>
>>> req = Request('http://en.wikipedia.org/wiki/Star_Wars', 'GET',
...               success_callback=lambda resp: logging.info(resp.content),
...               error_callback=lambda resp: logging.exception(resp.exception))
>>>
>>> with PiCloudConnection() as conn:
...     response = conn.send([req])

Documentation

Documentation is available at http://picrawler.readthedocs.org/.