PiCrawler

PiCrawler is a distributed web crawler using PiCloud.

Using PiCrawler, you can easily implement a distributed web crawler within a few lines of code.

>>> from picrawler import PiCloudConnection
>>>
>>> with PiCloudConnection() as conn:
...     response = conn.send(['http://en.wikipedia.org/wiki/Star_Wars',
...                           'http://en.wikipedia.org/wiki/Darth_Vader'])
...     print 'status code:', response[0].status_code
...     print 'content:', response[0].content[:15]
status code: 200
content: <!DOCTYPE html>

Installation

To install PiCrawler, simply:

$ pip install picrawler

Alternatively,

$ easy_install picrawler

PiCloud Setup

Before using PiCrawler, it is neccessary to configure an API key of PiCloud.

>>> import cloud
>>> cloud.setkey(API_KEY, API_SECRETKEY)

You can obtain an API key by signing up on PiCloud.

Using Real-time Cores

PiCloud enables you to reserve your exclusive computational resources by requesting real-time cores.

PiCrawler provides a thin wrapper class for requesting the cores.

NOTE: s1 core is the most suitable for crawling tasks, because PiCloud ensures that each s1 core has a unique IP address.

>>> from picrawler import RTCoreRequest
>>>
>>> with RTCoreRequest(core_type='s1', num_cores=10):
...     pass

Customizing Requests

You can easily customize the request headers and other internal behaviors by using Request instances instead of raw URL strings. Since PiCrawler internally uses Python requests, it supports all arguments that are supported in Python requests.

>>> from picrawler import PiCloudConnection
>>> from picrawler.request import Request
>>>
>>> req = Request('http://en.wikipedia.org/wiki/Star_Wars',
...               'GET',
...               headers={'User-Agent': 'MyCrawler'},
...               args={'timeout': 5})
>>>
>>> with PiCloudConnection() as conn:
...     response = conn.send([req])

Defining Callbacks

You can also define callbacks to the request.

>>> import logging
>>>
>>> from picrawler import PiCloudConnection
>>> from picrawler.request import Request
>>>
>>> req = Request('http://en.wikipedia.org/wiki/Star_Wars', 'GET',
...               success_callback=lambda resp: logging.info(resp.content),
...               error_callback=lambda resp: logging.exception(resp.exception))
>>>
>>> with PiCloudConnection() as conn:
...     response = conn.send([req])

Documentation

Documentation is available at http://picrawler.readthedocs.org/.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
docs		docs
picrawler		picrawler
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PiCrawler

Installation

PiCloud Setup

Using Real-time Cores

Customizing Requests

Defining Callbacks

Documentation

About

Releases

Packages

License

shaohua/picrawler

Folders and files

Latest commit

History

Repository files navigation

PiCrawler

Installation

PiCloud Setup

Using Real-time Cores

Customizing Requests

Defining Callbacks

Documentation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages