Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
tree: d1acc15c13
Fetching contributors…

Cannot retrieve contributors at this time

file 71 lines (49 sloc) 2.048 kb

scrapelib

scrapelib is a library for making requests to websites, particularly those that may be less-than-reliable.

scrapelib originated as part of the `Open States <http://openstates.org/`_ project to scrape the websites of all 50 state legislatures and as a result was therefore designed with features desirable when dealing with sites that have intermittent errors or require rate-limiting.

As of version 0.7 scrapelib has been retooled to take advantage of the superb requests library.

Advantages of using scrapelib over alternatives like httplib2 simply using requests as-is:

  • All of the power of the suberb requests library.
  • HTTP(S) and FTP requests via an identical API
  • support for simple caching with pluggable cache backends
  • request throtting
  • configurable retries for non-permanent site failures
  • optional robots.txt compliance

scrapelib is a project of Sunlight Labs (c) 2012. All code is released under a BSD-style license, see LICENSE for details.

Written by James Turk <jturk@sunlightfoundation.com>

Contributors:
  • Michael Stephens - initial urllib2/httplib2 version
  • Joe Germuska - fix for IPython embedding
  • Alex Chiang - fix to test suite

Requirements

  • python >= 2.6 (experimental support for Python 3.2)
  • requests

Installation

scrapelib is available on PyPI and can be installed via pip install scrapelib

PyPI package: http://pypi.python.org/pypi/scrapelib

Source: http://github.com/sunlightlabs/scrapelib

Documentation: http://scrapelib.readthedocs.org/en/latest/

Example Usage

import scrapelib
s = scrapelib.Scraper(requests_per_minute=10, allow_cookies=True,
                      follow_robots=True)

# Grab Google front page
s.urlopen('http://google.com')

# Will raise RobotExclusionError
s.urlopen('http://google.com/search')

# Will be throttled to 10 HTTP requests per minute
while True:
    s.urlopen('http://example.com')
Something went wrong with that request. Please try again.