Skip to content
Web scraper framework
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
alcazar
samples
specimens
tests
.gitignore
.pylintrc
.travis.yml
LICENSE
README.md
run-pylint.sh
run-samples.sh
run-tests-all-versions.sh
run-tests.sh
setup.py

README.md

Build Status PyPI version

Alcazar is a Python library that simplifies the task of writing web scrapers.

Some of its core features are:

  • succinct syntax for locating relevant data within an HTML page, JSON document, string of text
  • HTTP caching to disk for exact replay of scrapes without resubmitting HTTP requests
  • Throttling of requests to the same host
  • Automatic retries when an HTTP request fails, or when a page fails to parse as expected
  • Crawler facilities for maintaining a queue of URLs to visit
  • fail-fast: by default, we'd rather crash than save incorrect or incomplete data

Alcazar brings together the following libraries:

Getting Started

Alcazar is available on PyPi so it can be installed it using pip:

pip install alcazar

The simplest way to use the library is to instantiate a Scraper and call its fetch method:

>>> import alcazar
>>> scraper = alcazar.Scraper()
>>> page = scraper.fetch('https://en.wikipedia.org/wiki/Gorgie')
>>> print(page.one('div[@id="toc"]/preceding-sibling::p[./b]').text.normalized)
Gorgie (/ˈɡɔːrɡiː/ GOR-gee) is a densely populated area of Edinburgh, Scotland. It is located in the west of the city and borders Murrayfield, Ardmillan and Dalry.

In this snippet:

  • we've fetched the HTML for the page
    • if any network error or HTTP error happens, we'll retry to fetch it a few times, sleeping increasing delays between every attempt
  • we've parsed the HTML into a tree
    • using lxml's excellent handling and recovery from "broken" HTML, as seen in the wild
  • we've located the element we're interested in
    • here using an XPath expression, but we could've used a CSS selector too
    • we've checked that there was one and only one element that matched our query
    • else an exception would've been thrown, ensuring we capture only exactly what we wanted
  • we've extracted its text, removed all tags from it, and normalized its whitespace

See the samples directory for a taste of how Alcazar works.

You can’t perform that action at this time.