This repository has been archived by the owner. It is now read-only.
Cabu is a simple microservice framework to crawl websites from the cloud.
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
cabu
docs
.coveragerc
.dockerignore
.gitignore
.travis.yml
AUTHORS
Dockerfile
LICENSE
README.rst
dev_server.py
docker-compose.yml
requirements-dev.txt
requirements.txt
setup.cfg
setup.py
tox.ini

README.rst

Cabu

Documentation Status Coverage Status

Cabu is a simple microservice framework to remotely crawl websites. It's built on Flask and Selenium, contains a virtual display wrapper and few methods.

Full documentation here

Usage

@app.route('/gizmodo_last_articles_links')
def gizmodo_last_articles():
    app.webdriver.get('http://www.gizmodo.com')
    articles_links = [i.get_attribute('href') for i in app.webdriver.find_elements_by_css_selector('h1.headline>a')]

    return jsonify({'articles': articles_links})

Installing

$ pip install cabu

Features

  • Selenium configuration out of the box
  • Flask wrapping
  • Crawling methods included
  • AWS S3 Export
  • FTP / FTPS
  • Cookies persistence
  • Link extractor
  • Proxy configuration
  • Headless optional for local debug
  • Docker pre-configured distributed environment
  • Database handler
  • Compatible with most Flask extensions (Flask-Admin, Flask-Mail, Flask-OAuth, ...)
  • 12 Factors compliance

(Likely to come soon)

  • CouchDB support
  • Couchbase support
  • Mobile drivers
  • SFTP
  • HtmlUnit web driver
  • Remote webdriver wrapper
  • Parallelization
  • Neural Network plugins

Testing

All tests were written using Docker services instead of Mocks. Alternative mocks will be added soon ;)

$ pip install -r requirements-dev.txt
$ py.test cabu/tests

Contributing

Please see the Contribute page.

Copyright

Cabu is an open source project by Théotime Lévèque.