Skip to content
Analyze scraped data
Python
Branch: master
Clone or download
andersonberg and manycoding New demo spider (#149)
* updating Rules notebook

* adding new items files

* updating examples

* cleaning jupyter notebook output

* fix Rules notebook

* adding schema json for new spider

* update Compare notebook

* updating "Basics" notebook

* changing the jobs as the old ones were deleted by dash

* adding some description

* Rename schema, shorten find_by output

* Update urls to master
Latest commit 9612d71 Aug 23, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
docs/source New demo spider (#149) Aug 23, 2019
src/arche Anomalies (#138) Jul 23, 2019
tests Anomalies (#138) Jul 23, 2019
.bumpversion.cfg Bump version: 0.3.5 → 0.3.6 Jul 12, 2019
.readthedocs.yml Docs (#79) May 2, 2019
.travis.yml Add docs tests to travis Jun 18, 2019
CHANGES.md Anomalies (#138) Jul 23, 2019
LICENSE Look what I have done overnight Mar 18, 2019
Pipfile Anomalies (#138) Jul 23, 2019
README.md simplify use case Jul 3, 2019
setup.cfg Set min plotly to 4 Jul 25, 2019
setup.py Bump version: 0.3.1 → 0.3.2, use setup.cfg Apr 18, 2019
tox.ini Schema (#110) Jun 17, 2019

README.md

Arche

PyPI PyPI - Python Version GitHub Build Status Codecov Code style: black GitHub commit activity

pip install arche

Arche (pronounced Arkey) helps to verify scraped data using set of defined rules, for example:

  • Validation with JSON schema
  • Coverage (items, fields, categorical data, including booleans and enums)
  • Duplicates
  • Garbage symbols
  • Comparison of two jobs

We use it in Scrapinghub, among the other tools, to ensure quality of scraped data

Installation

Arche requires Jupyter environment, supporting both JupyterLab and Notebook UI

For JupyterLab, you will need to properly install plotly extensions

Then just pip install arche

Why

To check the quality of scraped data continuously. For example, if you scraped a website, a typical approach would be to validate the data with Arche. You can also create a schema and then set up Spidermon

Developer Setup

pipenv install --dev
pipenv shell
tox

Contribution

Any contributions are welcome! See https://github.com/scrapinghub/arche/issues if you want to take on something or suggest an improvement/report a bug.

You can’t perform that action at this time.