# Basics

## Bare

In [1]:
%cd -q ../../../src/

To access Scrapy Cloud Data, you need to set [Scrapinghub API key](https://app.scrapinghub.com/account/apikey) in `SH_APIKEY` environment variable.

In [2]:
import arche
from arche import *

The only required parameter is `source`, which accepts various inputs - see signature (`?Arche`) or examples.

In [3]:
# Reading from json
import json
with open("../docs/source/nbs/data/items_books_1.json") as f:
    raw_items = json.load(f)

In [4]:
a = Arche(raw_items)

In [5]:
a = Arche("381798/1/1")

In [6]:
a.report_all()


Job Outcome:
	Finished

Job Errors:
	No errors

Responses Per Item Ratio:
	Number of responses / Number of scraped items - 1.05

Garbage Symbols:
[32m	PASSED[0m

Fields Coverage:
[32m	PASSED[0m




Fields Coverage (1 message(s)):


We just ran a minimal number of rules. The validation can be improved with adding a json schema, let's infer one from the data we already have.

## JSON schema

In [7]:
basic_json_schema("381798/1/1")

{'$schema': 'http://json-schema.org/draft-07/schema#',
 'additionalProperties': False,
 'definitions': {'float': {'pattern': '^-?[0-9]+\\.[0-9]{2}$'},
                 'url': {'pattern': '^https?://(www\\.)?[a-z0-9.-]*\\.[a-z]{2,}([^<>%\\x20\\x00-\\x1f\\x7F]|%[0-9a-fA-F]{2})*$'}},
 'properties': {'category': {'type': 'string'},
                'description': {'type': 'string'},
                'price': {'type': 'string'},
                'title': {'type': 'string'}},
 'required': ['category', 'description', 'price', 'title'],
 'type': 'object'}

By itself a basic schema is not very helpful, but you can update it.

In [8]:
a.source_items.df.head()

Unnamed: 0,category,description,price,title
https://app.scrapinghub.com/p/381798/1/1/item/0,Travel,"“Wherever you go, whatever you do, just . . . ...",£45.17,It's Only the Himalayas
https://app.scrapinghub.com/p/381798/1/1/item/1,Politics,Libertarianism isn't about winning elections; ...,£51.33,Libertarianism for Beginners
https://app.scrapinghub.com/p/381798/1/1/item/2,Science Fiction,"Andrew Barger, award-winning author and engine...",£37.59,Mesaerion: The Best Science Fiction Stories 18...
https://app.scrapinghub.com/p/381798/1/1/item/3,Poetry,"Part fact, part fiction, Tyehimba Jess's much ...",£23.88,Olio
https://app.scrapinghub.com/p/381798/1/1/item/4,Music,This is the never-before-told story of the mus...,£57.25,Our Band Could Be Your Life: Scenes from the A...


Looks like `price` can be checked with regex. Let's also add `category` tag which helps to see the distribution in categoric data and `unique` tag to title to ensure there are no duplicates.

In [9]:
a.schema = {
    "$schema": "http://json-schema.org/draft-07/schema#",
    "definitions": {
        "float": {
            "pattern": "^-?[0-9]+\\.[0-9]{2}$"
        },
        "url": {
            "pattern": "^https?://(www\\.)?[a-z0-9.-]*\\.[a-z]{2,}([^<>%\\x20\\x00-\\x1f\\x7F]|%[0-9a-fA-F]{2})*$"
        }
    },
    "additionalProperties": False,
    "type": "object",
    "properties": {
        "category": {"type": "string", "tag": ["category"]},
        "price": {"type": "string", "pattern": "^£\d{2}.\d{2}$"},
        "_type": {"type": "string"},
        "description": {"type": "string"},
        "title": {"type": "string", "tag": ["unique"]},
        "_key": {"type": "string"}
    },
    "required": [
        "_key",
        "_type",
        "category",
        "description",
        "price",
        "title"
    ]
}

In [10]:
a.validate_with_json_schema()

HBox(children=(IntProgress(value=0, description='JSON Schema Validation', max=1000, style=ProgressStyle(descri…



JSON Schema Validation:
[31m	1000 items were checked, 3 error(s)[0m


Or if your job is really big you can use almost 100x faster [backend](https://github.com/horejsek/python-fastjsonschema)

In [11]:
a.glance()

HBox(children=(IntProgress(value=0, description='Fast Schema Validation', max=1000, style=ProgressStyle(descri…



JSON Schema Validation:
[31m	1000 items were checked, 1 error(s)[0m


We already got something! Let's execute the whole thing again to see how `category` tag works.

In [12]:
a.report_all()


Job Outcome:
	Finished

Job Errors:
	No errors

Responses Per Item Ratio:
	Number of responses / Number of scraped items - 1.05

Garbage Symbols:
[32m	PASSED[0m

Fields Coverage:
[32m	PASSED[0m

JSON Schema Validation:
[31m	1000 items were checked, 3 error(s)[0m

Tags:
	Used - category, unique
	Not used - name_field, product_price_field, product_price_was_field, product_url_field

Compare Price Was And Now:
	product_price_field or product_price_was_field tags were not found in schema

Uniqueness:
[31m	'title' contains 1 duplicated value(s)[0m

Duplicated Items:
	'name_field' and 'product_url_field' tags were not found in schema

Coverage For Scraped Categories:
	50 categories in 'category'




Fields Coverage (1 message(s)):



JSON Schema Validation (3 message(s)):



Uniqueness (1 message(s)):



Coverage For Scraped Categories (1 message(s)):


## Accessing results data

In [16]:
a.report.results.keys()

dict_keys(['Job Outcome', 'Job Errors', 'Responses Per Item Ratio', 'Garbage Symbols', 'Fields Coverage', 'JSON Schema Validation', 'Tags', 'Compare Price Was And Now', 'Uniqueness', 'Duplicated Items', 'Coverage For Scraped Categories'])

In [17]:
a.report.results.get("Coverage For Scraped Categories").stats

[Cultural                1
 Parenting               1
 Suspense                1
 Adult Fiction           1
 Academic                1
 Crime                   1
 Erotica                 1
 Novels                  1
 Paranormal              1
 Short Stories           1
 Historical              2
 Contemporary            3
 Christian               3
 Politics                3
 Health                  4
 Biography               5
 Sports and Games        5
 Self Help               5
 New Adult               6
 Spirituality            6
 Christian Fiction       6
 Psychology              7
 Religion                7
 Art                     8
 Autobiography           9
 Humor                  10
 Philosophy             11
 Travel                 11
 Thriller               11
 Business               12
 Music                  13
 Science                14
 Science Fiction        16
 Horror                 17
 Womens Fiction         17
 History                18
 Classics               19
 