# Basics

To access Scrapy Cloud Data, you need to set [Scrapinghub API key](https://app.scrapinghub.com/account/apikey) in `SH_APIKEY` environment variable.

In [None]:
import arche
from arche import *

The only required parameter is `source`, which accepts various inputs - see signature (`?Arche`) or examples.

## Data Sources
`Arche` with `pandas` API provide ability to read data from various places and formats.

### `*.json` as iterable

In [None]:
import json
with open("data/items_books_1.json") as f:
    raw_items = json.load(f)

In [None]:
a = Arche(source=raw_items)

### `*.jl.gz` and pandas API

In [None]:
url = "https://raw.githubusercontent.com/scrapinghub/arche/master/docs/source/nbs/data/items_books_1.jl.gz"

In [None]:
df = pd.read_json(url,lines=True)

`jsonlines` and `json` are not memory efficient if data contains nested objects. If other types are not available, you can read compressed jsonline in chunks.

In [None]:
chunks = pd.read_json(url, lines=True, chunksize=500)

In [None]:
dfs = [df for df in chunks]
df = pd.concat(dfs, sort=False)

In [None]:
df.shape

Uncompressed jsonline files however need to be downloaded first

In [None]:
raw_json = arche.tools.s3.get_contents("https://raw.githubusercontent.com/scrapinghub/arche/master/docs/source/nbs/data/items_books_1.jl")

In [None]:
chunks = pd.read_json(raw_json, lines=True, chunksize=500)

In [None]:
dfs = [df for df in chunks]
df = pd.concat(dfs, sort=False)

In [None]:
df.shape

In [None]:
a = Arche(source=df)

### Scrapy Cloud keys

In [None]:
a = Arche(source="381798/1/1")

In [None]:
a.report_all()

We just ran a minimal number of rules. The validation can be improved with adding a json schema, let's infer one from the data we already have.

## JSON Schema

In [None]:
basic_json_schema("381798/1/1")

By itself a basic schema is not very helpful, but you can update it.

In [None]:
a.source_items.df.head()

Looks like `price` can be checked with regex. Let's also add `category` tag which helps to see the distribution in categoric data and `unique` tag to title to ensure there are no duplicates.

In [None]:
a.schema = {
    "$schema": "http://json-schema.org/draft-07/schema#",
    "definitions": {
        "float": {
            "pattern": "^-?[0-9]+\\.[0-9]{2}$"
        },
        "url": {
            "pattern": "^https?://(www\\.)?[a-z0-9.-]*\\.[a-z]{2,}([^<>%\\x20\\x00-\\x1f\\x7F]|%[0-9a-fA-F]{2})*$"
        }
    },
    "additionalProperties": False,
    "type": "object",
    "properties": {
        "category": {"type": "string", "tag": ["category"]},
        "price": {"type": "string", "pattern": "^Â£\d{2}.\d{2}$"},
        "_type": {"type": "string"},
        "description": {"type": "string"},
        "title": {"type": "string", "tag": ["unique"]},
        "_key": {"type": "string"}
    },
    "required": [
        "_key",
        "_type",
        "category",
        "description",
        "price",
        "title"
    ]
}

In [None]:
a.validate_with_json_schema()

Or if your job is really big you can use almost 100x faster [backend](https://github.com/horejsek/python-fastjsonschema)

In [None]:
a.glance()

We already got something! Let's execute the whole thing again to see how `category` tag works.

In [None]:
a.report_all()

## Accessing Results Data

In [None]:
a.report.results.keys()

In [None]:
a.report.results.get("Coverage For Scraped Categories").stats