# Schema

In [1]:
%cd -q ../../../src

In [2]:
import arche
from arche import *

## Creating schema

A schema can be inferred from a job item. `basic_json_schema()` returns Python dict representacion.

In [3]:
schema = basic_json_schema("381798/1/1"); schema

{'$schema': 'http://json-schema.org/draft-07/schema#',
 'additionalProperties': False,
 'definitions': {'float': {'pattern': '^-?[0-9]+\\.[0-9]{2}$'},
                 'url': {'pattern': '^https?://(www\\.)?[a-z0-9.-]*\\.[a-z]{2,}([^<>%\\x20\\x00-\\x1f\\x7F]|%[0-9a-fA-F]{2})*$'}},
 'properties': {'_key': {'type': 'string'},
                '_type': {'type': 'string'},
                'category': {'type': 'string'},
                'description': {'type': 'string'},
                'price': {'type': 'string'},
                'title': {'type': 'string'}},
 'required': ['_key', '_type', 'category', 'description', 'price', 'title'],
 'type': 'object'}

But there's also a `json()` method, notice the difference in boolean values and regex.

In [4]:
schema.json()

{
    "$schema": "http://json-schema.org/draft-07/schema#",
    "definitions": {
        "float": {
            "pattern": "^-?[0-9]+\\.[0-9]{2}$"
        },
        "url": {
            "pattern": "^https?://(www\\.)?[a-z0-9.-]*\\.[a-z]{2,}([^<>%\\x20\\x00-\\x1f\\x7F]|%[0-9a-fA-F]{2})*$"
        }
    },
    "additionalProperties": false,
    "type": "object",
    "properties": {
        "_key": {
            "type": "string"
        },
        "_type": {
            "type": "string"
        },
        "category": {
            "type": "string"
        },
        "description": {
            "type": "string"
        },
        "price": {
            "type": "string"
        },
        "title": {
            "type": "string"
        }
    },
    "required": [
        "_key",
        "_type",
        "category",
        "description",
        "price",
        "title"
    ]
}


## Setting schema

In [5]:
a = Arche("381798/1/1")

You can set JSON schemas by different ways, by passing a `schema` argument to `Arche` instance or by setting `schema` property

### From a dict

In [6]:
a.schema = {
    "$schema": "http://json-schema.org/draft-07/schema#",
    "definitions": {
        "float": {
            "pattern": "^-?[0-9]+\\.[0-9]{2}$"
        },
        "url": {
            "pattern": "^https?://(www\\.)?[a-z0-9.-]*\\.[a-z]{2,}([^<>%\\x20\\x00-\\x1f\\x7F]|%[0-9a-fA-F]{2})*$"
        }
    },
    "additionalProperties": False,
    "type": "object",
    "properties": {
        "category": {"type": "string", "tag": ["category"]},
        "price": {"type": "string", "pattern": "^£\d{2}.\d{2}$"},
        "_type": {"type": "string"},
        "description": {"type": "string"},
        "title": {"type": "string", "tag": ["unique"]},
        "_key": {"type": "string"}
    },
    "required": [
        "_key",
        "_type",
        "category",
        "description",
        "price",
        "title"
    ]
}
a.schema

{'$schema': 'http://json-schema.org/draft-07/schema#',
 'definitions': {'float': {'pattern': '^-?[0-9]+\\.[0-9]{2}$'},
  'url': {'pattern': '^https?://(www\\.)?[a-z0-9.-]*\\.[a-z]{2,}([^<>%\\x20\\x00-\\x1f\\x7F]|%[0-9a-fA-F]{2})*$'}},
 'additionalProperties': False,
 'type': 'object',
 'properties': {'category': {'type': 'string', 'tag': ['category']},
  'price': {'type': 'string', 'pattern': '^£\\d{2}.\\d{2}$'},
  '_type': {'type': 'string'},
  'description': {'type': 'string'},
  'title': {'type': 'string', 'tag': ['unique']},
  '_key': {'type': 'string'}},
 'required': ['_key', '_type', 'category', 'description', 'price', 'title']}

### From a url

In [7]:
a.schema = "https://raw.githubusercontent.com/scrapinghub/arche/master/docs/source/nbs/data/books.json"
a.schema, a.schema_source

({'$schema': 'http://json-schema.org/draft-07/schema#',
  'definitions': {'float': {'pattern': '^-?[0-9]+\\.[0-9]{2}$'},
   'url': {'pattern': '^https?://(www\\.)?[a-z0-9.-]*\\.[a-z]{2,}([^<>%\\x20\\x00-\\x1f\\x7F]|%[0-9a-fA-F]{2})*$'}},
  'additionalProperties': False,
  'type': 'object',
  'properties': {'category': {'type': 'string', 'tag': ['category']},
   'price': {'type': 'string', 'pattern': '^£\\d{2}.\\d{2}$'},
   '_type': {'type': 'string'},
   'description': {'type': 'string'},
   'title': {'type': 'string', 'tag': ['unique']},
   '_key': {'type': 'string'}},
  'required': ['_key', '_type', 'category', 'description', 'price', 'title']},
 'https://raw.githubusercontent.com/scrapinghub/arche/master/docs/source/nbs/data/books.json')

### From a private repo

For github, you just specify the raw link which will contain a token on the end. The token expires after 5 minutes.

```a.schema = "https://raw.githubusercontent.com/manycoding/repo/master/schema.json?token=AJ6jjTtZtWZr5zyw7DuWduieMJ2ms1ks5ctRC6wA%3%3D"```

For bitbucket, you have to set up `BITBUCKET_USER` and `BITBUCKET_PASSWORD` environment variables.
For example, in Jupyter it looks like:

In [8]:
%env BITBUCKET_USER=your_id
%env BITBUCKET_PASSWORD=your_pass

env: BITBUCKET_USER=your_id
env: BITBUCKET_PASSWORD=your_pass


And then you can use raw links
```a.schema = "https://bitbucket.org/user/repo/raw/HEAD/schema.json"```

### From AWS S3

To get schemas from private s3 bucket, you need to set `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`

In [9]:
%env AWS_ACCESS_KEY_ID=your_id
%env AWS_SECRET_ACCESS_KEY=your_key

env: AWS_ACCESS_KEY_ID=your_id
env: AWS_SECRET_ACCESS_KEY=your_key


And then just specify s3 link

```a.schema = "s3://bucket/schema.json"```