# Schema Validation
This notebook demonstrates how you could use the jsonschema library to validate against the `document-schema.json` schema.

Requires `jsonschema==3.0.1`

In [1]:
from jsonschema import validate
import json

First we should load the schema as JSON:

In [2]:
schema = json.loads(open("document-schema.json").read())

Next, we can define an instance of a `document`.

We can purposely leave out the `extracted_text` field, which we know is required, to see if the `document` validates:

In [3]:
document = {"_id": "9cba6e4275a4b455bd163fbb2e3c04d95cd36291f51ba85974303558ca739bf6",
            "file_name": "test_file.pdf",
            "file_type": ".pdf",
            "stored_url": "http://stored_url.com",
            "source_url": "http://source_url.com"}

In [4]:
validate(instance=document, schema=schema)

ValidationError: 'extracted_text' is a required property

Failed validating 'required' in schema:
    {'$id': 'http://document-schema.worldmodelers.com',
     'description': 'Defines a schema for World Modelers and DART.',
     'properties': {'_id': {'$id': '#/properties/_id',
                            'description': 'SHA-256 hash of the raw file '
                                           'to generate a unique '
                                           'identifier',
                            'example': '239e910f11880045e6d2533c6ba86651dd89c54265047b26e5ac7e5792255775',
                            'type': 'string'},
                    'category': {'$id': '#/properties/category',
                                 'examples': ['migration',
                                              'food security',
                                              'conflict'],
                                 'type': 'string'},
                    'creation_date': {'$id': '#/properties/creation_date',
                                      'description': 'An object containing '
                                                     'information about '
                                                     "the document's "
                                                     'creation date.',
                                      'properties': {'date': {'$id': '#/properties/creation_date/properties/date',
                                                              'description': 'Raw '
                                                                             'date '
                                                                             'string',
                                                              'examples': ['2017-10-08T19:34:45Z',
                                                                           'June '
                                                                           '1, '
                                                                           '2018',
                                                                           '01/30/2019'],
                                                              'type': 'string'},
                                                     'day': {'$id': '#/properties/creation_date/properties/day',
                                                             'description': 'Extracted '
                                                                            'day '
                                                                            'in '
                                                                            'integer '
                                                                            'format',
                                                             'example': 30,
                                                             'type': 'integer'},
                                                     'month': {'$id': '#/properties/creation_date/properties/month',
                                                               'description': 'Extracted '
                                                                              'month '
                                                                              'in '
                                                                              'integer '
                                                                              'format',
                                                               'example': 2,
                                                               'type': 'integer'},
                                                     'year': {'$id': '#/properties/creation_date/properties/year',
                                                              'description': 'Extracted '
                                                                             'year '
                                                                             'in '
                                                                             'integer '
                                                                             'format',
                                                              'example': 2015,
                                                              'type': 'integer'}},
                                      'type': 'object'},
                    'extracted_text': {'$id': '#/properties/extracted_text',
                                       'description': 'An object whose '
                                                      'keys are the text '
                                                      'extraction tools '
                                                      'run against the raw '
                                                      'file.',
                                       'properties': {'bs4': {'$id': '#/properties/extracted_text/properties/bs4',
                                                              'description': 'The '
                                                                             'text '
                                                                             'extracted '
                                                                             'from '
                                                                             'BeautifulSoup '
                                                                             '(https://www.crummy.com/software/BeautifulSoup/bs4/doc/)',
                                                              'example': 'Increased '
                                                                         'rain '
                                                                         'led '
                                                                         'to '
                                                                         'flooding '
                                                                         'in '
                                                                         'the '
                                                                         'region.',
                                                              'type': 'string'},
                                                      'pypdf2': {'$id': '#/properties/extracted_text/properties/pypdf2',
                                                                 'description': 'The '
                                                                                'text '
                                                                                'extracted '
                                                                                'from '
                                                                                'PyPDF2 '
                                                                                '(https://github.com/mstamy2/PyPDF2)',
                                                                 'example': 'Increased '
                                                                            'rain '
                                                                            'led '
                                                                            'to '
                                                                            'flooding '
                                                                            'in '
                                                                            'the '
                                                                            'region.',
                                                                 'type': 'string'},
                                                      'tika': {'$id': '#/properties/extracted_text/properties/tika',
                                                               'description': 'The '
                                                                              'text '
                                                                              'extracted '
                                                                              'from '
                                                                              'Tika '
                                                                              '(https://tika.apache.org/)',
                                                               'example': 'Increased '
                                                                          'rain '
                                                                          'led '
                                                                          'to '
                                                                          'flooding '
                                                                          'in '
                                                                          'the '
                                                                          'region.',
                                                               'type': 'string'}},
                                       'type': 'object'},
                    'file_name': {'$id': '#/properties/file_name',
                                  'description': 'The name of the original '
                                                 'file',
                                  'example': 'Integrated_Disease_Surveillance_and_Response_(IDSR)_Annexes_25-Sep-17.pdf',
                                  'type': 'string'},
                    'file_type': {'$id': '#/properties/file_type',
                                  'description': 'The type of the original '
                                                 'file',
                                  'examples': ['.pdf', '.docx', '.ppt'],
                                  'type': 'string'},
                    'modification_date': {'$id': '#/properties/modification_date',
                                          'description': 'An object '
                                                         'containing '
                                                         'information '
                                                         'about the '
                                                         "document's "
                                                         'latest '
                                                         'modification '
                                                         'date.',
                                          'properties': {'date': {'$id': '#/properties/modification_date/properties/date',
                                                                  'description': 'Raw '
                                                                                 'date '
                                                                                 'string',
                                                                  'examples': ['2017-10-08T19:34:45Z',
                                                                               'June '
                                                                               '1, '
                                                                               '2018',
                                                                               '01/30/2019'],
                                                                  'type': 'string'},
                                                         'day': {'$id': '#/properties/modification_date/properties/day',
                                                                 'description': 'Extracted '
                                                                                'day '
                                                                                'in '
                                                                                'integer '
                                                                                'format',
                                                                 'example': 30,
                                                                 'type': 'integer'},
                                                         'month': {'$id': '#/properties/modification_date/properties/month',
                                                                   'description': 'Extracted '
                                                                                  'month '
                                                                                  'in '
                                                                                  'integer '
                                                                                  'format',
                                                                   'example': 2,
                                                                   'type': 'integer'},
                                                         'year': {'$id': '#/properties/modification_date/properties/year',
                                                                  'description': 'Extracted '
                                                                                 'year '
                                                                                 'in '
                                                                                 'integer '
                                                                                 'format',
                                                                  'example': 2015,
                                                                  'type': 'integer'}},
                                          'type': 'object'},
                    'source': {'$id': '#/properties/source',
                               'description': 'An object containing '
                                              'information about the '
                                              "document's source.",
                               'properties': {'author_name': {'$id': '#/properties/source/properties/author_name',
                                                              'description': 'The '
                                                                             'name '
                                                                             'of '
                                                                             'the '
                                                                             'author',
                                                              'example': 'Wamala '
                                                                         'Joseph '
                                                                         'Francis',
                                                              'type': 'string'},
                                              'organization_name': {'$id': '#/properties/source/properties/organization_name',
                                                                    'description': 'The '
                                                                                   'name '
                                                                                   'of '
                                                                                   'the '
                                                                                   'source '
                                                                                   'organization',
                                                                    'example': 'South '
                                                                               'Sudan '
                                                                               'Health '
                                                                               'Cluster, '
                                                                               'WHO',
                                                                    'type': 'string'},
                                              'publisher_name': {'$id': '#/properties/source/properties/publisher_name',
                                                                 'description': 'The '
                                                                                'name '
                                                                                'of '
                                                                                'the '
                                                                                'source '
                                                                                'publisher',
                                                                 'example': 'World '
                                                                            'Health '
                                                                            'Organization',
                                                                 'type': 'string'}},
                               'type': 'object'},
                    'source_url': {'$id': '#/properties/source_url',
                                   'description': 'The original web '
                                                  'location of the file',
                                   'example': 'https://www.afro.who.int/sites/default/files/2019-06/South%20Sudan%20IDSR%20Bulletin%20-%20W23%20June%203%20-%20June%209%202019..pdf',
                                   'type': 'string'},
                    'stored_url': {'$id': '#/properties/stored_url',
                                   'description': 'The stored location of '
                                                  'the file for World '
                                                  'Modelers reference (S3)',
                                   'example': 'https://world-modelers.s3.amazonaws.com/documents/migration/Integrated_Disease_Surveillance_and_Response_(IDSR)_Annexes_25-Sep-17.pdf',
                                   'type': 'string'},
                    'title': {'$id': '#/properties/title',
                              'description': 'The title of the file',
                              'example': 'South Sudan IDSR Annex - W39 '
                                         '2017 Sep 25-Oct 1_',
                              'type': 'string'}},
     'required': ['_id',
                  'file_name',
                  'file_type',
                  'extracted_text',
                  'stored_url',
                  'source_url'],
     'title': 'World Modelers Document Schema',
     'type': 'object'}

On instance:
    {'_id': '9cba6e4275a4b455bd163fbb2e3c04d95cd36291f51ba85974303558ca739bf6',
     'file_name': 'test_file.pdf',
     'file_type': '.pdf',
     'source_url': 'http://source_url.com',
     'stored_url': 'http://stored_url.com'}

Now, we can add in extracted text as an empty object to see if the `document` validates:

In [5]:
document["extracted_text"] = {}

We should not see any errors if the `document` validates to the `schema`:

In [6]:
validate(instance=document, schema=schema)