# meemoo - borndigital validation

**Watch** a [short tutorial video](https://greatexpectations.io/videos/getting_started/integrate_expectations) or **read** [the written tutorial](https://docs.greatexpectations.io/en/latest/tutorials/validate_data.html?utm_source=notebook&utm_medium=validate_data)

In [1]:
import json
import great_expectations as ge
import great_expectations.jupyter_ux
from great_expectations.datasource.types import BatchKwargs
import datetime

2020-12-23T17:20:46+0100 - INFO - Great Expectations logging enabled at 20 level by JupyterUX module.


## 1. Get a DataContext
This represents the **project** in this folder (originally created using `great_expectations init`).

In [2]:
context = ge.data_context.DataContext()

## 2. Choose an Expectation Suite

List expectation suites that you created in your project

In [3]:
context.list_expectation_suite_names()

['borndigi_test']

In [4]:
expectation_suite_name = 'borndigi_test' # name from the list above

## 3. Generate an SQL query from field list

We need to extract the metadata fields from the database and place them into a tabular structure.
The `generate_query` function from the `meemoo_util` package constructs a table from a list of fields and for example a borndigital batch id.

In [5]:
from meemoo_util import generate_query

FIELDS = ["CP", "CP_id", "dc_identifier_localid", "dc_title", "dcterms_created", "dcterms_issued"]
MULTISELECT_FIELDS = ["dc_identifier_localids", "dc_titles", "dc_rights_licenses", "dc_subjects"]
BORNDIGITAL_BATCH_ID = 5

borndigi_query = generate_query(FIELDS, MULTISELECT_FIELDS, {"batch_id":  BORNDIGITAL_BATCH_ID})
borndigi_query

"SELECT CP[1]::text AS CP, CP_id[1]::text AS CP_id, dc_identifier_localid[1]::text AS dc_identifier_localid, dc_title[1]::text AS dc_title, dcterms_created[1]::text AS dcterms_created, dcterms_issued[1]::text AS dcterms_issued,dc_identifier_localids[1]::text AS dc_identifier_localids,dc_titles[1]::text AS dc_titles,dc_rights_licenses[1]::text AS dc_rights_licenses,dc_subjects[1]::text AS dc_subjects FROM (SELECT xpath('/VIAA/CP/text()', xmldata::xml) AS CP, xpath('/VIAA/CP_id/text()', xmldata::xml) AS CP_id, xpath('/VIAA/dc_identifier_localid/text()', xmldata::xml) AS dc_identifier_localid, xpath('/VIAA/dc_title/text()', xmldata::xml) AS dc_title, xpath('/VIAA/dcterms_created/text()', xmldata::xml) AS dcterms_created, xpath('/VIAA/dcterms_issued/text()', xmldata::xml) AS dcterms_issued, xpath('/VIAA/dc_identifier_localids/*', xmldata::xml) AS dc_identifier_localids,xpath('/VIAA/dc_titles/*', xmldata::xml) AS dc_titles,xpath('/VIAA/dc_rights_licenses/*', xmldata::xml) AS dc_rights_licen

## 4. Load a batch of data you want to validate

To learn more about `get_batch`, see [this tutorial](https://docs.greatexpectations.io/en/latest/tutorials/validate_data.html?utm_source=notebook&utm_medium=validate_data#load-a-batch-of-data-to-validate)


In [6]:
# list datasources of the type SqlAlchemyDatasource in your project
[datasource['name'] for datasource in context.list_datasources() if datasource['class_name'] == 'SqlAlchemyDatasource']

['borndigi']

In [7]:
datasource_name = 'borndigi' # a datasource name from above

In [8]:
# If you would like to validate the result set of a query:
batch_kwargs = {'query': borndigi_query, 'datasource': datasource_name}

batch = context.get_batch(batch_kwargs, expectation_suite_name)
batch.head()

2020-12-23T17:21:09+0100 - INFO - 	9 expectation(s) included in expectation_suite.


Unnamed: 0,cp,cp_id,dc_identifier_localid,dc_title,dcterms_created,dcterms_issued,dc_identifier_localids,dc_titles,dc_rights_licenses,dc_subjects
0,Huis van Alijn,OR-1v5bc86,FO-60-01479,"Bruidspaar André en Georgette in bootje, Sint-...",1966-07-02/uuuu-uu-uu,,,<alternatief>digitale afbeelding</alternatief>,<multiselect>VIAA-PUBLIEK-METADATA-LTD</multis...,<Trefwoord>Onderwerp | huwelijk</Trefwoord>
1,Huis van Alijn,OR-1v5bc86,FO-60-01481,"Huwelijksmis Betsie en Walter, Gent, 1968",1968-04-05/uuuu-uu-uu,,,<alternatief>digitale afbeelding</alternatief>,<multiselect>VIAA-PUBLIEK-METADATA-LTD</multis...,<Trefwoord>Onderwerp | huwelijk</Trefwoord>
2,Huis van Alijn,OR-1v5bc86,FO-50-01562,"Echtpaar bij auto, Doornzele, 1954",1954-05-15/uuuu-uu-uu,,,<alternatief>digitale afbeelding</alternatief>,<multiselect>VIAA-PUBLIEK-METADATA-LTD</multis...,<Trefwoord>Onderwerp | vervoer</Trefwoord>
3,Huis van Alijn,OR-1v5bc86,FO-60-01484,"Trouwfeest Betsie en Walter, Gent, 1968",1968-04-05/uuuu-uu-uu,,,<alternatief>digitale afbeelding</alternatief>,<multiselect>VIAA-PUBLIEK-METADATA-LTD</multis...,<Trefwoord>Onderwerp | huwelijk</Trefwoord>
4,Huis van Alijn,OR-1v5bc86,FO-60-01485,"Openingsdans Betsie en Walter, Gent, 1968",1968-04-05/uuuu-uu-uu,,,<alternatief>digitale afbeelding</alternatief>,<multiselect>VIAA-PUBLIEK-METADATA-LTD</multis...,<Trefwoord>Onderwerp | huwelijk</Trefwoord>


## 4. Validate the batch with Validation Operators

`Validation Operators` provide a convenient way to bundle the validation of
multiple expectation suites and the actions that should be taken after validation.

When deploying Great Expectations in a **real data pipeline, you will typically discover these needs**:

* validating a group of batches that are logically related
* validating a batch against several expectation suites such as using a tiered pattern like `warning` and `failure`
* doing something with the validation results (e.g., saving them for a later review, sending notifications in case of failures, etc.).

[Read more about Validation Operators in the tutorial](https://docs.greatexpectations.io/en/latest/tutorials/validate_data.html?utm_source=notebook&utm_medium=validate_data#save-validation-results)

In [9]:
# This is an example of invoking a validation operator that is configured by default in the great_expectations.yml file

"""
Create a run_id. The run_id must be of type RunIdentifier, with optional run_name and run_time instantiation
arguments (or a dictionary with these keys). The run_name can be any string (this could come from your pipeline
runner, e.g. Airflow run id). The run_time can be either a dateutil parsable string or a datetime object.
Note - any provided datetime will be assumed to be a UTC time. If no instantiation arguments are given, run_name will
be None and run_time will default to the current UTC datetime.
"""

run_id = {
  "run_name": "borndigi_run",  # insert your own run_name here
  "run_time": datetime.datetime.now(datetime.timezone.utc)
}

results = context.run_validation_operator(
    "action_list_operator",
    assets_to_validate=[batch],
    run_id=run_id)

2020-12-23T17:21:13+0100 - INFO - 	9 expectation(s) included in expectation_suite.


## 5. View the Validation Results in Data Docs

Let's now build and look at your Data Docs. These will now include an **data quality report** built from the `ValidationResults` you just created that helps you communicate about your data with both machines and humans.

[Read more about Data Docs in the tutorial](https://docs.greatexpectations.io/en/latest/tutorials/validate_data.html?utm_source=notebook&utm_medium=validate_data#view-the-validation-results-in-data-docs)

In [10]:
context.open_data_docs()