# _Module 2 lesson 3_: Restructuring and validating data against a schema using Python and _whyqd_

<div class="alert alert-block alert-warning">
    <b>Learning outcomes:</b>
    <br>
    <ul>
        <li>Perform techniques in analysis and coding, using the Whyqd data wrangling package, to create a structured, JSON-formatted method, for restructuring data into a standard schema.</li>
        <li>Write and use modular functions to automatically validate data files against a defined schema using Python.</li>
        <li>Apply techniques in open data publication to prepare data, schemas and validation outputs for publication.</li>
    </ul>
</div>

---

## 3.1 Manual and automated data restructuring

Gains in software automation seem so unexceptional that it is legitimate to ask why a computer can't take a messy spreadsheet and clean it up for you without any supervision. 

Autonomous systems are good at repeating structured tasks but less capable of unstructured tasks where context, meaning and interpretation are important. A computer still can't really _understand_ the intention behind conversation and narrative. Spreadsheets structured both visually and textually but often contains little information within its construction to explain why it was created or what the data represent.

That doesn't mean that a computational system can't - eventually - figure out some meaning, but the scale of the work involved may be far greater than the value of doing it.

![Tutorial spreadsheet](images/tutorial-1-2.jpg)

Textual analysis, using search engines like Google or Bing, can identify terms used in the table headers or references. Hashing the file (a mathematical method of producing a short character string from the string pattern of the file itself) can uniquely identify this file and match it to where exact copies may be referenced online.

Here's an example of hashing:

In [1]:
import hashlib

# A tale of two cities, Charles Dickens (1859)
text = """It was the best of times, it was the worst of times, it was the age of wisdom, 
          it was the age of foolishness, it was the epoch of belief, it was the epoch of 
          incredulity, it was the season of Light, it was the season of Darkness, it was 
          the spring of hope, it was the winter of despair, we had everything before us, 
          we had nothing before us, we were all going direct to Heaven, we were all going 
          direct the other way – in short, the period was so far like the present period, 
          that some of its noisiest authorities insisted on its being received, for good 
          or for evil, in the superlative degree of comparison only."""

# Produce a sha256-encoded hash of the text
hashlib.sha256(text.encode('utf-8')).hexdigest()

'a9da3340c48f5dcb0bf6ec3f6c6f46afefc6262f49c2b9ab75983cc44e5c1808'

These connections - hashing, textual analysis - could then be used to gather additional metadata to support automated interpretation. A sufficiently complex system can make a reasonable approximation of restructured data, with a coded explanation of the decisions it made.

It can be done. But it _shouldn't have to be!_

This entire course exists because data are deliberately stored in non-accessible formats. Sometimes this happens out of a genuine - but misinformed - belief that human-structured data is "easier" to read. Sometimes it is done as a deliberate act of sabotage: to render data unavailable for use while still permitting the claim that it is technically readable. 

In case you think this is farcical, I once received a 200-page print-out from a PDF of a scanned spreadsheet, with the data font so small and smudged that it was almost unreadable. This as a result of the data owner losing a court case and being forced to publish.

Data professionals must work with data as it is, not as we would wish it to be, but - before we head on - take a moment to reflect on how much time and effort would be saved, how much ingenuity directed towards more beneficial pursuits, if our source data were placed in a structured, machine-readable format in the first place.

---

## 3.2 Auditable data restructuring with _whyqd_

The great thing about software-derived restructuring is that you can - with minimal effort - reuse your code and apply it to other spreadsheets that may have similar problems. Over time, you will develop a range of scripts to help you. Even better, you can share your scripts with others to help them as well, and to prove that your restructured data has not introduced weird errors.

This second part if critical for data probity; for ensuring an audit trail from source to your final data used in analysis. Where working manually in Excel means that it is often impossible to trace errors without redoing your work, code can be read. Anyone can follow the decisions captured in your code, run that code, and test whether the output is derived from the input, and whether your code contains any errors.

However, not everyone can read code. [whyqd](https://whyqd.readthedocs.io/en/latest/) is a Python package designed to make restructuring easier, while also providing an auditable record of how that restructuring was performed:

> __whyqd__ provides an intuitive method for restructuring messy data to conform to a standardised metadata schema. It supports data managers and researchers looking to rapidly, and continuously, normalise any messy spreadsheets using a simple series of steps. Once complete, you can import wrangled data into more complex analytical systems or full-feature wrangling tools.

Full disclosure: I also wrote and maintain the package, and it is used on [Sqwyre.com](https://sqwyre.com) where we import hundreds of messy data source files a month and restructure it into a single database.

Before we use this package, let's check if there have been any updates on the code since we first installed it. In Lesson 1.2.2 we installed `whyqd`, now we're going to update it.

From your Anaconda environment, left-click on the "arrow" next to "datascience" and choose `Open Terminal`. Make sure you click on the right environment; not `base` or `root`, but the name of the environment where you're working: 

![Jupyter terminal](images/jupyter-terminal.JPG "Jupyter terminal")

A new window will open with a command prompt. Type in the following and enter.

    pip install whyqd -U
    
![Jupyter terminal update](images/jupyter-terminal-update.jpg "Jupyter terminal update")

The following tutorial is adapted from [whyqd's documentation](https://whyqd.readthedocs.io/en/latest/morph_tutorial.html).

### 3.2.1 Creating a Schema

The objective of your schema is not only to define a structure for your data, but also provide reference and contextual information for anyone using it. In a research context, definitions are critical to avoid ambiguity, ensure replication, and build trust.

The minimum requirement for a schema is that it have a `name`, but we’re going to give it a `title` and `description` as well, because more information is better.

In [1]:
import whyqd as _w
schema = _w.Schema()

details = {
        "name": "human-development-report",
        "title": "UN Human Development Report 2007 - 2008",
        "description": """
        In 1990 the first Human Development Report introduced a new approach for
        advancing human wellbeing. Human development – or the human development approach - is about
        expanding the richness of human life, rather than simply the richness of the economy in which
        human beings live. It is an approach that is focused on people and their opportunities and choices."""
}
schema = _w.Schema()
schema.set_details(**details)

Let’s define the fields in our schema and then iterate over the list to add each field. You should see that this is equivalent to the exercise in Lesson 1.2.2. `required`, in 
the `constraints`, means that this column is required to ensure the destination data validate.

In [2]:
fields = [
    {
        "name": "Country Name",
        "title": "Country Name",
        "type": "string",
        "description": "Official country names.",
        "constraints": {
            "required": True
        }
    },
    {
        "name": "HDI Category",
        "title": "HDI Category",
        "type": "string",
        "description": "Human Development Index Category derived from the HDI Rank.",
    },
    {
        "name": "Indicator Name",
        "title": "Indicator Name",
        "type": "string",
        "description": "Indicator described in the data series.",
    },
    {
        "name": "Reference",
        "title": "Reference",
        "type": "string",
        "description": "Reference to data source.",
    },
    {
        "name": "Year",
        "title": "Year",
        "type": "year",
        "description": "Year of release.",
    },
    {
        "name": "Values",
        "title": "Values",
        "type": "number",
        "description": "Value for the Year and Indicator Name.",
        "constraints": {
            "required": True
        }
    },
]
for field in fields:
    schema.set_field(**field)

<div class="alert alert-block alert-info">
    <p>A <code>schema</code> is a <b>choice</b>. Its structure is based on decisions you need to make. You could decide 
        that you want to do it differently. There may be good reasons for a different approach, and there is no one way 
        of defining how data should be structured. What <b>is</b> required is clarity, consistency, and metadata.
    </p>
</div>

From here on we can access any `field` by calling it by `name` and then edit it as required. Note that the software changed the names we put in by lower-casing the text and replacing ` `  with `_`:

In [3]:
schema.field("country_name")

{'name': 'country_name',
 'type': 'string',
 'constraints': {'required': True},
 'title': 'Country Name',
 'description': 'Official country names.'}

We can also save our schema to a specified `directory`:

In [4]:
directory = "data/lesson-programmatic/"
# you can also specify an optional filename
# if you leave it out, the filename will default to the schema name
filename = "human-development-report-schema"
# if the file already exists, you'll need to specify `overwrite=True` otherwise you'll get
# an error
schema.save(directory, filename=filename, overwrite=True)

True

### 3.2.2 Creating a Method

`Methods` are how you define the steps `whyqd` must perform to restructure your data and align it with your `Schema`. There isn't a great deal of coding to do, but there are a lot of decisions to make. 

The only compulsory parameter needed when creating a method, is a reference to our source schema (the one we created above). We may also offer a working directory. During the process, `whyqd` will create a number of interim working data files, as well as your JSON method file, and your wrangled output data. You need to tell it where to work, or it will simply drop everything into the directory you’re calling the function from.

We can also, at initialisation, provide the list of data sources:

In [5]:
### The following imports and settings ensure that you can get a wide output for your tables
from IPython.core.display import HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

import numpy as np
import whyqd as _w

SCHEMA_SOURCE = "data/lesson-programmatic/human-development-report-schema.json"
DIRECTORY = "data/lesson-programmatic/"
INPUT_DATA = [
    "data/lesson-spreadsheet/HDR 2007-2008 Table 03.xlsx"
]
method = _w.Method(SCHEMA_SOURCE, directory=DIRECTORY, input_data=INPUT_DATA)

These data will be copied to your working directory and renamed to a unique hashed `id` (similar to the hash presented earlier).

<div class="alert alert-block alert-info">
    <p><b>Data probity</b> - the abilty to audit data and methodology back to source - is critical for research transparency and replication. 
        You may end up with hundreds of similarly-named files in a single directory without much information as to where they come from, or 
        how they were created. Unique ids, referenced in your method file, are a more useful way of ensuring you know what they were for.
    </p>
</div>

The method class provides help at each step. Access it like this:

In [6]:
print(method.help())


**whyqd** provides data wrangling simplicity, complete audit transparency, and at speed.

To get help, type:

	>>> method.help(option)

Where `option` can be any of:

	status
	merge
	structure
	category
	filter
	transform

`status` will return the current method status, and your mostly likely next steps. The other options
will return methodology, and output of that option's result (if appropriate). The `error` will
present an error trace and attempt to guide you to fix the process problem.

Current method status: `Ready to Merge`


_whyqd_ makes no assumptions about our data. We only have one file added to the method, so it is happy for you to head to the next step. However, we know our data and we know it needs work.

The [API](https://whyqd.readthedocs.io/) for _whyqd_ will describe all the functions available to you but it is useful to remember that you are working with your data symbolically, 
rather than directly. Data files can be extremely large, and waiting for them to load, or restructure hundreds of thousands of rows while you twiddle your thumbs, gets old fast. _whyqd_ 
keeps only a few rows available so you can get an idea of what your data look like:

In [7]:
print(method.print_input_data())



Data id: ba1fc184-592c-45f7-b842-160e981ec955
Original source: data/lesson-spreadsheet/HDR 2007-2008 Table 03.xlsx

  ..  Unnamed: 0                                         Unnamed: 1    Unnamed: 2    Monitoring human development: enlarging people's choices …    Unnamed: 4    Unnamed: 5    Unnamed: 6    Unnamed: 7    Unnamed: 8    Unnamed: 9    Unnamed: 10    Unnamed: 11    Unnamed: 12    Unnamed: 13    Unnamed: 14    Unnamed: 15    Unnamed: 16    Unnamed: 17    Unnamed: 18    Unnamed: 19    Unnamed: 20    Unnamed: 21    Unnamed: 22    Unnamed: 23    Unnamed: 24    Unnamed: 25    Unnamed: 26    Unnamed: 27    Unnamed: 28    Unnamed: 29    Unnamed: 30
   0  3 Human and income poverty Developing countries           nan           nan                                                           nan           nan           nan           nan           nan           nan           nan            nan            nan            nan            nan            nan            nan            nan       

You can see the problem we dealt with in Lesson 1.2.2. There doesn’t seem to be any data. At this stage you may be tempted to start hacking at the file directly - as we did then - and
see what you can fix, but our objective is not only clean data, but also an auditable record of how you went from source to final that can demonstrate the decisions you made, and 
whether you were able to maintain all the source data.

_whyqd_ offers a set of `morphs` that permit you to restructure individual tables prior to merging. Let's list all the morph types:

In [8]:
method.default_morph_types

['CATEGORISE', 'DEBLANK', 'DEDUPE', 'DELETE', 'MELT', 'REBASE', 'RENAME']

If you want to know what each of these does, get their individual settings:

In [9]:
# As an example:
method.default_morph_settings("CATEGORISE")

{'name': 'CATEGORISE',
 'title': 'Categorise',
 'type': 'morph',
 'description': 'Convert row-level categories into column categorisations.',
 'structure': ['rows', 'column_names']}

The standard way of writing a morph is:

    ["MORPH_NAME", [rows], [columns], [column_names]]

The presence of the parameters - rows, columns, column_names - is specified in the structure of the morph type.

- __rows__: address the row number of the table. These will remain immutable, so the row number is the row number.
- __columns__: these are the actual column names at that point of the morph. There are mutable and change as you morph.
- __column_names__: these are optional, but you can provide root names that will be used in creating new columns.

When you add your first morph, whyqd will automatically add in `DEBLANK` and `DEDUPE`. Figuring out the exact order of the morphs is trial-and-error, but nothing is committed and you 
can undo and redo as you require.

A few tools to help you … `input_dataframe(id)` returns the complete pandas dataframe for your source data. It will also run all of the morphs up to that point, allowing you to see
the impact of your morph order. You can then explore our data and figure out what we need to do next. We only have one file, and earlier you will have seen the `id` created to 
reference that file:

In [10]:
# Use _id, or some other variable, since `id` is a Python protected term
_id = method.input_data[0]["id"]
df = method.input_dataframe(_id)
df.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Monitoring human development: enlarging people's choices …,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30
0,3 Human and income poverty Developing countries,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


If you get to a point where you’re tangled entirely, `reset_input_data_morph(id)` will remove all the morphs and let you start again:

In [11]:
method.reset_input_data_morph(_id)

I encourage you to explore this dataset and see exactly the decisions made, but here’s my approach:

- Let’s rebase the table to the top of the actual data:

In [12]:
method.add_input_data_morph(_id, ["REBASE", 11])

- We can get rid of rows below 144 to the end of the table. These contain metadata that you may want to keep and publish separately:

In [13]:
# We get the value of the last index item, then add 1 to create the range
rows = [int(i) for i in np.arange(144, df.index[-1]+1)]
method.add_input_data_morph(_id, ["DELETE", rows])

- Now lets name the columns that remain based on what their original names. Also note that the reference columns were previously unlabeled:

In [15]:
columns = [
    "HDI rank",
    "Country",
    "Human poverty index (HPI-1) - Rank",
    "Reference 1",
    "Human poverty index (HPI-1) - Value (%)",
    "Probability at birth of not surviving to age 40 (% of cohort) 2000-05",
    "Reference 2",
    "Adult illiteracy rate (% aged 15 and older) 1995-2005",
    "Reference 3",
    "Population not using an improved water source (%) 2004",
    "Reference 4",
    "Children under weight for age (% under age 5) 1996-2005",
    "Reference 5",
    "Population below income poverty line (%) - $1 a day 1990-2005",
    "Reference 6",
    "Population below income poverty line (%) - $2 a day 1990-2005",
    "Reference 7",
    "Population below income poverty line (%) - National poverty line 1990-2004",
    "Reference 8",
    "HPI-1 rank minus income poverty rank"
]
method.add_input_data_morph(_id, ["RENAME", columns])

- We haven't finished, but we can have a quick look at what we've achieved so far:

In [16]:
df = method.input_dataframe(_id)
df.head()

Unnamed: 0,HDI rank,Country,Human poverty index (HPI-1) - Rank,Reference 1,Human poverty index (HPI-1) - Value (%),Probability at birth of not surviving to age 40 (% of cohort) 2000-05,Reference 2,Adult illiteracy rate (% aged 15 and older) 1995-2005,Reference 3,Population not using an improved water source (%) 2004,Reference 4,Children under weight for age (% under age 5) 1996-2005,Reference 5,Population below income poverty line (%) - $1 a day 1990-2005,Reference 6,Population below income poverty line (%) - $2 a day 1990-2005,Reference 7,Population below income poverty line (%) - National poverty line 1990-2004,Reference 8,HPI-1 rank minus income poverty rank
14,HIGH HUMAN DEVELOPMENT,,,,,,,,,,,,,,,,,,,
15,21,"Hong Kong, China (SAR)",..,,..,1.5,e,..,,..,,..,,..,,..,,..,,..
16,25,Singapore,7,,5.2,1.8,,7.5,,0,,3,,..,,..,,..,,..
17,26,Korea (Republic of),..,,..,2.5,,1.0,,8,,..,,<2,,<2,,..,,..
18,28,Cyprus,..,,..,2.4,,3.2,,0,,..,,..,,..,,..,,..


- If you look through the data, you’ll see that there are rows that define categories for data that appear below it. Here `HIGH HUMAN DEVELOPMENT` is an `HDI Category` and all the
  rows between this row and the next category `MEDIUM HUMAN DEVELOPMENT` form part of that category. What we need to do is “rotate” these rows into a column and assign the category
  to the effected data:

In [17]:
# Get the categorical data row indices
hdi_categories = ["HIGH HUMAN DEVELOPMENT", "MEDIUM HUMAN DEVELOPMENT", "LOW HUMAN DEVELOPMENT"]
rows = df[df["HDI rank"].isin(hdi_categories)].index
method.add_input_data_morph(_id, ["CATEGORISE", rows, "HDI category"])

ValueError: Task morph `CATEGORISE` has invalid structure `['rows', 'column_names']`.

Hmm, we got a `ValueError`, meaning our `rows` are not what was expected. This is because `pandas` returns not a `list` but an `Int64Index`. Let's correct that.

In [18]:
type(rows)

pandas.core.indexes.numeric.Int64Index

In [19]:
method.add_input_data_morph(_id, ["CATEGORISE", list(rows), "HDI category"])

- Most of these columns are actually indicators and can be pivoted into an `Indicator` column with the `Values` assigned into a single column. This is called a `MELT`:

In [20]:
df = method.input_dataframe(_id)
df.head()

Unnamed: 0,HDI rank,Country,Human poverty index (HPI-1) - Rank,Reference 1,Human poverty index (HPI-1) - Value (%),Probability at birth of not surviving to age 40 (% of cohort) 2000-05,Reference 2,Adult illiteracy rate (% aged 15 and older) 1995-2005,Reference 3,Population not using an improved water source (%) 2004,...,Children under weight for age (% under age 5) 1996-2005,Reference 5,Population below income poverty line (%) - $1 a day 1990-2005,Reference 6,Population below income poverty line (%) - $2 a day 1990-2005,Reference 7,Population below income poverty line (%) - National poverty line 1990-2004,Reference 8,HPI-1 rank minus income poverty rank,HDI category
15,21,"Hong Kong, China (SAR)",..,,..,1.5,e,..,,..,...,..,,..,,..,,..,,..,HIGH HUMAN DEVELOPMENT
16,25,Singapore,7,,5.2,1.8,,7.5,,0,...,3,,..,,..,,..,,..,HIGH HUMAN DEVELOPMENT
17,26,Korea (Republic of),..,,..,2.5,,1.0,,8,...,..,,<2,,<2,,..,,..,HIGH HUMAN DEVELOPMENT
18,28,Cyprus,..,,..,2.4,,3.2,,0,...,..,,..,,..,,..,,..,HIGH HUMAN DEVELOPMENT
19,30,Brunei Darussalam,..,,..,3.0,,7.3,,..,...,..,,..,,..,,..,,..,HIGH HUMAN DEVELOPMENT


In [21]:
# Select all the columns to be melted
columns = [
    "HDI rank",
    "Human poverty index (HPI-1) - Rank",
    "Human poverty index (HPI-1) - Value (%)",
    "Probability at birth of not surviving to age 40 (% of cohort) 2000-05",
    "Adult illiteracy rate (% aged 15 and older) 1995-2005",
    "Population not using an improved water source (%) 2004",
    "Children under weight for age (% under age 5) 1996-2005",
    "Population below income poverty line (%) - $1 a day 1990-2005",
    "Population below income poverty line (%) - $2 a day 1990-2005",
    "Population below income poverty line (%) - National poverty line 1990-2004",
    "HPI-1 rank minus income poverty rank"
]
method.add_input_data_morph(_id, ["MELT", columns, ["Indicator Name", "Indicator Value"]])

- Similarly, the `References` can be pivoted into a separate column as well:

In [22]:
columns = [
    "Reference 1",
    "Reference 2",
    "Reference 3",
    "Reference 4",
    "Reference 5",
    "Reference 6",
    "Reference 7",
    "Reference 8",
]
method.add_input_data_morph(_id, ["MELT", columns, ["Reference Name", "Reference"]])

- Let’s add in a final `DEBLANK` just to be sure:

In [23]:
method.add_input_data_morph(_id, ["DEBLANK"])

Get the current implementation of the morphs and have a look:

In [24]:
df = method.input_dataframe(_id)
df.head()

Unnamed: 0,HDI category,Indicator Name,Indicator Value,Country,Reference Name,Reference
0,HIGH HUMAN DEVELOPMENT,HDI rank,21,"Hong Kong, China (SAR)",Reference 1,
1,HIGH HUMAN DEVELOPMENT,HDI rank,25,Singapore,Reference 1,
2,HIGH HUMAN DEVELOPMENT,HDI rank,26,Korea (Republic of),Reference 1,
3,HIGH HUMAN DEVELOPMENT,HDI rank,28,Cyprus,Reference 1,
4,HIGH HUMAN DEVELOPMENT,HDI rank,30,Brunei Darussalam,Reference 1,


Cool, huh? Let’s continue with the `merge` step. If we had more than one input data file, these need to be consolidated into a single working data file via a merge. 
_whyqd_ will iteratively join files in a list, adding the 2nd to the 1st, then the 3rd, etc.

What we need to do is decide on the order, and identify a column that can be used to uniquely cross-reference rows in each file and link them together.

We only have one file, but - technically - there's nothing stopping you adding another file from our sample data and so merging two files together. I'll set this as an exercise for you. 
Have a look at the [whyqd tutorial](https://whyqd.readthedocs.io/en/latest/tutorial.html#organise-and-merge-input-data), and read the `merge` help:

In [25]:
print(method.help("merge"))


`merge` will join, in order from right to left, your input data on a common column.

To add input data, where `input_data` is a filename, or list of filenames:

	>>> method.add_input_data(input_data)

To remove input data, where `id` is the unique id for that input data:

	>>> method.remove_input_data(id)

Prepare an `order_and_key` list, where each dict in the list has:

	{{id: input_data id, key: column_name for merge}}

Run the merge by calling (and, optionally - if you need to overwrite an existing merge - setting
`overwrite_working=True`):

	>>> method.merge(order_and_key, overwrite_working=True)

To view your existing `input_data`:

	>>> method.input_data


Data id: 799bbee9-6031-4cbf-bacb-6b6c98a43d5b
Original source: data/lesson-spreadsheet/HDR 2007-2008 Table 03.xlsx

  ..  Unnamed: 0                                         Unnamed: 1    Unnamed: 2    Monitoring human development: enlarging people's choices …    Unnamed: 4    Unnamed: 5    Unnamed: 6    Unnamed: 7    Unnamed:

In [26]:
%time method.merge(overwrite_working=True)

Wall time: 5.68 s


You'll see that this step may have taken a little longer than you normally expect simply because the data are now restructured and all your changes are fixed into a new `working_data` file.

The Jupyter magic-method `%time` is a really useful way to see how long that step takes. Still, compare the time it took to write a few commands and process them with the time it took
you to manually cut-and-paste data in your source Excel spreadsheet back in Lesson 1.1.

The next step is `structure`. 

This is the part of the wrangling process where, depending on the scale of what you’re up to, you reach for Excel, OpenRefine or some commercial alternative. These are sometimes outside
of your workflow, or introduce the potential for human error. They're also very limited to the dataset you're working on. _whyqd_ is for repeatable processing. Next year, when these data
are updated, we will want to import it again. However, it might not be in the same format since a human being prepared and uploaded these data. That person doesn’t know about your 
use-case and probably doesn’t care. Maybe they change some column names. These are simple changes and all that’s required is a minor adjustment to the method to run this process again.

This is the core of the wrangling process and is the process where you define the `actions` which must be performed to restructure your working data.

In [27]:
print(method.help("structure"))


`structure` is the core of the wrangling process and is the process where you define the actions
which must be performed to restructure your working data.

Create a list of methods of the form:

	{
		"schema_field1": ["action", "column_name1", ["action", "column_name2"]],
		"schema_field2": ["action", "column_name1", "modifier", ["action", "column_name2"]],
	}

The format for defining a `structure` is as follows::

	[action, column_name, [action, column_name]]

e.g.::

	["CATEGORISE", "+", ["ORDER", "column_1", "column_2"]]

This permits the creation of quite expressive wrangling structures from simple building
blocks.

The schema for this method consists of the following terms:

['country_name', 'hdi_category', 'indicator_name', 'reference', 'year', 'values']

The actions:

['CALCULATE', 'CATEGORISE', 'JOIN', 'NEW', 'ORDER', 'ORDER_NEW', 'ORDER_OLD', 'RENAME']

The columns from your working data:

['HDI category', 'Indicator Name', 'Indicator Value', 'Country', 'Reference Name', 'Ref

We have a very simple use-case, since our `morphs` took care of most of the problems we may have. All we need to do is connect our working data to our schema that we developed
right at the beginning. This explicitly links data to structure and permits validation, as well as programmatic understanding of our metadata.

The `help` tells us our schema columns:

    ['country_name', 'hdi_category', 'indicator_name', 'reference', 'year', 'values']
    
And the structure for each `action` will be the same:

    "schema_column": ["RENAME", "working_data_column"]
    
In our example, we don't have a `year` column, but you may in yours:

In [28]:
structure = {
    "country_name": ["RENAME", "Country"],
    "hdi_category": ["RENAME", "HDI category"],
    "indicator_name": ["RENAME", "Indicator Name"],
    "reference": ["RENAME", "Reference"],
    "values": ["RENAME", "Indicator Value"],
}
# Note the `**` at the beginning of the parameter name
# This "unpacks" the dictionary so that all the terms are visible to the function
method.set_structure(**structure)

Despite all this, _whyqd_ has preserved your source data. Now it's time to create your data transformation and save it:

In [29]:
method.transform(overwrite_output=True)
FILENAME = "hdi_report_exercise"
method.save(directory, filename=FILENAME, overwrite=True)

You can review your methods as a JSON output using `.settings` for the entire method, or `.input_data_morphs(_id)` for the morphs themselves:

In [30]:
method.input_data_morphs(_id)

[{'c98c2729-28f1-4a5f-b8d7-4c38558b9d3f': ['DEBLANK']},
 {'1c07de2e-e6e5-42ea-b168-77ad790c4fdd': ['DEDUPE']},
 {'4133ed12-557a-4d4e-bf3b-7f06807fbe47': ['REBASE', [11]]},
 {'019437bf-a49c-44d9-827f-b6efe4554847': ['DELETE',
   [144,
    145,
    146,
    147,
    148,
    149,
    150,
    151,
    152,
    153,
    154,
    155,
    156,
    157,
    158,
    159,
    160,
    161,
    162,
    163,
    164,
    165,
    166,
    167,
    168,
    169,
    170,
    171,
    172,
    173,
    174,
    175,
    176,
    177,
    178,
    179]]},
 {'21d74d44-4cc0-4b0a-a98c-da60cc22b539': ['RENAME',
   ['HDI rank',
    'Country',
    'Human poverty index (HPI-1) - Rank',
    'Reference 1',
    'Human poverty index (HPI-1) - Value (%)',
    'Probability at birth of not surviving to age 40 (% of cohort) 2000-05',
    'Reference 2',
    'Adult illiteracy rate (% aged 15 and older) 1995-2005',
    'Reference 3',
    'Population not using an improved water source (%) 2004',
    'Reference 4',

This completes two steps: 

- Creating a schema and an auditable method to restructure your data;
- Directly connecting the schema to the data, and performing the transformation.

However, does it validate?

---

## 3.3 Data validation and its meaning

If you want to check whether your data validates in _whyqd_ you can just run one command:

In [31]:
%time method.validates

Wall time: 4.3 s


True

It does, which is unsurprising since the code ran and _whyqd_ won't run the transformation if it won't work. However, what does _validation_ mean, especially in the context of requirements 
for quality machine-readable data?

Wikipedia - the definitive source - [declares](https://en.wikipedia.org/wiki/Data_validation):

> __Data validation__ is the process of ensuring data have undergone data cleansing to ensure they have data quality, that is, that they are both correct and useful. 
It uses routines, often called "validation rules", "validation constraints", or "check routines", that check for correctness, meaningfulness, and security of data that 
are input to the system. The rules may be implemented through the automated facilities of a data dictionary, or by the inclusion of explicit application program validation 
logic of the computer and its application.

If you read through that Wikipedia article, you'll see that there are a vast range of different types of validation checks, but there is no universal standard. If _I_ declare a dataset
"valid" it may not be valid for _your_ requirements. 

As we stand, _whyqd_ declared our data valid since it was able to read and manipulate the file into a target schema. It fits. But _whyqd_ assumes that it is not the final point for
your use, which could be importing the data into a database, or using it in an application. You may need to transform the data all over again when you use it, and you will run your
own validation at each step to ensure your own software works.

From a data publisher perspective, it's difficult to predict all the use-cases for your data, and anything you do to "force" data into compliance runs the risk of degrading the 
data accuracy. Consider the following problems in our source data:

- __Data gaps__: a range of terms, such as `..` and `.`, are used to reflect a lack of data. But isn't that still information? These terms tells us directly that, in 2008, we __did not__
  __know__ how many people were illiterate, or how many people didn't have access to clean, running water in some countries. Replacing this with `nan` (not-a-number) doesn't change that 
  lack of information, although it might allow your data to pass a "is this column only numbers" test ... is that good? Wouldn't some users like to know directly?
- __Data approximations__: Sometimes data are collected as ranges, e.g. `20-30`, or `<2`. This may for data anonymisation (we deliberately bucket people into ranges to ensure that
  individuals can't be identified), or because our data are simply not sufficiently accurate (limits of the mechanism by which we recorded the data). This, also, is information. Forcing
  these ranges to an absolute point may satisfy a data check, but at the risk of losing information on the limits of the data gathering process.
- __Date formats__: When is 10 April also 4 October? When you're trying to figure out whether you're working with American or global date formats, e.g. `10-4-1990` vs `4-10-1990`. Date ranges
  can also be a problem. Should you force `2007-2008` to be a specific year?

This type of data validation is certainly critical _at the point of use_, but is it important _at the point of publication_? How far should you go in validating data for publication?

This is where different data scientists may differ. My view is as follows:

> __Data validation__ is the process of ensuring a complete audit trail from source data to publication, where the process for data transformation is documented in a series of metadata
files that transcribe the technical steps which produced the data, and all the definitions required to understand the terms used, data types and structure, and any assumptions or 
deliberate decisions made in preparing the data for publication.

### 3.3.1 Validation and data manipulation

What happens, then, if you need to know what your data contain so that you can work with it? You can use `pandas` to explore your data. Our new output file is available to us:

In [33]:
import pandas as pd
import numpy as np

source = "data/lesson-programmatic/output_729f1bee-da2c-4497-901b-8098aab3b99a.csv"

df = pd.read_csv(source)
df.head()

Unnamed: 0,year,country_name,hdi_category,indicator_name,reference,values
0,,"Hong Kong, China (SAR)",HIGH HUMAN DEVELOPMENT,HDI rank,,21
1,,Singapore,HIGH HUMAN DEVELOPMENT,HDI rank,,25
2,,Korea (Republic of),HIGH HUMAN DEVELOPMENT,HDI rank,,26
3,,Cyprus,HIGH HUMAN DEVELOPMENT,HDI rank,,28
4,,Brunei Darussalam,HIGH HUMAN DEVELOPMENT,HDI rank,,30


In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2079 entries, 0 to 2078
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   year            0 non-null      float64
 1   country_name    2079 non-null   object 
 2   hdi_category    1892 non-null   object 
 3   indicator_name  2079 non-null   object 
 4   reference       693 non-null    object 
 5   values          2079 non-null   object 
dtypes: float64(1), object(5)
memory usage: 97.6+ KB


We expect most of this, since most of the columns are `strings`, but the `values` column would - we may hope - be a `float` or an `int`. Instead it's also an `object`. We know that 
our data contain `<2`, `..` and other artifacts. We can use `pandas` to simply `replace` what we don't like. We can check for specific validation concerns. 

But now we're in the realms not of validation, but of manipulation. On the principle that we should know what our data contain, let's have a deeper look. But just because we know what
is there doesn't mean we are forced to change it.

We need a new Python package. Just as you installed `whyqd`, you need to install `PandasSchema`. Open your conda terminal, ensure you're in the correct development environment, and:

    pip install pandas_schema
    
The [documentation](https://tmiguelt.github.io/PandasSchema/) offers a range of methods for analysing and validating your data against a schema. We'll touch on this lightly.

`PandasSchema` has a number of validators we might want to use:

- __InListValidation__: Checks that each element in this column is contained within a list of possibilities
- __DateFormatValidation__: Checks that each element in this column is a valid date according to a provided format string
- __InRangeValidation__: Checks that each element in the series is within a given numerical range
- __IsDistinctValidation__: Checks that every element of this column is different from each other element
- __LeadingWhitespaceValidation__: Checks that there is no leading whitespace in this column
- __TrailingWhitespaceValidation__: Checks that there is no trailing whitespace in this column
- __IsDtypeValidation__: Checks that a series has a certain numpy dtype (i.e. whether `object`, `int`, `float`, etc.)

You can get a specific `numpy` `type` like this:

    np.dtype(float)
    np.dtype(int)

Let's see how this goes:

In [36]:
from pandas_schema import Column, Schema
from pandas_schema.validation import LeadingWhitespaceValidation, TrailingWhitespaceValidation, IsDtypeValidation, InListValidation

# We'll test only these columns
columns = ["country_name", "hdi_category", "values"]
# And these categories
hdi_categories = ["HIGH HUMAN DEVELOPMENT", "MEDIUM HUMAN DEVELOPMENT", "LOW HUMAN DEVELOPMENT"]

schema = Schema([
    Column("country_name", [LeadingWhitespaceValidation(), TrailingWhitespaceValidation()]),
    Column("hdi_category", [InListValidation(hdi_categories)]),
    Column("values", [IsDtypeValidation(np.dtype(float)), IsDtypeValidation(np.dtype(int))])
])

errors = schema.validate(df[columns])

print(F"Number of errors: {len(errors)}")
# Just the first 10
for error in errors[:10]:
    print(error)

Number of errors: 189
The column values has a dtype of object which is not a subclass of the required type float64
The column values has a dtype of object which is not a subclass of the required type int32
{row: 112, column: "hdi_category"}: "nan" is not in the list of legal options (HIGH HUMAN DEVELOPMENT, MEDIUM HUMAN DEVELOPMENT, LOW HUMAN DEVELOPMENT)
{row: 113, column: "hdi_category"}: "nan" is not in the list of legal options (HIGH HUMAN DEVELOPMENT, MEDIUM HUMAN DEVELOPMENT, LOW HUMAN DEVELOPMENT)
{row: 114, column: "hdi_category"}: "nan" is not in the list of legal options (HIGH HUMAN DEVELOPMENT, MEDIUM HUMAN DEVELOPMENT, LOW HUMAN DEVELOPMENT)
{row: 115, column: "hdi_category"}: "nan" is not in the list of legal options (HIGH HUMAN DEVELOPMENT, MEDIUM HUMAN DEVELOPMENT, LOW HUMAN DEVELOPMENT)
{row: 116, column: "hdi_category"}: "nan" is not in the list of legal options (HIGH HUMAN DEVELOPMENT, MEDIUM HUMAN DEVELOPMENT, LOW HUMAN DEVELOPMENT)
{row: 117, column: "hdi_category"}

We have a lot of errors. Missing categorical data in `hdi_category` and non-numeric data in `values`. Some of that is expected. Some is not. Perhaps we need to go back and tweek our
method for transforming the data?

However, the hard work if getting data into a place where a few simple programmatic fixes can manipulate our data into any format that works for the user.

<div class="alert alert-block alert-warning">
    <p><b>Never trust source data:</b> any data that comes from outside your work environment cannot be trusted until proven otherwise. It does not matter if the publisher is <i>trustworthy</i> or claims their data <i>validates</i>. Until definitively proven to validate by your own systems you can't simply import it into your systems untested.</p>
    <p>A publisher supports their user's workflow by ensuring data is machine-readable, well-structured, that all terms are clearly defined, and there is metadata for everything. After that, trust, but verify.</p>
</div>

As an exercise, describe what decisions you should make to either fix or leave these data as is.

### 3.3.2 Data publication and citation

A `citation` is a reference to a source. In academic publishing, data citation requires a special set of fields, with:

- __authors__: a list of author names
- __title__: the full study title
- __repository__: the organisation, or distributor, responsible for hosting the data and metadata
- __doi__: the persistent [digital object identifier](https://en.wikipedia.org/wiki/Digital_object_identifier) (DOI) for the repository

There are numerous styles of citation, but the metadata are more important than the structure, and ensure that appropriate credit is given to the data creators and maintainers, and that
there is evidence as to the data probity.

_whyqd_ offers the following as additional evidence:

- __hash__: BLAKE2b has of output data
- __input data__: a list of input data by original source, and the source hash

Those of you familiar with Dataverse’s [universal numerical fingerprint](http://guides.dataverse.org/en/latest/developers/unf/index.html) may be wondering where it is? _whyqd_, 
similarly, produces a unique hash for each datasource, including inputs, working data, and outputs, based on [BLAKE2b](https://en.wikipedia.org/wiki/BLAKE_(hash_function)) and is 
sufficiently universally available as to ensure you can run this as required.

Anyone with a copy of the method and input data can automatically rerun the entire method file to produce a "new" version of the output data and so confirm more directly the 
validity of the data. This is simply the data we provided as part of our schema, along with the hash values the identify the data files (source and output):

In [37]:
for l in method.citation.split(","):
    print(l)

2020-05-08
 UN Human Development Report 2007 - 2008
 29ea76d29f4756c0669a57009a46e14f724c75ee4e9df7058b0ed557179011647baa0e8317c5fde92c59cc8e9e8471186ce67830f56f96e3ed579522953eb7f9
 [input sources: data/lesson-spreadsheet/HDR 2007-2008 Table 03.xlsx
 7d95ebdb36966c7b97b7b4e578cac70ea89463e95f64ccada60cf15a76f29c68b56f64aca9e28b8042e3c9ce37522fc03a13d1a1e8b05eac6edf26e09e5c32d5]


Now you are ready to publish. Package up the following:

- __Input data__: the files you used as inputs. Sometimes you won't publish these, especially if the source data is confidential and you have removed fields that should not be in the
  public domain;
- __Schema__: the metadata schema file, as a JSON or as a table;
- __Method__: if you used _whyqd_, then release the method as well;
- __Output data__: the output data you prepared;
- __Metadata__: if you have other metadata, such as references, specific definitions, or descriptions of assumptions made, release these as well.

Then head over to your organisation's data management platform, upload and publish.

---

## 3.4 Lesson tutorial

<div class="alert alert-block alert-success">
    <p><b>Tutorial:</b></p>
    <p>Complete the processing of the file you started working with in Lesson 1.</p>
    <ul>
        <li><b>Schema</b>: Define a <i>whyqd</i> Schema to describe and specify validation criteria for your dataset.</li>
        <li><b>Morph and Method</b>: Define the appropriate morphs and method required to transform your source data. Run the transformation.</li>
        <li><b>Validation and Manipulation</b>: Use <i>PandasSchema</i> to review your output data and decide how you want to proceed if you have any errors. You can 
            choose to fix those errors, or simply write a justification for leaving the data as-is. It's up to you, but you do need to justify your decision.
        </li>
        <li><b>Citation</b>: Present your citation, and list the files, metadata and information you intend to publish as part of this exercise.</li>
    </ul>
</div>

Please complete the tutorial before continuing with this series.