Merge pull request #12 from whythawk/refactor-workflow

Refactor workflow
whythawk · May 10, 2023 · 96b77c8 · 96b77c8
2 parents 1084f96 + b09480f
commit 96b77c8
Show file tree

Hide file tree

Showing 434 changed files with 68,142 additions and 57,273 deletions.
diff --git a/.gitignore b/.gitignore
@@ -44,8 +44,10 @@ pip-log.txt
 # Whyqd
 ###############################
 __pycache__/
-*.komodoproject
+.pytest_cache/
+deprecated/
 .ipynb_checkpoints/
 *.ipynb
+.env
 
 setup and distribution.txt
diff --git a/.readthedocs.yml b/.readthedocs.yml
@@ -6,20 +6,20 @@
 # Required
 version: 2
 
-# Build documentation in the docs/ directory with Sphinx
-sphinx:
-  configuration: docs/conf.py
-
 # Using test server while Python 3.9 not supported
 # https://github.com/readthedocs/readthedocs-docker-images/pull/159#issuecomment-785048185
 build:
-  image: testing
+  os: "ubuntu-20.04"
+  tools:
+    python: "3.9"
+
+# Build documentation for mkdocs
+mkdocs:
+  configuration: mkdocs.yml
 
 # Optionally set the version of Python and requirements required to build your docs
 python:
-  version: 3.9
   install:
     - method: pip
       path: .
-      extra_requirements:
-        - docs
+    - requirements: docs-requirements.txt
diff --git a/CHANGELOG b/CHANGELOG
@@ -1,3 +1,4 @@
+- 1.0.0: Complete refactor with breaking changes. See documentation changelog.
 - 0.6.2: Fix for parsing ambiguity errors, plus Excel row-count exceeded on save.
 - 0.6.1: Minor correction for row count.
 - 0.6.0: Ensuring consistent optional column 'name' change for header row, plus new row-count in input and working data.

diff --git a/README.md b/README.md
@@ -6,132 +6,137 @@
 
 ## What is it?
 
-**whyqd** provides an intuitive method for restructuring messy data to conform to a standardised
-metadata schema. It supports data managers and researchers looking to rapidly, and continuously,
-normalise any messy spreadsheets using a simple series of steps. Once complete, you can import
-wrangled data into more complex analytical systems or full-feature wrangling tools.
+> More research, less wrangling
 
-It aims to get you to the point where you can perform automated data munging prior to
-committing your data into a database, and no further. It is built on Pandas, and plays well with
-existing Python-based data-analytical tools. Each raw source file will produce a json schema and
-method file which defines the set of actions to be performed to produce refined data, and a
-destination file validated against that schema.
+[**whyqd**](https://whyqd.com) (/wɪkɪd/) is a curatorial toolkit intended to produce well-structured and predictable 
+data for research analysis.
 
-**whyqd** ensures complete audit transparency by saving all actions performed to restructure
-your input data to a separate json-defined methods file. This permits others to read and scrutinise
-your approach, validate your methodology, or even use your methods to import data in production.
+It provides an intuitive method creating schema-to-schema crosswalks for restructuring messy data to conform to a 
+standardised metadata schema. It supports rapid and continuous transformation of messy data using a simple series of 
+steps. Once complete, you can import wrangled data into more complex analytical or database systems.
 
-Once complete, a method file can be shared, along with your input data, and anyone can
-import **whyqd** and validate your method to verify that your output data is the product of these
-inputs.
+**whyqd** plays well with your existing Python-based data-analytical tools. It uses [Ray](https://www.ray.io/) and 
+[Modin](https://modin.readthedocs.io/) as a drop-in replacement for [Pandas](https://pandas.pydata.org/) to support 
+processing of large datasets, and [Pydantic](https://pydantic-docs.helpmanual.io/) for data models. 
 
-[Read the docs](https://whyqd.readthedocs.io/en/latest/) and there are two worked tutorials to demonstrate
-how you can use `whyqd` to support source data curation transparency:
+Each definition is saved as [JSON Schema-compliant](https://json-schema.org/) file. This permits others to read and 
+scrutinise your approach, validate your methodology, or even use your crosswalks to import and transform data in 
+production.
 
-- [Local-government data](https://whyqd.readthedocs.io/en/latest/tutorial_local_government_data.html)
-- [Data produced by Cthulhu](https://whyqd.readthedocs.io/en/latest/tutorial_cthulhu_data.html)
+Once complete, a transform file can be shared, along with your input data, and anyone can import and validate your 
+crosswalk to verify that your output data is the product of these inputs.
 
 ## Why use it?
 
-If all you want to do is test whether your source data are even useful, spending days or weeks
-slogging through data restructuring could kill a project. If you already have a workflow and
-established software which includes Python and pandas, having to change your code every time your
-source data changes is really, really frustrating.
+If all you want to do is test whether your source data are even useful, spending days or weeks slogging through data 
+restructuring could kill a project. If you already have a workflow and established software which includes Python and 
+pandas, having to change your code every time your source data changes is really, really frustrating.
 
-If you want to go from a Cthulhu dataset like this:
+If you want to go from a [Cthulhu dataset](https://whyqd.readthedocs.io/tutorials/tutorial3) like this:
 
 ![UNDP Human Development Index 2007-2008: a beautiful example of messy data.](https://raw.githubusercontent.com/whythawk/whyqd/master/docs/images/undp-hdi-2007-8.jpg)
 
+*UNDP Human Development Index 2007-2008: a beautiful example of messy data.*
+
 To this:
 
-|     | country_name           | indicator_name | reference | year | values |
-| --: | :--------------------- | :------------- | :-------- | ---: | -----: |
-|   0 | Hong Kong, China (SAR) | HDI rank       | e         | 2008 |     21 |
-|   1 | Singapore              | HDI rank       | nan       | 2008 |     25 |
-|   2 | Korea (Republic of)    | HDI rank       | nan       | 2008 |     26 |
-|   3 | Cyprus                 | HDI rank       | nan       | 2008 |     28 |
-|   4 | Brunei Darussalam      | HDI rank       | nan       | 2008 |     30 |
-|   5 | Barbados               | HDI rank       | e,g, f    | 2008 |     31 |
+|    | country_name           | indicator_name   | reference   |   year |   values |
+|:---|:-----------------------|:-----------------|:------------|:-------|:---------|
+|  0 | Hong Kong, China (SAR) | HDI rank         | e           |   2008 |       21 |
+|  1 | Singapore              | HDI rank         | nan         |   2008 |       25 |
+|  2 | Korea (Republic of)    | HDI rank         | nan         |   2008 |       26 |
+|  3 | Cyprus                 | HDI rank         | nan         |   2008 |       28 |
+|  4 | Brunei Darussalam      | HDI rank         | nan         |   2008 |       30 |
+|  5 | Barbados               | HDI rank         | e,g,f       |   2008 |       31 |
 
 With a readable set of scripts to ensure that your process can be audited and repeated:
 
-```
-scripts = [
-     "DEBLANK",
-     "DEDUPE",
-     "REBASE < [11]",
-     f"DELETE_ROWS < {[int(i) for i in np.arange(144, df.index[-1]+1)]}",
-     "RENAME_ALL > ['HDI rank', 'Country', 'Human poverty index (HPI-1) - Rank;;2008', 'Reference 1', 'Human poverty index (HPI-1) - Value (%);;2008', 'Probability at birth of not surviving to age 40 (% of cohort);;2000-05', 'Reference 2', 'Adult illiteracy rate (% aged 15 and older);;1995-2005', 'Reference 3', 'Population not using an improved water source (%);;2004', 'Reference 4', 'Children under weight for age (% under age 5);;1996-2005', 'Reference 5', 'Population below income poverty line (%) - $1 a day;;1990-2005', 'Reference 6', 'Population below income poverty line (%) - $2 a day;;1990-2005', 'Reference 7', 'Population below income poverty line (%) - National poverty line;;1990-2004', 'Reference 8', 'HPI-1 rank minus income poverty rank;;2008']",
-     "PIVOT_CATEGORIES > ['HDI rank'] < [14,44,120]",
-     "RENAME_NEW > 'HDI Category'::['PIVOT_CATEGORIES_idx_20_0']",
-     "PIVOT_LONGER > = ['HDI rank', 'HDI Category', 'Human poverty index (HPI-1) - Rank;;2008', 'Human poverty index (HPI-1) - Value (%);;2008', 'Probability at birth of not surviving to age 40 (% of cohort);;2000-05', 'Adult illiteracy rate (% aged 15 and older);;1995-2005', 'Population not using an improved water source (%);;2004', 'Children under weight for age (% under age 5);;1996-2005', 'Population below income poverty line (%) - $1 a day;;1990-2005', 'Population below income poverty line (%) - $2 a day;;1990-2005', 'Population below income poverty line (%) - National poverty line;;1990-2004', 'HPI-1 rank minus income poverty rank;;2008']",
-     "SPLIT > ';;'::['PIVOT_LONGER_names_idx_9']",
-     f"JOIN > 'reference' < {reference_columns}",
-     "RENAME > 'indicator_name' < ['SPLIT_idx_11_0']",
-     "RENAME > 'country_name' < ['Country']",
-     "RENAME > 'year' < ['SPLIT_idx_12_1']",
-     "RENAME > 'values' < ['PIVOT_LONGER_values_idx_10']",
-  ]
+```python
+schema_scripts = [
+    f"UNITE > 'reference' < {REFERENCE_COLUMNS}",
+    "RENAME > 'country_name' < ['Country']",
+    "PIVOT_LONGER > ['indicator_name', 'values'] < ['HDI rank', 'HDI Category', 'Human poverty index (HPI-1) - Rank;;2008', 'Human poverty index (HPI-1) - Value (%);;2008', 'Probability at birth of not surviving to age 40 (% of cohort);;2000-05', 'Adult illiteracy rate (% aged 15 and older);;1995-2005', 'Population not using an improved water source (%);;2004', 'Children under weight for age (% under age 5);;1996-2005', 'Population below income poverty line (%) - $1 a day;;1990-2005', 'Population below income poverty line (%) - $2 a day;;1990-2005', 'Population below income poverty line (%) - National poverty line;;1990-2004', 'HPI-1 rank minus income poverty rank;;2008']",
+    "SEPARATE > ['indicator_name', 'year'] < ';;'::['indicator_name']",
+    "DEBLANK",
+    "DEDUPE",
+]
 ```
 
-There are two complex and time-consuming parts to preparing data for analysis: social, and technical.
+## How does it work?
 
-The social part requires multi-stakeholder engagement with source data-publishers, and with
-destination database users, to agree structural metadata. Without any agreement on data publication
-formats or destination structure, you are left with the tedious frustration of manually wrangling
-each independent dataset into a single schema.
+> Crosswalks are mappings of the relationships between fields defined in different metadata 
+> [schemas](https://whyqd.readthedocs.io/strategies/schema). Ideally, these are one-to-one, where a field in 
+> one has an exact match in the other. In practice, it's more complicated than that.
 
-**whyqd** allows you to get to work without requiring you to achieve buy-in from anyone or change
-your existing code.
+Your workflow is:
 
-## Wrangling process
+1. Define a single destination schema,
+2. Derive a source schema from a data source,
+3. Review your source data structure,
+4. Develop a crosswalk to define the relationship between source and destination,
+5. Transform and validate your outputs,
+6. Share your output data, transform definitions, and a citation.
 
-- Create, update or import a data schema which defines the destination data structure,
-- Create a new method and associate it with your schema and input data source/s,
-- Assign a foreign key column and (if required) merge input data sources,
-- Structure input data fields to conform to the requriements for each schema field,
-- Assign categorical data identified during structuring,
-- Transform and filter input data to produce a final destination data file,
-- Share your data and a citation.
+It starts like this:
 
-## Installation and dependencies
+```python
+import whyqd as qd
+```
 
-You'll need at least Python 3.7, then:
+[Install](https://whyqd.readthedocs.io/installation) and [get started](https://whyqd.readthedocs.io/quickstart).
 
-`pip install whyqd`
+There are three worked tutorials to guide you through three typical scenarios:
 
-Code requirements have been tested on the following versions:
+- [Aligning multiple data disparate sources to a single schema](https://whyqd.readthedocs.io/tutorials/tutorial1)
+- [Pivoting wide-format data into archival long-format](https://whyqd.readthedocs.io/tutorials/tutorial2)
+- [Wrangling Cthulhu data without losing your mind](https://whyqd.readthedocs.io/tutorials/tutorial3)
 
-- numpy>=1.18.1
-- openpyxl>=3.0.3
-- pandas>=1.0.0
-- tabulate>=0.8.3
-- xlrd>=1.2.0
+## Installation
 
-Version 0.5.0 introduced a new, simplified, API, along with script-based transformation actions. You can import and
-transform any saved `method.json` files with:
+You'll need at least Python 3.8, then install with your favourite package manager:
 
+```bash
+pip install whyqd
 ```
-SCHEMA = whyqd.Schema(source=SCHEMA_SOURCE)
-schema_scripts = whyqd.parsers.LegacyScript().parse_legacy_method(
-            version="1", schema=SCHEMA, source_path=METHOD_SOURCE_V1
-        )
+
+To derive a source schema from tabular data, import from `DATASOURCE_PATH`, define its `MIMETYPE`, and derive a schema:
+
+```python
+import whyqd as qd
+
+datasource = qd.DataSourceDefinition()
+datasource.derive_model(source=DATASOURCE_PATH, mimetype=MIMETYPE)
+schema_source = qd.SchemaDefinition()
+schema_source.derive_model(data=datasource.get)
+schema_source.fields.set_categories(name=CATEGORY_FIELD, 
+                                    terms=datasource.get_data())
+schema_source.save()
 ```
 
-Where SCHEMA_SOURCE is a path to your schema. Existing `schema.json` files should still work.
+[Get started...](https://whyqd.readthedocs.io/quickstart)
 
 ## Changelog
 
-The version history can be found in the [changelog](https://github.com/whythawk/whyqd/blob/master/CHANGELOG).
+The version history can be found in the [changelog](https://whyqd.readthedocs.io/changelog).
 
 ## Background and funding
 
-**whyqd** was created to serve a continuous data wrangling process, including collaboration on more complex messy sources, ensuring the integrity of the source data, and producing a complete audit trail from data imported to our database, back to source. You can see the product of that at [openLocal.uk](https://openlocal.uk).
+**whyqd** was created to serve a continuous data wrangling process, including collaboration on more complex messy 
+sources, ensuring the integrity of the source data, and producing a complete audit trail from data imported to our 
+database, back to source. You can see the product of that at [openLocal.uk](https://openlocal.uk).
 
-[This project](https://eoscfuture-grants.eu/meet-the-grantees/implementation-no-code-method-schema-schema-data-transformations-interoperability) has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 101017536. Technical development support is from [EOSC Future](https://eoscfuture.eu/) through the [RDA Open Call mechanism](https://eoscfuture-grants.eu/provider/research-data-alliance), based on evaluations of external, independent experts.
+**whyqd** [received initial funding](https://eoscfuture-grants.eu/meet-the-grantees/implementation-no-code-method-schema-schema-data-transformations-interoperability)
+from the European Union's Horizon 2020 research and innovation programme under grant agreement No 101017536. Technical 
+development support is from [EOSC Future](https://eoscfuture.eu/) through the 
+[RDA Open Call mechanism](https://eoscfuture-grants.eu/provider/research-data-alliance), based on evaluations of 
+external, independent experts.
 
-The 'backronym' for **whyqd** `/wɪkɪd/` is _Whythawk Quantitative Data_, [Whythawk](https://whythawk.com) is an open data science and open research technical consultancy.
+The 'backronym' for **whyqd** /wɪkɪd/ is *Whythawk Quantitative Data*, [Whythawk](https://whythawk.com)
+is an open data science and open research technical consultancy.
 
 ## Licence
 
-[BSD 3](LICENSE)
+The [**whyqd** Python distribution](https://github.com/whythawk/whyqd) is licensed under the terms of the 
+[BSD 3-Clause license](https://github.com/whythawk/whyqd/blob/master/LICENSE). All documentation is released under 
+[Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/). **whyqd** tradenames and 
+marks are copyright [Whythawk](https://whythawk.com).
diff --git a/docs-requirements.txt b/docs-requirements.txt
@@ -0,0 +1,5 @@
+mkdocs
+mkdocs-exclude
+mkdocstrings-python
+mkdocs-redirects
+mkdocs-material
diff --git a/docs/Makefile b/docs/Makefile
diff --git a/docs/_build/doctrees/action_api.doctree b/docs/_build/doctrees/action_api.doctree
diff --git a/docs/_build/doctrees/citation.doctree b/docs/_build/doctrees/citation.doctree
diff --git a/docs/_build/doctrees/contributing.doctree b/docs/_build/doctrees/contributing.doctree
diff --git a/docs/_build/doctrees/environment.pickle b/docs/_build/doctrees/environment.pickle
diff --git a/docs/_build/doctrees/field_api.doctree b/docs/_build/doctrees/field_api.doctree
diff --git a/docs/_build/doctrees/index.doctree b/docs/_build/doctrees/index.doctree
diff --git a/docs/_build/doctrees/installation.doctree b/docs/_build/doctrees/installation.doctree
diff --git a/docs/_build/doctrees/method.doctree b/docs/_build/doctrees/method.doctree
diff --git a/docs/_build/doctrees/method_api.doctree b/docs/_build/doctrees/method_api.doctree
diff --git a/docs/_build/doctrees/morph_api.doctree b/docs/_build/doctrees/morph_api.doctree
diff --git a/docs/_build/doctrees/morph_tutorial.doctree b/docs/_build/doctrees/morph_tutorial.doctree
diff --git a/docs/_build/doctrees/roadmap.doctree b/docs/_build/doctrees/roadmap.doctree
diff --git a/docs/_build/doctrees/schema.doctree b/docs/_build/doctrees/schema.doctree
diff --git a/docs/_build/doctrees/schema_api.doctree b/docs/_build/doctrees/schema_api.doctree
diff --git a/docs/_build/doctrees/transform_api.doctree b/docs/_build/doctrees/transform_api.doctree
diff --git a/docs/_build/doctrees/tutorial.doctree b/docs/_build/doctrees/tutorial.doctree
diff --git a/docs/_build/doctrees/tutorial_cthulhu_data.doctree b/docs/_build/doctrees/tutorial_cthulhu_data.doctree
diff --git a/docs/_build/doctrees/tutorial_local_government_data.doctree b/docs/_build/doctrees/tutorial_local_government_data.doctree
diff --git a/docs/_build/doctrees/validate.doctree b/docs/_build/doctrees/validate.doctree
diff --git a/docs/_build/doctrees/validate_api.doctree b/docs/_build/doctrees/validate_api.doctree
diff --git a/docs/_build/html/.buildinfo b/docs/_build/html/.buildinfo
diff --git a/docs/_build/html/_images/undp-hdi-2007-8.jpg b/docs/_build/html/_images/undp-hdi-2007-8.jpg