Skip to content

Commit

Permalink
Merge pull request #12 from whythawk/refactor-workflow
Browse files Browse the repository at this point in the history
Refactor workflow
  • Loading branch information
turukawa committed May 10, 2023
2 parents 1084f96 + b09480f commit 96b77c8
Show file tree
Hide file tree
Showing 434 changed files with 68,142 additions and 57,273 deletions.
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,10 @@ pip-log.txt
# Whyqd
###############################
__pycache__/
*.komodoproject
.pytest_cache/
deprecated/
.ipynb_checkpoints/
*.ipynb
.env

setup and distribution.txt
16 changes: 8 additions & 8 deletions .readthedocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,20 +6,20 @@
# Required
version: 2

# Build documentation in the docs/ directory with Sphinx
sphinx:
configuration: docs/conf.py

# Using test server while Python 3.9 not supported
# https://github.com/readthedocs/readthedocs-docker-images/pull/159#issuecomment-785048185
build:
image: testing
os: "ubuntu-20.04"
tools:
python: "3.9"

# Build documentation for mkdocs
mkdocs:
configuration: mkdocs.yml

# Optionally set the version of Python and requirements required to build your docs
python:
version: 3.9
install:
- method: pip
path: .
extra_requirements:
- docs
- requirements: docs-requirements.txt
1 change: 1 addition & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
- 1.0.0: Complete refactor with breaking changes. See documentation changelog.
- 0.6.2: Fix for parsing ambiguity errors, plus Excel row-count exceeded on save.
- 0.6.1: Minor correction for row count.
- 0.6.0: Ensuring consistent optional column 'name' change for header row, plus new row-count in input and working data.
Expand Down
175 changes: 90 additions & 85 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,132 +6,137 @@

## What is it?

**whyqd** provides an intuitive method for restructuring messy data to conform to a standardised
metadata schema. It supports data managers and researchers looking to rapidly, and continuously,
normalise any messy spreadsheets using a simple series of steps. Once complete, you can import
wrangled data into more complex analytical systems or full-feature wrangling tools.
> More research, less wrangling
It aims to get you to the point where you can perform automated data munging prior to
committing your data into a database, and no further. It is built on Pandas, and plays well with
existing Python-based data-analytical tools. Each raw source file will produce a json schema and
method file which defines the set of actions to be performed to produce refined data, and a
destination file validated against that schema.
[**whyqd**](https://whyqd.com) (/wɪkɪd/) is a curatorial toolkit intended to produce well-structured and predictable
data for research analysis.

**whyqd** ensures complete audit transparency by saving all actions performed to restructure
your input data to a separate json-defined methods file. This permits others to read and scrutinise
your approach, validate your methodology, or even use your methods to import data in production.
It provides an intuitive method creating schema-to-schema crosswalks for restructuring messy data to conform to a
standardised metadata schema. It supports rapid and continuous transformation of messy data using a simple series of
steps. Once complete, you can import wrangled data into more complex analytical or database systems.

Once complete, a method file can be shared, along with your input data, and anyone can
import **whyqd** and validate your method to verify that your output data is the product of these
inputs.
**whyqd** plays well with your existing Python-based data-analytical tools. It uses [Ray](https://www.ray.io/) and
[Modin](https://modin.readthedocs.io/) as a drop-in replacement for [Pandas](https://pandas.pydata.org/) to support
processing of large datasets, and [Pydantic](https://pydantic-docs.helpmanual.io/) for data models.

[Read the docs](https://whyqd.readthedocs.io/en/latest/) and there are two worked tutorials to demonstrate
how you can use `whyqd` to support source data curation transparency:
Each definition is saved as [JSON Schema-compliant](https://json-schema.org/) file. This permits others to read and
scrutinise your approach, validate your methodology, or even use your crosswalks to import and transform data in
production.

- [Local-government data](https://whyqd.readthedocs.io/en/latest/tutorial_local_government_data.html)
- [Data produced by Cthulhu](https://whyqd.readthedocs.io/en/latest/tutorial_cthulhu_data.html)
Once complete, a transform file can be shared, along with your input data, and anyone can import and validate your
crosswalk to verify that your output data is the product of these inputs.

## Why use it?

If all you want to do is test whether your source data are even useful, spending days or weeks
slogging through data restructuring could kill a project. If you already have a workflow and
established software which includes Python and pandas, having to change your code every time your
source data changes is really, really frustrating.
If all you want to do is test whether your source data are even useful, spending days or weeks slogging through data
restructuring could kill a project. If you already have a workflow and established software which includes Python and
pandas, having to change your code every time your source data changes is really, really frustrating.

If you want to go from a Cthulhu dataset like this:
If you want to go from a [Cthulhu dataset](https://whyqd.readthedocs.io/tutorials/tutorial3) like this:

![UNDP Human Development Index 2007-2008: a beautiful example of messy data.](https://raw.githubusercontent.com/whythawk/whyqd/master/docs/images/undp-hdi-2007-8.jpg)

*UNDP Human Development Index 2007-2008: a beautiful example of messy data.*

To this:

| | country_name | indicator_name | reference | year | values |
| --: | :--------------------- | :------------- | :-------- | ---: | -----: |
| 0 | Hong Kong, China (SAR) | HDI rank | e | 2008 | 21 |
| 1 | Singapore | HDI rank | nan | 2008 | 25 |
| 2 | Korea (Republic of) | HDI rank | nan | 2008 | 26 |
| 3 | Cyprus | HDI rank | nan | 2008 | 28 |
| 4 | Brunei Darussalam | HDI rank | nan | 2008 | 30 |
| 5 | Barbados | HDI rank | e,g, f | 2008 | 31 |
| | country_name | indicator_name | reference | year | values |
|:---|:-----------------------|:-----------------|:------------|:-------|:---------|
| 0 | Hong Kong, China (SAR) | HDI rank | e | 2008 | 21 |
| 1 | Singapore | HDI rank | nan | 2008 | 25 |
| 2 | Korea (Republic of) | HDI rank | nan | 2008 | 26 |
| 3 | Cyprus | HDI rank | nan | 2008 | 28 |
| 4 | Brunei Darussalam | HDI rank | nan | 2008 | 30 |
| 5 | Barbados | HDI rank | e,g,f | 2008 | 31 |

With a readable set of scripts to ensure that your process can be audited and repeated:

```
scripts = [
"DEBLANK",
"DEDUPE",
"REBASE < [11]",
f"DELETE_ROWS < {[int(i) for i in np.arange(144, df.index[-1]+1)]}",
"RENAME_ALL > ['HDI rank', 'Country', 'Human poverty index (HPI-1) - Rank;;2008', 'Reference 1', 'Human poverty index (HPI-1) - Value (%);;2008', 'Probability at birth of not surviving to age 40 (% of cohort);;2000-05', 'Reference 2', 'Adult illiteracy rate (% aged 15 and older);;1995-2005', 'Reference 3', 'Population not using an improved water source (%);;2004', 'Reference 4', 'Children under weight for age (% under age 5);;1996-2005', 'Reference 5', 'Population below income poverty line (%) - $1 a day;;1990-2005', 'Reference 6', 'Population below income poverty line (%) - $2 a day;;1990-2005', 'Reference 7', 'Population below income poverty line (%) - National poverty line;;1990-2004', 'Reference 8', 'HPI-1 rank minus income poverty rank;;2008']",
"PIVOT_CATEGORIES > ['HDI rank'] < [14,44,120]",
"RENAME_NEW > 'HDI Category'::['PIVOT_CATEGORIES_idx_20_0']",
"PIVOT_LONGER > = ['HDI rank', 'HDI Category', 'Human poverty index (HPI-1) - Rank;;2008', 'Human poverty index (HPI-1) - Value (%);;2008', 'Probability at birth of not surviving to age 40 (% of cohort);;2000-05', 'Adult illiteracy rate (% aged 15 and older);;1995-2005', 'Population not using an improved water source (%);;2004', 'Children under weight for age (% under age 5);;1996-2005', 'Population below income poverty line (%) - $1 a day;;1990-2005', 'Population below income poverty line (%) - $2 a day;;1990-2005', 'Population below income poverty line (%) - National poverty line;;1990-2004', 'HPI-1 rank minus income poverty rank;;2008']",
"SPLIT > ';;'::['PIVOT_LONGER_names_idx_9']",
f"JOIN > 'reference' < {reference_columns}",
"RENAME > 'indicator_name' < ['SPLIT_idx_11_0']",
"RENAME > 'country_name' < ['Country']",
"RENAME > 'year' < ['SPLIT_idx_12_1']",
"RENAME > 'values' < ['PIVOT_LONGER_values_idx_10']",
]
```python
schema_scripts = [
f"UNITE > 'reference' < {REFERENCE_COLUMNS}",
"RENAME > 'country_name' < ['Country']",
"PIVOT_LONGER > ['indicator_name', 'values'] < ['HDI rank', 'HDI Category', 'Human poverty index (HPI-1) - Rank;;2008', 'Human poverty index (HPI-1) - Value (%);;2008', 'Probability at birth of not surviving to age 40 (% of cohort);;2000-05', 'Adult illiteracy rate (% aged 15 and older);;1995-2005', 'Population not using an improved water source (%);;2004', 'Children under weight for age (% under age 5);;1996-2005', 'Population below income poverty line (%) - $1 a day;;1990-2005', 'Population below income poverty line (%) - $2 a day;;1990-2005', 'Population below income poverty line (%) - National poverty line;;1990-2004', 'HPI-1 rank minus income poverty rank;;2008']",
"SEPARATE > ['indicator_name', 'year'] < ';;'::['indicator_name']",
"DEBLANK",
"DEDUPE",
]
```

There are two complex and time-consuming parts to preparing data for analysis: social, and technical.
## How does it work?

The social part requires multi-stakeholder engagement with source data-publishers, and with
destination database users, to agree structural metadata. Without any agreement on data publication
formats or destination structure, you are left with the tedious frustration of manually wrangling
each independent dataset into a single schema.
> Crosswalks are mappings of the relationships between fields defined in different metadata
> [schemas](https://whyqd.readthedocs.io/strategies/schema). Ideally, these are one-to-one, where a field in
> one has an exact match in the other. In practice, it's more complicated than that.
**whyqd** allows you to get to work without requiring you to achieve buy-in from anyone or change
your existing code.
Your workflow is:

## Wrangling process
1. Define a single destination schema,
2. Derive a source schema from a data source,
3. Review your source data structure,
4. Develop a crosswalk to define the relationship between source and destination,
5. Transform and validate your outputs,
6. Share your output data, transform definitions, and a citation.

- Create, update or import a data schema which defines the destination data structure,
- Create a new method and associate it with your schema and input data source/s,
- Assign a foreign key column and (if required) merge input data sources,
- Structure input data fields to conform to the requriements for each schema field,
- Assign categorical data identified during structuring,
- Transform and filter input data to produce a final destination data file,
- Share your data and a citation.
It starts like this:

## Installation and dependencies
```python
import whyqd as qd
```

You'll need at least Python 3.7, then:
[Install](https://whyqd.readthedocs.io/installation) and [get started](https://whyqd.readthedocs.io/quickstart).

`pip install whyqd`
There are three worked tutorials to guide you through three typical scenarios:

Code requirements have been tested on the following versions:
- [Aligning multiple data disparate sources to a single schema](https://whyqd.readthedocs.io/tutorials/tutorial1)
- [Pivoting wide-format data into archival long-format](https://whyqd.readthedocs.io/tutorials/tutorial2)
- [Wrangling Cthulhu data without losing your mind](https://whyqd.readthedocs.io/tutorials/tutorial3)

- numpy>=1.18.1
- openpyxl>=3.0.3
- pandas>=1.0.0
- tabulate>=0.8.3
- xlrd>=1.2.0
## Installation

Version 0.5.0 introduced a new, simplified, API, along with script-based transformation actions. You can import and
transform any saved `method.json` files with:
You'll need at least Python 3.8, then install with your favourite package manager:

```bash
pip install whyqd
```
SCHEMA = whyqd.Schema(source=SCHEMA_SOURCE)
schema_scripts = whyqd.parsers.LegacyScript().parse_legacy_method(
version="1", schema=SCHEMA, source_path=METHOD_SOURCE_V1
)

To derive a source schema from tabular data, import from `DATASOURCE_PATH`, define its `MIMETYPE`, and derive a schema:

```python
import whyqd as qd

datasource = qd.DataSourceDefinition()
datasource.derive_model(source=DATASOURCE_PATH, mimetype=MIMETYPE)
schema_source = qd.SchemaDefinition()
schema_source.derive_model(data=datasource.get)
schema_source.fields.set_categories(name=CATEGORY_FIELD,
terms=datasource.get_data())
schema_source.save()
```

Where SCHEMA_SOURCE is a path to your schema. Existing `schema.json` files should still work.
[Get started...](https://whyqd.readthedocs.io/quickstart)

## Changelog

The version history can be found in the [changelog](https://github.com/whythawk/whyqd/blob/master/CHANGELOG).
The version history can be found in the [changelog](https://whyqd.readthedocs.io/changelog).

## Background and funding

**whyqd** was created to serve a continuous data wrangling process, including collaboration on more complex messy sources, ensuring the integrity of the source data, and producing a complete audit trail from data imported to our database, back to source. You can see the product of that at [openLocal.uk](https://openlocal.uk).
**whyqd** was created to serve a continuous data wrangling process, including collaboration on more complex messy
sources, ensuring the integrity of the source data, and producing a complete audit trail from data imported to our
database, back to source. You can see the product of that at [openLocal.uk](https://openlocal.uk).

[This project](https://eoscfuture-grants.eu/meet-the-grantees/implementation-no-code-method-schema-schema-data-transformations-interoperability) has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 101017536. Technical development support is from [EOSC Future](https://eoscfuture.eu/) through the [RDA Open Call mechanism](https://eoscfuture-grants.eu/provider/research-data-alliance), based on evaluations of external, independent experts.
**whyqd** [received initial funding](https://eoscfuture-grants.eu/meet-the-grantees/implementation-no-code-method-schema-schema-data-transformations-interoperability)
from the European Union's Horizon 2020 research and innovation programme under grant agreement No 101017536. Technical
development support is from [EOSC Future](https://eoscfuture.eu/) through the
[RDA Open Call mechanism](https://eoscfuture-grants.eu/provider/research-data-alliance), based on evaluations of
external, independent experts.

The 'backronym' for **whyqd** `/wɪkɪd/` is _Whythawk Quantitative Data_, [Whythawk](https://whythawk.com) is an open data science and open research technical consultancy.
The 'backronym' for **whyqd** /wɪkɪd/ is *Whythawk Quantitative Data*, [Whythawk](https://whythawk.com)
is an open data science and open research technical consultancy.

## Licence

[BSD 3](LICENSE)
The [**whyqd** Python distribution](https://github.com/whythawk/whyqd) is licensed under the terms of the
[BSD 3-Clause license](https://github.com/whythawk/whyqd/blob/master/LICENSE). All documentation is released under
[Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/). **whyqd** tradenames and
marks are copyright [Whythawk](https://whythawk.com).
5 changes: 5 additions & 0 deletions docs-requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
mkdocs
mkdocs-exclude
mkdocstrings-python
mkdocs-redirects
mkdocs-material
20 changes: 0 additions & 20 deletions docs/Makefile

This file was deleted.

Binary file removed docs/_build/doctrees/action_api.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/citation.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/contributing.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/environment.pickle
Binary file not shown.
Binary file removed docs/_build/doctrees/field_api.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/index.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/installation.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/method.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/method_api.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/morph_api.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/morph_tutorial.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/roadmap.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/schema.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/schema_api.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/transform_api.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/tutorial.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/tutorial_cthulhu_data.doctree
Binary file not shown.
Binary file not shown.
Binary file removed docs/_build/doctrees/validate.doctree
Binary file not shown.
Binary file removed docs/_build/doctrees/validate_api.doctree
Binary file not shown.
4 changes: 0 additions & 4 deletions docs/_build/html/.buildinfo

This file was deleted.

Binary file removed docs/_build/html/_images/undp-hdi-2007-8.jpg
Binary file not shown.

0 comments on commit 96b77c8

Please sign in to comment.