The purpose of this repository is to illustrate how the data cleaning process described in the paper "Tidy Data" by Hadley Wickham, a member of the RStudio team, can be done in Python.
The paper was published in 2014 in the Journal of Statistical Software. The author offers it for free here. Furthermore, the original R code is available here.
After installing the dependencies for this project (cf., the installation notes below), it is recommended to first read the paper to get the big picture and then work through the six Jupyter notebooks listed below.
Tidy data is defined as data that comes in a table form adhering to the following requirements:
- each variable is a column,
- each observation a row, and
- each type of observational unit forms a table.
This is equivalent to Codd's 3rd normal form, a concept from the theory on relational databases. A dataset that does not satisfy these properties is called messy.
The five most common problems with messy data are:
- column headers are values, not variable names (cf., notebook 1)
- multiple variables are stored in one column (cf., notebook 2)
- variables are stored in both rows and columns (cf., notebook 3)
- multiple types of observational units are stored in the same table (cf., notebook 4)
- a single observational unit is stored in multiple tables (cf., notebook 5)
A case study (cf., notebook 6) shows the advantages of tidy data as a standardized input to statistical functions.
Get a local copy of this repository with git.
git clone https://github.com/webartifex/tidy-data.git
If you are not familiar with git, simply download the latest version of the files in a zip archive here.
This project uses poetry to manage its dependencies. Install all third-party packages into a virtual environment.
poetry install
Alternatively, use the Anaconda Distribution that should also suffice to run the provided notebooks.