Tidy Data

The purpose of this repository is to illustrate how the data cleaning process described in the paper "Tidy Data" by Hadley Wickham, a member of the RStudio team, can be done in Python.

The paper was published in 2014 in the Journal of Statistical Software. The author offers it for free here. Furthermore, the original R code is available here.

After installing the dependencies for this project (cf., the installation notes below), it is recommended to first read the paper to get the big picture and then work through the six Jupyter notebooks listed below.

Summary

Definition

Tidy data is defined as data that comes in a table form adhering to the following requirements:

each variable is a column,
each observation a row, and
each type of observational unit forms a table.

This is equivalent to Codd's 3rd normal form, a concept from the theory on relational databases. A dataset that does not satisfy these properties is called messy.

Tidying Data

The five most common problems with messy data are:

column headers are values, not variable names (cf., notebook 1)
multiple variables are stored in one column (cf., notebook 2)
variables are stored in both rows and columns (cf., notebook 3)
multiple types of observational units are stored in the same table (cf., notebook 4)
a single observational unit is stored in multiple tables (cf., notebook 5)

Case Study

A case study (cf., notebook 6) shows the advantages of tidy data as a standardized input to statistical functions.

Installation

Get a local copy of this repository with git.

git clone https://github.com/webartifex/tidy-data.git

If you are not familiar with git, simply download the latest version of the files in a zip archive here.

This project uses poetry to manage its dependencies. Install all third-party packages into a virtual environment.

poetry install

Alternatively, use the Anaconda Distribution that should also suffice to run the provided notebooks.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
.gitignore		.gitignore
1_column_headers_are_values.ipynb		1_column_headers_are_values.ipynb
2_multiple_variables_stored_in_one_column.ipynb		2_multiple_variables_stored_in_one_column.ipynb
3_variables_are_stored_in_both_rows_and_columns.ipynb		3_variables_are_stored_in_both_rows_and_columns.ipynb
4_multiple_types_in_one_table.ipynb		4_multiple_types_in_one_table.ipynb
5_one_type_in_multiple_tables.ipynb		5_one_type_in_multiple_tables.ipynb
6_case_study.ipynb		6_case_study.ipynb
LICENSE.txt		LICENSE.txt
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
tidy-data.pdf		tidy-data.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tidy Data

Summary

Definition

Tidying Data

Case Study

Installation

About

Languages

License

webartifex/tidy-data

Folders and files

Latest commit

History

Repository files navigation

Tidy Data

Summary

Definition

Tidying Data

Case Study

Installation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages