Inconsistent definition of tidy data #968

andtheWings · 2020-05-22T17:38:17Z

On the tidyr website index, the Github README, and R4DS, tidy data is defined as:

Every column is variable.

Every row is an observation.

Every cell is a single value.

Then in the Tidy data article derived from vignettes/tidy-data.Rmd, it's defined as:

Each variable forms a column.

Each observation forms a row.

Each type of observational unit forms a table.

Using these definitions changes how I label this motor vehicle collision dataset.

From my assessment it meets rules 1-3 for the first definition so I would call it tidy. But using the second definition, it fits rules 1-2, but violates rule 3. There are variables corresponding to three different observational units:

Individual involved in a collision event:
- PERSONNMB -- Unique numeric sequence value for each person associated with a collision and a vehicle.
- GENDERCDE -- Code indicating person's gender.
Vehicle involved in a collision event:
- UNIT_MR_NUMBER -- Unique numeric sequence value for each vehicle in a collision.
- VEHMAKETXT -- Description indicating the name of the manufacturer of the vehicle.
Collision event:
- INDIVIDUAL_MR_RECORD -- Unique identifier for each collision.
- INJUREDNMB -- Total number of people injured in the collision.

In this sense, does it mean the dataset isn't tidy?

hadley added the documentation label Aug 28, 2020

nzare mentioned this issue Sep 7, 2020

updated definition of tidy data #1039

Merged

hadley closed this as completed in 501a757 Sep 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent definition of tidy data #968

Inconsistent definition of tidy data #968

andtheWings commented May 22, 2020

Inconsistent definition of tidy data #968

Inconsistent definition of tidy data #968

Comments

andtheWings commented May 22, 2020