# Data: Pivoting Data

*Purpose*: Data is easiest to use when it is *tidy*. In fact, grama is specifically designed to use tidy data. But not all data we'll encounter is tidy! To that end, in this exercise we'll learn how to tidy our data by *pivoting*.

As a result of learning how to quickly *tidy* data, you'll vastly expand the set of datasets you can analyze. Rather than fighting with data, you'll be able to quickly wrangle and extract insights.


## Setup


In [1]:
import grama as gr
DF = gr.Intention()
%matplotlib inline

# For assertion
from pandas.api.types import is_integer_dtype

# Tidy Data

*Tidy Data* is a very simple---but *very powerful*---concept for structuring a dataset. 

![Stylized text providing an overview of Tidy Data. The top reads “Tidy data is a standard way of mapping the meaning of a dataset to its structure. - Hadley Wickham.” On the left reads “In tidy data: each variable forms a column; each observation forms a row; each cell is a single measurement.” There is an example table on the lower right with columns ‘id’, ‘name’ and ‘color’ with observations for different cats, illustrating tidy data structure.](./images/tidydata_1.jpg)
Artwork by @allison_horst


## Tidy Data: Definition

A tidy dataset has three properties:

- each variable forms a column
- each observation forms a row
- each cell is a single measurement


The following dataset is tidy:


In [2]:
from grama.data import df_stang
df_stang.head()


Unnamed: 0,thick,alloy,E,mu,ang
0,0.022,al_24st,10600,0.321,0
1,0.022,al_24st,10600,0.323,0
2,0.032,al_24st,10400,0.329,0
3,0.032,al_24st,10300,0.319,0
4,0.064,al_24st,10500,0.323,0


The observations are all measured material properties taken at a particular angle (with respect to the direction in which the plates were rolled). Each column reports values for just one variable, each row corresponds to a single observation, and every cell reports just one measurement.

However, the following form of the same dataset is *not* tidy:


In [3]:
from grama.data import df_stang_wide
df_stang_wide

Unnamed: 0,thick,E_00,mu_00,E_45,mu_45,E_90,mu_90,alloy
0,0.022,10600,0.321,10700,0.329,10500,0.31,al_24st
1,0.022,10600,0.323,10500,0.331,10700,0.323,al_24st
2,0.032,10400,0.329,10400,0.318,10300,0.322,al_24st
3,0.032,10300,0.319,10500,0.326,10400,0.33,al_24st
4,0.064,10500,0.323,10400,0.331,10400,0.327,al_24st
5,0.064,10700,0.328,10500,0.328,10500,0.32,al_24st
6,0.081,10000,0.315,10000,0.32,9900,0.314,al_24st
7,0.081,10100,0.312,9900,0.312,10000,0.316,al_24st
8,0.081,10000,0.311,-1,-1.0,9900,0.314,al_24st


This dataset is *not* tidy: The angle of each measurement `00, 45, 90` is a variable, but these numerical values are expressed as column names. Put differently, some of the values are not in cells, but rather in the column names.


## Why tidy data?

Tidy data makes analysis *easier*. Putting our data in tidy form means we can use a *consistent* set of tools to work with *any* dataset.


![On the left is a happy cute fuzzy monster holding a rectangular data frame with a tool that fits the data frame shape. On the workbench behind the monster are other data frames of similar rectangular shape, and neatly arranged tools that also look like they would fit those data frames. The workbench looks uncluttered and tidy. The text above the tidy workbench reads “When working with tidy data, we can use the same tools in similar ways for different datasets…” On the right is a cute monster looking very frustrated, using duct tape and other tools to haphazardly tie data tables together, each in a different way. The monster is in front of a messy, cluttered workbench. The text above the frustrated monster reads “...but working with untidy data often means reinventing the wheel with one-time approaches that are hard to iterate or reuse.”](./images/tidydata_3.jpg)
Artwork by @allison_horst


Note that untidy data is not *bad* data; untidy data are simply harder to work with when doing data analysis. Data often come in untidy form when they are reported, say in a paper or a presentation. For instance, the wide form of the Stang et al. dataset can easily fit on one page:


In [4]:
df_stang_wide

Unnamed: 0,thick,E_00,mu_00,E_45,mu_45,E_90,mu_90,alloy
0,0.022,10600,0.321,10700,0.329,10500,0.31,al_24st
1,0.022,10600,0.323,10500,0.331,10700,0.323,al_24st
2,0.032,10400,0.329,10400,0.318,10300,0.322,al_24st
3,0.032,10300,0.319,10500,0.326,10400,0.33,al_24st
4,0.064,10500,0.323,10400,0.331,10400,0.327,al_24st
5,0.064,10700,0.328,10500,0.328,10500,0.32,al_24st
6,0.081,10000,0.315,10000,0.32,9900,0.314,al_24st
7,0.081,10100,0.312,9900,0.312,10000,0.316,al_24st
8,0.081,10000,0.311,-1,-1.0,9900,0.314,al_24st


However, the tidy form of the same dataset is far less compact:


In [5]:
df_stang

Unnamed: 0,thick,alloy,E,mu,ang
0,0.022,al_24st,10600,0.321,0
1,0.022,al_24st,10600,0.323,0
2,0.032,al_24st,10400,0.329,0
3,0.032,al_24st,10300,0.319,0
4,0.064,al_24st,10500,0.323,0
...,...,...,...,...,...
71,0.064,al_24st,10400,0.327,90
72,0.064,al_24st,10500,0.320,90
73,0.081,al_24st,9900,0.314,90
74,0.081,al_24st,10000,0.316,90


## Exercises


### __q1__ Identify

Inspect the following dataset; answer the questions under *observations* below.


In [6]:
## TASK: No need to edit; run and inspect
df_cases = gr.df_make(
    country=["FR", "DE", "US"],
    year2011=[7000, 5800, 15000],
    year2012=[6900, 6000, 14000],
    year2013=[7000, 6200, 13000],
)
df_cases


Unnamed: 0,country,year2011,year2012,year2013
0,FR,7000,6900,7000
1,DE,5800,6000,6200
2,US,15000,14000,13000


*Observations*

<!-- task-begin -->
- What are the *variables* in this dataset?
  - (Your response here)
- Is this dataset *tidy*? Why or why not?
  - (Your response here)
<!-- task-end -->
<!-- solution-begin -->
- What are the *variables* in this dataset?
  - Country, year, and some unknown quantity (n, count, etc.)
- Is this dataset *tidy*? Why or why not?
  - No; the year values are in the column names.
<!-- solution-end -->

### __q2__ Identify

Inspect the following dataset; answer the questions under *observations* below.


In [7]:
## TASK: No need to edit; run and inspect
df_alloys = gr.df_make(
    thick=[0.022, 0.022, 0.032, 0.032],
    E_00=[10600, 10600, 10400, 10300],
    mu_00=[0.321, 0.323, 0.329, 0.319],
    E_45=[10700, 10500, 10400, 10500],
    mu_45=[0.329, 0.331, 0.318, 0.326],
    rep=[1, 2, 1, 2],
)
df_alloys

Unnamed: 0,thick,E_00,mu_00,E_45,mu_45,rep
0,0.022,10600,0.321,10700,0.329,1
1,0.022,10600,0.323,10500,0.331,2
2,0.032,10400,0.329,10400,0.318,1
3,0.032,10300,0.319,10500,0.326,2


*Observations*

<!-- task-begin -->
- What are the *variables* in this dataset?
  - (Your response here)
- Is this dataset *tidy*? Why or why not?
  - (Your response here)
<!-- task-end -->
<!-- solution-begin -->
- What are the *variables* in this dataset?
  - Thickness `thick`, elasticity `E`, poisson's ratio `mu`, angle (in column names), replication `rep`
- Is this dataset *tidy*? Why or why not?
  - No; the angle values are in the column names.
<!-- solution-end -->

### __q3__ Identify

Inspect the following dataset; answer the questions under *observations* below.


In [8]:
## TASK: No need to edit; run and inspect
df_alloys2 = gr.df_make(
    thick=[0.022, 0.022, 0.032, 0.032],
    var=["E", "mu", "E", "mu"],
    value=[10700, 0.321, 10500, 0.323],
    rep=[1, 2, 1, 2],
    angle=[0, 0, 0, 0],
)
df_alloys2

Unnamed: 0,thick,var,value,rep,angle
0,0.022,E,10700.0,1,0
1,0.022,mu,0.321,2,0
2,0.032,E,10500.0,1,0
3,0.032,mu,0.323,2,0


*Observations*

<!-- task-begin -->
- What are the *variables* in this dataset?
  - (Your response here)
- Is this dataset *tidy*? Why or why not?
  - (Your response here)
<!-- task-end -->
<!-- solution-begin -->
- What are the *variables* in this dataset?
  - Thickness `thick`, elasticity `E`, poisson's ratio `mu`, `angle`, replication `rep`
- Is this dataset *tidy*? Why or why not?
  - No; the column `value` contains values of two *different* variables `E` and `mu`.
<!-- solution-end -->

# Pivoting Data

One of the first steps
