# Pandas wide vs tidy

> A simple tutorial on pandas Wide vs Tidy format including met and pivot.
> Animation from https://github.com/gadenbuie/tidy-animated-verbs

In [1]:
import pandas as pd

# Tidy Data

[Tidy data](http://r4ds.had.co.nz/tidy-data.html#tidy-data-1) (Hadley Wickham 2013) follows the following three rules:

1. Each variable has its own column.
2. Each observation has its own row.
3. Each value has its own cell.

![wide long dataframes](tidy-animated-verbs/images/static/png/original-dfs-tidy.png)

## Melt / Pivot

*or Spread / Gather*

![spread animation](tidy-animated-verbs/images/tidyr-spread-gather.gif)

**Melt**

- Output has two columns "variable" and "value"
- Each cell values goes to the "value" column
- The column name of each value, goes in the "variable" column


In [2]:
data = {'x': ['a', 'b'], 'y': ['c', 'd'], 'z': ['e', 'f'],}
df_wide = (pd.DataFrame(data)
           .rename_axis('id', axis='index')
           .rename_axis('key', axis='columns'))

In [3]:
df_wide

key,x,y,z
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,a,c,e
1,b,d,f


In [4]:
df_tidy = (df_wide
           .reset_index()
           .melt(id_vars='id')
           .set_index('id'))
df_tidy

Unnamed: 0_level_0,key,value
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,x,a
1,x,b
0,y,c
1,y,d
0,z,e
1,z,f


In [5]:
df_tidy.pivot(columns='key', values='value')

key,x,y,z
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,a,c,e
1,b,d,f


In [6]:
df_tidy.pivot(columns='key', values='value') == df_wide

key,x,y,z
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,True,True,True
1,True,True,True


# Wide vs Tidy

## Wide

- More suitable for **gridded** data
- Compact representation (gridded case)
- Always possible to transform to **tidy**

## Tidy

- More general format
- Good abstraction for wide range of data
- Not always possible/meaningful to transform to **wide**

## Melt: change names of axis

Change the name of the "variable" / "value" columns:

In [7]:
df_wide.melt(var_name='variables', value_name='values')

Unnamed: 0,variables,values
0,x,a
1,x,b
2,y,c
3,y,d
4,z,e
5,z,f


# Lossy Melt

*melt while dropping columns*

Melt (or "spread") the DF using **selected columns** name as "variable" and each column element as a "value". Unselected columns are dropped.

In [8]:
df_wide.melt(value_vars=['x', 'y'])

Unnamed: 0,key,value
0,x,a
1,x,b
2,y,c
3,y,d


Melt (or "spread") the DF using **selected columns** name as "index" column. The remaining columns are melted in two "variable" / "value" columns. 

In [9]:
df_wide.melt(id_vars=['x'])

Unnamed: 0,x,key,value
0,a,y,c
1,b,y,d
2,a,z,e
3,b,z,f


Note that melting only the non-id columns produce the same 
"variable" / "value" columns but without the "id" column 'x':

In [10]:
pd.melt(df_wide, value_vars=['y', 'z'])

Unnamed: 0,key,value
0,y,c
1,y,d
2,z,e
3,z,f


In [11]:
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
                   'B': {0: 1, 1: 3, 2: 5},
                   'C': {0: 2, 1: 4, 2: 6}})
df

Unnamed: 0,A,B,C
0,a,1,2
1,b,3,4
2,c,5,6


In [12]:
df.melt()

Unnamed: 0,variable,value
0,A,a
1,A,b
2,A,c
3,B,1
4,B,3
5,B,5
6,C,2
7,C,4
8,C,6
