Data Science Fundamentals: Julia |
[Table of Contents](../index.ipynb)
- - - 
<!--NAVIGATION-->
Module 18. [Constructors](01_constructors.ipynb) | [Basic Information](02_basicinfo.ipynb) | [Missing Values](03_missingvalues.ipynb) | [Load Save](04_loadsave.ipynb) | [Columns](05_columns.ipynb) | [Rows](06_rows.ipynb) | [Factors](07_factors.ipynb) | [Joins](08_joins.ipynb) | **[Reshaping](09_reshaping.ipynb)** | [Transforms](10_transforms.ipynb) | [Performance](11_performance.ipynb) | [Pitfalls](12_pitfalls.ipynb) | [Extras](13_extras.ipynb)

In [1]:
using DataFrames # load package

## Reshaping DataFrames

### Wide to long

In [2]:
x = DataFrame(id=[1,2,3,4], id2=[1,1,2,2], M1=[11,12,13,14], M2=[111,112,113,114])

Unnamed: 0_level_0,id,id2,M1,M2
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,1,1,11,111
2,2,1,12,112
3,3,2,13,113
4,4,2,14,114


In [3]:
stack(x, [:M1, :M2], :id) # first pass measure variables and then id-variable

Unnamed: 0_level_0,id,variable,value
Unnamed: 0_level_1,Int64,Cat…,Int64
1,1,M1,11
2,2,M1,12
3,3,M1,13
4,4,M1,14
5,1,M2,111
6,2,M2,112
7,3,M2,113
8,4,M2,114


add `view=true` keyword argument to make a view; in that case columns of the resulting data frame share memory with columns of the source data frame, so the operation is potentially unsafe

In [4]:
# optionally you can rename columns
stack(x, ["M1", "M2"], "id", variable_name="key", value_name="observed")

Unnamed: 0_level_0,id,key,observed
Unnamed: 0_level_1,Int64,Cat…,Int64
1,1,M1,11
2,2,M1,12
3,3,M1,13
4,4,M1,14
5,1,M2,111
6,2,M2,112
7,3,M2,113
8,4,M2,114


if second argument is omitted in `stack` , all other columns are assumed to be the id-variables

In [5]:
stack(x, Not([:id, :id2]))

Unnamed: 0_level_0,id,id2,variable,value
Unnamed: 0_level_1,Int64,Int64,Cat…,Int64
1,1,1,M1,11
2,2,1,M1,12
3,3,2,M1,13
4,4,2,M1,14
5,1,1,M2,111
6,2,1,M2,112
7,3,2,M2,113
8,4,2,M2,114


In [6]:
stack(x, Not([1, 2])) # you can use index instead of symbol

Unnamed: 0_level_0,id,id2,variable,value
Unnamed: 0_level_1,Int64,Int64,Cat…,Int64
1,1,1,M1,11
2,2,1,M1,12
3,3,2,M1,13
4,4,2,M1,14
5,1,1,M2,111
6,2,1,M2,112
7,3,2,M2,113
8,4,2,M2,114


In [7]:
x = DataFrame(id = [1,1,1], id2=['a','b','c'], a1 = rand(3), a2 = rand(3))

Unnamed: 0_level_0,id,id2,a1,a2
Unnamed: 0_level_1,Int64,Char,Float64,Float64
1,1,'a',0.0291231,0.0856714
2,1,'b',0.00764146,0.488177
3,1,'c',0.844594,0.93089


 if `stack` is not passed any measure variables by default numeric variables are selected as measures

In [8]:
stack(x)

Unnamed: 0_level_0,id,id2,variable,value
Unnamed: 0_level_1,Int64,Char,Cat…,Float64
1,1,'a',a1,0.0291231
2,1,'b',a1,0.00764146
3,1,'c',a1,0.844594
4,1,'a',a2,0.0856714
5,1,'b',a2,0.488177
6,1,'c',a2,0.93089


here all columns are treated as measures:

In [9]:
stack(DataFrame(rand(3,2)))

Unnamed: 0_level_0,variable,value
Unnamed: 0_level_1,Cat…,Float64
1,x1,0.893946
2,x1,0.604032
3,x1,0.0258379
4,x2,0.85813
5,x2,0.804899
6,x2,0.651169


In [10]:
df = DataFrame(rand(3,2))
df.key = [1,1,1]
mdf = stack(df) # duplicates in key are silently accepted

Unnamed: 0_level_0,key,variable,value
Unnamed: 0_level_1,Int64,Cat…,Float64
1,1,x1,0.6063
2,1,x1,0.868893
3,1,x1,0.60774
4,1,x2,0.36719
5,1,x2,0.355047
6,1,x2,0.517009


### Long to wide

In [11]:
x = DataFrame(id = [1,1,1], id2=['a','b','c'], a1 = rand(3), a2 = rand(3))

Unnamed: 0_level_0,id,id2,a1,a2
Unnamed: 0_level_1,Int64,Char,Float64,Float64
1,1,'a',0.967125,0.56558
2,1,'b',0.871342,0.468136
3,1,'c',0.60536,0.114597


In [12]:
y = stack(x)

Unnamed: 0_level_0,id,id2,variable,value
Unnamed: 0_level_1,Int64,Char,Cat…,Float64
1,1,'a',a1,0.967125
2,1,'b',a1,0.871342
3,1,'c',a1,0.60536
4,1,'a',a2,0.56558
5,1,'b',a2,0.468136
6,1,'c',a2,0.114597


In [13]:
unstack(y, :id2, :variable, :value) # stndard unstack with a specified key

Unnamed: 0_level_0,id2,a1,a2
Unnamed: 0_level_1,Char,Float64?,Float64?
1,'a',0.967125,0.56558
2,'b',0.871342,0.468136
3,'c',0.60536,0.114597


In [14]:
unstack(y, :variable, :value) # all other columns are treated as keys

Unnamed: 0_level_0,id,id2,a1,a2
Unnamed: 0_level_1,Int64,Char,Float64?,Float64?
1,1,'a',0.967125,0.56558
2,1,'b',0.871342,0.468136
3,1,'c',0.60536,0.114597


In [15]:
# all columns other than named :variable and :value are treated as keys
unstack(y)

Unnamed: 0_level_0,id,id2,a1,a2
Unnamed: 0_level_1,Int64,Char,Float64?,Float64?
1,1,'a',0.967125,0.56558
2,1,'b',0.871342,0.468136
3,1,'c',0.60536,0.114597


In [16]:
# you can rename the unstacked columns
unstack(y, renamecols=n->string("unstacked_", n))

Unnamed: 0_level_0,id,id2,unstacked_a1,unstacked_a2
Unnamed: 0_level_1,Int64,Char,Float64?,Float64?
1,1,'a',0.967125,0.56558
2,1,'b',0.871342,0.468136
3,1,'c',0.60536,0.114597


In [17]:
df = stack(DataFrame(rand(3,2)))

Unnamed: 0_level_0,variable,value
Unnamed: 0_level_1,Cat…,Float64
1,x1,0.258192
2,x1,0.595795
3,x1,0.511016
4,x2,0.26596
5,x2,0.426824
6,x2,0.826271


In [18]:
unstack(df, :variable, :value) # unable to unstack when no key column is present

ArgumentError: ArgumentError: No key column found