![alt text](pandas.png "Title")

In [1]:
import pandas as pd
import numpy as np

# Dataframes reshape

Quite often we need to rearrange the shape of a dataframe: pivot, transpose etc. Let's see how we can do that in Pandas

In [2]:
patients = [10010, 10011, 10012]
data = {
    'gender': ['M', 'F', 'F'],
    'age':    [20, 25, 23],
}

df = pd.DataFrame(data, index= patients, columns=['age', 'gender'])
df

Unnamed: 0,age,gender
10010,20,M
10011,25,F
10012,23,F


In [3]:
df.shape

(3, 2)

## Index swap

In [4]:
# T does a quick swap between the row index and the column index
transposed = df.T
transposed

Unnamed: 0,10010,10011,10012
age,20,25,23
gender,M,F,F


In [5]:
transposed.shape

(2, 3)

In [9]:
# ... and transpose back
transposed.T

Unnamed: 0,age,gender
10010,20,M
10011,25,F
10012,23,F


## Stack and unstack

Stacking a dataframe produces a pandas Series by pivoting from the columns to the rows. The resulting Series has a hierarchical index, 2 dimensional, which combined the row index and the column index from the original dataframe.

In [6]:
# Stack df values in a Series:
s = df.stack()
s

10010  age       20
       gender     M
10011  age       25
       gender     F
10012  age       23
       gender     F
dtype: object

In [7]:
# Create a df out of that Series:
pd.DataFrame(s, columns=['value'])

Unnamed: 0,Unnamed: 1,value
10010,age,20
10010,gender,M
10011,age,25
10011,gender,F
10012,age,23
10012,gender,F


In [16]:
# From a hierchachically indexed Series, you can rearrange the data back into a dataframe
s.unstack()

Unnamed: 0,age,gender
10010,20,M
10011,25,F
10012,23,F


## Transpose narrow to wide

In [10]:
# Let's create a narrow (aka vertical) dataframe
narrow = pd.DataFrame(
   [ [10010, 1, 'HR', 86],
     [10010, 1, 'Sysbp', 130],
     [10010, 2, 'HR', 92], 
     [10010, 2, 'Sysbp', 125], 
     [10011, 1, 'HR', 75],
     [10011, 1, 'Sysbp', 110], 
     [10011, 2, 'HR', 69], 
     [10011, 2, 'Sysbp', 115],
   ],
    columns=['subjid', 'visit', 'param', 'result']
)

narrow

Unnamed: 0,subjid,visit,param,result
0,10010,1,HR,86
1,10010,1,Sysbp,130
2,10010,2,HR,92
3,10010,2,Sysbp,125
4,10011,1,HR,75
5,10011,1,Sysbp,110
6,10011,2,HR,69
7,10011,2,Sysbp,115


In [11]:
# and the same data but in wide (aka horizontal) dataframe:
wide = pd.DataFrame(
    np.array(
       [ [10010, 1, 86, 130],
         [10010, 2, 92, 125],
         [10011, 1, 75, 110],
         [10011, 2, 69, 115],                                
       ]
    ), columns=['subjid', 'visit', 'HR', 'Sysbp']
)

wide

Unnamed: 0,subjid,visit,HR,Sysbp
0,10010,1,86,130
1,10010,2,92,125
2,10011,1,75,110
3,10011,2,69,115


How do we get from wide to narrow or narrow to wide?

In [12]:
# T does not do what we need 
narrow.T

Unnamed: 0,0,1,2,3,4,5,6,7
subjid,10010,10010,10010,10010,10011,10011,10011,10011
visit,1,1,2,2,1,1,2,2
param,HR,Sysbp,HR,Sysbp,HR,Sysbp,HR,Sysbp
result,86,130,92,125,75,110,69,115


In [14]:
# and neither does stack()
pd.DataFrame(narrow.stack(), columns=['value']).head()

Unnamed: 0,Unnamed: 1,value
0,subjid,10010
0,visit,1
0,param,HR
0,result,86
1,subjid,10010


In [186]:
# We can use pivot_table() to transpose data from narrow to wide:
widened = pd.pivot_table(
    narrow,
    values  = 'result',           # equivalent to the 'var' statement in Proc Transpose
    index   = ['subjid','visit'], # equivalent to the 'by' statement in Proc Transpose
    columns = 'param'             # equivalent to the 'id' statement in Proc Transpose
)

# the outcome is multi index df
widened

Unnamed: 0_level_0,param,HR,Sysbp
subjid,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
10010,1,86,130
10010,2,92,125
10011,1,75,110
10011,2,69,115


In [187]:
# We can flatten the index:
widened.reset_index()

param,subjid,visit,HR,Sysbp
0,10010,1,86,130
1,10010,2,92,125
2,10011,1,75,110
3,10011,2,69,115


Note: pivot_table() is a generalization of pivot() with more features (passing list of indexes, aggregations etc.)

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html

## Transpose wide to narrow

We can use melt() for that.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html

In [15]:
narrowed = wide.melt(
    id_vars    = ['subjid','visit'],  # by variables
    var_name   = 'param',             # name for the new variable holding the names
    value_name = 'result'             # name for the new variable holding the values
)

narrowed

Unnamed: 0,subjid,visit,param,result
0,10010,1,HR,86
1,10010,2,HR,92
2,10011,1,HR,75
3,10011,2,HR,69
4,10010,1,Sysbp,130
5,10010,2,Sysbp,125
6,10011,1,Sysbp,110
7,10011,2,Sysbp,115


In [193]:
# If we wanted to compare narrow and narrowed, we'd need some massage on narrowed first:
narrowed.sort_values(['subjid','visit'], inplace=True)
narrowed.index = range(len(narrowed.index))

# in case you wondered
narrow == narrowed

Unnamed: 0,subjid,visit,param,result
0,True,True,True,True
1,True,True,True,True
2,True,True,True,True
3,True,True,True,True
4,True,True,True,True
5,True,True,True,True
6,True,True,True,True
7,True,True,True,True


__________________________________________________
Nicolas Dupuis, Methodology and Innovation (IDAR C&SP), 2020+