# Restructuring Data Using Pandas  `stack`, `unstack`, `melt`, `pivot` and `pivot_table` functions.

Data analysis is the first and foremost step in machine learning life cycle. It includes inspecting, cleaning, transforming and modelling data with a goal to discover useful information, informing conclusions and support decision making. Python's pandas librabry is one of the powerful and widely used tool for data analysis. In this blog we will try to understand some of the widely used pandas methods for reshaping or restructuring the data listed below.
- `stack`
- `melt` 
- `unstack`
- `pivot`
- `pivot_table` 

### Why should we reshape or restructure data?

Real world data is not always in a consumable form. It contains lots of missing entries and errors. It is easy to extract data from the rows and columns of a data but there are situations when we need the data in a format that is different from format in which we received it. Therefore it is important to clean and restructure the data to a consumable form. Reshaping data includes converting columns to rows, rows to columns and performing aggregation to bring the data into a form which is easy to analyse.

Let us create a sample messy dataset and see how we can apply the above methods to reshape the data into a consumable form.We will also be using the pandas helper methods `set_index`, `reset_index`, `rename` and `rename_axis` to add final touches to the dataframe.

# Sample Dataset

In [1]:
import pandas as pd
import numpy as np

# Sensors data of 2 sensors for last 2 years.
iterables = [['sensor1', 'sensor2'],
             ['Pressure', 'Temperature', 'Flow']]

index = pd.MultiIndex.from_product(iterables,names=['Sensor', 'Metric'])

df_sensors = pd.DataFrame(np.random.randint(low=40, high=100,size=(6,2)),
                          index=index,
                          columns=['2017', '2018']).reset_index()

In [2]:
df_sensors

Unnamed: 0,Sensor,Metric,2017,2018
0,sensor1,Pressure,83,49
1,sensor1,Temperature,49,90
2,sensor1,Flow,61,72
3,sensor2,Pressure,64,58
4,sensor2,Temperature,41,99
5,sensor2,Flow,91,94


# Stack

The `stack` method takes all of the column names in the dataframe and reshapes them to be vertical as a single index level. Below mentioned are the input parameters to the function.
- `level`: (int, str,list, default -1) Prescribed level(s) to stack from column axis onto index axis.
- `dropna`: (bool, default  True) Whether to drop rows with missing values in the resulting frame, defaults.

Output:
- stacked dataframe or series.

__Note: By default the `stack` function takes all the columns in the dataframe and reshapes them to a single vertical column as a series, therefore you need to set your index column explicitly using `pd.DataFrame.set_index` method before performing `stack` and then use `pd.DataFrame.reset_index()` to convert the output to a dataframe.__

In [3]:
df_sensors.stack()

0  Sensor        sensor1
   Metric       Pressure
   2017               83
   2018               49
1  Sensor        sensor1
   Metric    Temperature
   2017               49
   2018               90
2  Sensor        sensor1
   Metric           Flow
   2017               61
   2018               72
3  Sensor        sensor2
   Metric       Pressure
   2017               64
   2018               58
4  Sensor        sensor2
   Metric    Temperature
   2017               41
   2018               99
5  Sensor        sensor2
   Metric           Flow
   2017               91
   2018               94
dtype: object

As mentioned `stack` by default takes in all the columns and converts then into a single vertical column at the inner most level. We have to explicitly set the index columns using `set_index` method before performing the `stack`.

Inorder convert the multiple year columns to a single year column we have to set the columns `Sensor`, `Metric` as index and apply `stack` function.

In [4]:
df_sensors.set_index(['Sensor', 'Metric']).stack()

Sensor   Metric           
sensor1  Pressure     2017    83
                      2018    49
         Temperature  2017    49
                      2018    90
         Flow         2017    61
                      2018    72
sensor2  Pressure     2017    64
                      2018    58
         Temperature  2017    41
                      2018    99
         Flow         2017    91
                      2018    94
dtype: int64

We can see that the output is a `series` with hierarchial index. Inorder to convert it to a consumable dataframe we have to use pandas helper methods `reset_index`, `rename`.

In [5]:
df_sensors.set_index(['Metric'])

Unnamed: 0_level_0,Sensor,2017,2018
Metric,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Pressure,sensor1,83,49
Temperature,sensor1,49,90
Flow,sensor1,61,72
Pressure,sensor2,64,58
Temperature,sensor2,41,99
Flow,sensor2,91,94


In [6]:
df_sensors.set_index(['Metric']).stack()

Metric             
Pressure     Sensor    sensor1
             2017           83
             2018           49
Temperature  Sensor    sensor1
             2017           49
             2018           90
Flow         Sensor    sensor1
             2017           61
             2018           72
Pressure     Sensor    sensor2
             2017           64
             2018           58
Temperature  Sensor    sensor2
             2017           41
             2018           99
Flow         Sensor    sensor2
             2017           91
             2018           94
dtype: object

In [7]:
(df_sensors.set_index(['Sensor', 'Metric'])
 .stack()
 .reset_index()
 .rename(columns={'level_2': 'Year', 0: 'Value'})
 .head(10))

Unnamed: 0,Sensor,Metric,Year,Value
0,sensor1,Pressure,2017,83
1,sensor1,Pressure,2018,49
2,sensor1,Temperature,2017,49
3,sensor1,Temperature,2018,90
4,sensor1,Flow,2017,61
5,sensor1,Flow,2018,72
6,sensor2,Pressure,2017,64
7,sensor2,Pressure,2018,58
8,sensor2,Temperature,2017,41
9,sensor2,Temperature,2018,99


# Melt

Pandas has different ways to accomplish the same task, the difference being readability and performance. Pandas dataframe method name `melt` works similarly to `stack` but gives more flexibilty.The method takes in the below 5 parameters out of which two parameters namely `id_vars` and `value_vars` are crucial to understand how to reshape your data.
- `id_vars`:(list,tuple or ndarray) list of column names that you want to preseve as columns and not reshape.(optional)
- `value_vars`:(list,tuple or ndarray) list of column names that you want to reshape as columns.(optional)
- `var_name`: (scalar) Name to use for the variable column,defaults to `variable`.(optional)
- `value_name`:(scalar) Name to use for the value column, defaults to `value`.(optional)
- `col_level`:(int or string) If column are multindex then use this level to melt.(optional)

All of the above parameters mentioned are optional. Let us try to understand the usage of each of the parameters with examples.\

__Note: The advantage of `melt`over `stack` is that you can mention the column name which you want to preserve in the index without explicitly setting them as index using `set_index`.__

In [8]:
df_sensors

Unnamed: 0,Sensor,Metric,2017,2018
0,sensor1,Pressure,83,49
1,sensor1,Temperature,49,90
2,sensor1,Flow,61,72
3,sensor2,Pressure,64,58
4,sensor2,Temperature,41,99
5,sensor2,Flow,91,94


In [9]:
df_sensors.melt(id_vars=['Sensor', 'Metric'],
                value_vars=['2017', '2018'])

Unnamed: 0,Sensor,Metric,variable,value
0,sensor1,Pressure,2017,83
1,sensor1,Temperature,2017,49
2,sensor1,Flow,2017,61
3,sensor2,Pressure,2017,64
4,sensor2,Temperature,2017,41
5,sensor2,Flow,2017,91
6,sensor1,Pressure,2018,49
7,sensor1,Temperature,2018,90
8,sensor1,Flow,2018,72
9,sensor2,Pressure,2018,58


As mentioned above `melt` works similar to `stack` but gives more flexibility. In the above step we can see that we can mention the index and value columns and using `id_vars` and `value_vars`. `melt`  assigns `variable`, `value` as the default namess for the variable and value column. Inorder to avoid to this we can pass the parameters `var_name` and `value_name` with the appropriate names.

In [10]:
df_sensors.melt(id_vars=['Sensor', 'Metric'],
                value_vars=['2017', '2018'],
                var_name='Year',
                value_name='value')

KeyError: "The following 'value_vars' are not present in the DataFrame: ['2019']"

__Important points about `melt`__:
- The `id_vars` or the identification variables remain in the same column but repeat for each of the `value_vars`
- One crucial aspect of `melt` is that it ignores the values in index infact it drops the existing index and replaces it
  with the `RangeIndex`. so if you have values in index that you want to keep, you need to do a `reset_index` before
  apllying `melt`.

__Note: The transformation of horizontal column names into vertical column values as `melting`, `stacking`, or `unpivoting`.__ 

# Unstack

DataFrames have two similar methods `stack` and `melt` to convert horizontal column names into vertical column values. Dataframes have the ability to invert these two opeartions using `unstack` and `pivot` methods. `stack`/ `unstack` are simpler methods which allow control over column/row indexes whereas `melt`/`pivot` give more flexibility to choose which columns to be reshaped.

Below are the parameters for the `unstack` method. By default it takes the inner most index values and returns a dataframe by reshapong them as the columns.
- `level`: (int or string or list of these) level(s) of index to unstack, defaults to -1
- `fill_value`: replace NAN with value specified if unstack produces missing values.

In [None]:
# let us consider melted output and load into a dataframe df_melted
df_melted = df_sensors.melt(id_vars=['Sensor', 'Metric'],
                value_vars=['2017', '2018'],
                var_name='Year',
                value_name='value')

In [None]:
df_melted

In order to observe `sensors` yearly metric values with each metric as a column, we can use a `unstack` operation on the dataframe. Similar to `stack` method we have to first set the columns which we want to stay in index using `set_index` and then apply `unstack`.

In [None]:
df_melted.set_index(['Sensor', 'Year', 'Metric'])

In [None]:
df_melted.set_index(['Sensor', 'Year', 'Metric']).unstack()

In [None]:
df_melted.set_index(['Sensor', 'Year', 'Metric']).unstack(level=[-2,-1])

# Pivot

`pivot` takes in 3 parameters (mentioned below) as input which are `index`, `columns` and `values`. Each parameter takes in a single column as a string. The `index` remains in the vertical and becomes the new index. The values of the columns referenced by `columns` becomes new column names. The values referenced by the `values` are tiled to correspond the intersection of their former index and columns label.

- `index`: (string or object) Column to use for the new frames index, if none use current index.
- `columns`: (string or object) Column to use for the new frames columns.
- `values` : (string or object) column(s) to use for the new frames values.

__Note__: __`pivot` raises a `ValueError` when any index or column combinations has multiple values .__       

In [None]:
# let us consider melted output and load into a dataframe df_melted
df_melted2 = df_sensors.melt(id_vars=['Sensor', 'Metric'],
                value_vars=['2017', '2018'],
                var_name='Year',
                value_name='value')
df_melted2.head(10)

In [None]:
# Sensors Yearly Pressure
df_melted[df_melted['Metric'] == 'Pressure'].pivot(index='Sensor', columns='Year', values='value')

In [None]:
# Yearly Temperature
df_melted[df_melted['Metric'] == 'Temperature'].pivot(index='Sensor', columns='Year', values='value')

In [None]:
# Yearly Flow
df_melted[df_melted['Metric'] == 'Flow'].pivot(index='Sensor', columns='Year', values='value')

`pivot` function can be used to view individual metric values for all the sensors. `pivot` cannot take more than one index at a time.

# Pivot_table

`pivot_table` is a versatile and flexible function. It's funtionality is similar to the pandas `groupby` function. Below mentioned are the list of the function paramters. The `index` parameter takes a column(or columns) which is not pivoted and whose unique values will be placed in the index. The `columns` paramters takes a column(or columns) which are pivoted and whose unique values will form the new columns. The `values` parameter takes a column that will be aggregated. There is also and `aggfunc` which takes a aggregation function that determines how the `values` column is aggregated. By default the aggregation is `mean`, also there is `fill_value` parameter which forces the missing value intersections to the value specified. The function also has some default parameters `margins`, `dropna`, `margins_name` whose usage is explained in the following examples.

- `index`: (column, list, array, Grouper) column(s) which is intended to stay as index.
- `columns`: (column, list, array, Grouper) column(s) which are pivoted.
- `values` : column to aggregate.
- `aggfunc`: function, list of aggregation functions.
- `fill_value`: (scalar, defult None) Scalar to fill for missing values in the result.
- `margins`: (Bool, default False ) whether to add all rows/columns(eg: subtotal or grand total)
- `dropna`: (Bool, default True) Do not include columns whose values for all rows are NaN.
- `margins_name`: (string, default All) Name of the row / column that will contain the totals when margins is True.

__Note:`pivot` method raises a ValueError when there are duplicate entries in the index column, `pivot_table` solves this problem by aggregating the values from rows with duplicate entries for the specified columns.__  

In [None]:
# let us consider melted output and load into a dataframe df_melted
df_step1 = df_sensors.melt(id_vars=['Sensor', 'Metric'],
                value_vars=['2017', '2018'],
                var_name='Year',
                value_name='value')
df_step1.head(10)

In [None]:
df_step1.pivot_table(index=['Sensor', 'Year'],
                                columns='Metric',
                                values='value')

In [None]:
df_step1.pivot_table(index=['Sensor', 'Year'],
                     columns='Metric',
                     values='value',
                     aggfunc=[np.sum, np.mean],
                     fill_value=0,
                     margins=True,
                     margins_name='Total')