# Reshaping data with _pandas_

This Notebook describes some simple patterns for reshaping datasets in *pandas* including transposing rows and columns, and considers the difference between *long* and *wide* formatted data tables and moving between them.

In [None]:
import pandas as pd
import numpy as np

## Sample data in a simple DataFrame

Let's start by considering a simple example - creating a DataFrame of a particular size and shape from a sequence of numbers.

In the following `arange(n)` generates a list with values `0` to `n`.  `reshape(m,n)` produces an array with `m` rows and `n` columns filled in 'column order'.   This array is then turned into a DataFrame with named columns.

In [None]:
df = pd.DataFrame(np.arange(6).reshape(3,2), columns=['a','b'])
df

The following simple functions describe the shape of a DataFrame, given the DataFrame or its name, in the form of a short sentence.

In [None]:
def shape(df):
    rows,cols = df.shape
    return "That DataFrame has {0} rows and {1} columns.".format(rows, cols)

def shape_name(df_name):
    df = eval(df_name)
    rows,cols = df.shape
    return "The DataFrame '{0}' has {1} rows and {2} columns.".format(df_name, rows, cols)

In [None]:
shape(df)

In [None]:
shape_name('df')

# `transpose()` - turning rows into columns, and columns into rows

The `transpose()` method can be applied to a DataFrame to transpose columns and rows.

In [None]:
df2 = df.transpose()
df2

In [None]:
shape_name('df2')

## Long and wide table representations

Tabular data can be arranged in several different representations - two common forms are the _long_ and _wide_ versions of tables.   

Let's start by creating a fictitious dataset containing total expenses by expense area for a set of local council directorates.

We start by defining some directorates and expense areas.

In [None]:
directorates = ['Community Wellbeing & Social Care',
                'Childrens Services',
                'Economy & Environment',
                'Resources',
                'Corporate']
expensetypes = ['Accommodation Costs',
                'Payment to Private Contractors',
                'Operational Equipment',
                'Professional Services']

We want to create a dummy value for each expense type in each directorate. 

The Python `itertools` library has a `product` function that will generate every possible combination of values in these two lists as a list of two tuples, `[ (directorate1, expensetype1), (directorate1, expensetype2), ..., (directorateN, expensetypeN) ]`. 

The `zip` function can then be used to 'unzip' the two tuple lists to give us separate lists that we can then use to create a dataframe. 

We can also add in a `Total` column that contains a randomly generated dummy amount.

In [None]:
import itertools
a = list(itertools.product(directorates, expensetypes))
unzipa = [t for t in zip(*a)]
df_long = pd.DataFrame({'directorates':unzipa[0],
                        'expense types':unzipa[1],
                        'total':np.random.randint(0,20000,len(directorates)*len(expensetypes))
                      })
df_long[:6]

Here's what the shape of the long DataFrame looks like.

In [None]:
shape_name('df_long')

This is in _long_ format as each directorate has many rows of data, one for each expense type in our example.  

Note: it's not long because it has more rows than columns: it's long because each 'thing' in the table (each directorate) has attribute values in multiple rows.

## Exercise

How would you find the accommodation costs by directorate?

In [None]:
# Accommodation costs by directorate


In [None]:
# Sample solution.
#      One way would be to filter the long DataFrame based on the values 
#      contained in the *expense types* column. 
df_long[df_long['expense types']=='Accommodation Costs']

## Moving between long and wide representations 
###      `pivot()`:   long to wide, for three-columned DataFrames

Suppose we want to reorder this _long_ table into a _wide_ format with the expense types as rows, and the directorates as columns:

In [None]:
df_wide = df_long.pivot('expense types', 'directorates', 'total')
df_wide

In [None]:
shape_name('df_wide')

Notice how the pivot operation has created a DataFrame with a row index made up from the expense types. The cell values represent the *total*, though this explict label has been lost.

This DataFrame contains the same information as our original data table, but in a more powerfully structured way. Properly indexed dataframes are much richer representations than simple tables and as such can support a wider range of transformations.

## Exercise
Using `df_wide`, write Python code to:
- display a list of expenses associated with the 'Economy & Environment' directorate
- display a list of the accommodation costs by directorate
- show the total amount associated with corporate accommodation costs.


In [None]:
# Display a list of expenses associated with the Economy & Environment directorate.


In [None]:
# Sample solution
#     Remember there are usually several different ways to achieve something like this in Python.

# We pull just the column associated with the required expenses from the DataFrame.
df_wide['Economy & Environment']

In [None]:
# Display a list of the accommodation costs by directorate.


In [None]:
# Sample solution

# To find the accommodation costs by directorate, we can call the appropriate row index value. 
df_wide.ix['Accommodation Costs']

In [None]:
# What amount is associated with corporate accommodation costs?


In [None]:
# Sample solution

# Either
df_wide.ix['Accommodation Costs']['Corporate']

#  or
#df_wide['Corporate'].ix['Accommodation Costs']

## Exercise
For the previous examples, how does the `pivot()` operation differ from the `transpose()` method?


In [None]:
# Discussion
# `transpose()` is used to simply switch rows and columns.  
# `pivot()` reorganises values that appear in columns, to form rows.

## Exercise

Suppose we want to reorder this _long_ table into a _wide_ format, but this time with the expense types as rows and the directorate as columns - can you do that?

In [None]:
#df_wide2 = df_long.pivot('<replace this with your parameters>')
#df_wide2

In [None]:
# Sample solution
df_wide2 = df_long.pivot('directorates', 'expense types', 'total')
df_wide2

In [None]:
shape_name('df_wide2')

###  `stack()`: from wide to long form

We can recreate a _long_ form by 'stacking' the data; in this case we will 'stack' the directorate column headings as sub-levels within each expense type.   This creates a Series with a hierarchical row index.

In [None]:
df_new_long = df_wide.stack()
df_new_long

In [None]:
# The output when we stack the df_wide2 DataFrame shouldn't come as a suprise.
df_new_long2 = df_wide2.stack()
df_new_long

If it *was* a surprise, remember that the long, wide, stacked and unstacked forms are just re-shaping the data, they're not changing the data - so if there is a total value for a directorate and expense type in one form it will be the same in all the forms.

Note that the `Total` attribute name was lost in the `pivot()` operation - it can't be recreated once lost.

If you prefer the simple, non-hierarchical, tabular format of the original, we can generate that from the hierarchical stacked form by adding `resetIndex()` after applying the `stack`.

In [None]:
df_new_long.reset_index()

With the hierarchically indexed `df_new_long` series, we can go back to the long form by using `unstack()`. By default this unstacks the 'last' level of the hierarchy, but we can also declare which level to unstack explicitly.

Both `stack()` and `unstack()` operate over hierarchical indexes and columns.
`pivot()` is simpler, operating with just three-columned _long_ table forms (Object, Attribute, Value). 

Here's `unstack()` with a named hierarchical level to unstack:

In [None]:
df_new_long.unstack('directorates')

In [None]:
# And here we unstack using the 'other' level of the hierarchy:
df_new_long.unstack('expense types')

### Transforming wide to long data: *pandas* `melt()`

Data is often provided in wide  formats and we can use the `.melt()` function to select which columns of the wide table to put in long form.

Let's start by creating another dummy dataset in a simple table with no meaningful index - the sort of DataFrame we might create by simple reading in a CSV file, or a spreadsheet summary table.

In [None]:
simple_wide = pd.DataFrame(np.random.randint(0,20000,
       len(directorates)*len(expensetypes)).reshape(len(directorates), len(expensetypes)),
       columns=expensetypes)
simple_wide['directorates'] = directorates
simple_wide = simple_wide[['directorates']+expensetypes]
simple_wide

In [None]:
shape_name('simple_wide')

The pandas `.melt()` function allows us to 'melt' selected columns in a wide format data table into a long format.

First we indicate the `id_vars`, the columns whose values will form the identifier (keys) for the long format rows (remember the information about specific things in a long table is spread over several rows, each with the same key values).  

We can then specify where a 'variable' column identifies the original column name of the melted data and a 'value' column contains the corresponding cell value.

The next cell turns our `simple_wide` table into the long form using _all_ the `expensetypes` columns.  
The cell below that shows how to select only one or more specific columns from the original `simple_wide` DataFrame.

In [None]:
simple_melt = pd.melt(simple_wide, id_vars=['directorates'], value_vars=expensetypes)
simple_melt

In [None]:
shape_name('simple_melt')


In [None]:
simple_melt_select = pd.melt(simple_wide,
                             id_vars=['directorates'],
                             value_vars=['Accommodation Costs','Operational Equipment'])
simple_melt_select

In [None]:
shape_name('simple_melt_select')


We can unmelt using a `pivot()`, followed by `reset_index()` to convert the hierarchical index values to columns. Note, however, that we are left with the index column named as `variable`. (This doesn't matter, but it's not very tidy!)

In [None]:
simple_unmelt = simple_melt.pivot('directorates', 'variable', 'value').reset_index()
simple_unmelt

## Why is reshaping data useful?

By this stage, you may be thinking: *this is all very interesting, but so what?*

Many datasets that you are likely to find will come in shapes determined by the publisher. By reshaping a dataset, you may be able to filter it naturally by row or column in a way that is at worst impossible, or at best difficult or clumsy to achieve, using the data in a different shape.

In addition, many plotting tools require data to be in a particular shape in order to produce charts of a particular style. The graphics libraries can do the plotting work for you, but only if you give the data to them in the shape they need.

## What next?

If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, you've completed the Part 4 Notebooks. It's time to move on to Part 5.