In [None]:
import numpy as np
import pandas as pd

# Reshaping Data

It is often desirable to reshape a dataset into a more concise format. An example of this would be Excel pivot tables.

Pandas provides useful functions to reshape data. To explore these functions, we will load both the Titanic dataset and the Air Quality dataset.

In [None]:
# Open the titanic CSV
titanic = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv")

titanic.head()

In [None]:
# Open the air quality CSV
air_quality  = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/air_quality_long.csv", index_col="date.utc", parse_dates=True)

air_quality .head()

Notice that the air quality data is in a long format, with each observation on a seperate row and each variable in a seperate column.

# Sorting Rows

Row sorting is done with the `sort_values()` function.

First let's sort the titanic data acording to the **Age** of the passengers.

In [None]:
# Sort titanic data by age
titanic.sort_values(by="Age").head()

We can sort multiple columns at once by passing a list of column names.

This time we will sort by **Pclass** and **Age**.

In [None]:
titanic.sort_values(by=['Pclass', 'Age']).head()

Let's do the same sort, but this time in descending order by setting `ascending=False`.

In [None]:
titanic.sort_values(by=['Pclass', 'Age'], ascending=False).head()

# Prepare a Demo Subset

We need a small subset of the data to illustrate reshaping.

In the air quality data, what parameter values do we have?

In [None]:
air_quality["parameter"].unique()

We have both PM<sub>25</sub> and NO<sub>2</sub> data.

Let's select only the NO<sub>2</sub> data

In [None]:
# Select no2 data
no2 = air_quality[air_quality["parameter"] == "no2"]

no2.head()

Now we'll get an even smaller subset of data by

- Sorting by the date (`sort_index`)
- Grouping by location (`groupby`)
- Selecting two measurements for each location (`head`)

In [None]:
no2_subset = no2.sort_index().groupby(["location"]).head(2)
no2_subset

# Pivot: Reshape long to wide

![Pivot](https://drive.google.com/uc?id=1RATtZcEE5emTa-tuWyt-XvO-7Po7CXjd)

The `pivot()` function reshapes data from long to wide format.

Let's display our **no2_subset** data with the NO<sub>2</sub> values for all three stations next to each other.

In [None]:
no2_subset.pivot(columns="location", values="value")

Comparing this `pivot` output to the original **no2_subset** Dataframe, we see that it contains the exact same data. The `pivot` has merely reshaped our data.

# Pivot Table: Reshape and Aggregate

The `pivot_table()` function both reshapes data and aggregates values.

We have many measurements over time in our dataset. What if we want to find the mean concentration of both PM<sub>25</sub> and NO<sub>2</sub> in table form?

In [None]:
air_quality.pivot_table(values="value", index="location",
                        columns="parameter", aggfunc="mean")

If we also want summary columns, we can set `margins=True`

In [None]:
air_quality.pivot_table(values="value", index="location",
                        columns="parameter", aggfunc="mean",
                        margins=True)

# Melt: Reshape wide to long

![Melt](https://drive.google.com/uc?id=1sdGEMI90GrhrqfOqjxcOqBT3RJg24WrN)

Let's start with our pivoted results in wide format.

In [None]:
no2_pivoted = no2.pivot(columns="location", values="value")

no2_pivoted.head()

We want the data to be treated as a column, so we will reset the index.

In [None]:
no2_pivoted = no2_pivoted.reset_index()

no2_pivoted.head()

We can use `melt()` to reshape this wide format data to long format where each observation has its own row.

In [None]:
no_2 = no2_pivoted.melt(id_vars="date.utc")

no_2.head()

Notice that our column headers have all been "melted" into a single **location** column.

By default, `melt()` combines all columns that are NOT mentioned in `id_vars`. The result is always two columns: A column with the header names, and a column with the table values named **value**.

# Summary

- Sorting by one or more columns can be done with `sort_values`
- `pivot` reshapes data
- `pivot_table` both reshapes and aggregates data
- The reverse of `pivot` (long to wide format) is `melt` (wide to long format)