# Data Wrangling 2 - Data Reformatting

## Restructuring Data
Depending on where you obtain your data, you might receive it in one of two formats. These are referred to as Wide Data or Long Data. The wide data format is one where each column has uniquely one type of value in it, e.g., all values are a temperature in Celsius, money in dollars, the name of a person, the address of the business, etc.

In [None]:
import pandas as pd

In [None]:
wide_data = pd.read_csv("https://raw.githubusercontent.com/stefmolin/Hands-On-Data-Analysis-with-Pandas/master/ch_03/data/wide_data.csv")

In [None]:
wide_data.head()

In the above dataframe, notice that each type of measured value is a unique type of thing. In this case, they are all numbers, but `TMAX` *only* reports the maximum temperature for the day. This format is great for presentation and analysis. When running the method `df.describe()`, we obtain sensible information on each row.

In [None]:
wide_data.describe()

Wide data is very commonly used in databases. Due to the fact that each column has a unique type of value, relational databases can host a suite of tables that all have information that is linked together through keys. This key allows us to join only the necessary tables together to get information that we need without much excess.

One drawback to this design is that it can be fairly inflexible. Imagine a scenario when you are taking a lot of measurements, and you are using a wide format to store that data. Your table might look like the following

| test | Measurement 1 | Measurement 2 | Measurement 3 |
| ---- | - | - | - |
| Observation 1 | 2 | 3 | 3 |
| Observation 2 | 5 | 3 | 2 |
| Observation 3 | 6 | 3 | 3 |
| Observation 4 | 5 | 5 | 5 |
| Observation 5 | 0 | 3 | 3 |

Then, for whatever reason, you might need to add a new measurement to your experiment. As far as your experiment is concerned, all you need to do is add a new sensor, or start asking a new question. However, if you're also doing the Database administration, you have to perform a schema change, you have to account for the previous rows of missing information (especially if it's a row that cannot be null), you need to update your documentation if you are sharing the data with anyone.

Let's contrast this with the Long Data format. With the long data format, you often see two important  columns present: 
- Key
- Value

Combining the key column with some sort of row information, we can restructure the wide format to present the way that will have fewer columns, but more rows. An example of a long table is below. 

In [None]:
long_data = pd.read_csv('https://raw.githubusercontent.com/stefmolin/Hands-On-Data-Analysis-with-Pandas/master/ch_03/data/long_data.csv')[['date', 'datatype', 'value']]

In [None]:
long_data.head(15)

If we carefully compare the above table with the wide table that we examined above, we will see that all the same data is present. In this table, our key column is called `datatype` and the value associated with that datatype is called `value`. We also will notice that each value of the data is repeated 3 times, and that the values in `datatype` are the names of the columns in the wide table. In this table, we have restructured the data such that all of the columns of the previous table are compressed into these two columns. This This format would be very confusing to show another person, but it does remove the need to create new columns. If you start recording a new type of measurement, simply create another key/`datatype` value and start recording the data. This format is often found with dealing with APIs, but would also make for terrible relational database design. Additionally, any attempt at producing summary statistics on this table would be nonsensical.

Note that each format has its benefits and reasons for use. Therefore it is important to know how to switch betweent he two as needed. Let's start with how to change from long to wide. 

### Pivoting DataFrames
Converting from a long format to a wide format creates what is often called a **pivot table**. Microsoft Excel also has a functionality to create pivot tables. In pandas, the method that we call is, as you expect `pivot`. We give it a few parameters, like which column should be the index of our new table, which column contains the column names, and which column contains all the values. Doing so produces the following

In [None]:
pivot_df = long_data.pivot(
    index='date', 
    columns='datatype',
    values='value'
)
pivot_df.head()

Now we have created a neew table that looks like the first wide thable that we created. Now, we could join this table with other datasets, or extract summary statistics.

One additional feature of Pandas is to set up your data with a **hierarchical index**. If you recall from last week, this `value` columns was actuall the temperature in Celcius. So, let's rename it as such and include the temperature in Fahrenheit. 

In [None]:
long_data.rename(columns={"value":"temp_C"}, inplace=True)
long_data["temp_F"] = (long_data.temp_C * 9/5) + 32

In [None]:
long_data.head()

When we pass in multiple columns to the `values` argument, we end up with the following dataframe.

In [None]:
pivot_df = long_data.pivot(
    index='date', 
    columns='datatype',
    values=['temp_C', 'temp_F']
)
pivot_df.head()

We now have two of each column name, one under Celcius, and one under Fahrenheit. Under this structure, Each higher column name actually returns a dataframe.

In [None]:
pivot_df['temp_C'].head()

Therefore, in order to reference any single column, we need to provide (what looks like) two column names. 

In [None]:
pivot_df['temp_C']['TMAX'].head()

In [None]:
pivot_df.temp_C.TMAX.head()

### Multi-Index
Pandas also allows for option to provide multiple indices for a particular row, just like we have provided multiple indices for a column (remember, column headers are actually treated as index objects). To utilize this, we call the familiar `set_index` object, and pass in multiple columns, in the order that they should be grouped. 

In [None]:
multi_index_df = long_data.set_index(['date', 'datatype'])
multi_index_df.head(6)

Here we have grouped the rows together based on the date, and have the `datatype` column as a second index. However, this is still our long format. Additionally, the `pivot` method expects a dataframe with only one index. To achieve a similar table as `pivot`, we can use the `unstack` method, which removes all layers of your multi-index and turns the into columns. Essentially, picture as if we have stacked the values on top of each other, and we are now unstacking them, and laying them all out into a single row. 

In [None]:
unstacked_df = multi_index_df.unstack()
unstacked_df.head()

### Melting DataFrames
Melting is the inverse action of a pivot. A pivot takes rows and combines them into a single column, melting a dataframe takes the columns of a dataframe and converts them into rows. By calling the method `melt`, we can pass in the arguments: 
- `id_vars`: which columns uniquely identify a row
- `value_vars`: which columns contains values
- `var_name`: what to name the column with the column names (default is `variable`)
- `value_name`: what to name the column with the values of those columns (default is `value`)

In [None]:
melted_df = wide_data.melt(
    id_vars='date',
    value_vars=['TMAX', 'TMIN', 'TOBS'],
    value_name='temp_C',
    var_name='measurement'
)
melted_df.head()

Alternatively, we have the inverse of `unstack`, being `stack`. This will take the columns and stack them together, creating a multi-index in the process. 

In [None]:
stacked_series = wide_data.set_index('date').stack()

In [None]:
stacked_series.head()

Upon inspection, we would see that this object is a series, not a dataframe. We can easily change this by calling `to_frame` and passing in the name of the column.

In [None]:
stacked_df = stacked_series.to_frame('values')

In [None]:
stacked_df.head()

As it stands, this dataframe has a multi-level index, however, the second layer index does not have a name. 

In [None]:
stacked_df.index

We can rename that index by intuitive method calls.  

In [None]:
stacked_df.index.rename(['date', 'datatype'], inplace=True)
stacked_df.index.names

In [None]:
stacked_df.head()

In [None]:
long_data = pd.read_csv('https://raw.githubusercontent.com/stefmolin/Hands-On-Data-Analysis-with-Pandas/master/ch_03/data/long_data.csv')[['date', 'datatype', 'value']]

In [None]:
long_data.head(15)

If we carefully compare the above table with the wide table that we examined above, we will see that all the same data is present. In this table, our key column is called `datatype` and the value associated with that datatype is called `value`. We also will notice that each value of the data is repeated 3 times, and that the values in `datatype` are the names of the columns in the wide table. In this table, we have restructured the data such that all of the columns of the previous table are compressed into these two columns. This This format would be very confusing to show another person, but it does remove the need to create new columns. If you start recording a new type of measurement, simply create another key/`datatype` value and start recording the data. This format is often found with dealing with APIs, but would also make for terrible relational database design. Additionally, any attempt at producing summary statistics on this table would be nonsensical.

Note that each format has its benefits and reasons for use. Therefore it is important to know how to switch betweent he two as needed. Let's start with how to change from long to wide. 

### Pivoting DataFrames
Converting from a long format to a wide format creates what is often called a **pivot table**. Microsoft Excel also has a functionality to create pivot tables. In pandas, the method that we call is, as you expect `pivot`. We give it a few parameters, like which column should be the index of our new table, which column contains the column names, and which column contains all the values. Doing so produces the following

In [None]:
pivot_df = long_data.pivot(
    index='date', 
    columns='datatype',
    values='value'
)
pivot_df.head()

Now we have created a neew table that looks like the first wide thable that we created. Now, we could join this table with other datasets, or extract summary statistics.

One additional feature of Pandas is to set up your data with a **hierarchical index**. If you recall from last week, this `value` columns was actuall the temperature in Celcius. So, let's rename it as such and include the temperature in Fahrenheit. 

In [None]:
long_data.rename(columns={"value":"temp_C"}, inplace=True)
long_data["temp_F"] = (long_data.temp_C * 9/5) + 32

In [None]:
long_data.head()

When we pass in multiple columns to the `values` argument, we end up with the following dataframe.

In [None]:
pivot_df = long_data.pivot(
    index='date', 
    columns='datatype',
    values=['temp_C', 'temp_F']
)
pivot_df.head()

We now have two of each column name, one under Celcius, and one under Fahrenheit. Under this structure, Each higher column name actually returns a dataframe.

In [None]:
pivot_df['temp_C'].head()

Therefore, in order to reference any single column, we need to provide (what looks like) two column names. 

In [None]:
pivot_df['temp_C']['TMAX'].head()

In [None]:
pivot_df.temp_C.TMAX.head()

### Multi-Index
Pandas also allows for option to provide multiple indices for a particular row, just like we have provided multiple indices for a column (remember, column headers are actually treated as index objects). To utilize this, we call the familiar `set_index` object, and pass in multiple columns, in the order that they should be grouped. 

In [None]:
multi_index_df = long_data.set_index(['date', 'datatype'])
multi_index_df.head(6)

Here we have grouped the rows together based on the date, and have the `datatype` column as a second index. However, this is still our long format. Additionally, the `pivot` method expects a dataframe with only one index. To achieve a similar table as `pivot`, we can use the `unstack` method, which removes all layers of your multi-index and turns the into columns. Essentially, picture as if we have stacked the values on top of each other, and we are now unstacking them, and laying them all out into a single row. 

In [None]:
unstacked_df = multi_index_df.unstack()
unstacked_df.head()

### Melting DataFrames
Melting is the inverse action of a pivot. A pivot takes rows and combines them into a single column, melting a dataframe takes the columns of a dataframe and converts them into rows. By calling the method `melt`, we can pass in the arguments: 
- `id_vars`: which columns uniquely identify a row
- `value_vars`: which columns contains values
- `var_name`: what to name the column with the column names (default is `variable`)
- `value_name`: what to name the column with the values of those columns (default is `value`)

In [None]:
melted_df = wide_data.melt(
    id_vars='date',
    value_vars=['TMAX', 'TMIN', 'TOBS'],
    value_name='temp_C',
    var_name='measurement'
)
melted_df.head()

Alternatively, we have the inverse of `unstack`, being `stack`. This will take the columns and stack them together, creating a multi-index in the process. 

In [None]:
stacked_series = wide_data.set_index('date').stack()

In [None]:
stacked_series.head()

Upon inspection, we would see that this object is a series, not a dataframe. We can easily change this by calling `to_frame` and passing in the name of the column.

In [None]:
stacked_df = stacked_series.to_frame('values')

In [None]:
stacked_df.head()

As it stands, this dataframe has a multi-level index, however, the second layer index does not have a name. 

In [None]:
stacked_df.index

We can rename that index by intuitive method calls.  

In [None]:
stacked_df.index.rename(['date', 'datatype'], inplace=True)
stacked_df.index.names

In [None]:
stacked_df.head()