# Reshaping data

Data comes in all shapes and sizes. Often, the way our data is structured is not conducive for various types of analysis and visualization.

Say that we have a basic data set such as below, showing the the salary each person earned in the last few years.

## Wide data

In [None]:
salaries = [
    {'name': 'Sally',  '2022': 70_000, '2023': 75_000},
    {'name': 'Eloise', '2022': 60_000, '2023': 80_000},
    {'name': 'Ayla',   '2022': 80_000, '2023': 83_000},
]

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(salaries)
df

The above data is often referred to as "wide".

Why do we call it wide? Well, we've kept this example simple but it's easy to imagine a real-world data set that has many more columns for prior years, for example going all the way back to 2010 or 2000. You may end up with dozens (or even hundreds of columns) depending on the data set.

This structure of data can be very useful for certain types of calculations.

Can you think of any calculations that you could perform on the data in its current form?

One interesting analysis might involve calculating how big a raise each person received -- both in absolute dollars and as a percentage relative to their original salary. You could then determine who got the biggest pay bump in total dollars and as a percentage of their original salary.

Let's do those calculations.

We'll start by calculating the total dollar amount of each person's raise.

In [None]:
df['raise'] = df['2023'] - df['2022']
df

Next, we'll figure out the percent increase in salary for each person.

> Remember the formula for percent chnage `New-Old / Old`

We've already calculated and stored `New - Old` in our `raise` column, so we just need to do the division step and multiply by 100.

In [None]:
df['pct_raise'] = df['raise'] / df['2022'] * 100
df

And we could sort the data to see who received the biggest pay bump as a percentage of salary.

In [None]:
df.sort_values('pct_raise', ascending=False)

## Long Data

Wide data, as seen above, can be quite useful. But what if we wanted to figure out the total of salaries year-by-year?

This type of calculation is a natural fit for the `pandas.DataFrame.groupby` method.

Alas, our wide data structure is not optimal for the `groupby` operation.

Instead, a "long" data set such as below would simplify things. 

Below, notice how the columns for each year in the "wide" data set have been transformed into values for a new `year` column.

The table is literally longer (and less wide), hence the name "long" data.

In [None]:
from IPython.display import HTML, display

html = [
'<table><tr><td>name</td><td>year</td><td>salary</td></tr>'
]
for row in salaries:
    for year in ['2022', '2023']:
        tr = f"<tr><td>{row['name']}</td><td>{year}</td><td>{row[year]}</td></tr>"
        html.append(tr)
html.append('</table>')
display(HTML(''.join(html)))

### Melting data

To reshape data from wide to long, we can use a DataFrame's `melt` method.

It can be a bit tricky to use, so here's an annotated snippet.

In [None]:
long_data = df.melt(
    id_vars=['name'], # Column(s) to use an identifier variable
    var_name='year', # The name of the new column, or variable, we're generating
    value_vars=['2022', '2023'] # The columns we'll use to populate the values for the new "year" column
).rename({'value': 'salary'}, axis=1) # we'll rename the resulting "value" column for clarity 

long_data

Ok, now we're ready to group our data by year to figure out which year had the highest salary figure.

> Note, to improve readability, we're using parentheses to write a multi-line code statement. Python effectively treats this as one big line, without having to use line-break or escape characters.

In [None]:
(
    long_data
        .groupby('year')
        .salary  # Select the year column
        .sum()   # Sum the salaries for each year
        .reset_index() # Restore our DataFrame
        .sort_values('salary', ascending=False) # Sort in reverse order
)

## Pivoting Data

It's also, of course, possible to go from long to wide data. 

Let's say that our original data was in "long" format.

In order to calculate each person's raise between 2022 and 2023, we would want our data to be in wide format, similar to how we started this exercise.

In this case, the `pandas.DataFrame.pivot` method is our friend.

In [None]:
long_data.pivot_table(
    index='name',
    columns='year',
    values='salary'
).reset_index()


## Keep learning

Pivots can be tricky to get right, and there are plenty of resources online that help explain the concept.

Here's the official pandas tutorial on [Reshaping and Pivot Tables](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html).

And [another tutorial](https://hausetutorials.netlify.app/posts/2020-05-17-reshape-python-pandas-dataframe-from-long-to-wide-with-pivottable/#long-to-wide-with-pivot_table) that includes some helpful visuals and a variety of related techniques such as aggregating column and row values when you pivot.