# Grouping with Pivot Tables

A pivot table aggregates data between the intersection of the unique values of two (or more) columns of your data. In the pivot table below, the two columns are `race` and `sex`. All pivot tables must aggregate some other column of data. Here, the salary is averaged. There are 5 unique races and 2 unique values for sex. The pivot table shows the mean of salary for each possible combination. Having the data in this structure, can make it easier to read and make comparisons.

![1]

[1]: images/pivot_table_example.png

### Creating a simple pivot table in pandas - four components

There are four components to a basic pivot table in pandas.

* One vertical grouping column
* One horizontal grouping column
* One aggregating column
* One aggregating function

In the example above, the two grouping columns are `race` and `sex`. The aggregating column is `salary` and the aggregating function is `mean`.

## Creating pivot tables with pandas

Let's read in the employee dataset and use it to recreate the pivot table above in pandas.

In [None]:
import pandas as pd
emp = pd.read_csv('../data/employee.csv', parse_dates=['hire_date'])
emp.head(3)

To use the `pivot_table` method, set the `index`, `columns`, `values`, and `aggfunc` parameters:

* `index` - vertical grouping column
* `columns` - horizontal grouping column
* `values` - aggregating column
* `aggfunc` - aggregating function (defaulted to the mean)

In [None]:
emp.pivot_table(index='race', columns='sex', values='salary', aggfunc='mean')

### Round results and convert to integer to reduce noise

The above DataFrame has several excess decimal values that aren't important. Rounding to the nearest thousand and changing the data type to be integers reduces this noise.

In [None]:
emp.pivot_table(index='race', columns='sex', 
                values='salary', aggfunc='mean').round(-3).astype('int64')

### Easily compare female vs male salary

Pivot tables make comparisons between groups easy. In this instance, the difference between female and male average salary is easily seen.

### Column and index labels are sorted

Notice that the labels for each of the index and columns of a pivot table come from the unique values of the grouping columns. They are also sorted in alphabetical order. The intersection of each label is where the aggregated data appears.

## Comparison to `groupby`

The `pivot_table` method is very similar to a `groupby` aggregation. Both are capable of producing the exact same results. Below, we replicate the results of our pivot table with `groupby`.

In [None]:
(emp.groupby(['race', 'sex'])
    .agg(mean_salary=('salary', 'mean'))
    .round(-3)
    .astype('int64'))

### Data is more difficult to make comparisons

This call to `groupby` produced the exact same result as the pivot table but in a different shape. Having all of our data in a vertical column makes it difficult to make comparisons.

### Wide vs long data

Pivot tables produce **wide** data, with new columns for each unique value of one of the grouping columns. Wide data is typically easier to read and make decisions with. The `groupby` method returns **long** data with the results of each group in a single column, making it more difficult to make comparisons. This type of data is generally easier to continue analyzing with further commands.

### The default aggregation is `mean`

By default, the `aggfunc` parameter is set to 'mean'. But, even if you are using the mean as your aggregation function, I advise that you explicitly state it in your call to `pivot_table` so that it's clear what you are doing.

### All aggregation strings are available for `pivot_table`

The same aggregation strings ('min', 'max', 'mean', etc...) are available to a `pivot_table` as they are with `groupby`. Here we find the max salary for the same groups.

In [None]:
emp.pivot_table(index='race', columns='sex', values='salary', aggfunc='max')

### Where is the 'pivoting'?

Microsoft Excel is well-known for its pivot tables that are created by dragging and dropping different columns into different boxes, 'pivoting' the data around. With pandas, you'll have to change the parameter values and call the `pivot_table` method again in order to get the same effect. Let's pivot the table by putting sex along the index and race along the columns.

In [None]:
emp.pivot_table(index='sex', columns='race', values='salary', aggfunc='max')

## Styling pivot tables

You can style any DataFrame by changing the text color, background color, font, and several other items with the `style` accessor. It works similarly to `str`, `dt`, and `cat` accessors in that it gives you access to style-only methods through dot notation. [Visit the documentation][1] for descriptions on all of the methods. Let's create a pivot table computing the mean salary by department and race and assign the result to a variable.

[1]: http://pandas.pydata.org/pandas-docs/stable/style.html

In [None]:
dept_race_mean = (emp.pivot_table(index='dept', columns='race', 
                                 values='salary', aggfunc='mean')
                     .round(-3).astype('int64'))
dept_race_mean

### Highlighting the maximum value

The `style` accessor's `highlight_max` method highlights the maximum value in each column or row. By default, it highlights the maximum value of each column, just like most other pandas methods.

In [None]:
dept_race_mean.style.highlight_max()

### Change direction with `axis`

You can highlight the max of each row by setting the `axis` parameter to `'columns'` or 1. A single cell of the entire table maximum can be highlighted by setting `axis` to `None` (not shown).

In [None]:
dept_race_mean.style.highlight_max(axis='columns')

### Background color gradients

Use `background_gradient` to color the background based on the value of the cell. You can change the colors by choosing a [Matplotlib colormap][1].

[1]: https://matplotlib.org/stable/tutorials/colors/colormaps.html

In [None]:
dept_race_mean.style.background_gradient(cmap='Oranges')

### Adding commas to numbers

Make your data easier to read by inserting commas into large numbers with the `format` method. You must know how to use the [string format specification][0] from core Python. To add commas, use the string `'{:,.0f}'`. If you are unfamiliar with format specification, use the link provided. But, as a quick overview, the actual specification is what comes after the colon and does not include the curly braces. Here, it is `',.0f'`. The comma is the digit separator. The `.0` represents the numbers after the decimal, which is 0 in this case. The character `'f'` is the 'type' and stands for fixed-point notation. If you wanted to include two decimal places, you would use the format specification `'{:,.2f}'`.

[0]: https://docs.python.org/3/library/string.html#formatspec

In [None]:
dept_race_mean.style.format('{:,.0f}')

### Chaining style methods

You can chain together multiple style methods to your DataFrame. Before doing so, it's important to know that the returned object from a call to one of the style methods is NOT a DataFrame. It's a `Styler` object. We verify this below.

In [None]:
df_styled = dept_race_mean.style.format('{:,.0f}')
type(df_styled)

You cannot use this object as a normal DataFrame. If you try and call a normal DataFrame method such as `mean`, you'll get an error.

In [None]:
df_styled.mean()

You can retrieve the original (unstyled data) with the `data` attribute. Here, we verify it is the same object as the pivot table we calculated above.

In [None]:
df_styled.data is dept_race_mean

The benefit of having this Styler object is that you can chain one style method after another without referencing the `style` attribute again. Here, we add commas and highlight the maximum and minimum value of each column different colors.

In [None]:
(dept_race_mean.style.format('{:,.0f}')
                     .highlight_max(color='yellow')
                     .highlight_min(color='lightblue'))

## Getting the size of each group

You can use the `pivot_table` method to get the total number of occurrences of each of the grouping columns. When doing so, it is not necessary to use an aggregating column (the `values` parameter). pandas knows that the size of the group is independent of what it is aggregating, so it does not require you to provide it. Here, we get the size of each unique combination of department and sex.

In [None]:
emp.pivot_table(index='dept', columns='race', aggfunc='size')

## Add margins to get row and column totals

The optional parameter `margins` can be set to `True` to add one additional row and column to the pivot table which calculates the same aggregate function to the entire row or column. In the following pivot table, the average salary for all fire department employees is 61,000, and the average salary for all Hispanic employees is 55,000.

In [None]:
(emp.pivot_table(index='dept', columns='race', values='salary', 
                 aggfunc='mean', margins=True)
    .round(-3)
    .style.format('{:,.0f}'))

The average salary of all employees is given by the bottom right value. Let's verify this is correct by computing the average manually.

In [None]:
emp['salary'].mean()

## Non-standard pivot tables

There are many different kinds of pivot tables that we can create besides one with exactly two grouping columns, one aggregating column, and one aggregating function. I am calling such pivot tables 'non-standard', though this is just to help differentiate them from the previous ones created above.

### A single grouping column

It is possible to have just a single grouping column in a pivot table. Here, we find the average salary for each department.

In [None]:
emp.pivot_table(index='dept', values='salary', aggfunc='mean').round(-3)

When using a single grouping column, the result will be the exact same as a `groupby` aggregation. Using `groupby` has the advantage of being able to rename the resulting aggregate column.

In [None]:
emp.groupby('dept').agg(average_salary=('salary', 'mean')).round(-3)

You can pivot this one column table by using the `columns` parameter.

In [None]:
emp.pivot_table(columns='dept', values='salary', aggfunc='mean').round(-3)

### More than two grouping columns

You can use any number of grouping columns when creating pivot tables. Use a list to contain the columns you want for the rows or columns. Here we find the average salary by department, sex, and race. The unique combinations of department and sex are placed in the index. The resulting DataFrame has a multi-level index.

In [None]:
(emp.pivot_table(index=['dept', 'sex'], columns='race', 
                 values='salary', aggfunc='max')
    .round(-3)
    .head(10)
    .style.format('{:,.0f}'))

We can pivot this so that the result so that there is a multi-level column index.

In [None]:
(emp.pivot_table(index='dept', columns=['race', 'sex'],
                 values='salary', aggfunc='max')
    .round(-3)
    .style.format('{:,.0f}'))

### Keep the multi-level index with pivot tables

I advise that you keep the multiple levels when using a pivot table (if you happen to produce them). This is the opposite advice that I gave when grouping. The reasoning is that pivot tables are much more likely to be a final product - something that you use in a presentation or a report and won't be doing further analysis on. Therefore, you won't have to handle the multi-level index or columns. After doing a groupby, it is more likely that you'll be running other pandas commands, and doing so is much easier with a normal single level index.

### Multiple aggregating columns

It is possible to aggregate more than one column as well. Because the employee dataset only has one main aggregating column, we will add a column for years of experience. The data was pulled in 2019, so we will subtract the year hired from it to get the approximate years of experience.

In [None]:
emp['experience'] = 2019 - emp['hire_date'].dt.year
emp.head(3)

Let's find the average salary and experience for every department and race. Notice that the columns are now multiple levels. Because of this, it becomes harder to use methods like `round` which you might need to specify different decimals for different columns. We compromise here and format our data to a single decimal place. The margins for each aggregating column are also there.

In [None]:
emp.pivot_table(index='dept', columns='race', 
                values=['salary', 'experience'],
                aggfunc='mean', margins=True).style.format('{:,.1f}')

### Multiple aggregating functions

All components of a pivot table are capable of taking multiple values, including the aggregating functions. Here, we find the minimum, maximum, and average salary for each department and sex.

In [None]:
(emp.pivot_table(index='dept', columns='sex', values='salary',
                aggfunc=['min', 'max', 'mean'], margins=True)
    .round(-3)
    .style.format('{:,.0f}'))

### Reducing readability of a pivot table

Pivot tables with more than two grouping columns, or multiple aggregating columns and aggregating functions become less readable as the amount of data displayed can be immense. Here, we find the minimum, maximum, and average salary and experience for all combinations of department, sex, and race. Only the first few rows and columns are output.

In [None]:
df_messy_pivot = emp.pivot_table(index=['dept', 'sex'], columns='race', 
                                 values=['salary', 'experience'],
                                 aggfunc=['min', 'max', 'mean'], margins=True)
df_messy_pivot.iloc[:6, :10]

These large tables can be useful, but the results are often better presented as multiple different pivot tables than a single one. This pivot table has 19 rows and 36 columns with two levels in the index and three levels in the columns. It won't easily fit on a single screen.

In [None]:
df_messy_pivot.shape

## Exercises

Execute the following cell to read in the flights dataset and insert columns for the day and month name. Use it for the following exercises.

In [None]:
import pandas as pd
flights = pd.read_csv('../data/flights.csv', parse_dates=['date'])
flights.insert(1, 'day_of_week', flights['date'].dt.day_name())
flights.insert(2, 'month', flights['date'].dt.month_name())
flights.head(3)

### Exercise 1

<span style="color:green; font-size:16px">What is the average carrier delay for each day of the week for each airline? Highlight the worst day of the week for each airline.</span>

### Exercise 2

<span style="color:green; font-size:16px">Use a pivot table to find the total number of canceled flights for each origin airport and airline. Place the airlines in the columns. Use the result to find the origin airport with the most cancelled flights for each airline. Also return this maximum number of cancelled flights.</span>

### Exercise 3

<span style="color:green; font-size:16px">Find the total distance flown for each airline for each month. Highlight the month with the most number of miles flown and use the style `format` method to put commas in the numbers so that they are easier to read.</span>

### Exercise 4

<span style="color:green; font-size:16px">Create a pivot table that shows the number of flights flown for every day of the week for every month.</span>

### Exercise 5

<span style="color:green; font-size:16px">In exercise 4, the months and days of week are ordered alphabetically. It would be better if these values were ordered chronologically. Can you return a result that has both groups in the correct order. Use Monday as the first day of the week.</span>

### Exercise 6

<span style="color:green; font-size:16px">Create a new column in the flights dataset called `'dep_time_hour'` and set it equal to the hour (this will be an integer 0 through 23) of the flight. Find the average carrier delay for every month and `dep_time_hour`. Place the month in the columns.</span>

### Exercise 7

<span style="color:green; font-size:16px">Use both `groupby` and `pivot_table` to compute the average and median distance flown by day of the week.</span>