# Grouping with Pivot Tables

A pivot table aggregates data, usually between the intersection of the unique values of two columns of your data. In the pivot table below, the two dimensions are the `race` and the `sex`. All pivot tables must aggregate some other column of data. Here, the salary is averaged

There are 5 unique races and 2 unique values for sex. The pivot table shows the mean of salary for each possible combination. Having the data in this structure, can make it easier to read.

![][1]

[1]: images/pivot_table.png

### Creating a simple pivot table in pandas - four components
There are four components to a basic pivot table in pandas.

* Two grouping columns
* One aggregating column
* One aggregating function

In the example above, the two grouping columns are `race` and `sex`. The aggregating column is `salary` and the aggregating function is `mean`.

## Creating the pivot table above with pandas

After reading in the data, let's identify the components of this pivot table.

* Grouping columns - race and sex
* Aggregating column - salary
* Aggregating function - mean

In [None]:
import pandas as pd
emp = pd.read_csv('../data/employee.csv')
emp.head()

### Mapping the components of the pivot table to the parameters of the `pivot_table` method
The `pivot_table` method creates pivot tables for us in pandas. To use a pivot table, we set the `index`, `columns`, `values`, and `aggfunc` parameters. Each parameter will take on the following component of the pivot table.

* `index` - grouping column
* `columns` - grouping column
* `values` - aggregating column
* `aggfunc` - aggregating function (defaulted to the mean)

In [None]:
emp.pivot_table(index='race', columns='sex', values='salary', aggfunc='mean')

### Trick to reduce noise in dataframe - use `astype('int')`
The above dataframe has lots of excess decimal values that are completely useless in this result. Changing the data type of the columns to be integer instead of float will eliminate the noisy decimals. We do this by using the `astype` method, which takes a string of the new data type you would like to enforce on you data.

In [None]:
emp.pivot_table(index='race', columns='sex', 
                values='salary', aggfunc='mean').astype('int')

Round to the nearest thousand to

In [None]:
emp.pivot_table(index='race', columns='sex', 
                values='salary', aggfunc='mean').round(-3).astype('int')

### Easily compare female vs male salary
It is now trivial to compare female and male salaries for every race.

## Comparison to groupby
Since we already know the grouping columns, aggregating columns, and the aggregating functions, we should have no problem using a groupby.

In [None]:
emp.groupby(['race', 'sex']).agg({'salary': 'mean'}).astype('int')

### Data is more difficult to make comparisons
This groupby has produced the exact same data as the pivot table but in a different shape. Having all of our data in a vertical column makes it difficult to make comparisons.

### Wide vs long data
Pivot tables produce **wide** data meaning that it will often result in data that is easier to read and make decisions with. The `groupby` method returns **long** data that takes a bit more effort in making a comparison.

### All aggregation strings are available for `pivot_table`
All the aggregation strings ('min', 'max', 'mean', etc...) are available to a `pivot_table` just as they were with `groupby`.

### The default aggregation is `mean`
By default, `pivot_table` takes the `mean` of each group.

### Using a different aggregating function
Use any valid aggregation string. Here we find the max salary.

In [None]:
emp.pivot_table(index='race', columns='sex', values='salary', aggfunc='max').astype('int')

### Where is the 'pivoting'?

In Excel, you can pivot the table easily by dragging columns into different boxes. With pandas, you'll have to change the parameter values. Let's pivot the table by putting sex along the index and race along the columns.

In [None]:
emp.pivot_table(index='sex', columns='race', values='salary', aggfunc='max').astype('int')

### The unique values of each grouping column for the labels
Notice that the labels for each of the index and columns of a pivot table come from the unique values of the grouping columns. The intersection of each label is where the aggregated data appears.

## Styling pivot tables to find important data
You can style your DataFrame by changing the text color, background color, font, and several other items through the `style` property. It works similarly to `str` and `dt` accessors in that it gives you access to style-only methods through the dot notation. [Visit the documentation][1] for descriptions on all the methods.

[1]: http://pandas.pydata.org/pandas-docs/stable/style.html

In [None]:
dept_race_mean = emp.pivot_table(index='dept', columns='race', 
                                 values='salary', aggfunc='mean').round(-3)
dept_race_mean

### Using `highlight_max`
By default, `highlight_max` will highlight the maximum value of each column. This is just like how most other pandas methods works - by going down each column.

In [None]:
dept_race_mean.style.highlight_max()

### Change direction with `axis`
You can highlight the max of each rows by changing the `axis` parameter.

In [None]:
dept_race_mean.style.highlight_max(axis='columns')

### Background color gradients
Use `background_gradient` to color the background based on the value of the cell. You can change the colors by choosing a [Matplotlib colormap][1].

[1]: https://matplotlib.org/tutorials/colors/colormaps.html

In [None]:
dept_race_mean.style.background_gradient(cmap='YlOrRd')

## Exercises

Execute the following cell to read in the flights dataset and to add columns for the day and month name. Use it for the following exercises.

In [None]:
import pandas as pd
flights = pd.read_csv('../data/flights.csv', parse_dates=['date'])
flights['day_of_week'] = flights['date'].dt.day_name()
flights['month'] = flights['date'].dt.month_name()
flights.head(3)

### Exercise 1
<span  style="color:green; font-size:16px">What is the carrier departure delay for each day of the week for each airline? Highlight the worst day of the week for each airline.</span>

### Exercise 2

<span  style="color:green; font-size:16px">Use a `pivot_table` to find the total number of canceled flights for each origin airport and airline making sure the airline is in the columns. Use the result to find the origin airport with the most cancelled flights for each airline. Also return this maximum number of cancelled flights.</span>

### Exercise 3

<span  style="color:green; font-size:16px">Find the total distance flown for each airline for each month. Highlight the month with the most number of miles flown and use the style `format` method to put commas in the numbers so that they are easier to read.</span>

### Exercise 4
<span  style="color:green; font-size:16px">Use the City of Houston employee dataset for this exercise. You can create pivot tables with multiple columns in the index or the columns by using a list. Create a pivot table with the department as the index and the race and sex as the columns. Calculate the median salary for these cross sections.</span>