# 02. Reshaping by Pivoting

### Objectives

+ Reshape data with the **`pivot`** method - it is the inverse of **`melt`**
+ Reshape and aggregate with the **`pivot_table`** method

### Resources
+ Read the [reshaping pandas documentation page](http://pandas.pydata.org/pandas-docs/stable/reshaping.html)

## Preparing to invert melted data
Let's recreate our tidy data from the previous notebook.

In [1]:
import pandas as pd
import numpy as np

df = pd.DataFrame(data={'State': ['Texas', 'Arizona', 'Florida'],
                        'Apple': [12, 9, 0],
                        'Orange': [10, 7, 14],
                        'Banana': [40, 12, 190]})
df

Unnamed: 0,State,Apple,Orange,Banana
0,Texas,12,10,40
1,Arizona,9,7,12
2,Florida,0,14,190


In [2]:
df_melt = df.melt(id_vars='State', value_vars=['Apple', 'Orange', 'Banana'],
                  var_name='Fruit', value_name='Weight')
df_melt

Unnamed: 0,State,Fruit,Weight
0,Texas,Apple,12
1,Arizona,Apple,9
2,Florida,Apple,0
3,Texas,Orange,10
4,Arizona,Orange,7
5,Florida,Orange,14
6,Texas,Banana,40
7,Arizona,Banana,12
8,Florida,Banana,190


## Inverting melted data with `pivot`
Pandas has functionality to invert melted data back to its original messy form. This is sometimes called **pivoting** and is done with the **`pivot`** method. It has three available parameters.

+ **`index`** - the column that will stay vertical. This column will be set as the index
+ **`columns`** - The column which will be transposed and whose unique values will be made into column names
+ **`values`** - The column which will be tiled across as the new values

In [3]:
df_pivot = df_melt.pivot(index='State', columns='Fruit', values='Weight')
df_pivot

Fruit,Apple,Banana,Orange
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Arizona,9,12,7
Florida,0,190,14
Texas,12,40,10


### What is that extra data?
You may be disturbed by seeing **`Fruit`** and **`State`** in the upper left hand corner of the DataFrame. These two words are **names** for the their respective index. **`Fruit`** is the name for the column index and **`State`** is the name for the row index.

### Remove this noise
The index name can be useful, as when calling the **`reset_index`** (the index name becomes the column name).

In [4]:
df_pivot2 = df_pivot.reset_index()
df_pivot2

Fruit,State,Apple,Banana,Orange
0,Arizona,9,12,7
1,Florida,0,190,14
2,Texas,12,40,10


### The name of the columns
That ugly name **`Fruit`** is still there. You can see it when output the columns at the end.

In [5]:
df_pivot2.columns

Index(['State', 'Apple', 'Banana', 'Orange'], dtype='object', name='Fruit')

### Remove this name
Set this **`name`** attribute of the column variable to **`None`**

In [6]:
df_pivot2.columns.name = None
df_pivot2.columns

Index(['State', 'Apple', 'Banana', 'Orange'], dtype='object')

## Output the cleaned-up DataFrame
We successfully removed the index and column names.

In [7]:
df_pivot2

Unnamed: 0,State,Apple,Banana,Orange
0,Arizona,9,12,7
1,Florida,0,190,14
2,Texas,12,40,10


All steps in a single cell:

In [8]:
df_pivot = df_melt.pivot(index='State', columns='Fruit', values='Weight')
df_pivot = df_pivot.reset_index()
df_pivot.columns.name = None
df_pivot

Unnamed: 0,State,Apple,Banana,Orange
0,Arizona,9,12,7
1,Florida,0,190,14
2,Texas,12,40,10


## Aggregating with `pivot_table` 
The **`pivot_table`** function is similar to the **`pivot`** function, except that it aggregates values for all the combinations of **`index`** and **`columns`**. Let's start with the same tidy DataFrame from above. We will then use the **`sample`** method to create many new rows of data. We will also change the weight to a random integer.

In [9]:
np.random.seed(1)
df_dupes = df_melt.sample(n=200, replace=True)
df_dupes['Weight'] = np.random.randint(1, 100, 200)
df_dupes.shape

(200, 3)

In [10]:
df_dupes.head()

Unnamed: 0,State,Fruit,Weight
5,Florida,Orange,43
8,Florida,Banana,75
5,Florida,Orange,67
0,Texas,Apple,89
0,Texas,Apple,99


## Attempting to reshape this won't work
The **`df_dupes`** DataFrame has the same exact columns and the same values for State and Fruit as **`df_melt`** but will produce an error when calling the **`pivot`** method. This is because there exists more than one row for each state and fruit combination.

For example, there are many rows that contain the state as Florida and the Fruit as Orange. Because there are multiple Weight values for this particular intersection, the **`pivot`** method will not work. 

Notice that in the first few, we have a value of 43 and 67 for the weight of Florida Orange. We cannot pivot this data and put those two values in the same cell.

In [11]:
df_dupes.pivot(index='State', columns='Fruit', values='Weight')

ValueError: Index contains duplicate entries, cannot reshape

### Verify this with `groupby`
The **`pivot`** method works when there exist exactly one unique combination for each intersection, like it does in the **`df_melt`** DataFrame. You can verify the number of occurrences of each state-fruit combination with the **`groupby`** method:

In [12]:
df_melt.groupby(['State', 'Fruit']).size()

State    Fruit 
Arizona  Apple     1
         Banana    1
         Orange    1
Florida  Apple     1
         Banana    1
         Orange    1
Texas    Apple     1
         Banana    1
         Orange    1
dtype: int64

In [13]:
df_dupes.groupby(['State', 'Fruit']).size()

State    Fruit 
Arizona  Apple     20
         Banana    31
         Orange    23
Florida  Apple     20
         Banana    22
         Orange    19
Texas    Apple     25
         Banana    18
         Orange    22
dtype: int64

## Introducing `pivot_table` to aggregate those values
The **`pivot_table`** method works very similarly to **`pivot`** but aggregates all the values at the intersection of the index and columns (State and Fruit here). By default, it takes the mean of all the values in each intersection.

In [14]:
df_dupes.pivot_table(index='State', columns='Fruit', values='Weight')

Fruit,Apple,Banana,Orange
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Arizona,48.15,45.645161,52.913043
Florida,56.2,54.181818,49.526316
Texas,49.28,55.055556,46.590909


### Use the `aggfunc` parameter to choose the aggregation method
The **`aggfunc`** parameter may be passed a string to select the aggregation function. The same strings work here as they do with **`agg`** GroupBy method.

In [15]:
df_dupes.pivot_table(index='State', columns='Fruit', values='Weight', aggfunc='sum')

Fruit,Apple,Banana,Orange
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Arizona,963,1415,1217
Florida,1124,1192,941
Texas,1232,991,1025


## A practical `pivot_table` example
We can use the **`pivot_table`** in a more practical example such as finding the median salary for each combination of race and gender in the employee dataset.

In [16]:
emp = pd.read_csv('../data/employee.csv')
emp.head()

Unnamed: 0,title,dept,salary,race,gender,hire_date,job_date
0,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Hispanic,Female,2006-06-12,2012-10-13
1,LIBRARY ASSISTANT,Library,26125.0,Hispanic,Female,2000-07-19,2010-09-18
2,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Male,2015-02-03,2015-02-03
3,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Male,1982-02-08,1991-05-25
4,ELECTRICIAN,General Services Department,56347.0,White,Male,1989-06-19,1994-10-22


In [17]:
race_gen_sal = emp.pivot_table(index='race', columns='gender', values='salary', aggfunc='median')
race_gen_sal

gender,Female,Male
race,Unnamed: 1_level_1,Unnamed: 2_level_1
Asian,57227.5,55461.0
Black,44491.0,46486.5
Hispanic,43087.0,54090.5
Native American,58855.0,60347.0
Other,63785.0,38771.0
White,62264.5,62540.0


## `pivot_table` is very similar to `groupby`
The **`pivot_table`** method is very similar to the **`groupby`** method and produces the exact same results, but with a different shape. The two columns used as for the **`index`** and **`columns`** parameter become the grouping columns with the **`groupby`** method. Which output do you prefer?

In [18]:
emp.groupby(['race', 'gender']).agg({'salary': 'median'})

Unnamed: 0_level_0,Unnamed: 1_level_0,salary
race,gender,Unnamed: 2_level_1
Asian,Female,57227.5
Asian,Male,55461.0
Black,Female,44491.0
Black,Male,46486.5
Hispanic,Female,43087.0
Hispanic,Male,54090.5
Native American,Female,58855.0
Native American,Male,60347.0
Other,Female,63785.0
Other,Male,38771.0


## Use the `style` attribute to highlight the max of each row

In [19]:
race_gen_sal.style.highlight_max(axis='columns')

gender,Female,Male
race,Unnamed: 1_level_1,Unnamed: 2_level_1
Asian,57227.5,55461.0
Black,44491.0,46486.5
Hispanic,43087.0,54090.5
Native American,58855.0,60347.0
Other,63785.0,38771.0
White,62264.5,62540.0


# Exercises

### Problem 1
<span  style="color:green; font-size:16px">Read the file **`tidy/clean_movie1.csv`** and then use the **`pivot`** method to put the country names as the columns. Put the **`count`** as the new values for the DataFrame.</span>

In [None]:
# your code here

### Problem 2
<span  style="color:green; font-size:16px">Read in the NYC deaths dataset and select only males from 2007. Pivot this information so we can more clearly see the breakdown of causes of death by race. Assign the result to a variable.</span>

In [None]:
# your code here

### Problem 3
<span  style="color:green; font-size:16px">Use the result from problem 2 and highlight the leading cause of death for each race. Is it the same for each one?</span>

In [None]:
# your code here

### Problem 4
<span  style="color:green; font-size:16px">Read in the flights dataset. Find the total number of flights from each airline by their origin airport. Hint: When making a pivot table of just frequency, its not necessary to have a `values` column. Save the results to a variable.</span>

In [None]:
# your code here

### Problem 5
<span  style="color:green; font-size:16px">Highlight the origin airport with the most flights for each airline. Do a few online searches to determine if those airports are hubs for those airlines.</span>

In [None]:
# your code here

### Problem 6
<span  style="color:green; font-size:16px">Read in the bikes dataset. For each type of weather event (the `events` column) find the median temperature for males and females.</span>

In [None]:
# your code here

### Problem 7
<span  style="color:green; font-size:16px">Reshape the movie dataset so that there are two columns, one for all of the actors and one for the content rating of each of their respective movies. Filter this DataFrame so that it contains the top 10 most common actors. Then create a table that displays the number of movies each actor made by content rating. The actor names should be in the index, with the content ratings in the columns, with the counts as the values.</span>

In [None]:
# your code here