# Reshaping with pivot_table and groupby

### Objectives
After this lesson you should be able to...

+ Master the reshaping methods: **`pivot_table`**
+ Know the equivalence of **`pivot_table/groupby`**

### Prepare for this lesson by...
+ Read the [reshaping pandas documentation page](http://pandas.pydata.org/pandas-docs/stable/reshaping.html)

## `pivot_table` and `groupby` are very similar
Now that you have seen how **`stack`** and **`melt`** are similar and **`unstack`** and **`pivot`** similarly invert those operations, there is one other set of reshaping methods that do nearly the same thing - **`pivot_table`** and **`groupby`** (when aggregating).

Since we already covered **`groupby`** thoroughly and had an example with **`pivot_table`** we will jump right into a more complex example with the college dataset.

In [1]:
import pandas as pd
import numpy as np

college = pd.read_csv('../data/college.csv')
pd.options.display.max_columns = 40

college.head()

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
3,University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
4,Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


### Redo a `groupby` operation with `pivot_table`
It's not apparent at first but **`groupby`** and **`pivot_table`** use almost the exact same inputs. The below **`groupby`** is passed three different lists for three different parts of the operation. 

+ **['STABBR', 'RELAFFIL']** - these are grouping columns.
+ **['UGDS', 'SATMTMID']** - these are the columns being aggregated
+ **['size', 'min', 'max']** - these are the aggregating functions applied to each column

The **`pivot_table`** method (also a function) uses the parameter **`index`** for the first list, **`values`** for the second and **`aggfunc`** for the third.

In [2]:
# use a complex groupby from a previous notebook
cg = college.groupby(['STABBR', 'RELAFFIL'])['UGDS', 'SATMTMID'].agg(['size', 'min', 'max']).head(12)
cg

Unnamed: 0_level_0,Unnamed: 1_level_0,UGDS,UGDS,UGDS,SATMTMID,SATMTMID,SATMTMID
Unnamed: 0_level_1,Unnamed: 1_level_1,size,min,max,size,min,max
STABBR,RELAFFIL,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
AK,0,7,109.0,12865.0,7,,
AK,1,3,27.0,275.0,3,503.0,503.0
AL,0,72,12.0,29851.0,72,420.0,590.0
AL,1,24,13.0,3033.0,24,400.0,560.0
AR,0,68,18.0,21405.0,68,427.0,565.0
AR,1,18,20.0,4485.0,18,495.0,600.0
AS,0,1,1276.0,1276.0,1,,
AZ,0,124,1.0,151558.0,124,503.0,580.0
AZ,1,9,25.0,4102.0,9,480.0,480.0
CA,0,609,0.0,44744.0,609,445.0,785.0


In [3]:
# replicate with pivot_table
cp = college.pivot_table(index=['STABBR', 'RELAFFIL'], 
                         values=['UGDS', 'SATMTMID'], 
                         aggfunc=[np.size, np.min, np.max]).head(12)
cp 

Unnamed: 0_level_0,Unnamed: 1_level_0,size,size,amin,amin,amax,amax
Unnamed: 0_level_1,Unnamed: 1_level_1,SATMTMID,UGDS,SATMTMID,UGDS,SATMTMID,UGDS
STABBR,RELAFFIL,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
AK,0,7.0,7.0,,109.0,,12865.0
AK,1,3.0,3.0,503.0,27.0,503.0,275.0
AL,0,72.0,72.0,420.0,12.0,590.0,29851.0
AL,1,24.0,24.0,400.0,13.0,560.0,3033.0
AR,0,68.0,68.0,427.0,18.0,565.0,21405.0
AR,1,18.0,18.0,495.0,20.0,600.0,4485.0
AS,0,1.0,1.0,,1276.0,,1276.0
AZ,0,124.0,124.0,503.0,1.0,580.0,151558.0
AZ,1,9.0,9.0,480.0,25.0,480.0,4102.0
CA,0,609.0,609.0,445.0,0.0,785.0,44744.0


### `pivot_table` needs more help for exact replication
Unfortunately the **`pivot_table`** method does not take numpy string methods as aggregate functions like **`groupby`**.  Also the column levels are reversed and not in the same order. The **`swaplevel`** and **`sort_index`** DataFrame methods can fix this.

In [4]:
# swap columm levels and sort top column
cp.swaplevel(0, 1, axis='columns').sort_index(axis='columns', ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,UGDS,UGDS,UGDS,SATMTMID,SATMTMID,SATMTMID
Unnamed: 0_level_1,Unnamed: 1_level_1,size,amin,amax,size,amin,amax
STABBR,RELAFFIL,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
AK,0,7.0,109.0,12865.0,7.0,,
AK,1,3.0,27.0,275.0,3.0,503.0,503.0
AL,0,72.0,12.0,29851.0,72.0,420.0,590.0
AL,1,24.0,13.0,3033.0,24.0,400.0,560.0
AR,0,68.0,18.0,21405.0,68.0,427.0,565.0
AR,1,18.0,20.0,4485.0,18.0,495.0,600.0
AS,0,1.0,1276.0,1276.0,1.0,,
AZ,0,124.0,1.0,151558.0,124.0,503.0,580.0
AZ,1,9.0,25.0,4102.0,9.0,480.0,480.0
CA,0,609.0,0.0,44744.0,609.0,445.0,785.0


### Taking Advantage of Index Level Names
The original grouped data has two index levels with the same names as the column names that they once were. These level names can be used in-place of the numeric level labeling like we have done above. See the examples below using the index level name in methods.

In [5]:
# original college grouped data
cg

Unnamed: 0_level_0,Unnamed: 1_level_0,UGDS,UGDS,UGDS,SATMTMID,SATMTMID,SATMTMID
Unnamed: 0_level_1,Unnamed: 1_level_1,size,min,max,size,min,max
STABBR,RELAFFIL,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
AK,0,7,109.0,12865.0,7,,
AK,1,3,27.0,275.0,3,503.0,503.0
AL,0,72,12.0,29851.0,72,420.0,590.0
AL,1,24,13.0,3033.0,24,400.0,560.0
AR,0,68,18.0,21405.0,68,427.0,565.0
AR,1,18,20.0,4485.0,18,495.0,600.0
AS,0,1,1276.0,1276.0,1,,
AZ,0,124,1.0,151558.0,124,503.0,580.0
AZ,1,9,25.0,4102.0,9,480.0,480.0
CA,0,609,0.0,44744.0,609,445.0,785.0


In [6]:
# sort by religious affiliation
cg.sort_index(level='RELAFFIL', sort_remaining=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,UGDS,UGDS,UGDS,SATMTMID,SATMTMID,SATMTMID
Unnamed: 0_level_1,Unnamed: 1_level_1,size,min,max,size,min,max
STABBR,RELAFFIL,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
AK,0,7,109.0,12865.0,7,,
AL,0,72,12.0,29851.0,72,420.0,590.0
AR,0,68,18.0,21405.0,68,427.0,565.0
AS,0,1,1276.0,1276.0,1,,
AZ,0,124,1.0,151558.0,124,503.0,580.0
CA,0,609,0.0,44744.0,609,445.0,785.0
CO,0,118,0.0,25873.0,118,424.0,680.0
AK,1,3,27.0,275.0,3,503.0,503.0
AL,1,24,13.0,3033.0,24,400.0,560.0
AR,1,18,20.0,4485.0,18,495.0,600.0


In [7]:
# get all the values of one level
cg.index.get_level_values('STABBR')

Index(['AK', 'AK', 'AL', 'AL', 'AR', 'AR', 'AS', 'AZ', 'AZ', 'CA', 'CA', 'CO'], dtype='object', name='STABBR')

### Crazy reshaping using `stack` and `unstack` with index and column level names
The level names really come in handy when stacking and unstacking data with indexes and columns with multiple levels. Behold the wizardry below. First we will name the column levels since they don't exist currently.

In [8]:
# now all four index and column levels have names
# looks a little odd doesn't it?
cg = cg.rename_axis(['Agg_Cols', 'Agg_Funcs'], axis='columns')
cg

Unnamed: 0_level_0,Agg_Cols,UGDS,UGDS,UGDS,SATMTMID,SATMTMID,SATMTMID
Unnamed: 0_level_1,Agg_Funcs,size,min,max,size,min,max
STABBR,RELAFFIL,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
AK,0,7,109.0,12865.0,7,,
AK,1,3,27.0,275.0,3,503.0,503.0
AL,0,72,12.0,29851.0,72,420.0,590.0
AL,1,24,13.0,3033.0,24,400.0,560.0
AR,0,68,18.0,21405.0,68,427.0,565.0
AR,1,18,20.0,4485.0,18,495.0,600.0
AS,0,1,1276.0,1276.0,1,,
AZ,0,124,1.0,151558.0,124,503.0,580.0
AZ,1,9,25.0,4102.0,9,480.0,480.0
CA,0,609,0.0,44744.0,609,445.0,785.0


In [9]:
# commense wizardy
# Stack all the values in the Agg_Cols level
cg.stack('Agg_Cols')

Unnamed: 0_level_0,Unnamed: 1_level_0,Agg_Funcs,max,min,size
STABBR,RELAFFIL,Agg_Cols,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AK,0,SATMTMID,,,7
AK,0,UGDS,12865.0,109.0,7
AK,1,SATMTMID,503.0,503.0,3
AK,1,UGDS,275.0,27.0,3
AL,0,SATMTMID,590.0,420.0,72
AL,0,UGDS,29851.0,12.0,72
AL,1,SATMTMID,560.0,400.0,24
AL,1,UGDS,3033.0,13.0,24
AR,0,SATMTMID,565.0,427.0,68
AR,0,UGDS,21405.0,18.0,68


In [10]:
# stack values in Agg_Funcs
cg.stack('Agg_Funcs')

Unnamed: 0_level_0,Unnamed: 1_level_0,Agg_Cols,UGDS,SATMTMID
STABBR,RELAFFIL,Agg_Funcs,Unnamed: 3_level_1,Unnamed: 4_level_1
AK,0,size,7.0,7.0
AK,0,min,109.0,
AK,0,max,12865.0,
AK,1,size,3.0,3.0
AK,1,min,27.0,503.0
AK,1,max,275.0,503.0
AL,0,size,72.0,72.0
AL,0,min,12.0,420.0
AL,0,max,29851.0,590.0
AL,1,size,24.0,24.0


In [11]:
# stack both into a Series
# now with 4 index levels!
s4 = cg.stack(['Agg_Funcs', 'Agg_Cols'])
s4

STABBR  RELAFFIL  Agg_Funcs  Agg_Cols
AK      0         size       UGDS             7.0
                             SATMTMID         7.0
                  min        UGDS           109.0
                  max        UGDS         12865.0
        1         size       UGDS             3.0
                             SATMTMID         3.0
                  min        UGDS            27.0
                             SATMTMID       503.0
                  max        UGDS           275.0
                             SATMTMID       503.0
AL      0         size       UGDS            72.0
                             SATMTMID        72.0
                  min        UGDS            12.0
                             SATMTMID       420.0
                  max        UGDS         29851.0
                             SATMTMID       590.0
        1         size       UGDS            24.0
                             SATMTMID        24.0
                  min        UGDS            13.0
            

In [12]:
# now unstack
s4.unstack('STABBR')

Unnamed: 0_level_0,Unnamed: 1_level_0,STABBR,AK,AL,AR,AS,AZ,CA,CO
RELAFFIL,Agg_Funcs,Agg_Cols,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,size,UGDS,7.0,72.0,68.0,1.0,124.0,609.0,118.0
0,size,SATMTMID,7.0,72.0,68.0,1.0,124.0,609.0,118.0
0,min,UGDS,109.0,12.0,18.0,1276.0,1.0,0.0,0.0
0,min,SATMTMID,,420.0,427.0,,503.0,445.0,424.0
0,max,UGDS,12865.0,29851.0,21405.0,1276.0,151558.0,44744.0,25873.0
0,max,SATMTMID,,590.0,565.0,,580.0,785.0,680.0
1,size,UGDS,3.0,24.0,18.0,,9.0,164.0,
1,size,SATMTMID,3.0,24.0,18.0,,9.0,164.0,
1,min,UGDS,27.0,13.0,20.0,,25.0,8.0,
1,min,SATMTMID,503.0,400.0,495.0,,480.0,441.0,


In [13]:
s4.unstack(['STABBR','Agg_Funcs'])

Unnamed: 0_level_0,STABBR,AK,AK,AK,AL,AL,AL,AR,AR,AR,AS,AS,AS,AZ,AZ,AZ,CA,CA,CA,CO,CO,CO
Unnamed: 0_level_1,Agg_Funcs,size,min,max,size,min,max,size,min,max,size,min,max,size,min,max,size,min,max,size,min,max
RELAFFIL,Agg_Cols,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2
0,UGDS,7.0,109.0,12865.0,72.0,12.0,29851.0,68.0,18.0,21405.0,1.0,1276.0,1276.0,124.0,1.0,151558.0,609.0,0.0,44744.0,118.0,0.0,25873.0
0,SATMTMID,7.0,,,72.0,420.0,590.0,68.0,427.0,565.0,1.0,,,124.0,503.0,580.0,609.0,445.0,785.0,118.0,424.0,680.0
1,UGDS,3.0,27.0,275.0,24.0,13.0,3033.0,18.0,20.0,4485.0,,,,9.0,25.0,4102.0,164.0,8.0,6745.0,,,
1,SATMTMID,3.0,503.0,503.0,24.0,400.0,560.0,18.0,495.0,600.0,,,,9.0,480.0,480.0,164.0,441.0,665.0,,,


### 'Wide' vs 'Long' format
'Wide' and 'Long' are common (but perhaps imprecise) idioms to identify data. Typically long format refers data that has many rows and few columns.  Stacked and tidy data would be 'long' data. 'Wide' data is the opposite and contains many columns and less rows. Pivoted and messy data is wide.

### Back to `pivot_table` `groupby` equivalence
We just saw how we can use the **`pivot_table`** method to emulate a **`groupby`**. Well, its also possible to do the opposite. **`pivot_table`** offers the **`columns`** argument which transposes the values of a column to column names before aggregating. **`groupby`** offers no direct ability to mimic this behavior but with the help of **`unstack`** it is possible to create the equivalence.

In [14]:
# use pivot_table to transpose the STABBR column
college.pivot_table(index='RELAFFIL', 
                    columns='STABBR', 
                    values='UGDS', 
                    aggfunc=np.mean)

STABBR,AK,AL,AR,AS,AZ,CA,CO,CT,DC,DE,FL,FM,GA,GU,HI,IA,ID,IL,IN,KS,...,NY,OH,OK,OR,PA,PR,PW,RI,SC,SD,TN,TX,UT,VA,VI,VT,WA,WI,WV,WY
RELAFFIL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1
0,3508.857143,3248.774648,1793.691176,1276.0,4363.533898,3802.08981,2324.619469,1890.573171,2008.285714,2247.75,2676.443149,2344.0,2704.517483,2808.5,2531.05,2872.57377,1624.222222,2251.887446,2884.929293,2236.323529,...,2491.77628,1610.865517,1347.015748,2409.0,1698.535117,1599.431818,602.0,3504.894737,2197.853933,1560.913043,1594.900709,3206.222826,2862.363636,2635.720588,1971.0,1602.684211,2293.683168,2879.130952,1873.857143,2244.363636
1,123.333333,979.722222,917.785714,,692.75,1356.342105,2332.25,1674.142857,4874.75,3788.666667,993.642857,,2288.24,65.0,1509.0,1076.827586,6344.5,1851.860465,1835.5,646.761905,...,1330.269231,1382.098039,1387.5,953.857143,1703.157895,1590.8,,2043.333333,1283.0,691.857143,1450.068966,1680.758621,4938.0,3030.25,,942.0,2064.909091,1716.2,716.428571,


The trick is to group by all the columns in both the **`index`** and **`columns`** parameters of **`pivot_table`** and then use **`unstack`**.

In [15]:
cg2 = college.groupby(['RELAFFIL','STABBR'])['UGDS'].agg('mean')
cg2.head(10)

RELAFFIL  STABBR
0         AK        3508.857143
          AL        3248.774648
          AR        1793.691176
          AS        1276.000000
          AZ        4363.533898
          CA        3802.089810
          CO        2324.619469
          CT        1890.573171
          DC        2008.285714
          DE        2247.750000
Name: UGDS, dtype: float64

In [16]:
## unstack STABBR
cg2.unstack('STABBR')

STABBR,AK,AL,AR,AS,AZ,CA,CO,CT,DC,DE,FL,FM,GA,GU,HI,IA,ID,IL,IN,KS,...,NY,OH,OK,OR,PA,PR,PW,RI,SC,SD,TN,TX,UT,VA,VI,VT,WA,WI,WV,WY
RELAFFIL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1
0,3508.857143,3248.774648,1793.691176,1276.0,4363.533898,3802.08981,2324.619469,1890.573171,2008.285714,2247.75,2676.443149,2344.0,2704.517483,2808.5,2531.05,2872.57377,1624.222222,2251.887446,2884.929293,2236.323529,...,2491.77628,1610.865517,1347.015748,2409.0,1698.535117,1599.431818,602.0,3504.894737,2197.853933,1560.913043,1594.900709,3206.222826,2862.363636,2635.720588,1971.0,1602.684211,2293.683168,2879.130952,1873.857143,2244.363636
1,123.333333,979.722222,917.785714,,692.75,1356.342105,2332.25,1674.142857,4874.75,3788.666667,993.642857,,2288.24,65.0,1509.0,1076.827586,6344.5,1851.860465,1835.5,646.761905,...,1330.269231,1382.098039,1387.5,953.857143,1703.157895,1590.8,,2043.333333,1283.0,691.857143,1450.068966,1680.758621,4938.0,3030.25,,942.0,2064.909091,1716.2,716.428571,


# Exercises
Solutions below.

The first set of problems will use NY state demographic data found from [data.gov](https://catalog.data.gov/dataset).

### Problem 1
<span  style="color:green; font-size:16px">Read in the `ny_demographics.csv` dataset. Is this a tidy dataset? Explain why or why not.</span>

In [17]:
# your code here

### Problem 2
<span  style="color:green; font-size:16px">Reshape the NY demographic data so that it has three variables: JURISDICTION NAME, Gender and Count</span>

In [18]:
# your code here

### Problem 3
<span  style="color:green; font-size:16px">Reshape the NY demographic data in the same way you did in problem 2 except with a different command. HINT: If you use stack, put columns that you don't want stacked in the index.</span>

Bonus: If you use stack, use method chaining to rename all the columns correctly.

In [19]:
# your code here

### Problem 4
<span  style="color:green; font-size:16px">Find a different variable in the columns and tidy that variable by creating another three column DataFrame. Store your resulting DataFrame in **`df_count`**.</span>

In [20]:
# your code here

### Problem 5
<span  style="color:green; font-size:16px">For the same variable you used in problem 4, create another three column tidy dataset using the percentage column instead of the count. Store your resulting DataFrame in **`df_perc`**.</span>

In [21]:
# your code here

### Problem 6: Advanced
<span  style="color:green; font-size:16px">Add a **`Percent`** column to **`df_count`** that calculates the percent found in **`df_perc`**. Create an additional column, **`Percent_orig`**, from **`df_perc`** to **`df_count`**. Check that the calculated percentage and original percentage match.</span>

In [22]:
# your code here

### Problem 7
<span  style="color:green; font-size:16px">If you use the **`stack`** method on a 10 row, 5 column DataFrame (that has single level indexes), what will be the resulting shape and data structure. Answer this problem first without writing any code. Then confirm it by testing it on a DataFrame.</span>

In [23]:
# your code here

### Problem 8
<span  style="color:green; font-size:16px">Give the column index and the index a name of the following DataFrame.</span>

In [24]:
df = pd.DataFrame(np.random.rand(2,2))
df

Unnamed: 0,0,1
0,0.092447,0.968751
1,0.12562,0.46091


In [25]:
# your code here

### Problem 9
<span  style="color:green; font-size:16px">Use `groupby` method (with other reshaping methods) to recreate the same DataFrame produced by `pivot_table` below.</span>

In [26]:
employee = pd.read_csv('../data/employee.csv')
# recreate this table
employee_pivot = employee.pivot_table(index='RACE', columns='GENDER', values='BASE_SALARY', aggfunc=np.mean)
employee_pivot

GENDER,Female,Male
RACE,Unnamed: 1_level_1,Unnamed: 2_level_1
American Indian or Alaskan Native,60238.8,60305.4
Asian/Pacific Islander,63226.3,61033.906667
Black or African American,48915.421233,51082.074074
Hispanic/Latino,46503.316176,54782.819018
Others,63785.0,38771.0
White,66793.352941,63940.388119


In [27]:
# your code here

### Problem 10
<span  style="color:green; font-size:16px">Use the `melt` method to make make the DataFrame `employee_pivot` tidy.</span>

In [28]:
employee_pivot = employee.pivot_table(index=['RACE', 'DEPARTMENT'], 
                            columns='GENDER', 
                            values='BASE_SALARY', 
                            aggfunc=np.mean)\
                .reset_index()\
                .rename_axis(None, axis='columns',)
employee_pivot.head(10)

Unnamed: 0,RACE,DEPARTMENT,Female,Male
0,American Indian or Alaskan Native,Dept of Neighborhoods (DON),,26125.0
1,American Indian or Alaskan Native,Health & Human Services,54117.0,
2,American Indian or Alaskan Native,Housing and Community Devp.,98536.0,
3,American Indian or Alaskan Native,Houston Airport System (HAS),68299.0,
4,American Indian or Alaskan Native,Houston Fire Department (HFD),,78355.0
5,American Indian or Alaskan Native,Houston Police Department-HPD,,65682.333333
6,American Indian or Alaskan Native,Library,26125.0,
7,Asian/Pacific Islander,Admn. & Regulatory Affairs,72293.666667,
8,Asian/Pacific Islander,City Controller's Office,59077.0,
9,Asian/Pacific Islander,Fleet Management Department,,46010.0


In [29]:
# your code here

### Problem 11
<span  style="color:green; font-size:16px">Use the `stack` method to make make the DataFrame `employee_pivot` from problem 10 tidy.</span>

In [30]:
# your code here

### Problem 12
<span  style="color:green; font-size:16px">Make the column levels `first` and `second` index levels. Make the index level `two` a column level.</span>

In [31]:
index = pd.MultiIndex.from_product([['a', 'b'], ['c', 'd', 'e']], names=['one', 'two'])
columns = pd.MultiIndex.from_product([['A', 'B'], ['C', 'D']], names=['first', 'second'])
df = pd.DataFrame(np.random.rand(6,4), index=index, columns=columns)
df

Unnamed: 0_level_0,first,A,A,B,B
Unnamed: 0_level_1,second,C,D,C,D
one,two,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
a,c,0.549765,0.166687,0.827741,0.71002
a,d,0.860739,0.557555,0.518176,0.444524
a,e,0.57907,0.896733,0.132004,0.772665
b,c,0.663679,0.079129,0.796362,0.112729
b,d,0.949948,0.694059,0.009932,0.597996
b,e,0.724896,0.912991,0.088399,0.894031


In [None]:
# your code here

# Solutions

### Problem 1
<span  style="color:green; font-size:16px">Read in the `ny_demographics.csv` dataset. Is this a tidy dataset? Explain why or why not</span>

In [20]:
ny_demo = pd.read_csv('../data/ny_demographics.csv')

In [21]:
pd.options.display.max_columns = 50
ny_demo.head()

Unnamed: 0,JURISDICTION NAME,COUNT PARTICIPANTS,COUNT FEMALE,PERCENT FEMALE,COUNT MALE,PERCENT MALE,COUNT GENDER UNKNOWN,PERCENT GENDER UNKNOWN,COUNT GENDER TOTAL,PERCENT GENDER TOTAL,COUNT PACIFIC ISLANDER,PERCENT PACIFIC ISLANDER,COUNT HISPANIC LATINO,PERCENT HISPANIC LATINO,COUNT AMERICAN INDIAN,PERCENT AMERICAN INDIAN,COUNT ASIAN NON HISPANIC,PERCENT ASIAN NON HISPANIC,COUNT WHITE NON HISPANIC,PERCENT WHITE NON HISPANIC,COUNT BLACK NON HISPANIC,PERCENT BLACK NON HISPANIC,COUNT OTHER ETHNICITY,PERCENT OTHER ETHNICITY,COUNT ETHNICITY UNKNOWN,PERCENT ETHNICITY UNKNOWN,COUNT ETHNICITY TOTAL,PERCENT ETHNICITY TOTAL,COUNT PERMANENT RESIDENT ALIEN,PERCENT PERMANENT RESIDENT ALIEN,COUNT US CITIZEN,PERCENT US CITIZEN,COUNT OTHER CITIZEN STATUS,PERCENT OTHER CITIZEN STATUS,COUNT CITIZEN STATUS UNKNOWN,PERCENT CITIZEN STATUS UNKNOWN,COUNT CITIZEN STATUS TOTAL,PERCENT CITIZEN STATUS TOTAL,COUNT RECEIVES PUBLIC ASSISTANCE,PERCENT RECEIVES PUBLIC ASSISTANCE,COUNT NRECEIVES PUBLIC ASSISTANCE,PERCENT NRECEIVES PUBLIC ASSISTANCE,COUNT PUBLIC ASSISTANCE UNKNOWN,PERCENT PUBLIC ASSISTANCE UNKNOWN,COUNT PUBLIC ASSISTANCE TOTAL,PERCENT PUBLIC ASSISTANCE TOTAL
0,10001,44,22,0.5,22,0.5,0,0,44,100,0,0.0,16,0.36,0,0.0,3,0.07,1,0.02,21,0.48,3,0.07,0,0.0,44,100,2,0.05,42,0.95,0,0.0,0,0,44,100,20,0.45,24,0.55,0,0,44,100
1,10002,35,19,0.54,16,0.46,0,0,35,100,0,0.0,1,0.03,0,0.0,28,0.8,6,0.17,0,0.0,0,0.0,0,0.0,35,100,2,0.06,33,0.94,0,0.0,0,0,35,100,2,0.06,33,0.94,0,0,35,100
2,10003,1,1,1.0,0,0.0,0,0,1,100,0,0.0,0,0.0,0,0.0,1,1.0,0,0.0,0,0.0,0,0.0,0,0.0,1,100,0,0.0,1,1.0,0,0.0,0,0,1,100,0,0.0,1,1.0,0,0,1,100
3,10004,0,0,0.0,0,0.0,0,0,0,0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0,0,0.0,0,0.0,0,0.0,0,0,0,0,0,0.0,0,0.0,0,0,0,0
4,10005,2,2,1.0,0,0.0,0,0,2,100,0,0.0,0,0.0,0,0.0,1,0.5,0,0.0,1,0.5,0,0.0,0,0.0,2,100,1,0.5,1,0.5,0,0.0,0,0,2,100,0,0.0,2,1.0,0,0,2,100


No, it is not tidy. Many variable names are in the columns like COUNT MALE and COUNT FEMALE.

### Problem 2
<span  style="color:green; font-size:16px">Read in the `ny_demographics.csv` dataset. Reshape the data so that it has three variables: JURISDICTION NAME, Gender and Count.</span>

**`JURISDICTION NAME`** is the zip code. Feel free to rename it.

In [22]:
df = ny_demo.melt(id_vars='JURISDICTION NAME', 
                  value_vars=['COUNT FEMALE', 'COUNT MALE'], 
                  var_name=['Gender'],
                  value_name = 'Count')
df.head()

Unnamed: 0,JURISDICTION NAME,Gender,Count
0,10001,COUNT FEMALE,22
1,10002,COUNT FEMALE,19
2,10003,COUNT FEMALE,1
3,10004,COUNT FEMALE,0
4,10005,COUNT FEMALE,2


In [23]:
# can make gender column cleaner
mapper = {'COUNT FEMALE':'F', 'COUNT MALE':'M'}
df.Gender = df.Gender.map(mapper)
df.head()

Unnamed: 0,JURISDICTION NAME,Gender,Count
0,10001,F,22
1,10002,F,19
2,10003,F,1
3,10004,F,0
4,10005,F,2


### Problem 3
<span  style="color:green; font-size:16px">Reshape the NY demographic data in the same way you did in problem 2 except with a different command. HINT: If you use stack, put columns that you don't want stacked in the index.</span>

In [24]:
ny_demo[['JURISDICTION NAME', 'COUNT MALE', 'COUNT FEMALE']].set_index('JURISDICTION NAME')\
                                                            .stack()\
                                                            .reset_index()\
                                                            .head(10)

Unnamed: 0,JURISDICTION NAME,level_1,0
0,10001,COUNT MALE,22
1,10001,COUNT FEMALE,22
2,10002,COUNT MALE,16
3,10002,COUNT FEMALE,19
4,10003,COUNT MALE,0
5,10003,COUNT FEMALE,1
6,10004,COUNT MALE,0
7,10004,COUNT FEMALE,0
8,10005,COUNT MALE,0
9,10005,COUNT FEMALE,2


In [25]:
ny_demo[['JURISDICTION NAME', 'COUNT MALE', 'COUNT FEMALE']].set_index('JURISDICTION NAME')\
                                                            .stack()\
                                                            .rename_axis(['Zip Code', 'Gender'])\
                                                            .rename('Count')\
                                                            .reset_index()\
                                                            .head(10)

Unnamed: 0,Zip Code,Gender,Count
0,10001,COUNT MALE,22
1,10001,COUNT FEMALE,22
2,10002,COUNT MALE,16
3,10002,COUNT FEMALE,19
4,10003,COUNT MALE,0
5,10003,COUNT FEMALE,1
6,10004,COUNT MALE,0
7,10004,COUNT FEMALE,0
8,10005,COUNT MALE,0
9,10005,COUNT FEMALE,2


### Problem 4
<span  style="color:green; font-size:16px">Find a different variable in the columns and tidy that variable by creating another three column DataFrame</span>

In [26]:
melted_cols = ['COUNT PACIFIC ISLANDER', 
               'COUNT HISPANIC LATINO', 
               'COUNT AMERICAN INDIAN', 
               'COUNT ASIAN NON HISPANIC', 
               'COUNT WHITE NON HISPANIC',
               'COUNT BLACK NON HISPANIC',
               'COUNT OTHER ETHNICITY',
               'COUNT ETHNICITY UNKNOWN']

df_count = ny_demo.melt( 
             id_vars='JURISDICTION NAME', 
             value_vars=melted_cols,
             var_name='Ethnicity', 
             value_name='Count')

# df_count = df_count.query('Count > 0')

df_count['Ethnicity'] = df_count.Ethnicity.str.replace('COUNT ', '')
df_count.head(10)

Unnamed: 0,JURISDICTION NAME,Ethnicity,Count
0,10001,PACIFIC ISLANDER,0
1,10002,PACIFIC ISLANDER,0
2,10003,PACIFIC ISLANDER,0
3,10004,PACIFIC ISLANDER,0
4,10005,PACIFIC ISLANDER,0
5,10006,PACIFIC ISLANDER,0
6,10007,PACIFIC ISLANDER,0
7,10009,PACIFIC ISLANDER,0
8,10010,PACIFIC ISLANDER,0
9,10011,PACIFIC ISLANDER,0


### Problem 5
<span  style="color:green; font-size:16px">For the same variable you used in problem 4, create another three column tidy dataset using the percentage column instead of the count.</span>

In [27]:
melted_cols = ['PERCENT PACIFIC ISLANDER', 
               'PERCENT HISPANIC LATINO', 
               'PERCENT AMERICAN INDIAN', 
               'PERCENT ASIAN NON HISPANIC', 
               'PERCENT WHITE NON HISPANIC',
               'PERCENT BLACK NON HISPANIC',
               'PERCENT OTHER ETHNICITY',
               'PERCENT ETHNICITY UNKNOWN']

df_perc = ny_demo.melt(id_vars='JURISDICTION NAME', 
                         value_vars=melted_cols,
                         var_name='Ethnicity', 
                         value_name='Percent')

# df_perc = df_perc.query('Percent > 0')

df_perc['Ethnicity'] = df_perc.Ethnicity.str.replace('PERCENT ', '')
df_perc.head(10)

Unnamed: 0,JURISDICTION NAME,Ethnicity,Percent
0,10001,PACIFIC ISLANDER,0.0
1,10002,PACIFIC ISLANDER,0.0
2,10003,PACIFIC ISLANDER,0.0
3,10004,PACIFIC ISLANDER,0.0
4,10005,PACIFIC ISLANDER,0.0
5,10006,PACIFIC ISLANDER,0.0
6,10007,PACIFIC ISLANDER,0.0
7,10009,PACIFIC ISLANDER,0.0
8,10010,PACIFIC ISLANDER,0.0
9,10011,PACIFIC ISLANDER,0.0


### Problem 6: Advanced
<span  style="color:green; font-size:16px">Add a **`Percent`** column to **`df_count`** that calculates the percent found in **`df_perc`**. Create an additional column, **`Percent_orig`**, from **`df_perc`** to **`df_count`**. Check that the calculated percentage and original percentage match</span>

In [28]:
# use transform to get percentage
# round to nearest decimal
# some zip codes have 0 count total and will create nan. Fill these with 0
df_count['Percent'] = df_count.groupby(['JURISDICTION NAME'])\
                              .transform(lambda x: (x / x.sum()).round(2).fillna(0))

In [29]:
# need to be careful here as the data aligns on the index
# if you don't trust the index, set the index as zip code and ethnicity 
df_count['Percent_orig'] = df_perc['Percent']

In [30]:
# find occurrences where not equal
df_count[df_count.Percent != df_perc.Percent]

Unnamed: 0,JURISDICTION NAME,Ethnicity,Count,Percent,Percent_orig
271,10038,HISPANIC LATINO,2,0.12,0.13
829,11220,ASIAN NON HISPANIC,1,0.12,0.13
1537,11220,OTHER ETHNICITY,1,0.12,0.13


In [31]:
# There is a weird rounding issue with numpy. It rounds numbers exactly at .5 down
# .125 got rounded down to .12 incorrectly
ny_demo[ny_demo['JURISDICTION NAME'] == 10038]

Unnamed: 0,JURISDICTION NAME,COUNT PARTICIPANTS,COUNT FEMALE,PERCENT FEMALE,COUNT MALE,PERCENT MALE,COUNT GENDER UNKNOWN,PERCENT GENDER UNKNOWN,COUNT GENDER TOTAL,PERCENT GENDER TOTAL,COUNT PACIFIC ISLANDER,PERCENT PACIFIC ISLANDER,COUNT HISPANIC LATINO,PERCENT HISPANIC LATINO,COUNT AMERICAN INDIAN,PERCENT AMERICAN INDIAN,COUNT ASIAN NON HISPANIC,PERCENT ASIAN NON HISPANIC,COUNT WHITE NON HISPANIC,PERCENT WHITE NON HISPANIC,COUNT BLACK NON HISPANIC,PERCENT BLACK NON HISPANIC,COUNT OTHER ETHNICITY,PERCENT OTHER ETHNICITY,COUNT ETHNICITY UNKNOWN,PERCENT ETHNICITY UNKNOWN,COUNT ETHNICITY TOTAL,PERCENT ETHNICITY TOTAL,COUNT PERMANENT RESIDENT ALIEN,PERCENT PERMANENT RESIDENT ALIEN,COUNT US CITIZEN,PERCENT US CITIZEN,COUNT OTHER CITIZEN STATUS,PERCENT OTHER CITIZEN STATUS,COUNT CITIZEN STATUS UNKNOWN,PERCENT CITIZEN STATUS UNKNOWN,COUNT CITIZEN STATUS TOTAL,PERCENT CITIZEN STATUS TOTAL,COUNT RECEIVES PUBLIC ASSISTANCE,PERCENT RECEIVES PUBLIC ASSISTANCE,COUNT NRECEIVES PUBLIC ASSISTANCE,PERCENT NRECEIVES PUBLIC ASSISTANCE,COUNT PUBLIC ASSISTANCE UNKNOWN,PERCENT PUBLIC ASSISTANCE UNKNOWN,COUNT PUBLIC ASSISTANCE TOTAL,PERCENT PUBLIC ASSISTANCE TOTAL
35,10038,16,11,0.69,5,0.31,0,0,16,100,0,0.0,2,0.13,0,0.0,9,0.56,0,0.0,5,0.31,0,0.0,0,0.0,16,100,1,0.06,15,0.94,0,0.0,0,0,16,100,3,0.19,13,0.81,0,0,16,100


In [32]:
df_count.set_index(['JURISDICTION NAME', 'Ethnicity']).unstack().head()

Unnamed: 0_level_0,Count,Count,Count,Count,Count,Count,Count,Count,Percent,Percent,Percent,Percent,Percent,Percent,Percent,Percent,Percent_orig,Percent_orig,Percent_orig,Percent_orig,Percent_orig,Percent_orig,Percent_orig,Percent_orig
Ethnicity,AMERICAN INDIAN,ASIAN NON HISPANIC,BLACK NON HISPANIC,ETHNICITY UNKNOWN,HISPANIC LATINO,OTHER ETHNICITY,PACIFIC ISLANDER,WHITE NON HISPANIC,AMERICAN INDIAN,ASIAN NON HISPANIC,BLACK NON HISPANIC,ETHNICITY UNKNOWN,HISPANIC LATINO,OTHER ETHNICITY,PACIFIC ISLANDER,WHITE NON HISPANIC,AMERICAN INDIAN,ASIAN NON HISPANIC,BLACK NON HISPANIC,ETHNICITY UNKNOWN,HISPANIC LATINO,OTHER ETHNICITY,PACIFIC ISLANDER,WHITE NON HISPANIC
JURISDICTION NAME,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2
10001,0,3,21,0,16,3,0,1,0.0,0.07,0.48,0.0,0.36,0.07,0.0,0.02,0.0,0.07,0.48,0.0,0.36,0.07,0.0,0.02
10002,0,28,0,0,1,0,0,6,0.0,0.8,0.0,0.0,0.03,0.0,0.0,0.17,0.0,0.8,0.0,0.0,0.03,0.0,0.0,0.17
10003,0,1,0,0,0,0,0,0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
10004,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10005,0,1,1,0,0,0,0,0,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0


In [33]:
df_count.head(10)

Unnamed: 0,JURISDICTION NAME,Ethnicity,Count,Percent,Percent_orig
0,10001,PACIFIC ISLANDER,0,0.0,0.0
1,10002,PACIFIC ISLANDER,0,0.0,0.0
2,10003,PACIFIC ISLANDER,0,0.0,0.0
3,10004,PACIFIC ISLANDER,0,0.0,0.0
4,10005,PACIFIC ISLANDER,0,0.0,0.0
5,10006,PACIFIC ISLANDER,0,0.0,0.0
6,10007,PACIFIC ISLANDER,0,0.0,0.0
7,10009,PACIFIC ISLANDER,0,0.0,0.0
8,10010,PACIFIC ISLANDER,0,0.0,0.0
9,10011,PACIFIC ISLANDER,0,0.0,0.0


### Problem 7
<span  style="color:green; font-size:16px">If you use the **`stack`** method on a 10 row, 5 column DataFrame (that has single level indexes), what will be the resulting shape and data structure. Answer this problem first without writing any code. Then confirm it by testing it on a DataFrame.</span>

In [34]:
# A series with 50 values
# confirm with fake data
df = pd.DataFrame(np.random.rand(10, 5))

In [35]:
df.stack().shape

(50,)

### Problem 8
<span  style="color:green; font-size:16px">Give the column index and the index a name of the following DataFrame.</span>

In [36]:
df = pd.DataFrame(np.random.rand(2,2))
df

Unnamed: 0,0,1
0,0.377716,0.258923
1,0.731304,0.686961


In [37]:
df.index.rename('some index', inplace=True)
df.columns.rename('some columns', inplace=True)
df

some columns,0,1
some index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.377716,0.258923
1,0.731304,0.686961


In [38]:
# another way
df.rename_axis('yet another index', axis=0)\
  .rename_axis('yet another column index', axis=1)

yet another column index,0,1
yet another index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.377716,0.258923
1,0.731304,0.686961


In [39]:
# and antother way
df.index.name ='last index'
df.columns.name = 'last columns'
df

last columns,0,1
last index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.377716,0.258923
1,0.731304,0.686961


In [40]:
# wait one more
df.columns.set_names('blah', inplace=True)
df.index.set_names('floop', inplace=True)
df

blah,0,1
floop,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.377716,0.258923
1,0.731304,0.686961


### Problem 9
<span  style="color:green; font-size:16px">Use `groupby` method (with other reshaping methods) to recreate the same DataFrame produced by `pivot_table` below.</span>

In [41]:
employee = pd.read_csv('../data/employee.csv')
# recreate this table
employee_pivot = employee.pivot_table(index='RACE', columns='GENDER', values='BASE_SALARY', aggfunc=np.mean)
employee_pivot

GENDER,Female,Male
RACE,Unnamed: 1_level_1,Unnamed: 2_level_1
American Indian or Alaskan Native,60238.8,60305.4
Asian/Pacific Islander,63226.3,61033.906667
Black or African American,48915.421233,51082.074074
Hispanic/Latino,46503.316176,54782.819018
Others,63785.0,38771.0
White,66793.352941,63940.388119


In [42]:
employee.groupby(['RACE', 'GENDER'])['BASE_SALARY'].mean().unstack()

GENDER,Female,Male
RACE,Unnamed: 1_level_1,Unnamed: 2_level_1
American Indian or Alaskan Native,60238.8,60305.4
Asian/Pacific Islander,63226.3,61033.906667
Black or African American,48915.421233,51082.074074
Hispanic/Latino,46503.316176,54782.819018
Others,63785.0,38771.0
White,66793.352941,63940.388119


### Problem 10
<span  style="color:green; font-size:16px">Use the `melt` function to make make the DataFrame `employee_pivot` tidy.</span>

In [43]:
employee_pivot = employee.pivot_table(index=['RACE', 'DEPARTMENT'], 
                            columns='GENDER', 
                            values='BASE_SALARY', 
                            aggfunc=np.mean)\
                .reset_index()\
                .rename_axis(None, axis='columns')
employee_pivot.head(10)

Unnamed: 0,RACE,DEPARTMENT,Female,Male
0,American Indian or Alaskan Native,Dept of Neighborhoods (DON),,26125.0
1,American Indian or Alaskan Native,Health & Human Services,54117.0,
2,American Indian or Alaskan Native,Housing and Community Devp.,98536.0,
3,American Indian or Alaskan Native,Houston Airport System (HAS),68299.0,
4,American Indian or Alaskan Native,Houston Fire Department (HFD),,78355.0
5,American Indian or Alaskan Native,Houston Police Department-HPD,,65682.333333
6,American Indian or Alaskan Native,Library,26125.0,
7,Asian/Pacific Islander,Admn. & Regulatory Affairs,72293.666667,
8,Asian/Pacific Islander,City Controller's Office,59077.0,
9,Asian/Pacific Islander,Fleet Management Department,,46010.0


In [44]:
employee_pivot.melt(id_vars=['RACE', 'DEPARTMENT'], 
                value_vars=['Female', 'Male'], 
                value_name='GENDER',
                var_name='BASE_SALARY')\
         .head(15)

Unnamed: 0,RACE,DEPARTMENT,BASE_SALARY,GENDER
0,American Indian or Alaskan Native,Dept of Neighborhoods (DON),Female,
1,American Indian or Alaskan Native,Health & Human Services,Female,54117.0
2,American Indian or Alaskan Native,Housing and Community Devp.,Female,98536.0
3,American Indian or Alaskan Native,Houston Airport System (HAS),Female,68299.0
4,American Indian or Alaskan Native,Houston Fire Department (HFD),Female,
5,American Indian or Alaskan Native,Houston Police Department-HPD,Female,
6,American Indian or Alaskan Native,Library,Female,26125.0
7,Asian/Pacific Islander,Admn. & Regulatory Affairs,Female,72293.666667
8,Asian/Pacific Islander,City Controller's Office,Female,59077.0
9,Asian/Pacific Islander,Fleet Management Department,Female,


### Problem 11
<span  style="color:green; font-size:16px">Use the `stack` function to make make the DataFrame `employee_pivot` from problem 10 tidy.</span>

In [45]:
employee_pivot.set_index(['RACE', 'DEPARTMENT'])\
         .stack()\
         .rename_axis(['RACE', 'DEPARTMENT', 'GENDER'])\
         .rename('BASE_SALARY')\
         .reset_index()\
         .head(10)

Unnamed: 0,RACE,DEPARTMENT,GENDER,BASE_SALARY
0,American Indian or Alaskan Native,Dept of Neighborhoods (DON),Male,26125.0
1,American Indian or Alaskan Native,Health & Human Services,Female,54117.0
2,American Indian or Alaskan Native,Housing and Community Devp.,Female,98536.0
3,American Indian or Alaskan Native,Houston Airport System (HAS),Female,68299.0
4,American Indian or Alaskan Native,Houston Fire Department (HFD),Male,78355.0
5,American Indian or Alaskan Native,Houston Police Department-HPD,Male,65682.333333
6,American Indian or Alaskan Native,Library,Female,26125.0
7,Asian/Pacific Islander,Admn. & Regulatory Affairs,Female,72293.666667
8,Asian/Pacific Islander,City Controller's Office,Female,59077.0
9,Asian/Pacific Islander,Fleet Management Department,Male,46010.0


### Problem 12
<span  style="color:green; font-size:16px">Make the column levels `first` and `second` index levels. Make the index level `two` a column level.</span>

In [46]:
index = pd.MultiIndex.from_product([['a', 'b'], ['c', 'd', 'e']], names=['one', 'two'])
columns = pd.MultiIndex.from_product([['A', 'B'], ['C', 'D']], names=['first', 'second'])
df = pd.DataFrame(np.random.rand(6,4), index=index, columns=columns)
df

Unnamed: 0_level_0,first,A,A,B,B
Unnamed: 0_level_1,second,C,D,C,D
one,two,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
a,c,0.300647,0.054157,0.824446,0.198426
a,d,0.161373,0.567253,0.860238,0.333736
a,e,0.681574,0.080062,0.600255,0.345057
b,c,0.709219,0.57258,0.7867,0.067699
b,d,0.821983,0.610003,0.936883,0.779836
b,e,0.371681,0.214022,0.408151,0.298428


In [47]:
df.stack(['first', 'second']).unstack('two')

Unnamed: 0_level_0,Unnamed: 1_level_0,two,c,d,e
one,first,second,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
a,A,C,0.300647,0.161373,0.681574
a,A,D,0.054157,0.567253,0.080062
a,B,C,0.824446,0.860238,0.600255
a,B,D,0.198426,0.333736,0.345057
b,A,C,0.709219,0.821983,0.371681
b,A,D,0.57258,0.610003,0.214022
b,B,C,0.7867,0.936883,0.408151
b,B,D,0.067699,0.779836,0.298428
