In this course, you'll learn how to leverage pandas' extremely powerful data manipulation engine to get the most out of your data. It is important to be able to **extract, filter, and transform data** from DataFrames in order to drill into the data that really matters. The pandas library has many techniques that make this process efficient and intuitive. You will learn how to **tidy, rearrange, and restructure your data by pivoting or melting and stacking or unstacking** DataFrames. These are all fundamental next steps on the road to becoming a well-rounded Data Scientist, and you will have the chance to apply all the concepts you learn to real-world datasets.

 #### [Index](index.ipynb)


### Extracting and transforming data
  * #### [Indexing Dataframes](#id)
     * [Position and Labelled Indexing](#pli)
     * [Indexing and column rearrangement](#icr)
  * #### [Slicing Dataframes](#sd)
     * [Slicing rows](#sr)
     * [Slicing columns](#sc)
     * [Subselecting Dataframes with lists](#sdl)
  * #### [Filtering Dataframes](#fd)
     * [Thresholding data](#td)
     * [Filtering columns using other columns](#fcc)
     * [Filtering using NaNs](#fun)
  * #### [Transforming Dataframes](#td)
     * [Using apply() to transform a column](#atc)
     * [Using .map() with a dictionary](#mwd)
     * [Using vectorized functions](#vf)
     
     
 -------
### [Advanced Indexing](#ai)
####  [Index objects and labelled data](#iold)
* [index values and names](#ivn)
* [changing index of a Dataframe](#cid)
* [Changing index name labels](#cinl)
* [Building an index, then a DataFrame](#bid)

#### [ Hierarchical Indexing](#hi)
  * [Extracting data with a MutliIndex](#edm)
  * [Setting and Sorting a MutiIndex](#ssm)
  * [Using .loc with non-unique indexes](#lwni)
  * [Indexing multiple levels of a Multi-Index](#iml)
  
  
 ----
### [Rearranging and Reshaping Data ](#rrd)
#### [Pivoting DataFrames](#pddd)
 * [Pivoting a single variable](#psv)
 * [Pivoting all variables](#pav)
 
#### [Stacking & unstacking DataFrames](#sud)

 * [Stacking & unstacking I](#ss1)
 * [Stacking & unstacking II](#ss2)
 * [Restoring the index order](#rio)
 
#### [Melting DataFrames](#md)
 * [Adding names for readability](#anr)
 * [Going from wide to long](#wtl)
 * [Obtaining key-value pairs with melt()](#kpm)

#### [Pivot tables](#pt)
 * [Setting up a pivot table](#spt)
 * [Using other aggregations in pivot tables](#apt)
 * [Using margins in pivot tables](#mpt)
  

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## Extracting and transforming data

<p id='id'> <p>
### Indexing Dataframes

In [3]:
election = pd.read_csv('./data/pennsylvania2012_turnout.csv', index_col='county')
election.head()

Unnamed: 0_level_0,state,total,Obama,Romney,winner,voters,turnout,margin
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Adams,PA,41973,35.482334,63.112001,Romney,61156,68.632677,27.629667
Allegheny,PA,614671,56.640219,42.18582,Obama,924351,66.497575,14.454399
Armstrong,PA,28322,30.696985,67.901278,Romney,42147,67.19814,37.204293
Beaver,PA,80015,46.032619,52.63763,Romney,115157,69.483401,6.605012
Bedford,PA,21444,22.057452,76.98657,Romney,32189,66.619031,54.929118


Your job is to select 'Bedford' county and the'winner' column. Which method is the preferred way?

In [4]:
election.loc['Bedford', 'winner']

'Romney'

<p id = 'pli'> <p>
### Position and Labelled Indexing

In [5]:
election.iloc[4, 4]

'Romney'

<p id = 'icr'> <p>
### Indexing and column rearrangement
There are circumstances in which it's useful to modify the order of your DataFrame columns.
    

In [6]:
results = election[['winner', 'total', 'voters']]
results.head()

Unnamed: 0_level_0,winner,total,voters
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Adams,Romney,41973,61156
Allegheny,Obama,614671,924351
Armstrong,Romney,28322,42147
Beaver,Romney,80015,115157
Bedford,Romney,21444,32189


<p id = 'sd'> <p>
## Slicing Dataframes

<p id = 'sr'> <p>
### Slicing rows

In [7]:
election.head()

Unnamed: 0_level_0,state,total,Obama,Romney,winner,voters,turnout,margin
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Adams,PA,41973,35.482334,63.112001,Romney,61156,68.632677,27.629667
Allegheny,PA,614671,56.640219,42.18582,Obama,924351,66.497575,14.454399
Armstrong,PA,28322,30.696985,67.901278,Romney,42147,67.19814,37.204293
Beaver,PA,80015,46.032619,52.63763,Romney,115157,69.483401,6.605012
Bedford,PA,21444,22.057452,76.98657,Romney,32189,66.619031,54.929118


In [8]:
# Slice the row labels 'Perry' to 'Potter': p_counties
p_counties = election.loc['Perry':'Potter']
p_counties.head() 


Unnamed: 0_level_0,state,total,Obama,Romney,winner,voters,turnout,margin
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Perry,PA,18240,29.769737,68.591009,Romney,27245,66.948064,38.821272
Philadelphia,PA,653598,85.224251,14.051451,Obama,1099197,59.461407,71.1728
Pike,PA,23164,43.904334,54.882576,Romney,41840,55.363289,10.978242
Potter,PA,7205,26.259542,72.158223,Romney,10913,66.022175,45.898681


In [9]:
p_counties2 = election.loc[['Perry','Potter'], :]
p_counties2.head()

Unnamed: 0_level_0,state,total,Obama,Romney,winner,voters,turnout,margin
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Perry,PA,18240,29.769737,68.591009,Romney,27245,66.948064,38.821272
Potter,PA,7205,26.259542,72.158223,Romney,10913,66.022175,45.898681


In [10]:
# Slice the row labels 'Potter' to 'Perry' in reverse order: p_counties_rev
p_counties_rev =  election.loc['Potter':'Perry':-1]
p_counties_rev.head()

Unnamed: 0_level_0,state,total,Obama,Romney,winner,voters,turnout,margin
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Potter,PA,7205,26.259542,72.158223,Romney,10913,66.022175,45.898681
Pike,PA,23164,43.904334,54.882576,Romney,41840,55.363289,10.978242
Philadelphia,PA,653598,85.224251,14.051451,Obama,1099197,59.461407,71.1728
Perry,PA,18240,29.769737,68.591009,Romney,27245,66.948064,38.821272


<p id = 'sc'> <p>
### Slicing columns

In [11]:
# Slice the columns from the starting column to 'Obama': left_columns
left_columns = election.loc[:, :'Obama']
print(type(left_columns))
left_columns.head()


<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,state,total,Obama
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Adams,PA,41973,35.482334
Allegheny,PA,614671,56.640219
Armstrong,PA,28322,30.696985
Beaver,PA,80015,46.032619
Bedford,PA,21444,22.057452


In [12]:
# Slice the columns from 'Obama' to 'winner': middle_columns
middle_columns = election.loc[:, 'Obama':'winner']
middle_columns.head()

Unnamed: 0_level_0,Obama,Romney,winner
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Adams,35.482334,63.112001,Romney
Allegheny,56.640219,42.18582,Obama
Armstrong,30.696985,67.901278,Romney
Beaver,46.032619,52.63763,Romney
Bedford,22.057452,76.98657,Romney


In [13]:
# Slice the columns from 'Romney' to the end: 'right_columns'
right_columns = election.loc[:, 'Romney': ]
right_columns.head()

Unnamed: 0_level_0,Romney,winner,voters,turnout,margin
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Adams,63.112001,Romney,61156,68.632677,27.629667
Allegheny,42.18582,Obama,924351,66.497575,14.454399
Armstrong,67.901278,Romney,42147,67.19814,37.204293
Beaver,52.63763,Romney,115157,69.483401,6.605012
Bedford,76.98657,Romney,32189,66.619031,54.929118


<p id = 'sdl'> <p>
### Subselecting Dataframes with lists
    
You can use lists to select specific row and column labels with the `.loc[]` accessor. In this exercise, your job is to select the counties `['Philadelphia', 'Centre', 'Fulton']` and the columns `['winner','Obama','Romney']` from the election DataFrame, which has been pre-loaded for you with the index set to `county'`.

In [14]:
three_counties = election.loc[['Philadelphia', 'Centre', 'Fulton'], ['winner','Obama','Romney'] ]
three_counties.head()

Unnamed: 0_level_0,winner,Obama,Romney
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Philadelphia,Obama,85.224251,14.051451
Centre,Romney,48.948416,48.977486
Fulton,Romney,21.096291,77.748861


<p id = 'fd'> <p>
## Filtering Dataframes

<p id = 'td'> <p>
### Thresholding Data

In [15]:
# Filter the election DataFrame with the high_turnout array: high_turnout_df
high_turnout_df = election[election.turnout>70]
high_turnout_df.head()


Unnamed: 0_level_0,state,total,Obama,Romney,winner,voters,turnout,margin
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Bucks,PA,319407,49.96697,48.801686,Obama,435606,73.324748,1.165284
Butler,PA,88924,31.920516,66.816607,Romney,122762,72.436096,34.896091
Chester,PA,248295,49.228539,49.650617,Romney,337822,73.498766,0.422079
Forest,PA,2308,38.734835,59.835355,Romney,3232,71.410891,21.10052
Franklin,PA,62802,30.110506,68.583803,Romney,87406,71.850903,38.473297


<p id = 'fcc'> <p>
### Filtering columns using other columns

In [16]:
election.margin[election.margin<1]

county
Berks      0.589269
Centre     0.029069
Chester    0.422079
Name: margin, dtype: float64

In [17]:
election.winner[election.margin<1]

county
Berks      Romney
Centre     Romney
Chester    Romney
Name: winner, dtype: object

In [18]:
election.loc[election.margin<1, 'winner']  = np.nan

In [19]:
election.winner[election.margin<1]

county
Berks      NaN
Centre     NaN
Chester    NaN
Name: winner, dtype: object

<p id = 'fun'> <p>
### Filtering using NaNs
    
It may be necessary to remove rows and columns with missing data from a DataFrame. The `.dropna()` method is used to perform this action.  
Your job is to use `.dropna()` to remove rows where any of these two columns contains missing data and rows where all of these two columns contain missing data.

In [20]:
titanic = pd.read_csv('./data/titanic.csv')
titanic.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [21]:
# Select the 'age' and 'cabin' columns: df
df = titanic[['age', 'cabin']]
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 2 columns):
age      1046 non-null float64
cabin    295 non-null object
dtypes: float64(1), object(1)
memory usage: 20.5+ KB
None


Unnamed: 0,age,cabin
0,29.0,B5
1,0.92,C22 C26
2,2.0,C22 C26
3,30.0,C22 C26
4,25.0,C22 C26


In [22]:
# Use .dropna() to remove rows where any of these two columns contains missing data
df_drop_any=df.dropna(how = 'any')
print(df_drop_any.info())
print(df_drop_any.shape)
print("There are ", 1309-272 , " values missing in any (age or cabin)" )

<class 'pandas.core.frame.DataFrame'>
Int64Index: 272 entries, 0 to 1231
Data columns (total 2 columns):
age      272 non-null float64
cabin    272 non-null object
dtypes: float64(1), object(1)
memory usage: 6.4+ KB
None
(272, 2)
There are  1037  values missing in any (age or cabin)


In [23]:
# Use .dropna() to remove rows where all of these two columns contain missing data.
df_drop_all = df.dropna(how = 'all')
print(df_drop_all.info())
print(df_drop_all.shape)

print("There are ", 1309-1069, " values missing at same index(age and cabin)" )

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1069 entries, 0 to 1308
Data columns (total 2 columns):
age      1046 non-null float64
cabin    295 non-null object
dtypes: float64(1), object(1)
memory usage: 25.1+ KB
None
(1069, 2)
There are  240  values missing at same index(age and cabin)


In [24]:
# Drop columns in titanic with less than 1000 non-missing values
print(titanic.dropna(thresh=1000, axis='columns').info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
pclass      1309 non-null int64
survived    1309 non-null int64
name        1309 non-null object
sex         1309 non-null object
age         1046 non-null float64
sibsp       1309 non-null int64
parch       1309 non-null int64
ticket      1309 non-null object
fare        1308 non-null float64
embarked    1307 non-null object
dtypes: float64(2), int64(4), object(4)
memory usage: 102.3+ KB
None


Note that cabin column has been dropped as it contains less than 1000 missing values

<p id = 'td'> <p>
### Transforming Dataframes

<p id = 'atc'> <p>
### Using `apply()` to transform a column
The `.apply()` method can be used on a pandas DataFrame to apply an arbitrary Python function to every element. In this exercise you'll take daily weather data in Pittsburgh in 2013 obtained from Weather Underground.



In [25]:
# Write a function to convert degrees Fahrenheit to degrees Celsius: to_celsius
def to_celsius(F):
    return 5/9*(F - 32)


In [26]:
pitsWeather = pd.read_csv('./data/pittsburgh2013.csv')
pitsWeather.head()

Unnamed: 0,Date,Max TemperatureF,Mean TemperatureF,Min TemperatureF,Max Dew PointF,Mean Dew PointF,Min DewpointF,Max Humidity,Mean Humidity,Min Humidity,...,Max VisibilityMiles,Mean VisibilityMiles,Min VisibilityMiles,Max Wind SpeedMPH,Mean Wind SpeedMPH,Max Gust SpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
0,2013-1-1,32,28,21,30,27,16,100,89,77,...,10,6,2,10,8,,0.0,8,Snow,277
1,2013-1-2,25,21,17,14,12,10,77,67,55,...,10,10,10,14,5,,0.0,4,,272
2,2013-1-3,32,24,16,19,15,9,77,67,56,...,10,10,10,17,8,26.0,0.0,3,,229
3,2013-1-4,30,28,27,21,19,17,75,68,59,...,10,10,6,23,16,32.0,0.0,4,,250
4,2013-1-5,34,30,25,23,20,16,75,68,61,...,10,10,10,16,10,23.0,0.21,5,,221


In [27]:
dff = pitsWeather[['Mean TemperatureF', 'Mean Dew PointF']]
dff.head()

Unnamed: 0,Mean TemperatureF,Mean Dew PointF
0,28,27
1,21,12
2,24,15
3,28,19
4,30,20


In [28]:
dfc=dff.apply(to_celsius)
dfc.head()

Unnamed: 0,Mean TemperatureF,Mean Dew PointF
0,-2.222222,-2.777778
1,-6.111111,-11.111111
2,-4.444444,-9.444444
3,-2.222222,-7.222222
4,-1.111111,-6.666667


<p id = 'mwd'> <p>
### Using `.map()` with a dictionary




In [29]:
election.head()

Unnamed: 0_level_0,state,total,Obama,Romney,winner,voters,turnout,margin
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Adams,PA,41973,35.482334,63.112001,Romney,61156,68.632677,27.629667
Allegheny,PA,614671,56.640219,42.18582,Obama,924351,66.497575,14.454399
Armstrong,PA,28322,30.696985,67.901278,Romney,42147,67.19814,37.204293
Beaver,PA,80015,46.032619,52.63763,Romney,115157,69.483401,6.605012
Bedford,PA,21444,22.057452,76.98657,Romney,32189,66.619031,54.929118


Your job is to use a dictionary to map the values 'Obama' and 'Romney' in the 'winner' column to the values 'blue' and 'red', and assign the output to the new column 'color'.



In [30]:
# Create the dictionary: red_vs_blue
red_vs_blue = {'Obama':'Blue', 'Romney':'Red'}


In [31]:
# Use the dictionary to map the 'winner' column to the new column: election['color']
election['color'] = election.winner.map(red_vs_blue)
election.head()

Unnamed: 0_level_0,state,total,Obama,Romney,winner,voters,turnout,margin,color
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Adams,PA,41973,35.482334,63.112001,Romney,61156,68.632677,27.629667,Red
Allegheny,PA,614671,56.640219,42.18582,Obama,924351,66.497575,14.454399,Blue
Armstrong,PA,28322,30.696985,67.901278,Romney,42147,67.19814,37.204293,Red
Beaver,PA,80015,46.032619,52.63763,Romney,115157,69.483401,6.605012,Red
Bedford,PA,21444,22.057452,76.98657,Romney,32189,66.619031,54.929118,Red


<p id = 'vf'> <p>
### Using vectorized functions

In [32]:
from scipy.stats import zscore

Use Z-score to compute the deviation in voter turnout in Pennsylvania from the mean in fractions of the standard deviation. In statistics, the z-score is the number of standard deviations by which an observation is above the mean - so if it is negative, it means the observation is below the mean

In [33]:
# Call zscore with election['turnout'] as input: turnout_zscore
turnout_zscore = zscore(election.turnout)
turnout_zscore


array([ 0.85373443,  0.43984633,  0.57565034,  1.01864668,  0.46339055,
        0.18992961, -1.62978766, -1.67811834,  1.76328918,  1.59102463,
        0.4115648 , -2.00690534, -0.41140691, -0.64265536,  1.79702245,
       -0.21292049, -0.36907863, -1.76358992, -0.63882099, -0.72673199,
        1.02421347,  0.83473876,  0.86101802, -0.58691702, -0.09392156,
       -2.26015319,  1.39228937,  1.47758532,  0.30389161, -0.71004763,
       -0.62292272, -0.22739249, -0.8586792 ,  1.11463935,  0.14408255,
        1.08675066, -0.25721482,  0.3426399 , -0.04498491, -0.09489986,
        0.71129079, -1.19644405, -0.06680477,  0.48399098, -1.89069251,
        1.68205856, -1.28403638, -0.79798793, -1.33971045,  0.52717328,
       -0.9241102 , -1.71852766,  0.34769042,  0.46386596,  0.99379745,
        0.21159213,  0.95701947,  0.83419812, -0.56442943,  0.65096061,
       -0.16243951, -1.4886646 , -0.18238803,  0.02514726,  1.29021923,
        0.14757638,  0.44085587])

In [34]:
# Assign turnout_zscore to a new column: election['turnout_zscore']
election['turnout_zscore'] = turnout_zscore
election.head()

Unnamed: 0_level_0,state,total,Obama,Romney,winner,voters,turnout,margin,color,turnout_zscore
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Adams,PA,41973,35.482334,63.112001,Romney,61156,68.632677,27.629667,Red,0.853734
Allegheny,PA,614671,56.640219,42.18582,Obama,924351,66.497575,14.454399,Blue,0.439846
Armstrong,PA,28322,30.696985,67.901278,Romney,42147,67.19814,37.204293,Red,0.57565
Beaver,PA,80015,46.032619,52.63763,Romney,115157,69.483401,6.605012,Red,1.018647
Bedford,PA,21444,22.057452,76.98657,Romney,32189,66.619031,54.929118,Red,0.463391


<p id = 'ai'> <p>
## Advanced indexing

<p id = 'iold'> <p>
## Index objects and labeled data


<p id = 'ivn'> <p>
### Index values and names

In [35]:
sales = pd.read_csv('./data/sales.csv')
sales

Unnamed: 0,month,eggs,salt,spam
0,Jan,47,12.0,17
1,Feb,110,50.0,31
2,Mar,221,89.0,72
3,Apr,77,87.0,20
4,May,132,,52
5,Jun,205,60.0,55


In [36]:
sales.index  = range(len(sales))

<p id = 'cid'> <p>
### Changing index of a DataFrame
As you saw in the previous exercise, indexes are immutable objects. This means that if you want to change or modify the index in a DataFrame, then you need to change the whole index. You will do this now, using a list comprehension to create the new index.



In [37]:
sales = pd.read_csv('./data/sales.csv', index_col='month')
sales.head()



Unnamed: 0_level_0,eggs,salt,spam
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,47,12.0,17
Feb,110,50.0,31
Mar,221,89.0,72
Apr,77,87.0,20
May,132,,52


In [38]:
sales.index

Index(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'], dtype='object', name='month')

In [39]:
# Create a list new_idx with the same elements as in sales.index, but with all characters capitalized.
new_idx = [i.upper() for i in sales.index]
new_idx


['JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN']

In [40]:
sales.index = new_idx
sales

Unnamed: 0,eggs,salt,spam
JAN,47,12.0,17
FEB,110,50.0,31
MAR,221,89.0,72
APR,77,87.0,20
MAY,132,,52
JUN,205,60.0,55


<p id = 'cinl'> <p>
### Changing index name labels
Notice that in the previous exercise, the index was not labeled with a name. In this exercise, you will set its name to 'MONTHS'

In [41]:
# Assign the string 'MONTHS' to sales.index.name
sales.index.name = 'MONTHS'
sales

Unnamed: 0_level_0,eggs,salt,spam
MONTHS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
JAN,47,12.0,17
FEB,110,50.0,31
MAR,221,89.0,72
APR,77,87.0,20
MAY,132,,52
JUN,205,60.0,55


In [42]:
sales.columns.name = 'PRODUCTS'
sales

PRODUCTS,eggs,salt,spam
MONTHS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
JAN,47,12.0,17
FEB,110,50.0,31
MAR,221,89.0,72
APR,77,87.0,20
MAY,132,,52
JUN,205,60.0,55


<p id = 'bid'> <p>
### Building an index, then a DataFrame


In [43]:
#sales = pd.read_csv('./data/sales.csv')


In [44]:
#del sales['month']

In [45]:
sales.head()

PRODUCTS,eggs,salt,spam
MONTHS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
JAN,47,12.0,17
FEB,110,50.0,31
MAR,221,89.0,72
APR,77,87.0,20
MAY,132,,52


In [46]:
# Generate the list of months: months
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']


In [47]:
sales.index = months
sales

PRODUCTS,eggs,salt,spam
Jan,47,12.0,17
Feb,110,50.0,31
Mar,221,89.0,72
Apr,77,87.0,20
May,132,,52
Jun,205,60.0,55


<p id = 'hi'> <p>
## Hierarchical indexing


#### pre-work
---------

<p id = 'ssm'> <p>
### Setting & sorting a MultiIndex



In [48]:
month = [1, 2, 1, 2, 1, 2]
state = ['CA', 'CA', 'NY', 'NY', 'TX', 'TX']


In [49]:
sales = pd.read_csv('./data/sales.csv')
del sales['month']

sales.head()


Unnamed: 0,eggs,salt,spam
0,47,12.0,17
1,110,50.0,31
2,221,89.0,72
3,77,87.0,20
4,132,,52


In [50]:
sales['month'] = month
sales['state'] = state

In [51]:
sales

Unnamed: 0,eggs,salt,spam,month,state
0,47,12.0,17,1,CA
1,110,50.0,31,2,CA
2,221,89.0,72,1,NY
3,77,87.0,20,2,NY
4,132,,52,1,TX
5,205,60.0,55,2,TX


In [52]:
sales = sales.set_index(['state', 'month'])
sales = sales.sort_index()

<p id = 'edm'> <p>
### Extracting data with a MultiIndex



------------------------------
#### postwork

In [53]:
sales.loc[['CA', 'TX']]

Unnamed: 0_level_0,Unnamed: 1_level_0,eggs,salt,spam
state,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,1,47,12.0,17
CA,2,110,50.0,31
TX,1,132,,52
TX,2,205,60.0,55


In [54]:
sales['CA':'TX']

Unnamed: 0_level_0,Unnamed: 1_level_0,eggs,salt,spam
state,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,1,47,12.0,17
CA,2,110,50.0,31
NY,1,221,89.0,72
NY,2,77,87.0,20
TX,1,132,,52
TX,2,205,60.0,55


<p id = 'lwni'> <p>
### Using .loc[] with nonunique indexes
It is always preferable to have a meaningful index that uniquely identifies each row. Even though pandas does not require unique index values in DataFrames, it works better if the index values are indeed unique. To see an example of this, you will index your sales data by 'state' in this exercise.




In [55]:
sales = pd.read_csv('./data/sales.csv')
del sales['month']

sales.head()



Unnamed: 0,eggs,salt,spam
0,47,12.0,17
1,110,50.0,31
2,221,89.0,72
3,77,87.0,20
4,132,,52


In [56]:
sales['month'] = month
sales['state'] = state
sales.head()

Unnamed: 0,eggs,salt,spam,month,state
0,47,12.0,17,1,CA
1,110,50.0,31,2,CA
2,221,89.0,72,1,NY
3,77,87.0,20,2,NY
4,132,,52,1,TX


In [57]:
sales=sales.set_index('state')
sales.head()

Unnamed: 0_level_0,eggs,salt,spam,month
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,47,12.0,17,1
CA,110,50.0,31,2
NY,221,89.0,72,1
NY,77,87.0,20,2
TX,132,,52,1


In [58]:
# Access the data from 'NY'
sales.loc['NY']

Unnamed: 0_level_0,eggs,salt,spam,month
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
NY,221,89.0,72,1
NY,77,87.0,20,2


<p id ='iml'><p> 
### Indexing multiple levels of a MultiIndex



In [59]:
sales = pd.read_csv('./data/sales.csv')
del sales['month']

sales['month'] = month
sales['state'] = state
sales=sales.set_index(['state', 'month'])
sales

Unnamed: 0_level_0,Unnamed: 1_level_0,eggs,salt,spam
state,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,1,47,12.0,17
CA,2,110,50.0,31
NY,1,221,89.0,72
NY,2,77,87.0,20
TX,1,132,,52
TX,2,205,60.0,55


In [60]:
# Look up data for NY in month 1: NY_month1
NY_month1 = sales.loc[('NY', 1)]
NY_month1

eggs    221.0
salt     89.0
spam     72.0
Name: (NY, 1), dtype: float64

In [61]:
# Look up data for CA and TX in month 2: CA_TX_month2
CA_TX_month2 = sales.loc[(['CA', 'TX'], 2), :]
CA_TX_month2

Unnamed: 0_level_0,Unnamed: 1_level_0,eggs,salt,spam
state,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,2,110,50.0,31
TX,2,205,60.0,55


In [62]:
# Look up data for all states in month 2: all_month2
all_month2 = sales.loc[(slice(None), 2), :]
all_month2

Unnamed: 0_level_0,Unnamed: 1_level_0,eggs,salt,spam
state,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,2,110,50.0,31
NY,2,77,87.0,20
TX,2,205,60.0,55


<p id = 'rrd'> <p>
# Rearranging and Reshaping Data

<p id = 'pddd'> <p>
## Pivoting DataFrames

<p id = 'psv'> <p>
### Pivoting a single variable

In [63]:
users  = pd.read_csv('./data/users.csv', index_col=0)
users.head()

Unnamed: 0,weekday,city,visitors,signups
0,Sun,Austin,139,7
1,Sun,Dallas,237,12
2,Mon,Austin,326,3
3,Mon,Dallas,456,5


Pivot the users DataFrame with the rows indexed by 'weekday', the columns indexed by 'city', and the values populated with 'visitors'.


In [64]:
users.pivot(index = 'weekday', columns='city', values='visitors')

city,Austin,Dallas
weekday,Unnamed: 1_level_1,Unnamed: 2_level_1
Mon,326,456
Sun,139,237


<p id = 'pav'> <p>
### Pivoting all variables
    
Pivoting all variables

If you do not select any particular variables, all of them will be pivoted. In this case - with the users DataFrame - both 'visitors' and 'signups' will be pivoted, creating hierarchical column labels.


Pivot the users DataFrame with the 'signups' indexed by 'weekday' in the rows and 'city' in the columns.



In [65]:
# Pivot users with signups indexed by weekday and city: signups_pivot
signups_pivot = users.pivot(index = 'weekday', columns = 'city', values='signups')
signups_pivot

city,Austin,Dallas
weekday,Unnamed: 1_level_1,Unnamed: 2_level_1
Mon,3,5
Sun,7,12


In [66]:
# Pivot users pivoted by both signups and visitors: pivot
pivot = users.pivot(index = 'weekday', columns='city')
pivot


Unnamed: 0_level_0,visitors,visitors,signups,signups
city,Austin,Dallas,Austin,Dallas
weekday,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Mon,326,456,3,5
Sun,139,237,7,12


<p id = 'sud'> <p>
## Stacking & unstacking DataFrames

<p id = 'ss1'> <p>
### Stacking & unstacking I

In [67]:
users

Unnamed: 0,weekday,city,visitors,signups
0,Sun,Austin,139,7
1,Sun,Dallas,237,12
2,Mon,Austin,326,3
3,Mon,Dallas,456,5


In [68]:
users=users.set_index(['weekday', 'city'])


In [69]:
users

Unnamed: 0_level_0,Unnamed: 1_level_0,visitors,signups
weekday,city,Unnamed: 2_level_1,Unnamed: 3_level_1
Sun,Austin,139,7
Sun,Dallas,237,12
Mon,Austin,326,3
Mon,Dallas,456,5


In [70]:
# Define a DataFrame byweekday with the 'weekday' level of users unstacked.
users = users.unstack(level='weekday')
users

Unnamed: 0_level_0,visitors,visitors,signups,signups
weekday,Mon,Sun,Mon,Sun
city,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Austin,326,139,3,7
Dallas,456,237,5,12


In [71]:
users = users.stack(level='weekday')
users

Unnamed: 0_level_0,Unnamed: 1_level_0,visitors,signups
city,weekday,Unnamed: 2_level_1,Unnamed: 3_level_1
Austin,Mon,326,3
Austin,Sun,139,7
Dallas,Mon,456,5
Dallas,Sun,237,12


<p id = 'ss2'> <p>
### Stacking & unstacking II

In [72]:
users

Unnamed: 0_level_0,Unnamed: 1_level_0,visitors,signups
city,weekday,Unnamed: 2_level_1,Unnamed: 3_level_1
Austin,Mon,326,3
Austin,Sun,139,7
Dallas,Mon,456,5
Dallas,Sun,237,12


In [73]:
# Unstack users by 'city': bycity
bycity = users.unstack(level='city')
bycity.head()

Unnamed: 0_level_0,visitors,visitors,signups,signups
city,Austin,Dallas,Austin,Dallas
weekday,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Mon,326,456,3,5
Sun,139,237,7,12


In [74]:
bycity.stack(level='city')

Unnamed: 0_level_0,Unnamed: 1_level_0,visitors,signups
weekday,city,Unnamed: 2_level_1,Unnamed: 3_level_1
Mon,Austin,326,3
Mon,Dallas,456,5
Sun,Austin,139,7
Sun,Dallas,237,12


<p id = 'rio'> <p>
### Restoring the index order

In [75]:
newusers = bycity.stack(level='city')
newusers

Unnamed: 0_level_0,Unnamed: 1_level_0,visitors,signups
weekday,city,Unnamed: 2_level_1,Unnamed: 3_level_1
Mon,Austin,326,3
Mon,Dallas,456,5
Sun,Austin,139,7
Sun,Dallas,237,12


In [76]:
# Swap the levels of the index of newusers: newusers
newusers = newusers.swaplevel(0, 1)
newusers = newusers.sort_index()
newusers


Unnamed: 0_level_0,Unnamed: 1_level_0,visitors,signups
city,weekday,Unnamed: 2_level_1,Unnamed: 3_level_1
Austin,Mon,326,3
Austin,Sun,139,7
Dallas,Mon,456,5
Dallas,Sun,237,12


In [77]:
newusers.equals(users)

True

<p id = 'md'> <p>
## Melting DataFrames

<p id = 'anr'> <p>
### Adding names for readability

In [107]:
users  = pd.read_csv('./data/users.csv', index_col=0)
users.head()

Unnamed: 0,weekday,city,visitors,signups
0,Sun,Austin,139,7
1,Sun,Dallas,237,12
2,Mon,Austin,326,3
3,Mon,Dallas,456,5


In [88]:
visitors_by_city_weekday = users.pivot(index = 'weekday', columns='city', values='visitors')
visitors_by_city_weekday

city,Austin,Dallas
weekday,Unnamed: 1_level_1,Unnamed: 2_level_1
Mon,326,456
Sun,139,237


In [89]:
visitors_by_city_weekday = visitors_by_city_weekday.reset_index()
visitors_by_city_weekday

city,weekday,Austin,Dallas
0,Mon,326,456
1,Sun,139,237


In [91]:
visitors = pd.melt(visitors_by_city_weekday, id_vars=['weekday'], value_name = 'visitors')
visitors

Unnamed: 0,weekday,city,visitors
0,Mon,Austin,326
1,Sun,Austin,139
2,Mon,Dallas,456
3,Sun,Dallas,237


In [95]:
visitors.pivot(index = 'weekday', columns= 'city', values= 'visitors')

city,Austin,Dallas
weekday,Unnamed: 1_level_1,Unnamed: 2_level_1
Mon,326,456
Sun,139,237


<p id = 'wtl'> <p>
### Going from wide to long

In [96]:
users

Unnamed: 0,weekday,city,visitors,signups
0,Sun,Austin,139,7
1,Sun,Dallas,237,12
2,Mon,Austin,326,3
3,Mon,Dallas,456,5


In [103]:
skinny = users.melt(id_vars = ['weekday', 'city'], var_name='usertype', value_name='count')
skinny.set_index('weekday')

Unnamed: 0_level_0,city,usertype,count
weekday,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Sun,Austin,visitors,139
Sun,Dallas,visitors,237
Mon,Austin,visitors,326
Mon,Dallas,visitors,456
Sun,Austin,signups,7
Sun,Dallas,signups,12
Mon,Austin,signups,3
Mon,Dallas,signups,5


<p id = 'kpm'> <p>
### Obtaining key-value pairs with `melt()`

In [108]:
users

Unnamed: 0,weekday,city,visitors,signups
0,Sun,Austin,139,7
1,Sun,Dallas,237,12
2,Mon,Austin,326,3
3,Mon,Dallas,456,5


In [109]:
users_idx = users.set_index(['city', 'weekday'])
users_idx

Unnamed: 0_level_0,Unnamed: 1_level_0,visitors,signups
city,weekday,Unnamed: 2_level_1,Unnamed: 3_level_1
Austin,Sun,139,7
Dallas,Sun,237,12
Austin,Mon,326,3
Dallas,Mon,456,5


Obtain the key-value pairs corresponding to visitors and signups by melting users_idx with the keyword argument col_level=0

In [110]:
kv_pairs = users_idx.melt(col_level=0)
kv_pairs

Unnamed: 0,variable,value
0,visitors,139
1,visitors,237
2,visitors,326
3,visitors,456
4,signups,7
5,signups,12
6,signups,3
7,signups,5


<p id = 'pt'> <p>
## Pivot tables

<p id = 'spt'> <p>
### Setting up a pivot table

In [111]:
users

Unnamed: 0,weekday,city,visitors,signups
0,Sun,Austin,139,7
1,Sun,Dallas,237,12
2,Mon,Austin,326,3
3,Mon,Dallas,456,5


In [113]:
d1 = users.pivot_table(index= 'weekday', columns ='city' )
d1

Unnamed: 0_level_0,signups,signups,visitors,visitors
city,Austin,Dallas,Austin,Dallas
weekday,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Mon,3,5,326,456
Sun,7,12,139,237


Excellent! Notice the labels of the index and the columns are 'weekday' and 'city', respectively - exactly as you specified.

<p id = 'apt'> <p>
### Using other aggregations in pivot tables

In [124]:
# Use a pivot table to display the count of each column: count_by_weekday1
count_by_weekday1 = users.pivot_table(index='weekday', aggfunc='count')
count_by_weekday1

Unnamed: 0_level_0,city,signups,visitors
weekday,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mon,2,2,2
Sun,2,2,2


In [123]:
# Replace 'aggfunc='count'' with 'aggfunc=len': count_by_weekday2
count_by_weekday2 = users.pivot_table(index='weekday', aggfunc=len)
count_by_weekday2

Unnamed: 0_level_0,city,signups,visitors
weekday,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mon,2,2,2
Sun,2,2,2


<p id = 'mpt'> <p>
### Using margins in pivot tables

In [127]:
# Create the DataFrame with the appropriate pivot table: signups_and_visitors
signups_and_visitors = users.pivot_table(index = 'weekday', aggfunc= 'sum')
signups_and_visitors


Unnamed: 0_level_0,signups,visitors
weekday,Unnamed: 1_level_1,Unnamed: 2_level_1
Mon,8,782
Sun,19,376


In [126]:
# Add in the margins: signups_and_visitors_total 
signups_and_visitors_total = users.pivot_table(index='weekday', aggfunc=sum, margins=True)

signups_and_visitors_total

Unnamed: 0_level_0,signups,visitors
weekday,Unnamed: 1_level_1,Unnamed: 2_level_1
Mon,8,782
Sun,19,376
All,27,1158


NameError: name 'olympic' is not defined