# More Data Processing with Pandas
## Merging Dataframes
Merge horizontally or concatenate vertically

![Venn Diagram](merging1.png)

Let these two populations are indices in separate DataFrames. Have to think about how we want to join them.

What if we want a list of all people regardless if they're student or staff, and all of the information we can get on them: **full outer join / union**:

![Union](merging2.png)

If we want only those people who we have the max info about, students who are also staff, then this is an **inner join / intersection**:

![Intersection](merging3.png)

How do we do this in pandas?

In [1]:
import pandas as pd

# staff
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR'},
                         {'Name': 'Sally', 'Role': 'Course liason'},
                         {'Name': 'James', 'Role': 'Grader'}])

## index staff by name
staff_df = staff_df.set_index('Name')

# students
student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business'},
                           {'Name': 'Mike', 'School': 'Law'},
                           {'Name': 'Sally', 'School': 'Engineering'}])

student_df = student_df.set_index('Name')

# print out
print(staff_df.head())
print(student_df.head())
                         

                 Role
Name                 
Kelly  Director of HR
Sally   Course liason
James          Grader
            School
Name              
James     Business
Mike           Law
Sally  Engineering


James and Sally are students and staff, but Mike and Kelly are not. The DataFrames are indexed along the value we want to merge them on, 'Name'.

If we want the union of these df's, use `merge()` and 'outer':

In [2]:
pd.merge(staff_df, student_df, how='outer', left_index=True, right_index=True)

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
James,Grader,Business
Kelly,Director of HR,
Mike,,Law
Sally,Course liason,Engineering


If we want just those students are also staff, do an intersection with `merge()` and 'inner':

In [3]:
pd.merge(staff_df, student_df, how='inner', left_index=True, right_index=True)

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Sally,Course liason,Engineering
James,Grader,Business


### Common Use Cases: left and right join
Left Join: get a list of all staff members, and if they are students, get their student details

In [4]:
pd.merge(staff_df, student_df, how='left', left_index=True, right_index=True)

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Kelly,Director of HR,
Sally,Course liason,Engineering
James,Grader,Business


Right Join: get a list of all students, and if they are also staff, get their staff details

In [5]:
pd.merge(staff_df, student_df, how='right', left_index=True, right_index=True)

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
James,Grader,Business
Mike,,Law
Sally,Course liason,Engineering


### Other Parameters of `merge()`
don't have to join on indices, can do columns by specifying column names

In [6]:
staff_df = staff_df.reset_index()
student_df = student_df.reset_index()

pd.merge(staff_df, student_df, how='right', on='Name')

Unnamed: 0,Name,Role,School
0,James,Grader,Business
1,Mike,,Law
2,Sally,Course liason,Engineering


### What if we have conflicts?

In [7]:
# redo the dataframes
## add office location
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR', 'Location': 'State Street'},
                         {'Name': 'Sally', 'Role': 'Course liason', 'Location': 'Washington Ave'},
                         {'Name': 'James', 'Role': 'Grader', 'Location': 'Washington Ave'}])

## add home location
student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business', 'Location': '1024 Billiard Ave'},
                           {'Name': 'Mike', 'School': 'Law', 'Location': "Frat House #22"},
                           {'Name': 'Sally', 'School': 'Engineering', 'Location': '512 Wilson Crescent'}])

pd.merge(staff_df, student_df, how='left', on='Name')

Unnamed: 0,Name,Role,Location_x,School,Location_y
0,Kelly,Director of HR,State Street,,
1,Sally,Course liason,Washington Ave,Engineering,512 Wilson Crescent
2,James,Grader,Washington Ave,Business,1024 Billiard Ave


### Multi-indexing and Multiple Columns
What if a staff member and student have the same first name? Can pass in a list of column names to join on in `merge()`. Have to make sure that the column name(s) exist in both dataframes.

In [8]:
# specify first and last names
staff_df = pd.DataFrame([{'First Name': 'Kelly', 'Last Name': 'Desjardins', 'Role': 'Director of HR', 'Location': 'State Street'},
                         {'First Name': 'Sally', 'Last Name': 'Brooks', 'Role': 'Course liason', 'Location': 'Washington Ave'},
                         {'First Name': 'James', 'Last Name': 'Wilde', 'Role': 'Grader', 'Location': 'Washington Ave'}])

## add home location
student_df = pd.DataFrame([{'First Name': 'James', 'Last Name': 'Hammond', 'School': 'Business', 'Location': '1024 Billiard Ave'},
                           {'First Name': 'Mike', 'Last Name': 'Smith', 'School': 'Law', 'Location': "Frat House #22"},
                           {'First Name': 'Sally', 'Last Name': 'Brooks', 'School': 'Engineering', 'Location': '512 Wilson Crescent'}])

# James Wilde and James Hammond don't match


pd.merge(staff_df, student_df, how='inner', on=['First Name', 'Last Name'])

Unnamed: 0,First Name,Last Name,Role,Location_x,School,Location_y
0,Sally,Brooks,Course liason,Washington Ave,Engineering,512 Wilson Crescent


### Concatenate Dataframes
stacking dataframes for yearly data

In [9]:
%%capture
df_2011 = pd.read_csv('datasets/college_scorecard/MERGED2011_12_PP.csv', error_bad_lines=False)
df_2012 = pd.read_csv('datasets/college_scorecard/MERGED2012_13_PP.csv', error_bad_lines=False)
df_2013 = pd.read_csv('datasets/college_scorecard/MERGED2013_14_PP.csv', error_bad_lines=False)

In [10]:
df_2011.head(3)

Unnamed: 0,UNITID,OPEID,OPEID6,INSTNM,CITY,STABBR,ZIP,ACCREDAGENCY,INSTURL,NPCURL,...,OMAWDP8_NOTFIRSTTIME_POOLED_SUPP,OMENRUP_NOTFIRSTTIME_POOLED_SUPP,OMENRYP_FULLTIME_POOLED_SUPP,OMENRAP_FULLTIME_POOLED_SUPP,OMAWDP8_FULLTIME_POOLED_SUPP,OMENRUP_FULLTIME_POOLED_SUPP,OMENRYP_PARTTIME_POOLED_SUPP,OMENRAP_PARTTIME_POOLED_SUPP,OMAWDP8_PARTTIME_POOLED_SUPP,OMENRUP_PARTTIME_POOLED_SUPP
0,100654.0,100200.0,1002,Alabama A & M University,Normal,AL,35762,,,,...,,,,,,,,,,
1,100663.0,105200.0,1052,University of Alabama at Birmingham,Birmingham,AL,35294-0110,,,,...,,,,,,,,,,
2,100690.0,2503400.0,25034,Amridge University,Montgomery,AL,36117-3553,,,,...,,,,,,,,,,


In [11]:
print(len(df_2011))
print(len(df_2012))
print(len(df_2013))

15235
7793
7804


In [12]:
frames = [df_2011, df_2012, df_2013]
pd.concat(frames)

Unnamed: 0,UNITID,OPEID,OPEID6,INSTNM,CITY,STABBR,ZIP,ACCREDAGENCY,INSTURL,NPCURL,...,OMAWDP8_NOTFIRSTTIME_POOLED_SUPP,OMENRUP_NOTFIRSTTIME_POOLED_SUPP,OMENRYP_FULLTIME_POOLED_SUPP,OMENRAP_FULLTIME_POOLED_SUPP,OMAWDP8_FULLTIME_POOLED_SUPP,OMENRUP_FULLTIME_POOLED_SUPP,OMENRYP_PARTTIME_POOLED_SUPP,OMENRAP_PARTTIME_POOLED_SUPP,OMAWDP8_PARTTIME_POOLED_SUPP,OMENRUP_PARTTIME_POOLED_SUPP
0,100654.0,100200.0,1002,Alabama A & M University,Normal,AL,35762,,,,...,,,,,,,,,,
1,100663.0,105200.0,1052,University of Alabama at Birmingham,Birmingham,AL,35294-0110,,,,...,,,,,,,,,,
2,100690.0,2503400.0,25034,Amridge University,Montgomery,AL,36117-3553,,,,...,,,,,,,,,,
3,100706.0,105500.0,1055,University of Alabama in Huntsville,Huntsville,AL,35899,,,,...,,,,,,,,,,
4,100724.0,100500.0,1005,Alabama State University,Montgomery,AL,36104-0271,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7799,48285703.0,157107.0,1571,Georgia Military College-Columbus Campus,Columbus,GA,31909,,,,...,,,,,,,,,,
7800,48285704.0,157101.0,1571,Georgia Military College-Valdosta Campus,Valdosta,GA,31605,,,,...,,,,,,,,,,
7801,48285705.0,157105.0,1571,Georgia Military College-Warner Robins Campus,Warner Robins,GA,31093,,,,...,,,,,,,,,,
7802,48285706.0,157100.0,1571,Georgia Military College-Online,Milledgeville,GA,31061,,,,...,,,,,,,,,,


Check that there should be 30,832 rows after concatenating 3 dataframes:

In [13]:
len(df_2011) + len(df_2012) + len(df_2013)

30832

It matches!

However, we now don't know what year each record came from. Use the parameter `keys` in `concat()` to set an extra level of indices:

In [14]:
pd.concat(frames, keys=['2011', '2012', '2013'])

Unnamed: 0,Unnamed: 1,UNITID,OPEID,OPEID6,INSTNM,CITY,STABBR,ZIP,ACCREDAGENCY,INSTURL,NPCURL,...,OMAWDP8_NOTFIRSTTIME_POOLED_SUPP,OMENRUP_NOTFIRSTTIME_POOLED_SUPP,OMENRYP_FULLTIME_POOLED_SUPP,OMENRAP_FULLTIME_POOLED_SUPP,OMAWDP8_FULLTIME_POOLED_SUPP,OMENRUP_FULLTIME_POOLED_SUPP,OMENRYP_PARTTIME_POOLED_SUPP,OMENRAP_PARTTIME_POOLED_SUPP,OMAWDP8_PARTTIME_POOLED_SUPP,OMENRUP_PARTTIME_POOLED_SUPP
2011,0,100654.0,100200.0,1002,Alabama A & M University,Normal,AL,35762,,,,...,,,,,,,,,,
2011,1,100663.0,105200.0,1052,University of Alabama at Birmingham,Birmingham,AL,35294-0110,,,,...,,,,,,,,,,
2011,2,100690.0,2503400.0,25034,Amridge University,Montgomery,AL,36117-3553,,,,...,,,,,,,,,,
2011,3,100706.0,105500.0,1055,University of Alabama in Huntsville,Huntsville,AL,35899,,,,...,,,,,,,,,,
2011,4,100724.0,100500.0,1005,Alabama State University,Montgomery,AL,36104-0271,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2013,7799,48285703.0,157107.0,1571,Georgia Military College-Columbus Campus,Columbus,GA,31909,,,,...,,,,,,,,,,
2013,7800,48285704.0,157101.0,1571,Georgia Military College-Valdosta Campus,Valdosta,GA,31605,,,,...,,,,,,,,,,
2013,7801,48285705.0,157105.0,1571,Georgia Military College-Warner Robins Campus,Warner Robins,GA,31093,,,,...,,,,,,,,,,
2013,7802,48285706.0,157100.0,1571,Georgia Military College-Online,Milledgeville,GA,31061,,,,...,,,,,,,,,,


There's also a parameter to choose the method of concatenation:
* outer method: some cells will be `NaN`
* inner method: some observations will be dropped due to `NaN` values

## Pandas Idioms
Vectorization >> iterative loops

In [2]:
import pandas as pd
import numpy as np
import timeit

df = pd.read_csv('datasets/census.csv')
df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861


### Method Chaining
Every method on an object retusn a reference to that object. Condense your code into one line/statement.

Example) pull out the state and city names as a multiple index, but only for data which has a summary level of 50 (that is summarized at the county-level)

In [16]:
# the ~pandorable~ way
(df.where(df['SUMLEV'] == 50) # passed in boolean mask
 .dropna() # need to drop missing values since where() does not by default
 .set_index(['STNAME', 'CTYNAME'])
 .rename(columns={'ESTIMATEBASE2010': 'Estimates Base 2010'})) # make column name more readable

Unnamed: 0_level_0,Unnamed: 1_level_0,SUMLEV,REGION,DIVISION,STATE,COUNTY,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Alabama,Autauga County,50.0,3.0,6.0,1.0,1.0,54571.0,54571.0,54660.0,55253.0,55175.0,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.592270,-2.187333
Alabama,Baldwin County,50.0,3.0,6.0,1.0,3.0,182265.0,182265.0,183193.0,186659.0,190396.0,...,14.832960,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
Alabama,Barbour County,50.0,3.0,6.0,1.0,5.0,27457.0,27457.0,27341.0,27226.0,27159.0,...,-4.728132,-2.500690,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
Alabama,Bibb County,50.0,3.0,6.0,1.0,7.0,22915.0,22919.0,22861.0,22733.0,22642.0,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
Alabama,Blount County,50.0,3.0,6.0,1.0,9.0,57322.0,57322.0,57373.0,57711.0,57776.0,...,1.807375,-1.177622,-1.748766,-2.062535,-1.369970,1.859511,-0.848580,-1.402476,-1.577232,-0.884411
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wyoming,Sweetwater County,50.0,4.0,8.0,56.0,37.0,43806.0,43806.0,43593.0,44041.0,45104.0,...,1.072643,16.243199,-5.339774,-14.252889,-14.248864,1.255221,16.243199,-5.295460,-14.075283,-14.070195
Wyoming,Teton County,50.0,4.0,8.0,56.0,39.0,21294.0,21294.0,21297.0,21482.0,21697.0,...,-1.589565,0.972695,19.525929,14.143021,-0.564849,0.654527,2.408578,21.160658,16.308671,1.520747
Wyoming,Uinta County,50.0,4.0,8.0,56.0,41.0,21118.0,21118.0,21102.0,20912.0,20989.0,...,-17.755986,-4.916350,-6.902954,-14.215862,-12.127022,-18.136812,-5.536861,-7.521840,-14.740608,-12.606351
Wyoming,Washakie County,50.0,4.0,8.0,56.0,43.0,8533.0,8533.0,8545.0,8469.0,8443.0,...,-11.637475,-0.827815,-2.013502,-17.781491,1.682288,-11.990126,-1.182592,-2.250385,-18.020168,1.441961


In [17]:
# traditional / non-pandorable way
## overload index operator that will drop na's
df = df[df['SUMLEV']==50] 

## set new index
df.set_index(['STNAME', 'CTYNAME'], inplace=True)

## rename columns
df.rename(columns={'ESTIMATESBASE2010': 'Estimates Base 2010'})

Unnamed: 0_level_0,Unnamed: 1_level_0,SUMLEV,REGION,DIVISION,STATE,COUNTY,CENSUS2010POP,Estimates Base 2010,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Alabama,Autauga County,50,3,6,1,1,54571,54571,54660,55253,55175,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.592270,-2.187333
Alabama,Baldwin County,50,3,6,1,3,182265,182265,183193,186659,190396,...,14.832960,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
Alabama,Barbour County,50,3,6,1,5,27457,27457,27341,27226,27159,...,-4.728132,-2.500690,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
Alabama,Bibb County,50,3,6,1,7,22915,22919,22861,22733,22642,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
Alabama,Blount County,50,3,6,1,9,57322,57322,57373,57711,57776,...,1.807375,-1.177622,-1.748766,-2.062535,-1.369970,1.859511,-0.848580,-1.402476,-1.577232,-0.884411
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wyoming,Sweetwater County,50,4,8,56,37,43806,43806,43593,44041,45104,...,1.072643,16.243199,-5.339774,-14.252889,-14.248864,1.255221,16.243199,-5.295460,-14.075283,-14.070195
Wyoming,Teton County,50,4,8,56,39,21294,21294,21297,21482,21697,...,-1.589565,0.972695,19.525929,14.143021,-0.564849,0.654527,2.408578,21.160658,16.308671,1.520747
Wyoming,Uinta County,50,4,8,56,41,21118,21118,21102,20912,20989,...,-17.755986,-4.916350,-6.902954,-14.215862,-12.127022,-18.136812,-5.536861,-7.521840,-14.740608,-12.606351
Wyoming,Washakie County,50,4,8,56,43,8533,8533,8545,8469,8443,...,-11.637475,-0.827815,-2.013502,-17.781491,1.682288,-11.990126,-1.182592,-2.250385,-18.020168,1.441961


Use `timeit` package to see which way is fastest:

In [18]:
def first_approach():
    global df # modifying it here will still change the variable in the global scope
    return (df.where(df['SUMLEV'] == 50)
            .dropna()
            .set_index(['STNAME', 'CTYNAME'])
            .rename(columns={'ESTIMATEBASE2010': 'Estimates Base 2010'}))

df = pd.read_csv('datasets/census.csv')

timeit.timeit(first_approach, number=10)

0.24976828321814537

In [19]:
def second_approach():
    global df
    new_df = df[df['SUMLEV'] == 50]
    new_df.set_index(['STNAME', 'CTYNAME'], inplace=True)
    return new_df.rename(columns={'ESTIMATEBASE2010': 'Estimates Base 2010'})

df = pd.read_csv('datasets/census.csv')

timeit.timeit(second_approach, number=10)

0.05060093477368355

Second approach is much faster! What appears to be stylistic idioms might have performance issues.

### `apply()` Function
Python's `map()`: when you want to apply a function to something iterable, like a list. The results are that the function is called against each item in this list, and there's a resulting list of all of the evaluations of that function.

Pandas' version is `apply()`: takes the function and the axis on which to operate as parameters. The axis is the parameter of the index to use: to apply the function across all rows, which is applying on all columns, you pass `axis='columns'`

Example) there are 5 columns for population estimates with each column representing one year of estimates. Create some new columns for minimum/maximum values.
** Need to write a function which takes in a row of data and finds min/max

In [20]:
def min_max(row):
    data = row[['POPESTIMATE2010',
                'POPESTIMATE2011',
                'POPESTIMATE2012',
                'POPESTIMATE2013',
                'POPESTIMATE2014',
                'POPESTIMATE2015']]
    return pd.Series({'min': np.min(data), 'max': np.max(data)})


In [21]:
df.apply(min_max, axis='columns').head()

Unnamed: 0,min,max
0,4785161,4858979
1,54660,55347
2,183193,203709
3,26489,27341
4,22512,22861


Instead of returning a new series object, add new columns to existing dataframe:

In [22]:
def min_max(row):
    data = row[['POPESTIMATE2010',
                'POPESTIMATE2011',
                'POPESTIMATE2012',
                'POPESTIMATE2013',
                'POPESTIMATE2014',
                'POPESTIMATE2015']]
    row['max'] = np.max(data)
    row['min'] = np.min(data)
    return row

df.apply(min_max, axis='columns')

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015,max,min
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594,4858979,4785161
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.592270,-2.187333,55347,54660
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499,203709,183193
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299,27341,26489
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861,22861,22512
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3188,50,4,8,56,37,Wyoming,Sweetwater County,43806,43806,43593,...,-5.339774,-14.252889,-14.248864,1.255221,16.243199,-5.295460,-14.075283,-14.070195,45162,43593
3189,50,4,8,56,39,Wyoming,Teton County,21294,21294,21297,...,19.525929,14.143021,-0.564849,0.654527,2.408578,21.160658,16.308671,1.520747,23125,21297
3190,50,4,8,56,41,Wyoming,Uinta County,21118,21118,21102,...,-6.902954,-14.215862,-12.127022,-18.136812,-5.536861,-7.521840,-14.740608,-12.606351,21102,20822
3191,50,4,8,56,43,Wyoming,Washakie County,8533,8533,8545,...,-2.013502,-17.781491,1.682288,-11.990126,-1.182592,-2.250385,-18.020168,1.441961,8545,8316


#### lambda function
lambda: unnamed function in Python; takes in one parameter `x` and returns a single value

How `apply()` is typically used, not with large function definitions like above

In [23]:
rows = ['POPESTIMATE2010','POPESTIMATE2011','POPESTIMATE2012','POPESTIMATE2013','POPESTIMATE2014','POPESTIMATE2015']

df.apply(lambda x: np.max(x[rows]), axis=1).head()

0    4858979
1      55347
2     203709
3      27341
4      22861
dtype: int64

Example) divide the states into 4 region categories: Northeast, Midwest, South, and West

In [24]:
def get_state_region(x):
    northeast = ['Connecticut', 'Maine', 'Massachusetts', 'New Hampshire', 'Rhode Island', 'Vermont', 'New York', 'New Jersey', 'Pennsylvania']
    midwest = ['Illinois','Indiana','Michigan','Ohio','Wisconsin','Iowa',
               'Kansas','Minnesota','Missouri','Nebraska','North Dakota',
               'South Dakota']
    south = ['Delaware','Florida','Georgia','Maryland','North Carolina',
             'South Carolina','Virginia','District of Columbia','West Virginia',
             'Alabama','Kentucky','Mississippi','Tennessee','Arkansas',
             'Louisiana','Oklahoma','Texas']
    west = ['Arizona','Colorado','Idaho','Montana','Nevada','New Mexico','Utah',
            'Wyoming','Alaska','California','Hawaii','Oregon','Washington']
    
    if x in northeast:
        return "Northeast"
    elif x in midwest:
        return "Midwest"
    elif x in south:
        return "South"
    else:
        return "West"
    

In [25]:
# just project the STNAME column
# even though we're working on a Series, we assign that to a dataframe projection slice, so still have full dataframe
df['state_region'] = df['STNAME'].apply(lambda x: get_state_region(x))

df[['STNAME', 'state_region']].head()

Unnamed: 0,STNAME,state_region
0,Alabama,South
1,Alabama,South
2,Alabama,South
3,Alabama,South
4,Alabama,South


## Group By
Takes dataframe and splits into chunks based on some key values, lets us apply computation on these chunks, and then combines the results back together into another dataframe. **split-apply-combine pattern**

### Splitting

In [26]:
import pandas as pd
import numpy as np

df = pd.read_csv('datasets/census.csv')
df = df[df['SUMLEV']==50] # exclude state-level summarizations which have sum level value of 40
df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
5,50,3,6,1,9,Alabama,Blount County,57322,57322,57373,...,1.807375,-1.177622,-1.748766,-2.062535,-1.36997,1.859511,-0.84858,-1.402476,-1.577232,-0.884411


Example) calculate the average county population by state

In [27]:
%%timeit -n 3

for state in df['STNAME'].unique():
    avg = np.average(df.where(df['STNAME']==state).dropna()['CENSUS2010POP'])
    print('Counties in state ' + state + ' have an average population of ' + str(avg))

Counties in state Alabama have an average population of 71339.34328358209
Counties in state Alaska have an average population of 24490.724137931036
Counties in state Arizona have an average population of 426134.4666666667
Counties in state Arkansas have an average population of 38878.90666666667
Counties in state California have an average population of 642309.5862068966
Counties in state Colorado have an average population of 78581.1875
Counties in state Connecticut have an average population of 446762.125
Counties in state Delaware have an average population of 299311.3333333333
Counties in state District of Columbia have an average population of 601723.0
Counties in state Florida have an average population of 280616.5671641791
Counties in state Georgia have an average population of 60928.63522012578
Counties in state Hawaii have an average population of 272060.2
Counties in state Idaho have an average population of 35626.86363636364
Counties in state Illinois have an average populat

In [28]:
%%timeit -n 3
# using groupby
# grouping by state = our split

for group, frame in df.groupby('STNAME'):
    # groupby returns a tuple: 
    ## first value = the value of the key we were trying to group by, in this case a specific state name
    ## second value = the projected dataframe that was found for that group
    
    avg = np.average(frame['CENSUS2010POP'])
    
    print('Counties in state ' + group + ' have an average population of ' + str(avg))

Counties in state Alabama have an average population of 71339.34328358209
Counties in state Alaska have an average population of 24490.724137931036
Counties in state Arizona have an average population of 426134.4666666667
Counties in state Arkansas have an average population of 38878.90666666667
Counties in state California have an average population of 642309.5862068966
Counties in state Colorado have an average population of 78581.1875
Counties in state Connecticut have an average population of 446762.125
Counties in state Delaware have an average population of 299311.3333333333
Counties in state District of Columbia have an average population of 601723.0
Counties in state Florida have an average population of 280616.5671641791
Counties in state Georgia have an average population of 60928.63522012578
Counties in state Hawaii have an average population of 272060.2
Counties in state Idaho have an average population of 35626.86363636364
Counties in state Illinois have an average populat

### Provide function to `groupby()`
to segment dataframe

Example) Create some function which returns a number between 0 and 2 based on the first character of the state name. Tell `groupby` to use this function to split up our dataframe. In order to do this, have to set the index of the dataframe to be the column that you want to groupby first.

In [29]:
df = df.set_index('STNAME')

def set_batch_number(item):
    if item[0] < 'M':
        return 0
    if item[0] < 'Q':
        return 1
    return 2
    
    
for group, frame in df.groupby(set_batch_number):
    print('There are ' + str(len(frame)) + ' records in group ' + str(group) + ' for processing.')

There are 1177 records in group 0 for processing.
There are 1134 records in group 1 for processing.
There are 831 records in group 2 for processing.


No column name was passed into `groupby()`. Instead, the index of dataframe was set to 'STNAME'. If no column identifier is passed, `groupby()` will automatically use the index.

Example) Airbnb data: look at cancellation_policy and review_scores_value

In [30]:
df = pd.read_csv('datasets/listings.csv')
df.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",...,,f,,,f,moderate,f,f,1,
1,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,none,"The room is in Roslindale, a diverse and prima...",...,9.0,f,,,t,moderate,f,f,1,1.3
2,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",none,The LOCATION: Roslindale is a safe and diverse...,...,10.0,f,,,f,moderate,t,f,1,0.47
3,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,none,Roslindale is a lovely little neighborhood loc...,...,10.0,f,,,f,moderate,f,f,1,1.0
4,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",none,"I love the proximity to downtown, the neighbor...",...,10.0,f,,,f,flexible,f,f,1,2.25


In [31]:
df=df.set_index(["cancellation_policy","review_scores_value"])

# When we have a multiindex we need to pass in the levels we are interested in grouping by
for group, frame in df.groupby(level=(0,1)):
    print(group)

('flexible', 2.0)
('flexible', 4.0)
('flexible', 5.0)
('flexible', 6.0)
('flexible', 7.0)
('flexible', 8.0)
('flexible', 9.0)
('flexible', 10.0)
('moderate', 2.0)
('moderate', 4.0)
('moderate', 6.0)
('moderate', 7.0)
('moderate', 8.0)
('moderate', 9.0)
('moderate', 10.0)
('strict', 2.0)
('strict', 3.0)
('strict', 4.0)
('strict', 5.0)
('strict', 6.0)
('strict', 7.0)
('strict', 8.0)
('strict', 9.0)
('strict', 10.0)
('super_strict_30', 6.0)
('super_strict_30', 7.0)
('super_strict_30', 8.0)
('super_strict_30', 9.0)
('super_strict_30', 10.0)


Example) what if we still wanted to group by cancellation policy and review scores, but separate out all the 10's from those under 10? Use a function to manage groupings

In [32]:
def grouping_fun(item):
    # check the 'review_scores_value' portion of the index
    # 'item' is in the format of a tuple (cancellation_policy, review_scores_value)
    
    if item[1] == 10.0:
        return (item[0], '10.0')
    else:
        return (item[0], 'not 10.0')
    
for group, frame in df.groupby(by=grouping_fun):
    print(group)

('flexible', '10.0')
('flexible', 'not 10.0')
('moderate', '10.0')
('moderate', 'not 10.0')
('strict', '10.0')
('strict', 'not 10.0')
('super_strict_30', '10.0')
('super_strict_30', 'not 10.0')


In [33]:
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_communication,review_scores_location,requires_license,license,jurisdiction_names,instant_bookable,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
cancellation_policy,review_scores_value,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
moderate,,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",...,,,f,,,f,f,f,1,
moderate,9.0,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,none,"The room is in Roslindale, a diverse and prima...",...,10.0,9.0,f,,,t,f,f,1,1.3
moderate,10.0,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",none,The LOCATION: Roslindale is a safe and diverse...,...,10.0,9.0,f,,,f,t,f,1,0.47
moderate,10.0,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,none,Roslindale is a lovely little neighborhood loc...,...,10.0,10.0,f,,,f,f,f,1,1.0
flexible,10.0,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",none,"I love the proximity to downtown, the neighbor...",...,10.0,9.0,f,,,f,f,f,1,2.25


At this point, we have done simple data processing to our data after splitting. There are 3 broad categories of data processing to happen during the apply step: **aggregation** of group data, **transformation** of group data, and **filtration** of group data.

## Aggregation
Uses the method `agg()` on groupby object. We can pass in a dictionary of the columns we're interested in aggregating (keys) along with the function we're looking to apply to aggregate (values). So far, we've only iterated through a groupby object, unpacked it into a label (the group name) and a dataframe.

You pass in the function as references to functions which will return single values, like `np.nanmean` not `np.nanmean()` or `"nanmean"`.

In [34]:
# reset index
df = df.reset_index() # have to set it = to df again, or pass in 'inplace=True'

# group by cancellation policy and find average review scores by group
df.groupby('cancellation_policy').agg({'review_scores_value': np.average})

Unnamed: 0_level_0,review_scores_value
cancellation_policy,Unnamed: 1_level_1
flexible,
moderate,
strict,
super_strict_30,


`np.average` does not ignore `NaNs`! We have to use `np.nanmean`:

In [35]:
df.groupby('cancellation_policy').agg({'review_scores_value': np.nanmean})

Unnamed: 0_level_0,review_scores_value
cancellation_policy,Unnamed: 1_level_1
flexible,9.237421
moderate,9.307398
strict,9.081441
super_strict_30,8.537313


### Multiple Aggregations

In [36]:
df.groupby('cancellation_policy').agg({'review_scores_value': (np.nanmean, np.nanstd),
                                       'reviews_per_month': np.nanmean})

Unnamed: 0_level_0,review_scores_value,review_scores_value,reviews_per_month
Unnamed: 0_level_1,nanmean,nanstd,nanmean
cancellation_policy,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
flexible,9.237421,1.096271,1.82921
moderate,9.307398,0.859859,2.391922
strict,9.081441,1.040531,1.873467
super_strict_30,8.537313,0.840785,0.340143


## Transformation
`agg()` returns a single value per column, so one row per group.

`transform()` returns an object that's the same size as the group; broadcasts the function you supply over the grouped dataframe, returning a new dataframe; makes combining data later easy

Example) include the average rating values in a given group by cancellation policy, but preserve the dataframe shape so that we could generate a difference between an individual observation and the sum.

In [37]:
cols = ['cancellation_policy', 'review_scores_value']

transform_df = df[cols].groupby('cancellation_policy').transform(np.nanmean)
transform_df.head(10)

Unnamed: 0,review_scores_value
0,9.307398
1,9.307398
2,9.307398
3,9.307398
4,9.237421
5,9.237421
6,9.081441
7,9.307398
8,9.307398
9,9.081441


The index is actually the same as the original dataframe. Before we join it in, rename the column in the transformed version:

In [38]:
# 'review_scores_value' in the transformed df isn't actually the review value anymore, but the mean review score by cancellation_policy group
transform_df.rename({'review_scores_value':'mean_review_scores'}, axis='columns', inplace=True)

# merge on indices because they're referencing the same thing
df = df.merge(transform_df, left_index=True, right_index=True)
df.head(10)

Unnamed: 0,cancellation_policy,review_scores_value,id,listing_url,scrape_id,last_scraped,name,summary,space,description,...,review_scores_location,requires_license,license,jurisdiction_names,instant_bookable,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month,mean_review_scores
0,moderate,,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",...,,f,,,f,f,f,1,,9.307398
1,moderate,9.0,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,...,9.0,f,,,t,f,f,1,1.3,9.307398
2,moderate,10.0,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",...,9.0,f,,,f,t,f,1,0.47,9.307398
3,moderate,10.0,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,...,10.0,f,,,f,f,f,1,1.0,9.307398
4,flexible,10.0,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",...,9.0,f,,,f,f,f,1,2.25,9.237421
5,flexible,10.0,12386020,https://www.airbnb.com/rooms/12386020,20160906204935,2016-09-07,Private Bedroom + Great Coffee,Super comfy bedroom plus your own bathroom in ...,Our sunny condo is located on the second and t...,Super comfy bedroom plus your own bathroom in ...,...,9.0,f,,,f,f,f,1,1.7,9.237421
6,strict,9.0,5706985,https://www.airbnb.com/rooms/5706985,20160906204935,2016-09-07,New Lrg Studio apt 15 min to Boston,It's a 5 minute walk to Rosi Square to catch t...,The whole house was recently redone and it 's ...,It's a 5 minute walk to Rosi Square to catch t...,...,9.0,f,,,f,f,f,3,4.0,9.081441
7,moderate,10.0,2843445,https://www.airbnb.com/rooms/2843445,20160906204935,2016-09-07,"""Tranquility"" on ""Top of the Hill""","We can accommodate guests who are gluten-free,...",We provide a bedroom and full shared bath. Ra...,"We can accommodate guests who are gluten-free,...",...,10.0,f,,,f,t,t,2,2.38,9.307398
8,moderate,10.0,753446,https://www.airbnb.com/rooms/753446,20160906204935,2016-09-07,6 miles away from downtown Boston!,Nice and cozy apartment about 6 miles away to ...,Nice and cozy apartment about 6 miles away to ...,Nice and cozy apartment about 6 miles away to ...,...,9.0,f,,,f,f,f,1,5.36,9.307398
9,strict,9.0,849408,https://www.airbnb.com/rooms/849408,20160906204935,2016-09-07,Perfect & Practical Boston Rental,This is a cozy and spacious two bedroom unit w...,Perfect apartment rental for those in town vis...,This is a cozy and spacious two bedroom unit w...,...,9.0,f,,,f,f,f,2,1.01,9.081441


Create new column for the difference between the a given row's review score and its group mean score:

In [39]:
df['mean_diff'] = np.absolute(df['review_scores_value'] - df['mean_review_scores']) # vectorization
df['mean_diff'].head()


0         NaN
1    0.307398
2    0.692602
3    0.692602
4    0.762579
Name: mean_diff, dtype: float64

## Filtering
The GroupBy object has built-in support for filtering groups as well. 

When you group by a feature and do some transformation to the groups, you may want to drop certain groups as part of your cleaning routines. `filter()` takes in a function which it applies to each group dataframe and returns either a `True` or `False`, depending upon whether that group should be included in results.

Example) include those groups which have a mean rating above 9 included in our results:

In [40]:
df.groupby('cancellation_policy').filter(lambda x: np.nanmean(x['review_scores_value']) > 9.2)

Unnamed: 0,cancellation_policy,review_scores_value,id,listing_url,scrape_id,last_scraped,name,summary,space,description,...,requires_license,license,jurisdiction_names,instant_bookable,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month,mean_review_scores,mean_diff
0,moderate,,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",...,f,,,f,f,f,1,,9.307398,
1,moderate,9.0,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,...,f,,,t,f,f,1,1.30,9.307398,0.307398
2,moderate,10.0,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",...,f,,,f,t,f,1,0.47,9.307398,0.692602
3,moderate,10.0,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,...,f,,,f,f,f,1,1.00,9.307398,0.692602
4,flexible,10.0,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",...,f,,,f,f,f,1,2.25,9.237421,0.762579
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3576,flexible,,14689681,https://www.airbnb.com/rooms/14689681,20160906204935,2016-09-07,Beautiful loft style bedroom with large bathroom,You'd be living on the top floor of a four sto...,,You'd be living on the top floor of a four sto...,...,f,,,f,f,f,1,,9.237421,
3577,flexible,,13750763,https://www.airbnb.com/rooms/13750763,20160906204935,2016-09-07,Comfortable Space in the Heart of Brookline,"Our place is close to Coolidge Corner, Allston...",This space consists of 2 Rooms and a private b...,"Our place is close to Coolidge Corner, Allston...",...,f,,,f,f,f,1,,9.237421,
3579,flexible,,14852179,https://www.airbnb.com/rooms/14852179,20160906204935,2016-09-07,Spacious Queen Bed Room Close to Boston Univer...,- Grocery: A full-size Star market is 2 minute...,,- Grocery: A full-size Star market is 2 minute...,...,f,,,f,f,f,1,,9.237421,
3582,flexible,,14585486,https://www.airbnb.com/rooms/14585486,20160906204935,2016-09-07,Gorgeous funky apartment,Funky little apartment close to public transpo...,Modern and relaxed space with many facilities ...,Funky little apartment close to public transpo...,...,f,,,f,f,f,1,,9.237421,


Results are still indexed, but not all indices were copied over because they were in a group with a mean review score less than or equal to 9.2.

## Applying
Allows you to apply an arbitrary function to each group in a GroupBy object and stitch their results back for each `apply()` into a single dataframe where the index is preserved.

In [41]:
df = pd.read_csv('datasets/listings.csv')

df = df[['cancellation_policy','review_scores_value']]
df.head()

Unnamed: 0,cancellation_policy,review_scores_value
0,moderate,
1,moderate,9.0
2,moderate,10.0
3,moderate,10.0
4,flexible,10.0


To find the average review score of a listing and its deviation from the group mean, we previously had to do it in a two-step process by first using `transform()` on the GroupBy object then broadcast to create a new column.

With `apply()` we can wrap this logic in one place:

In [42]:
def calc_mean_review_scores(group):
    # 'group' is a dataframe of whatever we have grouped by, e.g. cancellation_policy, so can treat as complete dataframe
    avg = np.nanmean(group['review_scores_value'])
    
    # now broadcast formula and create new column
    group['review_scores_mean'] = np.abs(avg - group['review_scores_value'])
    
    return group


df.groupby('cancellation_policy').apply(calc_mean_review_scores).head()

Unnamed: 0,cancellation_policy,review_scores_value,review_scores_mean
0,moderate,,
1,moderate,9.0,0.307398
2,moderate,10.0,0.692602
3,moderate,10.0,0.692602
4,flexible,10.0,0.762579


# Scales
1. Ratio Scale: units are equally spaced; can do mathematical operations
2. Interval Scale: unit are equally spaced, but no true zero; cannot do multiplication or division; e.g. 0 degrees on a compass doesn't mean an absence of direction; 0 degrees F doesn't mean no temperature
3. Ordinal Scale: order of units is important; not evenly spaced; e.g. letter grades
4. Nominal Scale: categorical data with no order; e.g. teams of a sport

Important for statistics and machine learning; pandas lets you convert between measurement scales

## Nominal / Categorical Data
use `astype('category')` to change your data type

In [1]:
import pandas as pd

df=pd.DataFrame(['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D'],
                index=['excellent', 'excellent', 'excellent', 'good', 'good', 'good', 
                       'ok', 'ok', 'ok', 'poor', 'poor'],
               columns=["Grades"])
df

Unnamed: 0,Grades
excellent,A+
excellent,A
excellent,A-
good,B+
good,B
good,B-
ok,C+
ok,C
ok,C-
poor,D+


In [2]:
df.dtypes

Grades    object
dtype: object

In [3]:
# nominal scale
df['Grades'].astype('category').head()

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
Name: Grades, dtype: category
Categories (11, object): ['A', 'A+', 'A-', 'B', ..., 'C+', 'C-', 'D', 'D+']

In [5]:
# ordinal scale
my_categories = pd.CategoricalDtype(categories=['D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+'], 
                                    ordered=True)

grades = df['Grades'].astype(my_categories)
grades.head()

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
Name: Grades, dtype: category
Categories (11, object): ['D' < 'D+' < 'C-' < 'C' ... 'B+' < 'A-' < 'A' < 'A+']

Ordering data helps with comparisons and boolean masking.

With just the nominal scale, comparing grades to a 'C' does a lexicographical comparison and doesn't return results we want:

In [6]:
df[df['Grades']>'C']

Unnamed: 0,Grades
ok,C+
ok,C-
poor,D+
poor,D


In [7]:
grades[grades>'C']

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
good         B-
ok           C+
Name: Grades, dtype: category
Categories (11, object): ['D' < 'D+' < 'C-' < 'C' ... 'B+' < 'A-' < 'A' < 'A+']

### Dummy Variables using `get_dummies()`
when you want a column for each category and value takes on a True/False

## Convert Interval/Ratio Scale to Categorical using `cut()`
e.g. making histogram --> frequencies of categories

if you're using a ML classification approach on data, you need to be using categorical data, so reducing dimensionality may be useful just to apply a given technique.

Example) back to census data --> bin the states by average county size

In [8]:
import numpy as np

df = pd.read_csv('datasets/census.csv')

df = df[df['SUMLEV']==50]

df = df.set_index('STNAME').groupby(level=0)['CENSUS2010POP'].agg(np.average)

df.head()

STNAME
Alabama        71339.343284
Alaska         24490.724138
Arizona       426134.466667
Arkansas       38878.906667
California    642309.586207
Name: CENSUS2010POP, dtype: float64

In [9]:
# use 10 bins
pd.cut(df,10)

STNAME
Alabama                   (11706.087, 75333.413]
Alaska                    (11706.087, 75333.413]
Arizona                 (390320.176, 453317.529]
Arkansas                  (11706.087, 75333.413]
California              (579312.234, 642309.586]
Colorado                 (75333.413, 138330.766]
Connecticut             (390320.176, 453317.529]
Delaware                (264325.471, 327322.823]
District of Columbia    (579312.234, 642309.586]
Florida                 (264325.471, 327322.823]
Georgia                   (11706.087, 75333.413]
Hawaii                  (264325.471, 327322.823]
Idaho                     (11706.087, 75333.413]
Illinois                 (75333.413, 138330.766]
Indiana                   (11706.087, 75333.413]
Iowa                      (11706.087, 75333.413]
Kansas                    (11706.087, 75333.413]
Kentucky                  (11706.087, 75333.413]
Louisiana                 (11706.087, 75333.413]
Maine                    (75333.413, 138330.766]
Maryland     

`cut()` is one way to build categories from your data by using an **interval** scale, where the spacing between each category is equal sized. 

But sometimes, you want to form categorical data based on frequency - the number of items in each bin should be the same, instead of the spacing between bins. Just depends on shape of your data and what you want to do.

# Pivot Table
* heavy use of aggregation function
* a DataFrame itself
* also includes marginal values, which are the sums for each column and row

Example) Times Higher Education World University Ranking dataset

In [3]:
import pandas as pd
import numpy as np

df = pd.read_csv('datasets/cwurData.csv')
df.head()

Unnamed: 0,world_rank,institution,country,national_rank,quality_of_education,alumni_employment,quality_of_faculty,publications,influence,citations,broad_impact,patents,score,year
0,1,Harvard University,USA,1,7,9,1,1,1,1,,5,100.0,2012
1,2,Massachusetts Institute of Technology,USA,2,9,17,3,12,4,4,,1,91.67,2012
2,3,Stanford University,USA,3,17,11,5,4,2,2,,15,89.5,2012
3,4,University of Cambridge,United Kingdom,1,10,24,4,16,16,11,,50,86.17,2012
4,5,California Institute of Technology,USA,4,2,29,7,37,22,22,,18,85.21,2012


Add column for Rank_Level where institutions with a world ranking of 1-100 are first tier, 101-200 are second, 201-300 are third, and 301+ are other top universities:

In [4]:
def get_rank(x):
    #x = int(x)
    if x <= 100:
        return "First Tier"
    elif x <= 200:
        return "Second Tier"
    elif x <= 300:
        return "Third Tier"
    else:
        return "Other Top University"
    
    
df['Rank_Level'] = df['world_rank'].apply(lambda x: get_rank(x))

df.head()

Unnamed: 0,world_rank,institution,country,national_rank,quality_of_education,alumni_employment,quality_of_faculty,publications,influence,citations,broad_impact,patents,score,year,Rank_Level
0,1,Harvard University,USA,1,7,9,1,1,1,1,,5,100.0,2012,First Tier
1,2,Massachusetts Institute of Technology,USA,2,9,17,3,12,4,4,,1,91.67,2012,First Tier
2,3,Stanford University,USA,3,17,11,5,4,2,2,,15,89.5,2012,First Tier
3,4,University of Cambridge,United Kingdom,1,10,24,4,16,16,11,,50,86.17,2012,First Tier
4,5,California Institute of Technology,USA,4,2,29,7,37,22,22,,18,85.21,2012,First Tier


Now, create a pivot table to compare rank level versus country of the universities in terms of overall score:

In [5]:
df.pivot_table(values='score', index='country', columns='Rank_Level', aggfunc=[np.mean]).head()

Unnamed: 0_level_0,mean,mean,mean,mean
Rank_Level,First Tier,Other Top University,Second Tier,Third Tier
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Argentina,,44.672857,,
Australia,47.9425,44.64575,49.2425,47.285
Austria,,44.864286,,47.066667
Belgium,51.875,45.081,49.084,46.746667
Brazil,,44.499706,49.565,


Pass in two functions to aggregate pivot table:

In [6]:
df.pivot_table(values='score', index='country', columns='Rank_Level', aggfunc=[np.mean, np.max]).head()

Unnamed: 0_level_0,mean,mean,mean,mean,amax,amax,amax,amax
Rank_Level,First Tier,Other Top University,Second Tier,Third Tier,First Tier,Other Top University,Second Tier,Third Tier
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Argentina,,44.672857,,,,45.66,,
Australia,47.9425,44.64575,49.2425,47.285,51.61,45.97,50.4,47.47
Austria,,44.864286,,47.066667,,46.29,,47.78
Belgium,51.875,45.081,49.084,46.746667,52.03,46.21,49.73,47.14
Brazil,,44.499706,49.565,,,46.08,49.82,


Can also summarize values within a given top level column: see overall average score for the country and max of the max. Need to indicate we want pandas to provide marginal values with parameter `margins=True`.

In [7]:
df.pivot_table(values='score', index='country', columns='Rank_Level', aggfunc=[np.mean, np.max], margins=True).head()

Unnamed: 0_level_0,mean,mean,mean,mean,mean,amax,amax,amax,amax,amax
Rank_Level,First Tier,Other Top University,Second Tier,Third Tier,All,First Tier,Other Top University,Second Tier,Third Tier,All
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
Argentina,,44.672857,,,44.672857,,45.66,,,45.66
Australia,47.9425,44.64575,49.2425,47.285,45.825517,51.61,45.97,50.4,47.47,51.61
Austria,,44.864286,,47.066667,45.139583,,46.29,,47.78,47.78
Belgium,51.875,45.081,49.084,46.746667,47.011,52.03,46.21,49.73,47.14,52.03
Brazil,,44.499706,49.565,,44.781111,,46.08,49.82,,49.82


In [8]:
new_df = df.pivot_table(values='score', index='country', columns='Rank_Level', aggfunc=[np.mean, np.max], margins=True)

print(new_df.index)

print(new_df.columns)

Index(['Argentina', 'Australia', 'Austria', 'Belgium', 'Brazil', 'Bulgaria',
       'Canada', 'Chile', 'China', 'Colombia', 'Croatia', 'Cyprus',
       'Czech Republic', 'Denmark', 'Egypt', 'Estonia', 'Finland', 'France',
       'Germany', 'Greece', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Iran',
       'Ireland', 'Israel', 'Italy', 'Japan', 'Lebanon', 'Lithuania',
       'Malaysia', 'Mexico', 'Netherlands', 'New Zealand', 'Norway', 'Poland',
       'Portugal', 'Puerto Rico', 'Romania', 'Russia', 'Saudi Arabia',
       'Serbia', 'Singapore', 'Slovak Republic', 'Slovenia', 'South Africa',
       'South Korea', 'Spain', 'Sweden', 'Switzerland', 'Taiwan', 'Thailand',
       'Turkey', 'USA', 'Uganda', 'United Arab Emirates', 'United Kingdom',
       'Uruguay', 'All'],
      dtype='object', name='country')
MultiIndex([('mean',           'First Tier'),
            ('mean', 'Other Top University'),
            ('mean',          'Second Tier'),
            ('mean',           'Third Tier'),

Columns are hierarchical. The top level column indices have two categories: mean and max, and lower level column indices have four categories: one for each rank level.

### Querying
How to query dataframe to get the average scores of First Tier universities in each country? Two dataframe projections for mean and top tier

In [25]:
new_df['mean']['First Tier'].head()

country
Argentina        NaN
Australia    47.9425
Austria          NaN
Belgium      51.8750
Brazil           NaN
Name: First Tier, dtype: float64

In [17]:
type(new_df['mean']['First Tier'])

pandas.core.series.Series

Find the country that has the maximum average score on First Tier universities using `idxmax()`:

In [18]:
new_df['mean']['First Tier'].idxmax()

'United Kingdom'

### (Un)Stacking
change the shape of your pivot table

**Stacking** = pivoting the lowermost column index to become the innermost row index

**Unstacking** = pivoting the innermost row index to become the lowermost column index (the opposite of Stacking)

In [9]:
new_df.head()

Unnamed: 0_level_0,mean,mean,mean,mean,mean,amax,amax,amax,amax,amax
Rank_Level,First Tier,Other Top University,Second Tier,Third Tier,All,First Tier,Other Top University,Second Tier,Third Tier,All
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
Argentina,,44.672857,,,44.672857,,45.66,,,45.66
Australia,47.9425,44.64575,49.2425,47.285,45.825517,51.61,45.97,50.4,47.47,51.61
Austria,,44.864286,,47.066667,45.139583,,46.29,,47.78,47.78
Belgium,51.875,45.081,49.084,46.746667,47.011,52.03,46.21,49.73,47.14,52.03
Brazil,,44.499706,49.565,,44.781111,,46.08,49.82,,49.82


In [14]:
# another way to transpose new_df
new_df.unstack().unstack().head()

Unnamed: 0_level_0,country,Argentina,Australia,Austria,Belgium,Brazil,Bulgaria,Canada,Chile,China,Colombia,...,Switzerland,Taiwan,Thailand,Turkey,USA,Uganda,United Arab Emirates,United Kingdom,Uruguay,All
Unnamed: 0_level_1,Rank_Level,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
mean,First Tier,,47.9425,,51.875,,,53.633846,,53.5925,,...,54.005,54.21,,,61.066726,,,63.937931,,58.350675
mean,Other Top University,44.672857,44.64575,44.864286,45.081,44.499706,44.335,44.760541,44.7675,44.564267,44.4325,...,44.625,44.476667,44.83,44.481,44.871718,44.28,44.22,44.881299,44.255,44.738871
mean,Second Tier,,49.2425,,49.084,49.565,,49.218182,,47.868,,...,48.184,,,,49.069524,,,48.9575,,49.06545
mean,Third Tier,,47.285,47.066667,46.746667,,,46.826364,,46.92625,,...,47.93,47.065,46.55,,46.818333,,,46.862273,,46.84345
mean,All,44.672857,45.825517,45.139583,47.011,44.781111,44.335,47.359306,44.7675,44.992575,44.4325,...,51.208846,45.012391,45.116667,44.481,51.83986,44.28,44.22,49.474653,44.255,47.798395


In [27]:
new_df=new_df.stack()
new_df.head()

# Rank_Level becomes innermost row index; moved to the right of 'country'

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,amax
country,Rank_Level,Unnamed: 2_level_1,Unnamed: 3_level_1
Argentina,Other Top University,44.672857,45.66
Argentina,All,44.672857,45.66
Australia,First Tier,47.9425,51.61
Australia,Other Top University,44.64575,45.97
Australia,Second Tier,49.2425,50.4


In [28]:
new_df.unstack().head()

Unnamed: 0_level_0,mean,mean,mean,mean,mean,amax,amax,amax,amax,amax
Rank_Level,First Tier,Other Top University,Second Tier,Third Tier,All,First Tier,Other Top University,Second Tier,Third Tier,All
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
All,58.350675,44.738871,49.06545,46.84345,47.798395,100.0,46.34,51.29,47.93,100.0
Argentina,,44.672857,,,44.672857,,45.66,,,45.66
Australia,47.9425,44.64575,49.2425,47.285,45.825517,51.61,45.97,50.4,47.47,51.61
Austria,,44.864286,,47.066667,45.139583,,46.29,,47.78,47.78
Belgium,51.875,45.081,49.084,46.746667,47.011,52.03,46.21,49.73,47.14,52.03


In [29]:
new_df.unstack().unstack().head()

      Rank_Level  country  
mean  First Tier  All          58.350675
                  Argentina          NaN
                  Australia    47.942500
                  Austria            NaN
                  Belgium      51.875000
dtype: float64

End up unstacking to a single column, a Series object. This column is just a 'value', the meaning of which is denoted by the hierarchical index of operation, rank, and country.

# Date/Time Functionality
Four main time-related classes:
1. Timestamp
2. DatetimeIndex
3. Period
4. PeriodIndex

## Timestamp
represents a single timestamp and associated values with points in time

In [30]:
import pandas as pd
import numpy as np

### Creating a Timestamp Instance

In [31]:
pd.Timestamp('9/1/2019 10:05AM')

Timestamp('2019-09-01 10:05:00')

In [32]:
pd.Timestamp(2019, 12, 20, 0, 0)

Timestamp('2019-12-20 00:00:00')

### Useful Attributes

In [34]:
# 1 = Monday, 7 = Sunday
pd.Timestamp(2019, 12, 20, 0, 0).isoweekday()

5

In [36]:
# extract the specific year, month, day, hour, minute, second from a timestamp with `.____`
pd.Timestamp(2019, 12, 20, 5, 2, 23).second

23

## Period
A single span of time, not a specific point in time

Encapsulates the granularity for arithmetic

In [37]:
pd.Period('1/2016')

Period('2016-01', 'M')

In [38]:
pd.Period('3/5/2016')

Period('2016-03-05', 'D')

In [39]:
pd.Period('1/2016') + 5

Period('2016-06', 'M')

In [40]:
pd.Period('3/5/2016') - 2

Period('2016-03-03', 'D')

## DatetimeIndex and PeriodIndex
Index of a timestamp is DatetimeIndex

In [41]:
t1 = pd.Series(list('abc'), [pd.Timestamp('2016-09-01'), pd.Timestamp('2016-09-02'), pd.Timestamp('2016-09-03')])

t1

2016-09-01    a
2016-09-02    b
2016-09-03    c
dtype: object

In [42]:
type(t1.index)

pandas.core.indexes.datetimes.DatetimeIndex

Can also do period-based index

In [43]:
t2 = pd.Series(list('def'), [pd.Period('2016-09'),pd.Period('2016-10'),pd.Period('2016-11')])

t2

2016-09    d
2016-10    e
2016-11    f
Freq: M, dtype: object

In [44]:
type(t2.index)

pandas.core.indexes.period.PeriodIndex

## Convert to Datetime

In [46]:
d1 = ['2 June 2013', 'Aug 29, 2014', '2015-06-26', '7/12/16']

ts3 = pd.DataFrame(np.random.randint(10, 100, (4,2)), index=d1,
                   columns = list('ab'))

ts3

Unnamed: 0,a,b
2 June 2013,83,69
"Aug 29, 2014",28,87
2015-06-26,50,29
7/12/16,95,38


use `pd.to_datetime()` to convert to Datetime and put in standard format:

In [47]:
ts3.index = pd.to_datetime(ts3.index)
ts3

Unnamed: 0,a,b
2013-06-02,83,69
2014-08-29,28,87
2015-06-26,50,29
2016-07-12,95,38


change the date parse order for European dates with parameter `dayfirst=True`:

In [48]:
pd.to_datetime('4.7.12', dayfirst=True)

Timestamp('2012-07-04 00:00:00')

## Timedelta
differences in time, but not the same as a Period

In [49]:
pd.Timestamp('9/3/2016') - pd.Timestamp('9/1/2016')

Timedelta('2 days 00:00:00')

can also do math on times for something obscure like "what's the date and time 12 days and 3 hours past September 2nd at 8:10AM?"

In [50]:
pd.Timestamp('9/2/2016 8:10AM') + pd.Timedelta('12D 3H')

Timestamp('2016-09-14 11:10:00')

## Offset
* similar to Timedelta, but follows specific calendar duration rules. 
* allows flexibility in types of time intervals
* can also do stuff like business day, end of month, semi month begin, etc.

In [51]:
pd.Timestamp('9/4/2016').weekday()

6

In [52]:
pd.Timestamp('9/4/2016') + pd.offsets.Week()

Timestamp('2016-09-11 00:00:00')

In [53]:
pd.Timestamp('9/4/2016') + pd.offsets.MonthEnd()

Timestamp('2016-09-30 00:00:00')

## Working with Dates in Dataframes
Example) look at nine measurements taken bi-weekly every Sunday starting in October 2016. Using `date_range`, we can create DatetimeIndex. 
* Have to specify either the start or end date. If it's not explicitly specified, by default, the date is considered the start date.
* Specify number of periods
* Specify frequency --> set it to "2W-SUN", which means biweekly on Sunday

In [54]:
dates = pd.date_range('10-01-2016', periods=9, freq="2W-SUN")
dates

DatetimeIndex(['2016-10-02', '2016-10-16', '2016-10-30', '2016-11-13',
               '2016-11-27', '2016-12-11', '2016-12-25', '2017-01-08',
               '2017-01-22'],
              dtype='datetime64[ns]', freq='2W-SUN')

Can do business days:

In [56]:
pd.date_range('10-01-2016', periods=9, freq='B')

DatetimeIndex(['2016-10-03', '2016-10-04', '2016-10-05', '2016-10-06',
               '2016-10-07', '2016-10-10', '2016-10-11', '2016-10-12',
               '2016-10-13'],
              dtype='datetime64[ns]', freq='B')

Can do quarterly (with the quarter start in June):

In [57]:
pd.date_range('04-01-2016', periods=12, freq='QS-JUN')

DatetimeIndex(['2016-06-01', '2016-09-01', '2016-12-01', '2017-03-01',
               '2017-06-01', '2017-09-01', '2017-12-01', '2018-03-01',
               '2018-06-01', '2018-09-01', '2018-12-01', '2019-03-01'],
              dtype='datetime64[ns]', freq='QS-JUN')

...Back to our weekly on Sunday example

In [59]:
dates = pd.date_range('10-01-2016', periods=9, freq='2W-SUN')
df = pd.DataFrame({'Count 1': 100 + np.random.randint(-5, 10, 9).cumsum(),
                   'Count 2': 120 + np.random.randint(-5, 10, 9).cumsum()}, index=dates)
df

Unnamed: 0,Count 1,Count 2
2016-10-02,102,115
2016-10-16,108,116
2016-10-30,103,121
2016-11-13,99,118
2016-11-27,94,121
2016-12-11,99,124
2016-12-25,95,121
2017-01-08,94,124
2017-01-22,89,129


In [62]:
# double-check that dates are in fact a Sunday
df.index.day_name() # .weekday_name attribute no longer supported

Index(['Sunday', 'Sunday', 'Sunday', 'Sunday', 'Sunday', 'Sunday', 'Sunday',
       'Sunday', 'Sunday'],
      dtype='object')

use `diff()` to find the difference between each date's value:

In [63]:
df.diff()

Unnamed: 0,Count 1,Count 2
2016-10-02,,
2016-10-16,6.0,1.0
2016-10-30,-5.0,5.0
2016-11-13,-4.0,-3.0
2016-11-27,-5.0,3.0
2016-12-11,5.0,3.0
2016-12-25,-4.0,-3.0
2017-01-08,-1.0,3.0
2017-01-22,-5.0,5.0


What is the mean count for each month? Use resampling.

Downsampling = converting from a higher frequency to a lower frequency

In [64]:
df.resample('M').mean()

Unnamed: 0,Count 1,Count 2
2016-10-31,104.333333,117.333333
2016-11-30,96.5,119.5
2016-12-31,97.0,122.5
2017-01-31,91.5,126.5


## Datetime Indexing and Slicing
Can use partial string indexing to find values from a particular year

In [65]:
df['2017']

Unnamed: 0,Count 1,Count 2
2017-01-08,94,124
2017-01-22,89,129


In [66]:
df['2016-12']

Unnamed: 0,Count 1,Count 2
2016-12-11,99,124
2016-12-25,95,121


In [67]:
df['2016-12':]

Unnamed: 0,Count 1,Count 2
2016-12-11,99,124
2016-12-25,95,121
2017-01-08,94,124
2017-01-22,89,129
