### Chapter 3 - More Data Processing with Pandas

# Table of Contents

3.1 More Data Processing with Pandas

- Mering DataFrames
- Pandas Idioms
- Group by
- Scales
- Pivot Table
- Date/Time Functionality


# 3.1 More Data Processing with Pandas

## 1. Merging DataFrames

In [1]:
import pandas as pd

staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR'},
                         {'Name': 'Sally', 'Role': 'Course liasion'},
                         {'Name': 'James', 'Role': 'Grader'}])

staff_df = staff_df.set_index('Name')

student_df = pd.DataFrame([{'Name': 'James', 'School': 'Bsuiness'},
                           {'Name': 'Mike', 'School': 'Law'},
                           {'Name': 'Sally', 'School': 'Engineering'}])

student_df = student_df.set_index('Name')

print("[Staff]:")
print(staff_df.head())
print("---------------------")
print("[Student]:")
print(student_df.head())


[Staff]:
                 Role
Name                 
Kelly  Director of HR
Sally  Course liasion
James          Grader
---------------------
[Student]:
            School
Name              
James     Bsuiness
Mike           Law
Sally  Engineering


#### merge( )

In [2]:
# Outer Join on the left and the right indexes (Name)
pd.merge(staff_df, student_df, how = 'outer', left_index = True, right_index = True)


Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
James,Grader,Bsuiness
Kelly,Director of HR,
Mike,,Law
Sally,Course liasion,Engineering


In [3]:
# Inner Join on the left and the right indexes (Name)
pd.merge(staff_df, student_df, how = 'inner', left_index = True, right_index = True)


Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Sally,Course liasion,Engineering
James,Grader,Bsuiness


In [4]:
# Left Join on the left and the right indexes (Name)
pd.merge(staff_df, student_df, how = 'left', left_index = True, right_index = True)


Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Kelly,Director of HR,
Sally,Course liasion,Engineering
James,Grader,Bsuiness


In [5]:
# Right Join on the left and the right indexes (Name)
pd.merge(staff_df, student_df, how = 'right', left_index = True, right_index = True)


Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
James,Grader,Bsuiness
Mike,,Law
Sally,Course liasion,Engineering


#### Alternative Way:

In [6]:
staff_df = staff_df.reset_index()
student_df = student_df.reset_index()

# Right Join on 'Name' column
pd.merge(staff_df, student_df, how = 'right', on = 'Name')


Unnamed: 0,Name,Role,School
0,Sally,Course liasion,Engineering
1,James,Grader,Bsuiness
2,Mike,,Law


#### Conflicts between the DataFrames?

In [7]:
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR',
                          'Location': 'State Street'},
                         {'Name': 'Sally', 'Role': 'Course liaison',
                         'Location': 'Washington Avenue'},
                         {'Name': 'James', 'Role': 'Grader',
                         'Location': 'Washington Avenue'}])

student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business',
                         'Location': '1024 Billiard Avenue'},
                            {'Name': 'Mike', 'School': 'Law',
                         'Location': 'Fraternity House #22'},
                            {'Name': 'Sally', 'School': 'Engineering',
                         'Location': '512 Wilson Crescent'}])

print("[Staff]:")
print(staff_df)
print("---------------------------------------------------------")
print("[Student]:")
print(student_df)


[Staff]:
    Name            Role           Location
0  Kelly  Director of HR       State Street
1  Sally  Course liaison  Washington Avenue
2  James          Grader  Washington Avenue
---------------------------------------------------------
[Student]:
    Name       School              Location
0  James     Business  1024 Billiard Avenue
1   Mike          Law  Fraternity House #22
2  Sally  Engineering   512 Wilson Crescent


- _x is always the left DataFrame information, and _y is always the right DdataFrame information

In [8]:
# Left Join on the 'Name' column
pd.merge(staff_df, student_df, how = 'left', on = 'Name')


Unnamed: 0,Name,Role,Location_x,School,Location_y
0,Kelly,Director of HR,State Street,,
1,Sally,Course liaison,Washington Avenue,Engineering,512 Wilson Crescent
2,James,Grader,Washington Avenue,Business,1024 Billiard Avenue


#### Multi-indexing and multiple columns

In [9]:
staff_df = pd.DataFrame([{'First Name': 'Kelly', 'Last Name': 'Desjardins',
                         'Role': 'Director of HR'}, 
                        {'First Name': 'Sally', 'Last Name': 'Brooks',
                         'Role': 'Course liaison'},
                        {'First Name': 'James', 'Last Name': 'Wilde',
                         'Role': 'Grader'}])

student_df = pd.DataFrame([{'First Name': 'James', 'Last Name': 'Hammond',
                            'School': 'Business'},
                          {'First Name': 'Mike', 'Last Name': 'Smith',
                           'School': 'Law'},
                          {'First Name': 'Sally', 'Last Name': 'Brooks',
                           'School': 'Engineering'}])

print("[Staff]:")
print(staff_df)
print("----------------------------------------")
print("[Student]:")
print(student_df)


[Staff]:
  First Name   Last Name            Role
0      Kelly  Desjardins  Director of HR
1      Sally      Brooks  Course liaison
2      James       Wilde          Grader
----------------------------------------
[Student]:
  First Name Last Name       School
0      James   Hammond     Business
1       Mike     Smith          Law
2      Sally    Brooks  Engineering


In [10]:
# Inner Join on First Name an Last Name
pd.merge(staff_df, student_df, how = 'inner', on = ['First Name', 'Last Name'])


Unnamed: 0,First Name,Last Name,Role,School
0,Sally,Brooks,Course liaison,Engineering


#### Concatenating Multiple DataFrames

In [11]:
%%capture 
# To suppress some of the Jupyter warning messages and just tell read_csv to ignore bad lines

df_2011 = pd.read_csv("MERGED2011_12_PP.csv", error_bad_lines = False)  # Save as 
df_2012 = pd.read_csv("MERGED2012_13_PP.csv", error_bad_lines = False)  # CSV UTF-8 (Comma delimited) (.csv)
df_2013 = pd.read_csv("MERGED2013_14_PP.csv", error_bad_lines = False)


In [12]:
df_2011.iloc[:5, :10]


Unnamed: 0,UNITID,OPEID,OPEID6,INSTNM,CITY,STABBR,ZIP,ACCREDAGENCY,INSTURL,NPCURL
0,100654,100200,1002,Alabama A & M University,Normal,AL,35762,,,
1,100663,105200,1052,University of Alabama at Birmingham,Birmingham,AL,35294-0110,,,
2,100690,2503400,25034,Amridge University,Montgomery,AL,36117-3553,,,
3,100706,105500,1055,University of Alabama in Huntsville,Huntsville,AL,35899,,,
4,100724,100500,1005,Alabama State University,Montgomery,AL,36104-0271,,,


In [13]:
print(len(df_2011))
print(len(df_2012))
print(len(df_2013))


7675
7793
7804


Concatenating

In [14]:
frames = [df_2011, df_2012, df_2013]
pd.concat(frames).iloc[:5, :10]


Unnamed: 0,UNITID,OPEID,OPEID6,INSTNM,CITY,STABBR,ZIP,ACCREDAGENCY,INSTURL,NPCURL
0,100654,100200,1002,Alabama A & M University,Normal,AL,35762,,,
1,100663,105200,1052,University of Alabama at Birmingham,Birmingham,AL,35294-0110,,,
2,100690,2503400,25034,Amridge University,Montgomery,AL,36117-3553,,,
3,100706,105500,1055,University of Alabama in Huntsville,Huntsville,AL,35899,,,
4,100724,100500,1005,Alabama State University,Montgomery,AL,36104-0271,,,


In [15]:
len(df_2011) + len(df_2012) + len(df_2013)


23272

In [16]:
# To differentiate which data is coming from which year, we use the keys parameter
pd.concat(frames, keys = ['2011', '2012', '2013']).iloc[:5, :10]


Unnamed: 0,Unnamed: 1,UNITID,OPEID,OPEID6,INSTNM,CITY,STABBR,ZIP,ACCREDAGENCY,INSTURL,NPCURL
2011,0,100654,100200,1002,Alabama A & M University,Normal,AL,35762,,,
2011,1,100663,105200,1052,University of Alabama at Birmingham,Birmingham,AL,35294-0110,,,
2011,2,100690,2503400,25034,Amridge University,Montgomery,AL,36117-3553,,,
2011,3,100706,105500,1055,University of Alabama in Huntsville,Huntsville,AL,35899,,,
2011,4,100724,100500,1005,Alabama State University,Montgomery,AL,36104-0271,,,


## 2. Pandas Idioms

In [17]:
import pandas as pd
import numpy as np
import timeit

df = pd.read_csv('census.txt')
df.iloc[:5, :10]


Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010
0,40,3,6,1,0,Alabama,Alabama,4779736,4780125,4785437
1,50,3,6,1,1,Alabama,Autauga County,54571,54597,54773
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183112
3,50,3,6,1,5,Alabama,Barbour County,27457,27455,27327
4,50,3,6,1,7,Alabama,Bibb County,22915,22915,22870


In [18]:
(df.where(df['SUMLEV'] == 50)            # df.where(a boolean mask of True/Flase)
   .dropna()                             # .where() does not drop NA values as a default
   .set_index(['STNAME', 'CTYNAME'])     # set index as 'STNAME' followed by 'CTYNAME'
   .rename(columns = {'ESTIMATEBASE2010': 'Estimates Base 2010'})).iloc[:5, :10]


Unnamed: 0_level_0,Unnamed: 1_level_0,SUMLEV,REGION,DIVISION,STATE,COUNTY,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Alabama,Autauga County,50.0,3.0,6.0,1.0,1.0,54571.0,54597.0,54773.0,55227.0,54954.0
Alabama,Baldwin County,50.0,3.0,6.0,1.0,3.0,182265.0,182265.0,183112.0,186558.0,190145.0
Alabama,Barbour County,50.0,3.0,6.0,1.0,5.0,27457.0,27455.0,27327.0,27341.0,27169.0
Alabama,Bibb County,50.0,3.0,6.0,1.0,7.0,22915.0,22915.0,22870.0,22745.0,22667.0
Alabama,Blount County,50.0,3.0,6.0,1.0,9.0,57322.0,57322.0,57376.0,57560.0,57580.0


#### An alternative, non-pandorable way

In [19]:
df = df[df['SUMLEV'] == 50]
df.set_index(['STNAME', 'CTYNAME'], inplace = True)    # inplace = True means modify the DF, not make a copy
df.rename(columns = {'ESTIMATESBASE2010': 'Estimates Base 2010'}).iloc[:5, :10]


Unnamed: 0_level_0,Unnamed: 1_level_0,SUMLEV,REGION,DIVISION,STATE,COUNTY,CENSUS2010POP,Estimates Base 2010,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Alabama,Autauga County,50,3,6,1,1,54571,54597,54773,55227,54954
Alabama,Baldwin County,50,3,6,1,3,182265,182265,183112,186558,190145
Alabama,Barbour County,50,3,6,1,5,27457,27455,27327,27341,27169
Alabama,Bibb County,50,3,6,1,7,22915,22915,22870,22745,22667
Alabama,Blount County,50,3,6,1,9,57322,57322,57376,57560,57580


#### Comparing the two methods above, in terms of time

In [20]:
# The first approach
def first_approach() :
    global df
    return (df.where(df['SUMLEV'] == 50)            
              .dropna()                             
              .set_index(['STNAME', 'CTYNAME'])     
              .rename(columns = {'ESTIMATEBASE2010': 'Estimates Base 2010'}))

# Read in our dataset anew, to be fresh
df = pd.read_csv('census.txt')

timeit.timeit(first_approach, number = 10)


0.7457618880000005

In [21]:
# The second approach
def second_approach() :
    global df
    new_df = df[df['SUMLEV'] == 50]
    new_df.set_index(['STNAME', 'CTYNAME'], inplace = True)    
    return new_df.rename(columns = {'ESTIMATESBASE2010': 'Estimates Base 2010'})

# Read in our dataset anew, to be fresh
df = pd.read_csv('census.txt')

timeit.timeit(second_approach, number = 10)


0.06557216599999904

The second approach is much faster! (and much readable)

#### apply( ) function

In [22]:
def min_max(row) :
    data = row[['POPESTIMATE2010',    # 6 different columns
                'POPESTIMATE2011',
                'POPESTIMATE2012',
                'POPESTIMATE2013',
                'POPESTIMATE2014',
                'POPESTIMATE2015']]
    return pd.Series({'min': np.min(data), 'max': np.max(data)})


In [23]:
df.apply(min_max, axis = 'columns').iloc[:10, :10]


Unnamed: 0,min,max
0,4785437,4852347
1,54727,55227
2,183112,202939
3,26283,27341
4,22521,22870
5,57376,57619
6,10400,10876
7,20162,20932
8,115469,118408
9,33977,34139


#### To add min, max columns to the original DataFrame:

In [24]:
def min_max(row) :
    data = row[['POPESTIMATE2010',    # 6 different columns
                'POPESTIMATE2011',
                'POPESTIMATE2012',
                'POPESTIMATE2013',
                'POPESTIMATE2014',
                'POPESTIMATE2015']]
    row['max'] = np.max(data)
    row['min'] = np.min(data)
    return row

df.apply(min_max, axis = 'columns').iloc[:5, :10]


Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010
0,40,3,6,1,0,Alabama,Alabama,4779736,4780125,4785437
1,50,3,6,1,1,Alabama,Autauga County,54571,54597,54773
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183112
3,50,3,6,1,5,Alabama,Barbour County,27457,27455,27327
4,50,3,6,1,7,Alabama,Bibb County,22915,22915,22870


#### apply( ) function is typically used with lambdas.

In [25]:
# Calculating the max of the columns, using the apply functions
rows = ['POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012', 'POPESTIMATE2013', 'POPESTIMATE2014', 
        'POPESTIMATE2015']

df.apply(lambda x: np.max(x[rows]), axis = 1).head(10)    # lambda x: np.max(x[rows]) -> returns a single value



0    4852347
1      55227
2     202939
3      27341
4      22870
5      57619
6      10876
7      20932
8     118408
9      34139
dtype: int64

- Dividing the states into four categories: North East, Mid West, South, West

In [26]:
def get_state_region(x) :
    northeast = ['Connecticut', 'Maine', 'Massachusetts', 'New Hamsphire',
                 'Rhode Istland', 'Vermont', 'New York', 'New Jersey', 'Pennsylvania']
    midwest = ['Illinois', 'Indiana', 'Michigan', 'Ohio', 'Wisconsin', 'Iowa',
              'Kansas', 'Minnesota', 'Missouri', 'Nebraska', 'North Dakota',
              'South Dakota']
    south = ['Delaware', 'Florida', 'Georgia', 'Maryland', 'North Carolina',
            'South Carolina', 'Virginia', 'District of Columnbia', 'West Virginia',
            'Alabama', 'Kentucky', 'Mississippi', 'Tennessee', 'Arkansas',
            'Louisiana', 'Oklahoma', 'Texas']
    west = ['Arizona', 'Colorado', 'Idaho', 'Montana', 'Nevada', 'New Mexico', 'Utah',
           'Wyoming', 'Alaska', 'California', 'Hawaii', 'Oregon', 'Washington']
    
    if x in northeast :
        return "Northeast"
    elif x in midwest :
        return "Midwest"
    elif x in south :
        return "South"
    else :
        return "West"
    

In [27]:
df['state_region'] = df['STNAME'].apply(lambda x: get_state_region(x))
df[['STNAME', 'state_region']].head(10)


Unnamed: 0,STNAME,state_region
0,Alabama,South
1,Alabama,South
2,Alabama,South
3,Alabama,South
4,Alabama,South
5,Alabama,South
6,Alabama,South
7,Alabama,South
8,Alabama,South
9,Alabama,South


## 3. Group By

### Splitting
df.groupby(level = , by = )

In [28]:
import pandas as pd
import numpy as np


In [29]:
df = pd.read_csv('census.txt')
df = df[df['SUMLEV'] == 50]
df.iloc[:5, :10]


Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010
1,50,3,6,1,1,Alabama,Autauga County,54571,54597,54773
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183112
3,50,3,6,1,5,Alabama,Barbour County,27457,27455,27327
4,50,3,6,1,7,Alabama,Bibb County,22915,22915,22870
5,50,3,6,1,9,Alabama,Blount County,57322,57322,57376


In [30]:
%%timeit -n 3

for state in df['STNAME'].unique() :
    avg = np.average(df.where(df['STNAME'] == state).dropna()['CENSUS2010POP'])
    # print('Counties in state ' + state + ' have an average population of ' + str(avg) + ' in 2010')
                     

3.04 s ± 70.1 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)


In [31]:
%%timeit -n 3

for group, frame in df.groupby('STNAME') :      
    avg = np.average(frame['CENSUS2010POP'])
    # print('Counties in state ' + group + ' have an average population of ' + str(avg))
    
    
# .groupby() returns a tuple, 
# where the first value is the value of the key we were trying to group by (state)
# and the second value is projected dataframe that was found for that group


15.1 ms ± 1.48 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)


#### Another example:

In [32]:
# You need to set the index of the dataframe to be the column that you want to group by
df = df.set_index('STNAME')

def set_batch_number(item) :
    if item[0] < 'M' :
        return 0              # group 0
    if item[0] < 'Q' :
        return 1              # group 1
    return 2                  # group 2


for group, frame in df.groupby(by = set_batch_number) :
    print('There are ' + str(len(frame)) + ' records in group ' + str(group) + ' for processing')
    # print(frame)

    
# Here, we didn't pass in a column name to groupby(axis = ?). 
# Instead, we set the index of the dataframe to be STNAME and, if no column identifier is passed groupby(), 
# the index(0) will be atuomatically used as its parameter. groupby(level = 0)
# Only works for single indexing. Need to specify levels for multiple indexing.
    

There are 1177 records in group 0 for processing
There are 1134 records in group 1 for processing
There are 831 records in group 2 for processing


#### One more example:

In [67]:
df = pd.read_csv('airbnb_listings.csv')
df.iloc[:5, 1:10]


Unnamed: 0,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview
0,https://www.airbnb.com/rooms/10080,20200000000000.0,2/6/19,D1 - Million Dollar View 2 BR,"Stunning two bedroom, two bathroom apartment. ...","Bed setup: 2 x queen, I can add up to 2 twin s...","Stunning two bedroom, two bathroom apartment. ...",none,
1,https://www.airbnb.com/rooms/11400,20200000000000.0,2/6/19,Central Lovely Rm in Victorian Home,Well-appointed room with a view of the garden ...,"Centrally-located lovely, quiet home on tree-l...",Well-appointed room with a view of the garden ...,none,"Very quiet residential area, yet only 1-1/2 bl..."
2,https://www.airbnb.com/rooms/13188,20200000000000.0,2/6/19,Garden level studio in ideal loc.,Garden level studio suite with garden patio - ...,"Explore Vancouver in a highly sought after, tr...",Garden level studio suite with garden patio - ...,none,
3,https://www.airbnb.com/rooms/13357,20200000000000.0,2/6/19,! Wow! 2bed 2bath 1bed den Harbour View Apartm...,Very spacious and comfortable with very well k...,"Mountains and harbour view 2 bedroom,2 bath,1 ...",Very spacious and comfortable with very well k...,none,Amanzing bibrant professional neighbourhood. C...
4,https://www.airbnb.com/rooms/13358,20200000000000.0,2/6/19,Urban Boutique Suite heart of Downtown Vancouver!,,Welcome to the Electra Building on Nelson stre...,Welcome to the Electra Building on Nelson stre...,none,


In [68]:
# print(df['cancellation_policy'])
# print(df['review_scores_value'])

df = df.set_index(['cancellation_policy', 'review_scores_value'])
df.iloc[1:10, 1:5]
    

Unnamed: 0_level_0,Unnamed: 1_level_0,listing_url,scrape_id,last_scraped,name
cancellation_policy,review_scores_value,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
strict_14_with_grace_period,9.0,https://www.airbnb.com/rooms/11400,20200000000000.0,2/6/19,Central Lovely Rm in Victorian Home
moderate,10.0,https://www.airbnb.com/rooms/13188,20200000000000.0,2/6/19,Garden level studio in ideal loc.
strict_14_with_grace_period,8.0,https://www.airbnb.com/rooms/13357,20200000000000.0,2/6/19,! Wow! 2bed 2bath 1bed den Harbour View Apartm...
strict_14_with_grace_period,9.0,https://www.airbnb.com/rooms/13358,20200000000000.0,2/6/19,Urban Boutique Suite heart of Downtown Vancouver!
strict_14_with_grace_period,10.0,https://www.airbnb.com/rooms/13490,20200000000000.0,2/6/19,Vancouver's best kept secret
strict_14_with_grace_period,9.0,https://www.airbnb.com/rooms/14267,20200000000000.0,2/6/19,EcoLoft Vancouver
moderate,10.0,https://www.airbnb.com/rooms/14508,20200000000000.0,2/6/19,Yaletown - Sea Wall
strict_14_with_grace_period,9.0,https://www.airbnb.com/rooms/16254,20200000000000.0,2/6/19,Close to PNE/Hastings Park and East Village
strict_14_with_grace_period,7.0,https://www.airbnb.com/rooms/16611,20200000000000.0,2/6/19,Broadway skytrain station


In [69]:
for group, frame in df.groupby(level = (0, 1)) :        # level 1 = canc. policy, level 2 = review.score
    print(group, str(len(frame)))
    

('flexible', 2.0) 1
('flexible', 4.0) 3
('flexible', 5.0) 1
('flexible', 6.0) 4
('flexible', 7.0) 5
('flexible', 8.0) 48
('flexible', 9.0) 200
('flexible', 10.0) 332
('moderate', 2.0) 1
('moderate', 4.0) 2
('moderate', 5.0) 1
('moderate', 6.0) 9
('moderate', 7.0) 6
('moderate', 8.0) 40
('moderate', 9.0) 422
('moderate', 10.0) 761
('strict_14_with_grace_period', 2.0) 2
('strict_14_with_grace_period', 4.0) 3
('strict_14_with_grace_period', 5.0) 2
('strict_14_with_grace_period', 6.0) 8
('strict_14_with_grace_period', 7.0) 28
('strict_14_with_grace_period', 8.0) 160
('strict_14_with_grace_period', 9.0) 1000
('strict_14_with_grace_period', 10.0) 1089
('super_strict_30', 9.0) 1


In [70]:
def grouping_fun(item) :
    
    if item[1] == 10.0 :
        return (item[0], "10.0")                    # this becomes a group
    else :
        return (item[0], "not 10.0")                # this becomes another group
    
for group, frame in df.groupby(grouping_fun) :      # must specify either by = OR level =
    print(group, str(len(frame)))                                    # level = 'cancellation_policy' once by = is specified


('flexible', '10.0') 332
('flexible', 'not 10.0') 558
('moderate', '10.0') 761
('moderate', 'not 10.0') 595
('strict_14_with_grace_period', '10.0') 1089
('strict_14_with_grace_period', 'not 10.0') 1501
('super_strict_30', 'not 10.0') 1


### Aggregation
.agg({'column1': (np.function1, np.function2), 'column2': np.function3})

In [71]:
df = df.reset_index()

df.groupby('cancellation_policy').agg({'review_scores_value': np.average})


Unnamed: 0_level_0,review_scores_value
cancellation_policy,Unnamed: 1_level_1
flexible,
moderate,
strict_14_with_grace_period,
super_strict_30,9.0


In [72]:
df.groupby('cancellation_policy').agg({'review_scores_value': np.nanmean})


Unnamed: 0_level_0,review_scores_value
cancellation_policy,Unnamed: 1_level_1
flexible,9.397306
moderate,9.532206
strict_14_with_grace_period,9.354276
super_strict_30,9.0


In [73]:
# calling multiple functions as a tuple() or multiple columns in the dictionary
df.groupby('cancellation_policy').agg({'review_scores_value': (np.nanmean, np.nanstd),
                                       'reviews_per_month': np.nanmean})


Unnamed: 0_level_0,review_scores_value,review_scores_value,reviews_per_month
Unnamed: 0_level_1,nanmean,nanstd,nanmean
cancellation_policy,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
flexible,9.397306,0.901784,1.949142
moderate,9.532206,0.734338,2.442953
strict_14_with_grace_period,9.354276,0.767728,2.057909
super_strict_30,9.0,,0.47


### Transformation
.transform(np.function1)

- returns an object that is the same size as the group

In [85]:
cols = ['cancellation_policy', 'review_scores_value']

transform_df = df[cols].groupby('cancellation_policy').transform(np.nanmean)
transform_df.head(10)    # preserves the original index order from the pre-grouping state


Unnamed: 0,review_scores_value
0,9.354276
1,9.354276
2,9.532206
3,9.354276
4,9.354276
5,9.354276
6,9.354276
7,9.532206
8,9.354276
9,9.354276


.rename({'original_column': 'new_name'})

In [86]:
# renaming the column in the transformed version
transform_df.rename({'review_scores_value': 'mean_review_socres'}, axis = 'columns', inplace = True)
transform_df.head(10)


Unnamed: 0,mean_review_socres
0,9.354276
1,9.354276
2,9.532206
3,9.354276
4,9.354276
5,9.354276
6,9.354276
7,9.532206
8,9.354276
9,9.354276


.merge( )

In [87]:
# df's index and transform_df's index are the same. So we merge on them
df = df.merge(transform_df, left_index = True, right_index = True)
df.iloc[:10]


Unnamed: 0,cancellation_policy,review_scores_value,id,listing_url,scrape_id,last_scraped,name,summary,space,description,...,instant_bookable,is_business_travel_ready,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,mean_review_socres
0,strict_14_with_grace_period,9.0,10080,https://www.airbnb.com/rooms/10080,20200000000000.0,2/6/19,D1 - Million Dollar View 2 BR,"Stunning two bedroom, two bathroom apartment. ...","Bed setup: 2 x queen, I can add up to 2 twin s...","Stunning two bedroom, two bathroom apartment. ...",...,f,f,f,f,31,31,0,0,0.18,9.354276
1,strict_14_with_grace_period,9.0,11400,https://www.airbnb.com/rooms/11400,20200000000000.0,2/6/19,Central Lovely Rm in Victorian Home,Well-appointed room with a view of the garden ...,"Centrally-located lovely, quiet home on tree-l...",Well-appointed room with a view of the garden ...,...,f,f,t,t,1,0,1,0,0.64,9.354276
2,moderate,10.0,13188,https://www.airbnb.com/rooms/13188,20200000000000.0,2/6/19,Garden level studio in ideal loc.,Garden level studio suite with garden patio - ...,"Explore Vancouver in a highly sought after, tr...",Garden level studio suite with garden patio - ...,...,t,f,f,f,1,1,0,0,1.51,9.532206
3,strict_14_with_grace_period,8.0,13357,https://www.airbnb.com/rooms/13357,20200000000000.0,2/6/19,! Wow! 2bed 2bath 1bed den Harbour View Apartm...,Very spacious and comfortable with very well k...,"Mountains and harbour view 2 bedroom,2 bath,1 ...",Very spacious and comfortable with very well k...,...,f,f,t,t,2,2,0,0,0.51,9.354276
4,strict_14_with_grace_period,9.0,13358,https://www.airbnb.com/rooms/13358,20200000000000.0,2/6/19,Urban Boutique Suite heart of Downtown Vancouver!,,Welcome to the Electra Building on Nelson stre...,Welcome to the Electra Building on Nelson stre...,...,f,f,f,t,1,1,0,0,3.65,9.354276
5,strict_14_with_grace_period,10.0,13490,https://www.airbnb.com/rooms/13490,20200000000000.0,2/6/19,Vancouver's best kept secret,This apartment rents for one month blocks of t...,"Vancouver city central, 700 sq.ft., main floor...",This apartment rents for one month blocks of t...,...,f,f,f,f,1,1,0,0,0.88,9.354276
6,strict_14_with_grace_period,9.0,14267,https://www.airbnb.com/rooms/14267,20200000000000.0,2/6/19,EcoLoft Vancouver,"The Ecoloft is located in the lovely, family r...",West Coast Modern Laneway House Loft: We call ...,"The Ecoloft is located in the lovely, family r...",...,t,f,f,f,1,1,0,0,0.31,9.354276
7,moderate,10.0,14508,https://www.airbnb.com/rooms/14508,20200000000000.0,2/6/19,Yaletown - Sea Wall,Long term bookings available for 6 months from...,"Long term, furnished rental from May 1st Spaci...",Long term bookings available for 6 months from...,...,f,f,f,f,1,1,0,0,0.28,9.532206
8,strict_14_with_grace_period,9.0,16254,https://www.airbnb.com/rooms/16254,20200000000000.0,2/6/19,Close to PNE/Hastings Park and East Village,,"Location, Quality, Cleanliness, and Amenities....","Location, Quality, Cleanliness, and Amenities....",...,t,f,f,f,1,0,1,0,0.48,9.354276
9,strict_14_with_grace_period,7.0,16611,https://www.airbnb.com/rooms/16611,20200000000000.0,2/6/19,Broadway skytrain station,"My place is close to downtown, Donald's Market...",1 to 3 bedrooms Balcony 5-minute ride to downt...,"My place is close to downtown, Donald's Market...",...,f,f,f,f,9,4,4,1,0.22,9.354276


np.absolute( )

In [90]:
# So now we could create, for instance, the difference between a given row and
# its group (cancellation policy) means.
df['mean_diff'] = np.absolute(df['review_scores_value'] - df['mean_review_socres'])
df['mean_diff'].head()


0    0.354276
1    0.354276
2    0.467794
3    1.354276
4    0.354276
Name: mean_diff, dtype: float64

### Filtering
.filter(function1)

- applies function1 to each group dataframe and returns True/False

In [107]:
# applies       np.nanmean( cancl_polcy_group1['review_scores_value'] )      >      9.2        -> True/False
df.groupby('cancellation_policy').filter(lambda x: np.nanmean(x['review_scores_value']) > 9.2).iloc[:5, :10]

# Notice the results are still indexed, but that any of the results which were in a group with a mean
# review score of less than or equal to 9.2 were not copied over.

# Note that this routine does not filter a dataframe on its contents. 
# The filter is applied to the labels of the index.


Unnamed: 0,cancellation_policy,review_scores_value,id,listing_url,scrape_id,last_scraped,name,summary,space,description
0,strict_14_with_grace_period,9.0,10080,https://www.airbnb.com/rooms/10080,20200000000000.0,2/6/19,D1 - Million Dollar View 2 BR,"Stunning two bedroom, two bathroom apartment. ...","Bed setup: 2 x queen, I can add up to 2 twin s...","Stunning two bedroom, two bathroom apartment. ..."
1,strict_14_with_grace_period,9.0,11400,https://www.airbnb.com/rooms/11400,20200000000000.0,2/6/19,Central Lovely Rm in Victorian Home,Well-appointed room with a view of the garden ...,"Centrally-located lovely, quiet home on tree-l...",Well-appointed room with a view of the garden ...
2,moderate,10.0,13188,https://www.airbnb.com/rooms/13188,20200000000000.0,2/6/19,Garden level studio in ideal loc.,Garden level studio suite with garden patio - ...,"Explore Vancouver in a highly sought after, tr...",Garden level studio suite with garden patio - ...
3,strict_14_with_grace_period,8.0,13357,https://www.airbnb.com/rooms/13357,20200000000000.0,2/6/19,! Wow! 2bed 2bath 1bed den Harbour View Apartm...,Very spacious and comfortable with very well k...,"Mountains and harbour view 2 bedroom,2 bath,1 ...",Very spacious and comfortable with very well k...
4,strict_14_with_grace_period,9.0,13358,https://www.airbnb.com/rooms/13358,20200000000000.0,2/6/19,Urban Boutique Suite heart of Downtown Vancouver!,,Welcome to the Electra Building on Nelson stre...,Welcome to the Electra Building on Nelson stre...


### Applying
apply()

- apply an arbitrary function to each group, and stitch the results back for each apply() into a single dataframe where the index is preserved

In [114]:
df = pd.read_csv('airbnb_listings.csv')
df = df[['cancellation_policy', 'review_scores_value']]

df.head()


Unnamed: 0,cancellation_policy,review_scores_value
0,strict_14_with_grace_period,9.0
1,strict_14_with_grace_period,9.0
2,moderate,10.0
3,strict_14_with_grace_period,8.0
4,strict_14_with_grace_period,9.0


- Wrapping the transform() and merge() processes in one step, with apply()

In [116]:
def calc_mean_review_scores(group) :   # group is a dataframe, grouped by 'cancellation policy' in this case
    avg = np.nanmean(group["review_scores_value"])
    group["review_scores_mean"] = np.abs(avg - group["review_scores_value"])
    
    return group

df.groupby('cancellation_policy').apply(calc_mean_review_scores).head()
    

Unnamed: 0,cancellation_policy,review_scores_value,review_scores_mean
0,strict_14_with_grace_period,9.0,0.354276
1,strict_14_with_grace_period,9.0,0.354276
2,moderate,10.0,0.467794
3,strict_14_with_grace_period,8.0,1.354276
4,strict_14_with_grace_period,9.0,0.354276


- Using apply can be slower than using some of the specialized functions, especially agg().
- But, if your dataframes are not huge, it's a soild general purpose approach 