## More Data Processing with Pandas

### Group by in Pandas

Sometimes, we want to select data based on groups and understand data on a group level. That's why Pandas has a `groupby()` function to speed up such task. The idea behind `groupby()` is that it takes some DataFrame, splits it into chunks based on some key values, applies computation on those chunks and then combines the results back together into another DataFrame. This is referred to as the **split-apply-combine** pattern.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Let's use the US census data (county level 'SUMLEV'==50)
df = pd.read_csv('../resources/week-3/datasets/census.csv')
df = df[df['SUMLEV']==50]
df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
5,50,3,6,1,9,Alabama,Blount County,57322,57322,57373,...,1.807375,-1.177622,-1.748766,-2.062535,-1.36997,1.859511,-0.84858,-1.402476,-1.577232,-0.884411


### Splitting step

**First example**: Perform a manual grouping by states `'STNAME'` to find the average population in 2010: `'CENSUS2010POP'`.

We will use the cell magic function `'%%timeit'` to run the task 3 times.

In [8]:
%%timeit -n 3 # timeit 3 times

for state in df['STNAME'].unique():

    avg = np.average(df.where(df['STNAME']==state).dropna()['CENSUS2010POP']) # Avg population per unique state

    print('State: ', state, 'Avg Population: ', avg) # It will take a while to finish

State:  Alabama Avg Population:  71339.34328358209
State:  Alaska Avg Population:  24490.724137931036
State:  Arizona Avg Population:  426134.4666666667
State:  Arkansas Avg Population:  38878.90666666667
State:  California Avg Population:  642309.5862068966
State:  Colorado Avg Population:  78581.1875
State:  Connecticut Avg Population:  446762.125
State:  Delaware Avg Population:  299311.3333333333
State:  District of Columbia Avg Population:  601723.0
State:  Florida Avg Population:  280616.5671641791
State:  Georgia Avg Population:  60928.63522012578
State:  Hawaii Avg Population:  272060.2
State:  Idaho Avg Population:  35626.86363636364
State:  Illinois Avg Population:  125790.50980392157
State:  Indiana Avg Population:  70476.10869565218
State:  Iowa Avg Population:  30771.262626262625
State:  Kansas Avg Population:  27172.55238095238
State:  Kentucky Avg Population:  36161.39166666667
State:  Louisiana Avg Population:  70833.9375
State:  Maine Avg Population:  83022.5625
State:

timeit results show that this process took "3.26 s ± 519 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)"

**Second example**: We will try `groupby()`. `groupby()` returns a tuple, where the first value is the value of the *key* we are trying to group by (the state name), and the second value is the *projected dataframe* for that group.

In [9]:
%%timeit -n 3

for group, frame in df.groupby('STNAME'): # returns (key, projected df)

    # Now, we include the "apply" step: compute the average population
    avg = np.average(frame['CENSUS2010POP'])

    print('State: ', group, 'Avg Population: ', avg) # It will take a while to finish

State:  Alabama Avg Population:  71339.34328358209
State:  Alaska Avg Population:  24490.724137931036
State:  Arizona Avg Population:  426134.4666666667
State:  Arkansas Avg Population:  38878.90666666667
State:  California Avg Population:  642309.5862068966
State:  Colorado Avg Population:  78581.1875
State:  Connecticut Avg Population:  446762.125
State:  Delaware Avg Population:  299311.3333333333
State:  District of Columbia Avg Population:  601723.0
State:  Florida Avg Population:  280616.5671641791
State:  Georgia Avg Population:  60928.63522012578
State:  Hawaii Avg Population:  272060.2
State:  Idaho Avg Population:  35626.86363636364
State:  Illinois Avg Population:  125790.50980392157
State:  Indiana Avg Population:  70476.10869565218
State:  Iowa Avg Population:  30771.262626262625
State:  Kansas Avg Population:  27172.55238095238
State:  Kentucky Avg Population:  36161.39166666667
State:  Louisiana Avg Population:  70833.9375
State:  Maine Avg Population:  83022.5625
State:

This new approach took "23.1 ms ± 13.9 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)" according to timeit. An improve by roughly two factors!

**Using functions with `groupby()`**:

You can also provide a function to `groupby()` and use it to segment your data.

For example, let's say that you want to work on only a third or so of the states. We can create a function which returns a number between zero and two based on the first character of the state name. Then, we call `groupby()` to use this function to split up the DataFrame.

**Important**: You need to set the index of the DataFrame to be the column you want to group by first in this case.

In [10]:
# Batch separator function: First letter of state -> M: return 0, Q: 1, otherwise 2.
def set_batch_number(item):
     
    if item[0] < 'M':
        return 0
    if item[0] < 'Q':
        return 1
    return 2

In [11]:
# Set STNAME as index
df = df.set_index('STNAME')

# Group the df according to the batch number function
for group, frame in df.groupby(set_batch_number):

    print ('There are ', len(frame), ' records in group ', group) # Did not pass the column name -> It uses the index

There are  1177  records in group  0
There are  1134  records in group  1
There are  831  records in group  2


**Multi-index `groupby()`**

In [12]:
# Using the housing data from airbnb: We are intested in 'cancellation_policy' and 'review_scores_value'
df = pd.read_csv('../resources/week-3/datasets/listings.csv')
df.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",...,,f,,,f,moderate,f,f,1,
1,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,none,"The room is in Roslindale, a diverse and prima...",...,9.0,f,,,t,moderate,f,f,1,1.3
2,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",none,The LOCATION: Roslindale is a safe and diverse...,...,10.0,f,,,f,moderate,t,f,1,0.47
3,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,none,Roslindale is a lovely little neighborhood loc...,...,10.0,f,,,f,moderate,f,f,1,1.0
4,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",none,"I love the proximity to downtown, the neighbor...",...,10.0,f,,,f,flexible,f,f,1,2.25


Let's promote these two columns to a multiindex and call the `groupby()` function.

When using multi-index, we need to pass in the levels we are interested in grouping.

In [13]:
df = df.set_index(['cancellation_policy', 'review_scores_value'])

In [14]:
for group, frame in df.groupby(level=(0, 1)):
    print(group)

('flexible', 2.0)
('flexible', 4.0)
('flexible', 5.0)
('flexible', 6.0)
('flexible', 7.0)
('flexible', 8.0)
('flexible', 9.0)
('flexible', 10.0)
('moderate', 2.0)
('moderate', 4.0)
('moderate', 6.0)
('moderate', 7.0)
('moderate', 8.0)
('moderate', 9.0)
('moderate', 10.0)
('strict', 2.0)
('strict', 3.0)
('strict', 4.0)
('strict', 5.0)
('strict', 6.0)
('strict', 7.0)
('strict', 8.0)
('strict', 9.0)
('strict', 10.0)
('super_strict_30', 6.0)
('super_strict_30', 7.0)
('super_strict_30', 8.0)
('super_strict_30', 9.0)
('super_strict_30', 10.0)


Now, let's group by the cancelation policy and review scores but separating out all the 10's from those under 10. In this case, we need a function to manage the splitting.

In [15]:
# Splitting function
def grouping_func(item):
    
    # item = (cancellation_policy, review_scores_value)
    if item[1] == 10.0:
        return (item[0], '10.0')
    else:
        return (item[0], 'not 10.0')

In [16]:
# Now, let's use the function for the groupby
for group, frame in df.groupby(by=grouping_func):
    print(group)

('flexible', '10.0')
('flexible', 'not 10.0')
('moderate', '10.0')
('moderate', 'not 10.0')
('strict', '10.0')
('strict', 'not 10.0')
('super_strict_30', '10.0')
('super_strict_30', 'not 10.0')


In [17]:
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_communication,review_scores_location,requires_license,license,jurisdiction_names,instant_bookable,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
cancellation_policy,review_scores_value,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
moderate,,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",...,,,f,,,f,f,f,1,
moderate,9.0,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,none,"The room is in Roslindale, a diverse and prima...",...,10.0,9.0,f,,,t,f,f,1,1.3
moderate,10.0,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",none,The LOCATION: Roslindale is a safe and diverse...,...,10.0,9.0,f,,,f,t,f,1,0.47
moderate,10.0,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,none,Roslindale is a lovely little neighborhood loc...,...,10.0,10.0,f,,,f,f,f,1,1.0
flexible,10.0,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",none,"I love the proximity to downtown, the neighbor...",...,10.0,9.0,f,,,f,f,f,1,2.25


### Applying step

Three broad categories of data processing in this step: *aggregation*, *transformation* and *filtration* of group data.

**Aggregation**:

It uses the method `agg()` on the groupby object. With `agg()`, we can pass in a dictionary of the columns we are insterested in aggregating along with the function we are looking to apply.

In [18]:
# First, let's reset the index
df = df.reset_index()

In [21]:
# Now, let's group by 'cancellation_policy' and find the average 'review_scores_value' by group
df.groupby('cancellation_policy').agg({'review_scores_value': np.average})

Unnamed: 0_level_0,review_scores_value
cancellation_policy,Unnamed: 1_level_1
flexible,
moderate,
strict,
super_strict_30,


The `NaN` values mean that the function that we are using (`np.average`) does not ignore `NaN` values. We need to use the `np.nanmean` function instead.

In [22]:
df.groupby('cancellation_policy').agg({'review_scores_value': np.nanmean})

Unnamed: 0_level_0,review_scores_value
cancellation_policy,Unnamed: 1_level_1
flexible,9.237421
moderate,9.307398
strict,9.081441
super_strict_30,8.537313


We can extend this dictionary to aggregate by multiple functions/columns:

In [23]:
df.groupby('cancellation_policy').agg({'review_scores_value': (np.nanmean, np.nanstd), # Use tuple to include multiple funcs
                                       'reviews_per_month': np.nanmean})

Unnamed: 0_level_0,review_scores_value,review_scores_value,reviews_per_month
Unnamed: 0_level_1,nanmean,nanstd,nanmean
cancellation_policy,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
flexible,9.237421,1.096271,1.82921
moderate,9.307398,0.859859,2.391922
strict,9.081441,1.040531,1.873467
super_strict_30,8.537313,0.840785,0.340143


Basically, the `agg()` function is used with the `groupby` object to apply one or more functions we specify to the group dfs and return a single row per group. We passed in two dictionary entries with keys indicating the columns we wanted to apply the functions, and we supplied a tuple to apply multiple functions to the same key.

**Transformation**:

The `transform()` function returns an object that is the same size as the group. Essentially, it broadcasts the function you supply over the grouped DataFrame, returning a new DataFrame. This makes *combining* data easier.

For example, let's include the average rating values in a given group by cancellation policy, but preserve the df shape so that we could generate a difference between an individual observation and the average.

In [24]:
# Let's define the columns we are interested in:
cols = ['cancellation_policy', 'review_scores_value']

In [25]:
# Let's transform and save in a new df
transformed_df = df[cols].groupby('cancellation_policy').transform(np.nanmean)
transformed_df.head() # The index is the same as in the original df

Unnamed: 0,review_scores_value
0,9.307398
1,9.307398
2,9.307398
3,9.307398
4,9.237421


In [27]:
# Rename the column in the transformed version
transformed_df.rename({'review_scores_value': 'mean_review_scores'}, axis='columns', inplace=True)

In [28]:
# We can use the index to join this transformed df
df = df.merge(transformed_df, left_index=True, right_index=True)
df.head()

Unnamed: 0,cancellation_policy,review_scores_value,id,listing_url,scrape_id,last_scraped,name,summary,space,description,...,review_scores_location,requires_license,license,jurisdiction_names,instant_bookable,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month,mean_review_scores
0,moderate,,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",...,,f,,,f,f,f,1,,9.307398
1,moderate,9.0,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,...,9.0,f,,,t,f,f,1,1.3,9.307398
2,moderate,10.0,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",...,9.0,f,,,f,t,f,1,0.47,9.307398
3,moderate,10.0,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,...,10.0,f,,,f,f,f,1,1.0,9.307398
4,flexible,10.0,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",...,9.0,f,,,f,f,f,1,2.25,9.237421


We can see that the new `'mean_review_scores'` column is in place. So, now, we can do operations using this new column.

In [29]:
# Let's check the difference in the review values for a given row and its group means
df['mean_diff'] = np.absolute(df['review_scores_value'] - df['mean_review_scores']) # Vectorized
df.head()

Unnamed: 0,cancellation_policy,review_scores_value,id,listing_url,scrape_id,last_scraped,name,summary,space,description,...,requires_license,license,jurisdiction_names,instant_bookable,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month,mean_review_scores,mean_diff
0,moderate,,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",...,f,,,f,f,f,1,,9.307398,
1,moderate,9.0,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,...,f,,,t,f,f,1,1.3,9.307398,0.307398
2,moderate,10.0,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",...,f,,,f,t,f,1,0.47,9.307398,0.692602
3,moderate,10.0,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,...,f,,,f,f,f,1,1.0,9.307398,0.692602
4,flexible,10.0,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",...,f,,,f,f,f,1,2.25,9.237421,0.762579


**Filtering**:

It's often to want to group by some feature and then make some transformation to the groups to finally drop certain groups as part of your cleaning routines. The `filter()` function takes in a function which it applies to each group dataframe and returns either a True or a False.

In [30]:
# Let's filter our groups by those with a mean rating above 9.2
df.groupby('cancellation_policy').filter(lambda x: np.nanmean(x['review_scores_value']) > 9.2)

Unnamed: 0,cancellation_policy,review_scores_value,id,listing_url,scrape_id,last_scraped,name,summary,space,description,...,requires_license,license,jurisdiction_names,instant_bookable,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month,mean_review_scores,mean_diff
0,moderate,,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",...,f,,,f,f,f,1,,9.307398,
1,moderate,9.0,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,...,f,,,t,f,f,1,1.30,9.307398,0.307398
2,moderate,10.0,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",...,f,,,f,t,f,1,0.47,9.307398,0.692602
3,moderate,10.0,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,...,f,,,f,f,f,1,1.00,9.307398,0.692602
4,flexible,10.0,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",...,f,,,f,f,f,1,2.25,9.237421,0.762579
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3576,flexible,,14689681,https://www.airbnb.com/rooms/14689681,20160906204935,2016-09-07,Beautiful loft style bedroom with large bathroom,You'd be living on the top floor of a four sto...,,You'd be living on the top floor of a four sto...,...,f,,,f,f,f,1,,9.237421,
3577,flexible,,13750763,https://www.airbnb.com/rooms/13750763,20160906204935,2016-09-07,Comfortable Space in the Heart of Brookline,"Our place is close to Coolidge Corner, Allston...",This space consists of 2 Rooms and a private b...,"Our place is close to Coolidge Corner, Allston...",...,f,,,f,f,f,1,,9.237421,
3579,flexible,,14852179,https://www.airbnb.com/rooms/14852179,20160906204935,2016-09-07,Spacious Queen Bed Room Close to Boston Univer...,- Grocery: A full-size Star market is 2 minute...,,- Grocery: A full-size Star market is 2 minute...,...,f,,,f,f,f,1,,9.237421,
3582,flexible,,14585486,https://www.airbnb.com/rooms/14585486,20160906204935,2016-09-07,Gorgeous funky apartment,Funky little apartment close to public transpo...,Modern and relaxed space with many facilities ...,Funky little apartment close to public transpo...,...,f,,,f,f,f,1,,9.237421,


**Applying**:

The most common operation is the `apply()` function. This allows to apply an arbitrary function to each group, and stitch the results back for each `apply()` into a single df where the index is preserved.

In [31]:
# A clean df file for the housings data
df = pd.read_csv('../resources/week-3/datasets/listings.csv')
df = df[['cancellation_policy', 'review_scores_value']]
df.head()

Unnamed: 0,cancellation_policy,review_scores_value
0,moderate,
1,moderate,9.0
2,moderate,10.0
3,moderate,10.0
4,flexible,10.0


Using `apply()`, we can find the average review score of a listing and its deviation from the group mean in one step.

In [32]:
# Create a custom function to find the average score and calculate the difference (avg - group mean score)
def calc_mean_review_scores(group):

    # group: cancellation policy
    # Calcuate the average of each group
    avg = np.nanmean(group['review_scores_value'])
    
    # Broadcast the formula and create a new column
    group['review_scores_mean'] = np.abs(avg - group['review_scores_value'])

    return group

In [33]:
# Now, you just apply the function to the groups
df.groupby('cancellation_policy').apply(calc_mean_review_scores).head()

To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)


	>>> .groupby(..., group_keys=True)
  df.groupby('cancellation_policy').apply(calc_mean_review_scores).head()


Unnamed: 0,cancellation_policy,review_scores_value,review_scores_mean
0,moderate,,
1,moderate,9.0,0.307398
2,moderate,10.0,0.692602
3,moderate,10.0,0.692602
4,flexible,10.0,0.762579
