 In pandas, the groupby function can be combined with one or more aggregation functions to quickly and easily summarize data.

## Aggregating


- An aggregation function is one which takes multiple individual values and returns a summary. 
- In the majority of the cases, this summary is a single value.

- The most common aggregation functions are a simple average or summation of values. 
- As of pandas 0.20, you may call an aggregation function on one or more columns of a DataFrame.

In [1]:
import pandas as pd
import seaborn as sns

df = sns.load_dataset("titanic")

df["fare"].agg(["sum", "mean"])

sum     28693.949300
mean       32.204208
Name: fare, dtype: float64

__NOTE:__ 
- One area that needs to be discussed is that there are multiple ways to call an aggregation function. 
- As shown above, you may pass a list of functions to apply to one or more columns of data.

#### Pandas Aggregation Options

In [2]:
# Use a list - All aggregations in list will be applied to column
df["fare"].agg(["sum", "mean"])

sum     28693.949300
mean       32.204208
Name: fare, dtype: float64

In [3]:
# Use a dictionary
# Defining columns as dictionary keys
# All aggregations in list will be applied
df.agg({"fare": ["sum", "mean"], "sex": ["count"]})

Unnamed: 0,fare,sex
count,,891.0
mean,32.204208,
sum,28693.9493,


In [4]:
# Use a named aggregation tuple

# Pass a tuple of column names and aggregations
# Only one aggregation can be passed per tuple
# Assign a value for the result

df.agg(fare_sum=("fare", "sum"), fare_mean=("fare", "mean"), sex_count=("sex", "count"))

Unnamed: 0,fare,sex
fare_sum,28693.9493,
fare_mean,32.204208,
sex_count,,891.0


## Groupby


#### basic math

In [5]:
agg_func_math = {
    "fare": ["sum", "mean", "median", "min", "max", "std", "var", "mad", "prod"]
}
df.groupby(["embark_town"]).agg(agg_func_math).round(2)

Unnamed: 0_level_0,fare,fare,fare,fare,fare,fare,fare,fare,fare
Unnamed: 0_level_1,sum,mean,median,min,max,std,var,mad,prod
embark_town,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Cherbourg,10072.3,59.95,29.7,4.01,512.33,83.91,7041.39,53.02,6.193716e+250
Queenstown,1022.25,13.28,7.75,6.75,90.0,14.19,201.3,7.87,6.4586709999999994e+78
Southampton,17439.4,27.08,13.0,0.0,263.0,35.89,1287.95,21.3,0.0


In [6]:
agg_func_describe = {"fare": ["describe"]}
df.groupby(["embark_town"]).agg(agg_func_describe).round(2)

Unnamed: 0_level_0,fare,fare,fare,fare,fare,fare,fare,fare
Unnamed: 0_level_1,describe,describe,describe,describe,describe,describe,describe,describe
Unnamed: 0_level_2,count,mean,std,min,25%,50%,75%,max
embark_town,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3
Cherbourg,168.0,59.95,83.91,4.01,13.7,29.7,78.5,512.33
Queenstown,77.0,13.28,14.19,6.75,7.75,7.75,15.5,90.0
Southampton,644.0,27.08,35.89,0.0,8.05,13.0,27.9,263.0


#### Counting

After basic math, counting is the next most common aggregation that can be performed on grouped data

In [8]:
agg_func_count = {"embark_town": ["count", "nunique", "size"]}
df.groupby(["deck"]).agg(agg_func_count)

Unnamed: 0_level_0,embark_town,embark_town,embark_town
Unnamed: 0_level_1,count,nunique,size
deck,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
A,15,2,15
B,45,2,47
C,59,3,59
D,33,2,33
E,32,3,32
F,13,3,13
G,4,1,4


#### First and last


In [9]:
agg_func_selection = {"fare": ["first", "last"]}
df.sort_values(by=["fare"], ascending=False).groupby(["embark_town"]).agg(
    agg_func_selection
)

Unnamed: 0_level_0,fare,fare
Unnamed: 0_level_1,first,last
embark_town,Unnamed: 1_level_2,Unnamed: 2_level_2
Cherbourg,512.3292,4.0125
Queenstown,90.0,6.75
Southampton,263.0,0.0


Another selection approach is to use idxmax and idxmin to select the index value that corresponds to the maximum or minimum value.



In [10]:
agg_func_max_min = {"fare": ["idxmax", "idxmin"]}
df.groupby(["embark_town"]).agg(agg_func_max_min)

Unnamed: 0_level_0,fare,fare
Unnamed: 0_level_1,idxmax,idxmin
embark_town,Unnamed: 1_level_2,Unnamed: 2_level_2
Cherbourg,258,378
Queenstown,245,143
Southampton,27,179


we can verify the results

In [11]:
df.loc[[258, 378]]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
258,1,1,female,35.0,0,0,512.3292,C,First,woman,False,,Cherbourg,yes,True
378,0,3,male,20.0,0,0,4.0125,C,Third,man,True,,Cherbourg,no,True


Here’s another shortcut trick you can use to see the rows with the max fare :

In [12]:
df.loc[df.groupby("class")["fare"].idxmax()]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
258,1,1,female,35.0,0,0,512.3292,C,First,woman,False,,Cherbourg,yes,True
72,0,2,male,21.0,0,0,73.5,S,Second,man,True,,Southampton,no,True
159,0,3,male,,8,2,69.55,S,Third,man,True,,Southampton,no,False


Aggregating in scipy 

 calculating the mode and skew of the fare data.

In [13]:
from scipy.stats import skew, mode

agg_func_stats = {"fare": [skew, mode, pd.Series.mode]}
df.groupby(["embark_town"]).agg(agg_func_stats)

Unnamed: 0_level_0,fare,fare,fare
Unnamed: 0_level_1,skew,mode,mode
embark_town,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Cherbourg,3.305112,"([7.2292], [15])",7.2292
Queenstown,4.265111,"([7.75], [30])",7.75
Southampton,3.640276,"([8.05], [43])",8.05


### Working with text


In [14]:
agg_func_text = {"deck": ["nunique", mode, set]}
df.groupby(["class"]).agg(agg_func_text)

Unnamed: 0_level_0,deck,deck,deck
Unnamed: 0_level_1,nunique,mode,set
class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
First,5,"([C], [59])","{nan, E, D, C, B, A}"
Second,3,"([F], [8])","{nan, E, F, D}"
Third,3,"([F], [5])","{nan, E, G, F}"


### Custom functions


Method 1- using partial functions

In [15]:
from functools import partial

# Use partial
q_25 = partial(pd.Series.quantile, q=0.25)
q_25.__name__ = "25%"

Method 2 - defining a function

In [16]:
def percentile_25(x):
    return x.quantile(0.25)

Method 3 - Defining a lambda

In [17]:
lambda_25 = lambda x: x.quantile(0.25)
lambda_25.__name__ = "lambda_25%"

Method 4 - Using a lambda function inline

In [18]:
agg_func = {"fare": [q_25, percentile_25, lambda_25, lambda x: x.quantile(0.25)]}

df.groupby(["embark_town"]).agg(agg_func).round(2)

Unnamed: 0_level_0,fare,fare,fare,fare
Unnamed: 0_level_1,25%,percentile_25,lambda_25%,<lambda_0>
embark_town,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Cherbourg,13.7,13.7,13.7,13.7
Queenstown,7.75,7.75,7.75,7.75
Southampton,8.05,8.05,8.05,8.05


#### Custom function examples


In [19]:
def count_nulls(s):
    """To count number of null values"""
    return s.size - s.count()


def unique_nan(s):
    """to include NaN values in unique counts, we need to pass dropna=False to the nunique function"""
    return s.nunique(dropna=False)

In [20]:
agg_func_custom_count = {
    "embark_town": ["count", "nunique", "size", unique_nan, count_nulls, set]
}
df.groupby(["deck"]).agg(agg_func_custom_count)

Unnamed: 0_level_0,embark_town,embark_town,embark_town,embark_town,embark_town,embark_town
Unnamed: 0_level_1,count,nunique,size,unique_nan,count_nulls,set
deck,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,15,2,15,2,0,"{Cherbourg, Southampton}"
B,45,2,47,3,2,"{nan, Cherbourg, Southampton}"
C,59,3,59,3,0,"{Cherbourg, Queenstown, Southampton}"
D,33,2,33,2,0,"{Cherbourg, Southampton}"
E,32,3,32,3,0,"{Cherbourg, Queenstown, Southampton}"
F,13,3,13,3,0,"{Cherbourg, Queenstown, Southampton}"
G,4,1,4,1,0,{Southampton}


In [31]:
def percentile_90(x):
    """To calculate the 90th percentile value"""
    return x.quantile(0.9)


def trim_mean_10(x):
    """To calculate a trimmed mean where the lowest 10th percent is excluded"""
    from scipy.stats import trim_mean

    return trim_mean(x, 0.1)


def largest(x):
    """If you want the largest value, regardless of the sort order"""
    return x.nlargest(1)

In [38]:
import numpy as np
import sparklines


def sparkline_str(x):
    bins = np.histogram(x)[0]
    sl = "".join(sparklines(bins))
    return sl

In [39]:
agg_func_largest = {"fare": [percentile_90, trim_mean_10, largest, sparkline_str]}
df.groupby(["class", "embark_town"]).agg(agg_func_largest)

TypeError: 'module' object is not callable