# Solution (Naive) Multiple Custom Grouping Aggregations

This challenge is going to be fairly difficult, but should answer a question that many pandas users face - What is the best way to do a grouping operation that does many custom aggregations? In this context, a 'custom aggregation' is defined as one that is not directly available to use from pandas and one that you must write a custom function for. 

In Pandas Challenge 1, a single aggregation, which required a custom grouping function, was the desired result. In this challenge, you'll need to make several aggregations when grouping. There are a few different solutions to this problem, but depending on how you arrive at your solution, there could arise enormous performance differences. I am looking for a compact, readable solution with very good performance.

### Sales Data

In this challenge, you will be working with some mock sales data found in the sales.csv file. It contains 200,000 rows and 9 columns.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('../data/sales.csv', parse_dates=['date'])
df.head()

Unnamed: 0,customer_id,date,country,region,delivery_type,cost_type,duration,revenue,cost
0,13763,2019-03-25,Portugal,F,slow,expert,60,553,295
1,13673,2019-12-06,Singapore,I,slow,experienced,60,895,262
2,10287,2018-09-04,India,I,slow,novice,60,857,260
3,14298,2018-06-21,Morocco,F,fastest,expert,120,741,238
4,11523,2019-01-05,Luxembourg,A,fast,expert,120,942,263


In [3]:
df.shape

(200000, 9)

### Challenge

There are many aggregations that you will need to return and it will take some time to understand what they are and how to return them. The following definitions for two time periods will be used throughout the aggregations.

Period **2019H1** is defined as the time period beginning January 1, 2019 and ending June 30, 2019.
Period **2018H1** is defined as the time period beginning January 1, 2018 and ending June 30, 2018.

### Aggregations
Now, I will list all the aggregations that are expected to be returned. Each bullet point represents a single column. Use the first word after the bullet point as the new column name.

For every country and region, return the following:
* recency: Number of days between today's date (9/9/2019) and the maximum value of the 'date' column 
* fast_and_fastest: Number of unique customer_id in period 2019H1 with delivery_type either 'fast' or 'fastest'
* rev_2019: Total revenue for the period 2019H1
* rev_2018: Total revenue for the period 2018H1
* cost_2019: Total cost for period 2019H1
* cost_2019_exp: Total cost for period 2019H1 with cost_type 'expert'
* other_cost: Difference between cost_2019 and cost_2019_exp
* rev_per_60: Total of revenue when duration equals 60 in period 2019H1 divided by number of unique customer_id when duration equals 60 in period 2019H1 
* profit_margin: Take the difference of rev_2019 and cost_2019_exp then divide by rev_2019. Return as percentage
* cost_exp_per_60: Total of cost when duration is 60 and cost_type is 'expert' in period 2019H1 divided by the number of unique customer_id when duration equals 60 and cost_type is 'expert' in period 2019H1 
* growth: Find the percentage growth from revenue in period 2019H1 compared to the revenue in period 2018H1

## Solutions

I will first present a naive solution that returns the correct results, but is extremely slow. It uses a large custom function with the groupby `apply` method. Using the groupby `apply` method has potential to capsize your program as performance can be awful. 

One of my first attempts at using a groupby `apply` to solve a complex grouping problem resulted in a computation that took about eight hours to finish. The dataset was fairly large, at around a million rows, but could still easily fit in memory. I eventually ended up solving the problem using SAS (and not pandas) and shrank the execution time down to a few minutes. This is not an endorsement of SAS, but rather a warning that poor knowledge of how pandas works can lead to horrific performance issues.

### Built-in vs Custom groupby functions

This is a difficult challenge because each aggregation requires a custom calculation that is not provided directly as a pandas groupby method. For example, taking the sum of a column in each group is a simple groupby operation that is built into pandas. No custom function is needed. Here is an example of how we can sum all the revenue for each group.

In [4]:
df.groupby(['country', 'region']).agg({'revenue':'sum'}).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,revenue
country,region,Unnamed: 2_level_1
Argentina,A,284805
Argentina,B,293080
Argentina,C,277825
Argentina,D,264302
Argentina,E,306655


Measuring the performance of this operation with the timeit magic command yields about 25ms for completion.

In [5]:
%timeit df.groupby(['country', 'region']).agg({'revenue':'sum'})

26.4 ms ± 968 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


We can instead define a very simple custom function which computes the same sum. A pandas Series of the revenue values is passed to the custom function for each group.

In [6]:
def custom_sum(x):
    return x.sum()

In [7]:
df.groupby(['country', 'region']).agg({'revenue': custom_sum}).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,revenue
country,region,Unnamed: 2_level_1
Argentina,A,284805
Argentina,B,293080
Argentina,C,277825
Argentina,D,264302
Argentina,E,306655


Measuring the performance of this custom function, which only has a single line of code has already more than doubled execution time to 60ms.

In [8]:
%timeit df.groupby(['country', 'region']).agg({'revenue': custom_sum})

70.7 ms ± 4.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Let's now define a slightly more demanding custom function, one where we return the sum of the revenue based on whether the cost_type column is 'expert'. Here, we must use `apply` as it passes the entire DataFrame of each group to the custom function and not just the Series like `agg` does. 

In [9]:
def custom_sum2(x):
    is_exp = x['cost_type'] == 'expert'
    return x.loc[is_exp, 'revenue'].sum()

In [10]:
df.groupby(['country', 'region']).apply(custom_sum2).head()

country    region
Argentina  A          99605
           B          93364
           C          87975
           D          89515
           E         110613
dtype: int64

We measure performance again (350ms) and see now that we are more than an order of magnitude worse than the original and the only difference is a simple filtering.

In [11]:
%timeit df.groupby(['country', 'region']).apply(custom_sum2)

439 ms ± 56.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


There is a better way to solve this problem and will be shown in the optimal second solution. Let's first complete our challenge using this naive approach with a custom function passed to the `apply` groupby method.

## Naive Solution - Use custom function with apply

This challenge lists 11 columns that must be returned in the result. Each of the columns returned requires some kind of customized operation that is not built into the pandas groupby. Let's begin by calculating the first column, recency. 

For every combination of country and region, we need to calculate the number of days between today and the 'maximum' (most recent) date. To do this we write the custom function, `f1`, which will accept the entire DataFrame as its argument.

In [12]:
def f1(x):
    today = pd.Timestamp('today')
    most_recent = x['date'].max()
    recency = (today - most_recent).days
    return recency

In [13]:
df.groupby(['country', 'region']).apply(f1).head()

country    region
Argentina  A        -83
           B        -85
           C        -85
           D        -85
           E        -85
dtype: int64

Instead of returning the scalar result, we can return a Series with the index equal to the column name that we'd like to use, which will be the format for the remainder of this solution.

In [14]:
def f1(x):
    today = pd.Timestamp('today')
    most_recent = x['date'].max()
    recency = (today - most_recent).days
    d = {
        'recency': recency
        }
    return pd.Series(d)

Applying this function to each group yields the following:

In [15]:
df.groupby(['country', 'region']).apply(f1).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,recency
country,region,Unnamed: 2_level_1
Argentina,A,-83
Argentina,B,-85
Argentina,C,-85
Argentina,D,-85
Argentina,E,-85


We measure performance and get 350ms. This doesn't seem too bad for a dataset of 200k rows, but we have 10 more columns to compute.

In [16]:
%timeit df.groupby(['country', 'region']).apply(f1)

358 ms ± 45.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Let's move on to the second column, rev_2019. We need to create a filter for revenue for only the first half of the year 2019.

In [17]:
def f2(x):
    is_2019H1 = x['date'].between('2019-01-01', '2019-06-30')
    recency = (pd.Timestamp('today') - x['date'].max()).days
    rev_2019 =         x.loc[is_2019H1, 'revenue'].sum()
    
    d = {
        'recency': recency,
        'rev_2019': rev_2019
        }
    return pd.Series(d)

Let's apply this new function to our groups.

In [18]:
df.groupby(['country', 'region']).apply(f2).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,recency,rev_2019
country,region,Unnamed: 2_level_1,Unnamed: 3_level_1
Argentina,A,-83,150508
Argentina,B,-85,139048
Argentina,C,-85,118035
Argentina,D,-85,131728
Argentina,E,-85,146201


Measuring performance again and we've jumped up to nearly 1 full second.

In [19]:
%timeit df.groupby(['country', 'region']).apply(f2)

915 ms ± 26.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


We can continue to develop our custom function until all columns are calculated in this manner. Here is the final function with all the calculations.

In [20]:
def f_final(x):
    # filters
    is_2019H1 =        x['date'].between('2019-01-01', '2019-06-30')
    is_2018H1 =        x['date'].between('2018-01-01', '2018-06-30')
    is_fast_fastest =  x['delivery_type'].isin({'fast', 'fastest'})
    is_exp =           x['cost_type'] == 'expert'
    is_60 =            x['duration'] == 60
    is_2019H1_exp =    is_2019H1 & is_exp
    is_2019H1_60 =     is_2019H1 & is_60
    is_2019H1_60_exp = is_2019H1 & is_60 & is_exp
    
    # column calculations
    recency =          (pd.Timestamp('today') - x['date'].max()).days
    fast_and_fastest = x.loc[is_fast_fastest, 'customer_id'].nunique()
    rev_2019 =         x.loc[is_2019H1, 'revenue'].sum()
    rev_2018 =         x.loc[is_2018H1, 'revenue'].sum()
    cost_2019 =        x.loc[is_2019H1, 'cost'].sum()
    cost_2019_exp =    x.loc[is_2019H1_exp, 'cost'].sum()
        
    # helper calculations
    rev_2019_60 =           x.loc[is_2019H1_60, 'revenue'].sum()
    uniq_cust_2019_60 =     x.loc[is_2019H1_60, 'customer_id'].nunique()
    cost_2019_exp_60 =      x.loc[is_2019H1_60_exp, 'cost'].sum()
    uniq_cust_2019_exp_60 = x.loc[is_2019H1_60_exp, 'customer_id'].nunique()
    
    # more column calculations
    other_cost =       cost_2019 - cost_2019_exp
    rev_per_60 =       rev_2019_60 / uniq_cust_2019_60
    profit_margin =    (rev_2019 - cost_2019_exp) / rev_2019 * 100
    cost_exp_per_60 =  cost_2019_exp_60 / uniq_cust_2019_60
    growth =           (rev_2019 - rev_2018) / rev_2018 * 100
    
    d = {
        'recency': recency,
        'fast_and_fastest': fast_and_fastest,
        'rev_2019': rev_2019,
        'rev_2018': rev_2018,
        'cost_2019': cost_2019,
        'cost_2019_exp': cost_2019_exp,
        'other_cost': other_cost,
        'rev_per_60': rev_per_60,
        'profit_margin': profit_margin,
        'cost_exp_per_60': cost_exp_per_60,
        'growth': growth
    }
    
    return pd.Series(d)

Applying this final function and formatting the resulting DataFrame yields the following.

In [21]:
df1 = df.groupby(['country', 'region']).apply(f_final)
df1.head().style.format('{:,.0f}')

Unnamed: 0_level_0,Unnamed: 1_level_0,recency,fast_and_fastest,rev_2019,rev_2018,cost_2019,cost_2019_exp,other_cost,rev_per_60,profit_margin,cost_exp_per_60,growth
country,region,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Argentina,A,-83,138,150508,82912,49577,18553,31024,773,88,111,82
Argentina,B,-85,143,139048,92112,46153,15732,30421,750,89,97,51
Argentina,C,-85,129,118035,98472,38786,12661,26125,780,89,105,20
Argentina,D,-85,135,131728,79600,44190,17217,26973,730,87,88,65
Argentina,E,-85,177,146201,93119,49600,18372,31228,747,87,110,57


Our performance with this final function is more than 3.5 seconds. While this is a lot less than 8 hours, the calculations we performed in the custom function were fairly simple and our data was just 200k rows. If the data and complexity of the custom function increases by an 1-2 order of magnitudes each, hours of computation time await.

In [22]:
%timeit df.groupby(['country', 'region']).apply(f_final)

3.68 s ± 78.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


# Become a pandas expert

If you are looking to completely master the pandas library and become a trusted expert for doing data science work, check out my book [Master Data Analysis with Python][1]. It comes with over 300 exercises with detailed solutions covering the pandas library in-depth.

[1]: https://www.dunderdata.com/master-data-analysis-with-python