# Custom Aggregation

Pandas groupby objects come with many built-in aggregation functions. These are all available as strings within the `agg` method. There are, of course, many other aggregations that are not directly available and only possible by defining a custom aggregation function. These customized functions must return a single value. We begin by reading in the City of Houston dataset.

In [None]:
import pandas as pd
import numpy as np
emp = pd.read_csv('../data/employee.csv')
emp.head(3)

### Writing your own custom aggregation function

Suppose you would like to know the difference between the max and min value of a column for each group. Pandas does not have an aggregation function built to do this, so you will have to define this one yourself. 

Each customized aggregation function is defined as a regular Python function with the `def` keyword. Each function is **implicitly** passed the aggregating column of the group as a **Series**. This means that all Series methods will work on the passed argument. The `min_max` function below takes one argument, `s`, which is a Series object. It returns the difference between the max and min values of that Series.

In [None]:
def min_max(s):
    return s.max() - s.min()

## Using a customized aggregation function

Customized aggregation functions are used similarly as their built-in counterparts. Use the function variable name instead of a string with the`agg` method. The following finds the difference between the maximum and minimum salaries within each department. 

In [None]:
emp.groupby('dept').agg(min_max_salary_diff=('salary', min_max))

### Implicit passing of aggregation Series

The above `agg` method passed the `salary` column as a Series to our customized aggregation function, `min_max`, for each group. The parameter `s` takes on this Series. We say this is implicit, because we don't actually see the function executed. Here is an **explicit** call to `min_max` function where we pass it the entire salary column.

In [None]:
min_max(emp['salary'])

Our custom aggregation function is not passed the salary column of the entire DataFrame as done above, but just the salary column for the current group. Let's manually recreate the grouping that pandas did by first getting the unique department values.

In [None]:
unique_depts = emp['dept'].drop_duplicates().sort_values()
unique_depts.values

We can loop through these departments and filter for just the salary of that department and pass that Series into the `min_max` function. The the results are then printed to the screen matching the above work done by pandas.

In [None]:
for dept in unique_depts:
    salary_group = emp.query('dept == @dept')['salary']
    min_max_diff = min_max(salary_group)
    print(f'{dept:30} {min_max_diff}')

### Using an anonymous function

For custom aggregation functions that have just a single line of code with little logic, an anonymous function can be used. If the aggregation will be reused or there is complex logic, then define a normal function. Here, we use an anonymous function to recreate the result of `min_max`. 

In [None]:
(emp.groupby('dept')
    .agg(min_max_salary_diff=('salary', lambda s: s.max() - s.min())))

###  Custom aggregation functions must return a single value

The custom aggregation function you define must return a single value or else an exception will be raised. Let's create a custom aggregation that adds 5 to each value. This will return a Series the same size as the group and not a single value.

In [None]:
def add5(s):
    return s + 5

Attempting to use this custom aggregation results in an error informing us that we did not produce an aggregated value.

In [None]:
emp.groupby('dept').agg(salary_plus_5=('salary', add5))

### Combine custom and built-in aggregation functions

The custom aggregation function can be used in conjunction with any number of other built-in aggregation functions that we have previously seen.

In [None]:
(emp.groupby(['dept', 'sex'])
    .agg(min_salary=('salary', 'min'),
         max_salary=('salary', 'max'),
         min_max_salary_diff=('salary', min_max))
    .head(10)
    .round(-3)
    .style.format('{:,.0f}'))

## Find the mean salary for the five highest paid employees per department

In this section, we will find the mean salary of the five highest paid employees per department. This again requires a custom aggregation function. The same Series of salaries will be implicitly passed to it. We use the `nlargest` method to select the top five salaries and then take the mean of them to return a single value.

In [None]:
def top5_mean(s):
    return s.nlargest(5).mean()

Let's use this function to generate the desired result.

In [None]:
(emp.groupby('dept')
    .agg(top5_average_salary=('salary', top5_mean))
    .round(-3)
    .style.format('{:,.0f}'))

### What percent of total salary do these five employees represent?

Let's take our question a step further and determine the percentage of the total salary from each department that these five employees represent. Instead of finding the mean, we take the sum of the top five salaries and divide it by the sum of all salaries for that department.

In [None]:
def top5_percent(s):
    return s.nlargest(5).sum() / s.sum() * 100

Again, we pass our function to the `agg` method to get the result.

In [None]:
(emp.groupby('dept')
    .agg(top5_average_salary=('salary', top5_percent))
    .round(2))

The above results tell us that the top five highest paid employees of the fire department represent 0.65% of the total combined fire department salary. Let's verify this by calculating it without using `groupby`.

In [None]:
fire_sal_sorted = (emp.query('dept == "Fire"')['salary']
                      .sort_values(ascending=False))
fire_sal_sorted.iloc[:5].sum() / fire_sal_sorted.sum() * 100

These results would be more meaningful if we also calculated the percentage of the number of employees that these five represent. For instance, if there are 200 fire department employees, then five would represent 2.5% (5 / 200). Another custom aggregation function is defined below. Notice that the `count` method is used as several employees are missing salaries. If there are less than five employees in the group, 100% will be returned.

In [None]:
def count5_percent(s):
    return min(5 / s.count(), 1) * 100

Let's use both custom aggregation functions along with the built-in `count`.

In [None]:
(emp.groupby('dept')
    .agg(top5_average_salary=('salary', top5_percent),
         count5_percent=('salary', count5_percent),
         count=('salary', 'count'))
    .round(2))

This gives us more meaning behind the results. The five fire department employees account for 0.11% of all employees, but 0.65% of the total salary. Since not all employees receive the same salary, we expect the top five to have a disproportional percentage of the total salary.

### Using a custom aggregation function in a pivot table

As we learned, pivot tables are nearly identical to groupby aggregations. pandas allows you to set the `aggfunc` of the `pivot_table` method to a custom aggregation function. It will implicitly pass the Series of the `values` column to the aggregation function. Here we call the `top5_percent` on the salary column for each unique combination of department and sex.

In [None]:
emp.pivot_table(index='dept', columns='sex', 
                values='salary', aggfunc=top5_percent).round(2)

## Percentage of employees by department with salaries greater than 100,000

In this section, we'll find the percentage of employees by each department with salaries greater that 100,000 using a custom aggregation function.

In [None]:
def perc_gt_100k(s):
    return (s > 100_000).mean() * 100

We use this custom function just like the others in this chapter to get our result.

In [None]:
(emp.groupby('dept')
    .agg(percent_greater_100k=('salary', perc_gt_100k))
    .round(2))

### Not accounting for missing values

Some employees do not have salaries and when we make the comparison `s > 100_000`, those missing values will evaluate as `False`. Let's find the number of employees per department that have missing values with an anonymous custom function.

In [None]:
(emp.groupby('dept')
    .agg(num_missing=('salary', lambda x: x.isna().sum())))

If we want to find the percentage of employees with known salary that have a salary greater than 100,000, then we'll have to either drop these missing values or use a denominator that is equivalent to the number of employees with known values. Below, we use the first option to drop all the values in the current group that are missing. Only the police department's returned statistic changes, as it was the only one that had missing values.

In [None]:
def perc_gt_100k_real(s):
    return (s.dropna() > 100_000).mean() * 100

(emp.groupby('dept')
    .agg(percent_greater_100k=('salary', perc_gt_100k_real))
    .round(2))

## Optimizing a custom aggregation function

Custom aggregation functions have potential to perform poorly. pandas optimizes its own built-in aggregations but cannot do the same for the ones you define. This section will show you how to optimize the performance of your custom aggregation function.

### Create a larger DataFrame to test performance

Our current DataFrame contains only 24,000 rows. Performance metrics matter more with larger datasets. To simulate a larger dataset, we'll use the `sample` method to select 100,000 random rows with replacement.

In [None]:
emp_large = emp.sample(100_000, replace=True, ignore_index=True)
emp_large.head(3)

Let's verify that we have a DataFrame with 100,000 rows.

In [None]:
emp_large.shape

Let's measure the performance of one of custom aggregation functions, `perc_gt_100k_real`, with this larger dataset.

In [None]:
%%time
_ = emp_large.groupby('dept').agg({'salary': perc_gt_100k_real})

Here, the built-in mean aggregation function is timed.

In [None]:
%%time
_ = emp_large.groupby('dept').agg({'salary': 'mean'})

On my machine, the custom aggregation function had similar performance to the built-in `mean`. The number of groups can greatly impact the performance of an operation. Let's take a look at the number of unique values in each column.

In [None]:
emp_large.nunique()

There are substantially more unique titles in our dataset. Let's use this column as the grouping column instead of department to show a larger difference in performance.

In [None]:
%%time
_ = emp_large.groupby('title').agg({'salary': perc_gt_100k_real})

In [None]:
%%time
_ = emp_large.groupby('title').agg({'salary':'mean'})

On my machine, grouping by title was more than 10 times slower, showcasing the vast differences in performance when grouping by columns with a large number of groups. Each group must be passed to the custom function, so the more groups, the worse the performance.

### Convert grouping columns to categorical

The first step to help making your custom aggregation (and any aggregation) faster is to convert the grouping column to categorical if it is an object column containing strings. Grouping columns tend to be strings so it is a change that you'll do often. Here, we create two new category columns for department and title and append them to the end of our DataFrame. When performing this on your own data, overwrite the original columns. We keep the original columns in this instance to show the difference in performance.

In [None]:
cat_cols = emp_large[['dept', 'title']].astype('category')
emp_large[['dept_cat', 'title_cat']] = cat_cols
emp_large.head(3)

Before testing our custom function, let's compare the performance difference of the built-in `mean` function between the two versions of the department column.

In [None]:
%%time
_ = emp_large.groupby('dept').agg({'salary': 'mean'})

In [None]:
%%time
_ = emp_large.groupby('dept_cat').agg({'salary': 'mean'})

Grouping with the categorical column reduced the completion time in half on my machine. Executing this comparison using title as the grouping column results in a similar performance gain.

In [None]:
%%time
_ = emp_large.groupby('title').agg({'salary': 'mean'})

In [None]:
%%time
_ = emp_large.groupby('title_cat').agg({'salary': 'mean'})

### Complete operations that are independent of the group outside of the custom function

The next step to optimizing your custom function is to look for operations that are independent of the group. The `perc_gt_100k_real` function has two computations that are independent of the group. Dropping missing values is independent of the group. It doesn't matter which department or title it is. We can drop all missing values BEFORE the groupby instead of WITHIN the custom function. This allows us to use the original `perc_gt_100k` function instead. Let's time the new operation.

In [None]:
%%time
_ = (emp_large.dropna(subset=['salary'])
              .groupby('title_cat')
              .agg({'salary': perc_gt_100k}))

A substantial 1/3 gain in performance was recorded. Dropping missing values causes pandas to create an entire new DataFrame (or Series) in memory. Doing this operation independently for all titles takes more time than doing it once for the entire DataFrame.

### Salary comparison is independent of the group

The comparison between salary and 100,000 is completed for every group, but this operation is also independent of the group. We can do the comparison before any grouping. Below, we create a new boolean column for salaries greater than 100,000 and use it as the aggregating function. Now, there's no need for a custom function at all, which results in a massive performance improvement of about 80%.

In [None]:
%%time
emp_large['salary_greater'] = emp_large['salary'] > 100_000
_ = (emp_large.dropna(subset=['salary'])
              .groupby('title_cat')['salary_greater'].mean())

### Avoid `dropna`

The `dropna` method is fairly expensive as it creates an entire new DataFrame in memory of just the rows that do not have a missing salary. We can avoid it by taking the sum of the `salary_greater` column of each group and dividing by the number of non-missing values (`count` method). 

In [None]:
%%time
emp_large['salary_greater'] = emp_large['salary'] > 100_000
df_temp = emp_large.groupby('title_cat').agg(ct_over_100k=('salary_greater', 'sum'),
                                             total_ct=('salary', 'count'))
_ = df_temp['ct_over_100k'] / df_temp['total_ct']

### Use alternative syntax

Notice that the previous step does not improve performance. In order to see the improvement, you'll have to use alternative groupby syntax where a single Series of data is returned. Below, we assign the result of just the `groupby` method (before any aggregation) to the variable `g`. This object stores the grouping and allows you to calculate different aggregations for different aggregating columns. This improves performance by over 98% from the original.

In [None]:
%%time
emp_large['salary_greater'] = emp_large['salary'] > 100_000
g = emp_large.groupby('title_cat')
_ = g['salary_greater'].sum() /  g['salary'].count()

### One last alternative with pandas-only data types

One of the issues with the default pandas data types is that comparisons with missing values evaluate as `False`. With the new pandas-only data types that use `pd.NA` as missing values, they stay missing. Below, we convert the salary column to the new nullable float data type which becomes a nullable boolean after the comparison keeping the missing values. Series also have a groupby method and can be grouped by a column of data in a different DataFrame as long as it's the same length. The categorical title column is used below. Finally the mean of the booleans is calculated on each group, returning the percentage greater than 100,000.

In [None]:
%%time
s = (emp_large['salary'].astype('Float64') > 100_000)
_ = s.groupby(emp_large['title_cat']).mean()

## Summary of performance improvement for custom aggregation functions

* Convert grouping columns to categorical
* Complete operations that are independent of the group outside of the custom function 
* Use built-in aggregations if possible
* Use alternative groupby syntax
* Use multiple built-in groupby aggregations if necessary

### Complexity vs Performance

This is usually a topic of debate when deciding on which Pandas methods to use. I typically like to avoid custom aggregation functions at all cost as they can drastically reduce performance for larger datasets.

Readability (low complexity) is very valuable when sharing your code or looking back at it at a later date. Custom aggregation functions might provide more readability in some situations, but more often than not this isn't a large enough benefit, so I rarely use them as the performance difference is much more significant.

## Exercises

Execute the cell below to read in the flights dataset and then use it for the following exercises.

In [None]:
import pandas as pd
flights = pd.read_csv('../data/flights.csv', parse_dates=['date'])
flights.head(3)

### Exercise 1

<span style="color:green; font-size:16px">What are the three airlines with the least number of flights?</span>

### Exercise 2

<span style="color:green; font-size:16px">For each airline, find the 75th percentile of flight distance. Use a custom aggregation function.</span>

### Exercise 3

<span style="color:green; font-size:16px">For each airline, find out what percentage of its flights leave on a Tuesday. Use a custom aggregation function.</span>

### Exercise 4

<span style="color:green; font-size:16px">Optimize exercise 3 without using a custom aggregation. What is the performance difference?</span>

### Exercise 5

<span style="color:green; font-size:16px">The range of salaries per department was calculated using the `min_max` custom function from the beginning of this chapter. Use this same function to calculate the range of distance for each airline. Then calculate this range again without a custom function.</span>

### Exercise 6

<span style="color:green; font-size:16px">Which origin airport has the highest percentage of its flights cancelled?</span>

### Use the college dataset

Execute the following cell which reads in a few columns from the college dataset, sets the institution name as the index and converts 'stabbr' and 'relaffil' to categorical.

In [None]:
cols = ['instnm', 'stabbr', 'relaffil', 'satvrmid', 'satmtmid', 'ugds']
college = pd.read_csv('../data/college.csv', usecols=cols, 
                      index_col='instnm', dtype={'stabbr': 'category', 
                                                 'relaffil': 'category'})
college.head(3)

### Exercise 7

<span style="color:green; font-size:16px">How many states have more schools with a higher 'satvrmid' than 'satmtmid'? Make sure to not count schools that have missing values for either one.</span>

### Exercise 8

<span style="color:green; font-size:16px">Create a pivot table that shows the percentage of schools with less than 1,000 students in each state by religious affiliation. Also return the count of schools.</span>