# 8. Say No to `apply` with `groupby`

### What is `apply` with `groupby`
Typically when calling `groupby`, you will use one of the standard aggregation functions that Pandas already has built such as sum, min, median, etc... Occasionally, you might need to write your own customized aggregation (or non-aggregation) function. To do so require you use `apply`.

Previously, we learned that `apply` can lead to very poor performance. This is especially true when using it with `groupby`. Let's see this in action with a problem that might lead you to think we absolutely need to use `apply`.

## Challenge

Using the college dataset, for each state, find the percentage of the total state population made up by the 5 largest colleges of that state.

In [None]:
import pandas as pd
pd.set_option('display.max_columns', 100)
college = pd.read_csv('data/college.csv')
college.head()

### A naive solution
We can define a custom aggregation function to sort each state's population and then find the top 5 values and compute the percent of the total like this.

In [None]:
def func1(s):
    s = s.sort_values(ascending=False)
    top5_total = s.iloc[:5].sum()
    total = s.sum()
    return top5_total / total

In [None]:
df = college.groupby('stabbr')['ugds'].apply(func1)
df.head()

## Do as much as you can outside of the custom function
As a general rule, you should attempt to do as many operations outside the custom operation as possible. For instance, we can sort all the values just once instead of within each group. Let's compare this performance.

In [None]:
df = college.sort_values('ugds', ascending=False)
def func2(s):
    top5_total = s.iloc[:5].sum()
    total = s.sum()
    return top5_total / total
df.groupby('stabbr')['ugds'].apply(func2).head()

### Time performance
Nice performance boost.

In [None]:
%timeit -n 5 college.groupby('stabbr')['ugds'].apply(func1)

In [None]:
%%timeit -n 5 
df = college.sort_values('ugds', ascending=False)
df.groupby('stabbr')['ugds'].apply(func2)

### Remove function completely
You can actually completely avoid using `apply` by first selecting the top 5 values for each state.

In [None]:
df = college.sort_values('ugds', ascending=False)
df_top5 = df.groupby('stabbr').head()

We can then find the totals for each DataFrame and divide the result.

In [None]:
top5_total = df_top5.groupby('stabbr').agg({'ugds': 'sum'})
total = df.groupby('stabbr').agg({'ugds': 'sum'})
df_final = top5_total / total
df_final.head()

Performance is now about 6x as fast as before.

In [None]:
%%timeit -n 5 

df = college.sort_values('ugds', ascending=False)
df_top5 = df.groupby('stabbr').head()
top5_total = df_top5.groupby('stabbr').agg({'ugds': 'sum'})
total = df.groupby('stabbr').agg({'ugds': 'sum'})
df_final = top5_total / total