# Optimize Custom Grouping Function
In this challenge, your goal is to find the fastest solution to the problem while only using the Pandas library.

### The Challenge
The `college_pop` dataset contains the name, state and population of all higher-ed institutions in the US and its territories. For each state, find the percentage of the total state population made up by the 5 largest colleges of that state.

In [1]:
import pandas as pd
college = pd.read_csv('https://raw.githubusercontent.com/DunderData/Pandas-Challenges/master/data/college_pop.csv')
college.head()

Unnamed: 0,name,state,pop
0,Alabama A & M University,AL,4206.0
1,University of Alabama at Birmingham,AL,11383.0
2,Amridge University,AL,291.0
3,University of Alabama in Huntsville,AL,5451.0
4,Alabama State University,AL,4811.0


## A `groupby` problem
This problem needs the use of the `groupby` method to group all the colleges by state. With each group, we need to do these 4 steps:
1. Select the top 5 largest schools
2. Sum the population of these 5 largest schools
3. Sum the entire state population
4. Divide the results of step 2 and 3

Pandas groupby objects have many methods such as min, max, mean, sum, etc... There is no direct method to accomplish our current task. We will need to do this problem in steps. There are multiple different approaches to solve this challenge which are outlined below.

## Solution 1: Naive Custom Function
Many Pandas users will see a problem like this and immediately think about creating a custom grouping function. Let's start with this approach. Below, we create a custom function that accepts a single column and returns a single value. We will sort this column from greatest to least and then finish the problem.

In [2]:
def find_top_5(s):
    s = s.sort_values(ascending=False)
    top5 = s.iloc[:5]
    top5_total = top5.sum()
    total = s.sum()
    return  top5_total / total

In [3]:
result = college.groupby('state').agg({'pop': find_top_5})
result.head()

Unnamed: 0_level_0,pop
state,Unnamed: 1_level_1
AK,0.961575
AL,0.37076
AR,0.422675
AS,1.0
AZ,0.551486


### Inspecting the custom function
Custom grouping functions can be difficult to understand what is taking place within them. One way to track and debug the code within custom grouping functions is to output to the screen the result of the variable you would like to inspect. You can use the `print` function but I recommend using the `display` function from the `IPython.display` module. This will output DataFrames styled in the same manner as they would in the notebook. Below, we define a new custom function that displays each line to the screen. 

Note, that this function is actually called twice, which doesn't seem to make any sense since an error is always raised before the function ends. Pandas always calls custom grouping functions twice for the first group regardless if it produces an error.

In [4]:
from IPython.display import display

def find_top_5_display(s):
    s = s.sort_values(ascending=False)
    display('sorted schools', s)
    
    top5 = s.iloc[:5]
    display('top 5 schools', top5)
    
    top5_total = top5.sum()
    display('top 5 total', top5_total)
    
    total = s.sum()
    display('state total', total)
    
    answer = top5_total / total
    display('answer', answer)
    raise

In [5]:
college.groupby('state').agg({'pop': find_top_5_display})

'sorted schools'

60      12865.0
62       5536.0
66       3256.0
63       1428.0
65        889.0
67        479.0
64        275.0
5171      109.0
5417       68.0
61         27.0
Name: pop, dtype: float64

'top 5 schools'

60    12865.0
62     5536.0
66     3256.0
63     1428.0
65      889.0
Name: pop, dtype: float64

'top 5 total'

23974.0

'state total'

24932.0

'answer'

0.9615754853200706

'sorted schools'

60      12865.0
62       5536.0
66       3256.0
63       1428.0
65        889.0
67        479.0
64        275.0
5171      109.0
5417       68.0
61         27.0
Name: pop, dtype: float64

'top 5 schools'

60    12865.0
62     5536.0
66     3256.0
63     1428.0
65      889.0
Name: pop, dtype: float64

'top 5 total'

23974.0

'state total'

24932.0

'answer'

0.9615754853200706

'sorted schools'

60      12865.0
62       5536.0
66       3256.0
63       1428.0
65        889.0
67        479.0
64        275.0
5171      109.0
5417       68.0
61         27.0
Name: AK, dtype: float64

'top 5 schools'

60    12865.0
62     5536.0
66     3256.0
63     1428.0
65      889.0
Name: AK, dtype: float64

'top 5 total'

23974.0

'state total'

24932.0

'answer'

0.9615754853200706

RuntimeError: No active exception to reraise

### Solution 1 Performance
On my machine, this solution completes in about 50ms.

In [6]:
%timeit -n 5 college.groupby('state').agg({'pop': find_top_5})

76.3 ms ± 10.5 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)


### Using `apply` instead of `agg`
You can use the `apply` method as well which has slightly different syntax and returns a Series and not a DataFrame. Performance is similar.

In [7]:
%timeit -n 5 college.groupby('state')['pop'].apply(find_top_5)

83.7 ms ± 9.9 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)


## Solution 2: Sort all data first
Instead of sorting the data within the custom function, we can sort the entire DataFrame first. Pandas preserves the order of the rows within each groups so we don't need to worry about losing this sorted order during grouping. Below, we create a new custom function that assumes the data is already sorted.

In [8]:
cs = college.sort_values('pop', ascending=False)

def find_top_5_sorted(s):
    top5 = s.iloc[:5]
    return top5.sum() / s.sum()

cs.groupby('state').agg({'pop': find_top_5_sorted}).head()

Unnamed: 0_level_0,pop
state,Unnamed: 1_level_1
AK,0.961575
AL,0.37076
AR,0.422675
AS,1.0
AZ,0.551486


### Solution 2 Performance
On my machine this performs twice as fast as solution 1.

In [9]:
%timeit -n 5 cs.groupby('state').agg({'pop': find_top_5_sorted})

25.4 ms ± 3.57 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)


### Why the performance improvement?
One thing you must be aware of when using a custom grouping function is their potential for poor performance. Every line of code in the custom function must be re-run for each group. If you can apply a function to the entire DataFrame instead of within a custom function, you will almost always see a nice performance gain.

## Solution 3: No custom function
It's possible to eliminate the custom function entirely. Pandas `groupby` objects have a `head` method that returns the top values of each group. This eliminates the need to call `s.iloc[:5]` within the custom function. Let's see this portion now. Notice, we create a new variable `grouped` to reference the `groupby` object. We will use this again later.

In [10]:
cs = college.sort_values('pop', ascending=False)
grouped = cs.groupby('state')
cs_top5 = grouped.head(5)
cs_top5.head(10)

Unnamed: 0,name,state,pop
7116,University of Phoenix-Arizona,AZ,151558.0
1189,Ivy Tech Community College,IN,77657.0
793,Miami Dade College,FL,61470.0
3711,Lone Star College System,TX,59920.0
3669,Houston Community College,TX,58084.0
725,University of Central Florida,FL,52280.0
3880,Liberty University,VA,49340.0
3765,Texas A & M University-College Station,TX,46941.0
5817,American Public University System,WV,44924.0
1299,Ashford University,CA,44744.0


Only the top 5 rows for each state are returned. We output the number of rows of this DataFrame below. This dataset includes US territories which is why there are more than 250 rows (50 states * 5).

In [11]:
cs_top5.shape

(270, 3)

### Perform another `groupby`
From here, we must perform another `groupby` on this smaller dataset to get the total for these top 5 schools.

In [12]:
top5_total = cs_top5.groupby('state').agg({'pop': 'sum'})
top5_total.head()

Unnamed: 0_level_0,pop
state,Unnamed: 1_level_1
AK,23974.0
AL,92059.0
AR,56985.0
AS,1276.0
AZ,287015.0


Now we find the total for all schools in each state.

In [13]:
total = grouped.agg({'pop': 'sum'})
total.head()

Unnamed: 0_level_0,pop
state,Unnamed: 1_level_1
AK,24932.0
AL,248298.0
AR,134820.0
AS,1276.0
AZ,520439.0


Now, we can divide the previous two to get our result.

In [14]:
answer = top5_total / total
answer.head()

Unnamed: 0_level_0,pop
state,Unnamed: 1_level_1
AK,0.961575
AL,0.37076
AR,0.422675
AS,1.0
AZ,0.551486


### Solution 3 Performance
Eliminating the custom function altogether gives us the best performance, about 7x faster on my machine than Solution 1.

In [15]:
%%timeit -n 5

cs = college.sort_values('pop', ascending=False)
grouped = cs.groupby('state')
cs_top5 = grouped.head(5)
top5_total = cs_top5.groupby('state').agg({'pop': 'sum'})
total = grouped.agg({'pop': 'sum'})
answer = top5_total / total

7.91 ms ± 1.78 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)


## Avoid custom grouping functions if possible
For complex grouping situations, you will be tempted to write your own custom function to do all of the work. This is dangerous and can lead to extremely inefficient code. Pandas cannot optimize custom functions. It has a limited number of builtin grouping methods. All of these are optimized and should yield better performance. The following are some guidelines when approaching a complex grouping situation.

### Use a builtin grouping method if it exists
If a builtin grouping method exists then you should use it over any custom function.

### Operate on the entire DataFrame if possible and not to individual groups
Solution 1 above was the slowest and performed all its calculations to each group within the custom function. This was the slowest solution. If you can perform an operation to the entire DataFrame outside of the custom grouping function, you will much better performance.

### Summary to optimize Pandas Performance
In general, use builtin Pandas methods whenever they exist and avoid custom functions if at all possible. Solution 3 uses no custom functions and performs the best

# Become a pandas expert

If you are looking to completely master the pandas library and become a trusted expert for doing data science work, check out my book [Master Data Analysis with Python][1]. It comes with over 300 exercises with detailed solutions covering the pandas library in-depth.

[1]: https://www.dunderdata.com/master-data-analysis-with-python