# More Groupby Methods

There are many more groupby methods other than `agg`, `filter`, and `transform`. In this chapter, you'll learn how to discover and use them.

## Kinds of groupby attributes and methods

All groupby methods act on either a Series or a DataFrame. If there is a single column name within the brackets following the call to the `groupby` method, then it acts on a Series. If there are no brackets or multiple column names in the brackets, then it acts on a DataFrame. Let's see some examples of the two kinds of `groupby` methods available. Let's begin by reading in the San Francisco employee compensation dataset.

In [None]:
import pandas as pd
import numpy as np
sf_emp = (pd.read_csv('../data/sf_employee_compensation.csv')
            .drop(columns='job'))
sf_emp.head(3)

All grouping begins with a call to the `groupby` method by providing it the grouping column(s). Let's assign the object returned when grouping by organization group and output its type.

In [None]:
g_df = sf_emp.groupby('organization group')
type(g_df)

The technical name for this object is `DataFrameGroupBy` and its methods can act on the sub-DataFrame of that group. Let's call the same `groupby` method, but this time put the `salaries` column in brackets following it. A `SeriesGroupBy` is produced and its methods can only act on the salaries Series.

In [None]:
g_series = sf_emp.groupby('organization group')['salaries']
type(g_series)

### `GroupBy` API

Take a look at the [`GroupBy` API in the official documentation][1] for a list of all the possible methods. Most of them will overlap with the normal DataFrame methods that were previously covered.

[1]: https://pandas.pydata.org/docs/reference/groupby.html

## Finding all available attributes and methods

The vast majority of DataFrameGroupBy and SeriesGroupBy attributes and methods overlap. Here we retrieve nearly of all the public attributes and methods for each and print out the ones from the SeriesGroupBy to the screen.

In [None]:
public_gbs_methods = [m for m in dir(g_series) 
                      if not m.startswith('_') and len(m) < 15]
public_gbdf_methods = [m for m in dir(g_df) if not m.startswith('_')  
                       and m not in sf_emp.columns and len(m) < 15]
for i, method in enumerate(public_gbs_methods):
    end = '\n' if i % 4 == 3 else ''
    print(f'{method:16}', end=end)

You should be familiar with many of these attributes and methods as they overlap with the ones available directly from a normal Series. We can take the set difference to determine the attributes and methods unique to each one. These are unique to SeriesGroupBy.

In [None]:
set(public_gbs_methods) - set(public_gbdf_methods)

These are unique to DataFrameGroupBy objects.

In [None]:
set(public_gbdf_methods) - set(public_gbs_methods)

## Calling single aggregation methods

You can bypass the `agg` method by calling the aggregation method directly from one of the groupby objects. The disadvantage is that you won't be able to rename the resulting column. Here, we take the maximum of the salaries column for each organization group.

In [None]:
g_series.max()

With `g_df`, the maximum of all non-grouping columns is returned.

In [None]:
g_df.max()

### The entire syntax

It's rare that the intermediate call to the `groupby` method will be assigned to a variable as we've done in this chapter. We are only doing this to avoid the repetitive nature of calling the same method over again. When completing these operations in practice, you'll likely begin with the original DataFrame, call the `groupby` method, and then chain the grouping method you desire. Below, the full syntax is given for the last two operations.

```python
sf_emp.groupby('organization group')['salaries'].max()
sf_emp.groupby('organization group').max()
```

### More aggregating methods

Most of the aggregating methods available to normal Series and DataFrames are available to their groupby counterparts. Nearly all of them return a single value for each group. However, the `describe` method returns many aggregations. You can provide it a list of percentiles to return as well. Here, we get many summary statistics for the salaries column on all the groups.

In [None]:
(g_series.describe(percentiles=[0.01, 0.2, 0.5, 0.8, 0.99])
         .round(0)
         .style.format('{:,.0f}'))

Calling the `describe` method on `g_df` would return a very wide DataFrame with all of these statistics calculated on each numeric column.

### The `size` method

The `size` method returns the number of values in each group, which is the exact same result as the `value_counts` method. Because it offers less options (no sorting or normalization), I prefer `value_counts`.

In [None]:
g_series.size().head(10)

## `head`, `tail`, and `nth` groupby methods

The `head` and `tail` methods return the first and last five rows, respectively, of each group. Set the parameter `n` to an integer to control the number of rows returned per group. Here we return the first two rows of the entire DataFrame for each organization group. Notice that the order of the rows are preserved and they are not sorted by the grouping column.

In [None]:
g_df.head(2)

Using the same operation on a `g_series` isn't as clear as only the values of the salaries are returned without the context of the grouping column. The index is preserved and can be used to verify correctness.

In [None]:
g_series.head(2)

The `nth` groupby method allows you to select exactly which rows from the group are returned using integer location. Pass it a single integer or a list of integers. For instance, the following returns rows with integer location 5 and 10 from each group.

In [None]:
g_df.nth([5, 10])

### Groupby methods unique to Series

A few methods such as `nlargest`, `nsmallest`, and `unique` are unique to SeriesGroupBy objects. Here, we get the two largest salaries in each group.

In [None]:
s = g_series.nlargest(2)
s

### Drop a level from the index with `droplevel`

Notice that a multilevel index was created. The inner level contains the index labels for the row with that salary. It isn't very meaningful here and can be dropped with the `droplevel` method. Pass it the integer location or name of the level to drop and it will return a Series without that level. Index levels are numbered beginning at 0 from the outside. 

In [None]:
s.droplevel(1)

## Non-aggregating methods

Many other methods do not aggregate and instead return a Series or DataFrame with the same length as the group. For the most part, they work exactly the same as they do on regular Series or DataFrames. To help teach these methods, a small example DataFrame will be created.

In [None]:
df = pd.DataFrame({'item': ['A', 'B', 'A', 'A', 'B', 'A', 'B', 'B'],
                   'quantity': [5, 3, 8, np.nan, 2, 15, np.nan, 6]})
df

We'll use a SeriesGroupBy object to showcase these methods. 

In [None]:
g_series = df.groupby('item')['quantity']

All of the methods in this section preserve the order of the original values. They do NOT sort by the group. Take for instance, the `cumsum` method which accumulates the sum beginning from the top by group.

In [None]:
g_series.cumsum()

A Series is returned, but is difficult to decipher without it being attached to the original DataFrame. Let's add it as a column and then re-examine the output.

In [None]:
df['quantity_cumsum'] = g_series.cumsum()
df

Each group has the quantity column accumulated independently for each group. The method `cumcount` is unique to groupby objects and provides the integer location of each row by group beginning with 0.

In [None]:
df['group_iloc'] = g_series.cumcount()
df

Each quantity can be ranked using the `rank` method. Below, the largest quantity of each group gets ranked 1.

In [None]:
df['group_rank'] = g_series.rank(ascending=False)
df

We fill in missing values with the previous known missing value of that group with the `fillna` method.

In [None]:
df['group_ffill'] = g_series.fillna(method='ffill')
df

### Finding the highest scoring movie for each year

Let's read in the movie dataset and then find the highest scoring movie for each year.

In [None]:
movie = pd.read_csv('../data/movie.csv', index_col='title')
movie.head(2)

Because the title is in the index, calling the `idxmax` method on the `imdb_score` column returns the movie with the highest score for each year.

In [None]:
movie.groupby('year')['imdb_score'].idxmax().tail()

Use the `agg` method to return both the score and movie title.

In [None]:
movie.groupby('year')['imdb_score'].agg(['max', 'idxmax']).tail()

## Summary of other groupby methods

The other groupby methods operate similarly as their DataFrame/Series counterparts, but do so on each independent grouping.

## Exercises

Execute the next cell to read in some of the columns from the flights dataset and use it to answer the following exercises.

In [None]:
import pandas as pd
cols = ['date', 'airline', 'origin', 'dest', 'dep_time', 'arr_time',
       'cancelled', 'air_time', 'distance', 'carrier_delay']
flights = pd.read_csv('../data/flights.csv', parse_dates=['date'], usecols=cols)
flights.head(3)

### Exercise 1

<span style="color:green; font-size:16px">For each airline, return the first and last row of each group. Use the `nth` groupby method.</span>

### Exercise 2

<span style="color:green; font-size:16px">For every origin and destination combination, select the 500th flight.</span>

### Exercise 3

<span style="color:green; font-size:16px">Find the date of the 10th cancelled flight for each airline.</span>

### Exercise 4

<span style="color:green; font-size:16px">Find the average carrier delay for each origin and destination combination with more than 300 flights.</span>