# Miscellaneous Grouping Functionality

Believe it or not, there is even more grouping functionality that remains to be covered in pandas. This chapter provides a few other lesser known grouping features possible with pandas.

## Grouping by columns not in the DataFrame

Thus far, we've only passed strings (or a list of strings) to the `groupby` method. Each of these strings refers to a specific column in the DataFrame. Let's review this simple concept by reading in the bikes dataset and finding the median trip duration by gender.

In [None]:
import pandas as pd
bikes = pd.read_csv('../data/bikes.csv', na_values=-9999)
bikes.head(3)

We use the syntax that returns the result as a Series.

In [None]:
bikes.groupby('gender')['tripduration'].median()

Instead of passing in the string name of the column, you can select the column as a Series and pass it to the `groupby` method instead.

In [None]:
s = bikes['gender']
bikes.groupby(s)['tripduration'].median()

The same result is produced and since the syntax is a bit more involved, it's best to just use the string name for simplicity. However, the example does show that it is possible to use other Series not in the DataFrame. Take a look at the following Series that has nothing to with the bikes DataFrame. It's just a random sample of strings with the same length as the DataFrame.

In [None]:
n = len(bikes)
s_fruits = pd.Series(['Apple', 'Banana', 'Cantaloupe', 'Durian', 'Elderberry'])
s_fruits = s_fruits.sample(n=n, replace=True, random_state=1, ignore_index=True)
s_fruits.head()

As long as the Series is the same length as the DataFrame, it may be passed to the `groupby` method where its unique values form distinct groups. As usual, these unique values are placed in the index.

In [None]:
bikes.groupby(s_fruits)['tripduration'].agg(['size', 'mean'])

### Mixing other Series and strings

This other Series may be used together with the normal strings that refer to column names to group by multiple columns.

In [None]:
bikes.groupby([s_fruits, 'gender'])['tripduration'].agg(['size', 'mean'])

One common use case is when binning a numeric column. Here, we bin temperature into six equal sized bins creating a Series and then count the values in each bin.

In [None]:
temp_bins = pd.qcut(bikes['temperature'], 6)
temp_bins.value_counts()

This new Series may be used by itself or in combination with other column names to group. Take note that this Series is assigned to the variable name `s_gt`, and will be used in a upcoming section.

In [None]:
s_gt = bikes.groupby(['gender', temp_bins])['tripduration'].median()
s_gt

Similarly, the `pivot_table` method accepts other Series as well. Here, we reproduce the results from above, but pivot the temperature bins so that they become the new column values.

In [None]:
bikes.pivot_table(index='gender', columns=temp_bins, 
                  values='tripduration', aggfunc='median')

## Grouping Series and aggregating other columns

The object calling the `groupby` method has always been a DataFrame in all of our previous examples. The Series also has a `groupby` method and like we saw above, it's not necessary for the grouping column to be part of the calling object. Here, we select the trip duration column as a Series, and group using the temperature bins created above. The aggregations are automatically applied to the Series values.

In [None]:
td = bikes['tripduration']
td.groupby(temp_bins).agg(['size', 'mean', 'median', 'min', 'max'])

## Grouping by index levels

You might be wondering how to use the Series `groupby` method without passing it another Series to act as the grouping column. Series, like DataFrames, can have multiple index levels that act like columns. The `s_gt` Series created above has two index levels. Each of their names may be retrieved with the `names` Index attribute.

In [None]:
s_gt.index.names

These index levels may be used just as if they were DataFrame columns with their names passed to the `groupby` method as strings. The values of the Series are aggregated.

In [None]:
s_gt.groupby('gender').max()

It's also possible to use the integer location of the index level (numbering begins from 0 with the left-most level). Here, we group by the second level, the temperature bins.

In [None]:
s_gt.groupby(level=1).max()

Note, that DataFrames may also be grouped by their index levels in the same exact manner.

## Changing the direction of grouping

As we've seen, many DataFrame methods have an `axis` parameter available to change the default direction of the operation. For most methods, we set `axis=1` to change the operation from vertical to horizontal. The `groupby` method is no different in this regard. Let's read in the `sweden_age` dataset containing the population by age of every person in Sweden from 1980 to 2020. The year is placed in the index and the remaining columns represent each age from 0 to 100, where 100 represents all those aged 100 and above.

In [None]:
sweden_age = pd.read_csv('../data/covid/sweden_age.csv', index_col='year')
sweden_age.tail()

Let's say we are interested in finding the population of particular age bins per year. We use the `cut` function to bin the age columns, which are read in as strings and must be converted to integers first.

In [None]:
age_bins = pd.cut(sweden_age.columns.astype('int64'), 
                  bins=[0, 5, 15, 25, 35, 50, 65, 80, 101], 
                  right=False)
age_bins.categories

We created eight unique bins, each spanning a variety of different years of age. The variable `age_bins` contains a total of 101 values, one for each column.

In [None]:
len(age_bins)

We can now use these bins to group the columns together by setting `axis=1`. The first five columns form a group, with the next 10 columns forming their own independent group, and so on. We now have the population by year within specific age groups.

In [None]:
sweden_age.groupby(age_bins, axis=1).sum().tail()

## Exercises

Read in the flights dataset and use it for the following exercises.

In [None]:
flights = pd.read_csv('../data/flights.csv')
flights.head(3)

### Exercise 1

<span style="color:green; font-size:16px">Create a Series of booleans determining if there is a carrier delay of 15 minutes or more. The values should be `False` if under 15 minutes and `True` if 15 minutes or over. Find the average distance flown by each group.</span>

### Exercise 2

<span style="color:green; font-size:16px">Create a Series of booleans determining if there is a weather delay of 15 minutes or more. Compute a cross tabulation of this Series with the similar one created above on carrier delay.</span>

### Exercise 3

<span style="color:green; font-size:16px">Find the total carrier delay by airline and origin as a Series with a multi-level index.</span>

### Exercise 4

<span style="color:green; font-size:16px">Using the Series from Exercise 3, calculate the total carrier delay by airline. Verify the result by calculating it directly from the original DataFrame.</span>

### Exercise 5

<span style="color:green; font-size:16px">Read in the Sweden deaths dataset found in the covid folder. Place the year column in the index and then calculate the total number of deaths by 10 year age interval per year. Then take this DataFrame and calculate the average deaths per age group group by 5 year time spans</span>