# Filter and Transform with Groupby

All of the groupby chapters thus far have focused on aggregation, which is the most common operation to perform. However, there are many more calculations we can perform on our groups besides return a single value. In this chapter, we cover the groupby `filter` method, which filters entire groups as a whole from DataFrames and is similar to boolean selection. We'll also cover the groupby `transform` method, which performs an operation to the entire group and returns a Series or DataFrame the same length as the original.


## The groupby `filter` method

The groupby `filter` method does boolean selection for entire groups. The entire group is kept or rejected as a whole. A DataFrame with the same number of columns is returned. An example with a small fake dataset can help us learn how it works.

In [None]:
import pandas as pd
item = ['A', 'A', 'B', 'B', 'B', 'C', 'C', 'D', 'D']
quantity = [2, 10, 3, 7, 6, 5, 2, 10, 12]
data = {'item': item, 'quantity': quantity}
df = pd.DataFrame(data)
df

### Review boolean selection

Before we filter by group, let's review boolean selection from earlier in the book. With boolean selection, we create a boolean Series (usually by using one of the comparison operators) and then pass this filter to *just the brackets*.  Here, we select all the rows with quantity greater than 4.

In [None]:
filt = df['quantity'] > 4
df[filt]

### Filter by group total

Instead of filtering by each individual row, we can filter entire groups. Let's say we want to keep the groups with a total quantity greater than 15. We could start by finding the total quantity using a basic groupby aggregation.

In [None]:
total = (df.groupby('item')
           .agg(total_quantity=('quantity', 'sum'))
           .reset_index())
total

We can use normal boolean selection to filter this aggregated DataFrame down to just the items that meet our criteria.

In [None]:
filt = total['total_quantity'] > 15
total[filt]

Let's get just the items that meet this criteria as a Series.

In [None]:
items = total.loc[filt, 'item']
items

From here we can use the `isin` method on our original DataFrame to get the desired result.

In [None]:
filt2 = df['item'].isin(items)
df[filt2]

### Shortcut with the groupby `filter`

The groupby `filter` method handles this procedure in a more direct manner. It is a somewhat complicated method so it will take some time to understand. You first must create a function that returns a single boolean value. pandas will implicitly pass this function a DataFrame consisting of just the rows of the current group.

Take a look at the `find_total` function below. It gets called once per group. It receives the current group as a DataFrame and assigns it to the variable `sub_df`. You can call any normal DataFrame methods on `sub_df`. Here, we select the quantity column and sum it. We then compare this sum against 15 and return a boolean.

In [None]:
def find_total(sub_df):
    return sub_df['quantity'].sum() > 15

We pass this function to the groupby `filter` method to complete the selection.

In [None]:
df.groupby('item').filter(find_total)

### Viewing each "Sub-DataFrame"

The variable name `sub_df` was chosen to signify that the object being passed to `find_total` was indeed a DataFrame. Let's print out each sub-DataFrame during each call to `find_total` to inspect what is happening.

In [None]:
def find_total2(sub_df):
    print(sub_df, end='\n\n')
    return sub_df['quantity'].sum() > 15

This function will be called four times, once for each group, and print out the current sub-DataFrame and then return a boolean.

In [None]:
df.groupby('item').filter(find_total2)

## Getting a nicer display

Instead of printing to the screen, we can use the `display_html` function from the `IPython.display` module to get the same HTML output that we are accustomed to. This can be quite helpful when debugging. Below, a decorator function is created that outputs the styled DataFrame HTML to the screen inside a div element using the CSS flexbox layout (displays the DataFrames horizontally). pandas adds a `name` attribute to each sub-DataFrame that stores the current group, which is used as a caption for the DataFrame output.

In [None]:
from IPython import display
def display_wrapper(func):
    def wrapper(sub_df, data=None, width=900, margin=50, max_ct=8, max_rows=10):
        """
        Parameters
        ----------
        sub_df: sub-DataFrame of group passed from pandas groupby
        
        data: dictionary holding the html string and the current group number 
                 {'html': '', 'ct': 0}
                 
        width: pixel width of output area
        
        margin: pixels between DataFrames
        
        max_ct: the maximum number of DataFrames to output to the screen
        """
        if data['ct'] < max_ct:
            data['ct'] += 1
            caption = f'Group {sub_df.name}'
            if isinstance(sub_df, pd.Series):
                sub_df = sub_df.to_frame()
            df_styled = sub_df.style.set_caption(caption)
            data['html'] += df_styled.to_html(max_rows=max_rows)
            style = f'style="width:{width}px; display:flex; flex-wrap:wrap; gap:30px"'
            final_html = f'<div {style}>{data["html"]}</div>'
            display.clear_output()
            display.display_html(final_html, raw=True)
        return func(sub_df)
    return wrapper
    
@display_wrapper
def find_total3(sub_df):
    return sub_df['quantity'].sum() > 15

When we call the `filter` method now, we pass it a dictionary that will continue collecting the HTML of each sub-DataFrame as a string and the count of the group, which is limited by `max_ct`.

In [None]:
df.groupby('item').filter(find_total3, data={'html': '', 'ct': 0})

### Using an anonymous function

If the custom function can be written in a single line, you may use an anonymous function. The same sub-DataFrame is passed to it like above.

In [None]:
df.groupby('item').filter(lambda sub_df: sub_df['quantity'].sum() > 15)

### Summary of the groupby `filter` method

* Must write a custom function
* The custom function implicitly gets passed a DataFrame of just that group
* The custom function must return a single boolean value
* Each group is either kept or dropped based on the returned boolean value
* The end result is the original DataFrame (same number of columns) with the rows of groups that met the criteria

## Finding actors that appear in at least 25 movies

Let's complete a more practical example with the movie dataset by filtering for actors that have appeared in at least 25 movies. Only a few of the columns are read.

In [None]:
cols = ['title', 'year', 'content_rating', 'director_name', 
        'actor1', 'num_reviews', 'imdb_score']
movie = pd.read_csv('../data/movie.csv', usecols=cols)
movie.head(3)

### Create a custom function

Our custom function is very simple. We merely need to check if the number of rows of the implicitly passed DataFrame is 25 or more.

In [None]:
movie_top_actor = (movie.groupby('actor1')
                        .filter(lambda sub_df: len(sub_df) >= 25))
movie_top_actor.head()

In [None]:
movie_top_actor.shape

Let's verify the results by returning the frequency of occurrence for each `actor1` of the returned DataFrame.

In [None]:
movie_top_actor['actor1'].value_counts()

## Multiple conditions

The custom function you create to filter your data can test as many conditions as you desire as long as it returns a single boolean value. Let's return all movies that have an actor1 with 25 or more appearances along with an average IMDB score greater than 7. We define a function that evaluates each condition.

In [None]:
def top_actor_score(sub_df):
    return len(sub_df) >= 25 and sub_df['imdb_score'].mean() > 7

Pass this function to the groupby `filter` method to get the result.

In [None]:
movie_top_actor_score = movie.groupby('actor1').filter(top_actor_score)
movie_top_actor_score.shape

Only 56 rows remain in this filtered DataFrame than the previous one. Let's verify that each actor1 left meets both criteria.

In [None]:
(movie_top_actor_score.groupby('actor1')
                      .agg(num_movies=('actor1', 'size'),
                           mean_imdb_score=('imdb_score', 'mean')))

## The groupby `transform` method

The `groupby` transform method performs a calculation on each group just like `agg`, but returns the same number of values as rows in the group.

### Aggregation with `transform`

The groupby `transform` method can perform an aggregation just like the `agg` method, but returns the aggregated value for each row in the group. Let's review the groupby `agg` method on the example dataset to sum the quantity of each item.

In [None]:
df.groupby('item').agg(total_quantity=('quantity', 'sum'))

We can perform the same aggregation with `transform`, but it returns the same number of rows as the original. The syntax for `transform` is different than `agg`. The aggregating column (quantity) is placed in the brackets following the call to `groupby` and then the `transform` method is called with the string name of the aggregation. A Series is returned.

In [None]:
df.groupby('item')['quantity'].transform('sum')

### Can append result to the original DataFrame

Since `transform` always returns an object the same length as the original DataFrame, it is common to append the result to the original DataFrame. 

In [None]:
df2 = df.copy()
df2['group total'] = df.groupby('item')['quantity'].transform('sum')
df2

### `transform` second use case - return a new value for each row in the group

You can also use `transform` to apply a specific transformation to each value in the group. For instance, we can divide each value in the group by the total of that specific group. For this, we need a custom function.

In [None]:
def divide_max(sub_series):
    return sub_series / sub_series.sum()

The `transform` method must either return a single value or a sequence of values the same length as each group. In this instance, it returns a Series the same length as the group.

In [None]:
df2['perc_of_total'] = df.groupby('item')['quantity'].transform(divide_max).round(2)
df2

### Implicitly passed a Series

The `transform` method is different than `filter` in that it implicitly passes just a Series of data to the custom function. You only have access to that one Series inside of the custom function and not all of the columns like you do with `filter`. It can be instructive to print out everything that is happening within the custom function. Here, we print out both the implicitly passed original Series and the returned transformed Series for each group.

In [None]:
def divide_max2(sub_series):
    result = sub_series / sub_series.sum()
    print("Original", sub_series, sep='\n', end='\n\n')
    print("Transformed", result, sep='\n', end='\n\n\n')
    return sub_series / sub_series.sum()

df.groupby('item')['quantity'].transform(divide_max2)

### `transform` must return either a single value or a Series the same length as the group

The custom function that you use with `transform` must return either a single value or a Series the same exact length as the group. Our first use-case returned an aggregation (a single value), while our second returned the Series divided by the max of each group.

### Find difference from the mean

Let's read in the City of Houston employee dataset and transform each salary so that it shows the difference between it and the mean salary of that employee's department.

In [None]:
emp = pd.read_csv('../data/employee.csv')
emp.head(3)

We define a custom function that subtracts the mean of that group from all the values in the group.

In [None]:
def sub_mean(s):
    return (s - s.mean()).round(-3)

We call the `transform` method with this function and create a new column which informs us how much more or less each employee is making relative to the mean of their department.

In [None]:
emp['salary_diff_mean'] = emp.groupby('dept')['salary'].transform(sub_mean)
emp.head()

## Transforming multiple columns

It's possible to use the `transform` method on multiple columns instead of just one that we've been using. We begin by reading in a few columns of the college dataset.

In [None]:
cols = ['instnm', 'stabbr', 'relaffil', 'satvrmid', 'satmtmid', 'ugds']
college = pd.read_csv('../data/college.csv', usecols=cols, index_col='instnm')
college.head(3)

Place all of the columns you desire to pass through the `transform` column in a list with brackets following the call to `groupby`. The following takes the mean SAT verbal and SAT math scores for each state.

In [None]:
mean_sat = (college.groupby('stabbr')[['satvrmid', 'satmtmid']]
                   .transform('mean')
                   .round(0))
mean_sat.head(3)

These columns can then be appended to the original DataFrame.

In [None]:
college[['sat_verbal_mean', 'sat_math_mean']] = mean_sat
college.head(3)

Let's filter for a different state (Texas) so verify that the mean scores are different.

In [None]:
college.query('stabbr == "TX"').head(3)

### Standardization

A common transformation for numeric columns is to subtract the mean and divide by the standard deviation. This is called **standardization** and is often completed before performing machine learning. It provides a relative metric of how many standard deviations away from the mean each value is. This metric is also known as the **z-score**. Let's define a custom function to produce the calculation.

In [None]:
def standardize(s):
    return (s - s.mean()) / s.std()

Let's standardize the SAT score and undergraduate population columns by state.

In [None]:
(college.groupby('stabbr')[['satvrmid', 'satmtmid', 'ugds']]
        .transform(standardize)
        .round(2)
        .head(3))

### Transforming all columns

If no columns are provided after the call to the `groupby` method then all columns will be transformed. If a column cannot be transformed (such as string column when taking the mean), then it will be silently dropped. Here we transform all of the numeric columns by immediately calling the `transform` method after grouping.

In [None]:
college.groupby('stabbr').transform('mean').head(3)

### Summary of the groupby `transform` method

* Syntax - `df.groupby('grouping col')['transformed col'].transform(func)`
* The function accepts a pandas Series of all the values in the group
* The function must return either a single value or a Series the same length as the group
* Define either a custom function or use a string name of a pandas aggregation function
* If a single value is returned from the custom function, then that value is repeated for the length of the group
* The final pandas object returned always has the same number of values as the original

## Exercises

Execute the cell below to reread the college dataset and use it for the exercises below.

In [None]:
cols = ['instnm', 'stabbr', 'relaffil', 'satvrmid', 'satmtmid', 'ugds']
college = pd.read_csv('../data/college.csv', usecols=cols, index_col='instnm')
college.head(3)

### Exercise 1

<span style="color:green; font-size:16px">Filter the college DataFrame for states that have more than 500,000 total undergraduate students. Can you verify your results?</span>

### Exercise 2

<span style="color:green; font-size:16px">Filter the college DataFrame for states that have a an average undergraduate student population greater than 2,500 and have more than 30 religiously affiliated schools. Can you verify your results?</span>

### Exercise 3

<span style="color:green; font-size:16px">The maximum SAT score for each test is 800. Create a new column in the college dataset that shows each school's percentage of maximum for each SAT score.</span>

### Use the City of Houston dataset

Execute the following cell to read in the City of Houston employee dataset and then use it for the following exercises.

In [None]:
emp = pd.read_csv('../data/employee.csv')
emp.head(3)

### Exercise 4

<span style="color:green; font-size:16px">Filter it so that only position titles with an average salary of 100,000 remain. Can you verify your results?</span>

### Exercise 5

<span style="color:green; font-size:16px">Filter the employee dataset so that only position titles with at least 5 employees and an average salary of 80,000 remain. Can you verify the results?</span>

### Exercise 6

<span style="color:green; font-size:16px">Add a column to the DataFrame that contains the median salary based on department, sex, and race.</span>

### Exercise 7

<span  style="color:green; font-size:16px">Add a new column, `pct_max_dept_sex`, to the employee DataFrame that holds the employees percentage of the maximum salary for each department and sex. For instance, if a male HPD employee makes 80,000 and the maximum male HPD salary is 120,000 then the value for this employee would be 80,000/120,000 or 0.667. Verify this value for the first employee.</span>