# Transform and Filter with Groupby

Applying an aggregating function to our groups is the most common operation. But, there are many more available things we can do to our groups besides this. Pandas provides the GroupBy `filter` method to either keep or reject groups **as a whole**. This is very similar to boolean indexing except that we apply the boolean filter to the entire group. An example with a small fake dataset can help us learn how it works.

In [None]:
import pandas as pd
item = ['A', 'A', 'B', 'B', 'B', 'C', 'C', 'D', 'D']
quantity = [2, 10, 3, 7, 6, 5, 2, 10, 12]
df = pd.DataFrame({'item': item, 'quantity': quantity})
df

## Back to boolean indexing
Before we filter by group, lets look at how we filter by row (boolean indexing). For instance, let's select all the items with quantity greater than 5. We first create a boolean series and then pass this boolean series to the selection operator.

In [None]:
filt = df['quantity'] > 5
df[filt]

### Filter by group total
Let's say we wanted to keep only the items with a total quantity greater than 15. One way to do this would be to find the total quantity per item and find just those items with the total greater than 15 and go back and filter the original dataset.

In [None]:
total = df.groupby('item').agg(total_quantity=('quantity', 'sum')).reset_index()
total

In [None]:
filt = total['total_quantity'] > 15
total[filt]

From here we can use the `isin` method on our original DataFrame to get what we desire.

In [None]:
items = total.loc[filt, 'item']
items

In [None]:
filt2 = df['item'].isin(items)
df[filt2]

## Shortcut with `filter`

The Groupby `filter` method handles this procedure in a direct manner. It is a somewhat complicated method so it will take some time to understand. You first must create a function that must return a boolean value. pandas will implicitly pass this function a DataFrame consisting of just the rows of the current group.

Take a look at the `find_total` function below. It gets called once per group. It receives the current group as a DataFrame and assigns it to the variable `sub_df`. You can call any normal DataFrame methods on `sub_df`. Here, we select the quantity column and sum it. We then compare this sum against 15 and return a boolean. The end result is a DataFrame with only the items that had a total quantity of more than 15.

In [None]:
def find_total(sub_df):
    return sub_df['quantity'].sum() > 15

In [None]:
df.groupby('item').filter(find_total)

Notice that the `find_total` function name was passed to the `filter` method without being called. Just the name itself was passed. Internally, Pandas will use this function and pass it each group as a DataFrame.

### Viewing each "Sub-DataFrame"

The name `sub_df` was chosen to signify that the object being passed to `find_total` was indeed a DataFrame. We can print out each sub-DataFrame during each call to `find_total` to get a better idea of what is happening. Let's add a print statement to it.

In [None]:
def find_total2(sub_df):
    print(sub_df, end='\n\n')
    return sub_df['quantity'].sum() > 15

In [None]:
df.groupby('item').filter(find_total2)

## Getting a nicer display
Instead of printing to the screen, we can use the `display` function from the `IPython.display` module to get the same HTML output that we are accustomed to. This can be quite helpful when debugging.

In [None]:
from IPython.display import display

In [None]:
def find_total3(sub_df):
    display(sub_df)
    total = sub_df['quantity'].sum()
    print(f'total is {total}')
    return total > 15

df.groupby('item').filter(find_total3)

### Summary of the Groupby `filter` method

* Scans each group independently
* Must write a custom function
* The custom function implicitly gets passed a DataFrame of just that group
* The custom function must return a single boolean value
* Each group is either kept or dropped based on the returned boolean value
* The end result is the original DataFrame (same number of columns) with the rows of certain groups filtered out

## Using an anonymous function

If the custom function can be written in a single line, you may use an anonymous function. The same sub-DataFrame is passed to it like above.

In [None]:
df.groupby('item').filter(lambda sub_df: sub_df['quantity'].sum() > 15)

### A more practical example - Finding actors that appear in at least 25 movies
Let's read in the movie dataset and filter for actors that have appeared in at least 25 movies.

In [None]:
movie = pd.read_csv('../data/movie.csv')
movie.head(3)

### Create a custom function
Our custom function is very simple. We merely need to check if the number of rows is 25 or more. Note, that we are only considering the actor1 column.

In [None]:
movie_top_actor = movie.groupby('actor1').filter(lambda sub_df: len(sub_df) >= 25)
movie_top_actor.head()

### Verify results

Let's verify the results by returning the frequency of occurrence for each actor1 of the returned DataFrame.

In [None]:
movie_top_actor['actor1'].value_counts()

## The Groupby `transform` method

There are a couple of different use-cases for the Groupby `transform` method. 

### `transform` first use case - aggregation

The first, is that it can perform an aggregation just like the `agg` method, but will return this aggregated value for each row in the group. An example showing the difference can clear this up. Let's output the original DataFrame again.

In [None]:
df

Previously, we found the total quantity for each group.

In [None]:
df.groupby('item').agg(total_quantity=('quantity', 'sum'))

### Use `transform` instead of `agg`

We can aggregate the quantity column again but in a different manner with `transform`. There are several differences here. With the syntax, we place the aggregating column (quantity) in the brackets and then call the `transform` method. The quantity column is aggregated, but this aggregated value is returned for each row. Also, a Series is returned.

In [None]:
df.groupby('item')['quantity'].transform('sum')

### Can append this to the original DataFrame

Since `transform` always returns an object the same length as the original DataFrame, a common scenario involves appending the result to the original DataFrame. 

In [None]:
df2 = df.copy()
df2['group total'] = df.groupby('item')['quantity'].transform('sum')
df2

### `transform` second use case - return a new value for each row in the group
You can also use `transform` to apply a specific transformation to each value in the group. For instance, we can divide each value in the group by the maximum of that specific group. For this, we need a custom function.

In [None]:
def divide_max(sub_series):
    return sub_series / sub_series.max()

In [None]:
df.groupby('item')['quantity'].transform(divide_max)

### Implicitly passed a Series
`transform` is different than `filter` in that it is implicitly passes just a Series of data to the custom function. So you only have access to that one Series inside of the custom function and not all of the columns like you do with `filter`. It can be instructive to print out everything that is happening within the custom function. Here we print out both the implicitly passed original Series and the returned transformed Series.

In [None]:
def divide_max2(sub_series):
    print("Original")
    display(sub_series)
    print("Transformed")
    display(sub_series / sub_series.max())
    print("\n\n")
    return sub_series / sub_series.max()

df.groupby('item')['quantity'].transform(divide_max2)

## `transform` must return either a single value or a Series the same length as the group

The custom function that you use with `transform` must return either a single value or a Series the same exact length as the group. Our first use-case returned an aggregation (a single value), while our second returned the Series divided by the max of each group.

## Summary of the GroupBy `transform` method

* The applied function must return either a single value or a Series the same length as the group
* Can use either a custom function or a string name of a pandas method
* If a single value is returned from the custom function, then that value is repeated for the length of the group
* The final Pandas object returned always has the same number of values as the original

## Exercises

Execute the following cell to read in the college dataset and then use it for the following exercises.

In [None]:
pd.options.display.max_columns = 100
college = pd.read_csv('../data/college.csv')
college.head(3)

### Exercise 1
<span  style="color:green; font-size:16px">Filter the college DataFrame for states that have more than 500,000 total undergraduate students. Can you verify your results?</span>

Execute the following cell to read in the City of Houston employee dataset and then use it for the following exercises.

In [None]:
emp = pd.read_csv('../data/employee.csv')
emp.head()

### Exercise 2

<span  style="color:green; font-size:16px">Filter it so that only position titles with an average salary of 100,000 remain. Can you verify your results?</span>

### Exercise 3
<span  style="color:green; font-size:16px">Filter the employee dataset so that only position titles with at least 5 employees and an average salary of $80,000 remain. Can you verify the results?</span>

### Exercise 4

<span  style="color:green; font-size:16px">Add a new column, **pct_max_dept_sex**, to the employee DataFrame that holds the employees percentage of the maximum salary for each department and race. For instance, if a male HPD employee makes 80,000 and the maximum male HPD salary is 120,000 then the value for this employee would be 80,000/120,000 or .666. Verify this value for the first employee.</span>