# Tutorial 3.7: Pandas GroupBy
Python for Data Analytics | Module 3  
Professor James Ng

<div class="alert alert-block alert-danger">
<h4>Warning: This is a Difficult Topic</h4>
<p>
This tutorial is going to cover some of the most powerful features of the Pandas library. With the power comes some complexity. 

In [1]:
# SETUP: DO NOT CHANGE
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd

## Introduction

In this tutorial, we will be getting our feet wet with the incredibly powerful and flexible grouping functionality in Pandas.

In [2]:
# For this tutorial, we will be working with both of
# our higher education data sets.
college_loan_defaults = pd.read_csv('https://www3.nd.edu/~jng2/college-loan-default-rates.csv')

college_scorecard = pd.read_csv('https://www3.nd.edu/~jng2/college-scorecard-data-scrubbed.csv', 
    encoding='latin-1')

## The `groupby()` Method

So far, all the calculations that we've done on *DataFrame* objects have looked at the values inside columns (*Series*) as a whole.

The `groupby()` method allows you to move into deeper forms analysis by splitting up the rows of a *DataFrame* into groups and then analyzing the records BY GROUP. You can think of this as putting rows into buckets for evaluation.

## Using `groupby()` to Split your Dataset into Groups
Of course, before we can perform evaluations on groups, we have to create them from an existing *DataFrame*. 

Let's explore how the `groupby()` method provides a variety of ways to split up your datasets. Generally speaking, we will demonstrate how the values of one (or more) Series or an index is used to divide up the rows into groups.

Let's explore some of these here, starting with the most simple.

### Single Column Grouping
The simpliest way to create a groups from a *DataFrame* is to specify a single column within the *DataFrame* whose values will become the group definitions. 

In the abstract, this sounds complicated, so let's to a look at an example, which will help to make things clear.

In [3]:
# Group rows by the values of the 'state' column.
colleges_by_state = college_scorecard.groupby(['state'])
colleges_by_state.groups

{'AK': Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64'),
 'AL': Int64Index([ 10,  11,  12,  13,  14,  15,  16,  17,  18,  19,  20,  21,  22,
              23,  24,  25,  26,  27,  28,  29,  30,  31,  32,  33,  34,  35,
              36,  37,  38,  39,  40,  41,  42,  43,  44,  45,  46,  47,  48,
              49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,  60,  61,
              62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,
              75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,
              88,  89,  90,  91,  92,  93,  94,  95,  96,  97,  98,  99, 100],
            dtype='int64'),
 'AR': Int64Index([101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113,
             114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126,
             127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139,
             140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152,
             153, 154, 155, 156, 15

The `groupby()` method returns a `DataFrameGroupBy` object. One of the attributes of that object is `groups`. When accessed, this attribute provides a *dict* object with the **label** of each group and the **corresponding index values** in the original records in the *DataFrame* that belong to that group.

If you look above, you can see there is a group labelled 'AK' with index values `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`. This indicates the the records with those index values had a 'state' column value == 'AK'.

<div class="alert alert-block alert-info">
<p>
We will learn more about the `DataFrameGroupBy` object shortly.
    </p>
</div>

### Multi Column Grouping

You can also specify multiple columns if you wish to split your data up in multiple levels:

In [4]:
# Group rows by the values of the 'state' AND 'city' columns
colleges_by_state_and_city = college_scorecard.groupby(['state', 'city'])
colleges_by_state_and_city.groups

{('AK', 'Anchorage'): Int64Index([1, 3, 5, 7], dtype='int64'),
 ('AK', 'Barrow'): Int64Index([6], dtype='int64'),
 ('AK', 'Fairbanks'): Int64Index([8], dtype='int64'),
 ('AK', 'Juneau'): Int64Index([9], dtype='int64'),
 ('AK', 'Palmer'): Int64Index([0], dtype='int64'),
 ('AK', 'Seward'): Int64Index([4], dtype='int64'),
 ('AK', 'Soldotna'): Int64Index([2], dtype='int64'),
 ('AL', 'Albertville'): Int64Index([61], dtype='int64'),
 ('AL', 'Alexander City'): Int64Index([25], dtype='int64'),
 ('AL', 'Andalusia'): Int64Index([58], dtype='int64'),
 ('AL', 'Athens'): Int64Index([16], dtype='int64'),
 ('AL', 'Auburn'): Int64Index([17], dtype='int64'),
 ('AL', 'Bay Minette'): Int64Index([52], dtype='int64'),
 ('AL', 'Bessemer'): Int64Index([47], dtype='int64'),
 ('AL',
  'Birmingham'): Int64Index([20, 23, 36, 44, 54, 57, 64, 71, 76, 77, 81, 89, 94, 97], dtype='int64'),
 ('AL', 'Boaz'): Int64Index([74], dtype='int64'),
 ('AL', 'Brewton'): Int64Index([53], dtype='int64'),
 ('AL', 'Daphne'): Int64In

As you can see, each group now has a multi-part label. Logically, as the number of grouping levels increase, the number of rows in each group will decrease since they are being distributed to a greater number of groups.

## *DataFrameGroupBy* Objects

We mentioned previously that the `groupby()` method returns a *DataFrameGroupBy* object. In this section, we will explore some of the other attributes of this type.

### Column Indexing
You can perform indexing on *DataFrameGroupBy* objects, just like you can on a regular *DataFrame*. 

In order to demonstrate this functionality, we will also have to apply some sort of aggregation.  I'll get the median value of each column for each group.

In [5]:
# Select the median value of multiple columns 
# from colleges_by_state_and_city
colleges_by_state_and_city[
    ['full_time_retention_rate_4_year', 
     'average_net_price_public']].median()[:15]

## Note that this is equivalent to 
# college_scorecard.groupby(['state'])[['full_time_retention_rate_4_year', 'average_net_price_public']].median()[:15]

Unnamed: 0_level_0,Unnamed: 1_level_0,full_time_retention_rate_4_year,average_net_price_public
state,city,Unnamed: 2_level_1,Unnamed: 3_level_1
AK,Anchorage,0.7453,8661.0
AK,Barrow,,9598.0
AK,Fairbanks,0.7756,9188.0
AK,Juneau,0.7167,8971.0
AK,Palmer,0.3333,
AK,Seward,,
AK,Soldotna,,
AL,Albertville,,
AL,Alexander City,,6922.0
AL,Andalusia,,7748.0


### Method Dispatching
Waaaaaayyyy back in the very beginning of our course, we discussed how objects have data attributes and methods (actions). *DataFrameGroupBy* objects are no exception to this and we will be discussing their unique methods in the next section.

Before we get there though, we need to make note of a special behavior on *`DataFrameGroupBy`* objects called **method dispatching**.

Normally, when you request a method that doesn't exist on an object, you'd get a `AttributeError` exception:

In [6]:
#str.mean()

But with *DataFrameGroupBy* objects, any method not found on the object itself is forwarded (or "dispatched") to all the groups that it contains.

That is why we were able to ask Python to execute the `median()` method of a `colleges_by_state_and_city` object above and get something back: it is (1) "dispatching" the `median()` method call to each group, (2) collecting the results and (3) presenting them to us.

In [7]:
# Another Example of Dispatching
# Here the `max` method is dispatched to 
# each group and the results are collected
# and displayed
colleges_by_state.max()[:5]

Unnamed: 0_level_0,UNITID,OPEID,OPEID6,institution_name,city,predominant_degree_code,predominant_degree_desc,institutional_owner_code,institutional_owner_desc,locale,...,part_time_students_percentage,open_or_closed,average_net_price_public,average_net_price_private,pell_grant_receipents,full_time_retention_rate_4_year,full_time_retention_rate_less_than_4_year,part_time_rentention_rate_4_year,part_time_rentention_rate_less_than_4_year,students_with_federal_loans
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AK,442523,4138600,41386,University of Alaska Southeast,Soldotna,3,Certificate,3,Public,43,...,0.6817,1,9598.0,27226.0,0.8868,0.7756,1.0,1.0,1.0,0.786
AL,483975,52098836,42267,Virginia College-Montgomery,Wadley,4,Unknown,3,Public,43,...,0.91,1,20787.0,31954.0,1.0,1.0,1.0,1.0,0.5714,1.0
AR,484622,12098848,42200,Williams Baptist College,West Memphis,3,Certificate,3,Public,43,...,0.8289,1,14143.0,25197.0,0.9815,0.8667,1.0,1.0,1.0,1.0
AS,240736,1001000,10010,American Samoa Community College,Pago Pago,2,Associates,1,Public,33,...,0.4389,1,1203.0,,0.7245,1.0,,1.0,,0.0
AZ,485306,10732934,42331,Yavapai College,Yuma,4,Graduate,3,Public,43,...,1.0,1,15215.0,41525.0,0.9985,1.0,1.0,0.6667,1.0,0.9701


### Iteration
You can iterate (loop over) a *DataFrameGroupBy* to inspect or process groups one at a time. When iterating, each iteration will produce two values (see below): 
* The first value will be the name/label of the current group being processed.
* The second value will be the group itself, which is nothing more than a regular DataFrame object (which only contains the records from the original DataFrame that match the group criteria).

To demonstrate, let us create a couple of new objects here:
1. `college_loan_defaults_subset`: Will be a new dataframe with only a few columns
from `college_loan_defaults`. Having fewer columns will simplify our examples.
2. `college_loan_defaults_by_state`: A grouped by state version of `college_loan_defaults_subset`

In [8]:
college_loan_defaults_subset = college_loan_defaults[['name', 'state', 'year_1_default_rate']]

college_loan_defaults_by_state = college_loan_defaults_subset.groupby('state')

In [9]:
# Here we will iterate over the DataFrameGroupBy object
# to demonstrate that each iteration provides the group label and a DataFrame of "grouped" rows.
for name, item in college_loan_defaults_by_state:
    print(name, type(item))

AK <class 'pandas.core.frame.DataFrame'>
AL <class 'pandas.core.frame.DataFrame'>
AR <class 'pandas.core.frame.DataFrame'>
AZ <class 'pandas.core.frame.DataFrame'>
CA <class 'pandas.core.frame.DataFrame'>
CO <class 'pandas.core.frame.DataFrame'>
CT <class 'pandas.core.frame.DataFrame'>
DC <class 'pandas.core.frame.DataFrame'>
DE <class 'pandas.core.frame.DataFrame'>
FL <class 'pandas.core.frame.DataFrame'>
GA <class 'pandas.core.frame.DataFrame'>
GU <class 'pandas.core.frame.DataFrame'>
HI <class 'pandas.core.frame.DataFrame'>
IA <class 'pandas.core.frame.DataFrame'>
ID <class 'pandas.core.frame.DataFrame'>
IL <class 'pandas.core.frame.DataFrame'>
IN <class 'pandas.core.frame.DataFrame'>
KS <class 'pandas.core.frame.DataFrame'>
KY <class 'pandas.core.frame.DataFrame'>
LA <class 'pandas.core.frame.DataFrame'>
MA <class 'pandas.core.frame.DataFrame'>
MD <class 'pandas.core.frame.DataFrame'>
ME <class 'pandas.core.frame.DataFrame'>
MI <class 'pandas.core.frame.DataFrame'>
MN <class 'panda

### Methods of *DataFrameGroupBy* Objects
In this section of the tutorial, we will be covering the various operations built into the `DataFrameGroupBy` object type.

#### The `aggregate()` Method
At first, the `aggregate()` method appears to be quite similiar to what we just covered when we talked about method dispatching. It performs aggregations on the groups of a `DataFrameGroupBy` object.

In [10]:
# Simply invocation of the aggregate() method
colleges_by_state.aggregate('count')[:5]

Unnamed: 0_level_0,UNITID,OPEID,OPEID6,institution_name,city,url,predominant_degree_code,predominant_degree_desc,institutional_owner_code,institutional_owner_desc,...,pell_grant_receipents,full_time_retention_rate_4_year,full_time_retention_rate_less_than_4_year,part_time_rentention_rate_4_year,part_time_rentention_rate_less_than_4_year,students_with_federal_loans,median_student_earnings,median_student_debt,less_than_4_year_school_completion_rate,4_year_school_completion_rate
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AK,10,10,10,10,10,10,10,10,10,10,...,10,5,4,4,3,10,8,10,4,5
AL,91,91,91,91,91,91,91,91,91,91,...,90,39,45,30,28,90,80,91,44,41
AR,83,83,83,83,83,83,83,83,83,83,...,83,23,55,16,36,83,78,83,55,24
AS,1,1,1,1,1,1,1,1,1,1,...,1,1,0,1,0,1,1,1,0,1
AZ,134,134,134,134,134,134,134,134,134,134,...,126,32,72,15,40,126,108,134,79,37


The difference is that the `aggregate()` method gives you some additional options that are not available if you rely on method dispatching:

In [11]:
# You can pass multiple aggregates as a list.
# Here will we get various aggregates for the
# full_time_retention_rate_4_year series 
# of our colleges_by_state object.
colleges_by_state['full_time_retention_rate_4_year'].aggregate(
    ['mean', 'min', 'max'])[:5]

Unnamed: 0_level_0,mean,min,max
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AK,0.66324,0.3333,0.7756
AL,0.615436,0.0,1.0
AR,0.650996,0.2564,0.8667
AS,1.0,1.0,1.0
AZ,0.6796,0.2,1.0


In [12]:
# Using `rename()` to apply friendly names to output columns
colleges_by_state['full_time_retention_rate_4_year'].aggregate(
    ['mean', 'min', 'max']).rename(
        columns={'mean': 'Avg. Retention', 
                 'min': 'Low Retention', 
                 'max': 'High Retention'})[:5]

Unnamed: 0_level_0,Avg. Retention,Low Retention,High Retention
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AK,0.66324,0.3333,0.7756
AL,0.615436,0.0,1.0
AR,0.650996,0.2564,0.8667
AS,1.0,1.0,1.0
AZ,0.6796,0.2,1.0


Another way of invoking the `aggregate()` method is actually to use a `dict` object to specify which aggregation(s) to perform on what columns. You can use it to specify different aggregation(s) on a per-column basis.

Here I'll use it to get the high/low values for the SAT Average *Series* and the mean for student retention on our `colleges_by_state_and_city` object.

In [13]:
# Notice how using this style automatically filters
# out all columns you don't specify.
colleges_by_state_and_city.aggregate(
        {'sat_average': ['min', 'max'], 
         'full_time_retention_rate_4_year': 'mean'})[:15]


Unnamed: 0_level_0,Unnamed: 1_level_0,sat_average,sat_average,full_time_retention_rate_4_year
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean
state,city,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
AK,Anchorage,1054.0,1054.0,0.7453
AK,Barrow,,,
AK,Fairbanks,,,0.7756
AK,Juneau,,,0.7167
AK,Palmer,,,0.3333
AK,Seward,,,
AK,Soldotna,,,
AL,Albertville,,,
AL,Alexander City,,,
AL,Andalusia,,,


#### The pandas DataFrameGroupBy `filter()` Method

You can the `DataFrameGroupBy` object's `filter()` method to generate a new dataframe after filtering out groups *(not individual records)* that don't pass a given criteria. It allows you to answer questions like this: *what states in have an average SAT score (for the state) of above 1150?*

To use this method, you **must pass in a function** that takes a single parameter, which is the group to evaluate. The function must return either `True`/`False` depending on whether or not the group should be kept or discarded in the resulting *DataFrame*.

So, with this in mind, let's define a `sat_filter` function so that groups with average SAT scores of less than 1100 are dropped from consideration.

In [14]:
# Notice how this function takes a group, which will be a DataFrame object.
def sat_filter(group):

    # If the group/dataframe object's sat_average series has a 
    # mean value of >= 1150, keep it, otherwise discard it.
    if group['sat_average'].mean() >= 1150:
        return True
    else:
        return False

And now let's use it on college_scorecard to see which rows remain in the new dataframe after applying the filter:

In [15]:
# To avoid clogging up the screen, I'm only going
# to display the `institution_name`, `sat_average`, `state`, and `city` fields 
filter_results = college_scorecard.groupby('state').filter(sat_filter)

filter_results[['state', 'city', 'institution_name', 'sat_average']]

Unnamed: 0,state,city,institution_name,sat_average
1246,DC,Washington,American University,1252.0
1247,DC,Washington,Bennett Career Institute,
1248,DC,Washington,Career Technical Institute,
1249,DC,Washington,Catholic University of America,1130.0
1250,DC,Washington,Gallaudet University,849.0
1251,DC,Washington,George Washington University,1297.0
1252,DC,Washington,Georgetown University,1414.0
1253,DC,Washington,Graduate School USA,
1254,DC,Washington,Howard University,1105.0
1255,DC,Washington,Institute of World Politics,


There are a couple of ***really*** important things to notice here:
1. Unlike the `aggregate()` method, the data returned here is not grouped by state as you probably expected it to be. The filter is used on a grouped *DataFrame*, but it returns a new "normal" *DataFrame*.
2. Notice that we have a bunch of rows for Washington DC and Rhode Island, but nothing else. If we've done things correctly, this would mean that the colleges in those two states have average SAT scores of at least 1150. It also means that all other state groups (and therefore their rows) were filtered out. 

Let's verify.

In [16]:
# Did our filter work as intended? Let's check the SAT average for each state.

# Get the mean of SAT Average for each state
# sort the values, reverse the order of the result
# and return the first 10 elements

# Wow, that is a mouthful isn't it?
college_scorecard.groupby('state')['sat_average'].mean().sort_values()[::-1][:10]

state
PW            NaN
MP            NaN
MH            NaN
GU            NaN
FM            NaN
AS            NaN
DC    1174.500000
RI    1171.000000
UT    1134.600000
MA    1121.729167
Name: sat_average, dtype: float64

You can see that Washington DC and Rhode Island are the only states with average SAT scores of at least 1150.

### Alternatively, use a lambda (anonymous) function in the filter

In [17]:
# Alternative way of filtering after grouping

filter_results_2 = college_scorecard.groupby('state').filter(lambda x: x['sat_average'].mean() > 1150)

filter_results_2[['state', 'city', 'institution_name', 'sat_average']]


Unnamed: 0,state,city,institution_name,sat_average
1246,DC,Washington,American University,1252.0
1247,DC,Washington,Bennett Career Institute,
1248,DC,Washington,Career Technical Institute,
1249,DC,Washington,Catholic University of America,1130.0
1250,DC,Washington,Gallaudet University,849.0
1251,DC,Washington,George Washington University,1297.0
1252,DC,Washington,Georgetown University,1414.0
1253,DC,Washington,Graduate School USA,
1254,DC,Washington,Howard University,1105.0
1255,DC,Washington,Institute of World Politics,


#### The `transform()` Method

Like the `filter()` method, the `transform()` method takes a **function** as an argument and generates a new *DataFrame*.  However, in the case of `transform()` the function is used to **modify the values of Series** in each group before combining the groups back together in the output *DataFrame*.

Your response to that sentence is probably *what THE HECK is he talking about!?*  

I don't blame you, it is very confusing at first. So, let's start with a practical example:
* Let's say that we wanted to center the data (subtract the mean) for the `year_1_default_rate` and `year_2_default_rate` *Series* of our `college_loan_defaults_by_state` DataFrameGroupBy object. 

Just like with the `filter()` method, we have to create a function to use with the `transform()` method, but this time the function will evaluate each series (column) of each group, rather than the groups as a whole.

In [18]:
# This function will be called for each 
# series of each group in your DataFrameGroupBy object
def center_data(series):
    
    # It returns a new "transformed" version of the series.
    # Note that we could have done anything in here we wanted to.
    # For example, we could have used an if statement to only transform
    # certain columns/series.
    return series - series.mean()

Now let's pass our function to the `transform()` method and see what happens:

In [19]:
# We'll also use the rename() method to apply some friendly column names.
transformed_default_rates = college_loan_defaults_subset.groupby('state').transform(
    center_data).rename(columns={'year_1_default_rate': 'centered_year_1_default_rate'})
transformed_default_rates.head()

Unnamed: 0,centered_year_1_default_rate
0,16.679912
1,0.786567
2,-9.209489
3,24.831818
4,12.159375


In [20]:
college_loan_defaults_subset.head()

Unnamed: 0,name,state,year_1_default_rate
0,A - TECHNICAL COLLEGE,CA,27.1
1,A & W HEALTHCARE EDUCATORS,LA,12.9
2,A. T. STILL UNIVERSITY OF HEALTH SCIENCES,MO,1.6
3,AARON'S ACADEMY OF BEAUTY,MD,35.8
4,ABC BEAUTY COLLEGE,AR,26.6


<div class="alert alert-block alert-info">
<p>
Our `college_loan_defaults_subset` dataframe included four columns: name, state, and year_1_default_rate.
</p> 
<p>But here in the returned dataframe we only have `centered_year_1_default_rate`. The reason for this is that the other two columns were strings, and you can't calculate the mean of a series of strings.
</p>
<p>
Because of this, Pandas just silently drops them from the new dataframe that is returned from the `transform()` method.
</p>
</div>

So now we have our centered rates in a new *DataFrame*. Let's merge together the result of our `transform()` method and our `college_loan_defaults_subset` *DataFrame*. 

In [21]:
pd.merge(college_loan_defaults_subset, transformed_default_rates, 
         left_index=True, right_index=True)[:5]

Unnamed: 0,name,state,year_1_default_rate,centered_year_1_default_rate
0,A - TECHNICAL COLLEGE,CA,27.1,16.679912
1,A & W HEALTHCARE EDUCATORS,LA,12.9,0.786567
2,A. T. STILL UNIVERSITY OF HEALTH SCIENCES,MO,1.6,-9.209489
3,AARON'S ACADEMY OF BEAUTY,MD,35.8,24.831818
4,ABC BEAUTY COLLEGE,AR,26.6,12.159375


**Important Note**  
Remember that *pandas* has performed the centering of the data for each group separately. Therefore the values in the "centered_year_1_default_rate" column should reflect the difference between the uncentered `year_1_default_rate` column value and the mean of `year_1_default_rate` for the state to which that record belongs.

Let's verify that is correct. To do so, let's get the mean default rate for year 1 for each state and then manually perform some calculations to ensure everything is as expected.

In [22]:
# This code will extract the mean default rate for each
# of the states in our result above.
college_loan_defaults_subset.groupby('state').aggregate(
    {'year_1_default_rate': 'mean'}).loc[
        ['CA', 'LA', 'MO', 'MD', 'AR']].rename(
            columns={'year_1_default_rate': 'year_1_mean_default_rate'})

Unnamed: 0_level_0,year_1_mean_default_rate
state,Unnamed: 1_level_1
CA,10.420088
LA,12.113433
MO,10.809489
MD,10.968182
AR,14.440625


Go ahead and use your calculator to subtract the state `year_1_mean_default_rate` from the corresponding `year_1_default_rate` in our merge product above and you'll see they equal the `centered_year_1_default_rate`.

#### The `apply()` Method
The `apply()` method is really flexible, and you can do a lot of different things with it.

Like `filter()` and `transform()` you have to define a custom function to be "applied" to each group. Just like with the `filter()` method, the functions that you define for use with `apply()` will need to accept a group/*DataFrame* object.

Inside the function, you can either return another *DataFrame*, *Series*, or scalar object and Pandas will intelligently summarize the results for you.

This sort of flexibility is a bid intimidating at first, but over time you will find a wide variety of uses for it.

Let's go through a couple examples:

In [23]:
# Define a function that returns a series holding
# the info on how many times different cities
# appear in a given state.

# Notice how the entire dataframe is reduced
# down to a single series via this function
def city_counts(dataframe):
    return dataframe['city'].value_counts()

In [24]:
college_scorecard.groupby('state').apply(city_counts)

state                   
AK     Anchorage             4
       Seward                1
       Juneau                1
       Barrow                1
       Palmer                1
       Fairbanks             1
       Soldotna              1
AL     Birmingham           14
       Montgomery           11
       Mobile               10
       Huntsville            6
       Tuscaloosa            3
       Selma                 3
       Dothan                3
       Madison               2
       Marion                2
       Hoover                2
       Florence              2
       Gardendale            1
       Gadsden               1
       Livingston            1
       Northport             1
       Wadley                1
       Boaz                  1
       Tuskegee              1
       Jasper                1
       Montevallo            1
       Rainsville            1
       Talladega             1
       Auburn                1
                            ..
WV     Bluefie

In this case, our `city_counts` function would only return information on a single column of the dataframe that it received for each group. It then returns this city information for each state group which is collected and then displayed to the user.

In the next example we will use the `apply()` method to append a new *Series* to our DataFrame. You'll see that it will effectively take the place of the transform()/merge() example from above in one step.

In [25]:
# Define a function that generates a new column
# holding the centered year_1_default_rate data.

def center_default_rate(dataframe):
    dataframe['center_year_1_default_rate'] = (
        dataframe['year_1_default_rate'] - dataframe['year_1_default_rate'].mean())
    return dataframe

In [26]:
college_loan_defaults_subset.groupby('state').apply(center_default_rate)[:10]

Unnamed: 0,name,state,year_1_default_rate,center_year_1_default_rate
0,A - TECHNICAL COLLEGE,CA,27.1,16.679912
1,A & W HEALTHCARE EDUCATORS,LA,12.9,0.786567
2,A. T. STILL UNIVERSITY OF HEALTH SCIENCES,MO,1.6,-9.209489
3,AARON'S ACADEMY OF BEAUTY,MD,35.8,24.831818
4,ABC BEAUTY COLLEGE,AR,26.6,12.159375
5,ABCOTT INSTITUTE,MI,16.4,5.050847
6,ABDILL CAREER COLLEGE,OR,17.1,5.821918
7,ABILENE CHRISTIAN UNIVERSITY,TX,5.4,-8.041245
8,ABINGTON MEMORIAL HOSPITAL DIXON SCHOOL OF NUR...,PA,0.9,-7.994834
9,ABRAHAM BALDWIN AGRICULTURAL COLLEGE,GA,13.6,1.09115


### Alternatively, create the centered series by applying a lambda function

In [29]:
centeredseries = college_loan_defaults_subset.groupby('state')['year_1_default_rate'].apply(lambda x: 
            x - x.mean()).rename('centered_year_1_default_rate')
centeredseries.head()

0    16.679912
1     0.786567
2    -9.209489
3    24.831818
4    12.159375
Name: centered_year_1_default_rate, dtype: float64

In [28]:
# As an exercise, merge the above Series to the college_loan_defaults_subset DataFrame. 


In [30]:
pd.merge(college_loan_defaults_subset, centeredseries, left_index=True, right_index=True)

Unnamed: 0,name,state,year_1_default_rate,centered_year_1_default_rate
0,A - TECHNICAL COLLEGE,CA,27.1,16.679912
1,A & W HEALTHCARE EDUCATORS,LA,12.9,0.786567
2,A. T. STILL UNIVERSITY OF HEALTH SCIENCES,MO,1.6,-9.209489
3,AARON'S ACADEMY OF BEAUTY,MD,35.8,24.831818
4,ABC BEAUTY COLLEGE,AR,26.6,12.159375
5,ABCOTT INSTITUTE,MI,16.4,5.050847
6,ABDILL CAREER COLLEGE,OR,17.1,5.821918
7,ABILENE CHRISTIAN UNIVERSITY,TX,5.4,-8.041245
8,ABINGTON MEMORIAL HOSPITAL DIXON SCHOOL OF NUR...,PA,0.9,-7.994834
9,ABRAHAM BALDWIN AGRICULTURAL COLLEGE,GA,13.6,1.091150
