In [1]:
import numpy as np
import pandas as pd
from pandas import DataFrame, Series

In [2]:
pd.__version__

u'0.20.1'

## Helper Function

This helper function will show us what the inputs and outputs are when we use the Pandas `apply` method.

This is a function that returns another function...don't focus on how this does what it does. Just look what gets printed to the screen when we use it later in this notebook.

In [3]:
def explain(f):
    def g(arg):
        print ('=' * 20)
        print "Input Type: ", type(arg)
        print "Input:"
        print arg
        output = f(arg)
        print "Output Type: ", type(output)
        print "Output:"
        print output
        print ('=' * 20)
        return output

    return g

# Data File

Simple data file for this notebook.

In [4]:
%%writefile groupby_example.csv
indx,name,team,number,true_false
10,Jane,A,100,True
20,Jason,B,120,True
30,Jill,B,110,False
40,Julie,A,90,True
50,Jim,A,80,False
60,John,B,140,False
70,Jen,B,60,True

Overwriting groupby_example.csv


In [5]:
!cat groupby_example.csv

indx,name,team,number,true_false
10,Jane,A,100,True
20,Jason,B,120,True
30,Jill,B,110,False
40,Julie,A,90,True
50,Jim,A,80,False
60,John,B,140,False
70,Jen,B,60,True

Read data file:

In [6]:
df = pd.read_csv('groupby_example.csv', index_col=0)

df

Unnamed: 0_level_0,name,team,number,true_false
indx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10,Jane,A,100,True
20,Jason,B,120,True
30,Jill,B,110,False
40,Julie,A,90,True
50,Jim,A,80,False
60,John,B,140,False
70,Jen,B,60,True


This is the test data we will be experimenting with.

# Test Functions

Simple functions to use in our apply statements.

In [9]:
def get_max_value(x):
    return x.max()

def get_first_value(x):
    return x.iloc[0]

Before we proceed, let's make sure we understand how this works.

If we select the number column, like so:

In [10]:
df['number']

indx
10    100
20    120
30    110
40     90
50     80
60    140
70     60
Name: number, dtype: int64

If the number column gets passed to the `get_max_value` function, it should return 140...and it does.

In [11]:
get_max_value(df['number'])

140

We can also select the name column:

In [12]:
df['name']

indx
10     Jane
20    Jason
30     Jill
40    Julie
50      Jim
60     John
70      Jen
Name: name, dtype: object

The first value in that `Series` is Jane.

In [13]:
get_first_value(df['name'])

'Jane'

We can also select the data by row:

In [14]:
df.loc[20]

name          Jason
team              B
number          120
true_false     True
Name: 20, dtype: object

The first value in the output is the name 'Jason'.

In [15]:
get_first_value(df.loc[20])

'Jason'

What if I pass in the entire DataFrame?

In [16]:
df

Unnamed: 0_level_0,name,team,number,true_false
indx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10,Jane,A,100,True
20,Jason,B,120,True
30,Jill,B,110,False
40,Julie,A,90,True
50,Jim,A,80,False
60,John,B,140,False
70,Jen,B,60,True


This will retrieve the first row.

In [17]:
get_first_value(df)

name          Jane
team             A
number         100
true_false    True
Name: 10, dtype: object

In [18]:
get_first_value??

In [19]:
df.iloc[0]

name          Jane
team             A
number         100
true_false    True
Name: 10, dtype: object

# Pandas Apply Method

I can use the DataFrame's `apply` method to apply a function to each row or each column. We will use the `explain` helper function to help us see what the inputs and resulting outputs are for each call to our function.

The basic idea of these uses of `apply` and later uses of `apply` is this:

1. Divide up the data
2. Apply a function each subset of the data and collect the results
3. Aggregate the results

This is a key idea that needs to be understood. The concept will be reused again and again in the Data Science world as it is a necessary approach for very large datasets. This idea is much bigger than Python or Pandas.

By default, `apply` iterates over one column at a time. Look at each of the inputs, think about the function, and look at the outputs.

In [20]:
result = df.apply(explain(get_first_value))

result

Input Type:  <class 'pandas.core.series.Series'>
Input:
indx
10     Jane
20    Jason
30     Jill
40    Julie
50      Jim
60     John
70      Jen
Name: name, dtype: object
Output Type:  <type 'str'>
Output:
Jane
Input Type:  <class 'pandas.core.series.Series'>
Input:
indx
10    A
20    B
30    B
40    A
50    A
60    B
70    B
Name: team, dtype: object
Output Type:  <type 'str'>
Output:
A
Input Type:  <class 'pandas.core.series.Series'>
Input:
indx
10    100
20    120
30    110
40     90
50     80
60    140
70     60
Name: number, dtype: object
Output Type:  <type 'int'>
Output:
100
Input Type:  <class 'pandas.core.series.Series'>
Input:
indx
10     True
20     True
30    False
40     True
50    False
60    False
70     True
Name: true_false, dtype: object
Output Type:  <type 'bool'>
Output:
True


name          Jane
team             A
number         100
true_false    True
dtype: object

By default the `apply` method iterates over the data one column at a time. We can see this because each column is printed out, one at a time. Our `get_first_value` function will return the first item passed to its input.

Each of the individual outputs get complied together into the final result, a Pandas Series.

Also notice that our helper function does not alter the outcome in any way. We get the same result if we leave it out. The helper function is just a silent observer, showing us what the input and output is each time the `apply` method applies our function to its data.

In [21]:
df.apply(get_first_value)

name          Jane
team             A
number         100
true_false    True
dtype: object

Alternatively, we can iterate over each row by setting `axis=1`.

In [22]:
df.apply(explain(get_first_value), axis=1)

Input Type:  <class 'pandas.core.series.Series'>
Input:
name          Jane
team             A
number         100
true_false    True
Name: 10, dtype: object
Output Type:  <type 'str'>
Output:
Jane
Input Type:  <class 'pandas.core.series.Series'>
Input:
name          Jason
team              B
number          120
true_false     True
Name: 20, dtype: object
Output Type:  <type 'str'>
Output:
Jason
Input Type:  <class 'pandas.core.series.Series'>
Input:
name           Jill
team              B
number          110
true_false    False
Name: 30, dtype: object
Output Type:  <type 'str'>
Output:
Jill
Input Type:  <class 'pandas.core.series.Series'>
Input:
name          Julie
team              A
number           90
true_false     True
Name: 40, dtype: object
Output Type:  <type 'str'>
Output:
Julie
Input Type:  <class 'pandas.core.series.Series'>
Input:
name            Jim
team              A
number           80
true_false    False
Name: 50, dtype: object
Output Type:  <type 'str'>
Output:
Jim
Inp

indx
10     Jane
20    Jason
30     Jill
40    Julie
50      Jim
60     John
70      Jen
dtype: object

Each of the inputs is one row of the DataFrame.

Each of the outputs is the first item of each row.

The final result is each of the individual outputs compiled together.

# Now with Groups

First, create the groups.

In [23]:
team_groups = df.groupby('team')

In [25]:
# foo = df.groupby('true_false')['number'].cumsum()

Let's break this down into simple steps with the `team_groups`.

I can use the `get_group` method to get each group. In this case it is a DataFrame with identical values for 'team'.

In [28]:
team_groups.get_group('A')

Unnamed: 0_level_0,name,team,number,true_false
indx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10,Jane,A,100,True
40,Julie,A,90,True
50,Jim,A,80,False


I can pass that DataFrame to the `get_max_value` function. The output is the maximum value for each column.

In [29]:
get_max_value(team_groups.get_group('A'))

name          Julie
team              A
number          100
true_false     True
dtype: object

The same thing for group 'B':

In [30]:
team_groups.get_group('B')

Unnamed: 0_level_0,name,team,number,true_false
indx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
20,Jason,B,120,True
30,Jill,B,110,False
60,John,B,140,False
70,Jen,B,60,True


In [31]:
get_max_value(team_groups.get_group('B'))

name          John
team             B
number         140
true_false    True
dtype: object

When I use the `apply` method it will apply the function to each group and assemble the results, like so:

In [32]:
team_groups.apply(get_max_value)

Unnamed: 0_level_0,name,team,number,true_false
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,Julie,A,100,True
B,John,B,140,True


In [34]:
sorted(['123', 'jim', 'Jim'])

['123', 'Jim', 'jim']

Let's use our helper function to peer into Pandas to see what it is doing:

In [35]:
df.groupby('team').apply(explain(get_max_value))

Input Type:  <class 'pandas.core.frame.DataFrame'>
Input:
       name team  number  true_false
indx                                
10     Jane    A     100        True
40    Julie    A      90        True
50      Jim    A      80       False
Output Type:  <class 'pandas.core.series.Series'>
Output:
name          Julie
team              A
number          100
true_false     True
dtype: object
Input Type:  <class 'pandas.core.frame.DataFrame'>
Input:
       name team  number  true_false
indx                                
10     Jane    A     100        True
40    Julie    A      90        True
50      Jim    A      80       False
Output Type:  <class 'pandas.core.series.Series'>
Output:
name          Julie
team              A
number          100
true_false     True
dtype: object
Input Type:  <class 'pandas.core.frame.DataFrame'>
Input:
       name team  number  true_false
indx                                
20    Jason    B     120        True
30     Jill    B     110       False
60  

Unnamed: 0_level_0,name,team,number,true_false
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,Julie,A,100,True
B,John,B,140,True


First you may notice that group 'A' is called twice. That is a quirk that Pandas is doing, perhaps as a "test run" before applying the function to each group. Or a bug. Don't focus on that.

More importantly, look at the steps it went through. It selected the data in each group, passed that data to our function, and then stored the results. The results were assembled into a new DataFrame.

The inputs are the two groups and the outputs are identical to what we get when we call our `get_max_value` function on one group at a time. The outputs are stacked on top of each other and assembled into a new DataFrame.

Now how do the inputs change if I did this:

In [36]:
df['number']

indx
10    100
20    120
30    110
40     90
50     80
60    140
70     60
Name: number, dtype: int64

In [37]:
df.groupby('team')['number'].apply(explain(get_max_value))

Input Type:  <class 'pandas.core.series.Series'>
Input:
indx
10    100
40     90
50     80
Name: A, dtype: int64
Output Type:  <type 'numpy.int64'>
Output:
100
Input Type:  <class 'pandas.core.series.Series'>
Input:
indx
20    120
30    110
60    140
70     60
Name: B, dtype: int64
Output Type:  <type 'numpy.int64'>
Output:
140


team
A    100
B    140
Name: number, dtype: int64

In [38]:
df.groupby('team')[['number', 'name']].apply(explain(get_max_value))

Input Type:  <class 'pandas.core.frame.DataFrame'>
Input:
      number   name
indx               
10       100   Jane
40        90  Julie
50        80    Jim
Output Type:  <class 'pandas.core.series.Series'>
Output:
number      100
name      Julie
dtype: object
Input Type:  <class 'pandas.core.frame.DataFrame'>
Input:
      number   name
indx               
10       100   Jane
40        90  Julie
50        80    Jim
Output Type:  <class 'pandas.core.series.Series'>
Output:
number      100
name      Julie
dtype: object
Input Type:  <class 'pandas.core.frame.DataFrame'>
Input:
      number   name
indx               
20       120  Jason
30       110   Jill
60       140   John
70        60    Jen
Output Type:  <class 'pandas.core.series.Series'>
Output:
number     140
name      John
dtype: object


Unnamed: 0_level_0,number,name
team,Unnamed: 1_level_1,Unnamed: 2_level_1
A,100,Julie
B,140,John


The extra `['number']` code selects just one column. Only that column is passed along to the `get_max_value` function. The input is a Series and the output is a single number. Those single numbers are assembled into a new Series, indexed by the group names.

Observe how the inputs, outputs, and final result changed by selecting just the number column.

# Data Aggregation

Now let's try aggregating the data with a new function, `get_mean`.

In [39]:
def get_mean(x):
    return x.mean()

Let's find the average value in the number column for each team group.

In [53]:
team_groups['number'].apply(explain(get_mean))

Input Type:  <class 'pandas.core.series.Series'>
Input:
indx
10    100
40     90
50     80
Name: A, dtype: int64
Output Type:  <type 'numpy.float64'>
Output:
90.0
Input Type:  <class 'pandas.core.series.Series'>
Input:
indx
20    120
30    110
60    140
70     60
Name: B, dtype: int64
Output Type:  <type 'numpy.float64'>
Output:
107.5


team
A     90.0
B    107.5
Name: number, dtype: float64

As expected, the `get_mean` function gets passed the number column for each group, one at a time, returning the mean. The means are assembled into a Series.

There is a built-in `mean` function that does exactly the same thing:

In [54]:
team_groups['number'].mean()

team
A     90.0
B    107.5
Name: number, dtype: float64

In [55]:
team_groups['number'].max()

team
A    100
B    140
Name: number, dtype: int64

There are built-in functions that do a lot of common operations, like `mean`,  `median`, `max`, etc. If you don't find a built-in function you can always use `apply`. In this notebook the `apply` method also allows us to use our helper function `explain` to see the inputs and outputs. This is critical for our explainations about how this works.

Interestingly, we can use `mean` method on the true_false column, which contains True and False values. It will calculate the mean assuming that True == 1 and False == 0. 

In [56]:
team_groups['true_false'].mean()

team
A    0.666667
B    0.500000
Name: true_false, dtype: float64

That won't always work if there happen to be null values in that column, or some other reasons. That isn't the case here, but it might be somewhere else. If so, try this:

In [59]:
team_groups['true_false'].apply(np.mean)

team
A    0.666667
B    0.500000
Name: true_false, dtype: float64

# Slightly more complex operation...

Imagine we want to figure out the difference between each row's number and the mean number for the row's group. Looking again at the data:

In [60]:
df

Unnamed: 0_level_0,name,team,number,true_false
indx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10,Jane,A,100,True
20,Jason,B,120,True
30,Jill,B,110,False
40,Julie,A,90,True
50,Jim,A,80,False
60,John,B,140,False
70,Jen,B,60,True


And the means:

In [62]:
team_groups['number'].mean()

team
A     90.0
B    107.5
Name: number, dtype: float64

For group 'A', the mean is 90, and for 'B', it is 107.5.

Jane is in group 'A', and her number is 100. 100 - 90 = 10.

Jen is in group 'B', and her number is 60. 60 - 107.5 = -47.5.

How can we do this operation programmatically?

We do that by writing a function.

In [63]:
def subtract_mean(x):
    return x - x.mean()

Before using it in an apply statement, let's test it on one group to see that it does what we expect.

First, our expected input, the number column for one group.

In [64]:
team_groups['number'].get_group('A')

indx
10    100
40     90
50     80
Name: number, dtype: int64

The mean there is 90. Does our function work as expected?

In [65]:
subtract_mean(team_groups['number'].get_group('A'))

indx
10    10.0
40     0.0
50   -10.0
Name: number, dtype: float64

Yes, it does.

Now use it in our apply statement:

In [66]:
team_groups['number'].apply(explain(subtract_mean))

Input Type:  <class 'pandas.core.series.Series'>
Input:
indx
10    100
40     90
50     80
Name: A, dtype: int64
Output Type:  <class 'pandas.core.series.Series'>
Output:
indx
10    10.0
40     0.0
50   -10.0
Name: A, dtype: float64
Input Type:  <class 'pandas.core.series.Series'>
Input:
indx
20    120
30    110
60    140
70     60
Name: B, dtype: int64
Output Type:  <class 'pandas.core.series.Series'>
Output:
indx
20    12.5
30     2.5
60    32.5
70   -47.5
Name: B, dtype: float64


indx
10    10.0
20    12.5
30     2.5
40     0.0
50   -10.0
60    32.5
70   -47.5
Name: number, dtype: float64

The output is a Series. If we look at the inputs and outputs printed to the screen, we can confirm that it does what we want it to do.

We also notice that the index of the Series is the numbers 10 through 70. This is also the index of our DataFrame:

In [67]:
df

Unnamed: 0_level_0,name,team,number,true_false
indx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10,Jane,A,100,True
20,Jason,B,120,True
30,Jill,B,110,False
40,Julie,A,90,True
50,Jim,A,80,False
60,John,B,140,False
70,Jen,B,60,True


Since both have the same number of rows, let's try adding this to our DataFrame. If they didn't have the same index, this wouldn't work:

In [68]:
df['subtract_mean'] = team_groups['number'].apply(subtract_mean)

df

Unnamed: 0_level_0,name,team,number,true_false,subtract_mean
indx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10,Jane,A,100,True,10.0
20,Jason,B,120,True,12.5
30,Jill,B,110,False,2.5
40,Julie,A,90,True,0.0
50,Jim,A,80,False,-10.0
60,John,B,140,False,32.5
70,Jen,B,60,True,-47.5


We can visually inspect the results and we see it did what we wanted.

Adding a new row to our DataFrame has some key advantages. One of them is I can use the new column to do filtering.

In [69]:
df[df['subtract_mean'] > 0]

Unnamed: 0_level_0,name,team,number,true_false,subtract_mean
indx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10,Jane,A,100,True,10.0
20,Jason,B,120,True,12.5
30,Jill,B,110,False,2.5
60,John,B,140,False,32.5


In [70]:
df[df['subtract_mean'] <= 0]

Unnamed: 0_level_0,name,team,number,true_false,subtract_mean
indx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
40,Julie,A,90,True,0.0
50,Jim,A,80,False,-10.0
70,Jen,B,60,True,-47.5


I can do any kind of filtering or grouping I want on each of these new segments of the data.