## Introduction

One of the most fundamental tasks during data analysis involves splitting data into
independent groups before performing a calculation on each group. This methodology
has been around for quite some time but has more recently been referred to as split-applycombine.
This chapter covers the powerful .groupby method, which allows you to group your
data in any way imaginable and apply any type of function independently to each group before
returning a single dataset.
Before we get started with the recipes, we will need to know just a little terminology. All basic
groupby operations have grouping columns, and each unique combination of values in these
columns represents an independent grouping of the data. The syntax looks as follows:
df.groupby(['list', 'of', 'grouping', 'columns'])
df.groupby('single_column') # when grouping by a single column
The result of calling the .groupby method is a groupby object. It is this groupby object
that will be the engine that drives all the calculations for this entire chapter. pandas does very
little when creating this groupby object, merely validating that grouping is possible. You will
have to chain methods on this groupby object to unleash its powers.
The most common use of the .groupby method is to perform an aggregation. What is an
aggregation? An aggregation takes place when a sequence of many inputs get summarized
or combined into a single value output. For example, summing up all the values of a column
or finding its maximum are aggregations applied to a sequence of data. An aggregation takes
a sequence and reduces it to a single value.
In addition to the grouping columns defined during the introduction, most aggregations have
two other components, the aggregating columns and aggregating functions. The aggregating
columns are the columns whose values will be aggregated. The aggregating functions define
what aggregations take place. Aggregation functions include sum, min, max, mean, count,
variance, std, and so on.

In [2]:
import numpy as np 
import pandas as pd 
pd.set_option('max_columns', 7, 'max_rows', 10)

In [2]:
flights = pd.read_csv('C:/Users/justine.o_kobo360/Desktop/Pandas Workbook/Pandas CookBook 1.x/Data files/flights.csv')

In [3]:
flights.head()

Unnamed: 0,MONTH,DAY,WEEKDAY,...,ARR_DELAY,DIVERTED,CANCELLED
0,1,1,4,...,65.0,0,0
1,1,1,4,...,-13.0,0,0
2,1,1,4,...,35.0,0,0
3,1,1,4,...,-7.0,0,0
4,1,1,4,...,39.0,0,0


Define the grouping columns (AIRLINE), aggregating columns (ARR_DELAY), and
aggregating functions (mean). Place the grouping column in the .groupby method
and then call the .agg method with a dictionary pairing the aggregating column
with its aggregating function. If you pass in a dictionary, it returns back a DataFrame
instance:

In [4]:
(flights
 .groupby('AIRLINE')
 .agg({'ARR_DELAY': 'mean'})
)

Unnamed: 0_level_0,ARR_DELAY
AIRLINE,Unnamed: 1_level_1
AA,5.542661
AS,-0.833333
B6,8.692593
DL,0.339691
EV,7.034580
...,...
OO,7.593463
UA,7.765755
US,1.681105
VX,5.348884


In [6]:
flights.groupby('AIRLINE').ARR_DELAY.agg('mean')

AIRLINE
AA    5.542661
AS   -0.833333
B6    8.692593
DL    0.339691
EV    7.034580
        ...   
OO    7.593463
UA    7.765755
US    1.681105
VX    5.348884
WN    6.397353
Name: ARR_DELAY, Length: 14, dtype: float64

In [7]:
# Alternatively, you may place the aggregating column in 
# the index operator and then pass the aggregating 
# function as a string to .agg.
# This will return a Series:
(flights
 .groupby('AIRLINE')
 ['ARR_DELAY']
 .agg('mean')
)

AIRLINE
AA    5.542661
AS   -0.833333
B6    8.692593
DL    0.339691
EV    7.034580
        ...   
OO    7.593463
UA    7.765755
US    1.681105
VX    5.348884
WN    6.397353
Name: ARR_DELAY, Length: 14, dtype: float64

The string names used in the previous step are a convenience that pandas offers you
to refer to a particular aggregation function. You can pass any aggregating function
directly to the .agg method, such as the NumPy mean function. The output is the
same as the previous step:

In [8]:
(flights
 .groupby('AIRLINE')
 ['ARR_DELAY']
 .agg(np.mean)
)

AIRLINE
AA    5.542661
AS   -0.833333
B6    8.692593
DL    0.339691
EV    7.034580
        ...   
OO    7.593463
UA    7.765755
US    1.681105
VX    5.348884
WN    6.397353
Name: ARR_DELAY, Length: 14, dtype: float64

In [10]:
# It's possible to skip the agg method altogether in
# this case and use the code in text method directly. 
# This output is also the same as step 3:
(flights
 .groupby('AIRLINE')
 ['ARR_DELAY']
 .mean()
)


AIRLINE
AA    5.542661
AS   -0.833333
B6    8.692593
DL    0.339691
EV    7.034580
        ...   
OO    7.593463
UA    7.765755
US    1.681105
VX    5.348884
WN    6.397353
Name: ARR_DELAY, Length: 14, dtype: float64

In [11]:
# The syntax for the .groupby method is not as
# straightforward as other methods. Let's intercept the 
# chain of methods in step 2 by storing the result of 
# the .groupby method as its own variable
grouped = flights.groupby('AIRLINE')
type(grouped)

pandas.core.groupby.generic.DataFrameGroupBy

In [12]:
# If you do not use an aggregating function with .agg,
# pandas raises an exception. For instance, let's see 
# what happens when we apply the square root function
# to each group:
(flights
 .groupby('AIRLINE')
 ['ARR_DELAY']
 .agg(np.sqrt)
)

  result = getattr(ufunc, method)(*inputs, **kwargs)


ValueError: Must produce aggregated value

### Grouping and aggregating with multiple columns and functions

It is possible to group and aggregate with multiple columns. The syntax is slightly different
than it is for grouping and aggregating with a single column. As usual with any kind of
grouping operation, it helps to identify the three components: the grouping columns,
aggregating columns, and aggregating functions.
In this recipe, we showcase the flexibility of the .groupby method by answering the following
queries:

 Finding the number of canceled flights for every airline per weekday
 
 Finding the number and percentage of canceled and diverted flights for every airline
per weekday

 For each origin and destination, finding the total number of flights, the number and percentage of canceled flights, and  the average and variance of the airtime

In [16]:
# Read in the flights dataset, and answer the first query
# by defining the grouping columns (AIRLINE, WEEKDAY),
# the aggregating column (CANCELLED), and the
# aggregating function (sum):

(flights
 .groupby(['AIRLINE', 'WEEKDAY'])
 ['CANCELLED']
 .agg('sum')
)

AIRLINE  WEEKDAY
AA       1          41
         2           9
         3          16
         4          20
         5          18
                    ..
WN       3          18
         4          10
         5           7
         6          10
         7           7
Name: CANCELLED, Length: 98, dtype: int64

In [14]:
# Answer the second query by using a list for each pair
# of grouping and aggregating columns, and use a
# list for the aggregating functions:

(flights
 .groupby(['AIRLINE', 'WEEKDAY'])
 [['CANCELLED', 'DIVERTED']]
 .agg('sum', 'mean')
)

Unnamed: 0_level_0,Unnamed: 1_level_0,CANCELLED,DIVERTED
AIRLINE,WEEKDAY,Unnamed: 2_level_1,Unnamed: 3_level_1
AA,1,41,6
AA,2,9,2
AA,3,16,2
AA,4,20,5
AA,5,18,1
...,...,...,...
WN,3,18,2
WN,4,10,4
WN,5,7,0
WN,6,10,3


In [17]:
(flights
 .groupby(['AIRLINE', 'WEEKDAY'])
 [['CANCELLED', 'DIVERTED']]
 .agg(['sum', 'mean'])
)

Unnamed: 0_level_0,Unnamed: 1_level_0,CANCELLED,CANCELLED,DIVERTED,DIVERTED
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,sum,mean
AIRLINE,WEEKDAY,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
AA,1,41,0.032106,6,0.004699
AA,2,9,0.007341,2,0.001631
AA,3,16,0.011949,2,0.001494
AA,4,20,0.015004,5,0.003751
AA,5,18,0.014151,1,0.000786
...,...,...,...,...,...
WN,3,18,0.014118,2,0.001569
WN,4,10,0.007911,4,0.003165
WN,5,7,0.005828,0,0.000000
WN,6,10,0.010132,3,0.003040


In [18]:
# In pandas 0.25, there is a named aggregation object 
# that can create non-hierarchical columns. We will
# repeat the above query using them:

(flights
 .groupby(['ORG_AIR', 'DEST_AIR'])
 .agg(sum_cancelled=pd.NamedAgg(column='CANCELLED', aggfunc='sum'),
     mean_cancelled=pd.NamedAgg(column='CANCELLED', aggfunc='mean'),
      size_cancelled=pd.NamedAgg(column='CANCELLED', aggfunc='size'),
      mean_air_time=pd.NamedAgg(column='AIR_TIME', aggfunc='mean'),
      var_air_time=pd.NamedAgg(column='AIR_TIME', aggfunc='var')
     )
)

Unnamed: 0_level_0,Unnamed: 1_level_0,sum_cancelled,mean_cancelled,size_cancelled,mean_air_time,var_air_time
ORG_AIR,DEST_AIR,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ATL,ABE,0,0.000000,31,96.387097,45.778495
ATL,ABQ,0,0.000000,16,170.500000,87.866667
ATL,ABY,0,0.000000,19,28.578947,6.590643
ATL,ACY,0,0.000000,6,91.333333,11.466667
ATL,AEX,0,0.000000,40,78.725000,47.332692
...,...,...,...,...,...,...
SFO,SNA,4,0.032787,122,64.059322,11.338331
SFO,STL,0,0.000000,20,198.900000,101.042105
SFO,SUN,0,0.000000,10,78.000000,25.777778
SFO,TUS,0,0.000000,20,100.200000,35.221053


In [25]:
# To flatten the columns in step 3, you can use the 
# .to_flat_index method (available since pandas 0.24):
flights.groupby('AIRLINE')
#[['ARR_DELAY', 'WEEKDAY']].agg(['count', 'var', 'std'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000026FD7BBB7C0>

In [41]:
res = (flights
       .groupby(["ORG_AIR", 'DEST_AIR'])
       .agg({'CANCELLED': ['sum', np.mean, 'size'],
            'AIR_TIME': ['mean', 'var']})
      )

In [42]:
res.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,CANCELLED,CANCELLED,CANCELLED,AIR_TIME,AIR_TIME
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,size,mean,var
ORG_AIR,DEST_AIR,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
ATL,ABE,0,0.0,31,96.387097,45.778495
ATL,ABQ,0,0.0,16,170.5,87.866667
ATL,ABY,0,0.0,19,28.578947,6.590643
ATL,ACY,0,0.0,6,91.333333,11.466667
ATL,AEX,0,0.0,40,78.725,47.332692


In [43]:
res.columns = ['_'.join(x) for x in res.columns.to_flat_index()]

In [44]:
res.columns

Index(['CANCELLED_sum', 'CANCELLED_mean', 'CANCELLED_size', 'AIR_TIME_mean',
       'AIR_TIME_var'],
      dtype='object')

In [40]:
res

Unnamed: 0_level_0,Unnamed: 1_level_0,CANCELLED_sum,CANCELLED_mean,CANCELLED_size,AIR_TIME_mean,AIR_TIME_var
ORG_AIR,DEST_AIR,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ATL,ABE,0,0.000000,31,96.387097,45.778495
ATL,ABQ,0,0.000000,16,170.500000,87.866667
ATL,ABY,0,0.000000,19,28.578947,6.590643
ATL,ACY,0,0.000000,6,91.333333,11.466667
ATL,AEX,0,0.000000,40,78.725000,47.332692
...,...,...,...,...,...,...
SFO,SNA,4,0.032787,122,64.059322,11.338331
SFO,STL,0,0.000000,20,198.900000,101.042105
SFO,SUN,0,0.000000,10,78.000000,25.777778
SFO,TUS,0,0.000000,20,100.200000,35.221053


In [45]:
# That is kind of ugly and I would prefer a chain
# operation to flatten the columns. Unfortunately,
# the .reindex method does not support flattening.
# Instead, we will have to leverage the .pipe method:
def flatten_cols(df):
    df.columns = ['_'.join(df) for 
                  df in df.columns.to_flat_index()]
    return df


res = (flights
       .groupby(['ORG_AIR', 'DEST_AIR'])
       .agg({'CANCELLED': ['sum', 'mean', 'size'],
            'AIR_TIME': ['mean', 'var']})
       .pipe(flatten_cols)
      )

In [46]:
res


Unnamed: 0_level_0,Unnamed: 1_level_0,CANCELLED_sum,CANCELLED_mean,CANCELLED_size,AIR_TIME_mean,AIR_TIME_var
ORG_AIR,DEST_AIR,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ATL,ABE,0,0.000000,31,96.387097,45.778495
ATL,ABQ,0,0.000000,16,170.500000,87.866667
ATL,ABY,0,0.000000,19,28.578947,6.590643
ATL,ACY,0,0.000000,6,91.333333,11.466667
ATL,AEX,0,0.000000,40,78.725000,47.332692
...,...,...,...,...,...,...
SFO,SNA,4,0.032787,122,64.059322,11.338331
SFO,STL,0,0.000000,20,198.900000,101.042105
SFO,SUN,0,0.000000,10,78.000000,25.777778
SFO,TUS,0,0.000000,20,100.200000,35.221053


Be aware that when grouping with multiple columns, pandas creates a hierarchical index, or
multi-index. In the preceding example, it returned 1,130 rows. However, if one of the columns
that we group by is categorical (and has a category type, not an object type), then pandas
will create a Cartesian product of all combinations for each level. In this case, it returns
2,710 rows. However, if you have categorical columns with higher cardinality, you can get
many more values:

In [47]:
res = (flights
       .assign(ORG_AIR=flights.ORG_AIR.astype('category'))
        .groupby(['ORG_AIR', 'DEST_AIR'])
       .agg({'CANCELLED': ['sum', 'mean', 'size'],
            'AIR_TIME': ['mean', 'var']})
)

In [48]:
res

Unnamed: 0_level_0,Unnamed: 1_level_0,CANCELLED,CANCELLED,CANCELLED,AIR_TIME,AIR_TIME
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,size,mean,var
ORG_AIR,DEST_AIR,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
ATL,ABE,0,0.0,31,96.387097,45.778495
ATL,ABI,0,,0,,
ATL,ABQ,0,0.0,16,170.500000,87.866667
ATL,ABR,0,,0,,
ATL,ABY,0,0.0,19,28.578947,6.590643
...,...,...,...,...,...,...
SFO,TYS,0,,0,,
SFO,VLD,0,,0,,
SFO,VPS,0,,0,,
SFO,XNA,0,0.0,2,173.500000,0.500000


To remedy the combinatoric explosion, use the observed=True parameter. This makes
the categorical group bys work like grouping with string types, and only shows the observed
values and not the Cartesian product:

In [49]:
res = (flights
       .assign(ORG_AIR=flights.ORG_AIR.astype('category'))
        .groupby(['ORG_AIR', 'DEST_AIR'], observed=True)
       .agg({'CANCELLED': ['sum', 'mean', 'size'],
            'AIR_TIME': ['mean', 'var']})
)

In [50]:
res

Unnamed: 0_level_0,Unnamed: 1_level_0,CANCELLED,CANCELLED,CANCELLED,AIR_TIME,AIR_TIME
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,size,mean,var
ORG_AIR,DEST_AIR,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
LAX,ABQ,1,0.018182,55,89.259259,29.403215
LAX,ANC,0,0.000000,7,307.428571,78.952381
LAX,ASE,1,0.038462,26,102.920000,102.243333
LAX,ATL,0,0.000000,174,224.201149,127.155837
LAX,AUS,0,0.000000,80,150.537500,57.897310
...,...,...,...,...,...,...
MSP,TTN,1,0.125000,8,124.428571,57.952381
MSP,TUL,0,0.000000,18,91.611111,63.075163
MSP,TUS,0,0.000000,2,176.000000,32.000000
MSP,TVC,0,0.000000,5,56.600000,10.300000


## Removing the MultiIndex after grouping 

Inevitably, when using groupby, you will create a MultiIndex. MultiIndexes can happen in both
the index and the columns. DataFrames with MultiIndexes are more difficult to navigate and
occasionally have confusing column names as well.
In this recipe, we perform an aggregation with the .groupby method to create a DataFrame
with a MultiIndex for the rows and columns. Then, we manipulate the index so that it has
a single level and the column names are descriptive.

In [51]:
# Read in the flights dataset, write a statement to find 
# the total and average miles flown, and the maximum and
# minimum arrival delay for each airline for each
# weekday:

flights = pd.read_csv('C:/Users/justine.o_kobo360/Desktop/Pandas Workbook/Pandas CookBook 1.x/Data files/flights.csv')

In [55]:
airline_info = (flights.
                groupby(['AIRLINE', 'WEEKDAY'])
                .agg({'DIST': ['sum', 'mean'], 
                      'ARR_DELAY': ['min', 'max']})
                .astype(int)
)

In [56]:
airline_info

Unnamed: 0_level_0,Unnamed: 1_level_0,DIST,DIST,ARR_DELAY,ARR_DELAY
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,min,max
AIRLINE,WEEKDAY,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
AA,1,1455386,1139,-60,551
AA,2,1358256,1107,-52,725
AA,3,1496665,1117,-45,473
AA,4,1452394,1089,-46,349
AA,5,1427749,1122,-41,732
...,...,...,...,...,...
WN,3,997213,782,-38,262
WN,4,1024854,810,-52,284
WN,5,981036,816,-44,244
WN,6,823946,834,-41,290


Both the rows and columns are labeled by a MultiIndex with two levels. Let's squash
both down to just a single level. To address the columns, we use the MultiIndex
method, .to_flat_index. Let's display the output of each level and then
concatenate both levels before setting it as the new column values:

In [61]:
airline_info.columns.get_level_values(0)

Index(['DIST', 'DIST', 'ARR_DELAY', 'ARR_DELAY'], dtype='object')

In [62]:
airline_info.columns.get_level_values(1)

Index(['sum', 'mean', 'min', 'max'], dtype='object')

In [64]:
airline_info.columns.to_flat_index()

Index([('DIST', 'sum'), ('DIST', 'mean'), ('ARR_DELAY', 'min'),
       ('ARR_DELAY', 'max')],
      dtype='object')

In [65]:
airline_info.columns = ['_'.join(x) for x in airline_info.columns.to_flat_index()]

In [67]:
airline_info

Unnamed: 0_level_0,Unnamed: 1_level_0,DIST_sum,DIST_mean,ARR_DELAY_min,ARR_DELAY_max
AIRLINE,WEEKDAY,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AA,1,1455386,1139,-60,551
AA,2,1358256,1107,-52,725
AA,3,1496665,1117,-45,473
AA,4,1452394,1089,-46,349
AA,5,1427749,1122,-41,732
...,...,...,...,...,...
WN,3,997213,782,-38,262
WN,4,1024854,810,-52,284
WN,5,981036,816,-44,244
WN,6,823946,834,-41,290


In [68]:
# A quick way to get rid of the row MultiIndex is to
# use the .reset_index method:
airline_info.reset_index()

Unnamed: 0,AIRLINE,WEEKDAY,DIST_sum,DIST_mean,ARR_DELAY_min,ARR_DELAY_max
0,AA,1,1455386,1139,-60,551
1,AA,2,1358256,1107,-52,725
2,AA,3,1496665,1117,-45,473
3,AA,4,1452394,1089,-46,349
4,AA,5,1427749,1122,-41,732
...,...,...,...,...,...,...
93,WN,3,997213,782,-38,262
94,WN,4,1024854,810,-52,284
95,WN,5,981036,816,-44,244
96,WN,6,823946,834,-41,290


In [73]:
# Refactor the code to make it readable. Use the pandas
# 0.25 functionality to flatten columns automatically:

(flights
 .groupby(['AIRLINE', 'WEEKDAY'])
 .agg(dist_sum=pd.NamedAgg(column='DIST', aggfunc='sum'),
     dist_mean=pd.NamedAgg(column='DIST', aggfunc='mean'),
      arr_delay_min=pd.NamedAgg(column='ARR_DELAY', aggfunc='min'),
      arr_delay_max=pd.NamedAgg(column='ARR_DELAY', aggfunc='max'),
).astype(int)
 .reset_index()
)

Unnamed: 0,AIRLINE,WEEKDAY,dist_sum,dist_mean,arr_delay_min,arr_delay_max
0,AA,1,1455386,1139,-60,551
1,AA,2,1358256,1107,-52,725
2,AA,3,1496665,1117,-45,473
3,AA,4,1452394,1089,-46,349
4,AA,5,1427749,1122,-41,732
...,...,...,...,...,...,...
93,WN,3,997213,782,-38,262
94,WN,4,1024854,810,-52,284
95,WN,5,981036,816,-44,244
96,WN,6,823946,834,-41,290


By default, at the end of a groupby operation, pandas puts all of the grouping columns in the
index. The as_index parameter in the .groupby method can be set to False to avoid this
behavior. You can chain the .reset_index method after grouping to get the same effect as
seen in step 3. Let's see an example of this by finding the average distance traveled per flight
from each airline:

In [77]:
(flights
 .groupby(['AIRLINE'], as_index=False)
 ['DIST']
 .agg('mean')
 .round(0)
)

Unnamed: 0,AIRLINE,DIST
0,AA,1114.0
1,AS,1066.0
2,B6,1772.0
3,DL,866.0
4,EV,460.0
...,...,...
9,OO,511.0
10,UA,1231.0
11,US,1181.0
12,VX,1240.0


In [76]:
(flights
 .groupby(['AIRLINE'], as_index=False, sort=False)
 ['DIST']
 .agg('mean')
 .round(0)
)

Unnamed: 0,AIRLINE,DIST
0,WN,810.0
1,UA,1231.0
2,MQ,404.0
3,AA,1114.0
4,F9,970.0
...,...,...
9,AS,1066.0
10,DL,866.0
11,VX,1240.0
12,B6,1772.0


### Grouping with a custom aggregation function

pandas provides a number of aggregation functions to use with the groupby object. At some point, you may need to write your own custom user-defined function that does not exist in
pandas or NumPy. In this recipe, we use the college dataset to calculate the mean and standard deviation
of the undergraduate student population per state. We then use this information to find the
maximum number of standard deviations from the mean that any single population value
is per state.


In [78]:
# Read in the college dataset, and find the mean and 
# standard deviation of the undergraduate population
# by state:
college = pd.read_csv('C:/Users/justine.o_kobo360/Desktop/Pandas Workbook/Pandas CookBook 1.x/Data files/college.csv')

In [82]:
(college
 .groupby(['STABBR'], )
 ['UGDS']
 .agg(['mean', 'std'])
 .round(0)
)

Unnamed: 0_level_0,mean,std
STABBR,Unnamed: 1_level_1,Unnamed: 2_level_1
AK,2493.0,4052.0
AL,2790.0,4658.0
AR,1644.0,3143.0
AS,1276.0,
AZ,4130.0,14894.0
...,...,...
VT,1513.0,2194.0
WA,2271.0,4124.0
WI,2655.0,4615.0
WV,1758.0,5957.0


This output isn't quite what we desire. We are not looking for the mean and standard
deviations of the entire group but the maximum number of standard deviations away
from the mean for any one institution. To calculate this, we need to subtract the mean
undergraduate population by state from each institution's undergraduate population
and then divide by the standard deviation. This standardizes the undergraduate
population for each group. We can then take the maximum of the absolute value of
these scores to find the one that is farthest away from the mean. pandas does not
provide a function capable of doing this. Instead, we will need to create a custom
function:

In [83]:
def max_deviation(s):
    std_score = (s - s.mean()) / s.std()
    return std_score.abs().max()

In [84]:
# After defining the function, pass it directly to the 
# .agg method to complete the aggregation :
(college
 .groupby('STABBR')
 ['UGDS']
 .agg(max_deviation)
 .round(1)
)

STABBR
AK    2.6
AL    5.8
AR    6.3
AS    NaN
AZ    9.9
     ... 
VT    3.8
WA    6.6
WI    5.8
WV    7.2
WY    2.8
Name: UGDS, Length: 59, dtype: float64

In [85]:
# It is possible to apply our custom function to multiple
# aggregating columns. We simply add more column names to
# the indexing operator. The max_deviation function only
# works with numeric columns:
(college
 .groupby('STABBR')
 [['UGDS', 'SATVRMID', 'SATMTMID']]
 .agg(max_deviation)
 .round(1)
)

Unnamed: 0_level_0,UGDS,SATVRMID,SATMTMID
STABBR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AK,2.6,,
AL,5.8,1.6,1.8
AR,6.3,2.2,2.3
AS,,,
AZ,9.9,1.9,1.4
...,...,...,...
VT,3.8,1.9,1.9
WA,6.6,2.2,2.0
WI,5.8,2.4,2.2
WV,7.2,1.7,2.1


In [87]:
# You can also use your custom aggregation function 
# along with the prebuilt functions. The following does
# this and groups by state and religious affiliation:
(college
 .groupby(['STABBR', 'RELAFFIL'])
 [['UGDS', 'SATVRMID', 'SATMTMID']]
 .agg([max_deviation, 'mean', 'std'])
 .round(1)
)

Unnamed: 0_level_0,Unnamed: 1_level_0,UGDS,UGDS,UGDS,...,SATMTMID,SATMTMID,SATMTMID
Unnamed: 0_level_1,Unnamed: 1_level_1,max_deviation,mean,std,...,max_deviation,mean,std
STABBR,RELAFFIL,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
AK,0,2.1,3508.9,4539.5,...,,,
AK,1,1.1,123.3,132.9,...,,503.0,
AL,0,5.2,3248.8,5102.4,...,1.7,515.8,56.7
AL,1,2.4,979.7,870.8,...,1.4,485.6,61.4
AR,0,5.8,1793.7,3401.6,...,2.0,503.6,39.0
...,...,...,...,...,...,...,...,...
WI,0,5.3,2879.1,5031.5,...,1.3,591.2,85.7
WI,1,3.4,1716.2,1934.6,...,1.8,526.6,42.5
WV,0,6.9,1873.9,6271.7,...,1.8,480.0,27.7
WV,1,1.3,716.4,503.6,...,1.7,484.8,17.7


In [88]:
# Notice that pandas uses the name of the function as the
# name for the returned column. You can change the column
# name directly with the .rename method or you can modify
# the function attribute .__name__:
max_deviation.__name__


'max_deviation'

In [89]:
max_deviation.__name__ = 'Max Deviation'

In [90]:
(college
 .groupby(['STABBR', 'RELAFFIL'])
 [['UGDS', 'SATVRMID', 'SATMTMID']]
 .agg([max_deviation, 'mean', 'std'])
 .round(1)
)

Unnamed: 0_level_0,Unnamed: 1_level_0,UGDS,UGDS,UGDS,...,SATMTMID,SATMTMID,SATMTMID
Unnamed: 0_level_1,Unnamed: 1_level_1,Max Deviation,mean,std,...,Max Deviation,mean,std
STABBR,RELAFFIL,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
AK,0,2.1,3508.9,4539.5,...,,,
AK,1,1.1,123.3,132.9,...,,503.0,
AL,0,5.2,3248.8,5102.4,...,1.7,515.8,56.7
AL,1,2.4,979.7,870.8,...,1.4,485.6,61.4
AR,0,5.8,1793.7,3401.6,...,2.0,503.6,39.0
...,...,...,...,...,...,...,...,...
WI,0,5.3,2879.1,5031.5,...,1.3,591.2,85.7
WI,1,3.4,1716.2,1934.6,...,1.8,526.6,42.5
WV,0,6.9,1873.9,6271.7,...,1.8,480.0,27.7
WV,1,1.3,716.4,503.6,...,1.7,484.8,17.7


## Customizing aggregating functions with *args and *kwargs

In this recipe, we will build a customized function for the college dataset that finds the
percentage of schools by state and religious affiliation that have an undergraduate population
between two values.

In [92]:
# Define a function that returns the percentage of schools
# with an undergraduate population of between
# 1,000 and 3,000:
def pct_between_1_3k(s):
    return (s
           .between(1_000, 3_000)
           .mean()
           * 100)

In [93]:
# calculate this percentage grouping by state 
# and religious affiliation:
(college
 .groupby(['STABBR', 'RELAFFIL'])
 ['UGDS']
     .agg(pct_between_1_3k)
 .round(1)
)

STABBR  RELAFFIL
AK      0           14.3
        1            0.0
AL      0           23.6
        1           33.3
AR      0           27.9
                    ... 
WI      0           13.8
        1           36.0
WV      0           24.6
        1           37.5
WY      0           54.5
Name: UGDS, Length: 112, dtype: float64

In [94]:
# This function works, but it does not give the user any
# flexibility to choose the lower and  upper bound. 
# Let's create a new function that allows the user to 
# parameterize these bounds:
def pct_between(s, low, high):
    return s.between(low, high).mean() * 100

In [96]:
# Pass this new function to the .agg method along with
# the lower and upper bounds:
(college
 .groupby(['STABBR', 'RELAFFIL'])
 ['UGDS']
     .agg(pct_between, 1_000, 3_000)
 .round(1)
)

STABBR  RELAFFIL
AK      0           14.3
        1            0.0
AL      0           23.6
        1           33.3
AR      0           27.9
                    ... 
WI      0           13.8
        1           36.0
WV      0           24.6
        1           37.5
WY      0           54.5
Name: UGDS, Length: 112, dtype: float64

In [97]:
#There are a few ways we could achieve the same result
# in step 4. We could have explicitly used keyword 
# parameters to produce the same result:
(college
 .groupby(['STABBR', 'RELAFFIL'])
 .agg(pct_between, low=1_000, high=1_0000)
 .round(1)
)

  (college


Unnamed: 0_level_0,Unnamed: 1_level_0,HBCU,MENONLY,WOMENONLY,...,PCTPELL,PCTFLOAN,UG25ABV
STABBR,RELAFFIL,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
AK,0,0.0,0.0,0.0,...,0.0,0.0,0.0
AK,1,0.0,0.0,0.0,...,0.0,0.0,0.0
AL,0,0.0,0.0,0.0,...,0.0,0.0,0.0
AL,1,0.0,0.0,0.0,...,0.0,0.0,0.0
AR,0,0.0,0.0,0.0,...,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...
WI,0,0.0,0.0,0.0,...,0.0,0.0,0.0
WI,1,0.0,0.0,0.0,...,0.0,0.0,0.0
WV,0,0.0,0.0,0.0,...,0.0,0.0,0.0
WV,1,0.0,0.0,0.0,...,0.0,0.0,0.0


In [98]:
# If we want to call multiple aggregation functions and
# some of them need parameters, we can utilize Python's 
# closure functionality to create a new function that has
# the parameters closed over in its calling environment
def between_n_m(n, m):
    def wrapper(ser):
        return pct_between(ser, n, m)
    wrapper.__name__ =  f'between_{n}_{m}'
    return wrapper

In [100]:
(college
 .groupby(['STABBR', 'RELAFFIL'])
 ['UGDS']
 .agg([between_n_m(1_000, 10_000), 'max', 'mean'])
 .round(1)
)

Unnamed: 0_level_0,Unnamed: 1_level_0,between_1000_10000,max,mean
STABBR,RELAFFIL,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AK,0,42.9,12865.0,3508.9
AK,1,0.0,275.0,123.3
AL,0,45.8,29851.0,3248.8
AL,1,37.5,3033.0,979.7
AR,0,39.7,21405.0,1793.7
...,...,...,...,...
WI,0,31.0,29302.0,2879.1
WI,1,44.0,8212.0,1716.2
WV,0,29.2,44924.0,1873.9
WV,1,37.5,1375.0,716.4


## Examining the groupby object

The immediate result from using the .groupby method on a DataFrame is a groupby object.
Usually, we chain operations on this object to do aggregations or transformations without ever
storing the intermediate values in variables.
In this recipe, we examine the groupby object to examine individual groups.

In [101]:
# Let's get started by grouping the state and religious 
# affiliation columns from the college dataset, saving 
# the result to a variable and confirming its type:
college = pd.read_csv('C:/Users/justine.o_kobo360/Desktop/Pandas Workbook/Pandas CookBook 1.x/Data files/college.csv')

In [102]:
grouped = college.groupby(['STABBR', 'RELAFFIL'])

In [103]:
type(grouped)

pandas.core.groupby.generic.DataFrameGroupBy

In [106]:
len(dir(pd.DataFrame)), len(dir(pd.Series))

(432, 421)

In [107]:
len(dir(pd.DataFrame)) & len(dir(pd.Series))

416

In [110]:
# Use the dir function to discover the attributes of 
# a groupby object:
print([attr for attr in dir(grouped) if not attr.startswith('_')])

['CITY', 'CURROPER', 'DISTANCEONLY', 'GRAD_DEBT_MDN_SUPP', 'HBCU', 'INSTNM', 'MD_EARN_WNE_P10', 'MENONLY', 'PCTFLOAN', 'PCTPELL', 'PPTUG_EF', 'RELAFFIL', 'SATMTMID', 'SATVRMID', 'STABBR', 'UG25ABV', 'UGDS', 'UGDS_2MOR', 'UGDS_AIAN', 'UGDS_ASIAN', 'UGDS_BLACK', 'UGDS_HISP', 'UGDS_NHPI', 'UGDS_NRA', 'UGDS_UNKN', 'UGDS_WHITE', 'WOMENONLY', 'agg', 'aggregate', 'all', 'any', 'apply', 'backfill', 'bfill', 'boxplot', 'corr', 'corrwith', 'count', 'cov', 'cumcount', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'dtypes', 'ewm', 'expanding', 'ffill', 'fillna', 'filter', 'first', 'get_group', 'groups', 'head', 'hist', 'idxmax', 'idxmin', 'indices', 'last', 'mad', 'max', 'mean', 'median', 'min', 'ndim', 'ngroup', 'ngroups', 'nth', 'nunique', 'ohlc', 'pad', 'pct_change', 'pipe', 'plot', 'prod', 'quantile', 'rank', 'resample', 'rolling', 'sample', 'sem', 'shift', 'size', 'skew', 'std', 'sum', 'tail', 'take', 'transform', 'tshift', 'var']


In [111]:
# Find the number of groups with the .ngroups attribute:
grouped.ngroups

112

To find the uniquely identifying labels for each group, look in the .groups attribute,
which contains a dictionary of each unique group mapped to all the corresponding
index labels of that group. Because we grouped by two columns, each of the keys has
a tuple, one value for the STABBR column and another for the RELAFFIL column:

In [116]:
groups = list(grouped.groups)

In [119]:
groups[:6]

[('AK', 0), ('AK', 1), ('AL', 0), ('AL', 1), ('AR', 0), ('AR', 1)]

In [120]:
# Retrieve a single group with the .get_group method by
# passing it a tuple of an exact group label. 
# For example, to get all the religiously affiliated
# schools in the state of Florida, do the following:
grouped.get_group(('FL', 1))

Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
712,The Baptist College of Florida,Graceville,FL,...,0.3531,30800,20052
713,Barry University,Miami,FL,...,0.4361,44100,28250
714,Gooding Institute of Nurse Anesthesia,Panama City,FL,...,,,PrivacySuppressed
715,Bethune-Cookman University,Daytona Beach,FL,...,0.0647,29400,36250
724,Johnson University Florida,Kissimmee,FL,...,0.2185,26300,20199
...,...,...,...,...,...,...,...
7486,Strayer University-Coral Springs Campus,Coral Springs,FL,...,,49200,36173.5
7487,Strayer University-Fort Lauderdale Campus,Fort Lauderdale,FL,...,,49200,36173.5
7488,Strayer University-Miramar Campus,Miramar,FL,...,,49200,36173.5
7489,Strayer University-Doral,Miami,FL,...,,49200,36173.5


You may want to take a peek at each individual group. This is possible because
groupby objects are iterable. If you are in Jupyter, you can leverage the display
function to show each group in a single cell (otherwise, Jupyter will only show the
result of the last statement of the cell):

In [121]:
from IPython.display import display

for name, group in grouped:
    print(name)
    display(group.head(3))

('AK', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
60,University of Alaska Anchorage,Anchorage,AK,...,0.4386,42500,19449.5
62,University of Alaska Fairbanks,Fairbanks,AK,...,0.4519,36200,19355.0
63,University of Alaska Southeast,Juneau,AK,...,0.555,37400,16875.0


('AK', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
61,Alaska Bible College,Palmer,AK,...,0.4286,,PrivacySuppressed
64,Alaska Pacific University,Anchorage,AK,...,0.491,47000.0,23250
5417,Alaska Christian College,Soldotna,AK,...,0.2264,,PrivacySuppressed


('AL', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
0,Alabama A & M University,Normal,AL,...,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,...,0.2422,39700,21941.5
3,University of Alabama in Huntsville,Huntsville,AL,...,0.264,45500,24097.0


('AL', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
2,Amridge University,Montgomery,AL,...,0.854,40100,23370
10,Birmingham Southern College,Birmingham,AL,...,0.0152,44200,27000
12,Concordia College Alabama,Selma,AL,...,0.2367,19900,PrivacySuppressed


('AR', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
128,University of Arkansas at Little Rock,Little Rock,AR,...,0.4062,33900,21736
129,University of Arkansas for Medical Sciences,Little Rock,AR,...,0.5133,61400,12500
130,ABC Beauty College Inc,Arkadelphia,AR,...,0.4688,PrivacySuppressed,16500


('AR', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
131,Arkansas Baptist College,Little Rock,AR,...,0.2833,22000,38000.0
134,Lyon College,Batesville,AR,...,0.0524,38600,25000.0
144,Baptist Health College-Little Rock,Little Rock,AR,...,0.3791,43200,13393.5


('AS', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
4138,American Samoa Community College,Pago Pago,AS,...,0.1774,19800,PrivacySuppressed


('AZ', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
69,Collins College,Phoenix,AZ,...,0.4764,25700,47000
71,Empire Beauty School-Tucson,Tucson,AZ,...,0.4229,18200,9833
72,Thunderbird School of Global Management,Glendale,AZ,...,0.0,118900,PrivacySuppressed


('AZ', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
68,Everest College-Phoenix,Phoenix,AZ,...,0.67,28600,9500
70,Empire Beauty School-Paradise Valley,Phoenix,AZ,...,0.4651,17800,9588
73,American Indian College Inc,Phoenix,AZ,...,0.4684,PrivacySuppressed,PrivacySuppressed


('CA', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
192,Academy of Art University,San Francisco,CA,...,0.4043,36000.0,35093
193,ITT Technical Institute-Rancho Cordova,Rancho Cordova,CA,...,0.7235,38800.0,25827.5
194,Academy of Chinese Culture and Health Sciences,Oakland,CA,...,,,PrivacySuppressed


('CA', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
200,American Baptist Seminary of the West,Berkeley,CA,...,,,PrivacySuppressed
210,Azusa Pacific University,Azusa,CA,...,0.1467,50000,22500
214,Bethesda University,Anaheim,CA,...,0.4672,PrivacySuppressed,PrivacySuppressed


('CO', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
565,The Salon Professional Academy-Grand Junction,Grand Junction,CO,...,0.2778,PrivacySuppressed,9570
566,Adams State University,Alamosa,CO,...,0.2106,32800,16255
567,Aims Community College,Greeley,CO,...,0.3941,31400,8773


('CO', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
575,Colorado Christian University,Lakewood,CO,...,0.45,36900.0,25808
589,Prince Institute-Rocky Mountains,Westminster,CO,...,0.8824,33400.0,20992
592,Denver Seminary,Littleton,CO,...,,,PrivacySuppressed


('CT', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
629,Paul Mitchell the School-Danbury,Danbury,CT,...,0.2913,19000,10486
630,Asnuntuck Community College,Enfield,CT,...,0.3959,30900,5500
631,Branford Hall Career Institute-Branford Campus,Branford,CT,...,0.5725,27900,9800


('CT', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
628,Albertus Magnus College,New Haven,CT,...,0.5133,52100.0,27763.5
645,Fairfield University,Fairfield,CT,...,0.0604,68500.0,26852.5
652,Holy Apostles College and Seminary,Cromwell,CT,...,0.7241,,PrivacySuppressed


('DC', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
698,University of the District of Columbia,Washington,DC,...,0.5662,34800,22393.5
700,Gallaudet University,Washington,DC,...,0.2451,26000,17750.0
701,George Washington University,Washington,DC,...,0.0783,65400,25350.0


('DC', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
696,American University,Washington,DC,...,0.0252,55900.0,24589
697,Catholic University of America,Washington,DC,...,0.094,53900.0,26000
699,Pontifical Faculty of the Immaculate Conceptio...,Washington,DC,...,,,PrivacySuppressed


('DE', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
685,Margaret H Rollins School of Nursing at Beebe ...,Lewes,DE,...,0.4909,PrivacySuppressed,PrivacySuppressed
686,Dawn Career Institute Inc,Wilmington,DE,...,0.6003,22400,9500
688,Delaware Technical Community College-Terry,Dover,DE,...,0.4075,30700,8000


('DE', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
687,Delaware Technical Community College-Owens,Georgetown,DE,...,0.3561,28800,6750
689,Delaware Technical Community College-Stanton/W...,Wilmington,DE,...,0.3842,34000,7508
694,Wesley College,Dover,DE,...,0.1319,41600,31000


('FL', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
709,Wyotech-Daytona,Ormond Beach,FL,...,0.598,31800,11600
710,The Art Institute of Fort Lauderdale,Fort Lauderdale,FL,...,0.4132,28800,29983
711,Atlantic Technical College,Coconut Creek,FL,...,0.5044,31900,PrivacySuppressed


('FL', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
712,The Baptist College of Florida,Graceville,FL,...,0.3531,30800.0,20052
713,Barry University,Miami,FL,...,0.4361,44100.0,28250
714,Gooding Institute of Nurse Anesthesia,Panama City,FL,...,,,PrivacySuppressed


('FM', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
4214,College of Micronesia-FSM,Pohnpei,FM,...,0.1631,15700,PrivacySuppressed


('GA', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
860,Abraham Baldwin Agricultural College,Tifton,GA,...,0.1523,32000,15085.5
862,Interactive College of Technology-Chamblee,Chamblee,GA,...,0.7937,21100,7376.0
863,Interactive College of Technology-Morrow,Morrow,GA,...,0.7778,21100,7376.0


('GA', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
785,Luther Rice University & Seminary,Lithonia,GA,...,0.8748,39400,29500
861,Agnes Scott College,Decatur,GA,...,0.0459,38800,27000
867,Andrew College,Cuthbert,GA,...,0.0095,27500,12875


('GU', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
4139,Guam Community College,Mangilao,GU,...,0.3058,22000,PrivacySuppressed
4140,University of Guam,Mangilao,GU,...,0.2064,29900,PrivacySuppressed


('GU', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
5289,Pacific Islands University,Mangilao,GU,...,0.2533,PrivacySuppressed,PrivacySuppressed


('HI', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
951,University of Hawaii at Hilo,Hilo,HI,...,0.269,33500,19197
952,University of Hawaii at Manoa,Honolulu,HI,...,0.1755,43000,19000
953,Hawaii Institute of Hair Design,Honolulu,HI,...,0.5529,17300,5868


('HI', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
949,Heald College-Honolulu,Honolulu,HI,...,0.5262,35000,11676
950,Chaminade University of Honolulu,Honolulu,HI,...,0.3237,38400,22000
3805,Brigham Young University-Hawaii,Laie,HI,...,0.2224,41500,8291


('IA', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
1247,Allen College,Waterloo,IA,...,0.3945,49100,17090.5
1248,AIB College of Business,Des Moines,IA,...,0.3209,37000,19732.5
1251,Capri College-Dubuque,Dubuque,IA,...,0.2295,19400,8477.0


('IA', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
1249,Briar Cliff University,Sioux City,IA,...,0.238,38100,24000
1250,Buena Vista University,Storm Lake,IA,...,0.3999,38300,23877.5
1253,American College of Hairstyling-Cedar Rapids,Cedar Rapids,IA,...,0.4545,PrivacySuppressed,PrivacySuppressed


('ID', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
965,Carrington College-Boise,Boise,ID,...,0.558,25000,9500
967,Boise State University,Boise,ID,...,0.3182,35600,23500
968,Eastern Idaho Technical College,Idaho Falls,ID,...,0.6041,26600,11375


('ID', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
966,Boise Bible College,Boise,ID,...,0.1613,25500,19596
977,Northwest Nazarene University,Nampa,ID,...,0.2991,35900,25500
979,Brigham Young University-Idaho,Rexburg,ID,...,0.371,38800,11000


('IL', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
43,Prince Institute-Southeast,Elmhurst,IL,...,0.6569,PrivacySuppressed,20992
981,Adler University,Chicago,IL,...,,,PrivacySuppressed
982,Alvareitas College of Cosmetology-Edwardsville,Edwardsville,IL,...,0.3111,PrivacySuppressed,9911


('IL', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
986,Augustana College,Rock Island,IL,...,0.0115,47900.0,27000
992,Blackburn College,Carlinville,IL,...,0.0534,37100.0,26000
1004,Catholic Theological Union at Chicago,Chicago,IL,...,,,PrivacySuppressed


('IN', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
1165,Apex Academy of Hair Design Inc,Anderson,IN,...,0.3333,PrivacySuppressed,PrivacySuppressed
1166,Ball State University,Muncie,IN,...,0.0715,38800,25000
1168,Butler University,Indianapolis,IN,...,0.0185,55000,27000


('IN', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
990,Bethany Theological Seminary,Richmond,IN,...,,,PrivacySuppressed
1163,Ancilla College,Donaldson,IN,...,0.2925,29400.0,17000
1164,Anderson University,Anderson,IN,...,0.1215,35600.0,27000


('KS', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
1326,Allen County Community College,Iola,KS,...,0.2886,29100,6900
1328,Barton County Community College,Great Bend,KS,...,0.4148,32200,8976
1332,Brown Mackie College-Kansas City,Lenexa,KS,...,0.6296,25200,16000


('KS', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
1327,Baker University,Baldwin City,KS,...,0.4418,48800,25250
1329,Benedictine College,Atchison,KS,...,0.0208,39600,26000
1330,Bethany College,Lindsborg,KS,...,0.0316,38100,27000


('KY', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
1389,Alice Lloyd College,Pippa Passes,KY,...,0.046,33500,16495
1390,Asbury University,Wilmore,KY,...,0.1448,33600,25250
1392,Ashland Community and Technical College,Ashland,KY,...,0.3974,23700,11780


('KY', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
1391,Asbury Theological Seminary,Wilmore,KY,...,,42500,PrivacySuppressed
1394,Bellarmine University,Louisville,KY,...,0.0941,46600,25000
1398,Brescia University,Owensboro,KY,...,0.4903,37500,30500


('LA', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
1461,Central Louisiana Technical Community College,Alexandria,LA,...,0.4799,PrivacySuppressed,PrivacySuppressed
1462,American School of Business,Shreveport,LA,...,0.8353,19400,9500
1463,Ayers Career College,Shreveport,LA,...,0.6816,25100,9500


('LA', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
1470,Centenary College of Louisiana,Shreveport,LA,...,0.0307,40400,25000.0
1478,Dillard University,New Orleans,LA,...,0.0904,32800,35000.0
1492,Louisiana College,Pineville,LA,...,0.1487,39100,23743.5


('MA', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
1619,Hult International Business School,Cambridge,MA,...,,,PrivacySuppressed
1620,New England College of Business and Finance,Boston,MA,...,0.8543,,18450
1621,American International College,Springfield,MA,...,0.2102,38900.0,27000


('MA', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
1623,Andover Newton Theological School,Newton Centre,MA,...,,,PrivacySuppressed
1624,Anna Maria College,Paxton,MA,...,0.2948,41900.0,25361
1626,Assumption College,Worcester,MA,...,0.0781,53600.0,27000


('MD', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
1556,Aaron's Academy of Beauty,Waldorf,MD,...,0.4359,PrivacySuppressed,PrivacySuppressed
1557,Aesthetics Institute of Cosmetology,Gaithersburg,MD,...,0.65,PrivacySuppressed,6333
1558,Allegany College of Maryland,Cumberland,MD,...,0.2946,29300,14072


('MD', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
1570,Washington Adventist University,Takoma Park,MD,...,0.3225,44500,27000
1587,Loyola University Maryland,Baltimore,MD,...,0.0072,63000,27000
1599,Mount St Mary's University,Emmitsburg,MD,...,0.0781,49900,25995


('ME', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
1526,Kaplan University-Maine Campus,S Portland,ME,...,0.752,33400,29493
1527,College of the Atlantic,Bar Harbor,ME,...,0.0387,26400,19000
1528,Bates College,Lewiston,ME,...,0.0034,51600,16297


('ME', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
1535,Husson University,Bangor,ME,...,0.2332,36900,26250
1549,Saint Joseph's College of Maine,Standish,ME,...,0.4171,39100,27000
4515,New England School of Communications,Bangor,ME,...,0.1007,27400,27000


('MH', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
4561,College of the Marshall Islands,Majuro,MH,...,0.231,PrivacySuppressed,PrivacySuppressed


('MI', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
1224,West Michigan College of Barbering and Beauty,Kalamazoo,MI,...,0.4368,14800,PrivacySuppressed
1755,Hillsdale Beauty College,Hillsdale,MI,...,0.2,PrivacySuppressed,PrivacySuppressed
1756,Northwestern Technological Institute,Southfield,MI,...,0.6478,30200,9500


('MI', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
1753,Adrian College,Adrian,MI,...,0.0231,37100,27000
1754,Albion College,Albion,MI,...,0.013,44900,27000
1757,Alma College,Alma,MI,...,0.0113,43200,27000


('MN', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
550,Walden University,Minneapolis,MN,...,0.8741,59700,29125
1863,Academy College,Bloomington,MN,...,0.6779,38500,29069
1864,Alexandria Technical & Community College,Alexandria,MN,...,0.2576,35100,12000


('MN', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
1868,Augsburg College,Minneapolis,MN,...,0.3108,45700,27000
1872,Bethany Lutheran College,Mankato,MN,...,0.0311,34200,25000
1873,Bethel University,Saint Paul,MN,...,0.1991,45000,24069


('MO', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
1357,Concorde Career College-Kansas City,Kansas City,MO,...,0.6181,22100,9500.0
1999,ITT Technical Institute-Earth City,Earth City,MO,...,0.701,38800,25827.5
2001,House of Heavilin Beauty College-Blue Springs,Blue Springs,MO,...,0.3556,11600,9088.5


('MO', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
1996,Aquinas Institute of Theology,Saint Louis,MO,...,,,PrivacySuppressed
1997,Assemblies of God Theological Seminary,Springfield,MO,...,,PrivacySuppressed,22062
1998,Avila University,Kansas City,MO,...,0.3298,41100,26625


('MP', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
4141,Northern Marianas College,Saipan,MP,...,0.2002,19600,PrivacySuppressed


('MS', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
1956,Alcorn State University,Alcorn State,MS,...,0.254,30400,28000
1959,Chris Beauty College,Gulfport,MS,...,0.299,15300,PrivacySuppressed
1960,Coahoma Community College,Clarksdale,MS,...,0.302,21100,PrivacySuppressed


('MS', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
1957,Belhaven University,Jackson,MS,...,0.5435,36800,29656
1958,Blue Mountain College,Blue Mountain,MS,...,0.1692,29200,PrivacySuppressed
1963,Creations College of Cosmetology,Tupelo,MS,...,0.4902,17900,PrivacySuppressed


('MT', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
2108,Academy of Cosmetology,Bozeman,MT,...,0.2619,PrivacySuppressed,PrivacySuppressed
2109,Blackfeet Community College,Browning,MT,...,0.48,15600,PrivacySuppressed
2110,Butte Academy of Beauty Culture,Butte,MT,...,0.4054,PrivacySuppressed,9500


('MT', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
2113,Carroll College,Helena,MT,...,0.0741,45500,27000
2121,University of Great Falls,Great Falls,MT,...,0.4283,30700,24000
2130,Rocky Mountain College,Billings,MT,...,0.1053,38900,25626


('NC', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
2642,College of the Albemarle,Elizabeth City,NC,...,0.3617,22300,PrivacySuppressed
2643,The Art Institute of Charlotte,Charlotte,NC,...,0.2754,28800,25167
2644,South Piedmont Community College,Polkton,NC,...,0.3595,21700,PrivacySuppressed


('NC', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
2647,Barton College,Wilson,NC,...,0.2271,36000,27000
2649,Belmont Abbey College,Belmont,NC,...,0.4347,36000,27000
2650,Bennett College,Greensboro,NC,...,0.0235,26900,37000


('ND', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
2772,Rasmussen College-North Dakota,Fargo,ND,...,0.6286,30900,21163
2773,Bismarck State College,Bismarck,ND,...,0.3351,38400,11588
2774,Dickinson State University,Dickinson,ND,...,0.2436,38800,19500


('ND', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
2778,University of Jamestown,Jamestown,ND,...,0.0806,39600,27000
2782,University of Mary,Bismarck,ND,...,0.1698,45100,22722
2792,Trinity Bible College,Ellendale,ND,...,0.1515,25500,27592


('NE', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
2133,La'James International College,Fremont,NE,...,0.2424,15900,PrivacySuppressed
2134,Bellevue University,Bellevue,NE,...,0.8125,52600,17188
2136,Bryan College of Health Sciences,Lincoln,NE,...,0.3174,50900,24280.5


('NE', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
2135,Clarkson College,Omaha,NE,...,0.4744,47000,26000
2140,Concordia University-Nebraska,Seward,NE,...,0.0405,36100,26000
2141,Creighton University,Omaha,NE,...,0.0775,57100,23250


('NH', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
2183,Colby-Sawyer College,New London,NH,...,0.0142,38800,27000
2184,Continental Academie of Hair Design-Hudson,Hudson,NH,...,0.1129,23200,9075
2185,Daniel Webster College,Nashua,NH,...,0.1377,50500,26999


('NH', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
2193,Northeast Catholic College,Warner,NH,...,,,PrivacySuppressed
2210,Rivier University,Nashua,NH,...,0.4104,41700.0,25500
2211,Saint Anselm College,Manchester,NH,...,0.0146,52800.0,27000


('NJ', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
2215,Eastwick College-Hackensack,Hackensack,NJ,...,0.6667,27300,12519
2216,Atlantic Cape Community College,Mays Landing,NJ,...,0.3129,28100,10005
2217,Fortis Institute-Wayne,Wayne,NJ,...,0.328,30400,10305


('NJ', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
2221,Bloomfield College,Bloomfield,NJ,...,0.2044,36100,30500.0
2224,Caldwell University,Caldwell,NJ,...,0.2186,44400,26040.0
2226,Centenary College,Hackettstown,NJ,...,0.3138,41100,25437.5


('NM', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
114,Pima Medical Institute-Albuquerque,Albuquerque,NM,...,0.5387,28200,8708
2303,Olympian Academy of Cosmetology,Alamogordo,NM,...,0.4169,17200,11705
2304,Central New Mexico Community College,Albuquerque,NM,...,0.4726,29500,10000


('NM', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
7419,Computer Career Center-Las Cruces,Las Cruces,NM,...,,21300,14250


('NV', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
2170,Academy of Hair Design-Las Vegas,Las Vegas,NV,...,0.2468,17200,9500.0
2171,Career College of Northern Nevada,Sparks,NV,...,0.5845,23800,14020.5
2172,College of Southern Nevada,Las Vegas,NV,...,0.4493,31700,10500.0


('NV', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
6439,Touro University Nevada,Henderson,NV,...,0.4,,PrivacySuppressed
7352,Marinello School of Beauty-Henderson,Henderson,NV,...,,21200.0,9796.5


('NY', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
678,Tri-State College of Acupuncture,New York,NY,...,,PrivacySuppressed,PrivacySuppressed
2334,Vaughn College of Aeronautics and Technology,Flushing,NY,...,0.4142,48700,22625
2335,Adelphi University,Garden City,NY,...,0.1562,51300,25000


('NY', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
2375,Canisius College,Buffalo,NY,...,0.0373,45700.0,25000.0
2382,Christ the King Seminary,East Aurora,NY,...,,,
2394,Concordia College-New York,Bronxville,NY,...,0.3393,43200.0,26000.0


('OH', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
2796,ETI Technical College,Niles,OH,...,0.6894,22700,13964
2797,The Art Institute of Cincinnati-AIC College of...,Cincinnati,OH,...,0.3158,29700,PrivacySuppressed
2798,Miami-Jacobs Career College-Independence,Independence,OH,...,0.6173,26700,22940


('OH', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
2803,Allegheny Wesleyan College,Salem,OH,...,0.0465,PrivacySuppressed,PrivacySuppressed
2808,Ashland University,Ashland,OH,...,0.307,39000,27000
2812,Baldwin Wallace University,Berea,OH,...,0.1393,44900,27000


('OK', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
3009,American Broadcasting School-Oklahoma City,Oklahoma City,OK,...,0.8333,27300,7023
3013,Broken Arrow Beauty College-Broken Arrow,Broken Arrow,OK,...,0.3556,16800,9259
3014,Pontotoc Technology Center,Ada,OK,...,0.4957,28500,PrivacySuppressed


('OK', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
3010,Bacone College,Muskogee,OK,...,0.1648,29700,26350.0
3011,Oklahoma Wesleyan University,Bartlesville,OK,...,0.4769,46100,21276.5
3012,Southern Nazarene University,Bethany,OK,...,0.3551,45800,18750.0


('OR', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
3069,Academy of Hair Design-Salem,Salem,OR,...,0.5536,14800,18519
3070,Abdill Career College Inc,Medford,OR,...,0.45,PrivacySuppressed,9500
3071,Paul Mitchell the School-Portland,Portland,OR,...,0.2159,,10194


('OR', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
3081,Concordia University-Portland,Portland,OR,...,0.2839,40400,25000
3086,New Hope Christian College-Eugene,Eugene,OR,...,0.2346,26400,24921
3087,George Fox University,Newberg,OR,...,0.1426,41700,22000


('PA', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
3136,Abington Memorial Hospital Dixon School of Nur...,Willow Grove,PA,...,0.6696,63300,15836.0
3137,Jolie Hair and Beauty Academy-Hazleton,Hazleton,PA,...,0.433,PrivacySuppressed,8847.5
3138,Keystone Technical Institute,Harrisburg,PA,...,0.3578,24400,11677.5


('PA', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
3139,Bryn Athyn College of the New Church,Bryn Athyn,PA,...,0.0266,PrivacySuppressed,22294.5
3141,Albright College,Reading,PA,...,0.2452,45800,28750.0
3144,Allegheny College,Meadville,PA,...,0.0088,48400,29046.0


('PR', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
4142,Institute of Beauty Careers,Arecibo,PR,...,0.2821,12000,PrivacySuppressed
4143,Educational Technical College-Recinto de Bayamon,Bayamon,PR,...,0.2933,14500,PrivacySuppressed
4144,American University of Puerto Rico,Bayamon,PR,...,0.2657,19300,3920


('PR', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
4147,Universidad Adventista de las Antillas,Mayaguez,PR,...,0.223,18900,13800
4149,Universidad Central de Bayamon,Bayamón,PR,...,0.2849,18500,8250
4154,Pontifical Catholic University of Puerto Rico-...,Arecibo,PR,...,0.2595,17900,13195


('PW', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
4215,Palau Community College,Koror,PW,...,0.2616,24700,PrivacySuppressed


('RI', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
3402,Brown University,Providence,RI,...,0.0112,59700,15500
3403,Bryant University,Smithfield,RI,...,0.0216,64500,27000
3404,Johnson & Wales University-Providence,Providence,RI,...,0.1037,35300,27000


('RI', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
3401,Empire Beauty School-Providence,Providence,RI,...,0.4667,21000,9833
3408,Providence College,Providence,RI,...,0.0689,57700,27000
3414,Salve Regina University,Newport,RI,...,0.0592,49700,27000


('SC', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
3417,Aiken Technical College,Graniteville,SC,...,0.4413,24500,9625
3420,Technical College of the Lowcountry,Beaufort,SC,...,0.5035,25300,7500
3422,Bob Jones University,Greenville,SC,...,0.0384,PrivacySuppressed,19000


('SC', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
3418,Allen University,Columbia,SC,...,0.0783,21100,37676
3419,Charleston Southern University,Charleston,SC,...,0.2198,35700,27741
3421,Benedict College,Columbia,SC,...,0.0784,21400,44000


('SD', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
3479,Black Hills Beauty College,Rapid City,SD,...,0.1339,16200,11790
3480,Black Hills State University,Spearfish,SD,...,0.2841,34400,25625
3481,Kilian Community College,Sioux Falls,SD,...,0.5455,23100,17125


('SD', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
3478,Augustana University,Sioux Falls,SD,...,0.0424,41800,27000
3483,Dakota Wesleyan University,Mitchell,SD,...,0.1309,34500,27000
3486,Avera McKennan Hospital School of Radiologic T...,Sioux Falls,SD,...,0.05,PrivacySuppressed,PrivacySuppressed


('TN', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
1205,ITT Technical Institute-Nashville,Nashville,TN,...,0.8019,38800,25827.5
3507,Arnolds Beauty School,Milan,TN,...,0.4444,16000,PrivacySuppressed
3508,Tennessee College of Applied Technology-Athens,Athens,TN,...,0.396,26600,PrivacySuppressed


('TN', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
3506,American Baptist College,Nashville,TN,...,0.7305,PrivacySuppressed,25000
3510,Baptist Memorial College of Health Sciences,Memphis,TN,...,0.5059,54100,30000
3511,Belmont University,Nashville,TN,...,0.0848,41800,22707


('TX', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
3611,Alvin Community College,Alvin,TX,...,0.2841,34500,6750
3612,Amarillo College,Amarillo,TX,...,0.3431,31700,10950
3613,Angelina College,Lufkin,TX,...,0.2603,26900,PrivacySuppressed


('TX', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
3610,Abilene Christian University,Abilene,TX,...,0.0381,40200,25985
3615,Arlington Baptist College,Arlington,TX,...,0.2251,34200,22905
3618,Austin College,Sherman,TX,...,0.0124,47800,26000


('UT', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
3802,AmeriTech College-Provo,Provo,UT,...,0.3526,24700,24370
3803,Bridgerland Applied Technology College,Logan,UT,...,0.4148,24300,PrivacySuppressed
3806,Broadview University-West Jordan,West Jordan,UT,...,0.559,25500,28458


('UT', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
3804,Brigham Young University-Provo,Provo,UT,...,0.122,57200,11000.0
3817,Latter-day Saints Business College,Salt Lake City,UT,...,0.2235,35100,5799.0
3818,Everest College-Salt Lake City,West Valley City,UT,...,0.5371,24400,10632.5


('VA', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
704,Medtech Institute,Falls Church,VA,...,0.2039,26300,9236
3850,Bar Palma Beauty Careers Academy,Roanoke,VA,...,0.6944,16900,9731
3851,Advanced Technology Institute,Virginia Beach,VA,...,0.5364,38000,16279


('VA', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
3852,Averett University,Danville,VA,...,0.0992,42400,25000
3853,Bluefield College,Bluefield,VA,...,0.4241,40000,18873
3854,Bridgewater College,Bridgewater,VA,...,0.0114,40800,27000


('VI', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
4216,University of the Virgin Islands,Charlotte Amalie,VI,...,0.3196,31800,15150


('VI', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
7404,University of the Virgin Islands-Albert A. Sheen,St. Croix,VI,...,,31800,15150


('VT', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
3829,Bennington College,Bennington,VT,...,0.0097,24600,27000
3830,Burlington College,Burlington,VT,...,0.2545,26000,25000
3831,Castleton University,Castleton,VT,...,0.0938,34900,25000


('VT', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
3835,Green Mountain College,Poultney,VT,...,0.0407,30100,25449
3843,Saint Michael's College,Colchester,VT,...,0.022,46600,27400
3845,College of St Joseph,Rutland,VT,...,0.2557,34700,24127


('WA', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
3943,Beauty Academy,Wenatchee,WA,...,0.3896,PrivacySuppressed,8718.5
3944,The Art Institute of Seattle,Seattle,WA,...,0.3795,34100,25937.5
3945,Evergreen Beauty and Barber College-Bellevue,Bellevue,WA,...,0.44,PrivacySuppressed,7917.0


('WA', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
3967,Gonzaga University,Spokane,WA,...,0.0298,53000,25500.0
3981,Trinity Lutheran College,Everett,WA,...,0.2165,37100,25000.0
3985,Northwest University,Kirkland,WA,...,0.3067,37700,23724.5


('WI', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
4063,Advanced Institute of Hair Design-Glendale,Glendale,WI,...,0.1736,24000,10314
4064,VICI Aveda Institute,Greenfield,WI,...,0.2059,24000,10314
4066,Madison Area Technical College,Madison,WI,...,0.508,35000,14250


('WI', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
4065,Alverno College,Milwaukee,WI,...,0.3464,37100,32606.5
4070,Cardinal Stritch University,Milwaukee,WI,...,0.6632,48500,27000.0
4071,Carroll University,Waukesha,WI,...,0.1119,41300,27000.0


('WV', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
2842,Scott College of Cosmetology,Wheeling,WV,...,0.1111,14800,9250
4019,B M Spurr School of Practical Nursing,Glen Dale,WV,...,0.4444,PrivacySuppressed,PrivacySuppressed
4020,Ben Franklin Career Center,Dunbar,WV,...,0.7568,20800,PrivacySuppressed


('WV', 1)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
4016,Alderson Broaddus University,Philippi,WV,...,0.0722,46000,27000.0
4018,Appalachian Bible College,Mount Hope,WV,...,0.0899,28700,9300.0
4027,Davis & Elkins College,Elkins,WV,...,0.1133,35000,23840.5


('WY', 0)


Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
4128,Casper College,Casper,WY,...,0.3447,34800,10764
4129,Central Wyoming College,Riverton,WY,...,0.3992,25200,8757
4130,Eastern Wyoming College,Torrington,WY,...,0.2371,25900,10000


However, I typically want to see some example data from a single group to figure out
what function I want to apply to the groups. If I know the names of the values from
the columns I grouped by, I can use the previous step. Often, I don't know those
names, but I also don't need to see all of the groups. The following is some debugging
of the code that is usually sufficient to understand what a group looks like:

In [123]:
for name, group in grouped:
    print(name)
    print(group)
    break
    


('AK', 0)
                                      INSTNM       CITY STABBR  ...  UG25ABV  \
60            University of Alaska Anchorage  Anchorage     AK  ...   0.4386   
62            University of Alaska Fairbanks  Fairbanks     AK  ...   0.4519   
63            University of Alaska Southeast     Juneau     AK  ...   0.5550   
65    AVTEC-Alaska's Institute of Technology     Seward     AK  ...   0.7127   
66                 Charter College-Anchorage  Anchorage     AK  ...   0.5472   
67                     Alaska Career College  Anchorage     AK  ...   0.5612   
5171                       Ilisagvik College     Barrow     AK  ...   0.6498   

      MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP  
60              42500             19449.5  
62              36200               19355  
63              37400               16875  
65              33500   PrivacySuppressed  
66              39200               13875  
67              28700                8994  
5171            24900   PrivacySuppressed

In [126]:
# You can also call the .head method on your groupby 
# object to get the first rows of
# each group together in a single DataFrame:

grouped.head()
    

Unnamed: 0,INSTNM,CITY,STABBR,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
0,Alabama A & M University,Normal,AL,...,0.1049,30300,33888
1,University of Alabama at Birmingham,Birmingham,AL,...,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,...,0.8540,40100,23370
3,University of Alabama in Huntsville,Huntsville,AL,...,0.2640,45500,24097
4,Alabama State University,Montgomery,AL,...,0.1270,26600,33118.5
...,...,...,...,...,...,...,...
7366,Montpelier Center - Closed July 2013,Montpelier,VT,...,,39600,18750
7367,New England Center,Brattleboro,VT,...,,39600,18750
7404,University of the Virgin Islands-Albert A. Sheen,St. Croix,VI,...,,31800,15150
7419,Computer Career Center-Las Cruces,Las Cruces,NM,...,,21300,14250


There are several useful methods that were not explored from the list in step 2. Take, for
instance, the .nth method, which, when provided with a list of integers, selects those specific
rows from each group. For example, the following operation selects the first and last rows from
each group:

In [127]:
grouped.nth([1, -1])

Unnamed: 0_level_0,Unnamed: 1_level_0,INSTNM,CITY,HBCU,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
STABBR,RELAFFIL,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
AK,0,University of Alaska Fairbanks,Fairbanks,0.0,...,0.4519,36200,19355
AK,0,Ilisagvik College,Barrow,0.0,...,0.6498,24900,PrivacySuppressed
AK,1,Alaska Pacific University,Anchorage,0.0,...,0.4910,47000,23250
AK,1,Alaska Christian College,Soldotna,0.0,...,0.2264,,PrivacySuppressed
AL,0,University of Alabama at Birmingham,Birmingham,0.0,...,0.2422,39700,21941.5
...,...,...,...,...,...,...,...,...
WV,0,BridgeValley Community & Technical College,South Charleston,0.0,...,,,9429.5
WV,1,Appalachian Bible College,Mount Hope,0.0,...,0.0899,28700,9300
WV,1,West Virginia Business College-Nutter Fort,Nutter Fort,,...,,16700,19258
WY,0,Central Wyoming College,Riverton,0.0,...,0.3992,25200,8757


## Filtering for states with a minority majority


Previously, we examined using Boolean arrays to filter rows. In a similar fashion, when using
the .groupby method, we can filter out groups. The .filter method of the groupby object
accepts a function that must return either True or False to indicate whether a group is kept.
This .filter method applied after a call to the .groupby method is completely different to
the DataFrame .filter method covered in the Selecting columns with methods recipe from
Chapter 2, Essential DataFrame Operations.
One thing to be aware of is that when the .filter method is applied, the result does not use
the grouping columns as the index, but keeps the original index! The DataFrame .filter
method filters columns, not values.
In this recipe, we use the college dataset to find all the states that have more non-white
undergraduate students than white. This is a dataset from the US, where whites form the
majority and therefore, we are looking for states with a minority majority.

In [138]:
# Read in the college dataset, group by state, and 
# display the total number of groups. This should equal
# the number of unique states retrieved from the 
# .nunique Series method:
college = pd.read_csv('C:/Users/justine.o_kobo360/Desktop/Pandas Workbook/Pandas CookBook 1.x/Data files/college.csv', index_col='INSTNM')


In [139]:
grouped = college.groupby('STABBR')

In [140]:
grouped.ngroups

59

In [142]:
len(grouped.nunique())

59

In [143]:
college['STABBR'].nunique()

59

In [156]:
r = 1 - college['UGDS_WHITE'] 
s = r * college['UGDS_WHITE'].sum()

In [157]:

f = college['UGDS'].sum()
s / f

INSTNM
Alabama A & M University                                  0.000209
University of Alabama at Birmingham                       0.000088
Amridge University                                        0.000152
University of Alabama in Huntsville                       0.000065
Alabama State University                                  0.000213
                                                            ...   
SAE Institute of Technology  San Francisco                     NaN
Rasmussen College - Overland Park                              NaN
National Personal Training Institute of Cleveland              NaN
Bay Area Medical Academy - San Jose Satellite Location         NaN
Excel Learning Center-San Antonio South                        NaN
Name: UGDS_WHITE, Length: 7535, dtype: float64

The grouped variable has a .filter method, which accepts a custom function
that determines whether a group is kept. The custom function accepts a DataFrame
of the current group and is required to return a Boolean. Let's define a function
that calculates the total percentage of minority students and returns True if this
percentage is greater than a user-defined threshold:

In [144]:
def check_minority(df, threshold):
    minority_pct = 1 - df['UGDS_WHITE']
    total_minority = (df['UGDS'] * minority_pct).sum()
    total_ugds = df['UGDS'].sum()
    total_minority_pct = total_minority / total_ugds
    return total_minority_pct > threshold

In [145]:
# Use the .filter method passed with the check_minority
# function and a threshold of 50% to find all
# states that have a minority majority:
college_filtered = grouped.filter(check_minority, threshold=.5)

In [146]:
college_filtered

Unnamed: 0_level_0,CITY,STABBR,HBCU,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Everest College-Phoenix,Phoenix,AZ,0.0,...,0.6700,28600,9500
Collins College,Phoenix,AZ,0.0,...,0.4764,25700,47000
Empire Beauty School-Paradise Valley,Phoenix,AZ,0.0,...,0.4651,17800,9588
Empire Beauty School-Tucson,Tucson,AZ,0.0,...,0.4229,18200,9833
Thunderbird School of Global Management,Glendale,AZ,0.0,...,0.0000,118900,PrivacySuppressed
...,...,...,...,...,...,...,...
WestMed College - Merced,Merced,CA,,...,,,15623.5
Vantage College,El Paso,TX,,...,,,9500
SAE Institute of Technology San Francisco,Emeryville,CA,,...,,,9500
Bay Area Medical Academy - San Jose Satellite Location,San Jose,CA,,...,,,PrivacySuppressed


Just looking at the output may not be indicative of what happened. The DataFrame
starts with the state of Arizona (AZ) and not Alaska (AK), so we can visually confirm
that something changed. Let's compare the shape of this filtered DataFrame with the
original. Looking at the results, about 60% of the rows have been filtered, and only
20 states remain that have a minority majority:

In [147]:
college.shape

(7535, 26)

In [149]:
college_filtered.shape

(3028, 26)

In [150]:
college_filtered['STABBR'].nunique()

20

In [158]:
# Our function, check_minority, is flexible and accepts
# a parameter to lower or raise the percentage of
# minority threshold. Let's check the shape and number 
# of unique states for a couple of other thresholds:
college_filtered_20 = grouped.filter(check_minority, threshold=.2)

In [159]:
college_filtered_20.shape

(7461, 26)

In [160]:
college_filtered_20['STABBR'].nunique()

57

In [161]:
college_filtered_70 = grouped.filter(check_minority, threshold=.7)

In [162]:
college_filtered_70.shape

(957, 26)

In [163]:
college_filtered_70['STABBR'].nunique()

10

## Transforming through a weight loss bet

One method to increase motivation to lose weight is to make a bet with someone else. The
scenario in this recipe will track weight loss from two individuals throughout a four-month
period and determine a winner.
In this recipe, we use simulated data from two individuals to track the percentage of weight
loss over four months. At the end of each month, a winner will be declared based on the
individual who lost the highest percentage of body weight for that month. To track weight
loss, we group our data by month and person, and then call the .transform method to find
the percentage weight loss change for each week against the start of the month.
We will use the .transform method in this recipe. This method returns a new object that
preserves the index of the original DataFrame but allows you to do calculations on groups
of the data.


In [2]:
# Read in the raw weight_loss dataset, and examine the 
# first month of data from the two people, Amy and 
# Bob. There are a total of four weigh-ins per month:
weight_loss = pd.read_csv('C:/Users/justine.o_kobo360/Desktop/Pandas Workbook/Pandas CookBook 1.x/Data files/weight_loss.csv')


In [3]:
weight_loss.query('Month == "Jan"')

Unnamed: 0,Name,Month,Week,Weight
0,Bob,Jan,Week 1,291
1,Amy,Jan,Week 1,197
2,Bob,Jan,Week 2,288
3,Amy,Jan,Week 2,189
4,Bob,Jan,Week 3,283
5,Amy,Jan,Week 3,189
6,Bob,Jan,Week 4,283
7,Amy,Jan,Week 4,190


To determine the winner for each month, we only need to compare weight loss from
the first week to the last week of each month. But, if we wanted to have weekly
updates, we can also calculate weight loss from the current week to the first week
of each month. Let's create a function that is capable of providing weekly updates.
It will take a Series and return a Series of the same size:

In [4]:
def percent_loss(s):
    return ((s -  s.iloc[0]) / s.iloc[0]) *100

In [5]:
(weight_loss
 .query('Name=="Bob" and Month=="Jan"')
 ['Weight']
 .pipe(percent_loss)
)

0    0.000000
2   -1.030928
4   -2.749141
6   -2.749141
Name: Weight, dtype: float64

fter the first week, Bob lost 1% of his body weight. He continued losing weight during
the second week but made no progress during the last week. We can apply this
function to every single combination of person and month to get the weight loss per
week in relation to the first week of the month. To do this, we need to group our data
by Name and Month, and then use the .transform method to apply this custom
function. The function we pass to .transform needs to maintain the index of the
group that is passed into it, so we can use percent_loss here:

In [6]:
(weight_loss
 .groupby(['Name', 'Month'])
 ['Weight']
 .transform(percent_loss)
)

0     0.000000
1     0.000000
2    -1.030928
3    -4.060914
4    -2.749141
        ...   
27   -3.529412
28   -3.065134
29   -3.529412
30   -4.214559
31   -5.294118
Name: Weight, Length: 32, dtype: float64

The .transform method takes a function that returns an object with the same
index (and the same number of rows) as was passed into it. Because it has the
same index, we can insert it as a column. The .transform method is useful for
summarizing information from the groups and then adding it back to the original
DataFrame. We will also filter down to two months of data for Bob:

In [7]:
(weight_loss
 .assign(percent_loss=(weight_loss.
                       groupby(['Name', 'Month'])
                       ['Weight']
                       .transform(percent_loss)
                       .round(1)
                      )
).query('Name=="Bob" and Month in ["Jan", "Feb"]'))

Unnamed: 0,Name,Month,Week,Weight,percent_loss
0,Bob,Jan,Week 1,291,0.0
2,Bob,Jan,Week 2,288,-1.0
4,Bob,Jan,Week 3,283,-2.7
6,Bob,Jan,Week 4,283,-2.7
8,Bob,Feb,Week 1,283,0.0
10,Bob,Feb,Week 2,275,-2.8
12,Bob,Feb,Week 3,268,-5.3
14,Bob,Feb,Week 4,268,-5.3


Notice that the percentage of weight loss resets after the new month. With this new
percent_loss column, we can manually determine a winner but let's see whether
we can find a way to do this automatically. As the only week that matters is the last week, let's select week 4:


In [8]:
(weight_loss
 .assign(percent_loss=(weight_loss
         .groupby(['Name', 'Month'])
 ['Weight']
 .transform(percent_loss)
 .round(1))
).query('Week == "Week 4"'))

Unnamed: 0,Name,Month,Week,Weight,percent_loss
6,Bob,Jan,Week 4,283,-2.7
7,Amy,Jan,Week 4,190,-3.6
14,Bob,Feb,Week 4,268,-5.3
15,Amy,Feb,Week 4,173,-8.9
22,Bob,Mar,Week 4,261,-2.6
23,Amy,Mar,Week 4,170,-1.7
30,Bob,Apr,Week 4,250,-4.2
31,Amy,Apr,Week 4,161,-5.3


In [9]:
# This narrows down the weeks but still doesn't
# automatically find out the winner of each month.
# Let's reshape this data with the .pivot method so
# that Bob's and Amy's percent weight loss is side by
# side for each month:

(weight_loss
 .assign(percent_loss=(weight_loss
         .groupby(['Name', 'Month']))
         ['Weight']
         .transform(percent_loss)
         .round(1))
   .query('Week == "Week 4"')
   .pivot(index='Month', columns='Name',
         values='percent_loss')
)

Name,Amy,Bob
Month,Unnamed: 1_level_1,Unnamed: 2_level_1
Apr,-5.3,-4.2
Feb,-8.9,-5.3
Jan,-3.6,-2.7
Mar,-1.7,-2.6


In [10]:
# This output makes it clearer who has won each month,
# but we can still go a couple of steps further. 
# NumPy has a vectorized if then else function called 
# where, which can map a Series or array of Booleans 
# to other values. Let's create a column, winner, with
# the name of the winner:
(weight_loss
 .assign(percent_loss=(weight_loss
                      .groupby(['Name', 'Month'])
                      ['Weight']
                      .transform(percent_loss)
                      .round(1)))
                 .query('Week == "Week 4"')
                .pivot_table(index='Month', columns='Name',
                            values='percent_loss')
 .assign(winner=lambda df_: np.where(df_.Amy < df_.Bob, 'Amy', 'Bob') )
)

Name,Amy,Bob,winner
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Apr,-5.3,-4.2,Amy
Feb,-8.9,-5.3,Amy
Jan,-3.6,-2.7,Amy
Mar,-1.7,-2.6,Bob


In [11]:
(weight_loss
 .assign(percent_loss=(weight_loss
                      .groupby(['Name', 'Month'])
                      ['Weight']
                      .transform(percent_loss)
                      .round(1)))
                 .query('Week == "Week 4"')
                .pivot_table(index='Month', columns='Name',
                            values='percent_loss')
 .assign(winner=lambda df_: np.where(df_.Amy < df_.Bob, 'Amy', 'Bob') )
)['winner'].value_counts()

Amy    3
Bob    1
Name: winner, dtype: int64

In [24]:
(weight_loss
 .assign(percent_loss=(weight_loss
                      .groupby(['Name', 'Month'])
                      ['Weight']
                      .transform(percent_loss)
                      .round(1)))).query('Week == "Week 2"').pivot(index='Month', columns='Name', 
                        values='percent_loss').assign(men_are_scum= lambda df_: np.where(df_.Amy < df_.Bob, 'Amy', 'Bob')).men_are_scum.value_counts()


Amy    4
Name: men_are_scum, dtype: int64

In [51]:
# In Jupyter, you can highlight the winning percentage
# for each month using the .style attribute:
(weight_loss
 .assign(percent_loss=(weight_loss
                      .groupby(['Name', 'Month'])
                      ['Weight']
                      .transform(percent_loss)
                      .round(1)))
                 .query('Week == "Week 4"')
                .pivot(index='Month', columns='Name',
                            values='percent_loss')
 .assign(winner=lambda df_: np.where(df_.Amy < df_.Bob, 'Amy', 'Bob'))
).style.highlight_min(axis=0)



Name,Amy,Bob,winner
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Apr,-5.3,-4.2,Amy
Feb,-8.9,-5.3,Amy
Jan,-3.6,-2.7,Amy
Mar,-1.7,-2.6,Bob


In [39]:
# Use the .value_counts method to return the final score 
# as the number of months won:
(weight_loss
 .assign(percent_loss=(weight_loss
                      .groupby(['Name', 'Month'])
                      ['Weight']
                      .transform(percent_loss)
                      .round(1)))
                 .query('Week == "Week 4"')
                .pivot_table(index='Month', columns='Name',
                            values='percent_loss')
 .assign(winner=lambda df_: np.where(df_.Amy < df_.Bob, 'Amy', 'Bob') )
)['winner'].value_counts()

Amy    3
Bob    1
Name: winner, dtype: int64

In [57]:
# Here is an example of using . groupyby with .unstack 
# to emulate the pivot functionality:

(weight_loss
 .assign(percent_loss=(weight_loss
                      .groupby(['Name', 'Month'])
                      ['Weight']
                      .transform(percent_loss)
                      .round(1)))
                 .query('Week == "Week 4"')
                .groupby(['Month','Name'])
                 ['percent_loss']
                  .first()
                   .unstack()
                
)

Name,Amy,Bob
Month,Unnamed: 1_level_1,Unnamed: 2_level_1
Apr,-5.3,-4.2
Feb,-8.9,-5.3
Jan,-3.6,-2.7
Mar,-1.7,-2.6


Take a look at the DataFrame output from step 7. Did you notice that the months are in
alphabetical and not chronological order? pandas unfortunately, in this case at least, orders
the months for us alphabetically. We can solve this issue by changing the data type of Month
to a categorical variable. Categorical variables map all the values of each column to an
integer. We can choose this mapping to be the normal chronological order for the months.
pandas uses this underlying integer mapping during the .pivot method to order the months
chronologically:

In [73]:
(weight_loss.assign(percent_loss=(weight_loss.groupby(['Name', 'Month'])['Weight']
                                  .transform(percent_loss).round(1)),
                          Month=pd.Categorical(weight_loss.Month, 
                                              categories=['Jan', 'Feb', 'Mar', 'Apr'],
                                              ordered=True))
                         .query('Week == "Week 4"')
                         .pivot(index='Month', columns='Name',
                               values='percent_loss')
)

Name,Amy,Bob
Month,Unnamed: 1_level_1,Unnamed: 2_level_1
Jan,-3.6,-2.7
Feb,-8.9,-5.3
Mar,-1.7,-2.6
Apr,-5.3,-4.2


## Calculated weighted mean SAT scores per state with apply

The groupby object has four methods that accept a function (or functions) to perform a
calculation on each group. These four methods are .agg, .filter, .transform, and
.apply. Each of the first three of these methods has a very specific output that the function
must return. .agg must return a scalar value, .filter must return a Boolean, and
.transform must return a Series or DataFrame with the same length as the passed group.
The .apply method, however, may return a scalar value, a Series, or even a DataFrame
of any shape, therefore making it very flexible. It is also called only once per group (on a
DataFrame), while the .transform and .agg methods get called once for each aggregating
column (on a Series). The .apply method's ability to return a single object when operating on
multiple columns at the same time makes the calculation in this recipe possible.
In this recipe, we calculate the weighted average of both the math and verbal SAT scores
per state from the college dataset. We weight the scores by the population of undergraduate
students per school.

In [78]:
# Read in the college dataset, and drop any rows that 
# have missing values in the UGDS, SATMTMID, or SATVRMID
# columns. We do not want any missing values for 
# those columns:
college = pd.read_csv('C:/Users/justine.o_kobo360/Desktop/Pandas Workbook/Pandas CookBook 1.x/Data files/college.csv')



In [79]:
subset = ['UGDS', 'SATMTMID', 'SATVRMID']

In [80]:
college2 = college.dropna(subset=subset)

In [81]:
college.shape

(7535, 27)

In [82]:
college2.shape

(1184, 27)

The vast majority of institutions do not have data for our three required columns,
but this is still more than enough data to continue. Next, create a user-defined
function to calculate the weighted average of the SAT math scores:

In [87]:
def weighted_math_average(df):
    weighted_math = df['UGDS'] * df['SATMTMID']
    return int(weighted_math.sum() / df['UGDS'].sum())

Group by state and pass this function to the .apply method. Because each group
has multiple columns and we want to reduce those to a single value, we need to use
.apply. The weighted_math_average function will be called once for each group
(not on the individual columns in the group):

In [90]:
college2.groupby('STABBR').apply(weighted_math_average)

STABBR
AK    503
AL    536
AR    529
AZ    569
CA    564
     ... 
VT    566
WA    555
WI    593
WV    500
WY    540
Length: 53, dtype: int64

We successfully returned a scalar value for each group. Let's take a small detour and
see what the outcome would have been by passing the same function to the .agg
method (which calls the function for every column):

In [89]:
(college2
 .groupby('STABBR')
 .agg(weight_math_average)
)

NameError: name 'weight_math_average' is not defined

The weighted_math_average function gets applied to each non-aggregating
column in the DataFrame. If you try and limit the columns to just SATMTMID, you
will get an error as you won't have access to UGDS. So, the best way to complete
operations that act on multiple columns is with .apply:

In [91]:
(college2
 .groupby('STABBR')
 ['SATMTMID']
 .agg(weighted_math_average)
)

KeyError: 'UGDS'

A nice feature of .apply is that you can create multiple new columns by returning
a Series. The index of this returned Series will be the new column names. Let's
modify our function to calculate the weighted and arithmetic average for both SAT
scores along with the count of the number of institutions from each group. We return
these five values in a Series:

In [100]:
def weighted_average(df):
    weight_m = df['UGDS'] * df['SATMTMID']
    weight_v = df['UGDS'] * df['SATMTMID']
    wm_avg = weight_m.sum() / df['UGDS']
    wv_avg = weight_v.sum() / df['UGDS']
    data = {'w_math_avg': wm_avg,
    'w_verbal_avg': wv_avg,
    'math_avg': df['SATMTMID'].mean(),
    'verbal_avg': df['SATVRMID'].mean(),
    'count': len(df)
           }
    return pd.Series(data)

In [102]:
(college2
 .groupby('STABBR')
 .apply(weighted_average)
).astype(int)

Unnamed: 0_level_0,w_math_avg,w_verbal_avg,math_avg,verbal_avg,count
STABBR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AK,503,503,503,555,1
AL,15460,15460,504,508,21
AR,5110,5110,515,491,16
AZ,818,818,536,538,6
CA,64489,64489,562,549,72
...,...,...,...,...,...
VT,6866,6866,526,527,8
WA,7170,7170,551,548,18
WI,18435,18435,545,516,14
WV,27004,27004,481,473,17


In [105]:
(college
 .groupby('STABBR')
 .apply(weighted_average)
)

ValueError: Length of values (2) does not match length of index (10)

ValueError: Length of values (2) does not match length of index (10)

In [112]:
from scipy.stats import gmean, hmean
def calculate_means(df):
    df_means = pd.DataFrame(index=['Arithmetic', 'Weighted', 'Geometric', 'Harmonic'])
    
    cols = ['SATMTMID', 'SATVRMID']
    for col in cols:
        arithmetic = df[col].mean()
        weighted = np.average(df[col], weights=df['UGDS'])
        geometric = gmean(df[col])
        harmonic = hmean(df[col])
    df_means[col] =  [arithmetic, weighted, geometric, harmonic]
    return df_means.astype(int)

In [113]:
(college2
 .groupby('STABBR')
 .apply(calculate_means)
)

Unnamed: 0_level_0,Unnamed: 1_level_0,SATVRMID
STABBR,Unnamed: 1_level_1,Unnamed: 2_level_1
AK,Arithmetic,555
AK,Weighted,555
AK,Geometric,555
AK,Harmonic,555
AL,Arithmetic,508
...,...,...
WV,Harmonic,472
WY,Arithmetic,535
WY,Weighted,535
WY,Geometric,534


### Grouping by continuous variable

When grouping in pandas, you typically use columns with discrete repeating values. If there
are no repeated values, then grouping would be pointless as there would only be one row
per group. Continuous numeric columns typically have few repeated values and are generally
not used to form groups. However, if we can transform columns with continuous values into a
discrete column by placing each value in a bin, rounding them, or using some other mapping,
then grouping with them makes sense.
In this recipe, we explore the flights dataset to discover the distribution of airlines for different
travel distances. This allows us, for example, to find the airline that makes the most flights
between 500 and 1,000 miles. To accomplish this, we use the pandas cut function to
discretize the distance of each flight flown.

In [4]:
# Read in the flights dataset:
flights = pd.read_csv('C:/Users/justine.o_kobo360/Desktop/Pandas Workbook/Pandas CookBook 1.x/Data files/flights.csv')
flights

Unnamed: 0,MONTH,DAY,WEEKDAY,...,ARR_DELAY,DIVERTED,CANCELLED
0,1,1,4,...,65.0,0,0
1,1,1,4,...,-13.0,0,0
2,1,1,4,...,35.0,0,0
3,1,1,4,...,-7.0,0,0
4,1,1,4,...,39.0,0,0
...,...,...,...,...,...,...,...
58487,12,31,4,...,-19.0,0,0
58488,12,31,4,...,4.0,0,0
58489,12,31,4,...,-5.0,0,0
58490,12,31,4,...,34.0,0,0


In [7]:
flights['DIST']

0         590
1        1452
2         641
3        1192
4        1363
         ... 
58487    1464
58488     414
58489     262
58490     907
58491     522
Name: DIST, Length: 58492, dtype: int64

In [8]:
# If we want to find the distribution of airlines over
# a range of distances, we need to place the values of
# the DIST column into discrete bins. Let's use the 
# pandas cut function to split the data into five bins:
bins = [-np.inf, 200, 500, 1000, 2000, np.inf]
cuts = pd.cut(flights['DIST'], bins=bins)


In [9]:
cuts

0         (500.0, 1000.0]
1        (1000.0, 2000.0]
2         (500.0, 1000.0]
3        (1000.0, 2000.0]
4        (1000.0, 2000.0]
               ...       
58487    (1000.0, 2000.0]
58488      (200.0, 500.0]
58489      (200.0, 500.0]
58490     (500.0, 1000.0]
58491     (500.0, 1000.0]
Name: DIST, Length: 58492, dtype: category
Categories (5, interval[float64, right]): [(-inf, 200.0] < (200.0, 500.0] < (500.0, 1000.0] < (1000.0, 2000.0] < (2000.0, inf]]

In [15]:
# An ordered categorical Series is created. To help get
# an idea of what happened, let's count the values 
# of each category

cuts.value_counts(dropna=False)

(500.0, 1000.0]     20659
(200.0, 500.0]      15874
(1000.0, 2000.0]    14186
(2000.0, inf]        4054
(-inf, 200.0]        3719
Name: DIST, dtype: int64

The cuts Series can now be used to form groups. pandas allows you to pass many
types into the .groupby method. Pass the cuts Series to the .groupby method
and then call the .value_counts method on the AIRLINE column to find the
distribution for each distance group. Notice that SkyWest (OO) makes up 33% of
flights of less than 200 miles but only 16% of those between 200 and 500 miles:

In [19]:
(flights
 .groupby(cuts)
 ['AIRLINE']
 .value_counts(normalize=True)
 .round(3)
)

DIST           AIRLINE
(-inf, 200.0]  OO         0.326
               EV         0.289
               MQ         0.211
               DL         0.086
               AA         0.052
                          ...  
(2000.0, inf]  WN         0.046
               HA         0.028
               NK         0.019
               AS         0.012
               F9         0.004
Name: AIRLINE, Length: 57, dtype: float64

In [24]:
# We can find more results when grouping by the cuts 
# variable. For instance, we can find the 25th, 50th, 
# and 75th percentile airtime for each distance 
# grouping. As airtime is in minutes, we can divide by
# 60 to get hours. This will return a Series with
# a MultiIndex:
(flights
 .groupby(cuts)
 ['AIR_TIME']
 .quantile(q=[.25, .5, .75])
 .div(60)
 .round(2)
)

DIST                  
(-inf, 200.0]     0.25    0.43
                  0.50    0.50
                  0.75    0.57
(200.0, 500.0]    0.25    0.77
                  0.50    0.92
                          ... 
(1000.0, 2000.0]  0.50    2.93
                  0.75    3.40
(2000.0, inf]     0.25    4.30
                  0.50    4.70
                  0.75    5.03
Name: AIR_TIME, Length: 15, dtype: float64

In [23]:
flights['AIR_TIME'].quantile(q=[.25, .5, .75]).div(60).round(2)


0.25    1.02
0.50    1.62
0.75    2.53
Name: AIR_TIME, dtype: float64

We can use this information to create informative string labels when using the cut function.
These labels replace the interval notation found in the index. We can also chain the
.unstack method, which transposes the inner index level to column names:

In [25]:
labels=['Under an Hour', '1 Hour', '1-2 Hours', '2-4 Hours', '4+ Hours']

In [26]:
cuts2 = pd.cut(flights['DIST'], bins=bins, labels=labels)

In [27]:
cuts2

0        1-2 Hours
1        2-4 Hours
2        1-2 Hours
3        2-4 Hours
4        2-4 Hours
           ...    
58487    2-4 Hours
58488       1 Hour
58489       1 Hour
58490    1-2 Hours
58491    1-2 Hours
Name: DIST, Length: 58492, dtype: category
Categories (5, object): ['Under an Hour' < '1 Hour' < '1-2 Hours' < '2-4 Hours' < '4+ Hours']

In [33]:
(flights
 .groupby(cuts2)
 ['AIRLINE']
 .value_counts(normalize=True)
 .round(3)
 .unstack()
 
)

AIRLINE,AA,AS,B6,...,US,VX,WN
DIST,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Under an Hour,0.052,,,...,,,0.009
1 Hour,0.071,0.001,0.007,...,0.016,0.028,0.194
1-2 Hours,0.144,0.023,0.003,...,0.025,0.004,0.138
2-4 Hours,0.264,0.016,0.003,...,0.04,0.012,0.16
4+ Hours,0.212,0.012,0.08,...,0.065,0.074,0.046


### Counting the total number of flights between cities

In the flights dataset, we have data on the origin and destination airport. It is trivial to count
the number of flights originating in Houston and landing in Atlanta, for instance. What is more
difficult is counting the total number of flights between the two cities.
In this recipe, we count the total number of flights between two cities, regardless of which
one is the origin or destination. To accomplish this, we sort the origin and destination airports
alphabetically so that each combination of airports always occurs in the same order. We can
then use this new column arrangement to form groups and then to count.


In [36]:
# Read in the flights dataset, and find the total
# number of flights between each origin
# and destination airport:
flights = pd.read_csv('C:/Users/justine.o_kobo360/Desktop/Pandas Workbook/Pandas CookBook 1.x/Data files/flights.csv')
flights_ct = flights.groupby(['ORG_AIR', 'DEST_AIR']).size()
flights_ct

ORG_AIR  DEST_AIR
ATL      ABE          31
         ABQ          16
         ABY          19
         ACY           6
         AEX          40
                    ... 
SFO      SNA         122
         STL          20
         SUN          10
         TUS          20
         XNA           2
Length: 1130, dtype: int64

In [39]:
# Select the total number of flights between Houston 
# (IAH) and Atlanta (ATL) in both directions

flights_ct.loc[[('ATL', 'IAH'), ('IAH', 'ATL')]]

ORG_AIR  DEST_AIR
ATL      IAH         121
IAH      ATL         148
dtype: int64

We could simply sum these two numbers together to find the total flights between
the cities, but there is a more efficient and automated solution that can work for all
flights. Let's sort the origin and destination columns for each row alphabetically. We
will use axis='columns' to do that:

In [40]:
f_parts = (flights
           [['ORG_AIR', 'DEST_AIR']]
           .apply(lambda ser:
                 ser.sort_values().reset_index(drop=True), 
                 axis='columns')
          )

In [41]:
f_parts


Unnamed: 0,0,1
0,LAX,SLC
1,DEN,IAD
2,DFW,VPS
3,DCA,DFW
4,LAX,MCI
...,...,...
58487,DFW,SFO
58488,LAS,SFO
58489,SBA,SFO
58490,ATL,MSP


In [43]:
# Now that the origin and destination values in each
# row are sorted, the column names are not correct.
# Let's rename them to something more generic and 
# then again find the total number of flights between 
# all cities:

rename_dict = {0: 'AIR1', 1: 'AIR2'}

(flights
           [['ORG_AIR', 'DEST_AIR']]
           .apply(lambda ser:
                 ser.sort_values().reset_index(drop=True), 
                 axis='columns')
           .rename(columns=(rename_dict))
           .groupby(['AIR1', 'AIR2'])
           .size()
          )

AIR1  AIR2
ABE   ATL      31
      ORD      24
ABI   DFW      74
ABQ   ATL      16
      DEN      46
             ... 
SFO   SNA     122
      STL      20
      SUN      10
      TUS      20
      XNA       2
Length: 1085, dtype: int64

In [44]:
# Let's select all the flights between Atlanta and
# Houston and verify that they match the sum of the
# values in step 2:
(flights
           [['ORG_AIR', 'DEST_AIR']]
           .apply(lambda ser:
                 ser.sort_values().reset_index(drop=True), 
                 axis='columns')
           .rename(columns=(rename_dict))
           .groupby(['AIR1', 'AIR2'])
           .size()
           .loc['IAH', 'ATL']
          )

KeyError: ('IAH', 'ATL')

In [46]:
# We can get a massive speed increase with the 
# NumPy sort function. Let's go ahead and use
# this function and analyze its output.
# By default, it sorts each row:

data_sorted = np.sort(flights[["ORG_AIR", 'DEST_AIR']])

In [52]:
data_sorted[:10]

array([['LAX', 'SLC'],
       ['DEN', 'IAD'],
       ['DFW', 'VPS'],
       ['DCA', 'DFW'],
       ['LAX', 'MCI'],
       ['IAH', 'SAN'],
       ['DFW', 'MSY'],
       ['PHX', 'SFO'],
       ['ORD', 'STL'],
       ['IAH', 'SJC']], dtype=object)

In [54]:
# A two-dimensional NumPy array is returned. NumPy 
# does not do grouping operations so let's use the
# DataFrame constructor to create a new DataFrame and 
# check whether it equals the DataFrame from step 3:
flights_sort2 = pd.DataFrame(data_sorted, columns=['AIR1', 'AIR2'])
flights_sort2.equals(f_parts.rename(columns={0:'AIR1', 1:'AIR2'}))

True

In [60]:
# Because the DataFrames are the same, you can replace
# step 3 with the previous faster sorting routine.
# Let's time the difference between each of the
# different sorting methods:

%%timeit
flights_sort = (flights
           [['ORG_AIR', 'DEST_AIR']]
           .apply(lambda ser:
                 ser.sort_values().reset_index(drop=True), 
                 axis='columns')
          )

UsageError: Line magic function `%%timeit` not found.


In [59]:
%%timeit
data_sorted = np.sort(flights[['ORG_AIR', 'DEST_AIR']])
flights_sort2 = pd.DataFrame(data_sorted,
    columns=['AIR1', 'AIR2'])

28.3 ms ± 1.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [61]:
%%timeit
flights_sort = (flights
           [['ORG_AIR', 'DEST_AIR']]
           .apply(lambda ser:
                 ser.sort_values().reset_index(drop=True), 
                 axis='columns')
          )

35.8 s ± 1.29 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Finding the longest streak of on-time flights

One of the most important metrics for airlines is their on-time flight performance. The
Federal Aviation Administration considers a flight delayed when it arrives at least 15 minutes
later than its scheduled arrival time. pandas includes methods to calculate the total and
percentage of on-time flights per airline. While these basic summary statistics are an
important metric, there are other non-trivial calculations that are interesting, such as finding
the length of consecutive on-time flights for each airline at each of its origin airports.

In this recipe, we find the longest consecutive streak of on-time flights for each airline at
each origin airport. This requires each value in a column to be aware of the value immediately
following it. We make clever use of the .diff and .cumsum methods to find streaks before
applying this methodology to each of the groups.

In [65]:
# Before we get started with the flights dataset,
# let's practice counting streaks of ones
# with a small sample Series:

s = pd.Series([0, 1, 1, 0, 1, 1,1, 0])

In [66]:
s

0    0
1    1
2    1
3    0
4    1
5    1
6    1
7    0
dtype: int64

Our final representation of the streaks of ones will be a Series of the same length
as the original with an independent count beginning from one for each streak. To get
started, let's use the .cumsum method:

In [68]:
s1 = s.cumsum()
s1

0    0
1    1
2    2
3    2
4    3
5    4
6    5
7    5
dtype: int64

In [69]:
# We have now accumulated all the ones going down the
# Series. Let's multiply this Series by the original:
s.mul(s1)

0    0
1    1
2    2
3    0
4    3
5    4
6    5
7    0
dtype: int64

We have only non-zero values where we originally had ones. This result is fairly close
to what we desire. We just need to restart each streak at one instead of where the
cumulative sum left off. Let's chain the .diff method, which subtracts the previous
value from the current:

In [70]:
s.mul(s1).diff()

0    NaN
1    1.0
2    1.0
3   -2.0
4    3.0
5    1.0
6    1.0
7   -5.0
dtype: float64

A negative value represents the end of a streak. We need to propagate the negative
values down the Series and use them to subtract away the excess accumulation from
step 2. To do this, we will make all non-negative values missing with the .where
method:

In [71]:
(s
 .mul(s.cumsum())
 .diff()
 .where(lambda x: x < 0)
)

0    NaN
1    NaN
2    NaN
3   -2.0
4    NaN
5    NaN
6    NaN
7   -5.0
dtype: float64

In [72]:
# We can now propagate these values down 
# with the .ffill method:
(s
 .mul(s.cumsum())
 .diff()
 .where(lambda x: x < 0)
 .ffill()
)

0    NaN
1    NaN
2    NaN
3   -2.0
4   -2.0
5   -2.0
6   -2.0
7   -5.0
dtype: float64

In [74]:
# Finally, we can add this Series back to the
# cumulative sum to clear out the excess accumulation:
(s
 .mul(s.cumsum())
 .diff()
 .where(lambda x: x < 0)
 .ffill()
 .add(s.cumsum(), fill_value=0)
)

0    0.0
1    1.0
2    2.0
3    0.0
4    1.0
5    2.0
6    3.0
7    0.0
dtype: float64

Now that we have a working consecutive streak finder, we can find the longest streak
per airline and origin airport. Let's read in the flights dataset and create a column
to represent on-time arrival:

In [75]:
flights = pd.read_csv('C:/Users/justine.o_kobo360/Desktop/Pandas Workbook/Pandas CookBook 1.x/Data files/flights.csv')

In [76]:
(flights
 .assign(ON_TIME=flights['ARR_DELAY'].lt(15).astype(int))
 [['AIRLINE', 'ORG_AIR', 'ON_TIME']]

)
 

Unnamed: 0,AIRLINE,ORG_AIR,ON_TIME
0,WN,LAX,0
1,UA,DEN,1
2,MQ,DFW,0
3,AA,DFW,1
4,WN,LAX,0
...,...,...,...
58487,AA,SFO,1
58488,F9,LAS,1
58489,OO,SFO,1
58490,WN,MSP,0


In [77]:
# Use our logic from the first seven steps to define 
# a function that returns the maximum streak of ones
# for a given Series:
def max_streak(s):
    sl = s.cumsum()
    return (s
            .mul(sl)
            .diff()
            .where(lambda x : x < 0)
            .ffill()
            .add(sl, fill_value=0)
            .max()
           )

Find the maximum streak of on-time arrivals per airline and origin airport along with
the total number of flights and the percentage of on-time arrivals. First, sort the day
of the year and the scheduled departure time:

In [78]:
(flights
   .assign(ON_TIME=flights['ARR_DELAY'].lt(15).astype(int))
   .sort_values(['MONTH', 'DAY', 'SCHED_DEP'])
   .groupby(['AIRLINE', 'ORG_AIR'])
    ['ON_TIME']
    .agg(['mean', 'size', max_streak])
    .round(2)
)

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,size,max_streak
AIRLINE,ORG_AIR,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AA,ATL,0.82,233,15.0
AA,DEN,0.74,219,17.0
AA,DFW,0.78,4006,64.0
AA,IAH,0.80,196,24.0
AA,LAS,0.79,374,29.0
...,...,...,...,...
WN,LAS,0.77,2031,39.0
WN,LAX,0.70,1135,23.0
WN,MSP,0.84,237,32.0
WN,PHX,0.77,1724,33.0


Now that we have found the longest streaks of on-time arrivals, we can easily find the opposite
– the longest streak of delayed arrivals. The following function returns two rows for each group
passed to it. The first row is the start of the streak, and the last row is the end of the streak.
Each row contains the month and day that the streak started and ended, along with the total
streak length:

In [95]:
def max_delay_streak(df):
    df = df.reset_index(drop=True)
    late = 1 - df['ON_TIME']
    late_sum = late.cumsum()
    streak = (late
              .mul(late_sum)
              .diff()
              .where(lambda x : x < 0)
              .ffill()
              .add(late_sum, fill_value=0)
              )
    last_idx = streak.idxmax()
    first_idx = last_idx - streak.max() + 1
    res = (df
        .loc[[first_idx, last_idx], ['MONTH', 'DAY']]
        .assign(streak=streak.max())
    )
    res.index = ['first', 'last']
    return res

In [96]:
(flights
    .assign(ON_TIME=flights['ARR_DELAY'].lt(15).astype(int))
    .sort_values(['MONTH', 'DAY', 'SCHED_DEP'])
    .groupby(['AIRLINE', 'ORG_AIR'])
    .apply(max_delay_streak)
    .sort_values('streak', ascending=False)
)

KeyError: '[1.0] not in index'