<h1>Data Aggregation and Group Operations</h1>

Categorizing a dataset and applying a function to each group, whether an aggregation or transformation, can be a critical component of a data analysis workflow. After loading, merging, and preparing a dataset, you may need to compute group statistics or possibly pivot tables for reporting or visualization purposes. pandas provides a versatile `groupby` interface, enabling you to slice, dice, and summarize datasets in a natural way.

One reason for the popularity of relational databases and SQL (which stands for “structured query language”) is the ease with which data can be joined, filtered, transformed, and aggregated. However, query languages like SQL impose certain limitations on the kinds of group operations that can be performed. As you will see, with the expressiveness of Python and pandas, we can perform quite complex group operations by expressing them as custom Python functions that manipulate the data associated with each group. In this notebook, you will learn how to:

<ul>
    <li>Split a pandas object into pieces using one or more keys (in the form of functions, arrays, or DataFrame column names)</li>
    <li>Calculate group summary statistics, like count, mean, or standard deviation, or a user-defined function</li>
    <li>Apply within-group transformations or other manipulations</li>
    <li>Compute pivot tables and cross-tabulations</li>
    <li>Perform quantile analysis and other statistical group analyses</li>
</ul>

As with the rest of the chapters, we start by importing NumPy and pandas:

In [1]:
import numpy as np
import pandas as pd

<h2>How to Think About Group Operations</h2>

Hadley Wickham, an author of many popular packages for the R programming language, coined the term split-apply-combine for describing group operations. In the first stage of the process, data contained in a pandas object, whether a Series, DataFrame, or otherwise, is split into groups based on one or more keys that you provide. The splitting is performed on a particular axis of an object. For example, a DataFrame can be grouped on its rows `(axis="index")` or its columns `(axis="columns")`. Once this is done, a function is applied to each group, producing a new value. Finally, the results of all those function applications are combined into a result object. The form of the resulting object will usually depend on what’s being done to the data. See Figure 10.1 for a mockup of a simple group aggregation.

Each grouping key can take many forms, and the keys do not have to be all of the same type:
<ul>
    <li>A list or array of values that is the same length as the axis being grouped</li>
    <li>A value indicating a column name in a DataFrame</li>
    <li>A dictionary or Series giving a correspondence between the values on the axis being grouped and the group names</li>
    <li>A function to be invoked on the axis index or the individual labels in the index</li>
</ul>

Note that the latter three methods are shortcuts for producing an array of values to be used to split up the object. Don’t worry if this all seems abstract. Throughout this notebook, I will give many examples of all these methods. To get started, here is a small tabular dataset as a DataFrame:

In [2]:
df = pd.DataFrame({"key1" : ["a", "a", None, "b", "b", "a", None],
                   "key2" : pd.Series([1, 2, 1, 2, 1, None, 1], dtype="Int64"),
                   "data1" : np.random.standard_normal(7),
                   "data2" : np.random.standard_normal(7)})
df

Unnamed: 0,key1,key2,data1,data2
0,a,1.0,1.94398,0.682804
1,a,2.0,-1.644034,-1.225634
2,,1.0,-1.512279,-1.355291
3,b,2.0,0.166192,1.656512
4,b,1.0,0.485218,-0.063425
5,a,,-2.096923,0.162421
6,,1.0,0.358883,-0.539933


Suppose you wanted to compute the mean of the `data1` column using the labels from `key1`. There are a number of ways to do this. One is to access `data1` and call groupby with the column (a Series) at `key1`:

In [3]:
grouped = df["data1"].groupby(df["key1"])
grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000251ECB12320>

This `grouped` variable is now a special "GroupBy" object. It has not actually computed anything yet except for some intermediate data about the group key `df["key1"]`. The idea is that this object has all of the information needed to then apply some operation to each of the groups. For example, to compute group means we can call the GroupBy’s `mean` method:

In [4]:
grouped.mean()

key1
a   -0.598992
b    0.325705
Name: data1, dtype: float64

Later we'll explain more about what happens when you call `.mean()`. The important thing here is that the data (a Series) has been aggregated by splitting the data on the group key, producing a new Series that is now indexed by the unique values in the `key1` column. The result index has the name "key1" because the DataFrame column `df["key1"]` did.

If instead we had passed multiple arrays as a list, we'd get something different

In [5]:
means = df["data1"].groupby([df["key1"], df["key2"]]).mean()
means

key1  key2
a     1       1.943980
      2      -1.644034
b     1       0.485218
      2       0.166192
Name: data1, dtype: float64

Here we grouped the data using two keys, and the resulting Series now has a hierarchical index consisting of the unique pairs of keys observed:

In [6]:
means.unstack()

key2,1,2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,1.94398,-1.644034
b,0.485218,0.166192


In this example, the group keys are all Series, though they could be any arrays of the right length:

In [7]:
df

Unnamed: 0,key1,key2,data1,data2
0,a,1.0,1.94398,0.682804
1,a,2.0,-1.644034,-1.225634
2,,1.0,-1.512279,-1.355291
3,b,2.0,0.166192,1.656512
4,b,1.0,0.485218,-0.063425
5,a,,-2.096923,0.162421
6,,1.0,0.358883,-0.539933


In [8]:
states = np.array(["OH", "CA", "CA", "OH", "OH", "CA", "OH"])
years = [2005, 2005, 2006, 2005, 2006, 2005, 2006]
df["data1"].groupby([states, years]).head()

0    1.943980
1   -1.644034
2   -1.512279
3    0.166192
4    0.485218
5   -2.096923
6    0.358883
Name: data1, dtype: float64

In [9]:
df["data1"].groupby([states, years]).mean()

CA  2005   -1.870479
    2006   -1.512279
OH  2005    1.055086
    2006    0.422051
Name: data1, dtype: float64

Frequently, the grouping information is found in the same DataFrame as the data you want to work on. In that case, you can pass column names (whether those are strings, numbers, or other Python objects) as the group keys:

In [10]:
df.groupby("key1").mean()

Unnamed: 0_level_0,key2,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1.5,-0.598992,-0.126803
b,1.5,0.325705,0.796544


In [11]:
df.groupby("key2").mean(numeric_only=True)

Unnamed: 0_level_0,data1,data2
key2,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.31895,-0.318961
2,-0.738921,0.215439


In [12]:
df.groupby(["key1", "key2"]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,1.94398,0.682804
a,2,-1.644034,-1.225634
b,1,0.485218,-0.063425
b,2,0.166192,1.656512


You may have noticed in the second case, `df.groupby("key2").mean()`, that there is no `key1` column in the result. Because `df["key1"]` is not numeric data, it is said to be a nuisance column, which is therefore automatically excluded from the result. By default, all of the numeric columns are aggregated, though it is possible to filter down to a subset, as you’ll see soon.

Regardless of the objective in using `groupby`, a generally useful GroupBy method is `size`, which returns a Series containing group sizes:

In [13]:
df.groupby(["key1", "key2"]).size()

key1  key2
a     1       1
      2       1
b     1       1
      2       1
dtype: int64

Note that any missing values in a group key are excluded from the result by default. This behavior can be disabled by passing `dropna=False to `groupby`:

In [14]:
df

Unnamed: 0,key1,key2,data1,data2
0,a,1.0,1.94398,0.682804
1,a,2.0,-1.644034,-1.225634
2,,1.0,-1.512279,-1.355291
3,b,2.0,0.166192,1.656512
4,b,1.0,0.485218,-0.063425
5,a,,-2.096923,0.162421
6,,1.0,0.358883,-0.539933


In [15]:
df.groupby("key1", dropna=False).size()

key1
a      3
b      2
NaN    2
dtype: int64

In [16]:
df.groupby(["key1", "key2"], dropna=False).size()

key1  key2
a     1       1
      2       1
      <NA>    1
b     1       1
      2       1
NaN   1       2
dtype: int64

A group function similar in spirit to `size` is count, which computes the number of nonnull values in each group:

In [17]:
df.groupby("key1").count()

Unnamed: 0_level_0,key2,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,2,3,3
b,2,2,2


<h3>Iterating over Groups</h3>

The object returned by `groupby` supports iteration, generating a sequence of 2-tuples containing the group name along with the chunk of data. Consider the following:

In [18]:
for name, group in df.groupby("key1"):
    print(name)
    print(group)

a
  key1  key2     data1     data2
0    a     1  1.943980  0.682804
1    a     2 -1.644034 -1.225634
5    a  <NA> -2.096923  0.162421
b
  key1  key2     data1     data2
3    b     2  0.166192  1.656512
4    b     1  0.485218 -0.063425


In the case of multiple keys, the first element in the tuple will be a tuple of key values:

In [19]:
for (k1, k2), group in df.groupby(["key1", "key2"]):
    print((k1, k2))
    print(group)

('a', 1)
  key1  key2    data1     data2
0    a     1  1.94398  0.682804
('a', 2)
  key1  key2     data1     data2
1    a     2 -1.644034 -1.225634
('b', 1)
  key1  key2     data1     data2
4    b     1  0.485218 -0.063425
('b', 2)
  key1  key2     data1     data2
3    b     2  0.166192  1.656512


Of course, you can choose to do whatever you want with the pieces of data. A recipe you may find useful is computing a dictionary of the data pieces as a one-liner:

In [20]:
pieces = {name: group for name, group in df.groupby("key1")}
pieces["b"]

Unnamed: 0,key1,key2,data1,data2
3,b,2,0.166192,1.656512
4,b,1,0.485218,-0.063425


<h3>Selecting a Column or Subset of Columns</h3>

Indexing a GroupBy object created from a DataFrame with a column name or array of column names has the effect of column subsetting for aggregation. This means that:

In [21]:
df.groupby("key1")["data1"]
# df.groupby("key1")[["data2"]]

<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000251ECBA4D60>

are conveniences for:

In [22]:
df["data1"].groupby(df["key1"])
# df[["data2"]].groupby(df["key1"])

<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000251ECBA4D90>

Especially for large datasets, it may be desirable to aggregate only a few columns. For example, in the preceding dataset, to compute the means for just the `data2` column and get the result as a DataFrame, we could write:

In [23]:
df.groupby(["key1", "key2"])[["data2"]].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,1,0.682804
a,2,-1.225634
b,1,-0.063425
b,2,1.656512


The object returned by this indexing operation is a grouped DataFrame if a list or array is passed, or a grouped Series if only a single column name is passed as a scalar:

In [24]:
s_grouped = df.groupby(["key1", "key2"])["data2"]
s_grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000251ECBA5540>

In [25]:
s_grouped.mean()

key1  key2
a     1       0.682804
      2      -1.225634
b     1      -0.063425
      2       1.656512
Name: data2, dtype: float64

<h3>Grouping with Dictionaries and Series</h3>

Grouping information may exist in a form other than an array. Let’s consider another example DataFrame:

In [26]:
people = pd.DataFrame(np.random.standard_normal((5, 5)),
                      columns=["a", "b", "c", "d", "e"],
                      index=["Joe", "Steve", "Wanda", "Jill", "Trey"])
people.iloc[2:3, [1, 2]] = np.nan # Add a few NA values
people

Unnamed: 0,a,b,c,d,e
Joe,-1.725807,0.518244,0.370188,1.863498,-1.022379
Steve,-0.623977,0.262575,0.430788,1.479619,1.525482
Wanda,0.351045,,,2.652535,0.879408
Jill,0.637435,0.723615,0.050052,0.547213,-1.310029
Trey,-1.254676,0.275446,-0.388053,0.14877,0.553899


Now, suppose I have a group correspondence for the columns and want to sum the columns by group:

In [27]:
mapping = {"a": "red", "b": "red", "c": "blue",
           "d": "blue", "e": "red", "f" : "orange"}

Now, you could construct an array from this dictionary to pass to `groupby`, but instead we can just pass the dictionary (I included the key `"f"` to highlight that unused grouping keys are OK):

In [28]:
by_column = people.groupby(mapping, axis="columns")
by_column.sum()

Unnamed: 0,blue,red
Joe,2.233686,-2.229942
Steve,1.910407,1.16408
Wanda,2.652535,1.230453
Jill,0.597265,0.051022
Trey,-0.239284,-0.42533


The same functionality holds for Series, which can be viewed as a fixed-size mapping:

In [29]:
map_series = pd.Series(mapping)
map_series

a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object

In [30]:
people.groupby(map_series, axis="columns").count()

Unnamed: 0,blue,red
Joe,2,3
Steve,2,3
Wanda,1,2
Jill,2,3
Trey,2,3


<h3>Grouping with Functions</h3>

Using Python functions is a more generic way of defining a group mapping compared with a dictionary or Series. Any function passed as a group key will be called once per index value (or once per column value if using `axis="columns"`), with the return values being used as the group names. More concretely, consider the example DataFrame from the previous section, which has people’s first names as index values. Suppose you wanted to group by name length. While you could compute an array of string lengths, it's simpler to just pass the `len` function:

In [31]:
people

Unnamed: 0,a,b,c,d,e
Joe,-1.725807,0.518244,0.370188,1.863498,-1.022379
Steve,-0.623977,0.262575,0.430788,1.479619,1.525482
Wanda,0.351045,,,2.652535,0.879408
Jill,0.637435,0.723615,0.050052,0.547213,-1.310029
Trey,-1.254676,0.275446,-0.388053,0.14877,0.553899


In [32]:
people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,-1.725807,0.518244,0.370188,1.863498,-1.022379
4,-0.617241,0.999061,-0.338002,0.695982,-0.756129
5,-0.272933,0.262575,0.430788,4.132154,2.40489


In [33]:
people.index.name="name"

In [34]:
for name, group in people.groupby("name"):
    print(group)

             a         b         c         d         e
name                                                  
Jill  0.637435  0.723615  0.050052  0.547213 -1.310029
             a         b         c         d         e
name                                                  
Joe  -1.725807  0.518244  0.370188  1.863498 -1.022379
              a         b         c         d         e
name                                                   
Steve -0.623977  0.262575  0.430788  1.479619  1.525482
             a         b         c        d         e
name                                                 
Trey -1.254676  0.275446 -0.388053  0.14877  0.553899
              a   b   c         d         e
name                                       
Wanda  0.351045 NaN NaN  2.652535  0.879408


Mixing functions with arrays, dictionaries, or Series is not a problem, as everything gets converted to arrays internally:

In [35]:
key_list = ["one", "one", "one", "two", "two"]
people.groupby([len, key_list]).min()

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d,e
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
3,one,-1.725807,0.518244,0.370188,1.863498,-1.022379
4,two,-1.254676,0.275446,-0.388053,0.14877,-1.310029
5,one,-0.623977,0.262575,0.430788,1.479619,0.879408


<h3>Grouping by Index Levels</h3>
A final convenience for hierarchically indexed datasets is the ability to aggregate using one of the levels of an axis index. Let's look at an example:

In [36]:
columns = pd.MultiIndex.from_arrays([["US", "US", "US", "JP", "JP"], [1, 3, 5, 1, 3]],
                                    names=["cty", "tenor"])
                                    
hier_df = pd.DataFrame(np.random.standard_normal((4, 5)), columns=columns)
hier_df

cty,US,US,US,JP,JP
tenor,1,3,5,1,3
0,1.712823,-2.70691,-0.935224,0.13127,-1.371244
1,0.136923,1.296051,-0.049381,0.340378,-1.032047
2,0.115576,0.328017,-1.638901,-0.059819,-1.211753
3,0.155814,-2.015919,-0.112578,-0.889203,0.12439


To group by level, pass the level number or name using the `level` keyword:

In [37]:
hier_df.groupby(level="cty", axis="columns").count()

cty,JP,US
0,2,3
1,2,3
2,2,3
3,2,3


<h1>Data Aggregation</h1>

Aggregations refer to any data transformation that produces scalar values from arrays. The preceding examples have used several of them, including mean, count, min, and sum. You may wonder what is going on when you invoke mean() on a GroupBy object. Many common aggregations, such as those found in Table 10.1, have optimized implementations. However, you are not limited to only this set of methods.

<b>Note:</b> Optimized `groupby` methods in this <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html">link</a>.

You can use aggregations of your own devising and additionally call any method that is also defined on the object being grouped. For example, the `nsmallest` Series method selects the smallest requested number of values from the data. While `nsmallest` is not explicitly implemented for GroupBy, we can still use it with a nonoptimized implementation. Internally, GroupBy slices up the Series, calls `piece.nsmallest(n)` for each piece, and then assembles those results into the result object:

In [38]:
df

Unnamed: 0,key1,key2,data1,data2
0,a,1.0,1.94398,0.682804
1,a,2.0,-1.644034,-1.225634
2,,1.0,-1.512279,-1.355291
3,b,2.0,0.166192,1.656512
4,b,1.0,0.485218,-0.063425
5,a,,-2.096923,0.162421
6,,1.0,0.358883,-0.539933


In [39]:
grouped = df.groupby("key1")
grouped.head()

Unnamed: 0,key1,key2,data1,data2
0,a,1.0,1.94398,0.682804
1,a,2.0,-1.644034,-1.225634
3,b,2.0,0.166192,1.656512
4,b,1.0,0.485218,-0.063425
5,a,,-2.096923,0.162421


In [40]:
# Return the first n rows ordered by columns in ascending order
grouped["data1"].nsmallest(2)

key1   
a     5   -2.096923
      1   -1.644034
b     3    0.166192
      4    0.485218
Name: data1, dtype: float64

To use your own aggregation functions, pass any function that aggregates an array to the `aggregate` method or its short alias `agg`:

In [41]:
def peak_to_peak(arr):
    return arr.max() - arr.min()
grouped.agg(peak_to_peak)

Unnamed: 0_level_0,key2,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,4.040903,1.908438
b,1,0.319027,1.719937


You may notice that some methods, like `describe`, also work, even though they are not aggregations, strictly speaking:

In [42]:
grouped.describe()

Unnamed: 0_level_0,key2,key2,key2,key2,key2,key2,key2,key2,data1,data1,data1,data1,data1,data2,data2,data2,data2,data2,data2,data2,data2
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
a,2.0,1.5,0.707107,1.0,1.25,1.5,1.75,2.0,3.0,-0.598992,...,0.149973,1.94398,3.0,-0.126803,0.986545,-1.225634,-0.531606,0.162421,0.422612,0.682804
b,2.0,1.5,0.707107,1.0,1.25,1.5,1.75,2.0,2.0,0.325705,...,0.405462,0.485218,2.0,0.796544,1.216179,-0.063425,0.36656,0.796544,1.226528,1.656512


Note
Custom aggregation functions are generally much slower than the optimized functions. This is because there is some extra overhead (function calls, data rearrangement) in constructing the intermediate group data chunks.

<h2>Column-Wise and Multiple Function Application</h2>

Let's return to the tipping dataset used in the last chapter. After loading it with `pandas.read_csv`, we add a tipping percentage column:

In [43]:
tips = pd.read_csv("../data/tips.csv")
tips.head()

Unnamed: 0,total_bill,tip,smoker,day,time,size
0,16.99,1.01,No,Sun,Dinner,2
1,10.34,1.66,No,Sun,Dinner,3
2,21.01,3.5,No,Sun,Dinner,3
3,23.68,3.31,No,Sun,Dinner,2
4,24.59,3.61,No,Sun,Dinner,4


Now I will add a `tip_pct` column with the tip percentage of the total bill:

In [44]:
tips["tip_pct"] = tips["tip"] / tips["total_bill"]
tips.head()

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_pct
0,16.99,1.01,No,Sun,Dinner,2,0.059447
1,10.34,1.66,No,Sun,Dinner,3,0.160542
2,21.01,3.5,No,Sun,Dinner,3,0.166587
3,23.68,3.31,No,Sun,Dinner,2,0.13978
4,24.59,3.61,No,Sun,Dinner,4,0.146808


As you’ve already seen, aggregating a Series or all of the columns of a DataFrame is a matter of using `aggregate` (or `agg`) with the desired function or calling a method like `mean` or `std`. However, you may want to aggregate using a different function, depending on the column, or multiple functions at once. Fortunately, this is possible to do, which I’ll illustrate through a number of examples. First, I’ll group the `tips` by `day` and `smoker`:

In [45]:
grouped = tips.groupby(["day", "smoker"])

Note that for descriptive statistics like those in Table 10.1, you can pass the name of the function as a string:

In [46]:
grouped_pct = grouped["tip_pct"]
grouped_pct.agg("mean")

day   smoker
Fri   No        0.151650
      Yes       0.174783
Sat   No        0.158048
      Yes       0.147906
Sun   No        0.160113
      Yes       0.187250
Thur  No        0.160298
      Yes       0.163863
Name: tip_pct, dtype: float64

If you pass a list of functions or function names instead, you get back a DataFrame with column names taken from the functions:

In [47]:
grouped_pct.agg(["mean", "std", peak_to_peak])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std,peak_to_peak
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fri,No,0.15165,0.028123,0.067349
Fri,Yes,0.174783,0.051293,0.159925
Sat,No,0.158048,0.039767,0.235193
Sat,Yes,0.147906,0.061375,0.290095
Sun,No,0.160113,0.042347,0.193226
Sun,Yes,0.18725,0.154134,0.644685
Thur,No,0.160298,0.038774,0.19335
Thur,Yes,0.163863,0.039389,0.15124


Here we passed a list of aggregation functions to `agg` to evaluate independently on the data groups.

You don’t need to accept the names that GroupBy gives to the columns; notably, `lambda` functions have the name `"<lambda>"`, which makes them hard to identify (you can see for yourself by looking at a function’s `__name__` attribute). Thus, if you pass a list of `(name, function)` tuples, the first element of each tuple will be used as the DataFrame column names (you can think of a list of 2-tuples as an ordered mapping):

In [48]:
grouped_pct.agg([("average", "mean"), ("stdev", np.std)])

Unnamed: 0_level_0,Unnamed: 1_level_0,average,stdev
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,No,0.15165,0.028123
Fri,Yes,0.174783,0.051293
Sat,No,0.158048,0.039767
Sat,Yes,0.147906,0.061375
Sun,No,0.160113,0.042347
Sun,Yes,0.18725,0.154134
Thur,No,0.160298,0.038774
Thur,Yes,0.163863,0.039389


With a DataFrame you have more options, as you can specify a list of functions to apply to all of the columns or different functions per column. To start, suppose we wanted to compute the same three statistics for the `tip_pct` and `total_bill` columns:

In [49]:
functions = ["count", "mean", "max"]
result = grouped[["tip_pct", "total_bill"]].agg(functions)
result

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,total_bill,total_bill,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,max,count,mean,max
day,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Fri,No,4,0.15165,0.187735,4,18.42,22.75
Fri,Yes,15,0.174783,0.26348,15,16.813333,40.17
Sat,No,45,0.158048,0.29199,45,19.661778,48.33
Sat,Yes,42,0.147906,0.325733,42,21.276667,50.81
Sun,No,57,0.160113,0.252672,57,20.506667,48.17
Sun,Yes,19,0.18725,0.710345,19,24.12,45.35
Thur,No,45,0.160298,0.266312,45,17.113111,41.19
Thur,Yes,17,0.163863,0.241255,17,19.190588,43.11


As you can see, the resulting DataFrame has hierarchical columns, the same as you would get aggregating each column separately and using `concat` to glue the results together using the column names as the `keys` argument:

In [50]:
result["tip_pct"]

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,max
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fri,No,4,0.15165,0.187735
Fri,Yes,15,0.174783,0.26348
Sat,No,45,0.158048,0.29199
Sat,Yes,42,0.147906,0.325733
Sun,No,57,0.160113,0.252672
Sun,Yes,19,0.18725,0.710345
Thur,No,45,0.160298,0.266312
Thur,Yes,17,0.163863,0.241255


As before, a list of tuples with custom names can be passed:

In [51]:
ftuples = [("Average", "mean"), ("Variance", np.var)]
grouped[["tip_pct", "total_bill"]].agg(ftuples)

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,total_bill,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,Average,Variance,Average,Variance
day,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Fri,No,0.15165,0.000791,18.42,25.596333
Fri,Yes,0.174783,0.002631,16.813333,82.562438
Sat,No,0.158048,0.001581,19.661778,79.908965
Sat,Yes,0.147906,0.003767,21.276667,101.387535
Sun,No,0.160113,0.001793,20.506667,66.09998
Sun,Yes,0.18725,0.023757,24.12,109.046044
Thur,No,0.160298,0.001503,17.113111,59.625081
Thur,Yes,0.163863,0.001551,19.190588,69.808518


Now, suppose you wanted to apply potentially different functions to one or more of the columns. To do this, pass a dictionary to `agg` that contains a mapping of column names to any of the function specifications listed so far:

In [52]:
grouped.agg({"tip" : np.max, "size" : "sum"})

Unnamed: 0_level_0,Unnamed: 1_level_0,tip,size
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,No,3.5,9
Fri,Yes,4.73,31
Sat,No,9.0,115
Sat,Yes,10.0,104
Sun,No,6.0,167
Sun,Yes,6.5,49
Thur,No,6.7,112
Thur,Yes,5.0,40


In [53]:
grouped.agg({"tip_pct" : ["min", "max", "mean", "std"],
             "size" : "sum"})

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,tip_pct,size
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,std,sum
day,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Fri,No,0.120385,0.187735,0.15165,0.028123,9
Fri,Yes,0.103555,0.26348,0.174783,0.051293,31
Sat,No,0.056797,0.29199,0.158048,0.039767,115
Sat,Yes,0.035638,0.325733,0.147906,0.061375,104
Sun,No,0.059447,0.252672,0.160113,0.042347,167
Sun,Yes,0.06566,0.710345,0.18725,0.154134,49
Thur,No,0.072961,0.266312,0.160298,0.038774,112
Thur,Yes,0.090014,0.241255,0.163863,0.039389,40


A DataFrame will have hierarchical columns only if multiple functions are applied to at least one column.

<h2>Returning Aggregated Data Without Row Indexes</h2>

In all of the examples up until now, the aggregated data comes back with an index, potentially hierarchical, composed from the unique group key combinations. Since this isn’t always desirable, you can disable this behavior in most cases by passing `as_index=False` to `groupby`:

In [54]:
tips

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_pct
0,16.99,1.01,No,Sun,Dinner,2,0.059447
1,10.34,1.66,No,Sun,Dinner,3,0.160542
2,21.01,3.50,No,Sun,Dinner,3,0.166587
3,23.68,3.31,No,Sun,Dinner,2,0.139780
4,24.59,3.61,No,Sun,Dinner,4,0.146808
...,...,...,...,...,...,...,...
239,29.03,5.92,No,Sat,Dinner,3,0.203927
240,27.18,2.00,Yes,Sat,Dinner,2,0.073584
241,22.67,2.00,Yes,Sat,Dinner,2,0.088222
242,17.82,1.75,No,Sat,Dinner,2,0.098204


In [55]:
tips.groupby(["day", "smoker"], as_index=False).mean(numeric_only=True)

Unnamed: 0,day,smoker,total_bill,tip,size,tip_pct
0,Fri,No,18.42,2.8125,2.25,0.15165
1,Fri,Yes,16.813333,2.714,2.066667,0.174783
2,Sat,No,19.661778,3.102889,2.555556,0.158048
3,Sat,Yes,21.276667,2.875476,2.47619,0.147906
4,Sun,No,20.506667,3.167895,2.929825,0.160113
5,Sun,Yes,24.12,3.516842,2.578947,0.18725
6,Thur,No,17.113111,2.673778,2.488889,0.160298
7,Thur,Yes,19.190588,3.03,2.352941,0.163863


Of course, it’s always possible to obtain the result in this format by calling `reset_index` on the result. Using the `as_index=False` argument avoids some unnecessary computations.

<h1>Apply: General split-apply-combine</h1>

The most general-purpose GroupBy method is `apply`, which is the subject of this section. `apply` splits the object being manipulated into pieces, invokes the passed function on each piece, and then attempts to concatenate the pieces.

Returning to the tipping dataset from before, suppose you wanted to select the top five `tip_pct` values by group. First, write a function that selects the rows with the largest values in a particular column:

In [56]:
def top(df, n=5, column="tip_pct"):
    return df.sort_values(column, ascending=False)[:n]
    
top(tips, n=6)

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_pct
172,7.25,5.15,Yes,Sun,Dinner,2,0.710345
178,9.6,4.0,Yes,Sun,Dinner,2,0.416667
67,3.07,1.0,Yes,Sat,Dinner,1,0.325733
232,11.61,3.39,No,Sat,Dinner,2,0.29199
183,23.17,6.5,Yes,Sun,Dinner,4,0.280535
109,14.31,4.0,Yes,Sat,Dinner,2,0.279525


Now, if we group by `smoker`, say, and call `apply` with this function, we get the following:

In [57]:
tips.groupby("smoker").apply(top)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,smoker,day,time,size,tip_pct
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
No,232,11.61,3.39,No,Sat,Dinner,2,0.29199
No,149,7.51,2.0,No,Thur,Lunch,2,0.266312
No,51,10.29,2.6,No,Sun,Dinner,2,0.252672
No,185,20.69,5.0,No,Sun,Dinner,5,0.241663
No,88,24.71,5.85,No,Thur,Lunch,2,0.236746
Yes,172,7.25,5.15,Yes,Sun,Dinner,2,0.710345
Yes,178,9.6,4.0,Yes,Sun,Dinner,2,0.416667
Yes,67,3.07,1.0,Yes,Sat,Dinner,1,0.325733
Yes,183,23.17,6.5,Yes,Sun,Dinner,4,0.280535
Yes,109,14.31,4.0,Yes,Sat,Dinner,2,0.279525


What has happened here? First, the `tips` DataFrame is split into groups based on the value of `smoker`. Then the `top` function is called on each group, and the results of each function call are glued together using pandas.concat, labeling the pieces with the group names. The result therefore has a hierarchical index with an inner level that contains index values from the original DataFrame.

If you pass a function to `apply` that takes other arguments or keywords, you can pass these after the function:

In [58]:
tips.groupby(["smoker", "day"]).apply(top, n=1, column="total_bill")

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,total_bill,tip,smoker,day,time,size,tip_pct
smoker,day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
No,Fri,94,22.75,3.25,No,Fri,Dinner,2,0.142857
No,Sat,212,48.33,9.0,No,Sat,Dinner,4,0.18622
No,Sun,156,48.17,5.0,No,Sun,Dinner,6,0.103799
No,Thur,142,41.19,5.0,No,Thur,Lunch,5,0.121389
Yes,Fri,95,40.17,4.73,Yes,Fri,Dinner,4,0.11775
Yes,Sat,170,50.81,10.0,Yes,Sat,Dinner,3,0.196812
Yes,Sun,182,45.35,3.5,Yes,Sun,Dinner,3,0.077178
Yes,Thur,197,43.11,5.0,Yes,Thur,Lunch,4,0.115982


Beyond these basic usage mechanics, getting the most out of `apply` may require some creativity. What occurs inside the function passed is up to you; it must either return a pandas object or a scalar value. The rest of this chapter will consist mainly of examples showing you how to solve various problems using `groupby`.

For example, you may recall that I earlier called `describe` on a GroupBy object:

In [58]:
result = tips.groupby("smoker")["tip_pct"].describe()
result

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
No,151.0,0.159328,0.03991,0.056797,0.136906,0.155625,0.185014,0.29199
Yes,93.0,0.163196,0.085119,0.035638,0.106771,0.153846,0.195059,0.710345


In [59]:
result.unstack("smoker")

       smoker
count  No        151.000000
       Yes        93.000000
mean   No          0.159328
       Yes         0.163196
std    No          0.039910
       Yes         0.085119
min    No          0.056797
       Yes         0.035638
25%    No          0.136906
       Yes         0.106771
50%    No          0.155625
       Yes         0.153846
75%    No          0.185014
       Yes         0.195059
max    No          0.291990
       Yes         0.710345
dtype: float64

Inside GroupBy, when you invoke a method like `describe`, it is actually just a shortcut for:

In [60]:
def f(group):
    return group.describe()

grouped.apply(f)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,total_bill,tip,size,tip_pct
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Fri,No,count,4.000000,4.000000,4.00,4.000000
Fri,No,mean,18.420000,2.812500,2.25,0.151650
Fri,No,std,5.059282,0.898494,0.50,0.028123
Fri,No,min,12.460000,1.500000,2.00,0.120385
Fri,No,25%,15.100000,2.625000,2.00,0.137239
...,...,...,...,...,...,...
Thur,Yes,min,10.340000,2.000000,2.00,0.090014
Thur,Yes,25%,13.510000,2.000000,2.00,0.148038
Thur,Yes,50%,16.470000,2.560000,2.00,0.153846
Thur,Yes,75%,19.810000,4.000000,2.00,0.194837


<h2>Suppressing the Group Keys</h2>

In the preceding examples, you see that the resulting object has a hierarchical index formed from the group keys, along with the indexes of each piece of the original object. You can disable this by passing `group_keys=False` to `groupby`:

In [61]:
tips.groupby("smoker", group_keys=False).apply(top)

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_pct
232,11.61,3.39,No,Sat,Dinner,2,0.29199
149,7.51,2.0,No,Thur,Lunch,2,0.266312
51,10.29,2.6,No,Sun,Dinner,2,0.252672
185,20.69,5.0,No,Sun,Dinner,5,0.241663
88,24.71,5.85,No,Thur,Lunch,2,0.236746
172,7.25,5.15,Yes,Sun,Dinner,2,0.710345
178,9.6,4.0,Yes,Sun,Dinner,2,0.416667
67,3.07,1.0,Yes,Sat,Dinner,1,0.325733
183,23.17,6.5,Yes,Sun,Dinner,4,0.280535
109,14.31,4.0,Yes,Sat,Dinner,2,0.279525


<h2>Quantile and Bucket Analysis</h2>

As you may recall from Ch 8: Data Wrangling: Join, Combine, and Reshape, pandas has some tools, in particular `pandas.cut` and `pandas.qcut`, for slicing data up into buckets with bins of your choosing, or by sample quantiles. Combining these functions with `groupby` makes it convenient to perform bucket or quantile analysis on a dataset. Consider a simple random dataset and an equal-length bucket categorization using `pandas.cut`:

In [62]:
frame = pd.DataFrame({"data1": np.random.standard_normal(1000),
                      "data2": np.random.standard_normal(1000)})
frame.head()

Unnamed: 0,data1,data2
0,-1.041463,-0.610736
1,-2.037502,0.61191
2,-0.840296,-1.428722
3,1.037623,0.35307
4,-1.089423,0.1292


In [63]:
quartiles = pd.cut(frame["data1"], 4)
quartiles.head(10)

0    (-1.992, -0.167]
1    (-3.824, -1.992]
2    (-1.992, -0.167]
3     (-0.167, 1.658]
4    (-1.992, -0.167]
5     (-0.167, 1.658]
6    (-1.992, -0.167]
7     (-0.167, 1.658]
8     (-0.167, 1.658]
9     (-0.167, 1.658]
Name: data1, dtype: category
Categories (4, interval[float64, right]): [(-3.824, -1.992] < (-1.992, -0.167] < (-0.167, 1.658] < (1.658, 3.483]]

The `Categorical` object returned by `cut` can be passed directly to `groupby`. So we could compute a set of group statistics for the quartiles, like so:

In [64]:
def get_stats(group):
    return pd.DataFrame(
        {"min": group.min(), "max": group.max(),
        "count": group.count(), "mean": group.mean()}
    )

grouped = frame.groupby(quartiles)
grouped.apply(get_stats)

Unnamed: 0_level_0,Unnamed: 1_level_0,min,max,count,mean
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"(-3.824, -1.992]",data1,-3.817136,-2.037502,24,-2.386472
"(-3.824, -1.992]",data2,-2.0577,1.617052,24,-0.362802
"(-1.992, -0.167]",data1,-1.987754,-0.170349,393,-0.792699
"(-1.992, -0.167]",data2,-3.310811,3.045615,393,-0.036599
"(-0.167, 1.658]",data1,-0.158601,1.656236,539,0.557116
"(-0.167, 1.658]",data2,-3.700327,3.217595,539,0.019418
"(1.658, 3.483]",data1,1.667286,3.483078,44,2.159714
"(1.658, 3.483]",data2,-1.942517,1.717329,44,-0.026539


Keep in mind the same result could have been computed more simply with:

In [65]:
grouped.agg(["min", "max", "count", "mean"])

Unnamed: 0_level_0,data1,data1,data1,data1,data2,data2,data2,data2
Unnamed: 0_level_1,min,max,count,mean,min,max,count,mean
data1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
"(-3.824, -1.992]",-3.817136,-2.037502,24,-2.386472,-2.0577,1.617052,24,-0.362802
"(-1.992, -0.167]",-1.987754,-0.170349,393,-0.792699,-3.310811,3.045615,393,-0.036599
"(-0.167, 1.658]",-0.158601,1.656236,539,0.557116,-3.700327,3.217595,539,0.019418
"(1.658, 3.483]",1.667286,3.483078,44,2.159714,-1.942517,1.717329,44,-0.026539


These were equal-length buckets; to compute equal-size buckets based on sample quantiles, use `pandas.qcut`. We can pass `4` as the number of bucket compute sample quartiles, and pass `labels=False` to obtain just the quartile indices instead of intervals:

In [66]:
quartiles_samp = pd.qcut(frame["data1"], 4, labels=False)
quartiles_samp.head()
grouped = frame.groupby(quartiles_samp)
grouped.apply(get_stats)

Unnamed: 0_level_0,Unnamed: 1_level_0,min,max,count,mean
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,data1,-3.817136,-0.584132,250,-1.238065
0,data2,-3.310811,2.572478,250,-0.09185
1,data1,-0.583168,0.013307,250,-0.261018
1,data2,-2.894694,3.045615,250,0.054874
2,data1,0.02201,0.672325,250,0.337001
2,data2,-3.700327,2.760301,250,-0.092585
3,data1,0.680109,3.483078,250,1.268109
3,data2,-2.687404,3.217595,250,0.074392


<h2>Example: Filling Missing Values with Group-Specific Value</h2>
When cleaning up missing data, in some cases you will remove data observations using dropna, but in others you may want to fill in the null (NA) values using a fixed value or some value derived from the data. fillna is the right tool to use; for example, here I fill in the null values with the mean:

In [67]:
s = pd.Series(np.random.standard_normal(6))
s[::2] = np.nan
s

0         NaN
1   -0.986873
2         NaN
3   -0.729564
4         NaN
5    0.391188
dtype: float64

In [68]:
s.fillna(s.mean())

0   -0.441749
1   -0.986873
2   -0.441749
3   -0.729564
4   -0.441749
5    0.391188
dtype: float64

Suppose you need the fill value to vary by group. One way to do this is to group the data and use `apply` with a function that calls `fillna` on each data chunk. Here is some sample data on US states divided into eastern and western regions:

In [69]:
states = ["Ohio", "New York", "Vermont", "Florida",
          "Oregon", "Nevada", "California", "Idaho"]
group_key = ["East", "East", "East", "East",
             "West", "West", "West", "West"]
data = pd.Series(np.random.standard_normal(8), index=states)
data

Ohio         -1.932937
New York      0.625985
Vermont      -0.103814
Florida      -0.152142
Oregon       -1.065127
Nevada        0.252103
California    0.346046
Idaho         0.958984
dtype: float64

Let's set some values in the data to be missing:

In [70]:
data[["Vermont", "Nevada", "Idaho"]] = np.nan
data

Ohio         -1.932937
New York      0.625985
Vermont            NaN
Florida      -0.152142
Oregon       -1.065127
Nevada             NaN
California    0.346046
Idaho              NaN
dtype: float64

In [71]:
data.groupby(group_key).size()

East    4
West    4
dtype: int64

In [72]:
data.groupby(group_key).count()

East    3
West    2
dtype: int64

In [73]:
data.groupby(group_key).mean()

East   -0.486365
West   -0.359540
dtype: float64

We can fill the NA values using the group means, like so:

In [74]:
def fill_mean(group):
    return group.fillna(group.mean())

data.groupby(group_key).apply(fill_mean)

To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)


	>>> .groupby(..., group_keys=True)
  data.groupby(group_key).apply(fill_mean)


Ohio         -1.932937
New York      0.625985
Vermont      -0.486365
Florida      -0.152142
Oregon       -1.065127
Nevada       -0.359540
California    0.346046
Idaho        -0.359540
dtype: float64

In another case, you might have predefined fill values in your code that vary by group. Since the groups have a `name` attribute set internally, we can use that:

In [75]:
fill_values = {"East": 0.5, "West": -1}
def fill_func(group):
    return group.fillna(fill_values[group.name])

data.groupby(group_key, group_keys=True).apply(fill_func)

East  Ohio         -1.932937
      New York      0.625985
      Vermont       0.500000
      Florida      -0.152142
West  Oregon       -1.065127
      Nevada       -1.000000
      California    0.346046
      Idaho        -1.000000
dtype: float64

<h2>Example: Random Sampling and Permutation</h2>

Suppose you wanted to draw a random sample (with or without replacement) from a large dataset for Monte Carlo simulation purposes or some other application. There are a number of ways to perform the “draws”; here we use the `sample` method for Series.

To demonstrate, here’s a way to construct a deck of English-style playing cards:

In [76]:
suits = ["H", "S", "C", "D"]  # Hearts, Spades, Clubs, Diamonds
card_val = (list(range(1, 11)) + [10] * 3) * 4
base_names = ["A"] + list(range(2, 11)) + ["J", "K", "Q"]
cards = []
for suit in suits:
    cards.extend(str(num) + suit for num in base_names)

deck = pd.Series(card_val, index=cards)

Now we have a Series of length 52 whose index contains card names, and values are the ones used in blackjack and other games (to keep things simple, I let the ace `"A"` be 1):

In [77]:
deck.head(13)

AH      1
2H      2
3H      3
4H      4
5H      5
6H      6
7H      7
8H      8
9H      9
10H    10
JH     10
KH     10
QH     10
dtype: int64

Now, based on what I said before, drawing a hand of five cards from the deck could be written as:

In [78]:
def draw(deck, n=5):
    return deck.sample(n)
draw(deck)

9S     9
9H     9
8H     8
QC    10
3C     3
dtype: int64

Suppose you wanted two random cards from each suit. Because the suit is the last character of each card name, we can group based on this and use `apply`:

In [79]:
def get_suit(card):
    # last letter is suit
    return card[-1]

deck.groupby(get_suit).apply(draw, n=2)

C  6C     6
   5C     5
D  KD    10
   JD    10
H  5H     5
   JH    10
S  KS    10
   2S     2
dtype: int64

Alternatively, we could pass `group_keys=False` to drop the outer suit index, leaving in just the selected cards:

In [80]:
deck.groupby(get_suit, group_keys=False).apply(draw, n=2)

7C     7
AC     1
9D     9
7D     7
JH    10
KH    10
3S     3
4S     4
dtype: int64

<h2>Example: Group Weighted Average and Correlation</h2>

Under the split-apply-combine paradigm of `groupby`, operations between columns in a DataFrame or two Series, such as a group weighted average, are possible. As an example, take this dataset containing group keys, values, and some weights:

In [81]:
df = pd.DataFrame({"category": ["a", "a", "a", "a",
                                "b", "b", "b", "b"],
                   "data": np.random.standard_normal(8),
                   "weights": np.random.uniform(size=8)})
df

Unnamed: 0,category,data,weights
0,a,0.371363,0.662625
1,a,0.455228,0.693703
2,a,0.65786,0.781763
3,a,0.133141,0.294274
4,b,1.488984,0.824164
5,b,-0.205509,0.098908
6,b,-1.434979,0.435225
7,b,-0.322268,0.209844


The weighted average by `category` would then be:

In [82]:
grouped = df.groupby("category")
def get_wavg(group):
    return np.average(group["data"], weights=group["weights"])

grouped.apply(get_wavg)

category
a    0.458541
b    0.328208
dtype: float64

As another example, consider a financial dataset originally obtained from Yahoo! Finance containing end-of-day prices for a few stocks and the S&P 500 index (the `SPX` symbol):

In [84]:
close_px = pd.read_csv("../data/stock_px.csv", parse_dates=True, index_col=0)

close_px.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2214 entries, 2003-01-02 to 2011-10-14
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   AAPL    2214 non-null   float64
 1   MSFT    2214 non-null   float64
 2   XOM     2214 non-null   float64
 3   SPX     2214 non-null   float64
dtypes: float64(4)
memory usage: 86.5 KB


In [85]:
close_px.tail(4)

Unnamed: 0,AAPL,MSFT,XOM,SPX
2011-10-11,400.29,27.0,76.27,1195.54
2011-10-12,402.19,26.96,77.16,1207.25
2011-10-13,408.43,27.18,76.37,1203.66
2011-10-14,422.0,27.27,78.11,1224.58


The DataFrame `info()` method here is a convenient way to get an overview of the contents of a DataFrame.

One task of interest might be to compute a DataFrame consisting of the yearly correlations of daily returns (computed from percent changes) with `SPX`. As one way to do this, we first create a function that computes the pair-wise correlation of each column with the `"SPX"` column:

In [86]:
def spx_corr(group):
    return group.corrwith(group["SPX"])

Next, we compute percent change on `close_px` using `pct_change`:

In [87]:
rets = close_px.pct_change().dropna()

Lastly, we group these percent changes by year, which can be extracted from each row label with a one-line function that returns the `year` attribute of each datetime `label`:

In [88]:
def get_year(x):
    return x.year

by_year = rets.groupby(get_year)
by_year.apply(spx_corr)

Unnamed: 0,AAPL,MSFT,XOM,SPX
2003,0.541124,0.745174,0.661265,1.0
2004,0.374283,0.588531,0.557742,1.0
2005,0.46754,0.562374,0.63101,1.0
2006,0.428267,0.406126,0.518514,1.0
2007,0.508118,0.65877,0.786264,1.0
2008,0.681434,0.804626,0.828303,1.0
2009,0.707103,0.654902,0.797921,1.0
2010,0.710105,0.730118,0.839057,1.0
2011,0.691931,0.800996,0.859975,1.0


You could also compute intercolumn correlations. Here we compute the annual correlation between Apple and Microsoft:

In [89]:
def corr_aapl_msft(group):
    return group["AAPL"].corr(group["MSFT"])
by_year.apply(corr_aapl_msft)

2003    0.480868
2004    0.259024
2005    0.300093
2006    0.161735
2007    0.417738
2008    0.611901
2009    0.432738
2010    0.571946
2011    0.581987
dtype: float64

<h2>Example: Group-Wise Linear Regression</h2>

In the same theme as the previous example, you can use `groupby` to perform more complex group-wise statistical analysis, as long as the function returns a pandas object or scalar value. For example, I can define the following regress function (using the `statsmodels` econometrics library), which executes an ordinary least squares (OLS) regression on each chunk of data:

In [90]:
import statsmodels.api as sm

def regress(data, yvar=None, xvars=None):
    Y = data[yvar]
    X = data[xvars]
    X["intercept"] = 1.
    result = sm.OLS(Y, X).fit()
    return result.params

You can install `statsmodels` with conda if you don't have it already:

conda install statsmodels
Now, to run a yearly linear regression of `AAPL` on `SPX` returns, execute:

In [91]:
by_year.apply(regress, yvar="AAPL", xvars=["SPX"])

Unnamed: 0,SPX,intercept
2003,1.195406,0.00071
2004,1.363463,0.004201
2005,1.766415,0.003246
2006,1.645496,8e-05
2007,1.198761,0.003438
2008,0.968016,-0.00111
2009,0.879103,0.002954
2010,1.052608,0.001261
2011,0.806605,0.001514


<h1>Group Transforms and "Unwrapped" GroupBys</h1>

In Apply: General split-apply-combine, we looked at the `apply` method in grouped operations for performing transformations. There is another built-in method called `transform`, which is similar to `apply` but imposes more constraints on the kind of function you can use:

It can produce a scalar value to be broadcast to the shape of the group.

It can produce an object of the same shape as the input group.

It must not mutate its input.

Let's consider a simple example for illustration:

In [92]:
df = pd.DataFrame({'key': ['a', 'b', 'c'] * 4,
                   'value': np.arange(12.)})
df

Unnamed: 0,key,value
0,a,0.0
1,b,1.0
2,c,2.0
3,a,3.0
4,b,4.0
5,c,5.0
6,a,6.0
7,b,7.0
8,c,8.0
9,a,9.0


Here are the group means by key:

In [93]:
g = df.groupby('key')['value']
g.mean()

key
a    4.5
b    5.5
c    6.5
Name: value, dtype: float64

Suppose instead we wanted to produce a Series of the same shape as `df['value']` but with values replaced by the average grouped by `'key'`. We can pass a function that computes the mean of a single group to `transform`:

In [94]:
def get_mean(group):
    return group.mean()
g.transform(get_mean)

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

For built-in aggregation functions, we can pass a string alias as with the GroupBy `agg` method:

In [95]:
g.transform('mean')

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

Like `apply`, `transform` works with functions that return Series, but the result must be the same size as the input. For example, we can multiply each group by 2 using a helper function:

In [96]:
def times_two(group):
    return group * 2
g.transform(times_two)

0      0.0
1      2.0
2      4.0
3      6.0
4      8.0
5     10.0
6     12.0
7     14.0
8     16.0
9     18.0
10    20.0
11    22.0
Name: value, dtype: float64

As a more complicated example, we can compute the ranks in descending order for each group:

In [97]:
def get_ranks(group):
    return group.rank(ascending=False)
g.transform(get_ranks)

0     4.0
1     4.0
2     4.0
3     3.0
4     3.0
5     3.0
6     2.0
7     2.0
8     2.0
9     1.0
10    1.0
11    1.0
Name: value, dtype: float64

Consider a group transformation function composed from simple aggregations:

In [98]:
def normalize(x):
    return (x - x.mean()) / x.std()

We can obtain equivalent results in this case using either `transform` or `apply`:

In [99]:
g.transform(normalize)

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

In [100]:
g.apply(normalize)

To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)


	>>> .groupby(..., group_keys=True)
  g.apply(normalize)


0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

Built-in aggregate functions like `'mean'` or `'sum'` are often much faster than a general `apply` function. These also have a "fast path" when used with `transform`. This allows us to perform what is called an unwrapped group operation:

In [101]:
g.transform('mean')

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

In [102]:
normalized = (df['value'] - g.transform('mean')) / g.transform('std')
normalized

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

Here, we are doing arithmetic between the outputs of multiple GroupBy operations instead of writing a function and passing it to `groupby(...).apply`. That is what is meant by "unwrapped."

While an unwrapped group operation may involve multiple group aggregations, the overall benefit of vectorized operations often outweighs this.

<h1>Pivot Tables and Cross-Tabulation</h1>

A pivot table is a data summarization tool frequently found in spreadsheet programs and other data analysis software. It aggregates a table of data by one or more keys, arranging the data in a rectangle with some of the group keys along the rows and some along the columns. Pivot tables in Python with pandas are made possible through the `groupby` facility described in this chapter, combined with reshape operations utilizing hierarchical indexing. DataFrame also has a `pivot_table` method, and there is also a top-level `pandas.pivot_table` function. In addition to providing a convenience interface to `groupby`, `pivot_table` can add partial totals, also known as margins.

Returning to the tipping dataset, suppose you wanted to compute a table of group means (the default `pivot_table` aggregation type) arranged by `day` and `smoker` on the rows:

In [103]:
tips.head()

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_pct
0,16.99,1.01,No,Sun,Dinner,2,0.059447
1,10.34,1.66,No,Sun,Dinner,3,0.160542
2,21.01,3.5,No,Sun,Dinner,3,0.166587
3,23.68,3.31,No,Sun,Dinner,2,0.13978
4,24.59,3.61,No,Sun,Dinner,4,0.146808


In [104]:
tips.pivot_table(index=["day", "smoker"])

  tips.pivot_table(index=["day", "smoker"])


Unnamed: 0_level_0,Unnamed: 1_level_0,size,tip,tip_pct,total_bill
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Fri,No,2.25,2.8125,0.15165,18.42
Fri,Yes,2.066667,2.714,0.174783,16.813333
Sat,No,2.555556,3.102889,0.158048,19.661778
Sat,Yes,2.47619,2.875476,0.147906,21.276667
Sun,No,2.929825,3.167895,0.160113,20.506667
Sun,Yes,2.578947,3.516842,0.18725,24.12
Thur,No,2.488889,2.673778,0.160298,17.113111
Thur,Yes,2.352941,3.03,0.163863,19.190588


This could have been produced with `groupby` directly, using `tips.groupby(["day", "smoker"]).mean()`. Now, suppose we want to take the average of only `tip_pct` and `size`, and additionally group by `time`. I’ll put `smoker` in the table columns and `time` and `day` in the rows:

In [105]:
tips.pivot_table(index=["time", "day"], columns="smoker",
                 values=["tip_pct", "size"])

Unnamed: 0_level_0,Unnamed: 1_level_0,size,size,tip_pct,tip_pct
Unnamed: 0_level_1,smoker,No,Yes,No,Yes
time,day,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Dinner,Fri,2.0,2.222222,0.139622,0.165347
Dinner,Sat,2.555556,2.47619,0.158048,0.147906
Dinner,Sun,2.929825,2.578947,0.160113,0.18725
Dinner,Thur,2.0,,0.159744,
Lunch,Fri,3.0,1.833333,0.187735,0.188937
Lunch,Thur,2.5,2.352941,0.160311,0.163863


We could augment this table to include partial totals by passing `margins=True`. This has the effect of adding `All` row and column labels, with corresponding values being the group statistics for all the data within a single tier:

In [106]:
tips.pivot_table(index=["time", "day"], columns="smoker",
                 values=["tip_pct", "size"], margins=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,size,size,size,tip_pct,tip_pct,tip_pct
Unnamed: 0_level_1,smoker,No,Yes,All,No,Yes,All
time,day,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Dinner,Fri,2.0,2.222222,2.166667,0.139622,0.165347,0.158916
Dinner,Sat,2.555556,2.47619,2.517241,0.158048,0.147906,0.153152
Dinner,Sun,2.929825,2.578947,2.842105,0.160113,0.18725,0.166897
Dinner,Thur,2.0,,2.0,0.159744,,0.159744
Lunch,Fri,3.0,1.833333,2.0,0.187735,0.188937,0.188765
Lunch,Thur,2.5,2.352941,2.459016,0.160311,0.163863,0.161301
All,,2.668874,2.408602,2.569672,0.159328,0.163196,0.160803


Here, the `All` values are means without taking into account smoker versus non-smoker (the `All` columns) or any of the two levels of grouping on the rows (the All row).

To use an aggregation function other than `mean`, pass it to the `aggfunc` keyword argument. For example, `"count"` or `len` will give you a cross-tabulation (count or frequency) of group sizes (though `"count"` will exclude null values from the count within data groups, while `len` will not):

In [107]:
tips.pivot_table(index=["time", "smoker"], columns="day",
                 values="tip_pct", aggfunc=len, margins=True)

Unnamed: 0_level_0,day,Fri,Sat,Sun,Thur,All
time,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dinner,No,3.0,45.0,57.0,1.0,106
Dinner,Yes,9.0,42.0,19.0,,70
Lunch,No,1.0,,,44.0,45
Lunch,Yes,6.0,,,17.0,23
All,,19.0,87.0,76.0,62.0,244


If some combinations are empty (or otherwise NA), you may wish to pass a `fill_value`:

In [108]:
tips.pivot_table(index=["time", "size", "smoker"], columns="day",
                 values="tip_pct", fill_value=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,day,Fri,Sat,Sun,Thur
time,size,smoker,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dinner,1,No,0.0,0.137931,0.0,0.0
Dinner,1,Yes,0.0,0.325733,0.0,0.0
Dinner,2,No,0.139622,0.162705,0.168859,0.159744
Dinner,2,Yes,0.171297,0.148668,0.207893,0.0
Dinner,3,No,0.0,0.154661,0.152663,0.0
Dinner,3,Yes,0.0,0.144995,0.15266,0.0
Dinner,4,No,0.0,0.150096,0.148143,0.0
Dinner,4,Yes,0.11775,0.124515,0.19337,0.0
Dinner,5,No,0.0,0.0,0.206928,0.0
Dinner,5,Yes,0.0,0.106572,0.06566,0.0


<b>Note:</b> See `pivot_table` options in this <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html">link</a>.

<h2>Cross-Tabulations: Crosstab</h2>

A cross-tabulation (or crosstab for short) is a special case of a pivot table that computes group frequencies. Here is an example:

In [109]:
from io import StringIO
#! blockstart
data = """Sample  Nationality  Handedness
1   USA  Right-handed
2   Japan    Left-handed
3   USA  Right-handed
4   Japan    Right-handed
5   Japan    Left-handed
6   Japan    Right-handed
7   USA  Right-handed
8   USA  Left-handed
9   Japan    Right-handed
10  USA  Right-handed"""
#! blockend
data = pd.read_table(StringIO(data), sep="\s+")

In [110]:
data

Unnamed: 0,Sample,Nationality,Handedness
0,1,USA,Right-handed
1,2,Japan,Left-handed
2,3,USA,Right-handed
3,4,Japan,Right-handed
4,5,Japan,Left-handed
5,6,Japan,Right-handed
6,7,USA,Right-handed
7,8,USA,Left-handed
8,9,Japan,Right-handed
9,10,USA,Right-handed


As part of some survey analysis, we might want to summarize this data by nationality and handedness. You could use `pivot_table` to do this, but the `pandas.crosstab` function can be more convenient:

In [111]:
pd.crosstab(data["Nationality"], data["Handedness"], margins=True)

Handedness,Left-handed,Right-handed,All
Nationality,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Japan,2,3,5
USA,1,4,5
All,3,7,10


The first two arguments to `crosstab` can each be an array or Series or a list of arrays. As in the tips data:

In [112]:
pd.crosstab([tips["time"], tips["day"]], tips["smoker"], margins=True)

Unnamed: 0_level_0,smoker,No,Yes,All
time,day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Dinner,Fri,3,9,12
Dinner,Sat,45,42,87
Dinner,Sun,57,19,76
Dinner,Thur,1,0,1
Lunch,Fri,1,6,7
Lunch,Thur,44,17,61
All,,151,93,244
