# Aggregation and Grouping
An essential piece of analysis of large data is efficient summarization: computing
aggregations like sum(), mean(), median(), min(), and max(), in which a single num‐
ber gives insight into the nature of a potentially large dataset. In this section, we’ll explore aggregations in Pandas, from simple operations akin to what we’ve seen on
NumPy arrays, to more sophisticated operations based on the concept of a groupby.
## Planets Data

In [1]:
import seaborn as sns
import numpy as np
import pandas as pd

In [2]:
planets= sns.load_dataset('planets')
planets.shape

(1035, 6)

In [3]:
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


In [4]:
# simple aggregation in Pandas
rng=np.random.RandomState(42)
ser=pd.Series(rng.rand(5))
ser

0    0.374540
1    0.950714
2    0.731994
3    0.598658
4    0.156019
dtype: float64

In [5]:
ser.sum()

2.811925491708157

In [6]:
ser.mean()

0.5623850983416314

In [7]:
df=pd.DataFrame({'A':rng.rand(5),'B':rng.rand(5)})
df

Unnamed: 0,A,B
0,0.155995,0.020584
1,0.058084,0.96991
2,0.866176,0.832443
3,0.601115,0.212339
4,0.708073,0.181825


In [8]:
df.mean()

A    0.477888
B    0.443420
dtype: float64

In [9]:
df.sum()

A    2.389442
B    2.217101
dtype: float64

In [10]:
planets.dropna().describe()

Unnamed: 0,number,orbital_period,mass,distance,year
count,498.0,498.0,498.0,498.0,498.0
mean,1.73494,835.778671,2.50932,52.068213,2007.37751
std,1.17572,1469.128259,3.636274,46.596041,4.167284
min,1.0,1.3283,0.0036,1.35,1989.0
25%,1.0,38.27225,0.2125,24.4975,2005.0
50%,1.0,357.0,1.245,39.94,2009.0
75%,2.0,999.6,2.8675,59.3325,2011.0
max,6.0,17337.5,25.0,354.0,2014.0


In [11]:
planets.count()

method            1035
number            1035
orbital_period     992
mass               513
distance           808
year              1035
dtype: int64

These are all methods of DataFrame and Series objects.
To go deeper into the data, however, simple aggregates are often not enough. The
next level of data summarization is the groupby operation, which allows you to
quickly and efficiently compute aggregates on subsets of data.
## GroupBy: Split, Apply, Combine
Simple aggregations can give you a flavor of your dataset, but often we would prefer
to aggregate conditionally on some label or index: this is implemented in the so-
called groupby operation. The name “group by” comes from a command in the SQL
database language, but it is perhaps more illuminative to think of it in the terms first
coined by Hadley Wickham of Rstats fame: split, apply, combine.
### Split, apply, combine
A canonical example of this split-apply-combine operation, where the “apply” is a
summation aggregation:  
__• The split step involves breaking up and grouping a DataFrame depending on the value of the specified key.  
• The apply step involves computing some function, usually an aggregate, transformation, or filtering, within the individual groups.  
• The combine step merges the results of these operations into an output array.__

While we could certainly do this manually using some combination of the masking,
aggregation, and merging commands covered earlier, it’s important to realize that the
intermediate splits do not need to be explicitly instantiated. Rather, the GroupBy can
(often) do this in a single pass over the data, updating the sum, mean, count, min, or
other aggregate for each group along the way. The power of the GroupBy is that it
abstracts away these steps: the user need not think about how the computation is
done under the hood, but rather thinks about the operation as a whole.

In [15]:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data': range(6)}, columns=['key', 'data'])
df

Unnamed: 0,key,data
0,A,0
1,B,1
2,C,2
3,A,3
4,B,4
5,C,5


In [16]:
df.groupby('key') # a groupby object is created

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f9e6c0b6a30>

*Notice that what is returned is not a set of DataFrames, but a DataFrameGroupBy
object. This object is where the magic is: you can think of it as a special view of the
DataFrame, which is poised to dig into the groups but does no actual computation
until the aggregation is applied. This “lazy evaluation” approach means that common
aggregates can be implemented very efficiently in a way that is almost transparent to
the user.*

In [17]:
df.groupby('key').sum()

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
A,3
B,5
C,7


Let’s introduce some of the other func‐
tionality that can be used with the basic GroupBy operation.  
**Column indexing.** The GroupBy object supports column indexing in the same way as
the DataFrame, and returns a modified GroupBy object.

In [21]:
planets.groupby('method')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f9e6e0c71f0>

In [22]:
planets.groupby('method')['orbital_period']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f9e6c56f9a0>

Here we’ve selected a particular Series group from the original DataFrame group by
reference to its column name. As with the GroupBy object, no computation is done
until we call some aggregate on the object:

In [23]:
planets.groupby('method')['orbital_period'].median()

method
Astrometry                         631.180000
Eclipse Timing Variations         4343.500000
Imaging                          27500.000000
Microlensing                      3300.000000
Orbital Brightness Modulation        0.342887
Pulsar Timing                       66.541900
Pulsation Timing Variations       1170.000000
Radial Velocity                    360.200000
Transit                              5.714932
Transit Timing Variations           57.011000
Name: orbital_period, dtype: float64

**Iteration over groups**. The GroupBy object supports direct iteration over the groups,
returning each group as a Series or DataFrame

In [25]:
for (method,group) in planets.groupby('method'):
    print("{0:30s} shape={1}".format(method, group.shape))

Astrometry                     shape=(2, 6)
Eclipse Timing Variations      shape=(9, 6)
Imaging                        shape=(38, 6)
Microlensing                   shape=(23, 6)
Orbital Brightness Modulation  shape=(3, 6)
Pulsar Timing                  shape=(5, 6)
Pulsation Timing Variations    shape=(1, 6)
Radial Velocity                shape=(553, 6)
Transit                        shape=(397, 6)
Transit Timing Variations      shape=(4, 6)
