### Aggregation and Grouping:
    An essential piece of analysis of large data is efficient summarization: computing aggregations like sum(), mean(), median(), min() and max() in which a single number gives insight into the nature of a potentially large dataset.
    In this section, we’ll explore aggregations in Pandas, from simple operations akin to what we’ve seen on NumPy arrays, to more sophisticated operations based on the concept of a groupby.

In [56]:
import seaborn as sns
import numpy as np
import pandas as pd
planets = sns.load_dataset('planets')
planets.shape

(1035, 6)

In [57]:
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


### Simple Aggregation in Pandas
    Earlier we explored some of the data aggregations available for NumPy arrays.
    As with a one-dimensional NumPy array, for a Pandas Series the aggregates return a single value.

In [58]:
rng = np.random.RandomState(42)
series = round(pd.Series(rng.rand(10)), 3)
series

0    0.375
1    0.951
2    0.732
3    0.599
4    0.156
5    0.156
6    0.058
7    0.866
8    0.601
9    0.708
dtype: float64

In [59]:
# Sum of the series
print("Sum of the series: ", series.sum())
# Mean of the series
print("Mean of the series: ", series.mean())
# Maximum of the series
print("Maximum of the series: ", series.max())
# Minimum of the series
print("Minimum of the series: ", series.min())
# Product of the all items int series
print("Product of the all items int series: ", series.prod())
# Mean absolute deviation of the series
print("Mean absolute deviation of the series: ", series.mad())
# Standard deviation of the series
print("Standard deviation of the series: ", series.std())
# Count of the series
print("Count of the series: ", series.count())
# Median of the series
print("Median of the series: ", series.median())

Sum of the series:  5.202
Mean of the series:  0.5202
Maximum of the series:  0.951
Minimum of the series:  0.058
Product of the all items int series:  8.133032356339817e-05
Mean absolute deviation of the series:  0.26715999999999995
Standard deviation of the series:  0.3158810605978846
Count of the series:  10
Median of the series:  0.6


In [60]:
# For a DataFrame , by default the aggregates return results within each column
df = round(pd.DataFrame({'A' : rng.rand(5), 'B' : rng.rand(5)}), 3)
df

Unnamed: 0,A,B
0,0.021,0.183
1,0.97,0.304
2,0.832,0.525
3,0.212,0.432
4,0.182,0.291


In [61]:
# Sum of the DataFrame
print("Aggregation by column-wise:\n")
print("Sum of the DataFrame:")
print(df.sum())
# Mean of the DataFrame
print("Mean of the DataFrame:")
print(df.mean())
# Maximum of the DataFrame
print("Maximum of the DataFrame:")
print(df.max())
# Minimum of the DataFrame
print("Minimum of the DataFrame:")
print(df.min())
# Product of the all items int DataFrame
print("Product of the all items int DataFrame:")
print(df.prod())
# Mean absolute deviation of the DataFrame
print("Mean absolute deviation of the DataFrame:")
print(df.mad())
# Standard deviation of the DataFrame
print("Standard deviation of the DataFrame:")
print(df.std())
# Count of the DataFrame
print("Count of the DataFrame:")
print(df.count())
# Median of the DataFrame
print("Median of the DataFrame:")
print(df.median())

Aggregation by column-wise:

Sum of the DataFrame:
A    2.217
B    1.735
dtype: float64
Mean of the DataFrame:
A    0.4434
B    0.3470
dtype: float64
Maximum of the DataFrame:
A    0.970
B    0.525
dtype: float64
Minimum of the DataFrame:
A    0.021
B    0.183
dtype: float64
Product of the all items int DataFrame:
A    0.000654
B    0.003672
dtype: float64
Mean absolute deviation of the DataFrame:
A    0.36608
B    0.10520
dtype: float64
Standard deviation of the DataFrame:
A    0.426795
B    0.133032
dtype: float64
Count of the DataFrame:
A    5
B    5
dtype: int64
Median of the DataFrame:
A    0.212
B    0.304
dtype: float64


In [62]:
# Sum of the DataFrame
print("Aggregation by row-wise:\n")
print("Sum of the DataFrame:")
print(df.sum(axis = 'columns'))
# Mean of the DataFrame
print("Mean of the DataFrame:")
print(df.mean(axis = 'columns'))
# Maximum of the DataFrame
print("Maximum of the DataFrame:")
print(df.max(axis = 'columns'))
# Minimum of the DataFrame
print("Minimum of the DataFrame:")
print(df.min(axis = 'columns'))
# Product of the all items int DataFrame
print("Product of the all items int DataFrame:")
print(df.prod(axis = 'columns'))
# Mean absolute deviation of the DataFrame
print("Mean absolute deviation of the DataFrame:")
print(df.mad(axis = 'columns'))
# Standard deviation of the DataFrame
print("Standard deviation of the DataFrame:")
print(df.std(axis = 'columns'))
# Count of the DataFrame
print("Count of the DataFrame:")
print(df.count(axis = 'columns'))
# Median of the DataFrame
print("Median of the DataFrame:")
print(df.median(axis = 'columns'))

Aggregation by row-wise:

Sum of the DataFrame:
0    0.204
1    1.274
2    1.357
3    0.644
4    0.473
dtype: float64
Mean of the DataFrame:
0    0.1020
1    0.6370
2    0.6785
3    0.3220
4    0.2365
dtype: float64
Maximum of the DataFrame:
0    0.183
1    0.970
2    0.832
3    0.432
4    0.291
dtype: float64
Minimum of the DataFrame:
0    0.021
1    0.304
2    0.525
3    0.212
4    0.182
dtype: float64
Product of the all items int DataFrame:
0    0.003843
1    0.294880
2    0.436800
3    0.091584
4    0.052962
dtype: float64
Mean absolute deviation of the DataFrame:
0    0.0810
1    0.3330
2    0.1535
3    0.1100
4    0.0545
dtype: float64
Standard deviation of the DataFrame:
0    0.114551
1    0.470933
2    0.217082
3    0.155563
4    0.077075
dtype: float64
Count of the DataFrame:
0    2
1    2
2    2
3    2
4    2
dtype: int64
Median of the DataFrame:
0    0.1020
1    0.6370
2    0.6785
3    0.3220
4    0.2365
dtype: float64


There is a convenience method describe() that computes several common aggregates for each column and returns the result.

In [63]:
planets.dropna().describe()

Unnamed: 0,number,orbital_period,mass,distance,year
count,498.0,498.0,498.0,498.0,498.0
mean,1.73494,835.778671,2.50932,52.068213,2007.37751
std,1.17572,1469.128259,3.636274,46.596041,4.167284
min,1.0,1.3283,0.0036,1.35,1989.0
25%,1.0,38.27225,0.2125,24.4975,2005.0
50%,1.0,357.0,1.245,39.94,2009.0
75%,2.0,999.6,2.8675,59.3325,2011.0
max,6.0,17337.5,25.0,354.0,2014.0


### GroupBy: Split, Apply, Combine
    Simple aggregations can give you a flavor of your dataset, but often we would prefer to aggregate conditionally on some label or index: this is implemented in the so-called groupby operation.
    The name “group by” comes from a command in the SQL database language, but it is perhaps more illuminative to think of it in the terms first coined by Hadley Wickham of Rstats fame: split, apply, combine.

In [64]:
df = pd.DataFrame({'Key': ['A', 'B', 'C', 'A', 'B', 'C'], 'Data' : range(6)}, columns = ['Key', 'Data'])
df

Unnamed: 0,Key,Data
0,A,0
1,B,1
2,C,2
3,A,3
4,B,4
5,C,5


In [65]:
"""
We can compute the most basic split-apply-combine operation with the groupby() method of DataFrames,
passing the name of the desired key column:
"""
df.groupby('Key')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f5618eceba8>

In [66]:
"""
To produce a result, we can apply an aggregate to this DataFrameGroupBy object,
which will perform the appropriate apply/combine steps to produce the desired
result:
"""
df.groupby('Key').sum()

Unnamed: 0_level_0,Data
Key,Unnamed: 1_level_1
A,3
B,5
C,7


### The GroupBy object
    The GroupBy object is a very flexible abstraction.
    In many ways, you can simply treat it as if it’s a collection of DataFrames, and it does the difficult things under the hood.
    Let’s see some examples using the Planets data.
    Perhaps the most important operations made available by a GroupBy are aggregate, filter, transform, and apply.

### Column indexing:
    The GroupBy object supports column indexing in the same way as the DataFrame , and returns a modified GroupBy object.

In [67]:
planets.groupby('method')['orbital_period'].median()

method
Astrometry                         631.180000
Eclipse Timing Variations         4343.500000
Imaging                          27500.000000
Microlensing                      3300.000000
Orbital Brightness Modulation        0.342887
Pulsar Timing                       66.541900
Pulsation Timing Variations       1170.000000
Radial Velocity                    360.200000
Transit                              5.714932
Transit Timing Variations           57.011000
Name: orbital_period, dtype: float64

### Dispatch methods:
    Through some Python class magic, any method not explicitly implemented by the GroupBy object will be passed through and called on the groups, whether they are DataFrame or Series objects.
    For example, you can use the describe() method of DataFrame s to perform a set of aggregations that describe each group in the data.

In [68]:
planets.groupby('method')['year'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Astrometry,2.0,2011.5,2.12132,2010.0,2010.75,2011.5,2012.25,2013.0
Eclipse Timing Variations,9.0,2010.0,1.414214,2008.0,2009.0,2010.0,2011.0,2012.0
Imaging,38.0,2009.131579,2.781901,2004.0,2008.0,2009.0,2011.0,2013.0
Microlensing,23.0,2009.782609,2.859697,2004.0,2008.0,2010.0,2012.0,2013.0
Orbital Brightness Modulation,3.0,2011.666667,1.154701,2011.0,2011.0,2011.0,2012.0,2013.0
Pulsar Timing,5.0,1998.4,8.38451,1992.0,1992.0,1994.0,2003.0,2011.0
Pulsation Timing Variations,1.0,2007.0,,2007.0,2007.0,2007.0,2007.0,2007.0
Radial Velocity,553.0,2007.518987,4.249052,1989.0,2005.0,2009.0,2011.0,2014.0
Transit,397.0,2011.236776,2.077867,2002.0,2010.0,2012.0,2013.0,2014.0
Transit Timing Variations,4.0,2012.5,1.290994,2011.0,2011.75,2012.5,2013.25,2014.0


### Aggregate, filter, transform, apply
    In particular, GroupBy objects have aggregate(), filter(), transform() and apply() methods that efficiently implement a variety of useful operations before combining the grouped data.

In [69]:
rng = np.random.RandomState(0)
df = pd.DataFrame({'key' : ['A', 'B', 'C', 'A', 'B', 'C'], 'data1' : range(6), 'data2' : rng.randint(0, 10, 6)})
df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


### Aggregation:
    We’re now familiar with GroupBy aggregations with sum() , median() and the like, but the aggregate() method allows for even more flexibility.
    It can take a string, a function, or a list thereof, and compute all the aggregates at once.

In [70]:
df.groupby('key').aggregate(['std', 'prod', 'count', 'mean', 'mad', 'min', 'max'])

Unnamed: 0_level_0,data1,data1,data1,data1,data1,data1,data1,data2,data2,data2,data2,data2,data2,data2
Unnamed: 0_level_1,std,prod,count,mean,mad,min,max,std,prod,count,mean,mad,min,max
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2
A,2.12132,0,2,1.5,1.5,0,3,1.414214,15,2,4.0,1.0,3,5
B,2.12132,4,2,2.5,1.5,1,4,4.949747,0,2,3.5,3.5,0,7
C,2.12132,10,2,3.5,1.5,2,5,4.242641,27,2,6.0,3.0,3,9


### Filtering:
    A filtering operation allows you to drop data based on the group properties.
    For example, we might want to keep all groups in which the standard deviation is larger than some critical value.

In [71]:
def filter_func(x):
    return x['data2'].std() > 4
print(df.groupby('key').std(), '\n')
print(df.groupby('key').filter(filter_func))

       data1     data2
key                   
A    2.12132  1.414214
B    2.12132  4.949747
C    2.12132  4.242641 

  key  data1  data2
1   B      1      0
2   C      2      3
4   B      4      7
5   C      5      9


### Transformation:
    While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine.
    For such a transformation, the output is the same shape as the input.
    A common example is to center the data by subtracting the group-wise mean.

In [72]:
df.groupby('key').transform(lambda x : x - x.mean())

Unnamed: 0,data1,data2
0,-1.5,1.0
1,-1.5,-3.5
2,-1.5,-3.0
3,1.5,-1.0
4,1.5,3.5
5,1.5,3.0


### The apply() method:
    The apply() method lets you apply an arbitrary function to the group results.
    The function should take a DataFrame , and return either a Pandas object (e.g., DataFrame , Series ) or a scalar; the combine operation will be tailored to the type of output returned.

In [73]:
def norm_by_data2(x):
    # x is DataFrame of group values
    x['data1'] /= x['data2'].sum()
    return x
print(df.groupby('key').apply(norm_by_data2))


  key     data1  data2
0   A  0.000000      5
1   B  0.142857      0
2   C  0.166667      3
3   A  0.375000      3
4   B  0.571429      7
5   C  0.416667      9


### Specifying the split key:
    In the simple examples presented before, we split the DataFrame on a single column name.
    This is just one of many options by which the groups can be defined, and we’ll go through some other options for group specification here.
    A list, array, series, or index providing the grouping keys.
    The key can be any series or list with a length matching that of the DataFrame.

In [74]:
l = [0, 1, 0, 1, 2, 0]
df.groupby(l).sum()

Unnamed: 0,data1,data2
0,7,17
1,4,3
2,4,7


In [75]:
# Grouping with dafault 'key'
df.groupby([df['key']]).sum()

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,3,8
B,5,7
C,7,12


### A dictionary or series mapping index to group:
    Another method is to provide a dictionary that maps index values to the group keys.

In [76]:
df2 = df.set_index('key')
mapping = {'A' : 'Vowel', 'B' : 'Consonant', 'C' : 'Consonant'}
df2.groupby(mapping).sum()

Unnamed: 0,data1,data2
Consonant,12,19
Vowel,3,8


### Any Python function:
    Similar to mapping, you can pass any Python function that will input the index value and output the group.

In [77]:
df2.groupby(str.lower).mean()

Unnamed: 0,data1,data2
a,1.5,4.0
b,2.5,3.5
c,3.5,6.0


### A list of valid keys:
    Further, any of the preceding key choices can be combined to group on a multi-index.

In [78]:
df2.groupby([str.lower, mapping]).mean()

Unnamed: 0,Unnamed: 1,data1,data2
a,Vowel,1.5,4.0
b,Consonant,2.5,3.5
c,Consonant,3.5,6.0


### Grouping example:
    As an example of this, in a couple lines of Python code we can put all these together and count discovered planets by method and by decade

In [79]:
decade = 10 * (planets['year'] // 10)
decade = decade.astype(str) + 's'
decade.name = 'decade'
planets.groupby(['method', decade]).number.sum().unstack().fillna(0)

decade,1980s,1990s,2000s,2010s
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Astrometry,0.0,0.0,0.0,2.0
Eclipse Timing Variations,0.0,0.0,5.0,10.0
Imaging,0.0,0.0,29.0,21.0
Microlensing,0.0,0.0,12.0,15.0
Orbital Brightness Modulation,0.0,0.0,0.0,5.0
Pulsar Timing,0.0,9.0,1.0,1.0
Pulsation Timing Variations,0.0,0.0,1.0,0.0
Radial Velocity,1.0,52.0,475.0,424.0
Transit,0.0,0.0,64.0,712.0
Transit Timing Variations,0.0,0.0,0.0,9.0
