### Aggregation and Grouping:
    An essential piece of analysis of large data is efficient summarization: computing aggregations like sum(), mean(), median(), min() and max() in which a single number gives insight into the nature of a potentially large dataset.
    In this section, we’ll explore aggregations in Pandas, from simple operations akin to what we’ve seen on NumPy arrays, to more sophisticated operations based on the concept of a groupby.

In [13]:
import seaborn as sns
import numpy as np
import pandas as pd
planets = sns.load_dataset('planets')
planets.shape

(1035, 6)

In [14]:
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


### Simple Aggregation in Pandas
    Earlier we explored some of the data aggregations available for NumPy arrays.
    As with a one-dimensional NumPy array, for a Pandas Series the aggregates return a single value.

In [23]:
rng = np.random.RandomState(42)
series = round(pd.Series(rng.rand(10)), 3)
series

0    0.375
1    0.951
2    0.732
3    0.599
4    0.156
5    0.156
6    0.058
7    0.866
8    0.601
9    0.708
dtype: float64

In [24]:
# Sum of the series
print("Sum of the series: ", series.sum())
# Mean of the series
print("Mean of the series: ", series.mean())
# Maximum of the series
print("Maximum of the series: ", series.max())
# Minimum of the series
print("Minimum of the series: ", series.min())
# Product of the all items int series
print("Product of the all items int series: ", series.prod())
# Mean absolute deviation of the series
print("Mean absolute deviation of the series: ", series.mad())
# Standard deviation of the series
print("Standard deviation of the series: ", series.std())
# Count of the series
print("Count of the series: ", series.count())
# Median of the series
print("Median of the series: ", series.median())

Sum of the series:  5.202
Mean of the series:  0.5202
Maximum of the series:  0.951
Minimum of the series:  0.058
Product of the all items int series:  8.133032356339817e-05
Mean absolute deviation of the series:  0.26715999999999995
Standard deviation of the series:  0.3158810605978846
Count of the series:  10
Median of the series:  0.6


In [28]:
# For a DataFrame , by default the aggregates return results within each column
df = round(pd.DataFrame({'A' : rng.rand(5), 'B' : rng.rand(5)}), 3)
df

Unnamed: 0,A,B
0,0.183,0.612
1,0.304,0.139
2,0.525,0.292
3,0.432,0.366
4,0.291,0.456


In [32]:
# Sum of the DataFrame
print("Aggregation by column-wise:\n")
print("Sum of the DataFrame:")
print(df.sum())
# Mean of the DataFrame
print("Mean of the DataFrame:")
print(df.mean())
# Maximum of the DataFrame
print("Maximum of the DataFrame:")
print(df.max())
# Minimum of the DataFrame
print("Minimum of the DataFrame:")
print(df.min())
# Product of the all items int DataFrame
print("Product of the all items int DataFrame:")
print(df.prod())
# Mean absolute deviation of the DataFrame
print("Mean absolute deviation of the DataFrame:")
print(df.mad())
# Standard deviation of the DataFrame
print("Standard deviation of the DataFrame:")
print(df.std())
# Count of the DataFrame
print("Count of the DataFrame:")
print(df.count())
# Median of the DataFrame
print("Median of the DataFrame:")
print(df.median())

Aggregation by column-wise:

Sum of the DataFrame:
A    1.735
B    1.865
dtype: float64
Mean of the DataFrame:
A    0.347
B    0.373
dtype: float64
Maximum of the DataFrame:
A    0.525
B    0.612
dtype: float64
Minimum of the DataFrame:
A    0.183
B    0.139
dtype: float64
Product of the all items int DataFrame:
A    0.003672
B    0.004146
dtype: float64
Mean absolute deviation of the DataFrame:
A    0.1052
B    0.1288
dtype: float64
Standard deviation of the DataFrame:
A    0.133032
B    0.177042
dtype: float64
Count of the DataFrame:
A    5
B    5
dtype: int64
Median of the DataFrame:
A    0.304
B    0.366
dtype: float64


In [33]:
# Sum of the DataFrame
print("Aggregation by row-wise:\n")
print("Sum of the DataFrame:")
print(df.sum(axis = 'columns'))
# Mean of the DataFrame
print("Mean of the DataFrame:")
print(df.mean(axis = 'columns'))
# Maximum of the DataFrame
print("Maximum of the DataFrame:")
print(df.max(axis = 'columns'))
# Minimum of the DataFrame
print("Minimum of the DataFrame:")
print(df.min(axis = 'columns'))
# Product of the all items int DataFrame
print("Product of the all items int DataFrame:")
print(df.prod(axis = 'columns'))
# Mean absolute deviation of the DataFrame
print("Mean absolute deviation of the DataFrame:")
print(df.mad(axis = 'columns'))
# Standard deviation of the DataFrame
print("Standard deviation of the DataFrame:")
print(df.std(axis = 'columns'))
# Count of the DataFrame
print("Count of the DataFrame:")
print(df.count(axis = 'columns'))
# Median of the DataFrame
print("Median of the DataFrame:")
print(df.median(axis = 'columns'))

Aggregation by row-wise:

Sum of the DataFrame:
0    0.795
1    0.443
2    0.817
3    0.798
4    0.747
dtype: float64
Mean of the DataFrame:
0    0.3975
1    0.2215
2    0.4085
3    0.3990
4    0.3735
dtype: float64
Maximum of the DataFrame:
0    0.612
1    0.304
2    0.525
3    0.432
4    0.456
dtype: float64
Minimum of the DataFrame:
0    0.183
1    0.139
2    0.292
3    0.366
4    0.291
dtype: float64
Product of the all items int DataFrame:
0    0.111996
1    0.042256
2    0.153300
3    0.158112
4    0.132696
dtype: float64
Mean absolute deviation of the DataFrame:
0    0.2145
1    0.0825
2    0.1165
3    0.0330
4    0.0825
dtype: float64
Standard deviation of the DataFrame:
0    0.303349
1    0.116673
2    0.164756
3    0.046669
4    0.116673
dtype: float64
Count of the DataFrame:
0    2
1    2
2    2
3    2
4    2
dtype: int64
Median of the DataFrame:
0    0.3975
1    0.2215
2    0.4085
3    0.3990
4    0.3735
dtype: float64


There is a convenience method describe() that computes several common aggregates for each column and returns the result.

In [35]:
planets.dropna().describe()

Unnamed: 0,number,orbital_period,mass,distance,year
count,498.0,498.0,498.0,498.0,498.0
mean,1.73494,835.778671,2.50932,52.068213,2007.37751
std,1.17572,1469.128259,3.636274,46.596041,4.167284
min,1.0,1.3283,0.0036,1.35,1989.0
25%,1.0,38.27225,0.2125,24.4975,2005.0
50%,1.0,357.0,1.245,39.94,2009.0
75%,2.0,999.6,2.8675,59.3325,2011.0
max,6.0,17337.5,25.0,354.0,2014.0


### GroupBy: Split, Apply, Combine
    Simple aggregations can give you a flavor of your dataset, but often we would prefer to aggregate conditionally on some label or index: this is implemented in the so-called groupby operation.
    The name “group by” comes from a command in the SQL database language, but it is perhaps more illuminative to think of it in the terms first coined by Hadley Wickham of Rstats fame: split, apply, combine.

In [41]:
df = pd.DataFrame({'Key': ['A', 'B', 'C', 'A', 'B', 'C'], 'Data' : range(6)}, columns = ['Key', 'Data'])
df

Unnamed: 0,Key,Data
0,A,0
1,B,1
2,C,2
3,A,3
4,B,4
5,C,5


In [42]:
"""
We can compute the most basic split-apply-combine operation with the groupby() method of DataFrames,
passing the name of the desired key column:
"""
df.groupby('Key')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fb25eb22390>

In [43]:
"""
To produce a result, we can apply an aggregate to this DataFrameGroupBy object,
which will perform the appropriate apply/combine steps to produce the desired
result:
"""
df.groupby('Key').sum()

Unnamed: 0_level_0,Data
Key,Unnamed: 1_level_1
A,3
B,5
C,7


### The GroupBy object
    The GroupBy object is a very flexible abstraction.
    In many ways, you can simply treat it as if it’s a collection of DataFrames, and it does the difficult things under the hood.
    Let’s see some examples using the Planets data.
    Perhaps the most important operations made available by a GroupBy are aggregate, filter, transform, and apply.

### Column indexing:
    The GroupBy object supports column indexing in the same way as the DataFrame , and returns a modified GroupBy object.

In [49]:
planets.groupby('method')['orbital_period'].median()

method
Astrometry                         631.180000
Eclipse Timing Variations         4343.500000
Imaging                          27500.000000
Microlensing                      3300.000000
Orbital Brightness Modulation        0.342887
Pulsar Timing                       66.541900
Pulsation Timing Variations       1170.000000
Radial Velocity                    360.200000
Transit                              5.714932
Transit Timing Variations           57.011000
Name: orbital_period, dtype: float64

### Dispatch methods:
    Through some Python class magic, any method not explicitly implemented by the GroupBy object will be passed through and called on the groups, whether they are DataFrame or Series objects.
    For example, you can use the describe() method of DataFrame s to perform a set of aggregations that describe each group in the data.

In [56]:
planets.groupby('method')['year'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Astrometry,2.0,2011.5,2.12132,2010.0,2010.75,2011.5,2012.25,2013.0
Eclipse Timing Variations,9.0,2010.0,1.414214,2008.0,2009.0,2010.0,2011.0,2012.0
Imaging,38.0,2009.131579,2.781901,2004.0,2008.0,2009.0,2011.0,2013.0
Microlensing,23.0,2009.782609,2.859697,2004.0,2008.0,2010.0,2012.0,2013.0
Orbital Brightness Modulation,3.0,2011.666667,1.154701,2011.0,2011.0,2011.0,2012.0,2013.0
Pulsar Timing,5.0,1998.4,8.38451,1992.0,1992.0,1994.0,2003.0,2011.0
Pulsation Timing Variations,1.0,2007.0,,2007.0,2007.0,2007.0,2007.0,2007.0
Radial Velocity,553.0,2007.518987,4.249052,1989.0,2005.0,2009.0,2011.0,2014.0
Transit,397.0,2011.236776,2.077867,2002.0,2010.0,2012.0,2013.0,2014.0
Transit Timing Variations,4.0,2012.5,1.290994,2011.0,2011.75,2012.5,2013.25,2014.0
