# Pandas Aggregations

- https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html#groupby

- **User Guide:** https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

**NOTE: Use nbextensions for a clener view of this notebook**

In [23]:
import pandas as pd
import numpy as np

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

Sample DF

In [5]:
df = pd.DataFrame(np.random.randint(0,11,(10,3)), columns = ['num1','num2','num3'])
df['category'] = ['a','a','a','b','b','b','b','c','c','c']
df = df[['category','num1','num2','num3']]
df

Unnamed: 0,category,num1,num2,num3
0,a,8,0,7
1,a,7,5,8
2,a,4,7,7
3,b,9,2,2
4,b,6,3,5
5,b,0,8,0
6,b,7,4,2
7,c,7,8,8
8,c,10,8,10
9,c,4,1,4


Dict of all the groups and their respective indices in the parent dataframe

In [28]:
for name, group in df.groupby('category'): 
    print(name)
    print(group)
    print('\n')

a
  category  num1  num2  num3
0        a     8     0     7
1        a     7     5     8
2        a     4     7     7


b
  category  num1  num2  num3
3        b     9     2     2
4        b     6     3     5
5        b     0     8     0
6        b     7     4     2


c
  category  num1  num2  num3
7        c     7     8     8
8        c    10     8    10
9        c     4     1     4




## Indexing / Iterations

In [7]:
df.groupby('category').groups
df.groupby('category').indices

{'a': Int64Index([0, 1, 2], dtype='int64'),
 'b': Int64Index([3, 4, 5, 6], dtype='int64'),
 'c': Int64Index([7, 8, 9], dtype='int64')}

{'a': array([0, 1, 2], dtype=int64),
 'b': array([3, 4, 5, 6], dtype=int64),
 'c': array([7, 8, 9], dtype=int64)}

Get the groups by the value of column that you are grouping by

In [21]:
df.groupby('category').get_group('a')
df.groupby('category').get_group('b')

Unnamed: 0,category,num1,num2,num3
0,a,8,0,7
1,a,7,5,8
2,a,4,7,7


Unnamed: 0,category,num1,num2,num3
3,b,9,2,2
4,b,6,3,5
5,b,0,8,0
6,b,7,4,2


### Multiindex

In [36]:
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
             ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]

index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
s = pd.Series(np.random.randn(8), index=index)
s

first  second
bar    one       1.316669
       two       0.549894
baz    one       0.526692
       two       0.460793
foo    one      -0.474307
       two      -0.756092
qux    one      -0.747648
       two       0.347592
dtype: float64

group by one of the levels -- any one of the levels

In [49]:
# sinle level using index / label name
s.groupby(level=0).sum()
s.groupby(level='first').sum()
s.groupby(level=1).sum()

# multiple level groupby using index labels
s.groupby(level = [0,1]).count()

first
bar    1.866563
baz    0.987485
foo   -1.230399
qux   -0.400056
dtype: float64

first
bar    1.866563
baz    0.987485
foo   -1.230399
qux   -0.400056
dtype: float64

second
one    0.621406
two    0.602187
dtype: float64

first  second
bar    one       1
       two       1
baz    one       1
       two       1
foo    one       1
       two       1
qux    one       1
       two       1
dtype: int64

## Function Application

Apply a function

In [29]:
def get_stats(group):
    return {'min': group.min(), 'max': group.max(), 'count': group.count(), 'mean': group.mean()}

In [59]:
df.groupby('category').apply(sum)
# df.groupby('category').apply()

Unnamed: 0_level_0,category,num1,num2,num3
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a,aaa,19,12,22
b,bbbb,22,17,9
c,ccc,21,17,22


## Group Descriptive Statistics

https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html#computations-descriptive-stats

Describe each group like a dataframe

In [25]:
df.groupby('category').describe()

Unnamed: 0_level_0,num1,num1,num1,num1,num1,num1,num1,num1,num2,num2,num2,num2,num2,num2,num2,num2,num3,num3,num3,num3,num3,num3,num3,num3
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2
a,3.0,6.333333,2.081666,4.0,5.5,7.0,7.5,8.0,3.0,4.0,3.605551,0.0,2.5,5.0,6.0,7.0,3.0,7.333333,0.57735,7.0,7.0,7.0,7.5,8.0
b,4.0,5.5,3.872983,0.0,4.5,6.5,7.5,9.0,4.0,4.25,2.629956,2.0,2.75,3.5,5.0,8.0,4.0,2.25,2.061553,0.0,1.5,2.0,2.75,5.0
c,3.0,7.0,3.0,4.0,5.5,7.0,8.5,10.0,3.0,5.666667,4.041452,1.0,4.5,8.0,8.0,8.0,3.0,7.333333,3.05505,4.0,6.0,8.0,9.0,10.0
