# GROUP BY AND AGGREGRATES

Here GROUP BY is same as sql Group By. It consists 3 steps -
- Splitting: Data is splitted into Groupes
- Applying: Some aggregrative function is applied on these groups
- Combining: All results are combined in DataFrame<br>

Even like group by and having in sql,we can filter based on aggregrate functions.

In [1]:
import numpy as np
import pandas as pd             #importing modules

In [2]:
df1=pd.DataFrame(
    {
        "class":[5,6,6,7,5,7],
        "name":["anu",'tanu','ranu','kanu','panu','sanu'],
        "score":[64,38,78,22,85,92],
        "weight":[54,46,63,59,72,51]
    }
)                               #creating dataframe

In [3]:
df1                             #printing dataframe

Unnamed: 0,class,name,score,weight
0,5,anu,64,54
1,6,tanu,38,46
2,6,ranu,78,63
3,7,kanu,22,59
4,5,panu,85,72
5,7,sanu,92,51


In [4]:
gm = df1.groupby("class")              #making a object by grouping class column

In [5]:
gm

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001A1A883D310>

Now why are you doing grouping? For applying some aggregrate function. So now if we apply aggregrate function we will get dataframe.

In [6]:
df1.groupby("class").sum()           #applying sum on groups

Unnamed: 0_level_0,name,score,weight
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
5,anupanu,149,126
6,tanuranu,116,109
7,kanusanu,114,110


In [7]:
df1.groupby("class").sum().drop("name",axis=1)         #dropping name column 

Unnamed: 0_level_0,score,weight
class,Unnamed: 1_level_1,Unnamed: 2_level_1
5,149,126
6,116,109
7,114,110


In [8]:
df1.groupby(["class"]).min().drop("name",axis=1)         #applying min column

Unnamed: 0_level_0,score,weight
class,Unnamed: 1_level_1,Unnamed: 2_level_1
5,64,54
6,38,46
7,22,51


In [9]:
df1.groupby(["class"]).mean()

TypeError: agg function failed [how->mean,dtype->object]

This is throwing an error as the column name is not an numeric type to operate an aggregration.

In [10]:
df1.iloc[:,[0,2,3]].groupby(["class"]).mean()      #so we are dropping name column and calculating mean

Unnamed: 0_level_0,score,weight
class,Unnamed: 1_level_1,Unnamed: 2_level_1
5,74.5,63.0
6,58.0,54.5
7,57.0,55.0


In [11]:
df1.iloc[:,[0,2,3]].groupby(["class"]).var()         #calculating variance

Unnamed: 0_level_0,score,weight
class,Unnamed: 1_level_1,Unnamed: 2_level_1
5,220.5,162.0
6,800.0,144.5
7,2450.0,32.0


In [12]:
df1.iloc[:,[0,2,3]].groupby(["class"]).describe()     #creating a summary

Unnamed: 0_level_0,score,score,score,score,score,score,score,score,weight,weight,weight,weight,weight,weight,weight,weight
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
5,2.0,74.5,14.849242,64.0,69.25,74.5,79.75,85.0,2.0,63.0,12.727922,54.0,58.5,63.0,67.5,72.0
6,2.0,58.0,28.284271,38.0,48.0,58.0,68.0,78.0,2.0,54.5,12.020815,46.0,50.25,54.5,58.75,63.0
7,2.0,57.0,49.497475,22.0,39.5,57.0,74.5,92.0,2.0,55.0,5.656854,51.0,53.0,55.0,57.0,59.0


All aggregrative functions can be found here: 
https://pandas.pydata.org/docs/user_guide/groupby.html#built-in-aggregation-methods

We can apply multiple aggregrate function at once.

In [13]:
df1.iloc[:,[0,2,3]].groupby(["class"]).agg(["mean","var","count"])      #applying multiple aggregrate function

Unnamed: 0_level_0,score,score,score,weight,weight,weight
Unnamed: 0_level_1,mean,var,count,mean,var,count
class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
5,74.5,220.5,2,63.0,162.0,2
6,58.0,800.0,2,54.5,144.5,2
7,57.0,2450.0,2,55.0,32.0,2


Even we can find a specific group.

In [14]:
gm.get_group(6)                  #we are getting group with class 6

Unnamed: 0,class,name,score,weight
1,6,tanu,38,46
2,6,ranu,78,63


Even we can rename those aggregrated columns.

In [15]:
# renaming columns of aggregrate function
df1.iloc[:,[0,2,3]].groupby(["class"]).agg(["mean","var","count"]).rename(columns={'mean':'A','var':'B','count':'C'})

Unnamed: 0_level_0,score,score,score,weight,weight,weight
Unnamed: 0_level_1,A,B,C,A,B,C
class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
5,74.5,220.5,2,63.0,162.0,2
6,58.0,800.0,2,54.5,144.5,2
7,57.0,2450.0,2,55.0,32.0,2


We can apply different aggregrate function to different columns.

In [16]:
df1.iloc[:,[0,2,3]].groupby(["class"]).agg({"score":"mean","weight":"var"})        #we apply mean in score column and variance in weight column

Unnamed: 0_level_0,score,weight
class,Unnamed: 1_level_1,Unnamed: 2_level_1
5,74.5,162.0
6,58.0,144.5
7,57.0,32.0


Now let we want to find cumulative sum over every group. Note here we cannot use aggregrate functions.We have to use transform function here.

In [17]:
df1                                                #printing dataframe

Unnamed: 0,class,name,score,weight
0,5,anu,64,54
1,6,tanu,38,46
2,6,ranu,78,63
3,7,kanu,22,59
4,5,panu,85,72
5,7,sanu,92,51


In [18]:
df1.iloc[:,[0,2,3]].groupby("class").cumsum()        #finding cumulative sum

Unnamed: 0,score,weight
0,64,54
1,38,46
2,116,109
3,22,59
4,149,126
5,114,110


So here first group is class 5. So first value is 64.Then second value of this class 5 which is indexed at 4. So here cumulative sum is 64+85=149. Hence it is indexed at 4. Then second group is class 6. First value of this group is 38 indexed at 2. As it is first value of this group cumsum returns 38 as index 2.Second element of this group is 78 indexed at 3.Hence cumsum is 38+78=116 which is indexed at 3 again. 

We can done it a little bit clearly by sorting our main dataframe.

In [19]:
df1.iloc[:,[0,2,3]].sort_values("class")      #making a sorting dataframe over column class

Unnamed: 0,class,score,weight
0,5,64,54
4,5,85,72
1,6,38,46
2,6,78,63
3,7,22,59
5,7,92,51


In [20]:
df1.iloc[:,[0,2,3]].sort_values("class").groupby("class").cumsum()      #now we find cumsum

Unnamed: 0,score,weight
0,64,54
4,149,126
1,38,46
2,116,109
3,22,59
5,114,110


In [21]:
df1.iloc[:,[0,2,3]].sort_values("class").groupby("class").transform("cumsum")   #applying transform() command

Unnamed: 0,score,weight
0,64,54
4,149,126
1,38,46
2,116,109
3,22,59
5,114,110


All transform functions can be found here: https://pandas.pydata.org/docs/user_guide/groupby.html#built-in-transformation-methods

Another function is filter. We can use it to filter number of rows in every group.

In [22]:
df1.groupby("class").head(1)                       #returning head 1 rows of every group

Unnamed: 0,class,name,score,weight
0,5,anu,64,54
1,6,tanu,38,46
3,7,kanu,22,59


In [23]:
df1.groupby("class").tail(1).reset_index()          #returning head 1 rows of every group

Unnamed: 0,index,class,name,score,weight
0,2,6,ranu,78,63
1,4,5,panu,85,72
2,5,7,sanu,92,51


We are creating another dataframe to use nth() function.

In [24]:
df = pd.DataFrame({'A': [1, 1, 2, 1, 2],
                   'B': [np.nan, 2, 3, 4, 5]}, columns=['A', 'B'])      #creating dataframe

In [26]:
df

Unnamed: 0,A,B
0,1,
1,1,2.0
2,2,3.0
3,1,4.0
4,2,5.0


In [27]:
df.groupby("A").nth(0)                        #getting first or 0th row of every group

Unnamed: 0,A,B
0,1,
2,2,3.0


In [28]:
df.groupby("A").nth(1).reset_index()          #getting second or 1th row of every group

Unnamed: 0,index,A,B
0,1,1,2.0
1,4,2,5.0


In [29]:
df.groupby("A").nth[0:2]                    #we are passing a slice in nth() function

Unnamed: 0,A,B
0,1,
1,1,2.0
2,2,3.0
4,2,5.0
