# Assumptions

Hoping you are familiar with basic python syntax and Lambda functions
<br />Lambda function Blog : https://www.programiz.com/python-programming/anonymous-function <br /><br />
Also you can find some Blogs that might help you better understand the concept for groupby:
<br />Link : https://towardsdatascience.com/how-to-use-the-split-apply-combine-strategy-in-pandas-groupby-29e0eb44b62e

# Why/When Groupby?
  During your EDA/Data analysis/feature engineering there will always come a point where you would want to split the data based on certain groups/categories, and get relevant statistical inferences from it.

Let's see some scenerios below. let's start by making a dummy Data


In [1]:
import numpy as np
import pandas as pd

In [2]:
"""
  Lets start with a assumption that there is a class of 300 students and they have given there one fav subject a rating in range of 1-10 from 6 unique subject .
"""
n_studs = 300
HouseOne = pd.DataFrame({
    "Name":["Name_"+str(i) for i in range(n_studs)],
    "Subject":np.random.choice(["Subject_"+str(i) for i in range(6)], size=n_studs),
    "Rating":np.random.uniform(low=1, high=10, size=n_studs)
})

In [3]:
HouseOne.sample(10) # will give you 10 random points in any order

Unnamed: 0,Name,Subject,Rating
114,Name_114,Subject_3,8.89847
9,Name_9,Subject_2,6.895104
73,Name_73,Subject_3,9.027637
13,Name_13,Subject_1,3.088946
288,Name_288,Subject_3,6.869682
167,Name_167,Subject_4,4.603252
117,Name_117,Subject_0,3.516517
24,Name_24,Subject_3,7.025708
44,Name_44,Subject_4,7.427637
66,Name_66,Subject_5,6.116709


Ok, now we have a sample data, let's see how much average rating each subject has

In [4]:
"""
Hold on here a second, notice groupby("Subject") will filter you data into groups of each individual unique subject and return a groupby OBJECT.!, 
if you iterate over these objects you get a tuple of group name(here, subname & the filtered data)
"""
for name, group in HouseOne.groupby("Subject"):
  print(name)
  print(group)
  print(f"Group Name is {name}")

Subject_0
         Name    Subject    Rating
4      Name_4  Subject_0  7.664599
8      Name_8  Subject_0  3.747902
17    Name_17  Subject_0  2.943420
20    Name_20  Subject_0  4.707439
26    Name_26  Subject_0  7.008340
27    Name_27  Subject_0  5.251544
47    Name_47  Subject_0  7.521537
50    Name_50  Subject_0  7.701747
54    Name_54  Subject_0  4.909955
58    Name_58  Subject_0  3.765084
63    Name_63  Subject_0  5.970750
68    Name_68  Subject_0  6.212082
70    Name_70  Subject_0  9.578245
72    Name_72  Subject_0  2.755612
74    Name_74  Subject_0  6.980206
92    Name_92  Subject_0  6.568149
101  Name_101  Subject_0  6.949784
107  Name_107  Subject_0  9.801355
116  Name_116  Subject_0  8.503369
117  Name_117  Subject_0  3.516517
121  Name_121  Subject_0  7.446221
122  Name_122  Subject_0  2.991187
124  Name_124  Subject_0  9.308572
129  Name_129  Subject_0  9.999444
142  Name_142  Subject_0  3.328223
152  Name_152  Subject_0  7.580553
155  Name_155  Subject_0  3.899070
157  Name_

In [5]:
%%time
for name, group in HouseOne.groupby("Subject"):
  print(f"Subject Name is {name} and Subject Avg. rating is {group['Rating'].mean()}") # now we know groups represent the  subjects

Subject Name is Subject_0 and Subject Avg. rating is 6.221117399677172
Subject Name is Subject_1 and Subject Avg. rating is 5.615162821682815
Subject Name is Subject_2 and Subject Avg. rating is 5.945150061262556
Subject Name is Subject_3 and Subject Avg. rating is 5.529745150060131
Subject Name is Subject_4 and Subject Avg. rating is 5.780473839570078
Subject Name is Subject_5 and Subject Avg. rating is 5.541819975490779
CPU times: user 7.19 ms, sys: 960 µs, total: 8.15 ms
Wall time: 7.32 ms


In [6]:
for name, group in HouseOne.groupby("Subject"):
  print(f"Now Printing Filtered Data of only : {name}")
  print("*"*50)
  print(group.head(3))
  print("*"*50)

Now Printing Filtered Data of only : Subject_0
**************************************************
       Name    Subject    Rating
4    Name_4  Subject_0  7.664599
8    Name_8  Subject_0  3.747902
17  Name_17  Subject_0  2.943420
**************************************************
Now Printing Filtered Data of only : Subject_1
**************************************************
       Name    Subject    Rating
0    Name_0  Subject_1  2.106866
3    Name_3  Subject_1  1.633120
13  Name_13  Subject_1  3.088946
**************************************************
Now Printing Filtered Data of only : Subject_2
**************************************************
     Name    Subject    Rating
5  Name_5  Subject_2  3.121444
6  Name_6  Subject_2  1.928478
9  Name_9  Subject_2  6.895104
**************************************************
Now Printing Filtered Data of only : Subject_3
**************************************************
       Name    Subject    Rating
15  Name_15  Subject_3  9.810748
1

I hope you abstractly get the point here., but looping over all values for mean or any statistical value of individual groups isn't really pythonic so let's try another way

In [7]:
HouseOne.groupby("Subject") #returns the groupby object 

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fdf442f87f0>

In [8]:
%%time
HouseOne.groupby("Subject")['Rating'].mean() # what just happened?, so basically first groups of each subjects were created withing these group rating column was extracted & mean of that value was taken. simple.

CPU times: user 2.67 ms, sys: 0 ns, total: 2.67 ms
Wall time: 10.8 ms


Subject
Subject_0    6.221117
Subject_1    5.615163
Subject_2    5.945150
Subject_3    5.529745
Subject_4    5.780474
Subject_5    5.541820
Name: Rating, dtype: float64

Check the time diffrence & match the values where we looped over group and now where we used short cute method, if the data is big enough this time difference would be significant., try it once by increasing n_rows from 300 to 3,000,000

#### Aggregate functions
How to get Multiple, Satistical infrences from groups at once?
  Let's try to get mean, median, mode, count values of rating from each group

In [9]:
"""
  There is one rule for Aggregate functions ----::---- Always remember the aggregate function assumes that function that you want to use will return a single value.
  for e.g : 
    for a column -> mean would return a single value which is the average of that column.
    but cant use a function like value_counts, which return multiple values. but i can filter the most/least frequent element from it to return as value see getmeMode function
  You can even pass your custom/user defined function, i am here going to pass a user defined function that returns mode of the series/column

"""
def getMeMode(x):
  return x.value_counts().index[0] # return value of most frequent element. Note: x is the entire column of rating of any particular group, and i am returning a *single* value which mode from my defined function.
  
stats = HouseOne.groupby("Subject")["Rating"].agg({"mean", "median", "count", np.sum, getMeMode}) # instead of np.sum you can also use "sum", getMeMode signifies as the address/refrence of function
stats # columns are returned in random order you can get a proper order by filtering columns in partcular order [ColA, ColB, ..., ColN]

Unnamed: 0_level_0,median,mean,getMeMode,count,sum
Subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Subject_0,6.568149,6.221117,7.580553,51,317.276987
Subject_1,5.817395,5.615163,4.424974,52,291.988467
Subject_2,6.217929,5.94515,5.121305,49,291.312353
Subject_3,4.719258,5.529745,4.726962,52,287.546748
Subject_4,5.315201,5.780474,1.856215,56,323.706535
Subject_5,5.173658,5.54182,8.682999,40,221.672799


#### Apply/ApplyMap/Map

There could be scenerios where you might want to apply some logic/function on dataframe or series itself. 


In [10]:
"""
  Apply/ApplyMap/Map are functions that help you intract with groups you created & even manipulate values for same.
  Map - Applied only to a series(columns), and is element wise
  ApplyMap - Applied to only DataFrame, and is element wise
  Apply - Can be used on both.
  .
  .
  Let's create a new column in Dataframe and convert the rating to some string value.
"""
rating_to_grade = {
    10 : "O",
     9 : "E",
     8 : "A",
     7 : "B",
     6 : "C",
     5 : "D",
     4 : "E",
     3 : "F",
     2 : "Nailed it.!",
     1 : "Legendary"
}

# converting Rating which is continous to discrete values
HouseOne['discrete_rating'] = HouseOne['Rating'].apply(np.ceil) # you can also try map here since we are just manipulating a Series here i.e. Rating
HouseOne.head()

Unnamed: 0,Name,Subject,Rating,discrete_rating
0,Name_0,Subject_1,2.106866,3.0
1,Name_1,Subject_5,7.764332,8.0
2,Name_2,Subject_5,1.016517,2.0
3,Name_3,Subject_1,1.63312,2.0
4,Name_4,Subject_0,7.664599,8.0


In [11]:
HouseOne['Grade'] = HouseOne['discrete_rating'].map(rating_to_grade)
HouseOne.sample(10)

Unnamed: 0,Name,Subject,Rating,discrete_rating,Grade
140,Name_140,Subject_4,3.926704,4.0,E
293,Name_293,Subject_5,4.868113,5.0,D
296,Name_296,Subject_3,9.627529,10.0,O
295,Name_295,Subject_0,6.725507,7.0,B
243,Name_243,Subject_4,9.153692,10.0,O
217,Name_217,Subject_4,8.611974,9.0,E
253,Name_253,Subject_1,9.890786,10.0,O
143,Name_143,Subject_2,9.212162,10.0,O
231,Name_231,Subject_2,7.53117,8.0,A
66,Name_66,Subject_5,6.116709,7.0,B


In [12]:
"""
  Let's say you want to do some manipulation on each row and value of dataframe, idk maybe add H1 for House_one to every value let's try to do that
"""
HouseOne.applymap(lambda x : str(x)+ "_H1") # as you can see applymap added _H1 to each element of data.

Unnamed: 0,Name,Subject,Rating,discrete_rating,Grade
0,Name_0_H1,Subject_1_H1,2.1068655162078356_H1,3.0_H1,F_H1
1,Name_1_H1,Subject_5_H1,7.764332406101751_H1,8.0_H1,A_H1
2,Name_2_H1,Subject_5_H1,1.01651714832016_H1,2.0_H1,Nailed it.!_H1
3,Name_3_H1,Subject_1_H1,1.6331200626721305_H1,2.0_H1,Nailed it.!_H1
4,Name_4_H1,Subject_0_H1,7.6645992588906715_H1,8.0_H1,A_H1
...,...,...,...,...,...
295,Name_295_H1,Subject_0_H1,6.725506583361409_H1,7.0_H1,B_H1
296,Name_296_H1,Subject_3_H1,9.627529283886263_H1,10.0_H1,O_H1
297,Name_297_H1,Subject_2_H1,9.824163820822799_H1,10.0_H1,O_H1
298,Name_298_H1,Subject_1_H1,1.9762734155590813_H1,2.0_H1,Nailed it.!_H1


##### Ley's say you want to Do some manipulation on groups of data or run your function for each group

In [25]:
def myFunction(groupName, groupData):
  """
    @ groupName param : it is the current groupName which is being used 
    @ x : x is the row of each group
  """
  print(f"Currently this group Name is {groupName}")
  print(f"Curious what X contains? :\n {groupData.head(2)}") # x is simply the data that you extracted. You can assume that groupData is the a dataframe & apply your logic to it based on your intrest.



HouseOne.groupby("Subject").apply(lambda x: myFunction(x.name, x)) # x.name is passed just for illustration purpose i hardly imagine u'll need that., btw if u r trying just try to use map instead of apply n see what happens

Currently this group Name is Subject_0
Curious what X contains? :
      Name    Subject    Rating  discrete_rating Grade
4  Name_4  Subject_0  7.664599              8.0     A
8  Name_8  Subject_0  3.747902              4.0     E
Currently this group Name is Subject_1
Curious what X contains? :
      Name    Subject    Rating  discrete_rating        Grade
0  Name_0  Subject_1  2.106866              3.0            F
3  Name_3  Subject_1  1.633120              2.0  Nailed it.!
Currently this group Name is Subject_2
Curious what X contains? :
      Name    Subject    Rating  discrete_rating        Grade
5  Name_5  Subject_2  3.121444              4.0            E
6  Name_6  Subject_2  1.928478              2.0  Nailed it.!
Currently this group Name is Subject_3
Curious what X contains? :
        Name    Subject    Rating  discrete_rating Grade
15  Name_15  Subject_3  9.810748             10.0     O
19  Name_19  Subject_3  3.050143              4.0     E
Currently this group Name is Subject

Let's see what subject got what frequent grade

In [14]:
HouseOne.groupby("Subject").apply(lambda x: x['Grade'].value_counts().index[0]) # if you are confused what happened here, try to decode it line by line n u'll get the point.

Subject
Subject_0    O
Subject_1    O
Subject_2    E
Subject_3    E
Subject_4    E
Subject_5    E
dtype: object

## Aggregating Multiple Columns

In [15]:
# here we aggregates individual columns by specifying the dictionary what operation we want also we can pass custom functons here i passed getMeMoode created earlier
HouseOne.groupby("Subject").agg({"Rating":"mean", "discrete_rating":"median", "Grade":getMeMode})

Unnamed: 0_level_0,Rating,discrete_rating,Grade
Subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Subject_0,6.221117,7.0,O
Subject_1,5.615163,6.0,O
Subject_2,5.94515,7.0,E
Subject_3,5.529745,5.0,E
Subject_4,5.780474,6.0,E
Subject_5,5.54182,5.5,E


In [None]:
my_list = [1, 5, 4, 6, 8, 11, 3, 12]

new_list = list(filter(lambda x: (x%2 == 0) , my_list))

print(new_list)

[4, 6, 8, 12]


In [None]:
list(filter(lambda x: (x%2 == 0) , my_list))

[4, 6, 8, 12]

In [33]:
def myFunction(groupData):
  """
    @ groupName param : it is the current groupName which is being used 
    @ x : x is the row of each group
  """

  print(f"Curious what X contains? :\n {groupData}") # x is simply the data that you extracted. You can assume that groupData is the a dataframe & apply your logic to it based on your intrest.

for name,group in HouseOne.groupby("Subject"):
  group.applymap(lambda x: myFunction(x))
#HouseOne.groupby("Subject").applymap(lambda x: myFunction(x)) # x.name is passed just for illustration purpose i hardly imagine u'll need that., btw if u r trying just try to use map instead of apply n see what happens

Curious what X contains? :
 Name_4
Curious what X contains? :
 Name_8
Curious what X contains? :
 Name_17
Curious what X contains? :
 Name_20
Curious what X contains? :
 Name_26
Curious what X contains? :
 Name_27
Curious what X contains? :
 Name_47
Curious what X contains? :
 Name_50
Curious what X contains? :
 Name_54
Curious what X contains? :
 Name_58
Curious what X contains? :
 Name_63
Curious what X contains? :
 Name_68
Curious what X contains? :
 Name_70
Curious what X contains? :
 Name_72
Curious what X contains? :
 Name_74
Curious what X contains? :
 Name_92
Curious what X contains? :
 Name_101
Curious what X contains? :
 Name_107
Curious what X contains? :
 Name_116
Curious what X contains? :
 Name_117
Curious what X contains? :
 Name_121
Curious what X contains? :
 Name_122
Curious what X contains? :
 Name_124
Curious what X contains? :
 Name_129
Curious what X contains? :
 Name_142
Curious what X contains? :
 Name_152
Curious what X contains? :
 Name_155
Curious what X cont