In [1]:
print("hello world")

hello world


### Understanding Pandas Groupby for Data Aggregation

Learning Objectives

    Understanding the syntax and functionality of the groupby() method is important for efficient data grouping.
    Familiarizing yourself with different types of aggregation functions available in pandas, including sum(), mean(), count(), max(), and min(), is necessary to perform effective data analysis.
    Knowing how to apply various aggregation functions to grouped data enables data analysts to extract useful insights from large data sets.

What Is the Pandas groupBy Function?

Pandas groupby operation is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis.

Let me take an example to elaborate on this. Let’s say we are trying to analyze the weight of a person in a city. We can easily get a fair idea of their weight by determining the mean weight of all the city dwellers. But here‘s a question – would the weight be affected by the gender of a person?

We can group the city dwellers into different gender groups and compute their mean weight. This would give us a better insight into the weight of a person living in the city. But we can probably get an even better picture if we further separate these gender groups into different age groups and then take their mean weight (because a teenage boy’s weight could differ from that of an adult male)!

You can see how separating people into separate groups and then applying a statistical value allows us to make better analyses than just looking at the statistical value of the entire population. This is what makes GroupBy so great!

GroupBy allows us to group our data based on different features and get a more accurate idea about your data. It is a one-stop shop for deriving deep insights from your data!

In [2]:
import pandas as pd
import numpy as np


In [3]:
df = pd.read_csv("train_v9rqX0R.csv")

In [6]:
df.head(2)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228


Let’s group the dataset based on the outlet location type using GroupBy, the syntax is simple we just have to use pandas dataframe.groupby:

In [7]:
df.groupby('Outlet_Location_Type')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001F2E6D7F4A0>

GroupBy has conveniently returned a DataFrameGroupBy object. It has split the data into separate groups. However, it won’t do anything unless it is being told explicitly to do so. So, let’s find the count of different outlet location types:

In [8]:
df.groupby('Outlet_Location_Type').count()

Unnamed: 0_level_0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Type,Item_Outlet_Sales
Outlet_Location_Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Tier 1,2388,1860,2388,2388,2388,2388,2388,2388,2388,2388,2388
Tier 2,2785,2785,2785,2785,2785,2785,2785,2785,930,2785,2785
Tier 3,3350,2415,3350,3350,3350,3350,3350,3350,2795,3350,3350


We did not tell GroupBy which column we wanted it to apply the aggregation function on, so we applied it to multiple columns (all the relevant columns) and returned the output.

    But fortunately, GroupBy object supports column indexing just like a pandas Dataframe!

So let’s find out the total sales for each location type:

In [9]:
df.groupby('Outlet_Location_Type')['Item_Outlet_Sales']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001F2E7BBB3B0>

Here, GroupBy has returned a SeriesGroupBy object. No computation will be done until we specify the agg function:

In [10]:
df.groupby('Outlet_Location_Type')['Item_Outlet_Sales'].sum()

Outlet_Location_Type
Tier 1    4.482059e+06
Tier 2    6.472314e+06
Tier 3    7.636753e+06
Name: Item_Outlet_Sales, dtype: float64

The Split-Apply-Combine Strategy

You just saw how quickly you can get an insight into grouped data using the Pandas GroupBy function. But, behind the scenes, a lot is taking place, which is important to understand to gauge the true power of GroupBy.

GroupBy employs the Split-Apply-Combine strategy coined by Hadley Wickham in his paper in 2011. Using this strategy, a data analyst can break down a big problem into manageable parts, perform operations on individual parts and combine them back together to answer a specific question.

I want to show you how this strategy works in GroupBy by working with a sample dataset to get the average height for males and females in a group. Let’s create that dataset:

In [11]:
data = {'Gender':['m','f','f','m','f','m','m'],'Height':[172,171,169,173,170,175,178]}
df_sample = pd.DataFrame(data)
df_sample

Unnamed: 0,Gender,Height
0,m,172
1,f,171
2,f,169
3,m,173
4,f,170
5,m,175
6,m,178


Splitting the data into separate groups:

In [12]:
f_filter = df_sample['Gender']=='f'
print(df_sample[f_filter])

m_filter = df_sample['Gender']=='m'
print(df_sample[m_filter])

  Gender  Height
1      f     171
2      f     169
4      f     170
  Gender  Height
0      m     172
3      m     173
5      m     175
6      m     178


Applying the operation that we need to perform (average in this case):

In [13]:
f_avg = df_sample[f_filter]['Height'].mean()

m_avg = df_sample[m_filter]['Height'].mean()

print(f_avg,m_avg)

170.0 174.5


combining the result to output a DataFrame:

In [14]:
df_output = pd.DataFrame({'Gender':['f','m'],'Height':[f_avg,m_avg]})
df_output

Unnamed: 0,Gender,Height
0,f,170.0
1,m,174.5


All these three steps can be achieved by using GroupBy with just a single line of code! Here’s how:

In [15]:
df_sample.groupby('Gender').mean()

Unnamed: 0_level_0,Height
Gender,Unnamed: 1_level_1
f,170.0
m,174.5


Loop Over GroupBy Groups

Remember the GroupBy object we created at the beginning of this article? Don’t worry, we’ll create it again:

In [16]:
df.columns

Index(['Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visibility',
       'Item_Type', 'Item_MRP', 'Outlet_Identifier',
       'Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type',
       'Outlet_Type', 'Item_Outlet_Sales'],
      dtype='object')

In [18]:
obj = df.groupby('Outlet_Location_Type')
obj

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001F2E7DDFAA0>

We can display the indices in each group by calling the groups on the GroupBy object:

In [19]:
obj.groups

{'Tier 1': [0, 2, 10, 11, 12, 13, 15, 17, 23, 24, 29, 34, 35, 40, 42, 48, 49, 50, 57, 58, 59, 63, 69, 70, 74, 75, 76, 77, 80, 81, 83, 88, 89, 91, 95, 96, 99, 102, 108, 110, 112, 115, 126, 131, 135, 143, 145, 154, 163, 164, 178, 182, 186, 187, 189, 190, 191, 195, 196, 197, 204, 206, 208, 220, 222, 225, 227, 234, 236, 248, 250, 252, 255, 270, 274, 284, 289, 295, 297, 299, 301, 308, 311, 312, 321, 324, 334, 336, 344, 345, 346, 347, 348, 353, 354, 355, 356, 358, 361, 363, ...], 'Tier 2': [8, 9, 19, 22, 25, 26, 33, 46, 47, 53, 54, 56, 61, 66, 67, 68, 72, 73, 78, 79, 85, 86, 92, 93, 94, 97, 100, 107, 111, 114, 116, 117, 118, 120, 121, 123, 124, 125, 127, 129, 137, 138, 140, 141, 142, 144, 146, 147, 148, 149, 150, 157, 158, 165, 166, 170, 171, 176, 179, 181, 188, 192, 200, 201, 202, 207, 210, 211, 212, 213, 219, 221, 223, 228, 232, 233, 240, 241, 242, 243, 244, 245, 247, 249, 254, 256, 258, 259, 261, 262, 263, 264, 268, 273, 277, 281, 283, 285, 288, 290, ...], 'Tier 3': [1, 3, 4, 5, 6, 7, 14,

We can even iterate over all of the groups:

In [20]:
for name,group in obj:
    print(name,'contains',group.shape[0],'rows')

Tier 1 contains 2388 rows
Tier 2 contains 2785 rows
Tier 3 contains 3350 rows


But what if you want to get a specific group out of all the groups? Well, don’t worry. Pandas has a solution for that too.

Just provide the specific group name when calling get_group on the group object. Here, I want to check out the features for the ‘Tier 1’ group of locations only:

In [21]:
obj.get_group('Tier 1')

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.30,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.1380
2,FDN15,17.50,Low Fat,0.016760,Meat,141.6180,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.2700
10,FDY07,11.80,Low Fat,0.000000,Fruits and Vegetables,45.5402,OUT049,1999,Medium,Tier 1,Supermarket Type1,1516.0266
11,FDA03,18.50,Regular,0.045464,Dairy,144.1102,OUT046,1997,Small,Tier 1,Supermarket Type1,2187.1530
12,FDX32,15.10,Regular,0.100014,Fruits and Vegetables,145.4786,OUT049,1999,Medium,Tier 1,Supermarket Type1,1589.2646
...,...,...,...,...,...,...,...,...,...,...,...,...
8480,FDQ58,,Low Fat,0.000000,Snack Foods,154.5340,OUT019,1985,Small,Tier 1,Grocery Store,459.4020
8490,FDU44,,Regular,0.102296,Fruits and Vegetables,162.3552,OUT019,1985,Small,Tier 1,Grocery Store,487.3656
8492,FDT34,9.30,Low Fat,0.174350,Snack Foods,104.4964,OUT046,1997,Small,Tier 1,Supermarket Type1,2419.5172
8517,FDF53,20.75,reg,0.083607,Frozen Foods,178.8318,OUT046,1997,Small,Tier 1,Supermarket Type1,3608.6360


Now isn’t that wonderful! You have the entire Tier 1 features to work with and derive wonderful insights! But wait, didn’t I say that GroupBy is lazy and doesn’t do anything unless explicitly specified? Alright then, let’s see GroupBy in action with the aggregate functions.
Applying Functions to GroupBy Groups

The apply step is unequivocally the most important step of a Pandas GroupBy function where we can perform a variety of operations using aggregation, transformation, filtration, or even with your own function!

Let’s have a look at these in detail.
Aggregation

We have looked at some aggregation functions in the article so far, such as mean, mode, and sum. These perform statistical operations on a set of data. Have a glance at all the aggregate functions in the Pandas package:

    count() – Number of non-null observations
    sum() – Sum of values
    mean() – Mean of values
    median() – Arithmetic median of values
    min() – Minimum
    max() – Maximum
    mode() – Mode
    std() – Standard deviation
    var() – Variance

But the agg() function in Pandas gives us the flexibility to perform several statistical computations all at once! Here is how it works:

In [24]:
# df.groupby('Outlet_Location_Type').agg([np.mean,np.median])

df.columns

In [25]:
df.columns

Index(['Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visibility',
       'Item_Type', 'Item_MRP', 'Outlet_Identifier',
       'Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type',
       'Outlet_Type', 'Item_Outlet_Sales'],
      dtype='object')

In [26]:
result = df.groupby('Outlet_Location_Type').agg({'Item_Outlet_Sales': ['mean', 'median']})
print(result)

                     Item_Outlet_Sales           
                                  mean     median
Outlet_Location_Type                             
Tier 1                     1876.909159  1487.3972
Tier 2                     2323.990559  2004.0580
Tier 3                     2279.627651  1812.3076


We can even run GroupBy with multiple indexes to get better insights from our data:

In [27]:
df.groupby(['Outlet_Location_Type','Outlet_Establishment_Year'],as_index=False).agg({'Outlet_Size':pd.Series.mode,'Item_Outlet_Sales':np.mean})

  df.groupby(['Outlet_Location_Type','Outlet_Establishment_Year'],as_index=False).agg({'Outlet_Size':pd.Series.mode,'Item_Outlet_Sales':np.mean})


Unnamed: 0,Outlet_Location_Type,Outlet_Establishment_Year,Outlet_Size,Item_Outlet_Sales
0,Tier 1,1985,Small,340.329723
1,Tier 1,1997,Small,2277.844267
2,Tier 1,1999,Medium,2348.354635
3,Tier 2,2002,[],2192.384798
4,Tier 2,2004,Small,2438.841866
5,Tier 2,2007,[],2340.675263
6,Tier 3,1985,Medium,3694.038558
7,Tier 3,1987,High,2298.995256
8,Tier 3,1998,[],339.351662
9,Tier 3,2009,Medium,1995.498739


Notice that I have used different aggregation functions for different column names by passing them in a dictionary with the corresponding operation to be performed. This allowed me to group and apply computations on nominal and numeric features simultaneously.

Also, I have changed the value of the as_index parameter to False. This way, the grouped index would not be output as an index.

We can even rename the aggregated columns to improve their comprehensibility, and we get a multi-index dataframe:

In [28]:
df.groupby(['Outlet_Type','Item_Type']).agg(mean_MRP=('Item_MRP',np.mean),mean_Sales=('Item_Outlet_Sales',np.mean))

  df.groupby(['Outlet_Type','Item_Type']).agg(mean_MRP=('Item_MRP',np.mean),mean_Sales=('Item_Outlet_Sales',np.mean))
  df.groupby(['Outlet_Type','Item_Type']).agg(mean_MRP=('Item_MRP',np.mean),mean_Sales=('Item_Outlet_Sales',np.mean))


Unnamed: 0_level_0,Unnamed: 1_level_0,mean_MRP,mean_Sales
Outlet_Type,Item_Type,Unnamed: 2_level_1,Unnamed: 3_level_1
Grocery Store,Baking Goods,126.438068,292.082544
Grocery Store,Breads,146.452873,381.967442
Grocery Store,Breakfast,147.026989,412.831042
Grocery Store,Canned,138.080808,352.864879
Grocery Store,Dairy,147.166715,341.866589
...,...,...,...
Supermarket Type3,Others,106.779053,2700.928667
Supermarket Type3,Seafood,124.028286,2687.073686
Supermarket Type3,Snack Foods,144.574508,3745.168739
Supermarket Type3,Soft Drinks,123.313587,3284.938836


It is amazing how a name change can improve the understandability of the output!

Transformation

Transformation allows us to perform some computation on the groups as a whole and then return the combined DataFrame. This is done using the transform() function.

We will try to compute the null values in the Item_Weight column using the transform() function.

The Item_Fat_Content and Item_Type will affect the Item_Weight, don’t you think? So, let’s group the DataFrame by these columns and handle the missing weights using the mean of these groups:

In [29]:
df['Item_Weight'] = df.groupby(['Item_Fat_Content','Item_Type'])['Item_Weight'].transform(lambda x: x.fillna(x.mean()))

In [30]:
df['Item_Weight']

0        9.300
1        5.920
2       17.500
3       19.200
4        8.930
         ...  
8518     6.865
8519     8.380
8520    10.600
8521     7.210
8522    14.800
Name: Item_Weight, Length: 8523, dtype: float64

Using the Transform function, a DataFrame calls a function on itself to produce a DataFrame with transformed values.”

#### Filtration

Filtration allows us to discard certain values based on computation and return only a subset of the group. We can do this using the filter() function in Pandas.

Let’s take a look at the number of rows in our DataFrame presently:

In [31]:
df.shape

(8523, 12)

If I wanted only those groups that have item weights within 3 standard deviations, I could use the filter function to do the job:

In [32]:
def filter_func(x):
    return x['Item_Weight'].std() < 3

df_filter = df.groupby(['Item_Weight']).filter(filter_func)
df_filter.shape

(8510, 12)

GroupBy has conveniently returned a DataFrame with only those groups that have Item_Weight less than 3 standard deviations.

Applying Our Own Functions

Pandas’ apply() function applies a function along an axis of the DataFrame. When using it with the GroupBy function, we can apply any function to the grouped result.

For example, if I wanted to center the Item_MRP values with the mean of their establishment year group, I could use the apply() function to do just that”:

In [33]:
df_apply = df.groupby(['Outlet_Establishment_Year'])['Item_MRP'].apply(lambda x: x - x.mean())
df_apply

Outlet_Establishment_Year      
1985                       7       -32.034285
                           18      -26.513085
                           21        4.747915
                           23      -32.102685
                           29      -96.151085
                                      ...    
2009                       8506    121.512366
                           8511    120.912366
                           8515     15.850166
                           8516    -82.919834
                           8521    -38.545434
Name: Item_MRP, Length: 8523, dtype: float64

Here, the values have been centered, and you can check whether the item was sold at an MRP above or below the mean MRP for that year.

Key Takeaways

    Groupby() is a powerful function in pandas that allows you to group data based on a single column or more.
    You can apply many operations to a groupby object, including aggregation functions like sum(), mean(), and count(), as well as lambda function and other custom functions using apply().
    The resulting output of a groupby() operation can be a pandas Series or dataframe, depending on the operation and data structure.
