<span style="font-size:18px; font-weight:bold;">

# GroupBy in Pandas

The groupby() method allows you to:

- Split the data into groups based on a column

- Apply aggregate functions (mean, sum, std, etc.)

- Combine results into a summary

**This is extremely useful for analyzing subsets of your data easily. Let's create a dataframe to practice.**

In [19]:
import pandas as pd

data = {
    'Department': ['Sales', 'Sales', 'HR', 'HR', 'IT', 'IT'],
    'Employee': ['John', 'Anna', 'Peter', 'Linda', 'Steve', 'Rachel'],
    'Salary': [60000, 65000, 52000, 58000, 75000, 78000],
    'YearsExperience': [3, 5, 2, 4, 6, 7]
}

df = pd.DataFrame(data)

In [21]:
df = pd.DataFrame(data)

In [23]:
df

Unnamed: 0,Department,Employee,Salary,YearsExperience
0,Sales,John,60000,3
1,Sales,Anna,65000,5
2,HR,Peter,52000,2
3,HR,Linda,58000,4
4,IT,Steve,75000,6
5,IT,Rachel,78000,7


<span style="font-size:18px; font-weight:bold;">

**Now you can use the .groupby() method to group rows together based off of a column name. For instance let's group based off of Department. This will create a DataFrameGroupBy object:**

In [27]:
df.groupby('Department')


<pandas.core.groupby.generic.DataFrameGroupBy object at 0x139eab1d0>

<span style="font-size:18px; font-weight:bold;">

At this point, it just creates a DataFrameGroupBy object — no computation yet.

Save it for reuse: You can use numerical columns to use functions like mean and std

In [49]:
by_dept = df.groupby('Department')[['Salary', 'YearsExperience']]


<span style="font-size:18px; font-weight:bold;">

And then call aggregate methods off the object:

<span style="font-size:18px; font-weight:bold;">

Average Salary and Experience per Department

In [55]:
by_dept.mean()

Unnamed: 0_level_0,Salary,YearsExperience
Department,Unnamed: 1_level_1,Unnamed: 2_level_1
HR,55000.0,3.0
IT,76500.0,6.5
Sales,62500.0,4.0


More examples of aggregate methods:

In [60]:
by_dept.std()

Unnamed: 0_level_0,Salary,YearsExperience
Department,Unnamed: 1_level_1,Unnamed: 2_level_1
HR,4242.640687,1.414214
IT,2121.320344,0.707107
Sales,3535.533906,1.414214


In [62]:
by_dept.min()

Unnamed: 0_level_0,Salary,YearsExperience
Department,Unnamed: 1_level_1,Unnamed: 2_level_1
HR,52000,2
IT,75000,6
Sales,60000,3


In [65]:
by_dept.max()

Unnamed: 0_level_0,Salary,YearsExperience
Department,Unnamed: 1_level_1,Unnamed: 2_level_1
HR,58000,4
IT,78000,7
Sales,65000,5


In [67]:
by_dept.count()

Unnamed: 0_level_0,Salary,YearsExperience
Department,Unnamed: 1_level_1,Unnamed: 2_level_1
HR,2,2
IT,2,2
Sales,2,2


In [71]:
by_dept.describe()

Unnamed: 0_level_0,Salary,Salary,Salary,Salary,Salary,Salary,Salary,Salary,YearsExperience,YearsExperience,YearsExperience,YearsExperience,YearsExperience,YearsExperience,YearsExperience,YearsExperience
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
Department,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
HR,2.0,55000.0,4242.640687,52000.0,53500.0,55000.0,56500.0,58000.0,2.0,3.0,1.414214,2.0,2.5,3.0,3.5,4.0
IT,2.0,76500.0,2121.320344,75000.0,75750.0,76500.0,77250.0,78000.0,2.0,6.5,0.707107,6.0,6.25,6.5,6.75,7.0
Sales,2.0,62500.0,3535.533906,60000.0,61250.0,62500.0,63750.0,65000.0,2.0,4.0,1.414214,3.0,3.5,4.0,4.5,5.0


In [73]:
by_dept.describe().transpose()

Unnamed: 0,Department,HR,IT,Sales
Salary,count,2.0,2.0,2.0
Salary,mean,55000.0,76500.0,62500.0
Salary,std,4242.640687,2121.320344,3535.533906
Salary,min,52000.0,75000.0,60000.0
Salary,25%,53500.0,75750.0,61250.0
Salary,50%,55000.0,76500.0,62500.0
Salary,75%,56500.0,77250.0,63750.0
Salary,max,58000.0,78000.0,65000.0
YearsExperience,count,2.0,2.0,2.0
YearsExperience,mean,3.0,6.5,4.0
