---
title: "A Python Blog Post I Often Reach For"  
date: 2020-07-26  
categories:  
  - Python  
tags:  
  - pandas  
slug: "pandas-grouping-data"  
image:
  caption: ''
  focal_point: ''
  preview_only: yes
links:
  donate_button:
    icon: seedling
    icon_pack: fas
    name: Ways to Support
    url: /support/

---

<!-- Icon Image: Small -->
<img src="featured.png" width="100"/> 

I find I have to do a lot of grouping in pandas and I reach for [this blog post by Shane Lynn](https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/) all the  time to remind me how to get it done. I've found grouping things in pandas difficult sometimes, usually when  I want to create a column in the original dataframe by grouping stuff. This technique solves that and I'm super grateful to it.

## So Group, Already
Learning works best when you try it out yourself, so let's give it a go! 

Say I have some  census data and  I want to group it together to get the sum of each group.
  - Dataset is from [US Census Demographics Data](https://www.kaggle.com/muonneutrino/us-census-demographic-data/data) on Kaggle

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("/Users/bogart/Downloads/7001_312628_bundle_archive/acs2017_county_data.csv")
df =  df[['State', 'County', 'TotalPop']]

In [6]:
df.head()

Unnamed: 0,State,County,TotalPop
0,Alabama,Autauga County,55036
1,Alabama,Baldwin County,203360
2,Alabama,Barbour County,26201
3,Alabama,Bibb County,22580
4,Alabama,Blount County,57667


Using  the method from the blog post, I can make a grouped dataframe that sums up the populations for each county:

In [7]:
df.groupby('State').agg(
    state_pop = pd.NamedAgg(column='TotalPop', aggfunc='sum')
).head(7)

Unnamed: 0_level_0,state_pop
State,Unnamed: 1_level_1
Alabama,4850771
Alaska,738565
Arizona,6809946
Arkansas,2977944
California,38982847
Colorado,5436519
Connecticut,3594478


If  I wrap it in a `.join()`, I can add it back to the original  dataframe to use later:

In [8]:
df.join(df.groupby('State').agg(
    state_pop = pd.NamedAgg(column='TotalPop', aggfunc='sum')
), on='State').head(7)

Unnamed: 0,State,County,TotalPop,state_pop
0,Alabama,Autauga County,55036,4850771
1,Alabama,Baldwin County,203360,4850771
2,Alabama,Barbour County,26201,4850771
3,Alabama,Bibb County,22580,4850771
4,Alabama,Blount County,57667,4850771
5,Alabama,Bullock County,10478,4850771
6,Alabama,Butler County,20126,4850771


## Overall

Grouping data happens a bunch,  but it can be complicated to remember  the mechanics. It can also feel sometimes like blog posts  are shouts into a  void, but they can be the best teaching tools out there! Gotta say thanks again to Shane Lynn; I review [the screenshot at the top of your blog post](https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2019/10/pandas-python-group-by-named-aggregation-update.jpg) often. 

#### Image Credit
[merged dataframes](https://thenounproject.com/search/?q=dataframe&creator=4129988&i=3097973) by Zach Bogart from [the Noun Project](https://thenounproject.com/) 