# Pandas Groupby
The [groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) method lets you group by one or more columns and perform specific aggregation calculations on DataFrames. We will experiment with this new method in the notebook below.

In [1]:
# Load `census_income_data.csv`
import pandas as pd

df = pd.read_csv("census_income_data.csv")

In [2]:
# view the columns available
df.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income'],
      dtype='object')

In [3]:
# calculate the mean values for all numeric columns
df.mean(numeric_only=True)

age                   38.581647
fnlwgt            189778.366512
education-num         10.080679
capital-gain        1077.648844
capital-loss          87.303830
hours-per-week        40.437456
dtype: float64

In [4]:
# groupby "workclass" to see the different mean values for all numeric columns
df.groupby("workclass").mean(numeric_only=True)

Unnamed: 0_level_0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
workclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Federal-gov,42.590625,185221.24375,10.973958,833.232292,112.26875,41.379167
Local-gov,41.751075,188639.712852,11.042045,880.20258,109.854276,40.9828
Never-worked,20.571429,225989.571429,7.428571,0.0,0.0,28.428571
Private,36.797585,192764.114734,9.879714,889.217792,80.008724,40.267096
Self-emp-inc,46.017025,175981.344086,11.137097,4875.693548,155.138889,48.8181
Self-emp-not-inc,44.969697,175608.64148,10.226289,1886.061787,116.631641,44.421881
State-gov,39.436055,184136.613251,11.375963,701.699538,83.256549,39.031587
Without-pay,47.785714,174267.5,9.071429,487.857143,0.0,32.714286


In [5]:
# groupby "workclass" and "race" to see the different mean values for all numeric columns
df.groupby(["workclass", "race"]).mean(numeric_only=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
workclass,race,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Federal-gov,Amer-Indian-Eskimo,42.842105,90124.105263,10.368421,0.0,0.0,40.315789
Federal-gov,Asian-Pac-Islander,40.931818,118511.75,12.045455,2813.454545,310.75,41.909091
Federal-gov,Black,41.633136,218892.331361,10.236686,547.39645,81.230769,40.769231
Federal-gov,Other,33.0,129977.285714,10.571429,0.0,0.0,42.142857
Federal-gov,White,43.002774,184442.266297,11.101248,809.432732,111.479889,41.510402
Local-gov,Amer-Indian-Eskimo,38.694444,94324.75,10.25,304.222222,0.0,39.444444
Local-gov,Asian-Pac-Islander,39.846154,181518.282051,11.74359,374.25641,85.410256,39.25641
Local-gov,Black,41.173611,225112.763889,10.611111,904.586806,73.201389,39.340278
Local-gov,Other,36.0,217902.9,9.7,115.1,188.7,41.7
Local-gov,White,41.988372,184497.97093,11.122674,904.095349,118.386628,41.325


In [6]:
# set as_index=False to keep "workclass" and "race" as non index values
# select just "capital-gain" to view only that column's mean
df.groupby(["workclass", "race"], as_index=False)["capital-gain"].mean()

Unnamed: 0,workclass,race,capital-gain
0,Federal-gov,Amer-Indian-Eskimo,0.0
1,Federal-gov,Asian-Pac-Islander,2813.454545
2,Federal-gov,Black,547.39645
3,Federal-gov,Other,0.0
4,Federal-gov,White,809.432732
5,Local-gov,Amer-Indian-Eskimo,304.222222
6,Local-gov,Asian-Pac-Islander,374.25641
7,Local-gov,Black,904.586806
8,Local-gov,Other,115.1
9,Local-gov,White,904.095349
