https://www.kaggle.com/crawford/python-groupby-tutorial/code

In [1]:
import numpy as np
import pandas as pd
import random

In [2]:
# Random pets column
pet_list = ["cat", "hamster", "alligator", "snake"]
pet = [random.choice(pet_list) for i in range(1,15)]

# Random weight of animal column
weight = [random.choice(range(5,15)) for i in range(1,15)]

# Random length of animals column
length = [random.choice(range(1,10)) for i in range(1,15)]

# random age of the animals column
age = [random.choice(range(1,15)) for i in range(1,15)]

# Put everyhting into a dataframe
df = pd.DataFrame()
df["animal"] = pet
df["age"] = age
df["weight"] = weight
df["length"] = length

# make a groupby object
animal_groups = df.groupby("animal")



# Groupby 
---
*This tutorial roughly picks up from the <a href="https://www.kaggle.com/crawford/python-merge-tutorial/">Python Merge Tutorial</a> but also works as a stand alone Groupby tutorial. If you come from a background in SQL and are familiar with GROUP BY,  you can scroll through this to see some examples of the syntax. *
<br><br>

Groupby is a pretty simple concept. You create a grouping of categories and apply a function to the categories. It's a simple concept but it's an extremely valuable technique that's widely used in data science. The value of groupby really comes from it's ability to aggregate data efficiently, both in performance and the amount code it takes. In real data science projects, you'll be dealing with large amounts of data and trying things over and over, so efficiency really becomes an important consideration. 
<br><br>

# Understanding Groupby
Here's a super simple dataframe to illustrate some examples. We'll be grouping the data by the "animal" column where there are four categories of animals: 
- alligators
- cats
- snakes
- hamsters

In [3]:
df

Unnamed: 0,animal,age,weight,length
0,hamster,9,10,6
1,cat,5,10,3
2,cat,8,6,3
3,hamster,2,14,1
4,hamster,4,13,4
5,hamster,10,7,7
6,cat,7,12,3
7,alligator,11,10,1
8,cat,11,12,8
9,hamster,8,14,3


One question we could ask about the animal data might be, "What's the average weight of all the snakes, cats, hamsters, and alligators?" To find the average weight of each category of animal, we'll group the animals by animal type and then apply the mean function. We could apply other functions too. We could apply "sum" to add up all the weights, "min" to find the lowest, "max" to get the highest, or "count" just to get a count of each animal type. 
<br><br>

This is a short list of some aggregation functions that I find handy but it's definitely not a complete list of possible operations.
<br>

<table>
<tr>
    <td><b>Summary statistics</b></td>
    <td><b>Numpy operations</b></td>
    <td><b>More complex operations</b></td>
</tr>
<tr>
    <td>mean</td>
    <td>np.mean</td>
    <td>.agg()</td>
</tr>
<tr>
    <td>median</td>
    <td>np.min</td>
    <td>agg(["mean", "median"])</td>
</tr>
<tr>
    <td>min</td>
    <td>np.max</td>
    <td>agg(custom_function())</td>
</tr>
<tr>
    <td>max</td>
    <td>np.sum</td>
</tr>
<tr>
    <td>sum</td>
    <td>np.product</td>
</tr>
<tr>
    <td>describe</td>
</tr>
<tr>
    <td>count or size</td>
</tr>
</table>

<br><br>


In [4]:
df['age'].mean()

7.7142857142857144

In [5]:
df.describe()

Unnamed: 0,age,weight,length
count,14.0,14.0,14.0
mean,7.714286,10.428571,4.642857
std,3.04905,2.874672,2.648865
min,2.0,6.0,1.0
25%,5.5,8.5,3.0
50%,8.5,10.0,3.5
75%,10.0,12.75,6.75
max,11.0,14.0,9.0


Taking the mean of all animals ignores the categorical features: 
- alligators
- cats
- snakes
- hamsters

The following two lines of code group the animals by type, then apply the mean function to the weight column.

In [6]:
# Group by animal category
animal_groups = df.groupby("animal")
type(animal_groups)

pandas.core.groupby.groupby.DataFrameGroupBy

A new type of data structure (different from Series and DataFrame)!

In [7]:
# Apply mean function to wieght column
animal_groups['weight'].mean()

animal
alligator    10.000000
cat           9.666667
hamster      12.000000
snake         6.000000
Name: weight, dtype: float64

Here's what happens when you run that code:


### 1. Group the unique values from the animal column 
<!--<img src="https://imgur.com/DRl1wil.jpg" width=400 alt="group stuff">-->
<img src="groupby_tutorial/DRl1wil.jpg" width=400 alt="group stuff">
<br><br>

### 2. Now there's a bucket for each group
<!--<img src="https://imgur.com/Q9fHw1O.jpg" width=250 alt="make buckets">-->
<img src="groupby_tutorial/Q9fHw1O.jpg" width=250 alt="make buckets">
<br><br>

### 3. Toss the other data into the buckets 
<!--<img src="https://imgur.com/A29SKAY.jpg" width=500 alt="add data">-->
<img src="groupby_tutorial/A29SKAY.jpg" width=500 alt="add data">
<br><br>

### 4. Apply a function on the weight column of each bucket
<!--<img src="https://imgur.com/xZnMuPZ.jpg" width=700 alt="calculate something">-->
<img src="groupby_tutorial/xZnMuPZ.jpg" width=700 alt="calculate something">

another example of using a groupby operation:

In [8]:
# Or apply the "max" function to the age column
animal_groups['age'].max()

animal
alligator    11
cat          11
hamster      10
snake        10
Name: age, dtype: int64