# Grouping

1. Simple grouping
2. More complex grouping
    - Grouping on more than one categorical column
    - Grouping on more than one numeric column
    - Using more than one aggregation method
3. Pivot tables (2D grouping -- on two categorical columns)
4. Stacking and unstacking -- moving things from rows to columns and back

In [2]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [3]:
filename = '../data/taxi.csv'   # 10k taxi rides from NYC in 2015
df = pd.read_csv(filename)

In [4]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.95443,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0
3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.76033,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.0,0.0,0.3,10.3


In [6]:
# I want the mean trip_distance where passenger_count is 1

df.loc[
    df['passenger_count'] == 1     # row selector
    ,
    'trip_distance'    # column selector
].mean()

3.0923380047176354

In [7]:
# but what if I want to perform the same calculation for passenger_count == 2

df.loc[
    df['passenger_count'] == 2     # row selector
    ,
    'trip_distance'    # column selector
].mean()

3.3843869002284848

In [8]:
# but what if I want to perform the same calculation for passenger_count == 3

df.loc[
    df['passenger_count'] == 3     # row selector
    ,
    'trip_distance'    # column selector
].mean()

3.3423891625615765

What we're doing (manually) is taking each distinct/unique value in `passenger_count` and we're running our query on it.

The whole point of grouping is to ask Pandas to do this same task for us, calculating once for each distinct value in `passenger_count`.

The way to think about grouping is as follows:

- One categorical column, on which we'll do the grouping. We'll get one result for each distinct value in this column.
- One numeric column, on which we'll perform the calculation.
- One aggregation method, which will be invoked on all of the values in the numeric column for each distinct value of the categorical

In our example above:
- passenger_count is categorical
- trip_distance is numeric
- mean is an aggregation method

In [9]:
# df.groupby(CATEGORICAL)[NUMERIC].aggregate()

df.groupby('passenger_count')['trip_distance'].mean()

passenger_count
0    4.600000
1    3.092338
2    3.384387
3    3.342389
4    3.628901
5    3.182712
6    3.170976
Name: trip_distance, dtype: float64

In [None]:
# normally, it's not a bad thing that Pandas automatically sorts the distinct values in 
# passenger_count and then displays them in that order. However, fi