# Agenda
1. Grouping and pivot tables
2. More with multi-indexes (e.g., stack and unstack)
3. Joining, merging, and concatenating
4. Working with text



In [3]:
import pandas as pd

filename = 'taxi.csv'

df = pd.read_csv(filename,
                usecols=['VendorID', 'passenger_count', 'trip_distance',
                         'total_amount', 'payment_type'])

In [4]:
df.head()

Unnamed: 0,VendorID,passenger_count,trip_distance,payment_type,total_amount
0,2,1,1.63,2,17.8
1,2,1,0.46,1,8.3
2,2,1,0.87,1,11.0
3,2,1,2.13,1,17.16
4,1,1,1.4,2,10.3


In [5]:
# I want to know how much people paid, on average (mean) for their taxi rides

df['total_amount'].mean()

np.float64(17.552472247224728)

In [7]:
# I want to know how much people paid, on average, for their taxi rides where there were 0 passengers

(
    df.loc[
        df['passenger_count'] == 0,
        'total_amount'
       ]
    .mean()
)

np.float64(25.57)

In [8]:
# I want to know how much people paid, on average, for their taxi rides where there were 1 passenger

(
    df.loc[
        df['passenger_count'] == 1,
        'total_amount'
       ]
    .mean()
)

np.float64(17.368569446371584)

In [9]:
# I want to know how much people paid, on average, for their taxi rides where there were 2 passengers

(
    df.loc[
        df['passenger_count'] == 2,
        'total_amount'
       ]
    .mean()
)

np.float64(18.406306169078444)

# DRY -- don't repeat yourself!

If you're running the same query for each distinct value in a particular column, there is a better way to do this -- to do grouping, which we run via the `groupby` method.

The idea is:
- Choose a categorical column, i.e., one with a limited number of distinct values
- We choose a numeric column, i.e., one on which we'll want to perform the calculation
- We choose an aggregation method, i.e., one which takes many values and returns a single value

The syntax for `groupby` is:

    df.groupby(CATEGORICAL)[NUMERIC].AGGFUNC()

The result will be a series. The index for this series will be the distinct values of `CATEGORICAL`, sorted in ascending order. The values will be the result of invoking `AGGFUNC` on each subset of `NUMERIC`.

In [10]:
df.groupby('passenger_count')['total_amount'].mean()

passenger_count
0    25.570000
1    17.368569
2    18.406306
3    17.994704
4    18.881648
5    17.211269
6    17.401355
Name: total_amount, dtype: float64

Any time that you ask, "What was the value of X for each value of Y," you're asking a `groupby` question:

- Sales per region
- Sales per product
- Salary per age
- Expenses per household

# What aggregation methods are there?

- `min`
- `max`
- `mean`
- `std`
- `median`
- `quantile`
- `sum`
- `count` -- how many non-`NaN` values are there?
- `idxmin`
- `idxmax`
- `value_counts`

In [11]:
df.groupby('passenger_count')['total_amount'].idxmin()

passenger_count
0    5097
1    5719
2    9052
3     603
4    1014
5    5087
6    7509
Name: total_amount, dtype: int64

In [12]:
df.groupby('passenger_count')['total_amount'].value_counts()

passenger_count  total_amount
0                14.75             1
                 36.39             1
1                7.30            210
                 7.80            186
                 6.80            179
                                ... 
6                63.41             1
                 63.55             1
                 70.01             1
                 72.92             1
                 83.12             1
Name: count, Length: 1749, dtype: int64

# Exercise: Taxi grouping

1. We're going to run a bunch of queries using `groupby` on the NYC taxi data from January 2020. (This is in the larger zipfile that I asked you to download. The filename is `nyc_taxi_2020-01.csv`.)
2. What was the mean `total_amount` for each value of `passenger_count`?
3. What was the max `total_amount` for each value of `passenger_count`?
4. Create a new column, `tip_percentage`, which is the result of taking the `tip_amount` and finding its percentage of `fare_amount`. Get the mean `tip_percentage` per `passenger_count`.
5. Compare the mean and median `total_amount` for each value of `payment_type`.

In [15]:
filename = '/Users/reuven/Courses/Current/Data/nyc_taxi_2020-01.csv'

df = pd.read_csv(filename)

  df = pd.read_csv(filename)


In [16]:
!ls -lh $filename

-rw-r--r-- 1 reuven staff 567M Jun  4  2021 /Users/reuven/Courses/Current/Data/nyc_taxi_2020-01.csv


In [17]:
df = pd.read_csv(filename, low_memory=False)

In [18]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1.0,2020-01-01 00:28:15,2020-01-01 00:33:03,1.0,1.2,1.0,N,238,239,1.0,6.0,3.0,0.5,1.47,0.0,0.3,11.27,2.5
1,1.0,2020-01-01 00:35:39,2020-01-01 00:43:04,1.0,1.2,1.0,N,239,238,1.0,7.0,3.0,0.5,1.5,0.0,0.3,12.3,2.5
2,1.0,2020-01-01 00:47:41,2020-01-01 00:53:52,1.0,0.6,1.0,N,238,238,1.0,6.0,3.0,0.5,1.0,0.0,0.3,10.8,2.5
3,1.0,2020-01-01 00:55:23,2020-01-01 01:00:14,1.0,0.8,1.0,N,238,151,1.0,5.5,0.5,0.5,1.36,0.0,0.3,8.16,0.0
4,2.0,2020-01-01 00:01:58,2020-01-01 00:04:16,1.0,0.0,1.0,N,193,193,2.0,3.5,0.5,0.5,0.0,0.0,0.3,4.8,0.0


In [19]:
df.dtypes

VendorID                 float64
tpep_pickup_datetime      object
tpep_dropoff_datetime     object
passenger_count          float64
trip_distance            float64
RatecodeID               float64
store_and_fwd_flag        object
PULocationID               int64
DOLocationID               int64
payment_type             float64
fare_amount              float64
extra                    float64
mta_tax                  float64
tip_amount               float64
tolls_amount             float64
improvement_surcharge    float64
total_amount             float64
congestion_surcharge     float64
dtype: object

In [20]:
# What was the mean total_amount for each value of passenger_count?

df.groupby('passenger_count')['total_amount'].mean()


passenger_count
0.0    18.059724
1.0    18.343110
2.0    19.050504
3.0    18.736862
4.0    19.128092
5.0    18.234443
6.0    18.367962
7.0    71.143103
8.0    58.197059
9.0    81.244211
Name: total_amount, dtype: float64

In [21]:
df['passenger_count'].value_counts()

passenger_count
1.0    4547226
2.0     946423
3.0     250234
5.0     225693
6.0     132154
4.0     123470
0.0     114302
7.0         29
9.0         19
8.0         17
Name: count, dtype: int64

In [22]:
# What was the max total_amount for each value of passenger_count?

df.groupby('passenger_count')['total_amount'].max()


passenger_count
0.0     435.42
1.0    4268.30
2.0     617.30
3.0     499.56
4.0     730.30
5.0     384.66
6.0     352.30
7.0     101.30
8.0     121.31
9.0     140.06
Name: total_amount, dtype: float64

In [26]:
df.groupby('passenger_count')['total_amount'].min()


passenger_count
0.0    -128.30
1.0   -1242.30
2.0    -177.80
3.0    -169.80
4.0    -730.30
5.0    -130.80
6.0     -65.30
7.0       8.30
8.0       8.80
9.0      11.76
Name: total_amount, dtype: float64

In [25]:
df.loc[df['total_amount'] == 4268.30].iloc[0]

VendorID                                 2.0
tpep_pickup_datetime     2020-01-21 15:38:33
tpep_dropoff_datetime    2020-01-27 13:43:40
passenger_count                          1.0
trip_distance                           1.57
RatecodeID                               1.0
store_and_fwd_flag                         N
PULocationID                             186
DOLocationID                             152
payment_type                             2.0
fare_amount                           4265.0
extra                                    0.0
mta_tax                                  0.5
tip_amount                               0.0
tolls_amount                             0.0
improvement_surcharge                    0.3
total_amount                          4268.3
congestion_surcharge                     2.5
Name: 4049543, dtype: object

In [30]:
# Create a new column, tip_percentage, which is the result of taking the tip_amount and finding its percentage of fare_amount. 
# Get the mean tip_percentage per passenger_count.

df['tip_percentage'] = df['tip_amount'] / df['fare_amount']

df.groupby('passenger_count')['tip_percentage'].mean()

passenger_count
0.0         inf
1.0         inf
2.0         inf
3.0    0.187235
4.0         inf
5.0    0.200383
6.0         inf
7.0    0.524173
8.0    0.138806
9.0    0.131651
Name: tip_percentage, dtype: float64

In [33]:
df.loc[df['total_amount'] == 0].iloc[0]

VendorID                                 1.0
tpep_pickup_datetime     2020-01-01 00:28:00
tpep_dropoff_datetime    2020-01-01 00:28:35
passenger_count                          1.0
trip_distance                            0.0
RatecodeID                               1.0
store_and_fwd_flag                         N
PULocationID                             166
DOLocationID                             166
payment_type                             3.0
fare_amount                              0.0
extra                                    0.0
mta_tax                                  0.0
tip_amount                               0.0
tolls_amount                             0.0
improvement_surcharge                    0.0
total_amount                             0.0
congestion_surcharge                     0.0
tip_percentage                           NaN
Name: 2318, dtype: object

In [39]:
df.dropna(subset=['tip_amount', 'fare_amount', 'tip_percentage']).groupby('passenger_count')['tip_percentage'].mean()

passenger_count
0.0         inf
1.0         inf
2.0         inf
3.0    0.187235
4.0         inf
5.0    0.200383
6.0         inf
7.0    0.524173
8.0    0.138806
9.0    0.131651
Name: tip_percentage, dtype: float64

In [41]:
df.loc[df['fare_amount'] != 0].groupby('passenger_count')['tip_percentage'].mean()

passenger_count
0.0    0.193764
1.0    0.209198
2.0    0.193316
3.0    0.187235
4.0    0.175032
5.0    0.200383
6.0    0.197588
7.0    0.524173
8.0    0.138806
9.0    0.131651
Name: tip_percentage, dtype: float64

In [42]:
# method chaining

(
    
    df
    .loc[df['fare_amount'] != 0]     # only keep rows where fare amount isn't 0
    .groupby('passenger_count')['tip_percentage'].mean()
)

passenger_count
0.0    0.193764
1.0    0.209198
2.0    0.193316
3.0    0.187235
4.0    0.175032
5.0    0.200383
6.0    0.197588
7.0    0.524173
8.0    0.138806
9.0    0.131651
Name: tip_percentage, dtype: float64

In [43]:
# Compare the mean and median total_amount for each value of payment_type.

df.groupby('payment_type')['total_amount'].mean()

payment_type
1.0    19.602178
2.0    15.516222
3.0     9.933257
4.0     0.890626
5.0     0.000000
Name: total_amount, dtype: float64

In [44]:
df.groupby('payment_type')['total_amount'].median()

payment_type
1.0    14.8
2.0    11.8
3.0     9.3
4.0     0.3
5.0     0.0
Name: total_amount, dtype: float64

In [46]:
df.groupby('payment_type')['total_amount'].agg(['mean', 'median'])

Unnamed: 0_level_0,mean,median
payment_type,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,19.602178,14.8
2.0,15.516222,11.8
3.0,9.933257,9.3
4.0,0.890626,0.3
5.0,0.0,0.0


In [47]:
# we see here how we can run groupby on a categorical column
# what if I want to groupby on *two* categoricals?
# typically, it'll be hierarchical
# - country + region
# - department + product
# - year + month

# let's get the mean amount paid 
# for each passenger_count + payment_type combination

df.groupby(['passenger_count', 'payment_type'])['total_amount'].mean()

passenger_count  payment_type
0.0              1.0             19.169661
                 2.0             15.080724
                 3.0             14.753550
                 4.0             15.009711
1.0              1.0             19.479882
                 2.0             15.272954
                 3.0              9.479660
                 4.0              0.501119
                 5.0              0.000000
2.0              1.0             20.196789
                 2.0             16.277862
                 3.0             12.061866
                 4.0              0.310638
3.0              1.0             19.839760
                 2.0             16.203684
                 3.0             11.231576
                 4.0              0.022131
4.0              1.0             20.463335
                 2.0             16.687550
                 3.0             11.233845
                 4.0              1.112188
5.0              1.0             19.285617
                 2.0    

In [49]:
# after we perform the groupby, we can use xs to retrieve only those results
# where payment_type == 1

df.groupby(['passenger_count', 'payment_type'])['total_amount'].mean().xs(1, level='payment_type')

passenger_count
0.0    19.169661
1.0    19.479882
2.0    20.196789
3.0    19.839760
4.0    20.463335
5.0    19.285617
6.0    19.406970
7.0    77.342174
8.0    53.255000
9.0    82.345556
Name: total_amount, dtype: float64

In [None]:
# method chaining here:

(
    df
    .groupby(['passenger_count', 'payment_type'])['total_amount'].mean()
    .xs(1, level='payment_type')
)

In [53]:
df.set_index(['passenger_count', 'payment_type'])

Unnamed: 0_level_0,Unnamed: 1_level_0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,tip_percentage
passenger_count,payment_type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1.0,1.0,1.0,2020-01-01 00:28:15,2020-01-01 00:33:03,1.20,1.0,N,238,239,6.00,3.00,0.5,1.47,0.00,0.3,11.27,2.5,0.245000
1.0,1.0,1.0,2020-01-01 00:35:39,2020-01-01 00:43:04,1.20,1.0,N,239,238,7.00,3.00,0.5,1.50,0.00,0.3,12.30,2.5,0.214286
1.0,1.0,1.0,2020-01-01 00:47:41,2020-01-01 00:53:52,0.60,1.0,N,238,238,6.00,3.00,0.5,1.00,0.00,0.3,10.80,2.5,0.166667
1.0,1.0,1.0,2020-01-01 00:55:23,2020-01-01 01:00:14,0.80,1.0,N,238,151,5.50,0.50,0.5,1.36,0.00,0.3,8.16,0.0,0.247273
1.0,2.0,2.0,2020-01-01 00:01:58,2020-01-01 00:04:16,0.00,1.0,N,193,193,3.50,0.50,0.5,0.00,0.00,0.3,4.80,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,,,2020-01-31 22:51:00,2020-01-31 23:22:00,3.24,,,237,234,17.59,2.75,0.5,0.00,0.00,0.3,21.14,0.0,0.000000
,,,2020-01-31 22:10:00,2020-01-31 23:26:00,22.13,,,259,45,46.67,2.75,0.5,0.00,12.24,0.3,62.46,0.0,0.000000
,,,2020-01-31 22:50:07,2020-01-31 23:17:57,10.51,,,137,169,48.85,2.75,0.0,0.00,0.00,0.3,51.90,0.0,0.000000
,,,2020-01-31 22:25:53,2020-01-31 22:48:32,5.49,,,50,42,27.17,2.75,0.0,0.00,0.00,0.3,30.22,0.0,0.000000


In [54]:
# what about calculating on multiple columns?
# if we want, we can pass a list of numeric columns on which to calculate

df.groupby('passenger_count')[['total_amount', 'trip_distance']].mean()

Unnamed: 0_level_0,total_amount,trip_distance
passenger_count,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,18.059724,2.689548
1.0,18.34311,2.81105
2.0,19.050504,3.001117
3.0,18.736862,2.930363
4.0,19.128092,2.980372
5.0,18.234443,2.850356
6.0,18.367962,2.906041
7.0,71.143103,3.589655
8.0,58.197059,2.96
9.0,81.244211,3.314737


In [55]:
# if we pass a single numeric column, then we get a series

df.groupby('passenger_count')['total_amount'].mean()

passenger_count
0.0    18.059724
1.0    18.343110
2.0    19.050504
3.0    18.736862
4.0    19.128092
5.0    18.234443
6.0    18.367962
7.0    71.143103
8.0    58.197059
9.0    81.244211
Name: total_amount, dtype: float64

In [56]:
# if we pass a single numeric column inside of a one-element list, then we get a data frame

df.groupby('passenger_count')[['total_amount']].mean()

Unnamed: 0_level_0,total_amount
passenger_count,Unnamed: 1_level_1
0.0,18.059724
1.0,18.34311
2.0,19.050504
3.0,18.736862
4.0,19.128092
5.0,18.234443
6.0,18.367962
7.0,71.143103
8.0,58.197059
9.0,81.244211


# Summarize so far

We can run `.groupby` on:
- a categorical column
- a numerical column
- with an aggregation method

*BUT* we can actually pass:
- a list of categorical columns
- a list of numeric columns
- more than one aggregation method

Doing of these gives us a data frame, rather than a series as a result.

