# Agenda
1. Grouping and pivot tables
2. More with multi-indexes (e.g., stack and unstack)
3. Joining, merging, and concatenating
4. Working with text



In [3]:
import pandas as pd

filename = 'taxi.csv'

df = pd.read_csv(filename,
                usecols=['VendorID', 'passenger_count', 'trip_distance',
                         'total_amount', 'payment_type'])

In [4]:
df.head()

Unnamed: 0,VendorID,passenger_count,trip_distance,payment_type,total_amount
0,2,1,1.63,2,17.8
1,2,1,0.46,1,8.3
2,2,1,0.87,1,11.0
3,2,1,2.13,1,17.16
4,1,1,1.4,2,10.3


In [5]:
# I want to know how much people paid, on average (mean) for their taxi rides

df['total_amount'].mean()

np.float64(17.552472247224728)

In [7]:
# I want to know how much people paid, on average, for their taxi rides where there were 0 passengers

(
    df.loc[
        df['passenger_count'] == 0,
        'total_amount'
       ]
    .mean()
)

np.float64(25.57)

In [8]:
# I want to know how much people paid, on average, for their taxi rides where there were 1 passenger

(
    df.loc[
        df['passenger_count'] == 1,
        'total_amount'
       ]
    .mean()
)

np.float64(17.368569446371584)

In [9]:
# I want to know how much people paid, on average, for their taxi rides where there were 2 passengers

(
    df.loc[
        df['passenger_count'] == 2,
        'total_amount'
       ]
    .mean()
)

np.float64(18.406306169078444)

# DRY -- don't repeat yourself!

If you're running the same query for each distinct value in a particular column, there is a better way to do this -- to do grouping, which we run via the `groupby` method.

The idea is:
- Choose a categorical column, i.e., one with a limited number of distinct values
- We choose a numeric column, i.e., one on which we'll want to perform the calculation
- We choose an aggregation method, i.e., one which takes many values and returns a single value

The syntax for `groupby` is:

    df.groupby(CATEGORICAL)[NUMERIC].AGGFUNC()

The result will be a series. The index for this series will be the distinct values of `CATEGORICAL`, sorted in ascending order. The values will be the result of invoking `AGGFUNC` on each subset of `NUMERIC`.

In [10]:
df.groupby('passenger_count')['total_amount'].mean()

passenger_count
0    25.570000
1    17.368569
2    18.406306
3    17.994704
4    18.881648
5    17.211269
6    17.401355
Name: total_amount, dtype: float64

Any time that you ask, "What was the value of X for each value of Y," you're asking a `groupby` question:

- Sales per region
- Sales per product
- Salary per age
- Expenses per household

# What aggregation methods are there?

- `min`
- `max`
- `mean`
- `std`
- `median`
- `quantile`
- `sum`
- `count` -- how many non-`NaN` values are there?
- `idxmin`
- `idxmax`
- `value_counts`

In [11]:
df.groupby('passenger_count')['total_amount'].idxmin()

passenger_count
0    5097
1    5719
2    9052
3     603
4    1014
5    5087
6    7509
Name: total_amount, dtype: int64

In [12]:
df.groupby('passenger_count')['total_amount'].value_counts()

passenger_count  total_amount
0                14.75             1
                 36.39             1
1                7.30            210
                 7.80            186
                 6.80            179
                                ... 
6                63.41             1
                 63.55             1
                 70.01             1
                 72.92             1
                 83.12             1
Name: count, Length: 1749, dtype: int64

# Exercise: Taxi grouping

1. We're going to run a bunch of queries using `groupby` on the NYC taxi data from January 2020. (This is in the larger zipfile that I asked you to download. The filename is `nyc_taxi_2020-01.csv`.)
2. What was the mean `total_amount` for each value of `passenger_count`?
3. What was the max `total_amount` for each value of `passenger_count`?
4. Create a new column, `tip_percentage`, which is the result of taking the `tip_amount` and finding its percentage of `fare_amount`. Get the mean `tip_percentage` per `passenger_count`.
5. Compare the mean and median `total_amount` for each value of `payment_type`.

In [15]:
filename = '/Users/reuven/Courses/Current/Data/nyc_taxi_2020-01.csv'

df = pd.read_csv(filename)

  df = pd.read_csv(filename)


In [16]:
!ls -lh $filename

-rw-r--r-- 1 reuven staff 567M Jun  4  2021 /Users/reuven/Courses/Current/Data/nyc_taxi_2020-01.csv


In [17]:
df = pd.read_csv(filename, low_memory=False)

In [18]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1.0,2020-01-01 00:28:15,2020-01-01 00:33:03,1.0,1.2,1.0,N,238,239,1.0,6.0,3.0,0.5,1.47,0.0,0.3,11.27,2.5
1,1.0,2020-01-01 00:35:39,2020-01-01 00:43:04,1.0,1.2,1.0,N,239,238,1.0,7.0,3.0,0.5,1.5,0.0,0.3,12.3,2.5
2,1.0,2020-01-01 00:47:41,2020-01-01 00:53:52,1.0,0.6,1.0,N,238,238,1.0,6.0,3.0,0.5,1.0,0.0,0.3,10.8,2.5
3,1.0,2020-01-01 00:55:23,2020-01-01 01:00:14,1.0,0.8,1.0,N,238,151,1.0,5.5,0.5,0.5,1.36,0.0,0.3,8.16,0.0
4,2.0,2020-01-01 00:01:58,2020-01-01 00:04:16,1.0,0.0,1.0,N,193,193,2.0,3.5,0.5,0.5,0.0,0.0,0.3,4.8,0.0


In [19]:
df.dtypes

VendorID                 float64
tpep_pickup_datetime      object
tpep_dropoff_datetime     object
passenger_count          float64
trip_distance            float64
RatecodeID               float64
store_and_fwd_flag        object
PULocationID               int64
DOLocationID               int64
payment_type             float64
fare_amount              float64
extra                    float64
mta_tax                  float64
tip_amount               float64
tolls_amount             float64
improvement_surcharge    float64
total_amount             float64
congestion_surcharge     float64
dtype: object

In [20]:
# What was the mean total_amount for each value of passenger_count?

df.groupby('passenger_count')['total_amount'].mean()


passenger_count
0.0    18.059724
1.0    18.343110
2.0    19.050504
3.0    18.736862
4.0    19.128092
5.0    18.234443
6.0    18.367962
7.0    71.143103
8.0    58.197059
9.0    81.244211
Name: total_amount, dtype: float64

In [21]:
df['passenger_count'].value_counts()

passenger_count
1.0    4547226
2.0     946423
3.0     250234
5.0     225693
6.0     132154
4.0     123470
0.0     114302
7.0         29
9.0         19
8.0         17
Name: count, dtype: int64

In [22]:
# What was the max total_amount for each value of passenger_count?

df.groupby('passenger_count')['total_amount'].max()


passenger_count
0.0     435.42
1.0    4268.30
2.0     617.30
3.0     499.56
4.0     730.30
5.0     384.66
6.0     352.30
7.0     101.30
8.0     121.31
9.0     140.06
Name: total_amount, dtype: float64

In [26]:
df.groupby('passenger_count')['total_amount'].min()


passenger_count
0.0    -128.30
1.0   -1242.30
2.0    -177.80
3.0    -169.80
4.0    -730.30
5.0    -130.80
6.0     -65.30
7.0       8.30
8.0       8.80
9.0      11.76
Name: total_amount, dtype: float64

In [25]:
df.loc[df['total_amount'] == 4268.30].iloc[0]

VendorID                                 2.0
tpep_pickup_datetime     2020-01-21 15:38:33
tpep_dropoff_datetime    2020-01-27 13:43:40
passenger_count                          1.0
trip_distance                           1.57
RatecodeID                               1.0
store_and_fwd_flag                         N
PULocationID                             186
DOLocationID                             152
payment_type                             2.0
fare_amount                           4265.0
extra                                    0.0
mta_tax                                  0.5
tip_amount                               0.0
tolls_amount                             0.0
improvement_surcharge                    0.3
total_amount                          4268.3
congestion_surcharge                     2.5
Name: 4049543, dtype: object

In [30]:
# Create a new column, tip_percentage, which is the result of taking the tip_amount and finding its percentage of fare_amount. 
# Get the mean tip_percentage per passenger_count.

df['tip_percentage'] = df['tip_amount'] / df['fare_amount']

df.groupby('passenger_count')['tip_percentage'].mean()

passenger_count
0.0         inf
1.0         inf
2.0         inf
3.0    0.187235
4.0         inf
5.0    0.200383
6.0         inf
7.0    0.524173
8.0    0.138806
9.0    0.131651
Name: tip_percentage, dtype: float64

In [33]:
df.loc[df['total_amount'] == 0].iloc[0]

VendorID                                 1.0
tpep_pickup_datetime     2020-01-01 00:28:00
tpep_dropoff_datetime    2020-01-01 00:28:35
passenger_count                          1.0
trip_distance                            0.0
RatecodeID                               1.0
store_and_fwd_flag                         N
PULocationID                             166
DOLocationID                             166
payment_type                             3.0
fare_amount                              0.0
extra                                    0.0
mta_tax                                  0.0
tip_amount                               0.0
tolls_amount                             0.0
improvement_surcharge                    0.0
total_amount                             0.0
congestion_surcharge                     0.0
tip_percentage                           NaN
Name: 2318, dtype: object

In [39]:
df.dropna(subset=['tip_amount', 'fare_amount', 'tip_percentage']).groupby('passenger_count')['tip_percentage'].mean()

passenger_count
0.0         inf
1.0         inf
2.0         inf
3.0    0.187235
4.0         inf
5.0    0.200383
6.0         inf
7.0    0.524173
8.0    0.138806
9.0    0.131651
Name: tip_percentage, dtype: float64

In [41]:
df.loc[df['fare_amount'] != 0].groupby('passenger_count')['tip_percentage'].mean()

passenger_count
0.0    0.193764
1.0    0.209198
2.0    0.193316
3.0    0.187235
4.0    0.175032
5.0    0.200383
6.0    0.197588
7.0    0.524173
8.0    0.138806
9.0    0.131651
Name: tip_percentage, dtype: float64

In [42]:
# method chaining

(
    
    df
    .loc[df['fare_amount'] != 0]     # only keep rows where fare amount isn't 0
    .groupby('passenger_count')['tip_percentage'].mean()
)

passenger_count
0.0    0.193764
1.0    0.209198
2.0    0.193316
3.0    0.187235
4.0    0.175032
5.0    0.200383
6.0    0.197588
7.0    0.524173
8.0    0.138806
9.0    0.131651
Name: tip_percentage, dtype: float64

In [43]:
# Compare the mean and median total_amount for each value of payment_type.

df.groupby('payment_type')['total_amount'].mean()

payment_type
1.0    19.602178
2.0    15.516222
3.0     9.933257
4.0     0.890626
5.0     0.000000
Name: total_amount, dtype: float64

In [44]:
df.groupby('payment_type')['total_amount'].median()

payment_type
1.0    14.8
2.0    11.8
3.0     9.3
4.0     0.3
5.0     0.0
Name: total_amount, dtype: float64

In [46]:
df.groupby('payment_type')['total_amount'].agg(['mean', 'median'])

Unnamed: 0_level_0,mean,median
payment_type,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,19.602178,14.8
2.0,15.516222,11.8
3.0,9.933257,9.3
4.0,0.890626,0.3
5.0,0.0,0.0


In [47]:
# we see here how we can run groupby on a categorical column
# what if I want to groupby on *two* categoricals?
# typically, it'll be hierarchical
# - country + region
# - department + product
# - year + month

# let's get the mean amount paid 
# for each passenger_count + payment_type combination

df.groupby(['passenger_count', 'payment_type'])['total_amount'].mean()

passenger_count  payment_type
0.0              1.0             19.169661
                 2.0             15.080724
                 3.0             14.753550
                 4.0             15.009711
1.0              1.0             19.479882
                 2.0             15.272954
                 3.0              9.479660
                 4.0              0.501119
                 5.0              0.000000
2.0              1.0             20.196789
                 2.0             16.277862
                 3.0             12.061866
                 4.0              0.310638
3.0              1.0             19.839760
                 2.0             16.203684
                 3.0             11.231576
                 4.0              0.022131
4.0              1.0             20.463335
                 2.0             16.687550
                 3.0             11.233845
                 4.0              1.112188
5.0              1.0             19.285617
                 2.0    

In [49]:
# after we perform the groupby, we can use xs to retrieve only those results
# where payment_type == 1

df.groupby(['passenger_count', 'payment_type'])['total_amount'].mean().xs(1, level='payment_type')

passenger_count
0.0    19.169661
1.0    19.479882
2.0    20.196789
3.0    19.839760
4.0    20.463335
5.0    19.285617
6.0    19.406970
7.0    77.342174
8.0    53.255000
9.0    82.345556
Name: total_amount, dtype: float64

In [None]:
# method chaining here:

(
    df
    .groupby(['passenger_count', 'payment_type'])['total_amount'].mean()
    .xs(1, level='payment_type')
)

In [53]:
df.set_index(['passenger_count', 'payment_type'])

Unnamed: 0_level_0,Unnamed: 1_level_0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,tip_percentage
passenger_count,payment_type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1.0,1.0,1.0,2020-01-01 00:28:15,2020-01-01 00:33:03,1.20,1.0,N,238,239,6.00,3.00,0.5,1.47,0.00,0.3,11.27,2.5,0.245000
1.0,1.0,1.0,2020-01-01 00:35:39,2020-01-01 00:43:04,1.20,1.0,N,239,238,7.00,3.00,0.5,1.50,0.00,0.3,12.30,2.5,0.214286
1.0,1.0,1.0,2020-01-01 00:47:41,2020-01-01 00:53:52,0.60,1.0,N,238,238,6.00,3.00,0.5,1.00,0.00,0.3,10.80,2.5,0.166667
1.0,1.0,1.0,2020-01-01 00:55:23,2020-01-01 01:00:14,0.80,1.0,N,238,151,5.50,0.50,0.5,1.36,0.00,0.3,8.16,0.0,0.247273
1.0,2.0,2.0,2020-01-01 00:01:58,2020-01-01 00:04:16,0.00,1.0,N,193,193,3.50,0.50,0.5,0.00,0.00,0.3,4.80,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,,,2020-01-31 22:51:00,2020-01-31 23:22:00,3.24,,,237,234,17.59,2.75,0.5,0.00,0.00,0.3,21.14,0.0,0.000000
,,,2020-01-31 22:10:00,2020-01-31 23:26:00,22.13,,,259,45,46.67,2.75,0.5,0.00,12.24,0.3,62.46,0.0,0.000000
,,,2020-01-31 22:50:07,2020-01-31 23:17:57,10.51,,,137,169,48.85,2.75,0.0,0.00,0.00,0.3,51.90,0.0,0.000000
,,,2020-01-31 22:25:53,2020-01-31 22:48:32,5.49,,,50,42,27.17,2.75,0.0,0.00,0.00,0.3,30.22,0.0,0.000000


In [54]:
# what about calculating on multiple columns?
# if we want, we can pass a list of numeric columns on which to calculate

df.groupby('passenger_count')[['total_amount', 'trip_distance']].mean()

Unnamed: 0_level_0,total_amount,trip_distance
passenger_count,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,18.059724,2.689548
1.0,18.34311,2.81105
2.0,19.050504,3.001117
3.0,18.736862,2.930363
4.0,19.128092,2.980372
5.0,18.234443,2.850356
6.0,18.367962,2.906041
7.0,71.143103,3.589655
8.0,58.197059,2.96
9.0,81.244211,3.314737


In [55]:
# if we pass a single numeric column, then we get a series

df.groupby('passenger_count')['total_amount'].mean()

passenger_count
0.0    18.059724
1.0    18.343110
2.0    19.050504
3.0    18.736862
4.0    19.128092
5.0    18.234443
6.0    18.367962
7.0    71.143103
8.0    58.197059
9.0    81.244211
Name: total_amount, dtype: float64

In [56]:
# if we pass a single numeric column inside of a one-element list, then we get a data frame

df.groupby('passenger_count')[['total_amount']].mean()

Unnamed: 0_level_0,total_amount
passenger_count,Unnamed: 1_level_1
0.0,18.059724
1.0,18.34311
2.0,19.050504
3.0,18.736862
4.0,19.128092
5.0,18.234443
6.0,18.367962
7.0,71.143103
8.0,58.197059
9.0,81.244211


# Summarize so far

We can run `.groupby` on:
- a categorical column
- a numerical column
- with an aggregation method

*BUT* we can actually pass:
- a list of categorical columns
- a list of numeric columns
- more than one aggregation method

Doing of these gives us a data frame, rather than a series as a result.



# what if I were to:

- groupby both `passenger_count` and `payment_type`
- calculate on `trip_distance` and `total_amount`

In [57]:
df.groupby(['passenger_count', 'payment_type'])[['trip_distance', 'total_amount']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,trip_distance,total_amount
passenger_count,payment_type,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,1.0,2.725365,19.169661
0.0,2.0,2.59374,15.080724
0.0,3.0,2.509309,14.75355
0.0,4.0,2.78921,15.009711
1.0,1.0,2.851244,19.479882
1.0,2.0,2.701837,15.272954
1.0,3.0,2.368599,9.47966
1.0,4.0,2.474345,0.501119
1.0,5.0,0.0,0.0
2.0,1.0,3.024937,20.196789


In [58]:
# can we run more than one aggregation method?

df.groupby(['passenger_count', 'payment_type'])[['trip_distance', 'total_amount']].agg(['median', 'mean'])

Unnamed: 0_level_0,Unnamed: 1_level_0,trip_distance,trip_distance,total_amount,total_amount
Unnamed: 0_level_1,Unnamed: 1_level_1,median,mean,median,mean
passenger_count,payment_type,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
0.0,1.0,1.5,2.725365,14.75,19.169661
0.0,2.0,1.4,2.59374,11.8,15.080724
0.0,3.0,1.1,2.509309,10.3,14.75355
0.0,4.0,1.0,2.78921,9.3,15.009711
1.0,1.0,1.6,2.851244,14.8,19.479882
1.0,2.0,1.48,2.701837,11.8,15.272954
1.0,3.0,0.9,2.368599,8.8,9.47966
1.0,4.0,1.0,2.474345,-2.705,0.501119
1.0,5.0,0.0,0.0,0.0,0.0
2.0,1.0,1.68,3.024937,15.3,20.196789


# Exercise: Olympic data 

1. Create a data frame with the file `olympic_athlete_events.csv`.
2. What was the mean height per team in years 1960 and onward?
3. What were the mean height and weight per team in basketball and speed skating?
4. What were the mean and median age per country, in years 1980 and onward?

In [59]:
# this file contains information about every Olympic athlete and event until 2020

filename = '/Users/reuven/Courses/Current/Data/olympic_athlete_events.csv'

!head $filename

"ID","Name","Sex","Age","Height","Weight","Team","NOC","Games","Year","Season","City","Sport","Event","Medal"
"1","A Dijiang","M",24,180,80,"China","CHN","1992 Summer",1992,"Summer","Barcelona","Basketball","Basketball Men's Basketball",NA
"2","A Lamusi","M",23,170,60,"China","CHN","2012 Summer",2012,"Summer","London","Judo","Judo Men's Extra-Lightweight",NA
"3","Gunnar Nielsen Aaby","M",24,NA,NA,"Denmark","DEN","1920 Summer",1920,"Summer","Antwerpen","Football","Football Men's Football",NA
"4","Edgar Lindenau Aabye","M",34,NA,NA,"Denmark/Sweden","DEN","1900 Summer",1900,"Summer","Paris","Tug-Of-War","Tug-Of-War Men's Tug-Of-War","Gold"
"5","Christine Jacoba Aaftink","F",21,185,82,"Netherlands","NED","1988 Winter",1988,"Winter","Calgary","Speed Skating","Speed Skating Women's 500 metres",NA
"5","Christine Jacoba Aaftink","F",21,185,82,"Netherlands","NED","1988 Winter",1988,"Winter","Calgary","Speed Skating","Speed Skating Women's 1,000 metres",NA
"5","Christine Jacoba Aaftink","F",25,1

In [60]:
df = pd.read_csv(filename)
df.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,


In [68]:
# What was the mean height per team in years 1960 and onward?

(
    df
    .loc[df['Year'] >= 1960]    # keep only 1960+ years
    .groupby('Team')['Height'].mean()
)

Team
Puerto Rico-1            196.000000
Nadine                   190.000000
Ireland-1                189.666667
Serbia-2                 189.000000
Puerto Rico-2            188.000000
Salamander               187.666667
Serbia and Montenegro    187.511401
India-1                  187.500000
Bingo                    187.000000
Ireland-2                187.000000
Name: Height, dtype: float64

In [71]:
# What were the mean height and weight per team in basketball and speed skating?


(
    df
    .loc[(df['Sport'] == 'Basketball') | (df['Sport'] == 'Speed Skating')]
    .groupby('Team')[['Height', 'Weight']].mean()
)

Unnamed: 0_level_0,Height,Weight
Team,Unnamed: 1_level_1,Unnamed: 2_level_1
Angola,191.788732,90.225352
Argentina,198.272727,101.033333
Australia,188.704762,85.373377
Austria,173.774510,70.901961
Belarus,178.086957,69.673913
...,...,...
United States,181.281134,76.589617
Uruguay,187.577778,86.372093
Venezuela,196.428571,101.294118
West Germany,182.068627,75.411765


In [72]:
# we can use the "isin" method

(
    df
    .loc[df['Sport'].isin(['Basketball', 'Speed Skating'])]
    .groupby('Team')[['Height', 'Weight']].mean()
)

Unnamed: 0_level_0,Height,Weight
Team,Unnamed: 1_level_1,Unnamed: 2_level_1
Angola,191.788732,90.225352
Argentina,198.272727,101.033333
Australia,188.704762,85.373377
Austria,173.774510,70.901961
Belarus,178.086957,69.673913
...,...,...
United States,181.281134,76.589617
Uruguay,187.577778,86.372093
Venezuela,196.428571,101.294118
West Germany,182.068627,75.411765


In [73]:
# What were the mean and median age per country, in years 1980 and onward?

(
    df
    .loc[df['Year'] >= 1980]
    .groupby('Team')['Age'].agg(['mean', 'median'])
)

Unnamed: 0_level_0,mean,median
Team,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,23.000000,23.0
Albania,25.230769,23.0
Algeria,24.346743,24.0
American Samoa,27.216216,26.0
Andorra,23.283871,22.0
...,...,...
Yugoslavia,23.535286,23.0
Yugoslavia-1,25.250000,24.5
Yugoslavia-2,24.250000,25.0
Zambia,24.108333,24.0


In [74]:
df = pd.read_csv('/Users/reuven/Courses/Current/Data/nyc_taxi_2020-01.csv', low_memory=False)

# I want to know:

- for each payment type
- for each number of passengers
- what was the mean total_amount?

One reasonable way to depict this would be in a table (or a data frame):

- The rows (index) would be different payment types
- The columns would be different numbers of passengers
- The values would be taken from `total_amount`
- We would run `mean` on the combination at the intersection

This, in the Pandas world, is known as a "pivot table"!

To create a pivot table, we need to specify:

- What categorical column will we use for the `index` (rows)?
- What categorical column will we use for the `columns`?
- What numeric column will we use for the values?
- What aggregate method will we invoke?

In [75]:
# this pivot_table method is the best way to create a pivot table
# there is also a "pivot" method, but it only works if there's one value for each row-column combination
# (it cannot handle aggregation methods)

df.pivot_table(index='payment_type',
               columns='passenger_count',
               values='total_amount',
               aggfunc='mean')

passenger_count,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0
payment_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1.0,19.169661,19.479882,20.196789,19.83976,20.463335,19.285617,19.40697,77.342174,53.255,82.345556
2.0,15.080724,15.272954,16.277862,16.203684,16.68755,15.385476,15.556227,47.38,81.26,61.42
3.0,14.75355,9.47966,12.061866,11.231576,11.233845,-8.708917,-3.13337,,,
4.0,15.009711,0.501119,0.310638,0.022131,1.112188,-8.882376,-2.484444,,,
5.0,,0.0,,,,,,,,


In [76]:
df.groupby(['passenger_count', 'payment_type'])['total_amount'].mean()

passenger_count  payment_type
0.0              1.0             19.169661
                 2.0             15.080724
                 3.0             14.753550
                 4.0             15.009711
1.0              1.0             19.479882
                 2.0             15.272954
                 3.0              9.479660
                 4.0              0.501119
                 5.0              0.000000
2.0              1.0             20.196789
                 2.0             16.277862
                 3.0             12.061866
                 4.0              0.310638
3.0              1.0             19.839760
                 2.0             16.203684
                 3.0             11.231576
                 4.0              0.022131
4.0              1.0             20.463335
                 2.0             16.687550
                 3.0             11.233845
                 4.0              1.112188
5.0              1.0             19.285617
                 2.0    

In [79]:
df.pivot_table(index=['payment_type', 'VendorID'],
               columns='passenger_count',
               values=['total_amount', 'trip_distance'],
               aggfunc=['mean', 'std'])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,mean,mean,mean,mean,mean,mean,mean,mean,mean,...,std,std,std,std,std,std,std,std,std,std
Unnamed: 0_level_1,Unnamed: 1_level_1,total_amount,total_amount,total_amount,total_amount,total_amount,total_amount,total_amount,total_amount,total_amount,total_amount,...,trip_distance,trip_distance,trip_distance,trip_distance,trip_distance,trip_distance,trip_distance,trip_distance,trip_distance,trip_distance
Unnamed: 0_level_2,passenger_count,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,...,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0
payment_type,VendorID,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3,Unnamed: 17_level_3,Unnamed: 18_level_3,Unnamed: 19_level_3,Unnamed: 20_level_3,Unnamed: 21_level_3,Unnamed: 22_level_3
1.0,1.0,19.042514,18.908395,20.163305,19.715304,20.756658,20.858342,23.07722,12.35,10.3,65.38,...,3.635495,3.473871,3.952757,3.81332,4.064551,3.789995,4.515855,,,4.128357
1.0,2.0,41.283312,19.784228,20.212669,19.883109,20.365903,19.274772,19.380297,80.296364,56.559231,85.738667,...,1.839421,3.847142,3.963992,3.872692,4.024464,3.644905,3.759983,7.397031,5.392417,6.143123
2.0,1.0,14.994231,14.523719,16.553966,16.801301,17.253524,18.632354,16.186094,36.42,,61.42,...,3.551209,3.257321,4.023848,3.997054,3.979624,4.621283,4.480345,,,
2.0,2.0,44.495783,15.626498,16.136692,15.953876,16.358619,15.352934,15.551475,49.572,81.26,,...,3.479518,3.775352,3.955832,3.821,4.37347,3.546435,3.721534,1.390133,10.337854,
3.0,1.0,14.915132,15.183049,20.399819,19.813648,19.824308,21.691818,32.4928,,,,...,4.393473,4.171511,5.42453,4.976179,5.802429,5.871529,5.528629,,,
3.0,2.0,-61.675,-12.685527,-13.063912,-12.036065,-12.599451,-11.776881,-8.842692,,,,...,0.342479,2.908318,1.450897,1.621301,1.034765,0.325854,0.325101,,,
4.0,1.0,15.361985,16.519574,19.73117,18.078141,22.014338,6.913636,8.549524,,,,...,5.693314,4.490303,5.392024,4.703409,5.03093,1.13458,3.741377,,,
4.0,2.0,-13.6125,-15.469029,-18.036862,-16.997432,-22.124213,-11.067987,-7.027843,,,,...,0.0,3.374683,3.886742,4.095468,3.857114,0.287239,0.335431,,,
5.0,1.0,,0.0,,,,,,,,,...,,,,,,,,,,


In [84]:
df.pivot_table(index='payment_type',
               columns='passenger_count',
               values='total_amount',
               aggfunc='mean',
              margins=True)

Unnamed: 0_level_0,total_amount,total_amount,total_amount,total_amount,total_amount,total_amount,total_amount,total_amount,total_amount,total_amount,total_amount
passenger_count,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,All
payment_type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2
1.0,19.169661,19.479882,20.196789,19.83976,20.463335,19.285617,19.40697,77.342174,53.255,82.345556,19.602178
2.0,15.080724,15.272954,16.277862,16.203684,16.68755,15.385476,15.556227,47.38,81.26,61.42,15.516222
3.0,14.75355,9.47966,12.061866,11.231576,11.233845,-8.708917,-3.13337,,,,9.933257
4.0,15.009711,0.501119,0.310638,0.022131,1.112188,-8.882376,-2.484444,,,,0.890626
5.0,,0.0,,,,,,,,,0.0
All,18.059724,18.34311,19.050504,18.736862,19.128092,18.234443,18.367962,71.143103,58.197059,81.244211,18.471623


# Exercise: Pivot tables with Olympic data

1. Create a pivot table for gold medalists showing mean height for every team vs. sport.
2. Create a pivot table showing mean age and weight for every year vs. team since 2000.

In [86]:
df = pd.read_csv('/Users/reuven/Courses/Current/Data/olympic_athlete_events.csv', 
                 low_memory=False,
                usecols=['Age', 'Height', 'Weight', 'Team', 'Year', 'Sport', 'Medal'])
df

Unnamed: 0,Age,Height,Weight,Team,Year,Sport,Medal
0,24.0,180.0,80.0,China,1992,Basketball,
1,23.0,170.0,60.0,China,2012,Judo,
2,24.0,,,Denmark,1920,Football,
3,34.0,,,Denmark/Sweden,1900,Tug-Of-War,Gold
4,21.0,185.0,82.0,Netherlands,1988,Speed Skating,
...,...,...,...,...,...,...,...
271111,29.0,179.0,89.0,Poland-1,1976,Luge,
271112,27.0,176.0,59.0,Poland,2014,Ski Jumping,
271113,27.0,176.0,59.0,Poland,2014,Ski Jumping,
271114,30.0,185.0,96.0,Poland,1998,Bobsleigh,


In [91]:
# Create a pivot table for gold medalists showing mean height for every team vs. sport.

# index - 'Team'
# columns - 'Sport'
# values - 'Height'
# aggfunc - 'mean'

(
    df
    .loc[df['Medal'] == 'Gold']
    .pivot_table(index='Team',
               columns='Sport',
               values='Height',
               aggfunc='mean')
    .dropna(thresh=20)
)


Sport,Alpine Skiing,Archery,Art Competitions,Athletics,Badminton,Baseball,Basketball,Beach Volleyball,Biathlon,Bobsleigh,...,Table Tennis,Taekwondo,Tennis,Trampolining,Triathlon,Tug-Of-War,Volleyball,Water Polo,Weightlifting,Wrestling
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Australia,,175.0,,172.263158,,,,,,,...,,165.0,180.5,,161.0,,,178.153846,180.0,
Canada,166.0,,,177.380952,,,,,161.0,,...,,,187.0,158.0,177.0,,,,,165.666667
China,,169.0,,169.375,176.375,,,,,,...,171.193548,179.428571,,166.0,,,184.088235,,162.294118,173.5
East Germany,,,,175.64,,,,,178.333333,,...,,,,,,,,,166.0,181.0
France,175.5,170.0,,176.615385,,,,,174.625,,...,,,,,,,,,168.666667,172.0
Germany,174.545455,,,182.76,,,,,175.588235,,...,,,192.0,158.0,194.0,179.6,,173.0,183.0,176.0
Great Britain,,,,176.355932,,,,,,,...,,156.0,186.0,,184.0,,,181.307692,188.0,175.0
Italy,172.583333,179.5,,178.055556,,,,,,,...,,183.0,,,,,,180.085714,173.0,163.8
Norway,180.666667,,,183.0,,,,,179.291667,,...,,,,,,,,,165.0,162.333333
Russia,,,,176.37931,,,,,169.25,,...,,,185.5,169.0,,,200.833333,,177.666667,174.931034


In [93]:
# Create a pivot table showing mean age and weight for every year vs. team since 2000.

# index -- Team
# columns -- year
# values -- Age and Weight
# aggfunc -- mean

(
    df
    .loc[df['Year'] >= 2000]
    .pivot_table(index='Team',
                 columns='Year',
                 values=['Age', 'Weight'],
                 aggfunc='mean')
)

Unnamed: 0_level_0,Age,Age,Age,Age,Age,Age,Age,Age,Age,Weight,Weight,Weight,Weight,Weight,Weight,Weight,Weight,Weight
Year,2000,2002,2004,2006,2008,2010,2012,2014,2016,2000,2002,2004,2006,2008,2010,2012,2014,2016
Team,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2
Afghanistan,,,18.600000,,22.500000,,24.833333,,24.666667,,,64.750000,,62.750000,,60.833333,,74.000000
Albania,31.200000,,20.857143,19.000000,27.250000,23.00,25.700000,20.0,23.666667,62.900000,,70.714286,74.000000,74.750000,74.00,80.200000,56.00,67.166667
Algeria,24.901961,,25.084507,24.333333,25.210526,17.00,24.846154,,23.959459,67.941176,,67.594203,62.666667,70.821429,65.00,66.857143,,68.378378
American Samoa,27.000000,,30.000000,,23.500000,,22.000000,,25.250000,103.000000,,91.666667,,59.000000,,79.750000,,75.250000
Andorra,31.000000,24.6,29.666667,24.800000,26.600000,23.55,32.000000,23.5,26.000000,69.800000,75.0,68.500000,78.100000,61.400000,71.45,68.000000,67.75,66.250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Whisper,,,,,67.000000,,,,,,,,,62.000000,,,,
Whitini Star,,,,,36.000000,,,,,,,,,,,,,
Yemen,25.000000,,20.000000,,21.375000,,20.000000,,19.333333,61.000000,,64.333333,,55.571429,,58.000000,,65.666667
Zambia,23.000000,,22.500000,,21.875000,,22.571429,,24.142857,64.833333,,64.500000,,62.750000,,74.166667,,67.500000


# stack + unstack

We saw that a two-dimensional `groupby` and a pivot table are basically the same, just displayed differently. How can we move from one depiction to the other?

The answer is `stack` and `unstack`, two methods that are for precisely this purpose.

- `stack` means: Take the column labels, and move them into the index, such that we have a multi-index on the rows.
- `unstack` means: Take one of the tiers of the multi-index on the rows of a series, and create a data frame where that tier becomes the column names.

In [100]:
# given a multi-indexed series, we can take one level of the index and use it as the 
# columns in a data frame. That's known as "unstack".

df.groupby(['Team', 'Year'])['Age'].mean().unstack('Year')

Year,1896,1900,1904,1906,1908,1912,1920,1924,1928,1932,...,1998,2000,2002,2004,2006,2008,2010,2012,2014,2016
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
30. Februar,,,,,,,,,,,...,,,,,,,,,,
A North American Team,,41.333333,,,,,,,,,...,,,,,,,,,,
Acipactli,,,,,,,,,,,...,,,,,,,,,,
Acturus,,,,,,,,,,,...,,,,,,,,,,
Afghanistan,,,,,,,,,,,...,,,,18.600000,,22.5000,,24.833333,,24.666667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Zambia,,,,,,,,,,,...,,23.000000,,22.500000,,21.8750,,22.571429,,24.142857
Zefyros,,,,,,,,,,,...,,,,,,,,,,
Zimbabwe,,,,,,,,,20.0,,...,,24.961538,,25.071429,,26.0625,,27.333333,20.0,27.483871
Zut,,,,,32.0,,,,,,...,,,,,,,,,,


In [101]:
df.groupby(['Team', 'Year'])['Age'].mean().unstack('Year').stack()

Team                   Year
30. Februar            1952    33.500000
A North American Team  1900    41.333333
Acipactli              1964    47.333333
Acturus                1948    27.000000
Afghanistan            1936    24.266667
                                 ...    
Zimbabwe               2012    27.333333
                       2014    20.000000
                       2016    27.483871
Zut                    1908    32.000000
rn-2                   1912    29.200000
Length: 5061, dtype: float64

# Next up: Joining and merging

Resume at :55

In [102]:
from pandas import Series, DataFrame

In [103]:
import numpy as np

In [108]:
np.random.seed(0)

df1 = DataFrame(np.random.randint(0, 1000, [3, 4]),
                index=list('abc'),
                columns=list('wxyz'))

df2 = DataFrame(np.random.randint(0, 1000, [3, 4]),
                index=list('abc'),
                columns=list('wxyz'))

df3 = DataFrame(np.random.randint(0, 1000, [3, 4]),
                index=list('abc'),
                columns=list('uvwx'))


In [109]:
df1

Unnamed: 0,w,x,y,z
a,684,559,629,192
b,835,763,707,359
c,9,723,277,754


In [110]:
df2

Unnamed: 0,w,x,y,z
a,804,599,70,472
b,600,396,314,705
c,486,551,87,174


In [111]:
# how can I combine these into a single data frame, keeping all of the original rows and columns?

# Option 1: stack them on top of one another
# Option 2: stack them side-by-side

# we can do this with pd.concat, which takes a list of data frames and returns a new data frame combining them

pd.concat([df1, df2])

Unnamed: 0,w,x,y,z
a,684,559,629,192
b,835,763,707,359
c,9,723,277,754
a,804,599,70,472
b,600,396,314,705
c,486,551,87,174


In [112]:
pd.concat([df1, df3])

Unnamed: 0,w,x,y,z,u,v
a,684,559,629.0,192.0,,
b,835,763,707.0,359.0,,
c,9,723,277.0,754.0,,
a,677,537,,,600.0,849.0
b,777,916,,,845.0,72.0
c,755,709,,,115.0,976.0


In [113]:
pd.concat([df1, df2, df3])

Unnamed: 0,w,x,y,z,u,v
a,684,559,629.0,192.0,,
b,835,763,707.0,359.0,,
c,9,723,277.0,754.0,,
a,804,599,70.0,472.0,,
b,600,396,314.0,705.0,,
c,486,551,87.0,174.0,,
a,677,537,,,600.0,849.0
b,777,916,,,845.0,72.0
c,755,709,,,115.0,976.0


In [115]:
pd.concat([df1, df3[['u', 'v']]])

Unnamed: 0,w,x,y,z,u,v
a,684.0,559.0,629.0,192.0,,
b,835.0,763.0,707.0,359.0,,
c,9.0,723.0,277.0,754.0,,
a,,,,,600.0,849.0
b,,,,,845.0,72.0
c,,,,,115.0,976.0


In [116]:
# what if we want to join them side-by-side?
# we can pass axis='columns'

pd.concat([df1, df2], axis='columns')

Unnamed: 0,w,x,y,z,w.1,x.1,y.1,z.1
a,684,559,629,192,804,599,70,472
b,835,763,707,359,600,396,314,705
c,9,723,277,754,486,551,87,174


# When do I use `pd.concat`?

Most often: When I have data split across multiple files. I can read the files into a list of data frames, and then use `pd.concat` to combine them into a single data frame.

The big thing to be sure of is that if you're combining them top-to-bottom, that the columns (or most of the columns) match up.

In [119]:
pd.concat([df1, df2]).reset_index()

Unnamed: 0,index,w,x,y,z
0,a,684,559,629,192
1,b,835,763,707,359
2,c,9,723,277,754
3,a,804,599,70,472
4,b,600,396,314,705
5,c,486,551,87,174


In [120]:
pd.concat([df1, df2], ignore_index=True)

Unnamed: 0,w,x,y,z
0,684,559,629,192
1,835,763,707,359
2,9,723,277,754
3,804,599,70,472
4,600,396,314,705
5,486,551,87,174


In [117]:
help(pd.concat)

Help on function concat in module pandas.core.reshape.concat:

concat(
    objs: 'Iterable[Series | DataFrame] | Mapping[HashableT, Series | DataFrame]',
    *,
    axis: 'Axis' = 0,
    join: 'str' = 'outer',
    ignore_index: 'bool' = False,
    keys: 'Iterable[Hashable] | None' = None,
    levels=None,
    names: 'list[HashableT] | None' = None,
    verify_integrity: 'bool' = False,
    sort: 'bool' = False,
    copy: 'bool | None' = None
) -> 'DataFrame | Series'
    Concatenate pandas objects along a particular axis.

    Allows optional set logic along the other axes.

    Can also add a layer of hierarchical indexing on the concatenation axis,
    which may be useful if the labels are the same (or overlapping) on
    the passed axis number.

    Parameters
    ----------
    objs : a sequence or mapping of Series or DataFrame objects
        If a mapping is passed, the sorted keys will be used as the `keys`
        argument, unless it is passed, in which case the values will be


# Exercise: Concatenation and analysis

We just used the taxi information from January, 2020 in New York. There are actually *four* files of taxi information -- from January 2020 and 2021, and July 2020 and 2021.

1. Load all four of these into a single data frame using `pd.concat`. If this is too much for your computer, then you can load two of them -- the two files from 2020 are probably the best bets.
2. Find the mean and median `trip_distance` and `total_amount` for each `passenger_count`.
3. Find the number of trips in which people were refunded money. How far, on average, did such people travel?
4. Find the number of trips in which people went 0 miles. How much did they, on average, pay for the privilege?


In [122]:
!ls /Users/reuven/Courses/Current/Data/nyc_taxi_*.csv

/Users/reuven/Courses/Current/Data/nyc_taxi_2019-01.csv
/Users/reuven/Courses/Current/Data/nyc_taxi_2019-07.csv
/Users/reuven/Courses/Current/Data/nyc_taxi_2020-01.csv
/Users/reuven/Courses/Current/Data/nyc_taxi_2020-07.csv


In [123]:
import glob  

glob.glob('/Users/reuven/Courses/Current/Data/nyc_taxi_*.csv')

['/Users/reuven/Courses/Current/Data/nyc_taxi_2020-01.csv',
 '/Users/reuven/Courses/Current/Data/nyc_taxi_2020-07.csv',
 '/Users/reuven/Courses/Current/Data/nyc_taxi_2019-07.csv',
 '/Users/reuven/Courses/Current/Data/nyc_taxi_2019-01.csv']

In [125]:
all_dfs = []

for one_filename in glob.glob('/Users/reuven/Courses/Current/Data/nyc_taxi_*.csv'):
    print(one_filename)
    all_dfs.append(pd.read_csv(one_filename, usecols=['trip_distance', 'total_amount', 'passenger_count']))

len(all_dfs)

/Users/reuven/Courses/Current/Data/nyc_taxi_2020-01.csv
/Users/reuven/Courses/Current/Data/nyc_taxi_2020-07.csv
/Users/reuven/Courses/Current/Data/nyc_taxi_2019-07.csv
/Users/reuven/Courses/Current/Data/nyc_taxi_2019-01.csv


4

In [126]:
df = pd.concat(all_dfs)  
df.shape

(21183631, 3)

In [128]:
# I like to use list comprehensions!

df = pd.concat([pd.read_csv(one_filename, 
                            usecols=['trip_distance', 'total_amount', 'passenger_count'])
                for one_filename in glob.glob('/Users/reuven/Courses/Current/Data/nyc_taxi_*.csv')])
df.shape

(21183631, 3)

In [129]:
# Find the mean and median trip_distance and total_amount for each passenger_count.

df.groupby('passenger_count')[['trip_distance', 'total_amount']].agg(['mean', 'median'])

Unnamed: 0_level_0,trip_distance,trip_distance,total_amount,total_amount
Unnamed: 0_level_1,mean,median,mean,median
passenger_count,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0.0,2.739365,1.5,18.464536,13.3
1.0,2.85122,1.6,17.565138,13.3
2.0,3.029851,1.64,18.162595,13.56
3.0,3.00926,1.64,18.031471,13.56
4.0,3.101422,1.7,18.573759,13.8
5.0,2.93857,1.63,17.509767,13.39
6.0,2.949152,1.62,17.453694,13.3
7.0,3.845976,0.01,62.500732,75.8
8.0,3.219714,0.0,64.482,84.15
9.0,5.229778,0.0,74.031111,92.8


In [None]:
# Find the number of trips in which people were refunded money. How far, on average, did such people travel?
# Find the number of trips in which people went 0 miles. How much did they, on average, pay for the privilege?