# Agenda
1. Grouping and pivot tables
2. More with multi-indexes (e.g., stack and unstack)
3. Joining, merging, and concatenating
4. Working with text



In [3]:
import pandas as pd

filename = 'taxi.csv'

df = pd.read_csv(filename,
                usecols=['VendorID', 'passenger_count', 'trip_distance',
                         'total_amount', 'payment_type'])

In [4]:
df.head()

Unnamed: 0,VendorID,passenger_count,trip_distance,payment_type,total_amount
0,2,1,1.63,2,17.8
1,2,1,0.46,1,8.3
2,2,1,0.87,1,11.0
3,2,1,2.13,1,17.16
4,1,1,1.4,2,10.3


In [5]:
# I want to know how much people paid, on average (mean) for their taxi rides

df['total_amount'].mean()

np.float64(17.552472247224728)

In [7]:
# I want to know how much people paid, on average, for their taxi rides where there were 0 passengers

(
    df.loc[
        df['passenger_count'] == 0,
        'total_amount'
       ]
    .mean()
)

np.float64(25.57)

In [8]:
# I want to know how much people paid, on average, for their taxi rides where there were 1 passenger

(
    df.loc[
        df['passenger_count'] == 1,
        'total_amount'
       ]
    .mean()
)

np.float64(17.368569446371584)

In [9]:
# I want to know how much people paid, on average, for their taxi rides where there were 2 passengers

(
    df.loc[
        df['passenger_count'] == 2,
        'total_amount'
       ]
    .mean()
)

np.float64(18.406306169078444)

# DRY -- don't repeat yourself!

If you're running the same query for each distinct value in a particular column, there is a better way to do this -- to do grouping, which we run via the `groupby` method.

The idea is:
- Choose a categorical column, i.e., one with a limited number of distinct values
- We choose a numeric column, i.e., one on which we'll want to perform the calculation
- We choose an aggregation method, i.e., one which takes many values and returns a single value

The syntax for `groupby` is:

    df.groupby(CATEGORICAL)[NUMERIC].AGGFUNC()

The result will be a series. The index for this series will be the distinct values of `CATEGORICAL`, sorted in ascending order. The values will be the result of invoking `AGGFUNC` on each subset of `NUMERIC`.

In [10]:
df.groupby('passenger_count')['total_amount'].mean()

passenger_count
0    25.570000
1    17.368569
2    18.406306
3    17.994704
4    18.881648
5    17.211269
6    17.401355
Name: total_amount, dtype: float64

Any time that you ask, "What was the value of X for each value of Y," you're asking a `groupby` question:

- Sales per region
- Sales per product
- Salary per age
- Expenses per household

# What aggregation methods are there?

- `min`
- `max`
- `mean`
- `std`
- `median`
- `quantile`
- `sum`
- `count` -- how many non-`NaN` values are there?
- `idxmin`
- `idxmax`
- `value_counts`

In [11]:
df.groupby('passenger_count')['total_amount'].idxmin()

passenger_count
0    5097
1    5719
2    9052
3     603
4    1014
5    5087
6    7509
Name: total_amount, dtype: int64

In [12]:
df.groupby('passenger_count')['total_amount'].value_counts()

passenger_count  total_amount
0                14.75             1
                 36.39             1
1                7.30            210
                 7.80            186
                 6.80            179
                                ... 
6                63.41             1
                 63.55             1
                 70.01             1
                 72.92             1
                 83.12             1
Name: count, Length: 1749, dtype: int64

# Exercise: Taxi grouping

1. We're going to run a bunch of queries using `groupby` on the NYC taxi data from January 2020. (This is in the larger zipfile that I asked you to download. The filename is `nyc_taxi_2020-01.csv`.)
2. What was the mean `total_amount` for each value of `passenger_count`?
3. What was the max `total_amount` for each value of `passenger_count`?
4. Create a new column, `tip_percentage`, which is the result of taking the `tip_amount` and finding its percentage of `fare_amount`. Get the mean `tip_percentage` per `passenger_count`.
5. Compare the mean and median `total_amount` for each value of `payment_type`.

In [15]:
filename = '/Users/reuven/Courses/Current/Data/nyc_taxi_2020-01.csv'

df = pd.read_csv(filename)

  df = pd.read_csv(filename)


In [16]:
!ls -lh $filename

-rw-r--r-- 1 reuven staff 567M Jun  4  2021 /Users/reuven/Courses/Current/Data/nyc_taxi_2020-01.csv
