In [1]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv(
    "../data/nyc_taxi_2019-01.csv",
    usecols=["passenger_count", "trip_distance", "total_amount"],
)
df.head()

Unnamed: 0,passenger_count,trip_distance,total_amount
0,1,1.5,9.95
1,1,2.6,16.3
2,3,0.0,5.8
3,5,0.0,7.55
4,5,0.0,55.55


In [None]:
# find the average cost of the 20 longest trips in January (whole dataset)
# sorting descending and first 20
print(df.sort_values("trip_distance", ascending=False).iloc[:20]["total_amount"].mean())
# sorting ascending and last 20
print(df.sort_values("trip_distance", ascending=True).iloc[-20:]["total_amount"].mean())
# could also have used .iloc[:20] which is probably better since you can specify a range
# rather than just from the top or bottom with head or tail
# so this could have been done with one sort_values call and used .iloc[:20] and .iloc[-20:]

290.00999999999993
290.01000000000005


In [16]:
# sort by ascending passenger count and descending trip distance
(
    df.sort_values(["passenger_count", "trip_distance"], ascending=[True, False])
    .iloc[:50]["total_amount"]
    .mean()
)

135.49739999999997

# Extension questions
1. In which 5 rides did people pay the most per mile? How far did people go on those trips?
2. Let's assume that multipassenger rides are split evenly amongst the passengers. Given that assumption, in which 10 multipassenger rides did each individual pay the greatest amount?
3. In the exercise solution head/tail or iloc was used to get the first/last N records. Use `ignore_index=True` with `sort_values` and `loc` to get the mean `total_amount` for the 20 longest trips.

In [None]:
# seems like something useful to have as a new column?
df["cost_per_mile"] = df["total_amount"] / df["trip_distance"]

In [20]:
df[df["trip_distance"] > 0].sort_values("cost_per_mile", ascending=False).iloc[:5]

Unnamed: 0,passenger_count,trip_distance,total_amount,cost_per_mile
2499600,1,2.4,623261.66,259692.358333
478791,1,0.1,6667.45,66674.5
7099014,4,0.01,415.3,41530.0
6403254,1,0.01,322.3,32230.0
4136499,1,0.01,273.96,27396.0


In [None]:
df["cost_per_passenger"] = df["total_amount"] / df["passenger_count"]
# interesting - I was expecting this to barf on the 0 passenger count values and raise a div by zero error!

In [23]:
(
    df[df["passenger_count"] > 1]
    .sort_values("cost_per_passenger", ascending=False)
    .iloc[:10]
)

Unnamed: 0,passenger_count,trip_distance,total_amount,cost_per_mile,cost_per_passenger
2972145,2,19.9,589.96,29.646231,294.98
3014027,2,16.6,560.76,33.780723,280.38
3842620,2,110.04,515.82,4.687568,257.91
7593395,2,83.61,449.32,5.373998,224.66
149362,2,17.2,426.8,24.813953,213.4
5726185,2,65.05,416.82,6.407686,208.41
6857368,2,0.0,411.36,inf,205.68
6496403,2,0.0,410.95,inf,205.475
4751745,2,100.78,403.5,4.003771,201.75
1154626,2,0.0,400.8,inf,200.4


The book dropped trip distances of 0 from the dataset in the first extension problem, which meant that subsequent
calculations here were different. I don't agree that it should have been dropped, and instead should have been
filtered from the previous query.

In [None]:
# 3. I actually don't understand what this one is asking.
# this is the book's solution:
df.sort_values("trip_distance", ascending=False, ignore_index=True)["total_amount"].loc[
    :20
].mean()
# although they got 253.65904761904761955 presumably because they dropped a bunch of data in previous steps

300.76285714285706

So the previous extension is really just removing the need for using `iloc` and turns the sorted results into something where the internal index also starts at 0. This means that for the first 10 results, the index will be 0-9.