# Agenda: Sorting!

1. Series
    - Sort by index
    - Sort by values
2. Data frame
    - Sort by index
    - Sort by one column
    - Sort by multiple columns
3. Along the way, we'll play with a number of useful methods   


In [1]:
import pandas as pd

filename = '../data/taxi.csv'

df = pd.read_csv(filename)

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


In [5]:
import numpy as np
from pandas import Series, DataFrame

np.random.seed(0)
s = Series(np.random.randint(-50, 50, 10),
           index=list('acegihfjdb'))

In [6]:
s

a    -6
c    -3
e    14
g    17
i    17
h   -41
f    33
j   -29
d   -14
b    37
dtype: int64

In [7]:
# the first kind of sort we'll do is on the index

# this returns a new series, not modifying the original one
s.sort_index()

a    -6
b    37
c    -3
d   -14
e    14
f    33
g    17
h   -41
i    17
j   -29
dtype: int64

In [9]:
# method chaining syntax

(
    s
    .sort_index()
    .head()
    .loc[['b', 'd']]
)

b    37
d   -14
dtype: int64

In [10]:
# let's create the series again, , and grab a slice

np.random.seed(0)
s = Series(np.random.randint(-50, 50, 10),
           index=list('acegihfjdb'))

s.loc['b':'f']

Series([], dtype: int64)

In [11]:
s.loc['g':'f']  # .loc is up to AND INCLUDING

g    17
i    17
h   -41
f    33
dtype: int64

In [12]:
# what if the index has repeated values?

np.random.seed(0)
s = Series(np.random.randint(-50, 50, 10),
           index=list('aceaihbjdb'))

s.loc['a':'b']

KeyError: "Cannot get left slice bound for non-unique label: 'a'"

In [13]:
s.loc['c':'b']

KeyError: "Cannot get right slice bound for non-unique label: 'b'"

In [14]:
# in order to solve this problem, we need to sort our series!

s.sort_index().loc['a':'b']

a    -6
a    17
b    33
b    37
dtype: int64

In [15]:
s.sort_index().loc['a':'c']

a    -6
a    17
b    33
b    37
c    -3
dtype: int64

In [16]:
# sometimes, we want to sort by the values
# we can do this with the sort_values() method

s.sort_values()

h   -41
j   -29
d   -14
a    -6
c    -3
e    14
a    17
i    17
b    33
b    37
dtype: int64

In [17]:
s = Series([10, 5, 15, 'b', 'a', 'c'])
s

0    10
1     5
2    15
3     b
4     a
5     c
dtype: object

In [18]:
s.sort_values()

TypeError: '<' not supported between instances of 'str' and 'int'

# Exercise: Series sorting

1. Create a series based on members of your family (or friends). The index will be their names, adn the values will be their ages.
2. Sort by the names. What is the mean age of the first 3 people, alphabetically?
3. Sort by the ages. What are the names of the youngest and the eldest people in your series?

In [19]:
s = Series([53, 23, 21, 18],
           index='Reuven Atara Shikma Amotz'.split())
s

Reuven    53
Atara     23
Shikma    21
Amotz     18
dtype: int64

In [22]:
(
    s
    .sort_index()
    .head(3)
    .mean()
)

31.333333333333332

In [24]:
(
    s
    .sort_values()
    .iloc[[0, -1]]   # first and final elements
)

Amotz     18
Reuven    53
dtype: int64

In [25]:
# how can we sort in *descending* order?
# so far, we've seen that both sort_index and sort_values go in ascending order

# pass ascending=False as a keyword argument; its default value is True
s.sort_values(ascending=False)

Reuven    53
Atara     23
Shikma    21
Amotz     18
dtype: int64

In [26]:
s.sort_index(ascending=False)

Shikma    21
Reuven    53
Atara     23
Amotz     18
dtype: int64

In [27]:
# all of these methods assume that the index/values are "comparable" in Python,
# meaning that they implement not only == but also < (and maybe a few other operators, as well)

In [28]:
# what if we want to change the way in which values are sorted?
# meaning: if we have both negative and positive numbers

np.random.seed(0)   # reset random numbers
s = Series(np.random.randint(-50, 50, 10),
           index=list('abcdefghij'))
s

a    -6
b    -3
c    14
d    17
e    17
f   -41
g    33
h   -29
i   -14
j    37
dtype: int64

In [29]:
# can I sort these numbers? Sure, using sort_values

s.sort_values()

f   -41
h   -29
i   -14
a    -6
b    -3
c    14
d    17
e    17
g    33
j    37
dtype: int64

In [30]:
# can I turn these values into absolute values (i.e., positive values) and then sort them?

s.abs().sort_values()

b     3
a     6
c    14
i    14
d    17
e    17
h    29
g    33
j    37
f    41
dtype: int64

In [34]:
# what if I want to sort them by absolute value
# but I don't want to change the values themselves

# for that, we can pass an argument to the "key" keyword argument
# the value that we pass to "key" is a FUNCTION, one that takes
# a series as an argument.  The input to the function will be our
# series, and the output will be a new series that we'll use for
# sorting, but nothing else

s.sort_values(key=abs)

b    -3
a    -6
c    14
i   -14
d    17
e    17
h   -29
g    33
j    37
f   -41
dtype: int64

In [41]:
# what if I want to sort by the square of the numbers?

def square_series(a_series):   # here, I define a function that takes a series
    return a_series ** 2       #   ... and returns a series

s.sort_values(key=square_series)

b    -3
a    -6
c    14
i   -14
d    17
e    17
h   -29
g    33
j    37
f   -41
dtype: int64

# Data frames

In many ways, sorting a data frame is just like sorting a series:

- We can sort by the index
- We can sort by the values

The difference, though, is that we have multiple columns! That means we have to indicate which column should be used as the basis for sorting. Also: We can sort by *more than* one column, if we want.

In [42]:
# to sort a data frame by its index, just use ... .sort_index

filename = '../data/taxi.csv'

df = pd.read_csv(filename)

df

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.954430,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.00,0.0,0.3,17.80
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.00,0.0,0.3,8.30
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.20,0.0,0.3,11.00
3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.760330,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.40,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.00,0.0,0.3,10.30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9994,1,2015-06-01 00:12:59,2015-06-01 00:24:18,1,2.70,-73.947792,40.814972,1,N,-73.973358,40.783638,2,11.0,0.5,0.5,0.00,0.0,0.3,12.30
9995,1,2015-06-01 00:12:59,2015-06-01 00:28:16,1,4.50,-74.004066,40.747818,1,N,-73.953758,40.779285,1,16.0,0.5,0.5,3.00,0.0,0.3,20.30
9996,2,2015-06-01 00:13:00,2015-06-01 00:37:25,1,5.59,-73.994377,40.766102,1,N,-73.903206,40.750546,2,21.0,0.5,0.5,0.00,0.0,0.3,22.30
9997,2,2015-06-01 00:13:02,2015-06-01 00:19:10,6,1.54,-73.978302,40.748531,1,N,-73.989166,40.762852,2,6.5,0.5,0.5,0.00,0.0,0.3,7.80


In [43]:
# let's make the tpep_pickup_datetime column into the index
# and then sort by it

(
    df
    .set_index('tpep_pickup_datetime')  # take this column, and use it as the index
    .sort_index()
)

Unnamed: 0_level_0,VendorID,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
tpep_pickup_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2015-06-01 00:00:00,1,2015-06-01 00:06:12,1,1.00,-73.988739,40.756832,1,N,-73.974701,40.757038,2,6.0,0.5,0.5,0.00,0.00,0.3,7.30
2015-06-01 00:00:00,2,2015-06-01 00:00:00,1,0.90,-73.984428,40.737209,1,N,-73.979935,40.749088,1,11.5,1.0,0.5,2.00,0.00,0.3,15.30
2015-06-01 00:00:00,2,2015-06-01 00:00:00,2,1.40,-73.987160,40.738972,1,N,-73.976288,40.755573,2,11.5,0.0,0.5,0.00,0.00,0.3,12.30
2015-06-01 00:00:01,2,2015-06-01 00:11:29,1,7.41,-73.874634,40.774082,1,N,-73.944809,40.779282,1,21.0,0.5,0.5,5.57,5.54,0.3,33.41
2015-06-01 00:00:01,2,2015-06-01 00:24:48,1,8.15,-74.006844,40.730572,1,N,-73.946342,40.811508,1,26.5,0.5,0.5,2.50,0.00,0.3,30.30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2015-06-06 16:53:56,1,2015-06-06 17:00:40,1,1.20,-73.992592,40.730629,1,N,-73.998161,40.717072,1,6.5,0.0,0.5,2.19,0.00,0.3,9.49
2015-06-06 16:53:56,2,2015-06-06 16:56:18,1,0.58,-73.949013,40.788616,1,N,-73.952942,40.781727,1,4.0,0.0,0.5,0.95,0.00,0.3,5.75
2015-06-06 16:53:56,2,2015-06-06 17:54:22,1,17.36,-73.790520,40.646461,2,N,-73.969048,40.763062,1,52.0,0.0,0.5,10.66,5.54,0.3,69.00
2015-06-06 16:53:56,2,2015-06-06 16:55:51,1,0.76,-73.977509,40.784252,1,N,-73.970848,40.793365,2,4.0,0.0,0.5,0.00,0.00,0.3,4.80


In [46]:
# this shows why we'll want to sort our values before grabbing a slice

(
    df
    .set_index('tpep_pickup_datetime')  # take this column, and use it as the index
    .loc['2015-06-06 16:53:56':'2015-06-06 16:53:57']
)

KeyError: "Cannot get left slice bound for non-unique label: '2015-06-06 16:53:56'"

In [47]:

(
    df
    .set_index('tpep_pickup_datetime')  # take this column, and use it as the index
    .sort_index()
    .loc['2015-06-06 16:53:56':'2015-06-06 16:53:57']
)

Unnamed: 0_level_0,VendorID,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
tpep_pickup_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2015-06-06 16:53:56,1,2015-06-06 17:15:40,1,3.2,-73.971924,40.743896,1,N,-73.957809,40.782368,1,16.0,0.0,0.5,3.36,0.0,0.3,20.16
2015-06-06 16:53:56,2,2015-06-06 17:02:49,1,1.31,-73.949532,40.77676,1,N,-73.962936,40.770599,2,8.0,0.0,0.5,0.0,0.0,0.3,8.8
2015-06-06 16:53:56,2,2015-06-06 17:02:42,6,1.36,-73.989304,40.773174,1,N,-73.972092,40.763237,1,7.5,0.0,0.5,1.66,0.0,0.3,9.96
2015-06-06 16:53:56,1,2015-06-06 17:00:10,1,1.3,-73.973396,40.789967,1,N,-73.958511,40.779106,1,6.5,0.0,0.5,1.45,0.0,0.3,8.75
2015-06-06 16:53:56,1,2015-06-06 17:00:40,1,1.2,-73.992592,40.730629,1,N,-73.998161,40.717072,1,6.5,0.0,0.5,2.19,0.0,0.3,9.49
2015-06-06 16:53:56,2,2015-06-06 16:56:18,1,0.58,-73.949013,40.788616,1,N,-73.952942,40.781727,1,4.0,0.0,0.5,0.95,0.0,0.3,5.75
2015-06-06 16:53:56,2,2015-06-06 17:54:22,1,17.36,-73.79052,40.646461,2,N,-73.969048,40.763062,1,52.0,0.0,0.5,10.66,5.54,0.3,69.0
2015-06-06 16:53:56,2,2015-06-06 16:55:51,1,0.76,-73.977509,40.784252,1,N,-73.970848,40.793365,2,4.0,0.0,0.5,0.0,0.0,0.3,4.8
2015-06-06 16:53:57,1,2015-06-06 16:58:39,1,0.5,-73.968719,40.764427,1,N,-73.965042,40.759476,2,5.0,0.0,0.5,0.0,0.0,0.3,5.8


# Exercise: Sorting cities

You might remember from last time that we can download + read into a data frame the 1,000 largest cities in the US. 

(If not, you can get it from here: https://gist.githubusercontent.com/reuven/77edbb0292901f35019f17edb9794358/raw/2bf258763cdddd704f8ffd3ea9a3e81d25e2c6f6/cities.json )

Read that into a data frame, and then try the following:

1. Turn the city name into the index. What is the mean population for the first 20 cities, alphabtically?
2. Turn the state name into the index. What is the mean population for all cities (alphabetically) from 'Iowa' to 'Nebraska'?
3. Turn the population into the index. What is the mean latitude for the 50 largest cities vs. 50 smallest cities.

In [49]:
url = 'https://gist.githubusercontent.com/reuven/77edbb0292901f35019f17edb9794358/raw/2bf258763cdddd704f8ffd3ea9a3e81d25e2c6f6/cities.json'

df = pd.read_json(url)

In [50]:
df.head()

Unnamed: 0,city,growth_from_2000_to_2013,latitude,longitude,population,rank,state
0,New York,4.8%,40.712784,-74.005941,8405837,1,New York
1,Los Angeles,4.8%,34.052234,-118.243685,3884307,2,California
2,Chicago,-6.1%,41.878114,-87.629798,2718782,3,Illinois
3,Houston,11.0%,29.760427,-95.369803,2195914,4,Texas
4,Philadelphia,2.6%,39.952584,-75.165222,1553165,5,Pennsylvania


In [55]:
# Turn the city name into the index. What is the mean population for the first 20 cities, alphabtically?

(
    df
    .set_index('city')
    .sort_index()
    .head(20)
    ['population']
    .mean()
)

125541.7

In [58]:
(
    df
    .set_index('city')
    .sort_index()
    .iloc[:20]
    ['population']
    .mean()
)

125541.7

In [59]:
df.set_index('city').sort_index().iloc[:20]['population'].mean()

125541.7

In [64]:
# Turn the state name into the index. What is the mean population for all cities (alphabetically) from 'Iowa' to 'Nebraska'?

(
    df
    .set_index('state')
    .sort_index()
    .loc['Iowa':'Nebraska', 'population']    # row-selector, column-selector syntax
    .mean()
)

102408.45508982036

In [66]:
# Turn the population into the index. What is the mean latitude for the 50 largest cities vs. 50 smallest cities.

(
    df
    .set_index('population')
    .sort_index()
    .head(50)
    ['latitude']
    .mean()
)

37.303587242

In [67]:
(
    df
    .set_index('population')
    .sort_index()
    .tail(50)
    ['latitude']
    .mean()
)

36.838639806

# Sorting by values

We can use the `df.sort_values` method, just as we did with a series. However, we need to specify the column by which we want to sort.

In [68]:
df.sort_values('population')

Unnamed: 0,city,growth_from_2000_to_2013,latitude,longitude,population,rank,state
999,Panama City,0.1%,30.158813,-85.660206,36877,1000,Florida
998,Beloit,2.9%,42.508348,-89.031776,36888,999,Wisconsin
997,Spanish Fork,78.1%,40.114955,-111.654923,36956,998,Utah
996,Keizer,14.4%,44.990119,-123.026208,37064,997,Oregon
995,Weslaco,28.8%,26.159519,-97.990837,37093,996,Texas
...,...,...,...,...,...,...,...
4,Philadelphia,2.6%,39.952584,-75.165222,1553165,5,Pennsylvania
3,Houston,11.0%,29.760427,-95.369803,2195914,4,Texas
2,Chicago,-6.1%,41.878114,-87.629798,2718782,3,Illinois
1,Los Angeles,4.8%,34.052234,-118.243685,3884307,2,California


# What if we want to sort by more than one column?

A good general rule in Pandas is: If you can pass a single argument, then you can pass a list of arguments.

In [72]:
(
    df
    .sort_values(['state', 'city'])
    [['city', 'state', 'population']]
    .head(20)
)

Unnamed: 0,city,state,population
614,Auburn,Alabama,58582
100,Birmingham,Alabama,212113
652,Decatur,Alabama,55816
501,Dothan,Alabama,68001
921,Florence,Alabama,40059
375,Hoover,Alabama,84126
125,Huntsville,Alabama,186254
810,Madison,Alabama,45799
121,Mobile,Alabama,194899
110,Montgomery,Alabama,201332


In [75]:
# let's sort by state + population

(
    df
    .sort_values(['state', 'population'])
    [['city', 'state', 'population']]
    .head(30)
)

Unnamed: 0,city,state,population
982,Phenix City,Alabama,37498
921,Florence,Alabama,40059
810,Madison,Alabama,45799
652,Decatur,Alabama,55816
614,Auburn,Alabama,58582
501,Dothan,Alabama,68001
375,Hoover,Alabama,84126
312,Tuscaloosa,Alabama,95334
125,Huntsville,Alabama,186254
121,Mobile,Alabama,194899


In [76]:
# can we sort them in descending order? Yes, just pass ascending=False

(
    df
    .sort_values(['state', 'population'], ascending=False)
    [['city', 'state', 'population']]
    .head(30)
)

Unnamed: 0,city,state,population
557,Cheyenne,Wyoming,62448
598,Casper,Wyoming,59628
30,Milwaukee,Wisconsin,599164
82,Madison,Wisconsin,243344
271,Green Bay,Wisconsin,104779
293,Kenosha,Wisconsin,99889
420,Racine,Wisconsin,78199
455,Appleton,Wisconsin,73596
474,Waukesha,Wisconsin,71016
507,Eau Claire,Wisconsin,67545


In [78]:
# what if we want to sort by state in ascending order, and by population in descending order?

(
    df
    .sort_values(['state', 'population'],
                ascending=[True, False])  # we pass a list of booleans to "ascending", and have one asc and the other desc
    [['city', 'state', 'population']]
)

Unnamed: 0,city,state,population
100,Birmingham,Alabama,212113
110,Montgomery,Alabama,201332
121,Mobile,Alabama,194899
125,Huntsville,Alabama,186254
312,Tuscaloosa,Alabama,95334
...,...,...,...
970,Brookfield,Wisconsin,37999
992,Greenfield,Wisconsin,37159
998,Beloit,Wisconsin,36888
557,Cheyenne,Wyoming,62448


# Exercise: Sorting by values

1. Create a data frame from our taxi data (`../data/taxi.csv`).
2. What was the mean `total_amount` for the 10 trips with the shortest `trip_distance`?
3. What was the mean `trip_distance` for the 10 trips with the highest `passenger_count`?
4. Sort the trips first by `passenger_count` and then by `trip_distance`. How many people went how far in the first 5 rows?

In [79]:
filename = '../data/taxi.csv'

df = pd.read_csv(filename)

df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.95443,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0
3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.76033,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.0,0.0,0.3,10.3


In [84]:
# What was the mean total_amount for the 10 trips with the shortest trip_distance?

(
    df
    .sort_values('trip_distance')
    ['total_amount']
    .head(10)
    .mean()
)

41.183

In [91]:
# What was the mean trip_distance for the 10 trips with the highest passenger_count?

(
    df
    .sort_values('passenger_count', ascending=False)
    ['trip_distance']
    .head(10)
    .mean()
)

1.589

In [93]:
# Sort the trips first by passenger_count and then by trip_distance. How many people went how far in the first 5 rows?

(
    df
    .sort_values(['passenger_count', 'trip_distance'])
    .iloc[:5]
)

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
5097,1,2015-06-04 15:15:45,2015-06-04 15:32:10,0,1.3,-73.953949,40.778915,1,N,-73.970337,40.788288,1,11.0,0.0,0.5,2.95,0.0,0.3,14.75
8313,1,2015-06-01 00:03:42,2015-06-01 00:22:09,0,7.9,-73.885246,40.773014,1,N,-73.976089,40.741604,1,23.5,0.5,0.5,6.05,5.54,0.3,36.39
149,1,2015-06-02 11:21:25,2015-06-02 11:21:50,1,0.0,-73.978493,40.748562,1,N,-73.978493,40.748604,1,2.5,0.0,0.5,1.0,0.0,0.3,4.3
246,1,2015-06-02 11:19:46,2015-06-02 12:26:33,1,0.0,-73.9832,40.766949,1,N,-73.99041,40.766872,2,2.5,0.0,0.5,0.0,0.0,0.3,3.3
657,1,2015-06-02 11:24:33,2015-06-02 11:24:50,1,0.0,-73.99646,40.732124,5,N,-73.996429,40.732147,1,12.0,0.0,0.0,3.05,0.0,0.3,15.35


In [95]:
# we *can* (and probably should) use sort_values + head/tail or .iloc
# there are two convenience methods we can also use -- nlargest and nsmallest

(
    df
    .nlargest(columns='trip_distance', n=10)
    ['t
)

# (
#     df
#     .sort_values('trip_distance')
#     ['total_amount']
#     .head(10)
#     .mean()
# )

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
4270,1,2015-06-01 00:00:58,2015-06-01 01:22:05,1,64.6,0.0,0.0,5,N,0.0,0.0,2,69.66,0.0,0.0,0.0,10.0,0.3,79.96
8513,1,2015-06-01 00:04:50,2015-06-01 01:31:44,1,60.3,-73.994415,40.750603,5,N,-73.42025,41.137344,1,150.0,0.0,0.0,0.0,9.75,0.3,160.05
4583,1,2015-06-01 00:02:42,2015-06-01 00:03:38,1,37.2,-73.550156,41.043472,5,N,-73.550102,41.043495,1,184.0,0.0,0.0,20.0,5.84,0.3,210.14
809,2,2015-06-02 11:21:03,2015-06-02 12:16:47,1,35.51,-73.789169,40.647758,3,N,-74.17675,40.662647,1,112.0,0.0,0.0,0.0,22.83,0.3,135.13
5470,2,2015-06-04 15:17:25,2015-06-04 17:05:42,1,34.84,-73.787354,40.64167,5,N,-74.177376,40.690781,2,120.0,0.0,0.0,0.0,17.29,0.3,137.59
4291,2,2015-06-01 00:01:19,2015-06-01 00:40:12,1,32.4,-73.781425,40.644905,2,N,-73.974174,40.731441,1,52.0,0.0,0.5,10.56,0.0,0.3,63.36
3323,1,2015-06-02 11:28:58,2015-06-02 12:13:29,1,32.1,-73.873085,40.774124,4,N,-73.957283,41.098221,1,129.0,0.0,0.5,27.05,5.54,0.3,162.39
4224,1,2015-06-01 00:00:13,2015-06-01 00:41:05,1,31.9,-73.875206,40.770382,5,N,-73.549629,41.043552,1,210.0,0.0,0.0,42.05,0.0,0.3,252.35
9231,1,2015-06-01 00:09:14,2015-06-01 01:03:11,1,31.5,-73.802437,40.677372,5,N,-74.255424,40.745316,2,140.0,0.0,0.0,0.0,9.75,0.3,150.05
4221,2,2015-06-01 00:00:16,2015-06-01 00:40:35,2,29.78,-73.781853,40.644711,2,N,-74.006905,40.707958,1,52.0,0.0,0.5,17.5,5.54,0.3,75.84
