# Agenda, day 3

1. Q&A
2. Sorting
3. Grouping (basic, advanced)
4. Pivot tables
5. Joining
6. Cleaning our data
7. Plotting
8. AMA -- ask me anything
9. What's next?

# Sorting

In Python, we have two ways to sort things:

- `list.sort` method, which changes the list (so we try to avoid using it)
- `sorted` builtin function, which takes any sequence of values and returns a list, with them in ascending order

You don't want to use `sorted` with a Python series or data frame! If it works,  it'll be very slow and it won't give you the control that you want.

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [2]:
filename = 'taxi.csv'

df = pd.read_csv(filename,
                 index_col='tpep_pickup_datetime')
df

Unnamed: 0_level_0,VendorID,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
tpep_pickup_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2015-06-02 11:19:29,2,2015-06-02 11:47:52,1,1.63,-73.954430,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.00,0.0,0.3,17.80
2015-06-02 11:19:30,2,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.00,0.0,0.3,8.30
2015-06-02 11:19:31,2,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.20,0.0,0.3,11.00
2015-06-02 11:19:31,2,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.760330,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
2015-06-02 11:19:32,1,2015-06-02 11:32:49,1,1.40,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.00,0.0,0.3,10.30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2015-06-01 00:12:59,1,2015-06-01 00:24:18,1,2.70,-73.947792,40.814972,1,N,-73.973358,40.783638,2,11.0,0.5,0.5,0.00,0.0,0.3,12.30
2015-06-01 00:12:59,1,2015-06-01 00:28:16,1,4.50,-74.004066,40.747818,1,N,-73.953758,40.779285,1,16.0,0.5,0.5,3.00,0.0,0.3,20.30
2015-06-01 00:13:00,2,2015-06-01 00:37:25,1,5.59,-73.994377,40.766102,1,N,-73.903206,40.750546,2,21.0,0.5,0.5,0.00,0.0,0.3,22.30
2015-06-01 00:13:02,2,2015-06-01 00:19:10,6,1.54,-73.978302,40.748531,1,N,-73.989166,40.762852,2,6.5,0.5,0.5,0.00,0.0,0.3,7.80


In [3]:
# how can I sort this?

# first: I want to find the shortest trip distance
df['trip_distance']

tpep_pickup_datetime
2015-06-02 11:19:29    1.63
2015-06-02 11:19:30    0.46
2015-06-02 11:19:31    0.87
2015-06-02 11:19:31    2.13
2015-06-02 11:19:32    1.40
                       ... 
2015-06-01 00:12:59    2.70
2015-06-01 00:12:59    4.50
2015-06-01 00:13:00    5.59
2015-06-01 00:13:02    1.54
2015-06-01 00:13:04    5.80
Name: trip_distance, Length: 9999, dtype: float64

In [4]:
# we can run the sort_values method on a series, and we'll get the values back, sorted from smallest to largest
# (in ascending order)

df['trip_distance'].sort_values()

tpep_pickup_datetime
2015-06-01 00:12:13     0.00
2015-06-04 15:18:09     0.00
2015-06-02 11:23:46     0.00
2015-06-01 00:08:03     0.00
2015-06-01 00:07:43     0.00
                       ...  
2015-06-04 15:17:25    34.84
2015-06-02 11:21:03    35.51
2015-06-01 00:02:42    37.20
2015-06-01 00:04:50    60.30
2015-06-01 00:00:58    64.60
Name: trip_distance, Length: 9999, dtype: float64

In [5]:
# I want the 10 shortest trips

df['trip_distance'].sort_values().head(10)

tpep_pickup_datetime
2015-06-01 00:12:13    0.0
2015-06-04 15:18:09    0.0
2015-06-02 11:23:46    0.0
2015-06-01 00:08:03    0.0
2015-06-01 00:07:43    0.0
2015-06-04 15:23:02    0.0
2015-06-06 16:51:57    0.0
2015-06-04 15:18:40    0.0
2015-06-01 00:03:44    0.0
2015-06-04 15:17:31    0.0
Name: trip_distance, dtype: float64

In [6]:
# I want the 10 longest trips

df['trip_distance'].sort_values().tail(10)

tpep_pickup_datetime
2015-06-01 00:00:16    29.78
2015-06-01 00:09:14    31.50
2015-06-01 00:00:13    31.90
2015-06-02 11:28:58    32.10
2015-06-01 00:01:19    32.40
2015-06-04 15:17:25    34.84
2015-06-02 11:21:03    35.51
2015-06-01 00:02:42    37.20
2015-06-01 00:04:50    60.30
2015-06-01 00:00:58    64.60
Name: trip_distance, dtype: float64

In [8]:
# another way:

df['trip_distance'].sort_values(ascending=False).head(10)

tpep_pickup_datetime
2015-06-01 00:00:58    64.60
2015-06-01 00:04:50    60.30
2015-06-01 00:02:42    37.20
2015-06-02 11:21:03    35.51
2015-06-04 15:17:25    34.84
2015-06-01 00:01:19    32.40
2015-06-02 11:28:58    32.10
2015-06-01 00:00:13    31.90
2015-06-01 00:09:14    31.50
2015-06-01 00:00:16    29.78
Name: trip_distance, dtype: float64

In [None]:
# why don't we just have a "sorted" function or method for Pandas?
# we don't always want to sort by the value. Sometimes, we want to sort by the index.

