# Agenda: Dates and times

1. Background theory
2. How to turn a series/column into datetime info
3. Calculations we can perform
4. Datetime columns as our index ('time series')

# Background

When we talk about time, we might mean either of the two following things:

1. One particular point in time. It has a unique year/month/day, hour/minute/second. Examples: Time of birth. Time of death. Graduation. When a meeting starts. These points in time are unique. These are known as `datetime` or `timestamp` values in different programming languges and databases.
2. A span of time, between two points. This isn't unique, and doesn't have a year/month/day, but it does have a volume, which we measure with minutes, seconds, hours, years, etc. This describes a lifespan, or the amount of time you were in school, or how long a meeting went. These are known as `timedeltas` or `intervals` in programming languages and systems.

Date math:
- `timestamp` - `timestamp` = `timedelta`
- `timestamp` + `timedelta` = `timestamp`
- `timestamp` - `timedelta` = `timestamp`



In [3]:
import pandas as pd

filename = '../data/taxi.csv'
df = pd.read_csv(filename)

In [4]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.95443,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0
3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.76033,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.0,0.0,0.3,10.3


In [5]:
df.dtypes

VendorID                   int64
tpep_pickup_datetime      object
tpep_dropoff_datetime     object
passenger_count            int64
trip_distance            float64
pickup_longitude         float64
pickup_latitude          float64
RateCodeID                 int64
store_and_fwd_flag        object
dropoff_longitude        float64
dropoff_latitude         float64
payment_type               int64
fare_amount              float64
extra                    float64
mta_tax                  float64
tip_amount               float64
tolls_amount             float64
improvement_surcharge    float64
total_amount             float64
dtype: object

# Converting to datetime data

If we have a series that contains text (strings) in a normal(ish) datetime format, then we can call `pd.to_datetime` on that series, and we'll get back a series with a `datetime` dtype.

In [6]:
pd.to_datetime(df['tpep_pickup_datetime'])

0      2015-06-02 11:19:29
1      2015-06-02 11:19:30
2      2015-06-02 11:19:31
3      2015-06-02 11:19:31
4      2015-06-02 11:19:32
               ...        
9994   2015-06-01 00:12:59
9995   2015-06-01 00:12:59
9996   2015-06-01 00:13:00
9997   2015-06-01 00:13:02
9998   2015-06-01 00:13:04
Name: tpep_pickup_datetime, Length: 9999, dtype: datetime64[ns]

In [7]:
df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])
df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'])

In [8]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.95443,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0
3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.76033,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.0,0.0,0.3,10.3


In [9]:
df.dtypes

VendorID                          int64
tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                   int64
trip_distance                   float64
pickup_longitude                float64
pickup_latitude                 float64
RateCodeID                        int64
store_and_fwd_flag               object
dropoff_longitude               float64
dropoff_latitude                float64
payment_type                      int64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
dtype: object

# Date formats

Everyone formats dates and times in different ways! How can `pd.to_datetime` handle things correctly?

If you're using an unambiguous format, then it's fine. (If it's in this sort of format, `YYYY-MM-DD HH:MM:SS`, then you're probably OK.)

If you have US dates, where it's `MM-DD-YYYY HH:MM:SS`, then you'll also be OK, because Pandas was written in the US. But if you're in another place, then you'll need to specify `dayfirst=True` as an option, and then it'll do `DD-MM-YYYY HH:MM:SS`.

What if your format is even crazier than that? You can specify a datetime format. (Look up `strftime` and `strptime` if you want more info on that.)

In [10]:
!head $filename

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.954429626464844,40.764141082763672,1,N,-73.974754333496094,40.754093170166016,2,17,0,0.5,0,0,0.3,17.8
2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,.46,-73.971443176269531,40.758941650390625,1,N,-73.978538513183594,40.761909484863281,1,6.5,0,0.5,1,0,0.3,8.3
2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,.87,-73.978111267089844,40.738433837890625,1,N,-73.990272521972656,40.745437622070313,1,8,0,0.5,2.2,0,0.3,11
2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892333984375,40.773529052734375,1,N,-73.971527099609375,40.760330200195312,1,13.5,0,0.5,2.86,0,0.3,17.16
1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.40,-73.979087829589844,40.776771545410156,1,N,-73.982

What if there  is bad data in the datetime info?  For example, if there's just a `0` or a string, or the like?

Use `pd.to_datetime(errors='coerce')`, and then any bad time value will be `NaN` or its equivalent, `NaT` ("not a time").

# If your data is (mostly) good... good news!

You can pass to `read_csv` an option, `parse_dates`, which takes a list of column names that should be parsed as dates. You can pass, as separate keyword arguments to `read_csv`, `dayfirst` and `date_format`, if you want to be more explicit. You cannot, in `read_csv`, use `errors='coerce'`.

In [11]:
df = pd.read_csv(filename,
                 parse_dates=['tpep_pickup_datetime',
                              'tpep_dropoff_datetime'])

In [12]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.95443,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0
3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.76033,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.0,0.0,0.3,10.3


In [13]:
df.dtypes

VendorID                          int64
tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                   int64
trip_distance                   float64
pickup_longitude                float64
pickup_latitude                 float64
RateCodeID                        int64
store_and_fwd_flag               object
dropoff_longitude               float64
dropoff_latitude                float64
payment_type                      int64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
dtype: object

In [14]:
!ls ../data/

2020_sharing_data_outside.csv  oecd_locations.csv
CPILFESL.csv		       oecd_tourism.csv
airlines.dat		       olympic_athlete_events.csv
albany,ny.csv		       san+francisco,ca.csv
alice-in-wonderland.txt        sat-scores.csv
boston,ma.csv		       skyscrapers.csv
burrito_current.csv	       springfield,il.csv
celebrity_deaths_2016.csv      springfield,ma.csv
chicago,il.csv		       taxi-distance.csv
cities.json		       taxi-passenger-count.csv
eu_cpi.csv		       taxi.csv
eu_gdp.csv		       titanic3.csv
ice-cream.csv		       titanic3.xls
languages.csv		       us-median-cpi.csv
linux-etc-passwd.txt	       us-unemployment-rate.csv
los+angeles,ca.csv	       us_gdp.csv
miles-traveled.csv	       winemag-150k-reviews.csv
new+york,ny.csv		       words.txt
nyc-temps.txt		       wti-daily.csv


In [16]:
!head ../data/chicago,il.csv	

date_time,"chicago,il_maxtempC","chicago,il_mintempC","chicago,il_totalSnow_cm","chicago,il_sunHour","chicago,il_uvIndex","chicago,il_uvIndex","chicago,il_moon_illumination","chicago,il_moonrise","chicago,il_moonset","chicago,il_sunrise","chicago,il_sunset","chicago,il_DewPointC","chicago,il_FeelsLikeC","chicago,il_HeatIndexC","chicago,il_WindChillC","chicago,il_WindGustKmph","chicago,il_cloudcover","chicago,il_humidity","chicago,il_precipMM","chicago,il_pressure","chicago,il_tempC","chicago,il_visibility","chicago,il_winddirDegree","chicago,il_windspeedKmph"
2018-12-11 00:00:00,1,-2,0.0,8.7,2,0,21,10:26 AM,08:22 PM,07:08 AM,04:19 PM,-5,-7,-1,-7,33,0,76,0.0,1021,-1,10,231,22
2018-12-11 03:00:00,1,-2,0.0,8.7,2,0,21,10:26 AM,08:22 PM,07:08 AM,04:19 PM,-4,-6,-1,-6,33,0,80,0.0,1021,-1,10,243,21
2018-12-11 06:00:00,1,-2,0.0,8.7,2,0,21,10:26 AM,08:22 PM,07:08 AM,04:19 PM,-5,-8,-2,-8,28,5,83,0.0,1020,-2,10,240,19
2018-12-11 09:00:00,1,-2,0.0,8.7,2,2,21,10:26 AM,08:22 PM,07:08 AM,04:19 PM,-5,-

# Exercise: Reading weather data

1. Read in the weather data from Chicago (`chicago,il.csv`).
2. First, read it in without asking for it to be a datetime, and check how much memory it uses.
3. Then use `parse_dates` and read it as a datetime.
4. Compare the memory usage.

In [18]:
filename = '../data/chicago,il.csv'
df = pd.read_csv(filename)

In [21]:
df.memory_usage(deep=True).sum()

332220

In [22]:
filename = '../data/chicago,il.csv'
df = pd.read_csv(filename,
                parse_dates=['date_time'])

In [23]:
df.memory_usage(deep=True).sum()

288540

In [24]:
288540 / 332220

0.8685208596713021

# What can we do with our datetime column?

Similar to strings, where we used `.str` to invoke many methods and retrieve info about the string, we have the `.dt` accessor for working with `datetime` values. You can say `.dt.TIME_PART` for nearly anything you can imagine, and many you probably hadn't thought of!

In [25]:
df['date_time'].dt.hour

0       0
1       3
2       6
3       9
4      12
       ..
723     9
724    12
725    15
726    18
727    21
Name: date_time, Length: 728, dtype: int32

In [26]:
df['date_time'].dt.day

0      11
1      11
2      11
3      11
4      11
       ..
723    11
724    11
725    11
726    11
727    11
Name: date_time, Length: 728, dtype: int32

In [29]:
df['date_time'].dt.day_name()

0      Tuesday
1      Tuesday
2      Tuesday
3      Tuesday
4      Tuesday
        ...   
723     Monday
724     Monday
725     Monday
726     Monday
727     Monday
Name: date_time, Length: 728, dtype: object

In [30]:
df['date_time'].dt.is_leap_year

0      False
1      False
2      False
3      False
4      False
       ...  
723    False
724    False
725    False
726    False
727    False
Name: date_time, Length: 728, dtype: bool

In [31]:
df['date_time'].dt.is_quarter_end

0      False
1      False
2      False
3      False
4      False
       ...  
723    False
724    False
725    False
726    False
727    False
Name: date_time, Length: 728, dtype: bool

In [37]:
# let's get all of the temp measurements taken at 9 a.m.

df.loc[
    df['date_time'].dt.hour == 9,       # row selector
    ['chicago,il_maxtempC', 'chicago,il_mintempC']    # column selector
]

Unnamed: 0,"chicago,il_maxtempC","chicago,il_mintempC"
3,1,-2
11,3,-1
19,3,-1
27,3,2
35,4,1
...,...,...
691,-3,-7
699,2,-4
707,3,-4
715,3,-1


In [40]:
# we can even use it in grouping
# I want the mean max temp for each distinct hour in the day

df.groupby(df['date_time'].dt.hour)['chicago,il_maxtempC'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,91.0,-0.736264,6.158707,-25.0,-3.0,0.0,3.0,9.0
3,91.0,-0.736264,6.158707,-25.0,-3.0,0.0,3.0,9.0
6,91.0,-0.736264,6.158707,-25.0,-3.0,0.0,3.0,9.0
9,91.0,-0.736264,6.158707,-25.0,-3.0,0.0,3.0,9.0
12,91.0,-0.736264,6.158707,-25.0,-3.0,0.0,3.0,9.0
15,91.0,-0.736264,6.158707,-25.0,-3.0,0.0,3.0,9.0
18,91.0,-0.736264,6.158707,-25.0,-3.0,0.0,3.0,9.0
21,91.0,-0.736264,6.158707,-25.0,-3.0,0.0,3.0,9.0


In [42]:
df.groupby([df['date_time'].dt.month, df['date_time'].dt.day])['chicago,il_maxtempC'].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
date_time,date_time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,1,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,8.0,-3.0,0.0,-3.0,-3.0,-3.0,-3.0,-3.0
1,3,8.0,2.0,0.0,2.0,2.0,2.0,2.0,2.0
1,4,8.0,6.0,0.0,6.0,6.0,6.0,6.0,6.0
1,5,8.0,9.0,0.0,9.0,9.0,9.0,9.0,9.0
...,...,...,...,...,...,...,...,...,...
12,27,8.0,8.0,0.0,8.0,8.0,8.0,8.0,8.0
12,28,8.0,9.0,0.0,9.0,9.0,9.0,9.0,9.0
12,29,8.0,-1.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0
12,30,8.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0


# Exercise: Taxi questions

1. Load the `taxi.csv` file into a data frame. Don't forget that `tpep_pickup_datetime` and `tpep_dropoff_datetime` should both be treated as `datetime` values.
2. Find the mean `trip_distance` for each distinct hour of the day.
3. Find the hour at which the `total_amount` came to the greatest sum.

In [44]:
df = pd.read_csv('../data/taxi.csv',
                 parse_dates=['tpep_pickup_datetime',
                              'tpep_dropoff_datetime'])
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.95443,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0
3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.76033,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.0,0.0,0.3,10.3


In [45]:
df.dtypes

VendorID                          int64
tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                   int64
trip_distance                   float64
pickup_longitude                float64
pickup_latitude                 float64
RateCodeID                        int64
store_and_fwd_flag               object
dropoff_longitude               float64
dropoff_latitude                float64
payment_type                      int64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
dtype: object

In [46]:
# Find the mean trip_distance for each distinct hour of the day.

df['trip_distance']

0       1.63
1       0.46
2       0.87
3       2.13
4       1.40
        ... 
9994    2.70
9995    4.50
9996    5.59
9997    1.54
9998    5.80
Name: trip_distance, Length: 9999, dtype: float64

In [47]:
df['tpep_pickup_datetime'].dt.hour

0       11
1       11
2       11
3       11
4       11
        ..
9994     0
9995     0
9996     0
9997     0
9998     0
Name: tpep_pickup_datetime, Length: 9999, dtype: int32

In [48]:
# for each hour of the day
# give the mean trip_distance
df.groupby(df['tpep_pickup_datetime'].dt.hour)['trip_distance'].mean()

tpep_pickup_datetime
0     4.312809
11    2.661199
15    2.986285
16    2.852166
Name: trip_distance, dtype: float64

In [51]:
# Find the hour at which the total_amount came to the greatest sum.

df.groupby(df['tpep_pickup_datetime'].dt.hour)['total_amount'].sum()

tpep_pickup_datetime
0     46176.82
11    72894.89
15    47059.46
16     9376.00
Name: total_amount, dtype: float64

In [52]:
# what about timedelta?

df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']

0      0 days 00:28:23
1      0 days 00:08:26
2      0 days 00:10:59
3      0 days 00:19:31
4      0 days 00:13:17
             ...      
9994   0 days 00:11:19
9995   0 days 00:15:17
9996   0 days 00:24:25
9997   0 days 00:06:08
9998   0 days 00:23:29
Length: 9999, dtype: timedelta64[ns]

In [53]:
# let's assign this delta to be a column in our data frame

df['trip_time'] = df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']

In [55]:
# we can compare the trip time with a string!

df.loc[df['trip_time'] < '00:10:00']   # how many rides were less than one hour long?

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,trip_time
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.00,0.0,0.3,8.30,0 days 00:08:26
5,1,2015-06-02 11:19:33,2015-06-02 11:28:48,1,1.40,-73.944641,40.779465,1,N,-73.961365,40.771561,1,8.0,0.0,0.5,1.75,0.0,0.3,10.55,0 days 00:09:15
9,1,2015-06-02 11:19:38,2015-06-02 11:23:50,1,0.60,-73.970734,40.796207,1,N,-73.977470,40.789509,1,5.0,0.0,0.5,0.50,0.0,0.3,6.30,0 days 00:04:12
10,2,2015-06-02 11:19:38,2015-06-02 11:19:43,3,0.01,0.000000,0.000000,2,N,0.000000,0.000000,2,52.0,0.0,0.5,0.00,0.0,0.3,52.80,0 days 00:00:05
13,1,2015-06-02 11:19:41,2015-06-02 11:23:56,2,0.50,-73.954803,40.765667,1,N,-73.962517,40.768044,1,4.5,0.0,0.5,1.30,0.0,0.3,6.60,0 days 00:04:15
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9987,1,2015-06-01 00:13:36,2015-06-01 00:19:18,1,1.40,-73.983498,40.726173,1,N,-73.985840,40.740742,1,6.5,0.5,0.5,0.50,0.0,0.3,8.30,0 days 00:05:42
9988,1,2015-06-01 00:13:39,2015-06-01 00:18:22,1,0.90,-73.997124,40.747105,1,N,-73.986977,40.756107,1,5.5,0.5,0.5,1.70,0.0,0.3,8.50,0 days 00:04:43
9989,2,2015-06-01 00:13:41,2015-06-01 00:17:44,6,1.34,-74.005898,40.735851,1,N,-73.991318,40.748177,1,6.0,0.5,0.5,1.46,0.0,0.3,8.76,0 days 00:04:03
9991,2,2015-06-01 00:13:43,2015-06-01 00:21:49,1,1.48,-73.988739,40.756950,1,N,-73.976860,40.743309,2,8.0,0.5,0.5,0.00,0.0,0.3,9.30,0 days 00:08:06


In [56]:
# can perform comparisons with human-looking strings, too!

df.loc[df['trip_time'] < '10 minutes']   # how many rides were less than one hour long?

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,trip_time
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.00,0.0,0.3,8.30,0 days 00:08:26
5,1,2015-06-02 11:19:33,2015-06-02 11:28:48,1,1.40,-73.944641,40.779465,1,N,-73.961365,40.771561,1,8.0,0.0,0.5,1.75,0.0,0.3,10.55,0 days 00:09:15
9,1,2015-06-02 11:19:38,2015-06-02 11:23:50,1,0.60,-73.970734,40.796207,1,N,-73.977470,40.789509,1,5.0,0.0,0.5,0.50,0.0,0.3,6.30,0 days 00:04:12
10,2,2015-06-02 11:19:38,2015-06-02 11:19:43,3,0.01,0.000000,0.000000,2,N,0.000000,0.000000,2,52.0,0.0,0.5,0.00,0.0,0.3,52.80,0 days 00:00:05
13,1,2015-06-02 11:19:41,2015-06-02 11:23:56,2,0.50,-73.954803,40.765667,1,N,-73.962517,40.768044,1,4.5,0.0,0.5,1.30,0.0,0.3,6.60,0 days 00:04:15
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9987,1,2015-06-01 00:13:36,2015-06-01 00:19:18,1,1.40,-73.983498,40.726173,1,N,-73.985840,40.740742,1,6.5,0.5,0.5,0.50,0.0,0.3,8.30,0 days 00:05:42
9988,1,2015-06-01 00:13:39,2015-06-01 00:18:22,1,0.90,-73.997124,40.747105,1,N,-73.986977,40.756107,1,5.5,0.5,0.5,1.70,0.0,0.3,8.50,0 days 00:04:43
9989,2,2015-06-01 00:13:41,2015-06-01 00:17:44,6,1.34,-74.005898,40.735851,1,N,-73.991318,40.748177,1,6.0,0.5,0.5,1.46,0.0,0.3,8.76,0 days 00:04:03
9991,2,2015-06-01 00:13:43,2015-06-01 00:21:49,1,1.48,-73.988739,40.756950,1,N,-73.976860,40.743309,2,8.0,0.5,0.5,0.00,0.0,0.3,9.30,0 days 00:08:06


In [57]:
df.loc[df['trip_time'] < '1 minutes']   # how many rides were less than one hour long?

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,trip_time
10,2,2015-06-02 11:19:38,2015-06-02 11:19:43,3,0.01,0.000000,0.000000,2,N,0.000000,0.000000,2,52.0,0.0,0.5,0.00,0.00,0.3,52.80,0 days 00:00:05
149,1,2015-06-02 11:21:25,2015-06-02 11:21:50,1,0.00,-73.978493,40.748562,1,N,-73.978493,40.748604,1,2.5,0.0,0.5,1.00,0.00,0.3,4.30,0 days 00:00:25
297,2,2015-06-02 11:20:23,2015-06-02 11:20:23,2,0.00,-73.937851,40.758236,1,N,0.000000,0.000000,2,1.5,0.0,0.5,0.00,0.00,0.3,2.30,0 days 00:00:00
516,2,2015-06-02 11:21:14,2015-06-02 11:22:03,1,0.03,-73.991943,40.740261,1,N,-73.991463,40.740086,1,2.5,0.0,0.5,0.66,0.00,0.3,3.96,0 days 00:00:49
657,1,2015-06-02 11:24:33,2015-06-02 11:24:50,1,0.00,-73.996460,40.732124,5,N,-73.996429,40.732147,1,12.0,0.0,0.0,3.05,0.00,0.3,15.35,0 days 00:00:17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9452,1,2015-06-01 00:10:45,2015-06-01 00:11:28,1,0.10,-73.962044,40.761909,1,N,-73.960907,40.761726,3,2.5,0.5,0.5,0.00,0.00,0.3,3.80,0 days 00:00:43
9693,2,2015-06-01 00:11:08,2015-06-01 00:11:08,1,5.03,-73.991440,40.731533,1,N,-73.967010,40.802006,1,16.0,0.5,0.5,1.50,0.00,0.3,18.80,0 days 00:00:00
9761,1,2015-06-01 00:12:46,2015-06-01 00:12:46,1,0.00,-73.999535,40.738533,1,N,0.000000,0.000000,2,4.5,0.5,0.5,0.00,0.00,0.3,5.80,0 days 00:00:00
9868,1,2015-06-01 00:12:31,2015-06-01 00:12:56,2,2.40,-73.984634,40.759377,2,N,-73.984604,40.759350,2,52.0,0.0,0.5,0.00,5.54,0.3,58.34,0 days 00:00:25


# Exercise: Time deltas

1. Create the `trip_time` column, as a timedelta.
2. Find the mean `total_amount` for trips that were less than 1 minute long.
3. Find the mean `trip_distance` for trips that were > 1 hour long. Also, how many were there?


In [58]:
df['trip_time']

0      0 days 00:28:23
1      0 days 00:08:26
2      0 days 00:10:59
3      0 days 00:19:31
4      0 days 00:13:17
             ...      
9994   0 days 00:11:19
9995   0 days 00:15:17
9996   0 days 00:24:25
9997   0 days 00:06:08
9998   0 days 00:23:29
Name: trip_time, Length: 9999, dtype: timedelta64[ns]

In [62]:
df.loc[
    df['trip_time'] < '1 minute'   # row selector
    ,
    'total_amount'
].describe()

count     84.000000
mean      31.938929
std       43.045758
min       -3.300000
25%        3.800000
50%        6.300000
75%       60.655000
max      210.140000
Name: total_amount, dtype: float64

In [70]:
# Find the mean trip_distance for trips that were > 1 hour long. Also, how many were there?

df.loc[
    df['trip_time'] > '1 hour'   # row selector
    ,
    'trip_distance'    # column selector
].mean()

16.294117647058822

In [72]:
(df['trip_time'] < '1 hour').value_counts()

trip_time
True     9829
False     170
Name: count, dtype: int64

In [73]:
(df['trip_time'] < '1 hour').value_counts(normalize=True)

trip_time
True     0.982998
False    0.017002
Name: proportion, dtype: float64

In [74]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,trip_time
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.95443,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8,0 days 00:28:23
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3,0 days 00:08:26
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0,0 days 00:10:59
3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.76033,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16,0 days 00:19:31
4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.0,0.0,0.3,10.3,0 days 00:13:17


In [76]:
# what if we set the index of our data frame to be the pickup datetime?

df = df.set_index('tpep_pickup_datetime')
df

Unnamed: 0_level_0,VendorID,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,trip_time
tpep_pickup_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2015-06-02 11:19:29,2,2015-06-02 11:47:52,1,1.63,-73.954430,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.00,0.0,0.3,17.80,0 days 00:28:23
2015-06-02 11:19:30,2,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.00,0.0,0.3,8.30,0 days 00:08:26
2015-06-02 11:19:31,2,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.20,0.0,0.3,11.00,0 days 00:10:59
2015-06-02 11:19:31,2,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.760330,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16,0 days 00:19:31
2015-06-02 11:19:32,1,2015-06-02 11:32:49,1,1.40,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.00,0.0,0.3,10.30,0 days 00:13:17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2015-06-01 00:12:59,1,2015-06-01 00:24:18,1,2.70,-73.947792,40.814972,1,N,-73.973358,40.783638,2,11.0,0.5,0.5,0.00,0.0,0.3,12.30,0 days 00:11:19
2015-06-01 00:12:59,1,2015-06-01 00:28:16,1,4.50,-74.004066,40.747818,1,N,-73.953758,40.779285,1,16.0,0.5,0.5,3.00,0.0,0.3,20.30,0 days 00:15:17
2015-06-01 00:13:00,2,2015-06-01 00:37:25,1,5.59,-73.994377,40.766102,1,N,-73.903206,40.750546,2,21.0,0.5,0.5,0.00,0.0,0.3,22.30,0 days 00:24:25
2015-06-01 00:13:02,2,2015-06-01 00:19:10,6,1.54,-73.978302,40.748531,1,N,-73.989166,40.762852,2,6.5,0.5,0.5,0.00,0.0,0.3,7.80,0 days 00:06:08


In [77]:
# time series -- a data frame in which the index is a datetime value

# (1) we can pull out all of the values from a given datetime

df.loc['2015-06-01 00:12:59']

Unnamed: 0_level_0,VendorID,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,trip_time
tpep_pickup_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2015-06-01 00:12:59,2,2015-06-01 00:14:07,1,0.18,-74.005539,40.725544,1,N,-74.002983,40.725056,1,3.0,0.5,0.5,0.86,0.0,0.3,5.16,0 days 00:01:08
2015-06-01 00:12:59,1,2015-06-01 00:26:17,1,4.1,-73.994362,40.727089,1,N,-73.993248,40.73317,1,15.0,0.5,0.5,3.26,0.0,0.3,19.56,0 days 00:13:18
2015-06-01 00:12:59,1,2015-06-01 00:18:08,1,0.7,-73.985619,40.760563,1,N,-73.986572,40.766663,1,5.5,0.5,0.5,1.35,0.0,0.3,8.15,0 days 00:05:09
2015-06-01 00:12:59,2,2015-06-01 00:25:04,1,2.89,-74.003098,40.718269,1,N,-73.999634,40.687263,2,11.5,0.5,0.5,0.0,0.0,0.3,12.8,0 days 00:12:05
2015-06-01 00:12:59,1,2015-06-01 00:24:18,1,2.7,-73.947792,40.814972,1,N,-73.973358,40.783638,2,11.0,0.5,0.5,0.0,0.0,0.3,12.3,0 days 00:11:19
2015-06-01 00:12:59,1,2015-06-01 00:28:16,1,4.5,-74.004066,40.747818,1,N,-73.953758,40.779285,1,16.0,0.5,0.5,3.0,0.0,0.3,20.3,0 days 00:15:17
