# Agenda: Dates and times in Pandas!

1. Dates and times in programming
2. Parsing dates with Pandas
3. The `.dt` accessor
4. Loading multiple files into a single data frame
5. How can you parse odd date formats?
6. Grouping with dates
7. Time deltas
8. Time series
9. Resampling

# Dates and times in programming

The most natural (and reasonable) way to to think about dates and times is as specific, unique moments in time. I can identify a "datetime" at various points in my life, and each of these is unique:

- When I was born
- When I graduated from university
- When this meeting started
- When this meeting ends

In programming, we describe these as "datetime" or "timestamp" objects.

We also talk about spans of time. Here, there is no hour/minute/second or year/month/day. Rather, it's "10 minutes long" or "3 days long."  Some examples:

- This meeting will last for 1 hour
- My life, so far, has been about 51.5 years.
- The pandemic has been going on for about 2 years now

This kind of measurement is known as a "time delta," or an "interval."

You can do some basic date+time math:

- timestamp + interval = timestamp
- timestamp - timestamp = interval

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

filename = '../data/nyc_taxi_2019-01.csv'
df = pd.read_csv(filename, 
                 usecols=['tpep_pickup_datetime',
                          'tpep_dropoff_datetime',
                          'passenger_count', 'total_amount', 'trip_distance'])


In [2]:
df.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount
0,2019-01-01 00:46:40,2019-01-01 00:53:20,1,1.5,9.95
1,2019-01-01 00:59:47,2019-01-01 01:18:59,1,2.6,16.3
2,2018-12-21 13:48:30,2018-12-21 13:52:40,3,0.0,5.8
3,2018-11-28 15:52:25,2018-11-28 15:55:45,5,0.0,7.55
4,2018-11-28 15:56:57,2018-11-28 15:58:33,5,0.0,55.55


In [3]:
df.dtypes

tpep_pickup_datetime      object
tpep_dropoff_datetime     object
passenger_count            int64
trip_distance            float64
total_amount             float64
dtype: object

In [4]:
# how can I retrieve parts of the dates and times for dropoff and pickup?
# the answer is: we'll need to tell Pandas to parse those columns as dates


In [5]:
# if I want Pandas to parse one or more columns as datetimes, then
# I can pass those column names in a list of strings to parse_dates

filename = '../data/nyc_taxi_2019-01.csv'
df = pd.read_csv(filename, 
                 usecols=['tpep_pickup_datetime',
                          'tpep_dropoff_datetime',
                          'passenger_count', 'total_amount', 'trip_distance'],
                parse_dates=['tpep_pickup_datetime',
                            'tpep_dropoff_datetime'])


In [6]:
df.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount
0,2019-01-01 00:46:40,2019-01-01 00:53:20,1,1.5,9.95
1,2019-01-01 00:59:47,2019-01-01 01:18:59,1,2.6,16.3
2,2018-12-21 13:48:30,2018-12-21 13:52:40,3,0.0,5.8
3,2018-11-28 15:52:25,2018-11-28 15:55:45,5,0.0,7.55
4,2018-11-28 15:56:57,2018-11-28 15:58:33,5,0.0,55.55


In [7]:
df.dtypes

tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                   int64
trip_distance                   float64
total_amount                    float64
dtype: object

In [8]:
!head ../data/nyc_taxi_2019-01.csv

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
1,2019-01-01 00:46:40,2019-01-01 00:53:20,1,1.50,1,N,151,239,1,7,0.5,0.5,1.65,0,0.3,9.95,
1,2019-01-01 00:59:47,2019-01-01 01:18:59,1,2.60,1,N,239,246,1,14,0.5,0.5,1,0,0.3,16.3,
2,2018-12-21 13:48:30,2018-12-21 13:52:40,3,.00,1,N,236,236,1,4.5,0.5,0.5,0,0,0.3,5.8,
2,2018-11-28 15:52:25,2018-11-28 15:55:45,5,.00,1,N,193,193,2,3.5,0.5,0.5,0,0,0.3,7.55,
2,2018-11-28 15:56:57,2018-11-28 15:58:33,5,.00,2,N,193,193,2,52,0,0.5,0,0,0.3,55.55,
2,2018-11-28 16:25:49,2018-11-28 16:28:26,5,.00,1,N,193,193,2,3.5,0.5,0.5,0,5.76,0.3,13.31,
2,2018-11-28 16:29:37,2018-11-28 16:33:43,5,.00,2,N,193,193,2,52,0,0.5,0,0,0.3,55.55,
1,2019-01-01 00:21:28,2019-01-01 00:28:37,1,1.30,1,N,163,229,1,6.5,0.5,0.5,1.25,0,0.3,9.05,
1,2019-01-01 00:

In [9]:
# if my file contains 02-04-2022, is that April 2nd, or is it Feb 4th?  It depends if
# you are American or from Europe.

help(pd.read_csv)

Help on function read_csv in module pandas.io.parsers.readers:

read_csv(filepath_or_buffer: 'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]', sep=<no_default>, delimiter=None, header='infer', names=<no_default>, index_col=None, usecols=None, squeeze=None, prefix=<no_default>, mangle_dupe_cols=True, dtype: 'DtypeArg | None' = None, engine: 'CSVEngine | None' = None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=None, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression: 'CompressionOptions' = 'infer', thousands=None, decimal: 'str' = '.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, encoding_errors: 'str | None' = 'strict', dialect=None, error_bad_li

In [10]:
df.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount
0,2019-01-01 00:46:40,2019-01-01 00:53:20,1,1.5,9.95
1,2019-01-01 00:59:47,2019-01-01 01:18:59,1,2.6,16.3
2,2018-12-21 13:48:30,2018-12-21 13:52:40,3,0.0,5.8
3,2018-11-28 15:52:25,2018-11-28 15:55:45,5,0.0,7.55
4,2018-11-28 15:56:57,2018-11-28 15:58:33,5,0.0,55.55


In [11]:
# I want to retrieve the year from each of the pickup datetimes
# to do this, I can use the .dt accessor

df['tpep_pickup_datetime'].dt

<pandas.core.indexes.accessors.DatetimeProperties object at 0x14a5991b0>

In [17]:
# question: how many of the entries in the Jan. 2019 records are really from 2019?
df['tpep_pickup_datetime'].dt.year.value_counts()

2019    7667349
2018        366
2009         50
2008         22
2003          2
2088          2
2001          1
Name: tpep_pickup_datetime, dtype: int64

In [19]:
with open('mydata.csv', 'w') as outfile:
    outfile.write('17-02-2022,hello\n')
    outfile.write('18-02-2022,goodbye\n')
    

In [21]:
mydata_df = pd.read_csv('mydata.csv', header=None)
mydata_df

Unnamed: 0,0,1
0,17-02-2022,hello
1,18-02-2022,goodbye


In [22]:
mydata_df.dtypes

0    object
1    object
dtype: object

In [23]:
mydata_df = pd.read_csv('mydata.csv', header=None, parse_dates=[0])
mydata_df

  return tools.to_datetime(
  return tools.to_datetime(


Unnamed: 0,0,1
0,2022-02-17,hello
1,2022-02-18,goodbye


In [24]:
mydata_df = pd.read_csv('mydata.csv', header=None, parse_dates=[0], dayfirst=True)
mydata_df

Unnamed: 0,0,1
0,2022-02-17,hello
1,2022-02-18,goodbye


In [25]:
df['tpep_pickup_datetime'].dt.day_of_week.value_counts()

3    1351516
2    1259695
1    1203843
4    1082795
5    1007797
0     904512
6     857634
Name: tpep_pickup_datetime, dtype: int64

In [26]:
# get a percentage, not a number
df['tpep_pickup_datetime'].dt.day_of_week.value_counts(normalize=True)

3    0.176259
2    0.164284
1    0.157000
4    0.141213
5    0.131432
0    0.117963
6    0.111849
Name: tpep_pickup_datetime, dtype: float64

In [27]:
# how long were the taxi rides?
# in order to know that, we'll need a timedelta!

df.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount
0,2019-01-01 00:46:40,2019-01-01 00:53:20,1,1.5,9.95
1,2019-01-01 00:59:47,2019-01-01 01:18:59,1,2.6,16.3
2,2018-12-21 13:48:30,2018-12-21 13:52:40,3,0.0,5.8
3,2018-11-28 15:52:25,2018-11-28 15:55:45,5,0.0,7.55
4,2018-11-28 15:56:57,2018-11-28 15:58:33,5,0.0,55.55


In [28]:
# datetime - datetime = time delta

df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']

0         0 days 00:06:40
1         0 days 00:19:12
2         0 days 00:04:10
3         0 days 00:03:20
4         0 days 00:01:36
                ...      
7667787   0 days 00:21:03
7667788   0 days 00:01:08
7667789   0 days 00:00:04
7667790   0 days 00:00:27
7667791   0 days 00:01:19
Length: 7667792, dtype: timedelta64[ns]

In [29]:
# create a new column, trip_time, a timedelta containing the trip time
df['trip_time'] = df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']

In [30]:
df.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount,trip_time
0,2019-01-01 00:46:40,2019-01-01 00:53:20,1,1.5,9.95,0 days 00:06:40
1,2019-01-01 00:59:47,2019-01-01 01:18:59,1,2.6,16.3,0 days 00:19:12
2,2018-12-21 13:48:30,2018-12-21 13:52:40,3,0.0,5.8,0 days 00:04:10
3,2018-11-28 15:52:25,2018-11-28 15:55:45,5,0.0,7.55,0 days 00:03:20
4,2018-11-28 15:56:57,2018-11-28 15:58:33,5,0.0,55.55,0 days 00:01:36


In [31]:
df.dtypes

tpep_pickup_datetime      datetime64[ns]
tpep_dropoff_datetime     datetime64[ns]
passenger_count                    int64
trip_distance                    float64
total_amount                     float64
trip_time                timedelta64[ns]
dtype: object

In [33]:
# now what? 
# what if: I want to find all trips that took longer than 12 hours

# you can compare a timedelta in pandas with an int + string (so long as it's a normal measure)
df['trip_time'] > '12 hours'

0          False
1          False
2          False
3          False
4          False
           ...  
7667787    False
7667788    False
7667789    False
7667790    False
7667791    False
Name: trip_time, Length: 7667792, dtype: bool

In [35]:
# use a boolean index to get only those matching rows back
df[df['trip_time'] > '12 hours']

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount,trip_time
112,2018-12-31 17:22:55,2019-01-01 16:57:23,1,22.59,68.30,0 days 23:34:28
165,2019-01-01 00:53:39,2019-01-02 00:51:48,1,2.60,16.44,0 days 23:58:09
824,2019-01-01 00:27:35,2019-01-02 00:03:49,1,8.28,36.67,0 days 23:36:14
1104,2019-01-01 00:45:54,2019-01-02 00:42:33,2,1.71,13.80,0 days 23:56:39
1108,2019-01-01 00:46:37,2019-01-02 00:29:38,2,16.48,49.30,0 days 23:43:01
...,...,...,...,...,...,...
7664614,2019-01-31 22:45:18,2019-02-01 22:00:05,5,8.61,37.87,0 days 23:14:47
7664706,2019-01-31 19:43:35,2019-02-01 18:54:07,1,2.52,11.80,0 days 23:10:32
7665965,2019-01-31 21:02:14,2019-02-01 20:21:07,1,4.38,23.50,0 days 23:18:53
7666072,2019-01-31 22:13:08,2019-02-01 21:36:12,1,2.56,20.76,0 days 23:23:04


In [36]:
# how much did people pay, on average, for 12+ hour taxi rides?

#        row selector                  column selector
df.loc[df['trip_time'] > '12 hours', 'total_amount'].mean()

19.97441213336676

In [37]:
df.loc[df['trip_time'] > '12 hours', 'trip_distance'].mean()

4.059907746302332

In [39]:
# you can also compare with a string that's a bit more traditional, in HH:MM:SS format
df.loc[df['trip_time'] < '00:01:00']

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount,trip_time
20,2019-01-01 00:46:43,2019-01-01 00:47:02,1,0.06,3.80,0 days 00:00:19
28,2019-01-01 00:32:59,2019-01-01 00:32:59,3,0.00,7.80,0 days 00:00:00
156,2019-01-01 00:32:24,2019-01-01 00:33:21,0,5.30,4.55,0 days 00:00:57
663,2019-01-01 00:32:56,2019-01-01 00:33:35,2,0.10,-3.80,0 days 00:00:39
664,2019-01-01 00:32:56,2019-01-01 00:33:35,2,0.10,3.80,0 days 00:00:39
...,...,...,...,...,...,...
7667659,2019-01-31 23:24:01,2019-01-31 23:24:05,1,0.00,63.36,0 days 00:00:04
7667714,2019-01-31 23:07:58,2019-01-31 23:08:07,1,5.80,3.80,0 days 00:00:09
7667753,2019-01-31 23:56:06,2019-01-31 23:57:04,1,0.29,4.30,0 days 00:00:58
7667789,2019-01-31 23:36:36,2019-01-31 23:36:40,1,0.00,0.00,0 days 00:00:04


In [40]:
# how much did people pay for those, on average?
df.loc[df['trip_time'] < '00:01:00', 'total_amount'].mean()

33.36508107305351

In [41]:
# how much did people pay for trips between 1 and 2 minutes long
df.loc[((df['trip_time'] >= '00:01:00') &
        (df['trip_time'] <= '00:02:00')), 'total_amount'].mean()

6.16960748771811

In [42]:
# what was the maximum distance achieved in a <1m trip?
df.loc[df['trip_time'] < '00:01:00', 'trip_distance'].max()

107.8

In [44]:
# what was the average distance traveled on each day of the week?
# solution 1: grab values for each day, separately, and get the mean
df.loc[df['tpep_pickup_datetime'].dt.day_of_week == 0, 'trip_distance'].mean()

2.8721331834182395

In [45]:
df.loc[df['tpep_pickup_datetime'].dt.day_of_week == 1, 'trip_distance'].mean()

2.866172166968616

In [46]:
df.loc[df['tpep_pickup_datetime'].dt.day_of_week == 2, 'trip_distance'].mean()

2.793659290542552

In [47]:
# solution 2: create a new column, day_of_week, and group on that
df['day_of_week'] = df['tpep_pickup_datetime'].dt.day_of_week
df.groupby('day_of_week')['trip_distance'].mean()

day_of_week
0    2.872133
1    2.866172
2    2.793659
3    2.755653
4    2.777719
5    2.617331
6    2.962710
Name: trip_distance, dtype: float64

In [49]:
# solution 3 (and best, in my opinon) -- group by dt.day_of_week
df.groupby(df['tpep_pickup_datetime'].dt.day_of_week)['trip_distance'].mean()

tpep_pickup_datetime
0    2.872133
1    2.866172
2    2.793659
3    2.755653
4    2.777719
5    2.617331
6    2.962710
Name: trip_distance, dtype: float64

In [56]:
# solution 4 (EVEN BETTER) -- group by dt.day_name()
df.groupby(df['tpep_pickup_datetime'].dt.day_name())['trip_distance'].mean().sort_values()

tpep_pickup_datetime
Saturday     2.617331
Thursday     2.755653
Friday       2.777719
Wednesday    2.793659
Tuesday      2.866172
Monday       2.872133
Sunday       2.962710
Name: trip_distance, dtype: float64

In [57]:
df.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount,trip_time,day_of_week
0,2019-01-01 00:46:40,2019-01-01 00:53:20,1,1.5,9.95,0 days 00:06:40,1
1,2019-01-01 00:59:47,2019-01-01 01:18:59,1,2.6,16.3,0 days 00:19:12,1
2,2018-12-21 13:48:30,2018-12-21 13:52:40,3,0.0,5.8,0 days 00:04:10,4
3,2018-11-28 15:52:25,2018-11-28 15:55:45,5,0.0,7.55,0 days 00:03:20,2
4,2018-11-28 15:56:57,2018-11-28 15:58:33,5,0.0,55.55,0 days 00:01:36,2


In [58]:
# what if I were to make tpep_pickup_datetime into the data frame's index?
# that's known as a "time series" in Pandas

df = df.set_index('tpep_pickup_datetime')

In [59]:
df.head()

Unnamed: 0_level_0,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount,trip_time,day_of_week
tpep_pickup_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2019-01-01 00:46:40,2019-01-01 00:53:20,1,1.5,9.95,0 days 00:06:40,1
2019-01-01 00:59:47,2019-01-01 01:18:59,1,2.6,16.3,0 days 00:19:12,1
2018-12-21 13:48:30,2018-12-21 13:52:40,3,0.0,5.8,0 days 00:04:10,4
2018-11-28 15:52:25,2018-11-28 15:55:45,5,0.0,7.55,0 days 00:03:20,2
2018-11-28 15:56:57,2018-11-28 15:58:33,5,0.0,55.55,0 days 00:01:36,2


In [60]:
# Find all of the rows with a particular pickup datetime
df.loc['2019-01-01 00:46:40']

Unnamed: 0_level_0,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount,trip_time,day_of_week
tpep_pickup_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2019-01-01 00:46:40,2019-01-01 00:53:20,1,1.5,9.95,0 days 00:06:40,1
2019-01-01 00:46:40,2019-01-01 00:52:34,1,1.07,8.3,0 days 00:05:54,1
2019-01-01 00:46:40,2019-01-01 00:57:23,1,5.13,16.8,0 days 00:10:43,1


In [63]:
# Even better: We can get a slice of all taxi rides that started in a 10-second window
# when you use a slice on a non-integer value, or using .loc, the end value is included
df.loc['2019-01-01 00:46:40':'2019-01-01 00:46:50']

Unnamed: 0_level_0,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount,trip_time,day_of_week
tpep_pickup_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2019-01-01 00:46:40,2019-01-01 00:53:20,1,1.5,9.95,0 days 00:06:40,1
2019-01-01 00:46:43,2019-01-01 00:47:02,1,0.06,3.8,0 days 00:00:19,1
2019-01-01 00:46:44,2019-01-01 00:56:18,2,1.1,10.55,0 days 00:09:34,1
2019-01-01 00:46:45,2019-01-01 00:54:51,1,0.7,7.8,0 days 00:08:06,1
2019-01-01 00:46:41,2019-01-01 01:01:23,4,1.8,14.15,0 days 00:14:42,1
2019-01-01 00:46:44,2019-01-01 00:55:53,1,1.1,9.8,0 days 00:09:09,1
2019-01-01 00:46:47,2019-01-01 00:55:23,2,2.32,12.36,0 days 00:08:36,1
2019-01-01 00:46:48,2019-01-01 01:16:17,2,3.34,24.36,0 days 00:29:29,1
2019-01-01 00:46:46,2019-01-01 00:52:52,1,0.6,8.16,0 days 00:06:06,1
2019-01-01 00:46:49,2019-01-01 00:56:48,3,1.5,9.8,0 days 00:09:59,1


In [64]:
# if I specify only large parts of the date when retrieving using the index,
# the small parts become wildcards

# meaning: leave off seconds? Any second will match
# leave off minutes and seconds? Any minute or second will match

df.loc['2019-01-01 00:46:40']

Unnamed: 0_level_0,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount,trip_time,day_of_week
tpep_pickup_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2019-01-01 00:46:40,2019-01-01 00:53:20,1,1.5,9.95,0 days 00:06:40,1
2019-01-01 00:46:40,2019-01-01 00:52:34,1,1.07,8.3,0 days 00:05:54,1
2019-01-01 00:46:40,2019-01-01 00:57:23,1,5.13,16.8,0 days 00:10:43,1


In [65]:
# leave off the seconds...
df.loc['2019-01-01 00:46']

Unnamed: 0_level_0,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount,trip_time,day_of_week
tpep_pickup_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2019-01-01 00:46:40,2019-01-01 00:53:20,1,1.50,9.95,0 days 00:06:40,1
2019-01-01 00:46:00,2019-01-01 00:49:07,1,0.30,6.36,0 days 00:03:07,1
2019-01-01 00:46:43,2019-01-01 00:47:02,1,0.06,3.80,0 days 00:00:19,1
2019-01-01 00:46:44,2019-01-01 00:56:18,2,1.10,10.55,0 days 00:09:34,1
2019-01-01 00:46:09,2019-01-01 01:06:46,1,4.18,20.76,0 days 00:20:37,1
...,...,...,...,...,...,...
2019-01-01 00:46:12,2019-01-01 00:52:39,2,0.50,8.15,0 days 00:06:27,1
2019-01-01 00:46:25,2019-01-01 00:57:16,1,3.48,13.80,0 days 00:10:51,1
2019-01-01 00:46:15,2019-01-01 00:50:01,1,0.75,7.24,0 days 00:03:46,1
2019-01-01 00:46:44,2019-01-01 00:56:23,5,1.79,9.80,0 days 00:09:39,1


In [66]:
# leave off the seconds and minutes...
df.loc['2019-01-01 00']

Unnamed: 0_level_0,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount,trip_time,day_of_week
tpep_pickup_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2019-01-01 00:46:40,2019-01-01 00:53:20,1,1.50,9.95,0 days 00:06:40,1
2019-01-01 00:59:47,2019-01-01 01:18:59,1,2.60,16.30,0 days 00:19:12,1
2019-01-01 00:21:28,2019-01-01 00:28:37,1,1.30,9.05,0 days 00:07:09,1
2019-01-01 00:32:01,2019-01-01 00:45:39,1,3.70,18.50,0 days 00:13:38,1
2019-01-01 00:57:32,2019-01-01 01:09:32,2,2.10,13.00,0 days 00:12:00,1
...,...,...,...,...,...,...
2019-01-01 00:43:57,2019-01-01 00:52:52,2,2.09,12.36,0 days 00:08:55,1
2019-01-01 00:56:26,2019-01-02 00:01:11,2,0.90,8.16,0 days 23:04:45,1
2019-01-01 00:33:02,2019-01-01 00:54:09,2,5.53,25.56,0 days 00:21:07,1
2019-01-01 00:52:42,2019-01-01 01:01:04,1,2.41,9.80,0 days 00:08:22,1


In [68]:
# another kind of question we can ask: For each day in our data set, what was
# the average amount paid?

# we can do this very powerfully with "resampling"
# (this is another kind of grouping, but by time/date) and it fills in any missing dates

# for every calendar day, starting with the earliest date in the data frame,
# and ending with the latest date in the data frame, show the mean total_amount
df.resample('1D')['total_amount'].mean().dropna()

tpep_pickup_datetime
2001-02-02     3.800000
2003-01-01     0.000000
2008-12-31    14.134545
2009-01-01    14.738400
2018-11-28    77.324000
2018-11-29    14.550000
2018-12-21     5.800000
2018-12-30    16.746667
2018-12-31    17.776348
2019-01-01    16.879556
2019-01-02    16.719936
2019-01-03    15.811817
2019-01-04    15.521666
2019-01-05    14.225854
2019-01-06    15.441653
2019-01-07    15.762637
2019-01-08    15.573949
2019-01-09    15.560222
2019-01-10    15.686428
2019-01-11    17.875299
2019-01-12    14.598995
2019-01-13    15.217222
2019-01-14    16.115582
2019-01-15    15.712672
2019-01-16    15.948871
2019-01-17    15.981913
2019-01-18    15.939742
2019-01-19    14.277029
2019-01-20    14.089513
2019-01-21    15.203768
2019-01-22    15.902929
2019-01-23    17.389760
2019-01-24    15.983443
2019-01-25    15.883327
2019-01-26    14.495203
2019-01-27    15.843776
2019-01-28    15.719709
2019-01-29    15.399370
2019-01-30    15.148965
2019-01-31    15.763856
2019-02-01    15.88

In [69]:
# mean amount per week
df.resample('1W')['total_amount'].mean().dropna()

tpep_pickup_datetime
2001-02-04     3.800000
2003-01-05     0.000000
2009-01-04    14.553889
2018-12-02    71.617273
2018-12-23     5.800000
2018-12-30    16.746667
2019-01-06    15.705234
2019-01-13    15.799031
2019-01-20    15.504423
2019-01-27    15.831641
2019-02-03    15.504500
2019-02-10    17.680000
2019-02-17    49.300000
2019-02-24    23.300000
2019-03-03    14.025000
2019-03-17    10.840000
2019-03-24    11.273333
2019-04-07    24.300000
2019-04-14    13.780000
2019-04-28    15.220000
2019-05-26    26.160000
2019-06-16    13.420000
2019-07-07     9.100000
2019-07-28    18.960000
2019-08-18    22.800000
2019-09-08     6.800000
2088-01-25    10.300000
Name: total_amount, dtype: float64

In [72]:
import glob 
glob.glob('../data/nyc_taxi_2020*csv')

['../data/nyc_taxi_2020-01.csv', '../data/nyc_taxi_2020-07.csv']

In [76]:
# list comprehension

all_dfs = [pd.read_csv(filename, 
                 usecols=['tpep_pickup_datetime',
                          'tpep_dropoff_datetime',
                          'passenger_count', 'total_amount', 'trip_distance'],
                      parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'])
        for filename in glob.glob('../data/nyc_taxi_2020*csv')]

df = pd.concat(all_dfs)

In [77]:
df.shape

(7205420, 5)

In [78]:
# how many rides were there per combination of year+month in our data set?
df.groupby([df['tpep_pickup_datetime'].dt.year,
            df['tpep_pickup_datetime'].dt.month])['total_amount'].count()

tpep_pickup_datetime  tpep_pickup_datetime
2003                  1                             1
2008                  12                           11
2009                  1                            20
2019                  12                          131
2020                  1                       6404796
                      2                            30
                      3                             5
                      4                             1
                      5                             5
                      6                             7
                      7                        800408
                      8                             2
2021                  1                             3
Name: total_amount, dtype: int64

In [79]:
800408/6404796

0.12497010053091465