# Agenda: Dates and times

1. Background theory
2. How to turn a column into datetime info
3. Calculations we can perform
4. Datetime columns as our index ("time series")

# Background

When we use the word "time" in our everyday speech, we might mean one of two different things:

1. One specific, unique point in time, specified with year/month/day and hour/minute/second. Of course, you can get even more precise than that. Examples: Time of birth. Time of death. When you graduated university. When a meeting starts. Each of these things is a unique point in time, and you can specify when precisely it happened. When we keep track of these on a computer, they're known as either a `datetime` value or a `timestamp`.
2. A span of time, between two points. This isn't unique, and it doesn't have a specific year/month/day, but we do measure it with the same units. We can use this kind of value to measure a lifespan, or the time spent in school, or the time that people were married, or the time you were in a meeting. The data type used for this is known as a `timedelta` or an `interval`.

Date math:
- `timestamp` - `timestamp` = `timedelta`
- `timestamp` + `timedelta` = `timestamp`
- `timestamp` - `timedelta` = `timestamp`

In [13]:
import pandas as pd
filename = '../data/taxi.csv'

df = pd.read_csv(filename,
                 usecols=['tpep_pickup_datetime',
                          'tpep_dropoff_datetime',
                          'passenger_count',
                          'trip_distance',
                          'total_amount'])
                          

In [14]:
df.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount
0,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,17.8
1,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,8.3
2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,11.0
3,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,17.16
4,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,10.3


In [15]:
df.dtypes

tpep_pickup_datetime      object
tpep_dropoff_datetime     object
passenger_count            int64
trip_distance            float64
total_amount             float64
dtype: object

In [16]:
# the main way to take a string column and get a datetime column based on it
# is with pd.to_datetime, a function that comes with Pandas.

pd.to_datetime(df['tpep_pickup_datetime'])

0      2015-06-02 11:19:29
1      2015-06-02 11:19:30
2      2015-06-02 11:19:31
3      2015-06-02 11:19:31
4      2015-06-02 11:19:32
               ...        
9994   2015-06-01 00:12:59
9995   2015-06-01 00:12:59
9996   2015-06-01 00:13:00
9997   2015-06-01 00:13:02
9998   2015-06-01 00:13:04
Name: tpep_pickup_datetime, Length: 9999, dtype: datetime64[ns]

In [17]:
# how much memory are we *really* using right now on this data frame?
df.memory_usage(deep=True).sum()    # 1,599,972

1599972

In [18]:
# we can assign the datetime version of these columns back to the orignal data frame

df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])
df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'])

In [19]:
df.memory_usage(deep=True).sum()   

400092

Dates and times come in *many* different formats. How did pd.to_datetime know how to convert our strings into datetime values?

- If the datetime value (as a string) looks like `YYYY-MM-DD HH:MM:SS`, then Pandas can handle it just fine.
- If the value is in US format (i.e., `MM-DD-YYYY HH:MM:SS`, then it can also handle it just fine, and by default.
- If the value is in European format (i.e., `DD-MM-YYYY HH:MM:SS`, then you need to pass an argument to `to_datetime`, `dayfirst=True`, which overrides the defaults for US time format.
- If it's in another format, then you have to specify it with a special formatting string that uses lots of `%` characters. These format strings are used in the `strftime` and `strptime` methods for strings in Python (and in many other languages).  You can pass this string as the `date_format` keyword argument to `pd.to_datetime`.

In [21]:
# another (my preferred) way is when we call pd.read_csv
# we can pass an argument, "parse_dates", which takes a list of column names. Those will all be parsed as datetime values.

df = pd.read_csv(filename,
                 usecols=['tpep_pickup_datetime',
                          'tpep_dropoff_datetime',
                          'passenger_count',
                          'trip_distance',
                          'total_amount'],
                parse_dates=['tpep_pickup_datetime',
                             'tpep_dropoff_datetime'])
                          

In [22]:
df.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount
0,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,17.8
1,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,8.3
2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,11.0
3,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,17.16
4,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,10.3


In [23]:
df.dtypes

tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                   int64
trip_distance                   float64
total_amount                    float64
dtype: object

# Exercise: Chicago weather

1. Read the weather data from Chicago (`../data/chicago,il.csv`)
2. Don't parse the datetime column (the first one). Check to see how much memory it uses as strings.
3. Re-read it using `parse_dates`.
4. Compare the memory usage.

In [25]:
filename = '../data/chicago,il.csv'
df = pd.read_csv(filename)
df.memory_usage(deep=True).sum()  # 332,220

332220

In [27]:
df = pd.read_csv(filename,
                parse_dates=['date_time'])
df.memory_usage(deep=True).sum()  # 288,540

288540

In [28]:
288540 / 332220

0.8685208596713021

In [30]:
filename = '../data/taxi.csv'

df = pd.read_csv(filename,
                 usecols=['tpep_pickup_datetime',
                          'tpep_dropoff_datetime',
                          'passenger_count',
                          'trip_distance',
                          'total_amount'],
                parse_dates=['tpep_pickup_datetime',
                             'tpep_dropoff_datetime'])
                          

In [31]:
df.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount
0,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,17.8
1,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,8.3
2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,11.0
3,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,17.16
4,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,10.3


# How can we extract data from a `datetime` column?

We use the `.dt` accessor, an attribute that is designed to retrieve one part of the `datetime` value. It's very similar in spirit to what we did last week with the `.str` accessor, giving us access to many string functions. Here, we're more typically retrieving pieces of the `datetime`, but there are some methods, too.

In [42]:
df['tpep_pickup_datetime'].dt.is_leap_year

0       False
1       False
2       False
3       False
4       False
        ...  
9994    False
9995    False
9996    False
9997    False
9998    False
Name: tpep_pickup_datetime, Length: 9999, dtype: bool

# How can we use this?

1. Find only those rows in which a particular datetime component is `True`
2. We can use them for grouping
3. We can use them in calculations

# Exercise: NYC taxis from July 2020

1. Load the file `../data/nyc_taxi_2020-07.csv` into a data frame. *BUT* only read in the columns `tpep_pickup_datetime`, `tpep_dropoff_datetime`, `passenger_count`, `trip_distance`, `total_amount`. Don't use `parse_dates` just yet; like last time, get the memory usage.
2. Use `parse_dates` and compare the memory usage.
3. What were the three hours of the day at which taxis were most often called?
4. This data is all supposed to be from July; how many rows have non-July 2020 data?
5. How many rows have non-2020 data?

In [43]:
filename = '../data/nyc_taxi_2020-07.csv'

df = pd.read_csv(filename,
                 usecols=['tpep_pickup_datetime',
                          'tpep_dropoff_datetime',
                          'passenger_count',
                          'trip_distance', 'total_amount'])


In [44]:
df.shape

(800412, 5)

In [46]:
df.memory_usage(deep=True).sum()   # 128,066,052

128066052

In [47]:
filename = '../data/nyc_taxi_2020-07.csv'

df = pd.read_csv(filename,
                 usecols=['tpep_pickup_datetime',
                          'tpep_dropoff_datetime',
                          'passenger_count',
                          'trip_distance', 'total_amount'],
                parse_dates=['tpep_pickup_datetime',
                             'tpep_dropoff_datetime'])


In [48]:
df.memory_usage(deep=True).sum()    # 32,016,612

32016612

In [49]:
32016612 / 128066052

0.2500007730385879

In [54]:
# What were the three hours of the day at which taxis were most often called?

(
    df['tpep_pickup_datetime']
    .dt.hour
    .value_counts(normalize=True)
)


tpep_pickup_datetime
15    0.073974
14    0.073803
13    0.071938
16    0.071393
17    0.070799
12    0.068169
18    0.066377
11    0.062275
10    0.057823
19    0.053063
9     0.051235
8     0.043840
20    0.038292
7     0.032345
21    0.030812
22    0.025965
6     0.023181
23    0.022728
0     0.014314
1     0.011970
4     0.009213
2     0.009047
3     0.008942
5     0.008504
Name: proportion, dtype: float64

In [59]:
# This data is all supposed to be from July; how many rows have non-July 2020 data?

df.loc[(df['tpep_pickup_datetime'].dt.month != 7) &
       (df['tpep_pickup_datetime'].dt.year != 2020)]

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount
198195,2008-12-31 23:12:22,2008-12-31 23:25:04,1.0,1.62,11.8
198275,2009-01-01 01:29:56,2009-01-01 01:38:44,1.0,1.31,10.3


In [60]:
# How many rows have non-2020 data?

df.loc[df['tpep_pickup_datetime'].dt.year != 2020]

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount
198195,2008-12-31 23:12:22,2008-12-31 23:25:04,1.0,1.62,11.8
198275,2009-01-01 01:29:56,2009-01-01 01:38:44,1.0,1.31,10.3


# Grouping

We can run `groupby` on any categorical column. That includes the result of invoking `.dt.NAME` for any `NAME` on our datetime columns.


In [61]:
# Example: How far did people travel, on average, in taxis in July 2020?

df['trip_distance'].mean()

4.304164880086755

In [62]:
# How far did people travel, on average *PER HOUR* in July 2020?

df.groupby(df['tpep_pickup_datetime'].dt.hour)['trip_distance'].mean()

tpep_pickup_datetime
0      4.686456
1     15.018075
2      7.522501
3     14.842060
4     13.844186
5      3.669135
6      2.877530
7      4.502182
8      2.588864
9      2.552402
10     2.525525
11     2.570557
12     2.635746
13     3.229825
14     2.705078
15     2.884645
16     2.980572
17     3.006179
18     2.958436
19     3.080243
20     3.432630
21    32.465414
22     3.785870
23     4.159685
Name: trip_distance, dtype: float64

In [63]:
df.groupby(df['tpep_pickup_datetime'].dt.day_of_week)['trip_distance'].mean()

tpep_pickup_datetime
0    3.087188
1    2.943605
2    3.255378
3    5.285671
4    3.687451
5    6.632864
6    7.252421
Name: trip_distance, dtype: float64

In [66]:
df.groupby(df['tpep_pickup_datetime'].dt.day_name())['trip_distance'].mean()

tpep_pickup_datetime
Friday       3.687451
Monday       3.087188
Saturday     6.632864
Sunday       7.252421
Thursday     5.285671
Tuesday      2.943605
Wednesday    3.255378
Name: trip_distance, dtype: float64

# Time deltas

We can, as we saw earlier, perform date math and get the distance (timedelta) between two datetime values. How can we do that in Pandas?

In [67]:
df['trip_time'] = df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']

In [68]:
df.dtypes

tpep_pickup_datetime      datetime64[ns]
tpep_dropoff_datetime     datetime64[ns]
passenger_count                  float64
trip_distance                    float64
total_amount                     float64
trip_time                timedelta64[ns]
dtype: object

In [69]:
df.head(10)

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount,trip_time
0,2020-07-01 00:25:32,2020-07-01 00:33:39,1.0,1.5,9.3,0 days 00:08:07
1,2020-07-01 00:03:19,2020-07-01 00:25:43,1.0,9.5,27.8,0 days 00:22:24
2,2020-07-01 00:15:11,2020-07-01 00:29:24,1.0,5.85,22.3,0 days 00:14:13
3,2020-07-01 00:30:49,2020-07-01 00:38:26,1.0,1.9,14.16,0 days 00:07:37
4,2020-07-01 00:31:26,2020-07-01 00:38:02,1.0,1.25,7.8,0 days 00:06:36
5,2020-07-01 00:09:00,2020-07-01 00:34:39,1.0,9.7,33.8,0 days 00:25:39
6,2020-07-01 00:44:08,2020-07-01 00:58:12,1.0,5.27,26.39,0 days 00:14:04
7,2020-07-01 00:49:20,2020-07-01 00:56:44,1.0,1.32,8.8,0 days 00:07:24
8,2020-07-01 00:21:59,2020-07-01 00:25:12,1.0,0.73,10.12,0 days 00:03:13
9,2020-07-01 00:08:28,2020-07-01 00:36:18,1.0,18.65,66.36,0 days 00:27:50


In [70]:
# we can compare timedelta with a string

df['trip_time'] < '01:00:00'   # gives a boolean series -- which trips took less than one hour?

0         True
1         True
2         True
3         True
4         True
          ... 
800407    True
800408    True
800409    True
800410    True
800411    True
Name: trip_time, Length: 800412, dtype: bool

In [71]:
df.loc[
    df['trip_time'] < '01:00:00'
] # gives a boolean series -- which trips took less than one hour?

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount,trip_time
0,2020-07-01 00:25:32,2020-07-01 00:33:39,1.0,1.50,9.30,0 days 00:08:07
1,2020-07-01 00:03:19,2020-07-01 00:25:43,1.0,9.50,27.80,0 days 00:22:24
2,2020-07-01 00:15:11,2020-07-01 00:29:24,1.0,5.85,22.30,0 days 00:14:13
3,2020-07-01 00:30:49,2020-07-01 00:38:26,1.0,1.90,14.16,0 days 00:07:37
4,2020-07-01 00:31:26,2020-07-01 00:38:02,1.0,1.25,7.80,0 days 00:06:36
...,...,...,...,...,...,...
800407,2020-07-19 13:27:52,2020-07-19 14:22:15,,24.23,83.50,0 days 00:54:23
800408,2020-07-19 13:02:00,2020-07-19 13:21:00,,4.40,19.78,0 days 00:19:00
800409,2020-07-19 13:32:00,2020-07-19 13:51:00,,8.78,38.45,0 days 00:19:00
800410,2020-07-19 13:28:00,2020-07-19 13:51:00,,6.50,29.77,0 days 00:23:00


In [74]:
df.loc[
    df['trip_time'] < '01:00:00'   # row selector
    ,
    'passenger_count'  # column selector
].mean()

1.3781325873607266

In [75]:
# there's an even better way to compare with a string: Use a number and a unit!

df['trip_time'] < '1 hour'

0         True
1         True
2         True
3         True
4         True
          ... 
800407    True
800408    True
800409    True
800410    True
800411    True
Name: trip_time, Length: 800412, dtype: bool

In [76]:
df['trip_time'] < '10 minutes'

0          True
1         False
2         False
3          True
4          True
          ...  
800407    False
800408    False
800409    False
800410    False
800411    False
Name: trip_time, Length: 800412, dtype: bool

# Exercises: Time deltas

With the data from the July, 2020 NYC taxi file:

- Create a `trip_time` column, by subtracing the pickup time from the dropoff time.
- Find the mean `total_amount` for trips that were less than 1 minute long. How many such trips were there?
- Find the mean `trip_distance` for trips that were > 1 hour long. How many such trips were there?

In [77]:
df.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount,trip_time
0,2020-07-01 00:25:32,2020-07-01 00:33:39,1.0,1.5,9.3,0 days 00:08:07
1,2020-07-01 00:03:19,2020-07-01 00:25:43,1.0,9.5,27.8,0 days 00:22:24
2,2020-07-01 00:15:11,2020-07-01 00:29:24,1.0,5.85,22.3,0 days 00:14:13
3,2020-07-01 00:30:49,2020-07-01 00:38:26,1.0,1.9,14.16,0 days 00:07:37
4,2020-07-01 00:31:26,2020-07-01 00:38:02,1.0,1.25,7.8,0 days 00:06:36


In [83]:
df.loc[
    df['trip_time'] < '1 minute',
    'total_amount'
].mean()

15.654617368081777

In [86]:
df.loc[
    df['trip_time'] > '1 hour',
    'trip_distance'
].mean()

13.059049645390072

In [89]:
df.loc[
    df['trip_time'] > '24 hours',
    'trip_distance'
]

274157     9.28
379415     0.88
399090     0.00
492526    11.43
520224     8.19
534889     1.18
Name: trip_distance, dtype: float64

# Time series

If I set a datetime column to be my data frame's index, I get a "time series," and can perform some special operations on it.

In [90]:
df = df.set_index('tpep_pickup_datetime')

In [91]:
df.head()

Unnamed: 0_level_0,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount,trip_time
tpep_pickup_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-07-01 00:25:32,2020-07-01 00:33:39,1.0,1.5,9.3,0 days 00:08:07
2020-07-01 00:03:19,2020-07-01 00:25:43,1.0,9.5,27.8,0 days 00:22:24
2020-07-01 00:15:11,2020-07-01 00:29:24,1.0,5.85,22.3,0 days 00:14:13
2020-07-01 00:30:49,2020-07-01 00:38:26,1.0,1.9,14.16,0 days 00:07:37
2020-07-01 00:31:26,2020-07-01 00:38:02,1.0,1.25,7.8,0 days 00:06:36


In [97]:
# now I can retrieve all of the rows from a particular datetime

df.loc['2020-07-01 09:00:00']

Unnamed: 0_level_0,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount,trip_time
tpep_pickup_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-07-01 09:00:00,2020-07-01 09:37:00,,5.84,30.32,0 days 00:37:00
2020-07-01 09:00:00,2020-07-01 09:11:00,,4.27,26.74,0 days 00:11:00
