# Agenda 

1. Timestamp + timedelta types
2. Working with timestamps in data frames -- `.dt`
3. Timedelta and comparisons
4. Time series -- timesetamps as indexes
5. Resampling

# Timestamp + timedelta

We mean two things when we say "time":
- A point in time (timestamp, datetime)
- A span of time (timedelta, interval)

  

In [1]:
import pandas as pd
from pandas import Series, DataFrame

In [2]:
df = pd.read_csv('taxi.csv',
                usecols=['tpep_pickup_datetime', 'tpep_dropoff_datetime',
                        'passenger_count', 'trip_distance', 'total_amount'])

In [3]:
df.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount
0,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,17.8
1,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,8.3
2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,11.0
3,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,17.16
4,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,10.3


In [4]:
df.dtypes

tpep_pickup_datetime      object
tpep_dropoff_datetime     object
passenger_count            int64
trip_distance            float64
total_amount             float64
dtype: object

In [5]:
# How can I turn the timestamp columns into actual timestamps?

df = pd.read_csv('taxi.csv',
                usecols=['tpep_pickup_datetime', 'tpep_dropoff_datetime',
                        'passenger_count', 'trip_distance', 'total_amount'],
                parse_dates=['tpep_pickup_datetime',
                            'tpep_dropoff_datetime'])

In [6]:

df.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount
0,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,17.8
1,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,8.3
2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,11.0
3,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,17.16
4,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,10.3


In [7]:
df.dtypes

tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                   int64
trip_distance                   float64
total_amount                    float64
dtype: object

Just as we can use `.str` to retrieve from a string, we can use `.dt` to retrieve from a datetime.

In [8]:
# in what year were these trips?

df['tpep_pickup_datetime'].dt.year

0       2015
1       2015
2       2015
3       2015
4       2015
        ... 
9994    2015
9995    2015
9996    2015
9997    2015
9998    2015
Name: tpep_pickup_datetime, Length: 9999, dtype: int32

In [9]:
df['tpep_pickup_datetime'].dt.hour

0       11
1       11
2       11
3       11
4       11
        ..
9994     0
9995     0
9996     0
9997     0
9998     0
Name: tpep_pickup_datetime, Length: 9999, dtype: int32

In [12]:
df['tpep_pickup_datetime'].dt.is_quarter_end

0       False
1       False
2       False
3       False
4       False
        ...  
9994    False
9995    False
9996    False
9997    False
9998    False
Name: tpep_pickup_datetime, Length: 9999, dtype: bool

In [13]:
# I can groupby on the values we get from .dt!

df.groupby(df['tpep_pickup_datetime'].dt.minute)['trip_distance'].mean()

tpep_pickup_datetime
0     4.381589
1     4.277704
2     4.041907
3     4.536489
4     4.660118
5     4.393082
6     4.641471
7     4.439505
8     3.997288
9     4.481159
10    4.250000
11    4.197670
12    3.941948
13    4.102202
15    3.670769
16    3.055872
17    2.925592
18    2.931938
19    2.552746
20    2.734559
21    2.837943
22    2.859967
23    2.874452
24    2.416812
25    2.676262
26    2.677785
27    2.884444
28    2.577386
29    2.644669
30    2.885680
31    2.856197
32    2.724747
33    2.483873
51    2.416744
52    2.791492
53    2.978448
Name: trip_distance, dtype: float64

In [15]:
# 01-02-2023 # dayfirst=False is the default

# Exercise: Taxi dates

1. Read the taxi data into a data frame using `parse_dates`.
2. How many trips were there for each hour in the data set?
3. How much did people spend, on average, on each day of the week?

In [17]:
df = pd.read_csv('taxi.csv',
                usecols=['tpep_pickup_datetime', 'tpep_dropoff_datetime',
                        'passenger_count', 'trip_distance', 'total_amount'],
                 parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'])


In [18]:
df.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount
0,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,17.8
1,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,8.3
2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,11.0
3,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,17.16
4,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,10.3


In [19]:
df.dtypes

tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                   int64
trip_distance                   float64
total_amount                    float64
dtype: object

In [21]:
# 2. How many trips were there for each hour in the data set?

df['tpep_pickup_datetime'].dt.hour.value_counts()


tpep_pickup_datetime
11    4396
15    2536
0     2439
16     628
Name: count, dtype: int64

In [23]:
df.groupby(df['tpep_pickup_datetime'].dt.hour)['passenger_count'].count()

tpep_pickup_datetime
0     2439
11    4396
15    2536
16     628
Name: passenger_count, dtype: int64

In [28]:
df.loc[
  df['passenger_count'] > 1, 
  'tpep_pickup_datetime'
].dt.hour.value_counts()

tpep_pickup_datetime
11    1142
15     777
0      649
16     222
Name: count, dtype: int64

In [29]:
df['tpep_pickup_datetime'].dt

<pandas.core.indexes.accessors.DatetimeProperties object at 0x125f26d50>

# Timedelta

Time math looks like this:

- timestamp - timestamp = timedelta
- timestamp + timedelta = timestamp

In [30]:
df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']

0      0 days 00:28:23
1      0 days 00:08:26
2      0 days 00:10:59
3      0 days 00:19:31
4      0 days 00:13:17
             ...      
9994   0 days 00:11:19
9995   0 days 00:15:17
9996   0 days 00:24:25
9997   0 days 00:06:08
9998   0 days 00:23:29
Length: 9999, dtype: timedelta64[ns]

In [31]:
# let's add a new column to our data frame with this timedelta

df['trip_time'] = df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']

In [32]:
df.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount,trip_time
0,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,17.8,0 days 00:28:23
1,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,8.3,0 days 00:08:26
2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,11.0,0 days 00:10:59
3,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,17.16,0 days 00:19:31
4,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,10.3,0 days 00:13:17


In [33]:
df.dtypes

tpep_pickup_datetime      datetime64[ns]
tpep_dropoff_datetime     datetime64[ns]
passenger_count                    int64
trip_distance                    float64
total_amount                     float64
trip_time                timedelta64[ns]
dtype: object

In [36]:
df.loc[df['trip_time'] < '0 days 00:01:00']

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount,trip_time
10,2015-06-02 11:19:38,2015-06-02 11:19:43,3,0.01,52.80,0 days 00:00:05
149,2015-06-02 11:21:25,2015-06-02 11:21:50,1,0.00,4.30,0 days 00:00:25
297,2015-06-02 11:20:23,2015-06-02 11:20:23,2,0.00,2.30,0 days 00:00:00
516,2015-06-02 11:21:14,2015-06-02 11:22:03,1,0.03,3.96,0 days 00:00:49
657,2015-06-02 11:24:33,2015-06-02 11:24:50,1,0.00,15.35,0 days 00:00:17
...,...,...,...,...,...,...
9452,2015-06-01 00:10:45,2015-06-01 00:11:28,1,0.10,3.80,0 days 00:00:43
9693,2015-06-01 00:11:08,2015-06-01 00:11:08,1,5.03,18.80,0 days 00:00:00
9761,2015-06-01 00:12:46,2015-06-01 00:12:46,1,0.00,5.80,0 days 00:00:00
9868,2015-06-01 00:12:31,2015-06-01 00:12:56,2,2.40,58.34,0 days 00:00:25


In [37]:
# we can also do this
df.loc[df['trip_time'] < '00:01:00']    # just write the time in HH:MM:SS

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount,trip_time
10,2015-06-02 11:19:38,2015-06-02 11:19:43,3,0.01,52.80,0 days 00:00:05
149,2015-06-02 11:21:25,2015-06-02 11:21:50,1,0.00,4.30,0 days 00:00:25
297,2015-06-02 11:20:23,2015-06-02 11:20:23,2,0.00,2.30,0 days 00:00:00
516,2015-06-02 11:21:14,2015-06-02 11:22:03,1,0.03,3.96,0 days 00:00:49
657,2015-06-02 11:24:33,2015-06-02 11:24:50,1,0.00,15.35,0 days 00:00:17
...,...,...,...,...,...,...
9452,2015-06-01 00:10:45,2015-06-01 00:11:28,1,0.10,3.80,0 days 00:00:43
9693,2015-06-01 00:11:08,2015-06-01 00:11:08,1,5.03,18.80,0 days 00:00:00
9761,2015-06-01 00:12:46,2015-06-01 00:12:46,1,0.00,5.80,0 days 00:00:00
9868,2015-06-01 00:12:31,2015-06-01 00:12:56,2,2.40,58.34,0 days 00:00:25


In [43]:
# we can even do this:
df.loc[df['trip_time'] < '00:01:00']    # just write the time in HH:MM:SS

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount,trip_time
10,2015-06-02 11:19:38,2015-06-02 11:19:43,3,0.01,52.80,0 days 00:00:05
149,2015-06-02 11:21:25,2015-06-02 11:21:50,1,0.00,4.30,0 days 00:00:25
297,2015-06-02 11:20:23,2015-06-02 11:20:23,2,0.00,2.30,0 days 00:00:00
516,2015-06-02 11:21:14,2015-06-02 11:22:03,1,0.03,3.96,0 days 00:00:49
657,2015-06-02 11:24:33,2015-06-02 11:24:50,1,0.00,15.35,0 days 00:00:17
...,...,...,...,...,...,...
9452,2015-06-01 00:10:45,2015-06-01 00:11:28,1,0.10,3.80,0 days 00:00:43
9693,2015-06-01 00:11:08,2015-06-01 00:11:08,1,5.03,18.80,0 days 00:00:00
9761,2015-06-01 00:12:46,2015-06-01 00:12:46,1,0.00,5.80,0 days 00:00:00
9868,2015-06-01 00:12:31,2015-06-01 00:12:56,2,2.40,58.34,0 days 00:00:25


# Exercise: Taxi times

1. Create a `trip_time` column, like I did, with the timedelta of the trip time.
2. Create a `trip_description` column, with values `short`, `medium`, or `long`. 