# 4. Datetime Series Methods


# Methods for Series with Datetime data types
In this notebook we will focus on methods that work for Series that contain datetime data. Just like Pandas has the **`str`** accessor to give us access to string only methods, it also has the **`dt`** accessor to give us access to datetime only methods.

Let's read in the bikes dataset which has two datetime columns, **`starttime`**, and **`stoptime`**.

In [1]:
import pandas as pd

bikes = pd.read_csv('../data/bikes.csv', parse_dates=['starttime', 'stoptime'])
bikes.head()

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
0,7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,Michigan Ave & Oak St,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
1,7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wells St & Walton St,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0,partlycloudy
2,10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy
3,12907,Subscriber,Male,2013-07-01 10:05:00,2013-07-01 10:16:00,667,Carpenter St & Huron St,41.894556,-87.653449,19.0,Clark St & Randolph St,41.884576,-87.63189,31.0,72.0,10.0,16.1,-9999.0,mostlycloudy
4,13168,Subscriber,Male,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,73.0,10.0,17.3,-9999.0,partlycloudy


## Pandas datetime columns are always nanosecond precision
Pandas forces all datetime columns to have nanosecond precision. It relies on NumPy's datetime64 data type as the foundation. NumPy does allow you to have different ranges of precision, microsecond or millisecond, for example, but pandas requires nanosecond precision. Pandas converts any other NumPy datatime to nanoseconds.

Let's take a look at the data types of each column with the **`dtypes`** attribute to verify that we do have two datetime columns.

In [2]:
bikes.dtypes

trip_id                       int64
usertype                     object
gender                       object
starttime            datetime64[ns]
stoptime             datetime64[ns]
tripduration                  int64
from_station_name            object
latitude_start              float64
longitude_start             float64
dpcapacity_start            float64
to_station_name              object
latitude_end                float64
longitude_end               float64
dpcapacity_end              float64
temperature                 float64
visibility                  float64
wind_speed                  float64
precipitation               float64
events                       object
dtype: object

# The `dt` accessor
The primary focus on this notebook will be the methods that follow the **`dt`** accessor. [Visit the API][1] to view all the possible datetime attributes and methods that are available.

## Use the `read_html` to scrape its own API page and output the `dt` attributes and methods as a DataFrame

The `read_html` function attempts to turn every single HTML table found on the given URL into a Pandas DataFrame. It returns a list DataFrames. It takes an optional second parameter, a string that must be contained in the table.

The Pandas API page places all of the object attribute and methods within HTML tables. This makes it a great page to work with `read_html`. The function searches each table for the phrase `Series.dt.`. Four DataFrames are returned in a list. The first two contain the attributes and methods for the `dt` accessor.

In [3]:
dfs = pd.read_html('http://pandas.pydata.org/pandas-docs/stable/api.html', 'Series[.]dt[.]')

dt_attr = dfs[0]
dt_attr.columns = ['Attributes', 'Description']

dt_methods = dfs[1]
dt_methods.columns = ['Methods', 'Description']

dt_attr

Unnamed: 0,Attributes,Description
0,Series.dt.date,Returns numpy array of python datetime.date ob...
1,Series.dt.time,Returns numpy array of datetime.time.
2,Series.dt.year,The year of the datetime
3,Series.dt.month,"The month as January=1, December=12"
4,Series.dt.day,The days of the datetime
5,Series.dt.hour,The hours of the datetime
6,Series.dt.minute,The minutes of the datetime
7,Series.dt.second,The seconds of the datetime
8,Series.dt.microsecond,The microseconds of the datetime
9,Series.dt.nanosecond,The nanoseconds of the datetime


In [50]:
dt_methods

Unnamed: 0,Methods,Description
0,"Series.dt.to_period(*args, **kwargs)",Cast to PeriodIndex at a particular frequency.
1,Series.dt.to_pydatetime(),Return the data as an array of native Python d...
2,"Series.dt.tz_localize(*args, **kwargs)",Localize tz-naive DatetimeIndex to tz-aware Da...
3,"Series.dt.tz_convert(*args, **kwargs)",Convert tz-aware DatetimeIndex from one time z...
4,"Series.dt.normalize(*args, **kwargs)",Convert times to midnight.
5,"Series.dt.strftime(*args, **kwargs)",Convert to Index using specified date_format.
6,"Series.dt.round(*args, **kwargs)",round the data to the specified freq.
7,"Series.dt.floor(*args, **kwargs)",floor the data to the specified freq.
8,"Series.dt.ceil(*args, **kwargs)",ceil the data to the specified freq.
9,"Series.dt.month_name(*args, **kwargs)",Return the month names of the DateTimeIndex wi...


### Only available for Series
The **`dt`** accessor (and **`str`**) are only available to Series objects and not DataFrames. You will have to select a single Series first in order to use it. Let's select the **`starttime`** column as a Series and output the public datetime attributes and methods.

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#datetimelike-properties

In [4]:
start = bikes['starttime']

### Datetime attributes and methods are simpler than strings
Almost all the attributes and methods available for datetimes are simple and straightforward. Let's take a look at some of them. We will output the head of the Series so that we can visually verify the results of the attributes and methods.

In [5]:
start.head()

0   2013-06-28 19:01:00
1   2013-06-28 22:53:00
2   2013-06-30 14:43:00
3   2013-07-01 10:05:00
4   2013-07-01 11:16:00
Name: starttime, dtype: datetime64[ns]

There are many attributes that return a particular part of the datetime such as **`year, month, day, hour, minute, second`**, etc...

In [6]:
start.dt.year.head()

0    2013
1    2013
2    2013
3    2013
4    2013
Name: starttime, dtype: int64

In [7]:
start.dt.month.head()

0    6
1    6
2    6
3    7
4    7
Name: starttime, dtype: int64

In [8]:
start.dt.minute.head()

0     1
1    53
2    43
3     5
4    16
Name: starttime, dtype: int64

In [9]:
# monday is 0
start.dt.dayofweek.head()

0    4
1    4
2    6
3    0
4    0
Name: starttime, dtype: int64

In [10]:
# week of year
start.dt.week.head()

0    26
1    26
2    26
3    27
4    27
Name: starttime, dtype: int64

## Datetime methods
There are actually only a few methods that exist with the most useful being **`ceil`**, **`round`**, **`floor`**, **`strftime`**, and **`to_period`**. To use these methods you will need to be familiar with the [offset aliases][1], which are short strings, usually one character, that represent an element of time.

* **`D`** - day
* **`H`** - hour
* **`T`** or **`min`** - minute
* **`S`** - second

[1]: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

### Scrape the offset aliases and output them in the notebook

In [11]:
dfs = pd.read_html('http://pandas.pydata.org/pandas-docs/stable/timeseries.html', 
                   match='business day frequency',
                   attrs={'class' :"colwidths-given docutils"})
offset_aliases = dfs[0]
offset_aliases

Unnamed: 0,Alias,Description
0,B,business day frequency
1,C,custom business day frequency
2,D,calendar day frequency
3,W,weekly frequency
4,M,month end frequency
5,SM,semi-month end frequency (15th and end of month)
6,BM,business month end frequency
7,CBM,custom business month end frequency
8,MS,month start frequency
9,SMS,semi-month start frequency (1st and 15th)


### Use offset aliases with datetime methods

In [12]:
start.head()

0   2013-06-28 19:01:00
1   2013-06-28 22:53:00
2   2013-06-30 14:43:00
3   2013-07-01 10:05:00
4   2013-07-01 11:16:00
Name: starttime, dtype: datetime64[ns]

## `ceil` rounds up to nearest unit

Round up to nearest hour:

In [13]:
start.dt.ceil('H').head()

0   2013-06-28 20:00:00
1   2013-06-28 23:00:00
2   2013-06-30 15:00:00
3   2013-07-01 11:00:00
4   2013-07-01 12:00:00
Name: starttime, dtype: datetime64[ns]

Round up to nearest day:

In [14]:
start.dt.ceil('D').head()

0   2013-06-29
1   2013-06-29
2   2013-07-01
3   2013-07-02
4   2013-07-02
Name: starttime, dtype: datetime64[ns]

**`floor`** rounds down:

In [15]:
start.dt.floor('min').head()

0   2013-06-28 19:01:00
1   2013-06-28 22:53:00
2   2013-06-30 14:43:00
3   2013-07-01 10:05:00
4   2013-07-01 11:16:00
Name: starttime, dtype: datetime64[ns]

**`round`** rounds normally to nearest whole unit.

In [16]:
start.dt.round('H').head()

0   2013-06-28 19:00:00
1   2013-06-28 23:00:00
2   2013-06-30 15:00:00
3   2013-07-01 10:00:00
4   2013-07-01 11:00:00
Name: starttime, dtype: datetime64[ns]

# Exercises

### Problem 1
<span  style="color:green; font-size:16px">What percentage of bike rides happen in January?</span>

In [38]:
# your code here
start.dt.month.value_counts(normalize=True)

8     0.137375
7     0.132245
9     0.130168
6     0.120525
10    0.111102
5     0.086227
11    0.071433
4     0.064286
12    0.045000
3     0.044101
2     0.030346
1     0.027192
Name: starttime, dtype: float64

### Problem 2
<span  style="color:green; font-size:16px">What percentage of bike rides happen on the weekend?</span>

In [33]:
# your code here
filt1 = start.dt.dayofweek == 5
filt2 = start.dt.dayofweek == 6
filt = filt1 | filt2 
filt.mean()

0.19692946555131866

### Problem 3
<span  style="color:green; font-size:16px">What percentage of bike rides happen on the last day of the month?</span>

In [39]:
# your code here
start.dt.is_month_end.mean()

0.031563816406795904

### Problem 4
<span  style="color:green; font-size:16px">We would expect that the value of the minutes recorded for each starting ride is approximately random. Can you show some data that confirms or rejects this?</span>

In [47]:
# your code here

start.dt.minute.value_counts(normalize=True)

12    0.017968
6     0.017928
8     0.017868
18    0.017808
43    0.017629
21    0.017549
10    0.017529
48    0.017509
44    0.017449
15    0.017409
53    0.017349
17    0.017329
37    0.017309
13    0.017289
19    0.017269
33    0.017229
42    0.017229
39    0.017189
24    0.017189
22    0.017110
34    0.017070
29    0.017070
45    0.016950
5     0.016890
36    0.016870
11    0.016870
14    0.016870
49    0.016850
47    0.016830
30    0.016810
16    0.016710
32    0.016670
38    0.016630
1     0.016630
40    0.016531
7     0.016491
2     0.016471
46    0.016471
4     0.016391
23    0.016331
54    0.016311
57    0.016291
3     0.016251
28    0.016091
35    0.016071
59    0.016071
56    0.016031
0     0.015932
58    0.015912
50    0.015872
31    0.015872
55    0.015852
9     0.015812
27    0.015812
41    0.015712
20    0.015612
25    0.015512
52    0.015253
51    0.015213
26    0.014973
Name: starttime, dtype: float64

### Problem 5
<span  style="color:green; font-size:16px">Assign the length of the ride to `ride_length`. Then find the percentage of rides that lasted longer than 30 minutes.</span>

In [40]:
# your code here
bikes.head()

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
0,7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,Michigan Ave & Oak St,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
1,7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wells St & Walton St,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0,partlycloudy
2,10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy
3,12907,Subscriber,Male,2013-07-01 10:05:00,2013-07-01 10:16:00,667,Carpenter St & Huron St,41.894556,-87.653449,19.0,Clark St & Randolph St,41.884576,-87.63189,31.0,72.0,10.0,16.1,-9999.0,mostlycloudy
4,13168,Subscriber,Male,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,73.0,10.0,17.3,-9999.0,partlycloudy


In [45]:
stop = bikes['stoptime']
ride_length = stop - start
ride_length.head()

0   00:16:00
1   00:10:00
2   00:18:00
3   00:11:00
4   00:02:00
dtype: timedelta64[ns]

In [49]:
(ride_length.dt.seconds > 30 * 60).mean()

0.019625067380063487

# Explore the `dt` accessor

# Extra
Some extra notes on the Period and Timedelta objects

## Format time as a string with `strftime`
The **`strftime`** stands for **str**ing **f**ormat **time**. It turns each datetime into a string object. You must consult [Python's documentation][1] to determine how you want your string to be formatted.

[1]: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

In [22]:
start.dt.strftime('%A, %B %d, %Y at %X').head()

0    Friday, June 28, 2013 at 19:01:00
1    Friday, June 28, 2013 at 22:53:00
2    Sunday, June 30, 2013 at 14:43:00
3    Monday, July 01, 2013 at 10:05:00
4    Monday, July 01, 2013 at 11:16:00
Name: starttime, dtype: object

## Convert to a Period object
Period objects are special data types unique to pandas and simply represent an entire period of time such as the entire month of June, 2012 or the entire year 1998, or the entire minute of June 11, 2011 12:34 p.m.

This contrasts with datetimes which represent a particular moment in time with nanosecond precision. Datetimes are always specific all the way down to a nanosecond.

### Use offset aliases to convert to a period
To convert to a period use the same [offset aliases](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases) from above.

Let's do some conversions: First to a month.

In [23]:
start.dt.to_period('M').head()

0   2013-06
1   2013-06
2   2013-06
3   2013-07
4   2013-07
Name: starttime, dtype: object

Convert to a time span of an hour:

In [24]:
start.dt.to_period('h').head()

0   2013-06-28 19:00
1   2013-06-28 22:00
2   2013-06-30 14:00
3   2013-07-01 10:00
4   2013-07-01 11:00
Name: starttime, dtype: object

# Timedeltas
Timedeltas are a separate data type that represent an amount of time such as 5 minutes and 34 seconds. The highest unit of a timedelta is days. Timedelta Series can also use the **`dt`** accessor.

### Creating a Timedelta
To create a timedelta, subtract two datetime Series from each other. Here, we select the stop time as a Series and subtract the **`start`** Series from it.

In [25]:
stop = bikes['stoptime']
ride_length = stop - start
ride_length.head()

0   00:16:00
1   00:10:00
2   00:18:00
3   00:11:00
4   00:02:00
dtype: timedelta64[ns]

There are much fewer attributes and methods for timedeltas but they work the same way: