<a href="https://github.com/theonaunheim">
    <img style="border-radius: 100%; float: right;" src="static/strawberry_thief_square.png" width=10% alt="Theo Naunheim's Github">
</a>

<br style="clear: both">
<hr>
<br>

<h1 align='center'>Datetime Series</h1>

<br>

<div style="display: table; width: 100%">
    <div style="display: table-row; width: 100%;">
        <div style="display: table-cell; width: 50%; vertical-align: middle;">
        <img src="static/chengdu.jpg">
        </div>
        <div style="display: table-cell; width: 10%">
        </div>
        <div style="display: table-cell; width: 40%; vertical-align: top;">
            <blockquote>
                <p style="font-style: italic;">"Tempus edax rerum."</p>
                <br>
    <p>-<a href="https://en.wikipedia.org/wiki/List_of_sundial_mottos">Sundial Motto</a></p>
            </blockquote>
        </div>
    </div>
</div>


<br>






<br>

<div align='left'>
    Image courtesy of <a href='https://commons.wikimedia.org/wiki/File:Chengdu-pandas-d10.jpg'>Colegota</a> under the <a href='https://creativecommons.org/licenses/by-sa/2.5/es/deed.en'>Creative Commons Attribution-Share Alike 2.5 Spain</a>
</div>


<hr>

In [1]:
# Import stuff so we can use libraries.
import numpy as np
import pandas as pd

# Generally

As previously mentioned in our earlier sessions, each pandas Series has a particular datatype (even if it's just a generic Python datatype). Because of the importance that time series analysis plays in data analysis as a whole, pandas has specific datetime column types for dealing with time data. These types are the:

* **Datetime** series;
* **Period** series; and,
* **Timedelta** series.

Each of these numpy-derived types has a `.dt` namespace (a.k.a. "datetime properties object") with attributes/methods and also perform specialized datetime behaviors.

**Note**: Python also has the concept of datetimes in its standard library. For the purpose of this presentation we will refer to pandas and numpy datetimes as "datetimes". We will refer to Python standard library datetimes as "Python datetimes".

## The Datetime Series

### Generally

A datetime is a data structure that pinpoints a moment in time, generally down to the nanosecond. An example of a datetime would be November 11, 1918 at one nanosecond past 11AM (i.e. 1918-11-11T11:00:00.000001). Time object in a datetime series is known as a [`pd.Timestamp`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Timestamp.html) object, and these objects in a Series are a datetime Series.

The most common ways to generate these datetime series are:

* [`pd.read_csv()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) 

* [`pd.to_datetime()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html) is your workhorse for generating datetimes series from string series.

* [`pd.date_range()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.date_range.html) can be used manually generate ranges of datetimes as DatetimeIndexes with whatever arbitrary frequency you desire.

In [56]:
# Let's create a range of datetimes for business day of 2020.
# Always use ISO format (e.g. 2018-01-31): https://xkcd.com/1179/
daily_2020 = pd.date_range(start='2020-01-01', end='2020-12-31', freq='B')

# Create our series
series_2020 = pd.Series(daily_2020)

# Create series and get first 5.
series_2020.head(5)

0   2020-01-01
1   2020-01-02
2   2020-01-03
3   2020-01-06
4   2020-01-07
dtype: datetime64[ns]

In [57]:
# So if the Series is a datetime series, what are the individual values? Timestamps.
series_2020.iloc[0]

Timestamp('2020-01-01 00:00:00')

In [58]:
# And let's take a peek at the datetime properties object
series_2020.dt

<pandas.core.indexes.accessors.DatetimeProperties object at 0x0000021FCF754080>

In [59]:
# Which you can build on your own if you so choose.
ts = pd.Timestamp('2020-01-02 03:04:05')

# If you specify frequency, you can add or subtract scalars (e.g. 900 seconds later).
pd.Timestamp('2020-01-02 03:04:05', freq='s') + 900

Timestamp('2020-01-02 03:19:05', freq='S')

In [60]:
# You can also read from a CSV if you specify what you want to parse.
# Note: again, read_csv isn't used to create Series, generally, so it's messy.
dts = pd.read_csv(
    'data/datetimes.csv', 
    squeeze=True, 
    parse_dates=[0], 
    skip_blank_lines=False
)

# Display 
dts.head(5)

0   2020-01-01 00:00:00
1                   NaT
2   2020-01-01 02:00:00
3   2020-01-01 03:00:00
4   2020-01-01 04:00:00
Name: dts, dtype: datetime64[ns]

In [61]:
# The most flexible way to do this is pd.to_datetime(), which allows for strptime'ing.
# And error handling!
ts = pd.Series(['2018-01-01T23:00:10.222222', '2019-05-01T23:00:10.555555', ''])
pd.to_datetime(ts)

0   2018-01-01 23:00:10.222222
1   2019-05-01 23:00:10.555555
2                          NaT
dtype: datetime64[ns]

In [62]:
# Notice that [ns] at the end of the datatype?
# That's the resolution (how precise it is)
# Though not generally adviseable, you can convert
ts.astype('datetime64[D]')

0   2018-01-01
1   2019-05-01
2          NaT
dtype: datetime64[ns]

### So what's this 'NaT' I see?

Numpy (and by extension pandas) uses `NaT` to indicate a missing time value. `np.NaN` works with floats and Python objects ... it does not work with datetimes. Generally you can treat these interchangably with `np.NaN` when you're using the `Series.dropna()` or `Series.fillna()` methods, however if you need to manually set something you will have to use `pd.NaT` (not np.NaT).

### So how do we work with these datetimes?

The first and most useful thing you can do with datetimes use + and - operators to change the series as a whole.

In [63]:
# Subtract the year start (scalar) from your 2020 series to get hours from start of year.
# This gives us a Timedelta (which we will explore later)
hours_from_year_start = series_2020 - pd.Timestamp('2020-01-01')
hours_from_year_start.tail()

257   359 days
258   362 days
259   363 days
260   364 days
261   365 days
dtype: timedelta64[ns]

In [64]:
# Subtract one series from another (the part of the dtype in brackets is the resolution)
start_series    = pd.Series(['2018-01-01T12:00:00', '2019-10-05', '2020-10-09T21:00:05'], dtype='datetime64[ns]')
end_series      = pd.Series(['2018-10-30T22:00:00', '2029-01-05', '2020-10-09T22:00:00'], dtype='datetime64[ns]')
duration_series = end_series - start_series

# Display duration
duration_series

0    302 days 10:00:00
1   3380 days 00:00:00
2      0 days 00:59:55
dtype: timedelta64[ns]

### Detour: so what of this `.dt` namespace?

All Datetime and Timedelta Series have a `.dt` namespace that has a variety of attributes and methods you can use. The below is a brief description of what attributes and methods are available.

#### Python Datetime Component Attributes

If you just want the standard library date and time components, you can access them via the following attributes:

* [`dt.date`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.dt.date.html): returns the "date" portion of the datetime via Python standard library datetime.date objects.
* [`dt.time`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.dt.time.html): returns the "time" portion of the datetime via Python standard library datetime.time objects.

In [65]:
# Print the type of the first item from the date attribute series
print(type(start_series.dt.date.iloc[0]))

# Display entire series
start_series.dt.date

<class 'datetime.date'>


0    2018-01-01
1    2019-10-05
2    2020-10-09
dtype: object

In [66]:
# Print the type of the first item from the time attribute series
print(type(start_series.dt.time.iloc[0]))

# Display entire series
start_series.dt.time

<class 'datetime.time'>


0    12:00:00
1    00:00:00
2    21:00:05
dtype: object

#### Numeric Components of datetime

It is also possible to get the numeric portions of the datetime. For example, for 2018-01-01T12:00:00.000005, I could get 2018 for the year, 1 for the month and 5 for the nanoseconds

* `dt.year`
* `dt.month`
* `dt.day`
* `dt.hour`
* `dt.minute`
* `dt.second`
* `dt.microsecond`
* `dt.nanosecond`

In [67]:
# Here we get the hours (we'll skip the others as they are used pretty much the same)
start_series.dt.hour

0    12
1     0
2    21
dtype: int64

#### Descriptive Attributes of Datetime

There are also attributes that aren't a numeric part of the datetime itself, but are derived from datetimes.

* `dt.quarter`: the quarter in which this datetime falls
* `dt.week`: which week of the year (1-52) this datetime falls within
* `dt.weekday_name`: the day of the week (e.g. 'Monday', 'Tuesday', etc.)
* `dt.dayofweek`: a number for the weekday (0 is Sunday, 1 is Monday, etc)
* `dt.dayofyear`: the day between 1 and 365 of the year it is
* `dt.days_in_month`: the number of days in the month (e.g. 29 for February leap year)
* `dt.daysinmonth`: an alias for days_in_month
* `dt.weekday`: an alias for day_of_week
* `dt.weekofyear`: an alias for week
* `dt.freq`: the frequency of your data (days, months, seconds, etc.)

See also the following methods:

* `dt.day_name()`: the name of the day
* `dt.month_name()`: the name of the month

In [68]:
# This is weekday name (we'll skip the others).
start_series.dt.weekday_name

0      Monday
1    Saturday
2      Friday
dtype: object

#### Check attributes

These attributes give you a True/False value for whether a datetime passes a particular check.

* `dt.is_leap_year`
* `dt.is_month_end`
* `dt.is_month_start`
* `dt.is_quarter_end`
* `dt.is_quarter_start`
* `dt.is_year_end`
* `dt.is_year_start`

In [69]:
# Check if each day is the start of the year.
start_series.dt.is_year_start

0     True
1    False
2    False
dtype: bool

#### Rounding Functions

These functions will round datetimes to a value:

* [`dt.ceil()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.dt.ceil.html#pandas.Series.dt.ceil): round up to a particular frequency
* [`dt.floor()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.dt.floor.html#pandas.Series.dt.floor): round down to a particular frequency
* [`dt.round()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.dt.round.html): round to the closest value of a particular frequency
* [`dt.normalize()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.dt.normalize.html): round to midnight

**Note**: See [offset aliases](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-offset-aliases) for the codes of things you can round to (minute, month start, business day, etc).

In [70]:
# Round to next day
next_day = start_series.dt.ceil('d')

# Round to this minute
this_minute = start_series.dt.floor('min')

# Rount to nearest second
this_second = start_series.dt.round('S')

# Normalized to midnight
normalized = start_series.dt.normalize()

# Display next day
next_day

0   2018-01-02
1   2019-10-05
2   2020-10-10
dtype: datetime64[ns]

Note: these rounding functions do not actually change the time resolution of your data ... they change the actual values of your data but leave the resolution intact.

#### Conversions

* `dt.strftime()`: create a formatted string from a datetime ("string format time")
* `dt.to_period()`: convert datetimes to ranges of time (a.k.a. periods)
* `dt.to_pydatetime()`: convert np.datetime objects to standard library datetimes.

Note: [see strftime() and strptime() behavior](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior) for details on strftime formatting.

In [71]:
# Do some formatted time strings
start_series.dt.strftime("Verily, it was the %d day in the year %Y of our lord.")

0    Verily, it was the 01 day in the year 2018 of ...
1    Verily, it was the 05 day in the year 2019 of ...
2    Verily, it was the 09 day in the year 2020 of ...
dtype: object

In [72]:
# Periods are the best way to deal with ranges as we will see later on.
as_period = start_series.dt.to_period('M')

# We can add 6 months with simple addition.
six_months_after = as_period + 6

# Display
six_months_after

0   2018-07
1   2020-04
2   2021-04
dtype: object

#### Timezone Stuff

Dealing with timezones and daylight savings time and is a pain. Never do this manually. Pandas (by way of [pytz](https://pypi.org/project/pytz/
) and the [Olson tz database](https://en.wikipedia.org/wiki/Tz_database)) is smarter than you.

* `dt.tz`: get the timezone for this time
* `dt.tz_convert`: convert to a new timezone
* `dt.tz_localize`: set the time zone for this datetime

In [73]:
# Set a timezone
utc_time = start_series.dt.tz_localize('UTC')
utc_time

0   2018-01-01 12:00:00+00:00
1   2019-10-05 00:00:00+00:00
2   2020-10-09 21:00:05+00:00
dtype: datetime64[ns, UTC]

In [74]:
# Get our timezone
utc_time.dt.tz

<UTC>

In [75]:
# Convert to another timezone
utc_time.dt.tz_convert('America/Chicago')

0   2018-01-01 06:00:00-06:00
1   2019-10-04 19:00:00-05:00
2   2020-10-09 16:00:05-05:00
dtype: datetime64[ns, America/Chicago]

**Note**: if you want to know what timezones are available, you can `import pytz` and then check `pytz.all_timezones`.

**Note**: the "-06:00" or "-05:00" you see at the end of the time is the difference from UTC, which differs based on daylight savings time.

---

## The Period

The [`pd.Period`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Period.html) in pandas is just a fixed period of time. Am example of a period would be one minute--a minute has a nanosecond on which it starts, a nanosecond on which it ends, and about 60 billion nanoseconds in the middle. Though we could refer to each and everyone one of those datetimes in that minute, it's easier to generalize. 

If you want to group times by second, by year, by every two hours, by every five minutes or any other which way, Periods are the way to go (e.g. I want to know how many events I have for a given month).

The most common ways to generate period series are:

* [`pd.Period()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Period.html) can be used on its own if you need to instantiate a particular period.

* [`Series.dt.to_period()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.dt.to_period.html) if you have a bunch of datetimes and you want to convert them to periods, this is the best way to do it.

* [`pd.period_range()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.period_range.html) can be used manually generate periods as [`pd.PeriodIndex`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.PeriodIndex.html)es with whatever arbitrary frequency you desire.

In [76]:
# Let's generate a single period with a frequency of minute.
period = pd.Period('2019-01-01T12:00:00', freq='min')
period

Period('2019-01-01 12:00', 'T')

In [77]:
# Using period range of every 6 months from 1900 to 2020.
period_range = pd.period_range(start='1900-01-01', end='2020-12-31', freq='6M')
period_range

PeriodIndex(['1900-01', '1900-07', '1901-01', '1901-07', '1902-01', '1902-07',
             '1903-01', '1903-07', '1904-01', '1904-07',
             ...
             '2016-01', '2016-07', '2017-01', '2017-07', '2018-01', '2018-07',
             '2019-01', '2019-07', '2020-01', '2020-07'],
            dtype='period[6M]', length=242, freq='6M')

In [78]:
# Convert our start_series to 3 week periods.
start_periods = start_series.dt.to_period('3w')
start_periods

0   2018-01-01/2018-01-07
1   2019-09-30/2019-10-06
2   2020-10-05/2020-10-11
dtype: object

In [79]:
# Periods and period columns can be simply stepped forward using +/- operators.
print(period, end='\n\n')

# Print freq='min' period three minutes earlier.
print(period - 3, end='\n\n')

# Print the start_period 9 weeks in the future
print(start_periods + 3, end='\n\n')

2019-01-01 12:00

2019-01-01 11:57

0   2018-03-05/2018-03-11
1   2019-12-02/2019-12-08
2   2020-12-07/2020-12-13
dtype: object



#### Period `.dt` namepsace

Like other datetime objects, period columns have a .dt namespace. Most of the methods and attributes are the same as what you would see for datetimes, but a few specialized attributes and methods are below:

* [`dt.start_time`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Period.start_time.html#pandas.Period.start_time) gets the timestamp at the start of the period.
* [`dt.end_time`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Period.end_time.html#pandas.Period.end_time) gets the timestamp at the end of the period.
* [`dt.to_timestamp()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Period.to_timestamp.html#pandas.Period.to_timestamp) the timestamp gives a string for the period as a whole.

In [80]:
# Lets take a look at some of these
dt_items = {
    'start': start_periods.dt.start_time,
    'end'  : start_periods.dt.end_time,
    'ts'   : start_periods.dt.to_timestamp(),
    'freq' : start_periods.dt.freq,
}

# Note how the resolution is nanoseconds, but the period is 3 seconds.
for name, data in dt_items.items():
    print(name)
    print(data, end='\n\n')

start
0   2018-01-01
1   2019-09-30
2   2020-10-05
dtype: datetime64[ns]

end
0   2018-01-21
1   2019-10-20
2   2020-10-25
dtype: datetime64[ns]

ts
0   2018-01-01
1   2019-09-30
2   2020-10-05
dtype: datetime64[ns]

freq
<3 * Weeks: weekday=6>



In [81]:
# A more practical example (we created a datetime data series at the beginning of this notebook)
dts

0    2020-01-01 00:00:00
1                    NaT
2    2020-01-01 02:00:00
3    2020-01-01 03:00:00
4    2020-01-01 04:00:00
5    2020-01-01 05:00:00
6    2020-01-01 06:00:00
7    2020-01-01 07:00:00
8    2020-01-01 08:00:00
9    2020-01-01 09:00:00
10   2020-01-01 10:00:00
11                   NaT
12   2020-01-01 12:00:00
13   2020-01-01 13:00:00
14   2020-01-01 14:00:00
15   2020-01-01 15:00:00
16   2020-01-01 16:00:00
17   2020-01-01 17:00:00
18   2020-01-01 18:00:00
19   2020-01-01 19:00:00
20   2020-01-01 20:00:00
21                   NaT
22   2020-01-01 22:00:00
23   2020-01-01 23:00:00
24   2020-01-02 00:00:00
Name: dts, dtype: datetime64[ns]

In [82]:
# How many of our datetimes occur each day? Convert to day periods, then value count.
dts.dt.to_period('d').value_counts()

2020-01-01    21
2020-01-02     1
Freq: D, Name: dts, dtype: int64

---

## The Timedelta

The [`pd.Timedelta`](http://pandas.pydata.org/pandas-docs/stable/timedeltas.html) is a data structure that represents the difference between two times. For example, if I were comparing now to thirty minutes from now, my Timedelta would be 30 minutes.

Timedeltas can be created by subtracting two times or series of times from one another.

It can be also be created using:

* [`pd.Timedelta()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Timedelta.html): a constructor for directly creating a timedelta.

* [`pd.to_timedelta()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_timedelta.html): a helper function for creating a series of timedeltas in a fault tolerant manner.

The majority of the items in the `.dt` namespace for Timedeltas are the same as what you have seen elsewhere and won't be covered here.

In [83]:
# Again, we can generate these by subtracting two datetime series
timedelta_series = end_series - start_series
timedelta_series

0    302 days 10:00:00
1   3380 days 00:00:00
2      0 days 00:59:55
dtype: timedelta64[ns]

In [84]:
# We can also generate them manually.
fortnight = pd.Timedelta('14 days')
fortnight

Timedelta('14 days 00:00:00')

In [85]:
# After we have them, we can use them with our operators get a new datetime (here 2 weeks later)
start_series - fortnight

0   2017-12-18 12:00:00
1   2019-09-21 00:00:00
2   2020-09-25 21:00:05
dtype: datetime64[ns]

In [86]:
# Using to_timedelta()
pd.to_timedelta(['7 hours 15 days', '3 minutes', '2:00:00'])

TimedeltaIndex(['15 days 07:00:00', '0 days 00:03:00', '0 days 02:00:00'], dtype='timedelta64[ns]', freq=None)

In [87]:
# You still have .dt
timedelta_series.dt.to_pytimedelta()

array([datetime.timedelta(302, 36000), datetime.timedelta(3380),
       datetime.timedelta(0, 3595)], dtype=object)

## The offset

What if you don't want to add a finite timedelta to your date? What if you just want to go to the next business day? Or the start of the next month? Or something else? You'll probably benefit from something from [`pd.offset`](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#dateoffset-objects). These basically allow you to step through dates however you please. 

Want to make your own custom business day to automatically skip over specific holidays? [Go for it](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#dateoffset-objects).

In [88]:
# What is today?
today = pd.Timestamp.now()
today

Timestamp('2018-11-15 21:19:09.223271')

In [89]:
# What is the start of the next  month?
bm_offset = pd.offsets.MonthBegin()
today + bm_offset

Timestamp('2018-12-01 21:19:09.223271')

In [90]:
# You can also do multiples
five_biz_days = pd.offsets.BDay(5)

# Equivalent
five_biz_days = pd.offsets.BDay() * 5

# Subtract from today to get five bdays ago
today - five_biz_days

Timestamp('2018-11-08 21:19:09.223271')

In [91]:
## The rabbit hole ...

# Additional Learing Resources

* ### [Time Series / Date Functionality](http://pandas.pydata.org/pandas-docs/stable/timeseries.html) 
* ### [Numpy Datetimes and Timedeltas](https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html)
* ### [Datetimelike-Properties API](https://pandas.pydata.org/pandas-docs/stable/api.html#datetimelike-properties)
* ### [Time-Series-Related](https://pandas.pydata.org/pandas-docs/stable/api.html#time-series-related)

---

# Next Up: [String Series](5_string_series.ipynb)

<br>

<img style="margin-left: 0;" src="static/log_transform.svg" width="20%">

<br>

<div align='left'>
    Image courtesy of <a href='https://commons.wikimedia.org/wiki/File:Population_vs_area.svg'>Skbkekas</a> under the <a href='https://creativecommons.org/licenses/by-sa/3.0/deed.en'>CC BY-SA 3.0</a>
</div>

---