# Pandas Period

[Feature Engineering for Time Series Forecasting](https://www.trainindata.com/p/feature-engineering-for-forecasting)

In this notebook we'll discuss the Pandas `Period` and `PeriodIndex` type to handle time span related data.

# Load example data

The air passengers dataset is the monthly totals of international airline passengers, from 1949 to 1960, in units of 1000s. 

For instructions on how to download, prepare, and store the dataset, refer to notebook number 5, in the folder "01-Create-Datasets" from this repo.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv(
    "../Datasets/example_air_passengers.csv",
    parse_dates=["ds"],
    index_col=["ds"],
)

In [3]:
df.index

DatetimeIndex(['1949-01-01', '1949-02-01', '1949-03-01', '1949-04-01',
               '1949-05-01', '1949-06-01', '1949-07-01', '1949-08-01',
               '1949-09-01', '1949-10-01',
               ...
               '1960-03-01', '1960-04-01', '1960-05-01', '1960-06-01',
               '1960-07-01', '1960-08-01', '1960-09-01', '1960-10-01',
               '1960-11-01', '1960-12-01'],
              dtype='datetime64[ns]', name='ds', length=144, freq=None)

In [4]:
type(df.index[0])

pandas._libs.tslibs.timestamps.Timestamp

The current type of our index is a `DatetimeIndex` where each element is a `Timestamp`.

# Pandas Period - what is it and when to use it.

When working with time related information which refers to a time span (e.g., the sales of products over each month) rather than an instance in time (e.g., an event that occurs at a specific timestamp), it can be more convenient to work with a data type in Pandas called `Period`.

To read more about the `Period` type in Pandas see the [docs](https://pandas.pydata.org/docs/user_guide/timeseries.html), in particular the section titled "timestamps vs. time spans".
    
   > "A `Period` represents a span of time (e.g., a day, a month, a quarter, etc)."
   
   > "Under the hood, pandas represents timestamps using instances of `Timestamp` and sequences of timestamps using instances of `DatetimeIndex`. For regular time spans, pandas uses `Period` objects for scalar values and `PeriodIndex` for sequences of spans."

`Period` objects can be created just as easily as timestamp `Timestamp` objects.

In [5]:
pd.Timestamp("2020-01-01") # Create a timestamp representing 1st January 2020 at time 00:00:00

Timestamp('2020-01-01 00:00:00')

In [6]:
pd.Period("2020-01", freq="M") # Create a time period representing the month of January 2020

Period('2020-01', 'M')

For example, our dataset index currently is a `DatetimeIndex` where there is a day (and even a time) associated with each month (e.g., 1960-12-01 00:00:00), despite the day and time being meaningless for this data set. What we're trying to represent is the sales over the time span of a given month.

In [7]:
df.head()

Unnamed: 0_level_0,y
ds,Unnamed: 1_level_1
1949-01-01,112
1949-02-01,118
1949-03-01,132
1949-04-01,129
1949-05-01,121


We can convert the index from `datetime` to `Period` as follows:

In [8]:
df.index = df.index.to_period()

In [9]:
df.head()

Unnamed: 0_level_0,y
ds,Unnamed: 1_level_1
1949-01,112
1949-02,118
1949-03,132
1949-04,129
1949-05,121


In [10]:
df.index

PeriodIndex(['1949-01', '1949-02', '1949-03', '1949-04', '1949-05', '1949-06',
             '1949-07', '1949-08', '1949-09', '1949-10',
             ...
             '1960-03', '1960-04', '1960-05', '1960-06', '1960-07', '1960-08',
             '1960-09', '1960-10', '1960-11', '1960-12'],
            dtype='period[M]', name='ds', length=144)

We now have a `PeriodIndex` with monthly frequency which better represents the time series (i.e., the sales over the whole month).

`Period` objects can make it easier to do certain calculations. Let's add one month to a given period:

In [11]:
df.index[0]

Period('1949-01', 'M')

In [12]:
df.index[0] + 1

Period('1949-02', 'M')

`Period` is also the preferred type when calculating the **exact** differences in dates in terms of calendar events (e.g., what is the exact integer difference between the week numbers of the two following timestamps: "2012-01-15 10:00:00" (week 2, year 2012) and "2014-04-01 01:30:00" (week 14, year 2014))

Using `Period`

In [13]:
delta = pd.Period("2012-01-15 10:00:00", freq="W") - pd.Period("2014-04-01 01:30:00", freq="W")
delta

<-116 * Weeks: weekday=6>

We can get the integer using the `n` attribute:

In [14]:
delta.n

-116

Using `Timestamp` and `timedelta` objects we only get approximate, and sometimes incorrect, answers:

In [15]:
(pd.Timestamp("2012-01-15 10:00:00") - pd.Timestamp("2014-04-01 01:30:00")) / np.timedelta64(1, "W")

-115.23511904761905

Whether we use `Period` or `datetime` should not change the forecasting workflow, but it will make some calculations easier depending on the time series.

In general, if your data represents a timespan then `Period` (e.g., sales over one month) can make handling the data more convenient. If your data represents events that occurred at a timepoint then `datetime` or `Timestamp` is preferred.