# Date and Time features with Feature-engine

We can automate the extraction of date and time features with Feature-engine, using the [DatetimeFeatures](https://feature-engine.readthedocs.io/en/latest/api_doc/datetime/DatetimeFeatures.html) transformer.

## The dataset

We will use the Online Retail II Data Set available in the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/machine-learning-databases/00502/).

Download the xlsx file from the link above and save it in the **Datasets** folder within this repo.

**Citation**:

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

## In this demo

We will extract different time-related features from the datetime variable: **InvoiceDate**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from feature_engine.datetime import DatetimeFeatures

## Load the data

In [2]:
# File path:
file = "../Datasets/online_retail_II.xlsx"

# The data is provided as two sheets in a single Excel file.
# Each sheet contains a different time period.
# Load both and join them into a single dataframe
# as shown below:

df_1 = pd.read_excel(file, sheet_name="Year 2009-2010")
df_2 = pd.read_excel(file, sheet_name="Year 2010-2011")

data = pd.concat([df_1, df_2])

print(data.shape)

data.head()

(1067371, 8)


Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


In this dataset, we have the datetime variable in a column called InvoiceDate. We could also have it in the dataframe index. The procedure for extracting the date and time features is identical. That is, we would use the methods from pandas dt as shown below.

The dataset contains sales information for different customers in different countries. Customers may have made one or multiple purchases from the business that provided the data.

## Variable format

In [3]:
# Let's determine the type of data in the datetime variable.

data["InvoiceDate"].dtypes

dtype('<M8[ns]')

# Extract all possible features

DatetimeFeatures can extract all possible features out-of-the-box.

In [4]:
dtfs = DatetimeFeatures(
    variables=None,  # it identifies the datetime variable automatically.
    features_to_extract="all",
)

In [5]:
# Extract features.
dft = dtfs.fit_transform(data)

# Capture the names of the features we just created.
# (Feature-engine tags them with the original variable name
# plus the feature it extracted).
vars_ = [v for v in dft.columns if "InvoiceDate" in v]

# Show
dft[vars_].head()

Unnamed: 0,InvoiceDate_month,InvoiceDate_quarter,InvoiceDate_semester,InvoiceDate_year,InvoiceDate_week,InvoiceDate_day_of_week,InvoiceDate_day_of_month,InvoiceDate_day_of_year,InvoiceDate_weekend,InvoiceDate_month_start,InvoiceDate_month_end,InvoiceDate_quarter_start,InvoiceDate_quarter_end,InvoiceDate_year_start,InvoiceDate_year_end,InvoiceDate_leap_year,InvoiceDate_days_in_month,InvoiceDate_hour,InvoiceDate_minute,InvoiceDate_second
0,12,4,2,2009,49,1,1,335,0,1,0,0,0,0,0,0,31,7,45,0
1,12,4,2,2009,49,1,1,335,0,1,0,0,0,0,0,0,31,7,45,0
2,12,4,2,2009,49,1,1,335,0,1,0,0,0,0,0,0,31,7,45,0
3,12,4,2,2009,49,1,1,335,0,1,0,0,0,0,0,0,31,7,45,0
4,12,4,2,2009,49,1,1,335,0,1,0,0,0,0,0,0,31,7,45,0


In [6]:
# The datetime variable, which was automatically
# identified, is stored in an attribute.

dtfs.variables_

['InvoiceDate']

# Extract most common features

DatetimeFeatures can extract the most **common** features out-of-the-box.

In [7]:
dtfs = DatetimeFeatures(
    variables=None,  # it identifies the datetime variable automatically
    features_to_extract=None,
)

In [8]:
# Extract features
dft = dtfs.fit_transform(data)

# Capture the names of the features we just created.
# (Feature-engine tags them with the original variable name
# plus the feature it extracted).
vars_ = [v for v in dft.columns if "InvoiceDate" in v]

# Show
dft[vars_].head()

Unnamed: 0,InvoiceDate_month,InvoiceDate_year,InvoiceDate_day_of_week,InvoiceDate_day_of_month,InvoiceDate_hour,InvoiceDate_minute,InvoiceDate_second
0,12,2009,1,1,7,45,0
1,12,2009,1,1,7,45,0
2,12,2009,1,1,7,45,0
3,12,2009,1,1,7,45,0
4,12,2009,1,1,7,45,0


# Extract user defined features

We can also extract a subset of features of choice.

In [9]:
dtfs = DatetimeFeatures(
    variables=None,  # it identifies the datetime variable automatically
    features_to_extract=["week", "year", "day_of_month", "day_of_week"],
)

In [10]:
# Extract features
dft = dtfs.fit_transform(data)

# Capture the names of the features we just created.
# (Feature-engine tags them with the original variable name
# plus the feature it extracted).
vars_ = [v for v in dft.columns if "InvoiceDate" in v]

# Show
dft[vars_].head()

Unnamed: 0,InvoiceDate_week,InvoiceDate_year,InvoiceDate_day_of_month,InvoiceDate_day_of_week
0,49,2009,1,1
1,49,2009,1,1
2,49,2009,1,1
3,49,2009,1,1
4,49,2009,1,1


# Working with timezones

DatetimeFeatures also takes care of different time zones out-of-the-box.

In [11]:
# First, let's create a toy dataframe with some
# timestamps in different time zones.

df = pd.DataFrame()

df["time"] = pd.concat(
    [
        pd.Series(
            pd.date_range(
                start="2014-08-01 09:00", freq="H", periods=3, tz="Europe/Berlin"
            )
        ),
        pd.Series(
            pd.date_range(
                start="2014-08-01 09:00", freq="H", periods=3, tz="US/Central"
            )
        ),
    ],
    axis=0,
)

df

Unnamed: 0,time
0,2014-08-01 09:00:00+02:00
1,2014-08-01 10:00:00+02:00
2,2014-08-01 11:00:00+02:00
0,2014-08-01 09:00:00-05:00
1,2014-08-01 10:00:00-05:00
2,2014-08-01 11:00:00-05:00


We can see the different timezones indicated by the +2 and -5, with respect to the central meridian.

In [12]:
dfts = DatetimeFeatures(
    features_to_extract=["day_of_week", "hour", "minute"],
    drop_original=False,
    utc=True,  # to handle timezones
)

# DatetimeFeatures will take all timestamps to utc
# before deriving the features.

In [13]:
dft = dfts.fit_transform(df)

dft.head()

Unnamed: 0,time,time_day_of_week,time_hour,time_minute
0,2014-08-01 09:00:00+02:00,4,7,0
1,2014-08-01 10:00:00+02:00,4,8,0
2,2014-08-01 11:00:00+02:00,4,9,0
0,2014-08-01 09:00:00-05:00,4,14,0
1,2014-08-01 10:00:00-05:00,4,15,0


More details about Feature-engine's datetime transformer in the [user guide](https://feature-engine.readthedocs.io/en/latest/user_guide/datetime/DatetimeFeatures.html#datetime-features).