# Handling Dates and Times

## Introduction 

Dates  and  times  (datetimes)  are  frequently  encountered  during  preprocessing  for machine  learning,  whether  the  time  of  a  particular  sale  or  the  year  of  some  public health statistic. In this chapter, we will build a toolbox of strategies for handling time series  data  including  tackling  time  zones  and  creating  lagged  time  features.  Specifi‐ cally, we will focus on the time series tools in the pandas library, which centralizes the functionality of many other libraries. 

In [1]:
import numpy as np
import pandas as pd

# Converting Strings to Dates 

Given  a  vector  of  strings  representing  dates  and  times,  you  want  to  transform  them into time series data. 

In [3]:
date_strings = np.array([
    '03-04-2005 11:35 PM',
    '23-05-2010 12:01 AM',
    '04-09-2009 09:09 PM'
])



### Solution 

In [4]:
# convert to datetimes
[pd.to_datetime(date, format='%d-%m-%Y %I:%M %p') for date in date_strings]

[Timestamp('2005-04-03 23:35:00'),
 Timestamp('2010-05-23 00:01:00'),
 Timestamp('2009-09-04 21:09:00')]

In [5]:
[pd.to_datetime(date, format='%d-%m-%Y %I:%M %p', errors='coerce') for date in date_strings]

[Timestamp('2005-04-03 23:35:00'),
 Timestamp('2010-05-23 00:01:00'),
 Timestamp('2009-09-04 21:09:00')]

## 7.2 Handling Time Zones

You have time series data and want to add or change time zone information. 

In [6]:
import pandas as pd

pd.Timestamp('2017-05-01 06:00:00', tz='Europe/London')

Timestamp('2017-05-01 06:00:00+0100', tz='Europe/London')

### Solution

In [7]:
date = pd.Timestamp('2017-05-01 06:00:00')

date_in_london = date.tz_localize('Europe/London')

date_in_london

Timestamp('2017-05-01 06:00:00+0100', tz='Europe/London')

In [8]:
date_in_london.tz_convert('Africa/Abidjan')

Timestamp('2017-05-01 05:00:00+0000', tz='Africa/Abidjan')

In [9]:
dates = pd.Series(pd.date_range('2/2/2002', periods=3, freq='M'))

dates.dt.tz_localize('Africa/Abidjan')

0   2002-02-28 00:00:00+00:00
1   2002-03-31 00:00:00+00:00
2   2002-04-30 00:00:00+00:00
dtype: datetime64[ns, Africa/Abidjan]

## 7.3 Selecting Dates and Times

You have a vector of dates and you want to select one or more. 


In [12]:
# Load 
import pandas as pd 
# Create data frame 
dataframe = pd.DataFrame() 
# Create datetimes 
dataframe['date'] = pd.date_range('1/1/2001', periods=100000, freq='H') 

### Solution

In [13]:
# Select observations between two datetimes 
dataframe[(dataframe['date'] > '2002-1-1 01:00:00') &(dataframe['date'] <= '2002-1-1 04:00:00')] 

Unnamed: 0,date
8762,2002-01-01 02:00:00
8763,2002-01-01 03:00:00
8764,2002-01-01 04:00:00


In [15]:
dataframe = dataframe.set_index(dataframe['date']) 
dataframe.loc['2002-1-1 01:00:00':'2002-1-1 04:00:00'] 

Unnamed: 0_level_0,date
date,Unnamed: 1_level_1
2002-01-01 01:00:00,2002-01-01 01:00:00
2002-01-01 02:00:00,2002-01-01 02:00:00
2002-01-01 03:00:00,2002-01-01 03:00:00
2002-01-01 04:00:00,2002-01-01 04:00:00


## 7.4 Breaking Up Date Data into Multiple Features

You  have  a  column  of  dates  and  times  and  you  want  to  create  features  for  year, month, day, hour, and minute. 

In [16]:
dataframe = pd.DataFrame() 
dataframe['date'] = pd.date_range('1/1/2001', periods=150, freq='W') 

### Solution

In [17]:
dataframe['year'] = dataframe['date'].dt.year 
dataframe['month'] = dataframe['date'].dt.month 
dataframe['day'] = dataframe['date'].dt.day 
dataframe['hour'] = dataframe['date'].dt.hour 
dataframe['minute'] = dataframe['date'].dt.minute 

In [19]:
dataframe.head()

Unnamed: 0,date,year,month,day,hour,minute
0,2001-01-07,2001,1,7,0,0
1,2001-01-14,2001,1,14,0,0
2,2001-01-21,2001,1,21,0,0
3,2001-01-28,2001,1,28,0,0
4,2001-02-04,2001,2,4,0,0


## 7.5 Calculating the Difference Between Dates

You have two datetime features and want to calculate the time between them for each observation. 

In [20]:
dataframe = pd.DataFrame() 
dataframe['Arrived'] = [pd.Timestamp('01-01-2017'), pd.Timestamp('01-04-2017')]
dataframe['Left'] = [pd.Timestamp('01-01-2017'), pd.Timestamp('01-06-2017')] 

### Solution

In [21]:
dataframe['Left'] - dataframe['Arrived'] 

0   0 days
1   2 days
dtype: timedelta64[ns]

In [22]:
pd.Series(delta.days for delta in (dataframe['Left'] - dataframe['Arrived'])) 

0    0
1    2
dtype: int64

## 7.6 Encoding Days of the Week

You have a vector of dates and want to know the day of the week for each date. 


In [23]:
dates = pd.Series(pd.date_range("2/2/2002", periods=3, freq="M")) 

### Solution

In [25]:
dates.dt.weekday_name 

0    Thursday
1      Sunday
2     Tuesday
dtype: object

In [26]:
dates.dt.weekday 

0    3
1    6
2    1
dtype: int64

## 7.7 Creating Lagged Feature

You want to create a feature that is lagged n time periods. 


In [27]:
dataframe = pd.DataFrame() 

In [28]:
dataframe["dates"] = pd.date_range("1/1/2001", periods=5, freq="D") 
dataframe["stock_price"] = [1.1,2.2,3.3,4.4,5.5] 

### Solution

In [29]:
dataframe["previous_days_stock_price"] = dataframe["stock_price"].shift(1) 

## 7.8 Using Rolling Time Windows

Given time series data, you want to calculate some statistic for a rolling time. 

In [31]:
time_index = pd.date_range("01/01/2010", periods=5, freq="M")
dataframe = pd.DataFrame(index=time_index) 
dataframe["Stock_Price"] = [1,2,3,4,5] 

### Solution

In [32]:
dataframe.rolling(window=2).mean() 

Unnamed: 0,Stock_Price
2010-01-31,
2010-02-28,1.5
2010-03-31,2.5
2010-04-30,3.5
2010-05-31,4.5


In [33]:
import pandas as pd
import numpy as np

time_index = pd.date_range('01/01/2010', periods=5, freq='M')

df = pd.DataFrame(index=time_index)

df["Sales"] = [1.0, 2.0, np.nan, np.nan, 5.0]

df.interpolate()

Unnamed: 0,Sales
2010-01-31,1.0
2010-02-28,2.0
2010-03-31,3.0
2010-04-30,4.0
2010-05-31,5.0


## Handling Missing Data in Time Series 

You have missing values in time series data. 

In [34]:
time_index = pd.date_range("01/01/2010", periods=5, freq="M") 
dataframe = pd.DataFrame(index=time_index) 
dataframe["Sales"] = [1.0,2.0,np.nan,np.nan,5.0] 

### Solution

In [35]:
dataframe.interpolate() 

Unnamed: 0,Sales
2010-01-31,1.0
2010-02-28,2.0
2010-03-31,3.0
2010-04-30,4.0
2010-05-31,5.0


Alternatively, we can replace missing values with the last known value (i.e., forward-filling)

In [39]:
dataframe.ffill() 

Unnamed: 0,Sales
2010-01-31,1.0
2010-02-28,2.0
2010-03-31,2.0
2010-04-30,2.0
2010-05-31,5.0


We can also replace missing values with the latest known value (i.e., back-filling): 

In [40]:
dataframe.bfill() 

Unnamed: 0,Sales
2010-01-31,1.0
2010-02-28,2.0
2010-03-31,5.0
2010-04-30,5.0
2010-05-31,5.0


Interpolation  is  a  technique  for  filling  in  gaps  caused  by  missing  values  by,  in  effect, drawing  a  line  or  curve  between  the  known  values  bordering  the  gap  and  using  that line  or  curve  to  predict  reasonable  values.  Interpolation  can  be  particularly  useful
when the time intervals between are constant, the data is not prone to noisy fluctua‐ tions, and the gaps caused by missing values are small. For example, in our solution a gap  of  two  missing  values  was  bordered  by 2.0  and 5.0.  By  fitting  a  line  starting  at 2.0 and ending at 5.0, we can make reasonable guesses for the two missing values in between of 3.0 and 4.0. 

If we believe the line between the two known points is nonlinear, we can use interpo late’s method to specify the interpolation method: 

In [41]:
df.interpolate(method="quadratic")

Unnamed: 0,Sales
2010-01-31,1.0
2010-02-28,2.0
2010-03-31,3.059808
2010-04-30,4.038069
2010-05-31,5.0


Finally,  there  might  be  cases  when  we  have  large  gaps  of  missing  values  and  do  not want  to  interpolate  values  across  the  entire  gap.  In  these  cases  we  can  use limit  to restrict  the  number  of  interpolated  values  and limit_direction  to  set  whether  to interpolate values forward from at the last known value before the gap or vice versa: 

In [38]:
df.interpolate(limit=1, limit_direction="forward")

Unnamed: 0,Sales
2010-01-31,1.0
2010-02-28,2.0
2010-03-31,3.0
2010-04-30,
2010-05-31,5.0


Back-filling  and  forward-filling  can  be  thought  of  as  a  form  of  naive  interpolation, where we draw a flat line from a known value and use it to fill in missing values. One (minor) advantage back- and forward-filling have over interpolation is the lack of the need for known values on both sides of missing value(s). 