# Time-series data and the DatetimeIndex

A time-series is a sequence of data points, typically consisting of successive measurements made at a regular frequency and over a specific time interval. Time-series analysis is composed of various methods for making decisions based upon the data in a time-series by extracting meaningful statistics while regression analysis is the process of testing whether one or more independent time-series affect the current value of another time-series.

There is extensive support for working with time-series data in pandas. In this session, we will examine representing time-series data with the pandas Series and DataFrame as well as several common techniques for manipulating this data. The techniques learned in this session will set the basis for the remaining session, where we will examine several financial processes using time-series data, including a historical analysis of stock performance, correlating multiple streams of financial and social data to develop trading strategies, optimize portfolio allocation, and calculate risk based upon historical data.

In this session, we will cover the following:

    1- DatetimeIndex and its use in time-series data
    2- Creating time-series with specific frequencies
    3- Calculation of new dates using date offsets
    4- Representation of intervals of time user periods
    5- Shifting and lagging time-series data
    6- Frequency conversion of time-series data
    7- Upsampling and downsampling of time-series data

In [1]:
import numpy as np
import pandas as pd
import datetime
from datetime import datetime
import matplotlib.pyplot as plt

# read the Microsoft and Apple data from file
msft = pd.read_csv("https://raw.githubusercontent.com/safarini/Python_Pandas/master/msft.csv", index_col=0, parse_dates=True)
aapl = pd.read_csv("https://raw.githubusercontent.com/safarini/Python_Pandas/master/aapl.csv", index_col=0, parse_dates=True)

Excelling at manipulating time-series data, pandas was created initially for use in finance, and from its inception, it has had facilities for managing complete date and time-series operations to handle complex financial scenarios. These capabilities have been progressively expanded and refined over all of its versions.

The representations of dates, times, and time intervals and periods provided by pandas, which are pandas's own, are above and beyond those provided in other Python frameworks such as SciPy and NumPy. The pandas implementations provide additional capabilities that are required to model time-series data, and to transform data across different frequencies, periods, and calendars for different organizations and financial markets.

Specific dates and times in pandas are represented using the pandas Timestamp class. Timestamp is based on NumPy's dtype datetime64 and has higher precision than Python's built-in datetime object. This increased precision is frequently required for accurate financial calculations.

Sequences of timestamp objects are represented by pandas as a DatetimeIndex, which is a type of pandas index that is optimized for indexing by dates and times. There are several ways to create DatetimeIndex objects in pandas. The following command creates a DatetimeIndex from an array of datetime objects:

In [2]:
# create a a DatetimeIndex from an array of datetime's
dates = [datetime(2014, 8, 1), datetime(2014, 8, 2)]
dti   = pd.DatetimeIndex(dates)
dti

DatetimeIndex(['2014-08-01', '2014-08-02'], dtype='datetime64[ns]', freq=None)

A Series will also automatically construct a DatetimeIndex as its index when passing a list of datetime objects as the index parameter:

In [3]:
# a Series given a datetime list will automatically create
# a DatetimeIndex as its index
np.random.seed(123)
ts = pd.Series(np.random.randn(2), dates)
type(ts.index)

pandas.core.indexes.datetimes.DatetimeIndex

The Series object has taken the datetime objects and constructed a DatetimeIndex from the date values, where each value of the DatetimeIndex is a Timestamp object, and each element of the index can be used to access the corresponding value in the Series object. To demonstrate this, the following command shows several ways in which we can access the value in the Series with the date 2014-08-02 as an index label:



In [4]:
# retrieve a value using a datetime object
ts[datetime(2014, 8, 2)]

0.9973454465835858

In [5]:
# this can also be performed with a string
ts['2014-8-2']

0.9973454465835858

The Series object can also create a DatetimeIndex when passing a list of strings, which pandas will gladly recognize as dates and perform the appropriate conversions:

In [6]:
# create a Series with a DatetimeIndex using strings as dates
np.random.seed(123)
dates = ['2014-08-01', '2014-08-02']
ts = pd.Series(np.random.randn(2), dates)
ts

2014-08-01   -1.085631
2014-08-02    0.997345
dtype: float64

Also provided by pandas is the pd.to_datetime() function, which is used to perform a conversion of a list of potentially mixed type items into a DatetimeIndex:

In [7]:
# convert a list of items to a DatetimeIndex
dti = pd.to_datetime(['Aug 1, 2014', '2014-08-02', 
                      '2014.8.3', None])
dti

DatetimeIndex(['2014-08-01', '2014-08-02', '2014-08-03', 'NaT'], dtype='datetime64[ns]', freq=None)

NOTE  
Notice that None is converted into a not-a-time value, NaT, which represents that the source data could not be converted into datetime.  

But be careful as, by default, the pd.to_datetime() function will fall back to returning a NumPy array of objects if it cannot parse a value, as demonstrated here:

In [8]:
dti2 = pd.to_datetime(['Aug 1, 2014'])
type(dti2)


pandas.core.indexes.datetimes.DatetimeIndex

The pandas default is that date strings are always month first. If you need to parse dates with the day as the first component, you can use the dayfirst=True option, which can be useful as data can often have day first, particularly when it is non-U.S. data. The following command demonstrates this in action and also shows how the ordering can be changed:

In [9]:
# demonstrate two representations of the same date, one 
# month first, the other day first, converting to the 
# same date representation in pandas
dti1 = pd.to_datetime(['8/1/2014'])
dti2 = pd.to_datetime(['1/8/2014'], dayfirst=True)
dti1[0], dti2[0]

(Timestamp('2014-08-01 00:00:00'), Timestamp('2014-08-01 00:00:00'))

A range of timestamps at a specific frequency can easily be created using the pd.date_range() function. The following command creates a Series from a DatetimeIndex of 10 consecutive days:

In [10]:
# create a Series with a DatetimeIndex starting at 8/1/2014
# and consisting of 10 consequtive days
np.random.seed(123456)
dates = pd.date_range('8/1/2014', periods=10)
s1 = pd.Series(np.random.randn(10), dates)
s1[:5]

2014-08-01    0.469112
2014-08-02   -0.282863
2014-08-03   -1.509059
2014-08-04   -1.135632
2014-08-05    1.212112
Freq: D, dtype: float64

In [11]:
# for examples of data retrieval / slicing, we will use the 
# following data from Yahoo! Finance
msft.head(5)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2012-01-03,26.55,26.96,26.39,26.77,64731500,24.42183
2012-01-04,26.82,27.47,26.78,27.4,80516100,24.99657
2012-01-05,27.38,27.73,27.29,27.68,56081400,25.25201
2012-01-06,27.53,28.19,27.53,28.11,99455500,25.64429
2012-01-09,28.05,28.1,27.72,27.74,59706800,25.30675


The msft variable is a DataFrame that represents a time-series of multiple data points (Open, High, Low, and so on) for the MSFT stock. To make these examples easier, from this DataFrame, we can create a pandas Series consisting of just the Adj Close values:

In [12]:
# extract just the Adj Close values
msftAC = msft['Adj Close']
msftAC.head(3)

Date
2012-01-03    24.42183
2012-01-04    24.99657
2012-01-05    25.25201
Name: Adj Close, dtype: float64

The msftAC variable is a pandas Series object. I point this out as several of the operations to retrieve values from Series objects differ, depending upon whether the operation is being applied to a Series or a DataFrame. This can cause some slight confusion if this is not recognized.

The slicing notation is overridden to very conveniently allow the passing of strings representing dates as the values for the slice. The following command retrieves MSFT data for dates from 2012-01-01 to 2012-01-05:

In [13]:
# slicing using a DatetimeIndex nicely works with dates 
# passed as strings
msft['2012-01-01':'2012-01-05']

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2012-01-03,26.55,26.96,26.39,26.77,64731500,24.42183
2012-01-04,26.82,27.47,26.78,27.4,80516100,24.99657
2012-01-05,27.38,27.73,27.29,27.68,56081400,25.25201


A specific item can be retrieved from a time-series represented by a DataFrame by specifying the date/time index value and using the .loc method. The result is a Series where the index labels are the column names, with the values for each being in a specific row for each of the columns

In [14]:
# returns a Series representing all the values of the 
# single row indexed by the column names
msft.loc['2012-01-03']

Open         2.655000e+01
High         2.696000e+01
Low          2.639000e+01
Close        2.677000e+01
Volume       6.473150e+07
Adj Close    2.442183e+01
Name: 2012-01-03 00:00:00, dtype: float64

Note: that the following syntax does not work as the DataFrame attempts to look for a column with the name 2012-01-03

In [15]:
# this is an error as this tries to retrieve a column
# named '2012-01-03'
# msft['2012-01-03'] # commented to prevent killing the notebook

This syntax does work on a Series object that is a time-series, and this looks for an index label with the matching date

In [16]:
# this is a Series, so the lookup works
msftAC['2012-01-03']

24.42183

NOTE<br>

This is a subtle difference that sometimes causes headaches when using time-series data in pandas. So be careful or always convert your Series objects to DataFrame objects to use a lookup, using .loc to lookup using the index.

One of the advantages of pandas is the ability to be able to select based upon partial datetime specifications. As an example, the following command selects MSFT data for the month of February 2012:

In [17]:
# we can lookup using partial date specifications
# such as only year and month
msft['2012-02'].head(5)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2012-02-01,29.79,30.05,29.76,29.89,67409900,27.26815
2012-02-02,29.9,30.17,29.71,29.95,52223300,27.32289
2012-02-03,30.14,30.4,30.09,30.24,41838500,27.58745
2012-02-06,30.04,30.22,29.97,30.2,28039700,27.55096
2012-02-07,30.15,30.49,30.05,30.35,39242400,27.68781


NOTE<br>
Note that this did not require the use of the .loc method, as pandas first identifies this as a partial date and then looks along the index of the DataFrame instead of a column (although .loc can be used to perform an equivalent operation).

It is also possible to slice, starting at the beginning of a specific month and ending at a specific day of the month

In [18]:
# slice starting at the beginning of Feb 2012 and 
# end on Feb 9 2012
msft['2012-02':'2012-02-09'][:5]

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2012-02-01,29.79,30.05,29.76,29.89,67409900,27.26815
2012-02-02,29.9,30.17,29.71,29.95,52223300,27.32289
2012-02-03,30.14,30.4,30.09,30.24,41838500,27.58745
2012-02-06,30.04,30.22,29.97,30.2,28039700,27.55096
2012-02-07,30.15,30.49,30.05,30.35,39242400,27.68781


# Creating time-series with specific frequencies

Time-series data in pandas can also be created to represent intervals of time other than daily frequency. Different frequencies can be generated with pd.date_range() by utilizing the freq parameter. This parameter defaults to a value of D, which represents daily frequency.

To introduce the creation of nondaily frequencies, the following command creates a DatetimeIndex with one-minute intervals using freq='T'

In [19]:
# create a time-series with one minute frequency
bymin = pd.Series(np.arange(0, 90*60*24),
                  pd.date_range('2014-08-01', 
                                '2014-10-29 23:59:00',
                                freq='T'))
bymin

2014-08-01 00:00:00         0
2014-08-01 00:01:00         1
2014-08-01 00:02:00         2
2014-08-01 00:03:00         3
2014-08-01 00:04:00         4
                        ...  
2014-10-29 23:55:00    129595
2014-10-29 23:56:00    129596
2014-10-29 23:57:00    129597
2014-10-29 23:58:00    129598
2014-10-29 23:59:00    129599
Freq: T, Length: 129600, dtype: int32

This time-series allows us to use forms of slicing at finer resolution. Earlier, we saw slicing at day and month levels, but now we have a time-series with minute-based data that we can slice down to hours and minutes (and smaller intervals if we use finer frequencies).

In [20]:
# slice at the minute level
bymin['2014-08-01 12:30':'2014-08-01 12:59']

2014-08-01 12:30:00    750
2014-08-01 12:31:00    751
2014-08-01 12:32:00    752
2014-08-01 12:33:00    753
2014-08-01 12:34:00    754
2014-08-01 12:35:00    755
2014-08-01 12:36:00    756
2014-08-01 12:37:00    757
2014-08-01 12:38:00    758
2014-08-01 12:39:00    759
2014-08-01 12:40:00    760
2014-08-01 12:41:00    761
2014-08-01 12:42:00    762
2014-08-01 12:43:00    763
2014-08-01 12:44:00    764
2014-08-01 12:45:00    765
2014-08-01 12:46:00    766
2014-08-01 12:47:00    767
2014-08-01 12:48:00    768
2014-08-01 12:49:00    769
2014-08-01 12:50:00    770
2014-08-01 12:51:00    771
2014-08-01 12:52:00    772
2014-08-01 12:53:00    773
2014-08-01 12:54:00    774
2014-08-01 12:55:00    775
2014-08-01 12:56:00    776
2014-08-01 12:57:00    777
2014-08-01 12:58:00    778
2014-08-01 12:59:00    779
Freq: T, dtype: int32

NOTE <br>
A complete list can be found at http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases.

# Representing intervals of time using periods

It is often required to represent not just a specific time or sequence of timestamps, but to represent an interval of time using a start date and an end date (an example of this would be a financial quarter). This representation of a bounded interval of time can be represented in pandas using Period objects.

Period objects consist of a start time and an end time and are created from a start date with a given frequency. The start time is referred to as the anchor of the Period object, and the end time is then calculated from the start date and the period specification.

To demonstrate this, the following command creates a period representing a 1-month period anchored in August 2014:

In [21]:
# create a period representing a start of 
# 2014-08 and for a duration of one month
aug2014 = pd.Period('2014-08', freq='M')
aug2014

Period('2014-08', 'M')

The Period function has start_time and end_time properties that inform us of the derived start and end times of Period:

In [22]:
# pandas determined the following start and end
# for the period
aug2014.start_time, aug2014.end_time

(Timestamp('2014-08-01 00:00:00'), Timestamp('2014-08-31 23:59:59.999999999'))

Since we specified a period that starts using a partial date specification of August 2014, pandas determines the anchor (start_time) as 2014-08-01 00:00:00 and then calculates the end_time property based upon the specified frequency; in this case, calculating 1 month from the start_time anchor and returning the last unit of time prior to this.

Mathematical operations are overloaded on Period objects, so as to calculate another period based upon the value represented in Period. As an example, the following command creates a new Period based upon the aug2014 period object by adding 1 to the period. Since aug2014 has a period of 1 month, the resulting value is that start date (2014-08-01) + 1 * 1 month (the period represented by the object), and, hence, the result is the last moment of time prior to 2014-09-01:

In [23]:
# what is the one month period following the given period?
sep2014 = aug2014 + 1
sep2014

Period('2014-09', 'M')

This seems as though pandas has simply added one to the month in the Period object aug2014. However, examining start_time and end_time of sep2014, we can see something interesting:

In [24]:
# the calculated start and end are
sep2014.start_time, sep2014.end_time

(Timestamp('2014-09-01 00:00:00'), Timestamp('2014-09-30 23:59:59.999999999'))

NOTE<br>
The Period object has the ability to know that September has 30 days and not 31. This is the advantage that the Period object has over simple addition. It is not simply adding 30 days (in this example), but one unit of the period. This helps to solve many difficult date management problems.

Period objects are useful when combined into a collection referred to as a PeriodIndex. The following command creates a pandas PeriodIndex consisting of 1-month intervals for the year of 2013:

In [25]:
# create a pandas PeriodIndex
mp2013 = pd.period_range('1/1/2013', '12/31/2013', freq='M')
mp2013

PeriodIndex(['2013-01', '2013-02', '2013-03', '2013-04', '2013-05', '2013-06',
             '2013-07', '2013-08', '2013-09', '2013-10', '2013-11', '2013-12'],
            dtype='period[M]', freq='M')

A PeriodIndex differs from a DatetimeIndex in that in a PeriodIndex, the index labels are Period objects:

In [26]:
# dump all the calculated periods
for p in mp2013: 
    print ("{0} {1} {2} {3}".format(p, 
                                   p.freq, 
                                   p.start_time, 
                                   p.end_time))

2013-01 <MonthEnd> 2013-01-01 00:00:00 2013-01-31 23:59:59.999999999
2013-02 <MonthEnd> 2013-02-01 00:00:00 2013-02-28 23:59:59.999999999
2013-03 <MonthEnd> 2013-03-01 00:00:00 2013-03-31 23:59:59.999999999
2013-04 <MonthEnd> 2013-04-01 00:00:00 2013-04-30 23:59:59.999999999
2013-05 <MonthEnd> 2013-05-01 00:00:00 2013-05-31 23:59:59.999999999
2013-06 <MonthEnd> 2013-06-01 00:00:00 2013-06-30 23:59:59.999999999
2013-07 <MonthEnd> 2013-07-01 00:00:00 2013-07-31 23:59:59.999999999
2013-08 <MonthEnd> 2013-08-01 00:00:00 2013-08-31 23:59:59.999999999
2013-09 <MonthEnd> 2013-09-01 00:00:00 2013-09-30 23:59:59.999999999
2013-10 <MonthEnd> 2013-10-01 00:00:00 2013-10-31 23:59:59.999999999
2013-11 <MonthEnd> 2013-11-01 00:00:00 2013-11-30 23:59:59.999999999
2013-12 <MonthEnd> 2013-12-01 00:00:00 2013-12-31 23:59:59.999999999


With a PeriodIndex, we can then construct a Series using it as the index:

In [27]:
# and now create a Series using the PeriodIndex
np.random.seed(123456)
ps = pd.Series(np.random.randn(12), mp2013)
ps

2013-01    0.469112
2013-02   -0.282863
2013-03   -1.509059
2013-04   -1.135632
2013-05    1.212112
2013-06   -0.173215
2013-07    0.119209
2013-08   -1.044236
2013-09   -0.861849
2013-10   -2.104569
2013-11   -0.494929
2013-12    1.071804
Freq: M, dtype: float64

We now have a time-series where the value at a specific index label represents a measurement that spans a period of time, such as the average value of a security in a given month, instead of at a specific time. This becomes very useful when we perform resampling of the time-series to another frequency, which we will do a little later in this session.

# Shifting and lagging time-series data

A common operation on time-series data is to shift or "lag" the values back and forward in time, such as to calculate percentage change from sample to sample. The pandas method for this is .shift(), which will shift the values in the index by a specified number of units of the index's period.

To demonstrate shifting and lagging, we will use the adjusted close values for MSFT. As a refresher, the following command shows the first 10 items in that time-series:

In [28]:
# refresh our memory on the data in the MSFT closing prices Series
msftAC[:5]

Date
2012-01-03    24.42183
2012-01-04    24.99657
2012-01-05    25.25201
2012-01-06    25.64429
2012-01-09    25.30675
Name: Adj Close, dtype: float64

The following command shifts the adjusted closing prices forward by 1 day:

In [29]:
# shift the prices one index position forward
shifted_forward = msftAC.shift(1)
shifted_forward[:5]

Date
2012-01-03         NaN
2012-01-04    24.42183
2012-01-05    24.99657
2012-01-06    25.25201
2012-01-09    25.64429
Name: Adj Close, dtype: float64

Notice that the value of the index label of 2012-01-03 is now NaN. When shifting at the same frequency as that of the index, the shift will result in one or more NaN values being added for the labels at one end of the Series, and a loss of the same number of values at the other end. The amount of NaN values is the same as the number of specified periods.

If we examine the tail of both the original and shifted Series, we will see that the last value in the Series was shifted away:

In [30]:
# the last item is also shifted away 
msftAC.tail(5), shifted_forward.tail(5)

(Date
 2012-12-21    25.75040
 2012-12-24    25.38455
 2012-12-26    25.19693
 2012-12-27    25.29074
 2012-12-28    24.90612
 Name: Adj Close, dtype: float64,
 Date
 2012-12-21    25.96616
 2012-12-24    25.75040
 2012-12-26    25.38455
 2012-12-27    25.19693
 2012-12-28    25.29074
 Name: Adj Close, dtype: float64)

The original Series ended with two values of 36.04 for both 2013-12-27 and 2013-12-30, and the value that was originally at 2013-12-30 is now lost.

It is also possible to shift values in the opposite direction. The following command demonstrates this by shifting the Series by -2:

In [31]:
# shift backwards 2 index labels
shifted_backwards = msftAC.shift(-2)
shifted_backwards[:5]

Date
2012-01-03    25.25201
2012-01-04    25.64429
2012-01-05    25.30675
2012-01-06    25.39797
2012-01-09    25.28850
Name: Adj Close, dtype: float64

This results in two NaN values at the tail of the resulting Series:

In [32]:
# this has resulted in 2 NaN values at 
# the end of the resulting Series
shifted_backwards.tail(5)

Date
2012-12-21    25.19693
2012-12-24    25.29074
2012-12-26    24.90612
2012-12-27         NaN
2012-12-28         NaN
Name: Adj Close, dtype: float64

It is possible to shift by different frequencies using the freq parameter. This will create a time-series with a new index, where the index labels are adjusted by the number of specified units of the given frequency. As an example, the following command shifts forward the time-series with a frequency of 1 day by one second:

In [33]:
# shift by a different frequency does not realign
# and ends up essentially changing the index labels by
# the specific amount of time
msftAC.shift(1, freq="S")

Date
2012-01-03 00:00:01    24.42183
2012-01-04 00:00:01    24.99657
2012-01-05 00:00:01    25.25201
2012-01-06 00:00:01    25.64429
2012-01-09 00:00:01    25.30675
                         ...   
2012-12-21 00:00:01    25.75040
2012-12-24 00:00:01    25.38455
2012-12-26 00:00:01    25.19693
2012-12-27 00:00:01    25.29074
2012-12-28 00:00:01    24.90612
Name: Adj Close, Length: 249, dtype: float64

The resulting DataFrame or Series is essentially the same as the original, with the specified number of units of frequency added to each index label. No data will be shifted out or replaced with NaN as this is not performing realignment.

An alternate form of shifting is provided by pandas using the .tshift() method. Rather than changing the alignment of the data, .tshift() simply results in a new Series or DataFrame, where the values of the index labels are changed by the specified number of offsets of the value of the freq parameter. This is demonstrated by the following command, which modifies the index labels by 1 day:

In [34]:
# resulting Series has one day added to all index labels
msftAC.tshift(1, freq="D")

Date
2012-01-04    24.42183
2012-01-05    24.99657
2012-01-06    25.25201
2012-01-07    25.64429
2012-01-10    25.30675
                ...   
2012-12-22    25.75040
2012-12-25    25.38455
2012-12-27    25.19693
2012-12-28    25.29074
2012-12-29    24.90612
Name: Adj Close, Length: 249, dtype: float64

A practical application of shifting is the calculation of daily percentage changes from the previous day. The following command calculates the day-to-day percentage change in the adjusted closing price for MSFT:

In [35]:
# calculate the percentage change in closing price
msftAC / msftAC.shift(1) - 1

Date
2012-01-03         NaN
2012-01-04    0.023534
2012-01-05    0.010219
2012-01-06    0.015535
2012-01-09   -0.013162
                ...   
2012-12-21   -0.008309
2012-12-24   -0.014208
2012-12-26   -0.007391
2012-12-27    0.003723
2012-12-28   -0.015208
Name: Adj Close, Length: 249, dtype: float64

# Frequency conversion of time-series data

The frequency of the data in a time-series can be converted in pandas using the .asfreq() method of a Series or DataFrame. To demonstrate, we will use the following small subset of the MSFT stock closing values:

In [36]:
# take a two item sample of the msftAC data for demonstrations
sample = msftAC[:2]
sample

Date
2012-01-03    24.42183
2012-01-04    24.99657
Name: Adj Close, dtype: float64

We have extracted the first 2 days of adjusted close values. Let's suppose we want to resample this to have hourly sampling of data in-between the index labels. We can do this with the following command:

In [37]:
# demonstrate resampling to hour intervals
# realignment causes many NaN's
sample.asfreq("H")

Date
2012-01-03 00:00:00    24.42183
2012-01-03 01:00:00         NaN
2012-01-03 02:00:00         NaN
2012-01-03 03:00:00         NaN
2012-01-03 04:00:00         NaN
2012-01-03 05:00:00         NaN
2012-01-03 06:00:00         NaN
2012-01-03 07:00:00         NaN
2012-01-03 08:00:00         NaN
2012-01-03 09:00:00         NaN
2012-01-03 10:00:00         NaN
2012-01-03 11:00:00         NaN
2012-01-03 12:00:00         NaN
2012-01-03 13:00:00         NaN
2012-01-03 14:00:00         NaN
2012-01-03 15:00:00         NaN
2012-01-03 16:00:00         NaN
2012-01-03 17:00:00         NaN
2012-01-03 18:00:00         NaN
2012-01-03 19:00:00         NaN
2012-01-03 20:00:00         NaN
2012-01-03 21:00:00         NaN
2012-01-03 22:00:00         NaN
2012-01-03 23:00:00         NaN
2012-01-04 00:00:00    24.99657
Freq: H, Name: Adj Close, dtype: float64

A new index with hourly index labels has been created by pandas, but when aligning to the original time-series, only two values were found, thereby leaving the others filled with NaN.

We can change this default behavior using the method parameter of the .asfreq() method. One method is pad or ffill that will fill with the last known value:

In [38]:
# fill NaN's with the last know non-NaN valuen
sample.asfreq("H", method="ffill")

Date
2012-01-03 00:00:00    24.42183
2012-01-03 01:00:00    24.42183
2012-01-03 02:00:00    24.42183
2012-01-03 03:00:00    24.42183
2012-01-03 04:00:00    24.42183
2012-01-03 05:00:00    24.42183
2012-01-03 06:00:00    24.42183
2012-01-03 07:00:00    24.42183
2012-01-03 08:00:00    24.42183
2012-01-03 09:00:00    24.42183
2012-01-03 10:00:00    24.42183
2012-01-03 11:00:00    24.42183
2012-01-03 12:00:00    24.42183
2012-01-03 13:00:00    24.42183
2012-01-03 14:00:00    24.42183
2012-01-03 15:00:00    24.42183
2012-01-03 16:00:00    24.42183
2012-01-03 17:00:00    24.42183
2012-01-03 18:00:00    24.42183
2012-01-03 19:00:00    24.42183
2012-01-03 20:00:00    24.42183
2012-01-03 21:00:00    24.42183
2012-01-03 22:00:00    24.42183
2012-01-03 23:00:00    24.42183
2012-01-04 00:00:00    24.99657
Freq: H, Name: Adj Close, dtype: float64

The other method is to use backfill/bfill, which will use the next known value:

In [39]:
# fill with the "next known" value
sample.asfreq("H", method="bfill")

Date
2012-01-03 00:00:00    24.42183
2012-01-03 01:00:00    24.99657
2012-01-03 02:00:00    24.99657
2012-01-03 03:00:00    24.99657
2012-01-03 04:00:00    24.99657
2012-01-03 05:00:00    24.99657
2012-01-03 06:00:00    24.99657
2012-01-03 07:00:00    24.99657
2012-01-03 08:00:00    24.99657
2012-01-03 09:00:00    24.99657
2012-01-03 10:00:00    24.99657
2012-01-03 11:00:00    24.99657
2012-01-03 12:00:00    24.99657
2012-01-03 13:00:00    24.99657
2012-01-03 14:00:00    24.99657
2012-01-03 15:00:00    24.99657
2012-01-03 16:00:00    24.99657
2012-01-03 17:00:00    24.99657
2012-01-03 18:00:00    24.99657
2012-01-03 19:00:00    24.99657
2012-01-03 20:00:00    24.99657
2012-01-03 21:00:00    24.99657
2012-01-03 22:00:00    24.99657
2012-01-03 23:00:00    24.99657
2012-01-04 00:00:00    24.99657
Freq: H, Name: Adj Close, dtype: float64

## Up and down resampling of time-series

Frequency conversion provides basic conversion of data using the new frequency intervals and allows the filling of missing data using either NaN, forward filling, or backward filling. More elaborate control is provided through the process of resampling.

Resampling can be either downsampling, where data is converted to wider frequency ranges (such as downsampling from day-to-day to month-to-month) or upsampling, where data is converted to narrower time ranges. Data for the associated labels are then calculated by a function provided to pandas instead of simple filling.

To demonstrate upsampling, we will calculate the daily cumulative returns for the MSFT stock over 2012 and 2013 and resample it to monthly frequency. We will examine the return calculation in more detail later on, Time-series Stock Data, but for now, we will use it as a demonstration of the mechanics of up and down resampling of time-series data.

The cumulative daily return for MSFT can be calculated with the following command using .shift() and application of the .cumprod() method, as shown here:

In [40]:
# calculate the cumulative daily returns for MSFT
msft_cum_ret = (1 + (msftAC / msftAC.shift() - 1)).cumprod()
msft_cum_ret

Date
2012-01-03         NaN
2012-01-04    1.023534
2012-01-05    1.033993
2012-01-06    1.050056
2012-01-09    1.036235
                ...   
2012-12-21    1.054401
2012-12-24    1.039420
2012-12-26    1.031738
2012-12-27    1.035579
2012-12-28    1.019830
Name: Adj Close, Length: 249, dtype: float64

A time-series can be resampled using the .resample() method. This method provides a very flexible means to specify the frequency conversion involved in the resampling, as well as the means by which the resampled values are selected or calculated.

The following command downsamples the daily cumulative returns from day-to-day to month-to-month:

In [41]:
# resample to a monthly cumulative return
msft_monthly_cum_ret = msft_cum_ret.resample("M")
msft_monthly_cum_ret

<pandas.core.resample.DatetimeIndexResampler object at 0x000001D415A8AC48>

As the resample period is specified as monthly, pandas will break the index labels into monthly intervals bounded on calendar months, and the new index label for a group will be the month's end date. The value for each index entry will be the mean of the values for the month. This can be verified for January 2012 with the following command:

In [42]:
# verify the monthly average for 2012-01
msft_cum_ret['2012-01'].mean()

1.0686746674033671

The means by which the value for each index label is calculated can be controlled using the how parameter. Any function that is available via dispatching can be used and given to the how parameter by name. The default is to use the np.mean() function, as we can see in the following example:

In [43]:
# verify that the default resample techique is mean
msft_cum_ret.resample("M").mean()

Date
2012-01-31    1.068675
2012-02-29    1.155697
2012-03-31    1.210570
2012-04-30    1.184644
2012-05-31    1.140516
2012-06-30    1.123344
2012-07-31    1.126245
2012-08-31    1.153629
2012-09-30    1.172756
2012-10-31    1.107382
2012-11-30    1.064822
2012-12-31    1.036327
Freq: M, Name: Adj Close, dtype: float64

We can use "ohlc", which will give us a summary of the open, high, low, and close values during each sampling period. For each resampling period (monthly in this example), pandas will return the first value in the period (open), the maximum value (high), the lowest value (low), and the final value in the period (close):

In [44]:
# resample to monthly and give us open, high, low, close
msft_cum_ret.resample("M").ohlc()[:5]

Unnamed: 0_level_0,open,high,low,close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2012-01-31,1.023534,1.110572,1.023534,1.1031
2012-02-29,1.116548,1.198349,1.116548,1.193461
2012-03-31,1.214142,1.235198,1.186693,1.213014
2012-04-30,1.214142,1.21903,1.141195,1.20399
2012-05-31,1.203613,1.203613,1.09986,1.10478


The type of index resulting from a resampling is controlled by the kind parameter, which can be set to timestamp (the default) or period. In the resampling examples up to this point, the resample has returned Timestamp and, in particular, returned the last day of the month. The following command demonstrates returning an index based on periods instead of time stamps, which can be quite useful if we need to have the start and end timestamps for each sample:

In [45]:
# this will return an index with periods instead of timestamps
by_periods = msft_cum_ret.resample("M", kind="period").mean()
for i in by_periods.index[:5]: 
    print ("{0}:{1} {2}".format(i.start_time, 
                                i.end_time, 
                                by_periods[i]))

2012-01-01 00:00:00:2012-01-31 23:59:59.999999999 1.0686746674033671
2012-02-01 00:00:00:2012-02-29 23:59:59.999999999 1.155697443639563
2012-03-01 00:00:00:2012-03-31 23:59:59.999999999 1.2105695638250318
2012-04-01 00:00:00:2012-04-30 23:59:59.999999999 1.1846436159779994
2012-05-01 00:00:00:2012-05-31 23:59:59.999999999 1.1405159943899656


To demonstrate upsampling, we will examine the process using the second and third days of MSFT's adjusted close values:

In [46]:
# upsampling will be demonstrated using the second
# and third values (first is NaN)
sample = msft_cum_ret[1:3]
sample

Date
2012-01-04    1.023534
2012-01-05    1.033993
Name: Adj Close, dtype: float64

Our upsample example will have to resample this data to an hourly interval:

In [47]:
# upsampling this will have a lot of NaN's
by_hour = sample.resample("H")
by_hour

<pandas.core.resample.DatetimeIndexResampler object at 0x000001D415A00508>

Hourly index labels have been created by pandas, but the alignment only propagates two values into the new time-series and fills the others with NaN. This is an inherent issue with upsampling as in the result there is missing information. By default, pandas uses NaN but provide other methods to fill in values.

As with frequency conversion, the new index labels can be forward filled or back filled using the fill_method parameter and specifying bfill or ffill. Another option is to interpolate the missing data, which can be done using the time-series object's .interpolate() method, which will perform a linear interpolation:

In [48]:
by_hour.interpolate()

Date
2012-01-04 00:00:00    1.023534
2012-01-04 01:00:00    1.023970
2012-01-04 02:00:00    1.024405
2012-01-04 03:00:00    1.024841
2012-01-04 04:00:00    1.025277
2012-01-04 05:00:00    1.025713
2012-01-04 06:00:00    1.026149
2012-01-04 07:00:00    1.026585
2012-01-04 08:00:00    1.027020
2012-01-04 09:00:00    1.027456
2012-01-04 10:00:00    1.027892
2012-01-04 11:00:00    1.028328
2012-01-04 12:00:00    1.028764
2012-01-04 13:00:00    1.029199
2012-01-04 14:00:00    1.029635
2012-01-04 15:00:00    1.030071
2012-01-04 16:00:00    1.030507
2012-01-04 17:00:00    1.030943
2012-01-04 18:00:00    1.031378
2012-01-04 19:00:00    1.031814
2012-01-04 20:00:00    1.032250
2012-01-04 21:00:00    1.032686
2012-01-04 22:00:00    1.033122
2012-01-04 23:00:00    1.033558
2012-01-05 00:00:00    1.033993
Freq: H, Name: Adj Close, dtype: float64