# Selecting Time Series Data

Broadly speaking, time series data are points of data gathered over time (datetimes using pandas terminology). The time order is meaningful and there is only one observation per unit of time typically. Each unit of time often uniquely identifies each record. Often, time is evenly spaced between each data point. 

Examples of time series data include stock market closing prices, levels of CO2 in the atmosphere, unemployment rates, and airplane altitude. pandas has good functionality with regards to analyzing time series data, aggregating over different time periods, sampling different periods of time, and more. Let's begin by reading in 20 years of stock market data, putting the `date` column in the index.

In [1]:
import pandas as pd
df = pd.read_csv('../data/stocks/stocks10.csv', parse_dates=['date'], 
                 index_col='date')
df.head(3)

Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1999-10-25,29.84,2.32,17.02,82.75,,21.45,38.99,16.78,,
1999-10-26,29.82,2.34,16.65,81.25,,20.89,37.11,17.28,,
1999-10-27,29.33,2.38,16.52,75.94,,20.8,36.94,18.27,,


## Set the datetime column as the index

If you do have time series data where the values of one datetime column uniquely identify each row, then it's best to use this column as the index. pandas provides extra functionality to DataFrames that have a datetime index.

### DateTimeIndex

Setting a datetime column as the index creates a DateTimeIndex. 

In [2]:
idx = df.index
type(idx)

pandas.core.indexes.datetimes.DatetimeIndex

Like other index objects, items may be selected with slice notation.

In [3]:
idx[:5]

DatetimeIndex(['1999-10-25', '1999-10-26', '1999-10-27', '1999-10-28',
               '1999-10-29'],
              dtype='datetime64[ns]', name='date', freq=None)

You can directly call specific datetime methods on DateTimeIndex objects just like you can with the `dt` accessor on datetime Series. Let's get the year, month, and day name directly from this index object. The first five values of each attribute are returned.

In [4]:
idx.year[:5]

Int64Index([1999, 1999, 1999, 1999, 1999], dtype='int64', name='date')

In [5]:
idx.month[:5]

Int64Index([10, 10, 10, 10, 10], dtype='int64', name='date')

In [6]:
idx.day_name()[:5]

Index(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'], dtype='object', name='date')

## Easy subset selection with a DateTimeIndex

One big advantage of a DateTimeIndex is the ability to select subsets of data without using boolean indexing. We can use strings to represent specific datetimes and pass those strings to the `loc` indexer. Here, we select the row of data for January 5th, 2017.

In [7]:
df.loc['2017-1-5']

MSFT     59.23
AAPL    111.73
SLB      76.93
AMZN    780.45
TSLA    226.75
XOM      79.11
WMT      64.88
T        36.08
FB      120.67
V        79.61
Name: 2017-01-05 00:00:00, dtype: float64

Note that we did not have to convert the string to a datetime object first. pandas implicitly understood that the string was a datetime.

### Partial string matching to select entire periods of time

You can select entire periods of time by using a string with less precision. Here, we select all of the rows from the month of February, 2017.

In [None]:
df.loc['2017-2'].head(3)

Below, we select the entire year 2016.

In [None]:
df.loc['2016'].head(3)

### Slicing with partial string matching

Use slice notation to select a specific date range. Below, we select from March 28, 2017 through April 3, 2017. Note that the stop value is inclusive.

In [None]:
df.loc['2017-3-28':'2017-4-3']

### Selecting date ranges along with specific columns

The `loc` indexer allows you select specific columns along with ranges of dates. Here, we select the month of May, 2017 along with three specific columns.

In [None]:
df.loc['2017-5', ['SLB', 'T', 'FB']].head()

## Selecting rows at specific frequencies

In addition to selecting consecutive rows, it is possible to select disjoint rows at specific frequencies of time. The `asfreq` method allows you to select very specific intervals, by passing it an **offset alias** as a string. An offset alias determines the frequency of the time series data you would like to sample. The table below shows the most common offset aliases. Reference all of the [offset aliases in the official documentation][1].

| Alias    | Description     |  Alias  |  Description  |
|:---------|:----------------|:--------|:--------------|
| `Y`/`A`        | year end        | `D`       | day           |
| `YS`/`AS`       | year start      | `H`        | hourly       |
| `Q`        | quarter end     | `T` or `min`   | minutes      |
| `QS`       | quarter start   | `S`        | seconds      |
| `M`        | month end     | `L` or `ms`    | milliseconds |
| `MS`       | month start       | `U` or `us`    | microseconds |
| `W`        | weekly          | `N`        | nanoseconds  |

[1]: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases

Let's say we are interested in selecting the last day of each year. To do so, we choose `'Y'` for the year end frequency. We pass this as a string to the `asfreq` method to return the very last day of each year. Note that `asfreq` only works for DataFrames with a DateTimeIndex.

In [None]:
df.asfreq('Y').head(8)

### Business offset aliases

In this case, selecting the very last day isn't quite what we want because the stock market is only open on weekdays and December 31st falls on a weekend some years. The `asfreq` method returns one row for each frequency regardless if there is data for that date. All values for rows that do not appear in the DataFrame will be missing.

Most of the offset aliases above can be prepended by the character `'B'` to signify a business offset alias. Business offset aliases only consider the weekdays Monday through Friday. Let's change the offset alias to `'BY'` to signify business year end frequency. Doing so correctly selects the last trading day of each year.

In [None]:
df.asfreq('BY').head(8)

### Anchored offset aliases

Let's say we would like to select every Thursday. We'll need to use a slightly different string called an **anchored offset alias**. You can anchor years and quarters to months and weeks to days by placing a dash and the abbreviation of the anchor after the offset alias. For example, `BY-APR` signifies business year frequency ending in April. When anchoring the week, use the three-character abbreviation of the day. Below, we anchor weeks to Thursday. The default anchor for weeks is Sunday.

In [None]:
df.asfreq('W-THU').head()

Select the last day of June of each year by using the `A` offset alias and anchoring to the three-character abbreviation of the month. At the time of this writing, the `Y` offset alias, does not allow for anchoring.

In [None]:
df.asfreq('A-Jun').head()

## Upsampling - Increasing the number of rows

The above selections choose a specific subset of rows. pandas uses the terminology **downsampling** when selecting a subset of the original data (usually less rows than the original). Instead, we may choose to **upsample** and increase the number of rows. This will lead to many rows of missing values. Both upsampling and downsampling ensure that the rows are evenly spaced units of time. Let's return a DataFrame with a single row for each day of the year. This will create rows all non-trading days (weekends and holidays).

In [None]:
df.asfreq('D').head(7)

## Use integers in the offset alias

You can provide more precise offsets by placing an integer in front of the offset alias. These represent a multiple of the offset alias. For example, `'3M'` stands for 3 months and `'15s'` for 15 seconds. To select every 6th Wednesday, we do the following:

In [None]:
df.asfreq('6W-WED').head()

You can also upsample by smaller units than what is present in the index. For instance, '4H' will make a new row for every 4 hours of time.

In [None]:
df.asfreq('4H').head(8)

You can fill in the missing values with the previous or next known values using the `method` parameter which can be set to either `'ffill'` or `'bfill'`. Here we fill the missing values using the previously known value in the column.

In [None]:
df.asfreq('4H', method='ffill').head(8)

### No duplicates are allowed and dates must be ordered

Upsampling and downsampling only work when there are no duplicate dates and when the data is ordered. Let's take a look at the employee dataset which has a datetime column, but is not time series data.

In [None]:
emp = pd.read_csv('../data/employee.csv', parse_dates=['hire_date'])
emp = emp.set_index('hire_date')
emp.head(3)

If we try and sample it by year, we get an error.

In [None]:
emp.asfreq('Y')

Let's try and make it more like a time series by sorting the index.

In [None]:
emp = emp.sort_index()
emp.head(3)

The operation will only be successful if there are no duplicate dates. The error tells us that at least one hire date is not unique.

In [None]:
emp.asfreq('M')

Selection with partial string still works.

In [None]:
emp.loc['2012-1':'2012-2'].head()

## Exercises

### Exercise 1

<span style="color:green; font-size:16px">Read in the weather time series dataset and place the date column in the index. Then use this DataFrame for the following questions.</span>

### Exercise 2

<span style="color:green; font-size:16px">Select all of the month of November, 2010</span>

### Exercise 3

<span style="color:green; font-size:16px">Select all of the second quarter of 2017.</span>

### Exercise 4

<span style="color:green; font-size:16px">Select data from July 1, 2015 to the end of 2016.</span>

### Exercise 5

<span style="color:green; font-size:16px">Select just the rain and snow columns from the January 1, 2008 to January 7, 2008.</span>

### Exercise 6

<span style="color:green; font-size:16px">What was the temperature on June 11, 2011?</span>

### Exercise 7

<span  style="color:green; font-size:16px">How many days did it rain during the last three months of 2011?</span>

### Exercise 8

<span style="color:green; font-size:16px">Which year had more snow days, 2007 or 2012?</span>

### Exercise 9

<span style="color:green; font-size:16px">Select every other Thursday.</span>

### Exercise 10

<span style="color:green; font-size:16px">Select the first day of each month.</span>

### Exercise 11

<span style="color:green; font-size:16px">Select every other October 1st.</span>

### Use the temperature dataset for the remaining exercises

Execute the following cell to read in the temperature dataset which sets the datetime column in the index.

In [None]:
df_temp = pd.read_csv('../data/weather/temperature.csv', parse_dates=['datetime'], 
                      index_col='datetime')
df_temp.head()

### Exercise 12

<span style="color:green; font-size:16px">Select the temperatures for Houston between 3 and 6 p.m. on July 4, 2014.</span>

### Exercise 13

<span style="color:green; font-size:16px">Upsample the result from the previous exercise so that there are entries every 20 minutes.</span>

### Exercise 14

<span style="color:green; font-size:16px">Linearly interpolate the missing values in the previous exercise to estimate the temperature at 4:40 pm on July 4, 2014.</span>