# 02. Time Series

### Objectives
* Make a web request and retrieve JSON data from the IEX trading API
* Group by time with **`resample`**
* Use offset aliases to determine amount of time
* Create a DatetimeIndex and use it for easier resampling and subset selection
* Use the **`rolling`** method to calculate moving window statistics

## Introduction
Broadly speaking, time series data are simply points of data gathered over time. Often, the time is evenly spaced between each data point. Pandas has good functionality with regards to manipulating dates, aggregating over different time periods, sampling different periods of time, and more.

# Stock Market Data
There are many tools available to get data stock market data. We will use the [IEX developer platform][1] which has an excellent an easy-to-use API to retrieve market data for free (up to 100 requests per second).

### Using the requests library
The **requests** third-party Python library helps retrieve data from another website. You simply pass the URL as a string to the **`get`** function. The returned object stores the data as a string in the **`text`** attribute. The requests library comes standard with the Anaconda distribution, so you should already have it. The requests library is one of the most popular third-party Python libraries.

### Using the IEX API
The IEX API is fairly straightforward to use and there are several examples that you can view to understand how it works. The base URL of the API is `https://api.iextrading.com/1.0` which can be [seen here in the docs][2]. If you scroll down from the last link, you will see how the API is used. Each **endpont** is documented. Let's use the [chart endpoint][3].

We simply append **`/stock/{symbol}/chart/{range}`** to the base URL and put the stock symbol and range of data we want (without the curly braces) to retrieve historical stock price data. Go to the docs to view the available ranges.

Let's retrieve the last 5 years of Amazon data (symbol AMZN) by passing our endpoint to the **`get`** function. A requests **`Response`** object is returned.

[1]: https://iextrading.com/developer/
[2]: https://iextrading.com/developer/docs/#endpoints
[3]: https://iextrading.com/developer/docs/#chart

In [None]:
import pandas as pd
import requests

In [None]:
req = requests.get('https://api.iextrading.com/1.0/stock/AMZN/chart/5y')
type(req)

### Output the `text`
The response is captured as a Python string assigned to the **`text`** attribute. Let's print out the first 1,000 characters.

In [None]:
req.text[:1000]

### Reading JSON objects
Most APIs will respond with **JSON** data, a standardized format of data that is very similar to a Python dictionary with key-value pairs. This particular JSON data is returned as a list of dictionaries. We can usually read in an API response with the **`read_json`** pandas function which will attempt to convert the JSON text data to a DataFrame.

In [None]:
amzn = pd.read_json(req.text)
amzn.head()

### Verify data types
The **`read_json`** function will help choose the correct types for us. It's a good idea to verify that Pandas chose the correct data types with the **`dtype`** attribute. A common occurrence is for a column that looks like it contains numeric data to be actually kept as a string.

Looking below, the data types seem to all be correct, save for **`label`**, which appears to be just a duplicate of the date column. We are good to continue.

In [None]:
amzn.dtypes

### Drop some columns
Let's drop the **`label`**, **`unadjustedVolumne`**, and **`vwap`** columns.

In [None]:
amzn = amzn.drop(columns=['label', 'unadjustedVolume', 'vwap'])
amzn.head()

# Grouping by time
Pandas gives you the ability to group by a period of time. A concrete example can help here.

### Find the average closing price of Amazon for every month
If we are interested in finding the average closing price of Amazon for every month of data we have then we need to group by month and aggregate the closing price with the sum function.

### Datetime column, amount of time, aggregating column, and aggregating method
This procedure is very similar to how we grouped and aggregated columns in previous notebooks. The only difference is that, our **grouping column** will now be a datetime column with an additional specification for the amount of time.

### Use the `resample` method
Instead of the **`groupby`** method, we use a special method for grouping time together called **`resample`**. We must pass the **`resample`** method a time period as a string and the name of the datetime column. The rest of the process is the exact same as the **`groupby`** method. We call the **`agg`** method and pass it a dictionary mapping the **aggregating columns** to the **aggregating functions.

### `resample` syntax
The first parameter we pass to **`resample`** is the amount of time as a **string**. Here, we use **`'M'`** which refers to month. We also must specify which datetime column to use. Here, it is the **`date`** column. We then call **`agg`** as usual.

In [None]:
amzn.resample('M', on='date').agg({'close': 'mean'}).head(10)

### Use any number of aggregation functions
Map the aggregating column to a list of aggregating functions.

In [None]:
amzn.resample('M', on='date').agg({'close': ['size', 'min', 'mean', 'max']}).head(10)

## Offset Aliases
All the possible strings that represent amounts of time are called [**offset aliases**](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases). They have been reprinted below:

<table border="1" class="docutils">
<colgroup>
<col width="13%" />
<col width="87%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Alias</th>
<th class="head">Description</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td>B</td>
<td>business day frequency</td>
</tr>
<tr class="row-odd"><td>C</td>
<td>custom business day frequency (experimental)</td>
</tr>
<tr class="row-even"><td>D</td>
<td>calendar day frequency</td>
</tr>
<tr class="row-odd"><td>W</td>
<td>weekly frequency</td>
</tr>
<tr class="row-even"><td>M</td>
<td>month end frequency</td>
</tr>
<tr class="row-odd"><td>BM</td>
<td>business month end frequency</td>
</tr>
<tr class="row-even"><td>CBM</td>
<td>custom business month end frequency</td>
</tr>
<tr class="row-odd"><td>MS</td>
<td>month start frequency</td>
</tr>
<tr class="row-even"><td>BMS</td>
<td>business month start frequency</td>
</tr>
<tr class="row-odd"><td>CBMS</td>
<td>custom business month start frequency</td>
</tr>
<tr class="row-even"><td>Q</td>
<td>quarter end frequency</td>
</tr>
<tr class="row-odd"><td>BQ</td>
<td>business quarter endfrequency</td>
</tr>
<tr class="row-even"><td>QS</td>
<td>quarter start frequency</td>
</tr>
<tr class="row-odd"><td>BQS</td>
<td>business quarter start frequency</td>
</tr>
<tr class="row-even"><td>A</td>
<td>year end frequency</td>
</tr>
<tr class="row-odd"><td>BA</td>
<td>business year end frequency</td>
</tr>
<tr class="row-even"><td>AS</td>
<td>year start frequency</td>
</tr>
<tr class="row-odd"><td>BAS</td>
<td>business year start frequency</td>
</tr>
<tr class="row-even"><td>BH</td>
<td>business hour frequency</td>
</tr>
<tr class="row-odd"><td>H</td>
<td>hourly frequency</td>
</tr>
<tr class="row-even"><td>T, min</td>
<td>minutely frequency</td>
</tr>
<tr class="row-odd"><td>S</td>
<td>secondly frequency</td>
</tr>
<tr class="row-even"><td>L, ms</td>
<td>milliseconds</td>
</tr>
<tr class="row-odd"><td>U, us</td>
<td>microseconds</td>
</tr>
<tr class="row-even"><td>N</td>
<td>nanoseconds</td>
</tr>
</tbody>
</table>

### Group by Quarter

In [None]:
amzn.resample('Q', on='date').agg({'close': ['size', 'min', 'mean', 'max']}).head()

### Label with the beginning of the period
Notice how the end date of both the month and quarter are used as index labels for the time periods. We can change the index labels so that they are from the beginning of the period by appending 'S' to our offset alias. This does not affect the amount of time, it only affects the label. 

Notice that using offset alias **`'QS'`** only changed the label to the beginning of the period. The actual data stayed the same.

In [None]:
amzn.resample('QS', on='date').agg({'close': ['size', 'min', 'mean', 'max']}).head()

# Anchored offsets
By default, when grouping by week, Pandas chooses to end the week on Sunday. We say that it is **anchored** to Sunday. Let's verify this by grouping by week and taking the resulting index label and determining its weekday name.

In [None]:
amzn.resample('W', on='date').agg({'close': ['size', 'min', 'mean', 'max']}).head()

In [None]:
dt = pd.to_datetime('2013-07-28')
dt.day_name()

### Anchor by a different day
You can anchor the week to any day you choose by appending a dash and then the first the letters of the day of the week. Let's anchor the week to Wednesday. [Anchored offsets][1] are available when grouping by quarter and year as well.

[1]: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#anchored-offsets

In [None]:
amzn.resample('W-WED', on='date').agg({'close': ['size', 'min', 'mean', 'max']}).head()

### Longer intervals of time with numbers within offset aliases
We can actually add more details to our offset aliases by using a number to specify an amount of that particular offset alias. For instance, **`5M`** will group in 5 month intervals.

In [None]:
amzn.resample('5M', on='date').agg({'close': ['size', 'min', 'mean', 'max']}).head()

Group by every 22 weeks anchored to Thursday.

In [None]:
amzn.resample('22W-THU', on='date').agg({'close': ['size', 'min', 'mean', 'max']}).head()

# Setting the index as a datetime column
The date column is a good choice of index since it uniquely identifies every row. Let's set it as the index and assign the result to a new variable.

In [None]:
amzn_dt = amzn.set_index('date')
amzn_dt.head()

## Extra functionality with a DatetimeIndex
Whenever you set a datetime column as the index, your DataFrame gains a bit more extra functionality. The type of your index is now a **`DatetimeIndex`**. Let's verify this:

In [None]:
amzn_dt.index

In [None]:
type(amzn_dt.index)

## Easy Subset selection with a DatetimeIndex
One of the benefits is the easy subset selection that comes along with a DatetimeIndex. We can use date **strings** to select particular rows of our DataFrame. Let's select the row for May 7, 2014. 

Since we are selecting rows, it is best to use **`.loc`**.

In [None]:
amzn_dt.loc['2014-5-7']

### Partial date matching
A much better feature is that we can select entire subsets of time with a partial date string. For instance, if we wanted to select the rows of just May 2014, we would do the following:

In [None]:
amzn_dt.loc['2014-05']

We can select entire years this way:

In [None]:
amzn_dt.loc['2014'].head()

### Slicing ranges of dates
It is possible to use slice notation to select an entire range of data. As usual, the endpoints are inclusive.

In [None]:
amzn_dt.loc['2014-6-9':'2014-6-12']

## Easier use of `resample` with a DatetimeIndex
If you have a DatetimeIndex, then using the **`resample`** method will be slightly easier. You do not have to specify the date column with the **`on`** parameter as we did above. By default, Pandas will use the **index** to group.

In [None]:
amzn_dt.resample('M').agg({'close':'mean'}).head()

# Rolling Window Calculations
Often in time series analysis, we would like to calculate a statistic over a continuous rolling window of time. With our a stock data, we might want to find the average closing price of the last 5 days. The **`rolling`** method helps accomplish this task.

The **`rolling`** method works very similarly to the **`resample`** method. We pass it the offset alias of the length of our window and then aggregate as usual. The result will always be a DataFrame (or Series) with the same number of elements as the original.

The following takes the mean of the last 5 day period at each date.

In [None]:
amzn_dt.rolling('5D').agg({'close': 'mean'}).head(10)

### Explanation
At each data point, the average of the last 5 days worth of data is found. Note, that this does not mean the window is always going to contain 5 values. It could contain more or less, depending on the dates in the time series. For instance, in the example above, the average calculated at **2013-07-31** contains only three data points (itself, **2013-07-30** and **2013-07-29**).

We can include an additional aggregation function, **`count`**, to find verify the number of values in each window. We should be able to use **`size`** but there appears to be a bug in Pandas and its giving us an error. **`count`** works the same as there are no missing values.

In [None]:
amzn_dt.rolling('5D').agg({'close': ['mean', 'count']}).head(10)

## Keep window size the same with an integer
Instead of using an offset alias, you can specify a specific window size with an integer. The following will always use the last 5 trading days (regardless of how many actual days pass) to determine an average.

When using an integer for the window, the **`rolling`** method enforces that there must be that number of values present or else a missing value will be the result. This is what you are seeing below.

In [None]:
amzn_dt.rolling(5).agg({'close': 'mean'}).head(10)

# Plotting
Let's plot the 50-day min, mean, and max of the closing price.

In [None]:
rolling_stats = amzn_dt.rolling(50).agg({'close': ['min', 'mean', 'max']})
rolling_stats.head()

Remove missing values:

In [None]:
rolling_stats = rolling_stats.dropna()
rolling_stats.head()

Rename columns:

In [None]:
rolling_stats.columns = ['Min', 'Mean', 'Max']

Import matplotlib and choose a nice style:

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.style.use("ggplot")

In [None]:
rolling_stats.plot(figsize=(16, 4), style=['-', '--', '-'], title='AMZN Rolling Windows')

## Resampling and Rolling Windows with a  Series - A bit easier
Resampling and rolling window calculations can be done on Series that have DatetimeIndexes. The syntax becomes a bit easier since you don't have to specify an aggregating column. If you are only applying one aggregating function to the group, you can call it directly as method. With Series **`s`**, the syntax will look like this:

```
>>> s.resample('5D').sum()
```

We select the closing price as a Series below and proceed to call both the **`resample`** and **`rolling`** methods on it.

In [None]:
close = amzn_dt['close']
close.head()

Find the mean over a two month period.

In [None]:
close.resample('2M').mean().head()

Find the rolling mean of the previous 5 trading days.

In [None]:
close.rolling(5).mean().head(10)

Multiple aggregation functions.

In [None]:
close.resample('2M').agg(['min', 'mean', 'max']).head()

The syntax is simpler when selecting the Series first.

# Exercises

In [None]:
import pandas as pd
import requests

## Problem 1
<span  style="color:green; font-size:16px">Read in stock data for Apple (AAPL) for the last 5 years. Set the date as the index and keep just the closing price and the volume columns.</span>

## Problem 2
<span  style="color:green; font-size:16px">In which week did AAPL have the greatest number of its shares traded?</span>

## Problem 3
<span  style="color:green; font-size:16px">With help from the `diff` method, find the quarter containing the most number of up days.</span>

## Problem 4
<span  style="color:green; font-size:16px">Find the mean price per year along with the minimum and maximum volume. Have the label for each row be the first day of the year.</span>

## Problem 5
<span  style="color:green; font-size:16px">Execute the cell below exactly as it is to read in the employee dataset. Then use `to_datetime` to convert the hire and job date columns into datetimes.</span>

In [None]:
# execute this as is
emp = pd.read_csv('../data/employee.csv')

## Problem 6
<span  style="color:green; font-size:16px">Execute the cell below exactly as it is to read in the employee dataset. Then use `to_datetime` to convert the hire and job date columns into datetimes.</span>

## Problem 7
<span  style="color:green; font-size:16px">Without putting `hire_date` into the index, find the mean salary based on `hire_date` over 5 year periods. Also return the number of salaries used in the mean calculation for each period.</span>

## Problem 8
<span  style="color:green; font-size:16px">Attempt to take a rolling average on salary using a 30 day time span on hire date. Does the error message make sense?</span>

## Problem 9
<span  style="color:green; font-size:16px">Set hire date as the index and then select the salary column as a Series. Sort the Series by date and drop the missing values. Now select a subset that only has hire dates from 1990 onwards. Then find a 1,000 day rolling average. Finally make a call to the `plot` method. Make sure you inline matplotlib if you did not do it earlier.</span>

## Problem 10
<span  style="color:green; font-size:16px">Can you do problem 9 in one line of code? Chain each method on a separate line. (You should probably not do this in real code as it will be messy, but it is possible)</span>