# Stock Price Predictor - Data Preparation

## Notebook Overview
- [1. Load Modules and Data](#load-data)
- [2. Model Descriptions](#model-descriptions)
    - [2.1 ARIMA Model](#ARIMA-Model)
    - [2.2 DeepAR Model](#DeepAR-Model)
- [3. Multiple Time Series](#multiple-time-series)
- [4. Split Data](#split-data)
- [5. Saving Data](#save-in-json-format)
    - [5.1 ARIMA Model](#save-arima-model)
    - [5.2 DeepAR Model](#save-deepar-model)

## Plan of Action
The aim to prepare the data to be processed and save it as a `CSV` or `JSON` file. First any data processing will be carried out. Then data will be split into train and test data. Finally, this data will be saved in the folder called `data`. 

> *Note: ideally, data is split into training and test data, and then **separately** processed. However, as this is time series data, where the train-test split cannot be randomly selected points, processing first should not affect the results in any way.

A `Adj Close` vs `Time` graph is to be created. This will be a line graph with the prediction of stock price for the next year (in the end: 2021). In this section data will be divided to be processed by each of the two models. 

<a id="load-data"></a>
# 1. Load Modules and Data
All the required modules will be loaded here along with the data from the `CSV` files in the `data` directory files.

> **Citation for data**: _Yahoo Finance – stock market live, quotes, business &amp; finance news_ (no date). Available at: https://in.finance.yahoo.com/ (Accessed: 2 October 2020).

In [1]:
import pandas as pd
import numpy as np
import os
import json
import pytz
import matplotlib.pyplot as plt
%matplotlib inline

import datetime

pd.set_option('display.max_rows', None)
timezone_str = 'Asia/Kolkata'
localtz = pytz.timezone(timezone_str)

In [2]:
# Load stocks data
stock_names = {'^GSPC': 'S&P 500',
               '^BSESN': 'S&P BSE SENSEX',
               'AAPL': 'Apple Inc.'}

data_dir = 'data'
data = {}

for stock in stock_names.keys():
    data[stock] = pd.read_csv(os.path.join(data_dir, stock + '.csv'),
                              parse_dates=True, index_col=['Date'])
    data[stock] = data[stock].dropna()

In [3]:
data['AAPL'].head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1980-12-12,0.128348,0.128906,0.128348,0.128348,0.101261,469033600.0
1980-12-15,0.12221,0.12221,0.121652,0.121652,0.095978,175884800.0
1980-12-16,0.113281,0.113281,0.112723,0.112723,0.088934,105728000.0
1980-12-17,0.115513,0.116071,0.115513,0.115513,0.091135,86441600.0
1980-12-18,0.118862,0.11942,0.118862,0.118862,0.093777,73449600.0


> **_NOTE: The folowing sections will consider only `AAPL` stock data. Later, similar steps will be carried out on the two indices: `^GSPC` and `^BSESN`_**

<a id="model-descriptions"></a>
# 2. Model Descriptions
In this section, we will discuss model types and how the models take data and process it. There are two models that will be used to make predictions and the results from both will later be compared.

<a id="ARIMA-Model"></a>
## 2.1 ARIMA Model
One of the major problems with time series problems are the non-stationary nature of most time series. To get around this obstacle, a time series is usually stationaized. This is done by reducing effects of trends and seasonality, training a model and finally converting the predicitons back (adding trends and seasonality). A complex method can be used to efficiently stationarize a time series, but it makes it harder to convert the results back. Though differencing is not the most efficient of stationarizing a time series, it provides an easy to remove and add trends and seasonality. Hence, ARIMA model is chosen for this project.

ARIMA (Auto-Regressive Integrated Moving Average) utilizes differencing to stationarize a non-stationary time series. It takes in three parameters:
1. Number of `auto-regressive terms` (p): The number of past terms, the current value depends on.
2. Number of `differences` (d): Difference between two consecutive terms if `d=1`, higher degree takes differences between more terms.
3. Number of `moving average terms` (q): The number of past error terms, the current value depends on.

Determining optimal values of `p` and `q` is part of modelling ARIMA. Two plots are used to determine these values.
1. Autocorrelation Function (ACF)
2. Partial Autocorrelation Function (PACF)

An ARIMA model takes a time series as training data. Furthermore, it can take a single long series of data. Therefore, there is not much data processing required. The data will be prepared here and the parameters `(p, d, q)` will be determined in the next notebook.

In [4]:
# Adj Close Time series
aapl_ts_adj = data['AAPL']['Adj Close'].copy()
aapl_ts_adj.head()

Date
1980-12-12    0.101261
1980-12-15    0.095978
1980-12-16    0.088934
1980-12-17    0.091135
1980-12-18    0.093777
Name: Adj Close, dtype: float64

> **Data will be created when required (like above) as it can be easily formed from the entire dataset.**

<a id="DeepAR-Model"></a>
## 2.2 DeepAR Model

DeepAR is Amazon SageMaker's supervised learning algorithm that uses a recurrent neural network (RNN) to train on the data provided. The neural network trains on multiple time series of predefined length. It uses a **context length** to predict adjacent "_prediction window_". A "_training example_" is of the same length and is made up of context and prediction lengths.

DeepAR accepts a `.json` file with training data. The format of the `JSON` file is as follow (*source: AWS website*):
- `start` (str): timestamp of format `YYYY-MM-DD HH:MM:SS`.
- `target` (array of floats): values in the time series.
- `cat` (optional, integer): catergory for multivariate time series.

> **Source**: Amazon Web Services, I. (no date) DeepAR Forecasting Algorithm - Amazon SageMaker. Available at: https://docs.aws.amazon.com/sagemaker/latest/dg/deepar.html (Accessed: 29 October 2020).


Another important point about missing data is to be taken into consideration. In one of DeepAR's update<sup>1</sup>, it is mentioned that it supports missing data points. Hence, all the missing data can be included in the `target` array and DeepAR will take of them.

> <sup>1</sup>Flunkert, V. et al. (2018) Amazon SageMaker DeepAR now supports missing values, categorical and time series features, and generalized frequencies | AWS Machine Learning Blog, Amazon SageMaker, Artificial Intelligence. Available at: https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-deepar-now-supports-missing-values-categorical-and-time-series-features-and-generalized-frequencies/ (Accessed: 29 October 2020).

In [5]:
# Adding `NaN` for missing values.
df = data['AAPL'].index
idx = pd.date_range(min(df.date), max(df.date))
aapl_updated = data['AAPL'].copy().reindex(idx)

aapl_updated.tail()

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume
2020-09-27,,,,,,
2020-09-28,115.010002,115.32,112.779999,114.959999,114.959999,137672400.0
2020-09-29,114.550003,115.309998,113.57,114.089996,114.089996,99382200.0
2020-09-30,113.790001,117.260002,113.620003,115.809998,115.809998,142675200.0
2020-10-01,117.639999,117.720001,115.830002,116.790001,116.790001,116120400.0


As it can be seen above, we have filled all the missing dates with `NaN` values. Now let's create a function that can take a pandas series and split it into multiple smaller time series.

<a id="multiple-time-series"></a>
# 3. Create Multiple Time Series - DeepAR
The following function creates a list with different time series.

In [6]:
def create_time_series(data: pd.Series, series_length_years: int, 
                       start_date: datetime.datetime = None,
                       last_date: datetime.datetime = None,
                       equal_series: bool = True):
    """Creates a list of time series of the given length from the data provided.
    Args:
        data (pd.Series): pandas series with all the data, indexed with the timestamp.
        series_length_years (int): length of each series in the list to be created.
        start_date (datetime.datetime, optional): 
            Date the first time series should start from. If None, '2002-01-01' is used.
            Defaults to None.
        last_date (datetime.datetime, optional):
            Date the last time series should end on. If None, last date found in `data` will be used.
            Defaults to None.
        equal_series (bool): if True, all series created will be of equal length. last series created
                             will be removed if it is shorter than the others.
    
    Returns:
        (list of pd.Series): python list containing all the time series created.
    
    """
    # Updating dictionaries.
    if start_date is None:
        start_date = datetime.datetime(2002, 1, 1)
    if last_date is None:
        last_date = max(data.index.date)
    else:
        last_date = last_date.date()
    
    start_date = start_date.date()
    time_series_list = []

    while start_date < last_date:
        end_date = start_date + pd.DateOffset(years=series_length_years) - pd.DateOffset(days=1)
        time_series_list.append(data.loc[start_date:end_date])
        start_date = end_date + pd.DateOffset(days=1)
        
    is_last_equal = str(max(time_series_list[0].index.date))[5:10] == str(max(time_series_list[-1].index.date))[5:10]
    
    if equal_series and not is_last_equal:
        time_series_list = time_series_list[:-1]
    
    print(f'Number of series created: {len(time_series_list)}')
    print(f'Last series removed: {not is_last_equal}')
    print(f'Last series end date: {max(time_series_list[-1].index.date)}')
    return time_series_list

In [7]:
ts_list = create_time_series(aapl_updated['Adj Close'], 3)

Number of series created: 6
Last series removed: True
Last series end date: 2019-12-31


<a id="split-data"></a>
# 4. Split Data
The train-test split being created for DeepAR can also be used for the ARIMA model. However, it ARIMA model would do better if provided with a larger dataset. Maintaining the prediction length for both the models, the data will be split differently. ARIMA model will be provided one a **single set** of train time series and test time series, while multiple time series data will be provided to DeepAR model to train and then its performance will be tested on the last 10 months of data (Jan 2020 - Oct 2020). This will ensure both models can be compared well.

<a id="split-arima-model"></a>
## 4.1 ARIMA model
The data from years `2002` to `2019` will be split into train and test data. The prediction length is equal to `10 months`. Furthermore, all the missing values in the test time series will be interpolated after the ARIMA model is trained and has made predictions (before metric calculation).

In [8]:
start = datetime.datetime(2002, 1, 1)
end_test = datetime.datetime(2019, 12, 31)
end_train = end_test - pd.DateOffset(months=10)

arima_test = aapl_updated['Adj Close'][start:end_test]
arima_train = aapl_updated['Adj Close'][start:end_train]

print(f'Test ts date range:  {min(arima_test.index.date)} - {max(arima_test.index.date)}\n'
      f'Train ts date range: {min(arima_train.index.date)} - {max(arima_train.index.date)}')

Test ts date range:  2002-01-01 - 2019-12-31
Train ts date range: 2002-01-01 - 2019-02-28


<a id="split-deepar-model"></a>
## 4.2 DeepAR model

The following function takes a list of time series and prediction length and return a list of training time series. This is will ensure we can format the data for the DeepAR algorithm.

In [9]:
def create_training_series(time_series_list: list,
                           prediction_length_months: int):
    """Create a training series using the prediction length provided in months.
    
    Args:
        time_series_list (list): list of pandas series each of equal length.
        prediciton_length_months (int): number of months a prediction is to be made. This will be
                                        used to created a training series, i.e. context length for
                                        DeepAR algorithm to train on.
    
    Returns:
        list of pd.Series: python list containing the training time series.
    """
    training_series_list = []
    
    for ts in time_series_list:
        end = max(ts.index.date) - pd.DateOffset(months=prediction_length_months)
        training_series_list.append(ts[:end])

    print(f'Number of series updated: {len(training_series_list)}')
    return training_series_list

In [10]:
# Create training series
prediction_length = 10
train_series_list = create_training_series(ts_list, prediction_length)

Number of series updated: 6


<a id="save-in-json-format"></a>
# 5. Saving Data

<a id="save-arima-model"></a>
## 5.1 ARIMA Model
The train and test data will be saved to a csv file for the ARIMA model as there is no particular format the algorith accepts the data in.

In [11]:
csv_data_dir = 'data/csv_aapl_data'

if not os.path.exists(csv_data_dir):
    os.makedirs(csv_data_dir)
    
arima_train.to_csv(os.path.join(csv_data_dir, f'train.csv'))
arima_test.to_csv(os.path.join(csv_data_dir, f'test.csv'))

print(f'Created {csv_data_dir} with train-test data.')

Created data/csv_aapl_data with train-test data.


<a id="save-deepar-model"></a>
## 5.2 DeepAR Model
Amazon SageMaker's DeepAR model accepts data in through a JSON file. The following function will create a file for the list of time series input.

In [24]:
def save_series_to_json(time_series_list: list,
                        filename: str,
                        data_dir: str = 'json_time_series_data'):
    """Function takes a list of time series data and then saves in DeepAR, JSON format.
    
    Args:
        time_series_list (list): list of pandas series each of equal length.
        filename (str): name of the file that will contain the data in DeepAR, JSON format.
        data_dir (str, optional): name of directory that will hold the JSON files.
    
    Returns:
        str: path to the file created.
    """
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)

    file_path = os.path.join(data_dir, filename)
    
    with open(file_path, 'wb') as f:
        for ts in time_series_list:
            line = json.dumps({
                "start": str(ts.index[0]),
                "target": ts.interpolate().dropna().tolist()}) + '\n'
            json_line = line.encode('utf-8')
            f.write(json_line)
    print(f'{file_path} created.')

In [25]:
json_data_dir = 'data/json_aapl_data'

train_data = save_series_to_json(train_series_list, 'train.json', json_data_dir)
test_data = save_series_to_json(ts_list, 'test.json', json_data_dir)

data/json_aapl_data/train.json created.
data/json_aapl_data/test.json created.
