![image](https://storage.googleapis.com/kf-pipeline-contrib-public/ai-hub-assets/oob-timeseries-prediction/timeseries.png)
# Intended Use
- Performs 1-step time series forecasting.
- Performs quantile regression on provided quantiles.
- Supports anomaly detection.

# Runtime Parameters
### Data Parameters
| Parameter               | Definition|
| :---                    |:---|
| training-dir            | GCS directory containing training data|
| validation-dir          | (optional) GCS directory containing validation data. If not provided validation data is split from training data.|
| testing-dir             | (optional) GCS directory container testing data. If provided final evaluation of model performed with this data.|
| output-dir              | GCS directory to write servable SavedModel.|

### Model Hyperparameters
| Parameter               | Definition|
| :---                    |:---|
| input-length          | Number of input window steps the model can use.|
| lower-quantile          | The lower quantile to regress on. |
| upper-quantile          | The upper quantile to regress on. |
| batch-size              | Batch size during training.|
| epochs                  | This component uses early stopping to terminate training. This parameter can be used to specify the max number of epochs. Training will step at this value or when early stopping is triggered.

# Input
Each timeseries should be stored as it's own json file in the data directories. For example:

```
├── gcs
    ├── training-dir
    |   ├── train-timeseries-1.json
    |   ├── train-timeseries-2.json
    |   └── ...
    ├── testing-dir
    |   ├── test-timeseries-1.json
    |   ├── test-timeseries-2.json
    |   └── ...
    └── ...
```

The timeseries data in each json should be in the following format:
```
{
    "data":[
        [1325376000000,3.6484771574],
        [1325383200000,4.6002538071],
        ...
        [1420070400000,0.3172588832],
    ],
    "columns":["timestamp", "target"]
}
```
* **data**: This is the time series data where each step contains an array with a timestamp in milliseconds and the value of the timeseries at that timestamp. Time intervals between values should be constant. 
* **columns**: This contains the name of the respective columns which should align with the ordering in *data*.

# Output
A servable SavedModel will be stored in *output-dir*. It's input signature will take a *input-length* length sequence of values and return the prediction of the next step. An example payload:
```
{
    "instances":[
        [3.6484771574, 4.6002538071, ..., 0.3172588832],
        [19.8257467994, 22.8485064011, ..., 22.4039829303],
        ...
    ]
}
```

# Anomaly Detection
This model supports anomaly detection with quantile regression. An anomalous value is defined as any point on the time series whose value is outside of the predicted quantile range. 

## Enter Component Arguments

In [None]:
# GCS path to directory containing training data, validation data, and test data. Reference README for format.
training_dir = 'gs://kf-pipeline-contrib-public-data/electricity/train'
validation_dir = 'gs://kf-pipeline-contrib-public-data/electricity/validation'
test_dir = 'gs://kf-pipeline-contrib-public-data/electricity/test'

# GCS path to store trainined model. You must have write access.CHANGE BEFORE RUNNING
model_dir = 'GCS DIRECTORY WITH WRITE ACCESS ' # gs://...

input_length = 84  # how many steps the model can consider

# set you lower and upper quantiles which will be used for anomaly detection
lower_quantile = 0.01
upper_quantile = 0.99

### Install KFP python package

In [None]:
%%capture
KFP_PACKAGE = 'https://storage.googleapis.com/ml-pipeline/release/0.1.12/kfp.tar.gz'
!pip3 install $KFP_PACKAGE --upgrade
!pip3 install pandas
!pip3 install matplotlib

In [None]:
pip install https://storage.googleapis.com/ml-pipeline/release/0.1.10/kfp.tar.gz --upgrade

In [None]:
%matplotlib inline
import kfp
from kfp import compiler
import kfp.dsl as dsl
import kfp.notebook
import kfp.gcp as gcp
import kfp.components as comp
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

### Create Pipeline

In [None]:
client = kfp.Client()
exp = client.create_experiment(name='timeseries forecasting experiment')

timeseries_training_op = comp.load_component_from_url(
    'https://storage.googleapis.com/kf-pipeline-contrib-public/ai-hub-assets/oob-timeseries-prediction/component.yaml')


@dsl.pipeline(name='pipeline name', description='pipeline description')
def one_step_training(
    training_dir='',
    validation_dir='',
    testing_dir='',
    input_length=84,
    lower_quantile=0.01,
    upper_quantile=0.99,
    batch_size=128,
    output_dir='',
    epochs=10000
    ):
    training_op = timeseries_training_op(
        training_dir=training_dir,
        validation_dir=validation_dir,
        testing_dir=testing_dir,
        input_length=input_length,
        lower_quantile=lower_quantile,
        upper_quantile=upper_quantile,
        batch_size=batch_size,
        output_dir=output_dir,
        epochs=epochs).apply(gcp.use_gcp_secret('user-gcp-sa'))

compiler.Compiler().compile(one_step_training, 'one_step_pipeline.tar.gz')

### Run Pipeline

In [None]:
run = client.run_pipeline(exp.id, 'ts run 1', 'one_step_pipeline.tar.gz', params=
                        {
                            'training-dir':training_dir,
                            'validation-dir':validation_dir,
                            'testing-dir':test_dir,
                            'input-length':input_length,
                            'lower-quantile': lower_quantile,
                            'upper-quantile': upper_quantile,
                            'output-dir':model_dir
                        } )

### Load trained model after run and test with toy data
When training is done load the trained model and just try some toy data to make sure everything worked.

First lets create a class to easily use our model and then pass in the toy data. 

In [None]:
class TimeseriesModel(object):
    
    def __init__(self, model_dir):
        self.model_dir = model_dir
    
    def predict(self, ts):
        with tf.Session(graph=tf.Graph()) as sess:
            tf.saved_model.loader.load(sess, [tf.saved_model.tag_constants.SERVING], self.model_dir)
            input_tensor = sess.graph.get_tensor_by_name('raw_input:0')
            lower_quantile_tensor = sess.graph.get_tensor_by_name('lower_quantile:0')
            median_tensor = sess.graph.get_tensor_by_name('median:0')
            upper_quantile_tensor = sess.graph.get_tensor_by_name('upper_quantile:0')
            lower_quantile, median, upper_quantile = sess.run(
                [lower_quantile_tensor, median_tensor, upper_quantile_tensor],
                feed_dict={input_tensor:ts})
        
        return {
            'lower_quantile':lower_quantile,
            'median': median,
            'upper_quantile': upper_quantile
        }
    
ts_model = TimeseriesModel(model_dir)
import pprint
# predict on two ranges
pprint.pprint(ts_model.predict(np.stack([np.arange(84), np.arange(20, 104)]).reshape(2, -1, 1)))

### Retreive public dataset for visualizing
Retreive a public dataset of electricity usage from UCI's Machine Learning Repository. It will be used to train the model so we can visualize how well the model performed.

In [None]:
%%capture
!wget http://archive.ics.uci.edu/ml/machine-learning-databases/00321/LD2011_2014.txt.zip
!unzip LD2011_2014.txt.zip

### Read in public dataset to notebook
Read in the public dataset and then split it into the contexts used to train the time series model.

In [None]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

def read_electricity_data():
    """
    Read electricy data from local file. Use this for local notebook development.
    """
    file_path = 'LD2011_2014.txt'
    data = pd.read_csv(file_path, sep=';', index_col=0, parse_dates=True, decimal=',')
    
    data_2H = data.resample('2H').sum()/8
    return [np.trim_zeros(data_2H.iloc[:,i], trim='f') for i in range(data_2H.shape[1])]

freq = '2H'
input_length = 7 * 12
test_future_time = 12 * 7 * 4 #every two hours for 1 month
start_dataset = pd.Timestamp("2014-01-01 00:00:00", freq=freq)
end_training = pd.Timestamp("2014-09-01 00:00:00", freq=freq)

elec = read_electricity_data()


### Split the dataset
For the model trained above the dataset was trained on 9 months, validated on 1 month, and tested on 1 month.

In [None]:
# These were the split used to train the model above
training_elec = [ts[start_dataset:end_training - 1] for ts in elec]
validation_elec = [ts[end_training - input_length:end_training + test_future_time] for ts in elec]
test_elec = [ts[end_training - input_length + test_future_time:end_training + 2 * test_future_time] for ts in elec]

Lets visualize what this looks like for a single timeseries.

In [None]:
%matplotlib inline
plt.plot(pd.DataFrame(training_elec[13]), label='train')
plt.plot(pd.DataFrame(validation_elec[13]), alpha=0.5, label='validation')
plt.plot(pd.DataFrame(test_elec[13]), alpha=0.4, label='test')
plt.title('Temporal Split')
plt.gcf().autofmt_xdate()
fig = plt.gcf()
fig.set_size_inches(18.5, 10.5)
plt.legend(loc=2)

### Create helper classes to predict and plot
These classes will be used to wrap the prediction and plot the test timeseries with predictions.

In [None]:
class TimeseriesGraph(object):
  
    def __init__(self, ts, input_length, title, model):
        self.ts = ts
        self.input_length = input_length
        self.create_dataset()
        self.x = list(range(self.input_length, len(self.ts)))
        self.title = title
        self.model=model
    
    def create_dataset(self):
        context_windows = [self.ts[i:i+self.input_length] for i in range(len(self.ts) - self.input_length)]
        labels = [self.ts[i] for i in range(self.input_length, len(self.ts))]
        self.labels = np.array(labels).reshape(-1, 1)
        self.ts_array = np.array(context_windows)

    def plot_truth_and_predictions(self, anomalies=False):
        predictions = self.model.predict(self.ts_array)
        
        if anomalies:
            is_anomaly = np.logical_or(
            predictions['lower_quantile'] > self.labels, predictions['upper_quantile'] < self.labels)
            indices = np.argwhere(is_anomaly)[:,0]
            plt.plot(indices + self.input_length, self.labels[indices], 'ro', label='anomalies')
        self.plot(predictions)

    def plot(self, predictions):
        plt.fill_between(self.x,
                         predictions['lower_quantile'].reshape(-1),
                         predictions['upper_quantile'].reshape(-1),
                         color='lightskyblue',
                         label='quantile range',
                         alpha=0.4)
        plt.plot(self.x, predictions['median'], color='orange', label='prediction',
            alpha=0.8)
        plt.plot(self.ts, color='black', label='target', alpha=0.3)
        plt.legend(loc=2)
        plt.title(self.title)
        fig = plt.gcf()
        fig.set_size_inches(18.5, 10.5)


### View the test predictions with the actual time series
We will now perform a 1-step prediction through the timeseries and plot the test time series and compare it to the results of our predictions.

What's in the graph:
- the black line is the true times series
- the blue line is the forecast of the model
- the blue area is the predicted quantile range between *lower_quantile* and *upper_quantile*
- the part of our chart with no predictions or percentile is the first context window slice used for prediction

Try different time series indices as there are over 300 timeseries in this multi times eries datasets. You can also try different slices in time.

In [None]:
ts_plotter = TimeseriesGraph(
    test_elec[0].values.reshape(-1, 1),
    input_length,
    'Test Timeseries Index 0',
    TimeseriesModel(model_dir))

ts_plotter.plot_truth_and_predictions()

### Anomaly Detection
This model is capable of performing anomaly detection. An anomaly is defined as any value on our time series that is outside for our predicted quantile range. 

Let's plot the same graph and include the anomalies as red dots.

In [None]:
ts_plotter.plot_truth_and_predictions(anomalies=True)

### Helper method for data conversion
If you want to convert your own data and train a new model you can use this helper method below to convert your pandas time series data into the json format specified.

In [None]:
from google.cloud import storage 
    
def write_pandas_multi_timeseries_to_gcs(
    multi_timeseries, # instance of multi timseries - a list of pd.Dataframe timeseries
    bucket='ai_hub_dev', # bucket to write data
    gcs_folder_path='timeseries/data/electricity_train', # path inside bucket to write data
    local_folder_path= 'data'): # local directory to store intermediate files
    client = storage.Client()
    bucket = client.get_bucket(bucket)
    for i, ts in enumerate(multi_timeseries):
        if not isinstance(ts, pd.DataFrame):
            ts = pd.DataFrame(ts)
        ts.reset_index(inplace=True)
        ts.columns = ['timestamp', 'target']
        local_file_path = '{}/{}.json'.format(local_folder_path, i)
        ts.to_json(local_file_path, orient='split', index=False)
        blob = bucket.blob('{}/{}.json'.format(gcs_folder_path, i))
        blob.upload_from_filename('data/{}.json'.format(i))

# write_pandas_multi_timeseries_to_gcs(training_elec)  # example call