# VAE for Time Series Data

The purpose of this notebook is to test the developed time series VAE module on real time series datasets. We will go through an example of generating synthetic stock data based on daily price data from the Amazon stock from the years 2010 to 2020.

## Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from models.tsvae_conv import ConvTimeSeriesVAE

## Loading the Datasets

In order to train the model, we need to read the csv file containing the data into the Pandas dataframe. 

In [None]:
# Load dataset
dataset_name = 'AMZN_10-20'
file_path = './datasets/' + dataset_name + '.csv'
dataset = pd.read_csv(file_path)
dataset.head(10)

## Instantiating the Model
 
Before creating the model, we need to define various parameters. We can roughly categorize the types of parameters into three classes:

1. **Model parameters**: These parameters are related to the architecture of the underlying VAE model.  
2. **Training parameters**: These parameters are related to the training process of the VAE model when given data. These parameters are standard in machine learning models.
3. **Preprocessing parameters**: These parameters are related to the preprocessing required to train a deep generative model on a time series dataset. This also involves two **required** arguments:
    - *time_column*: a string with the name of the column in the Pandas dataframe that corresponds to the temporal component. For example, for daily stock data, this corresonds to the date. There can only be a single time column.
    - *feature_names*: a list containing the name of the column(s) in the Pandas dataframes that contain the features of the time series. For example, for the daily stock data, features could correspond to the open price and close price.
    
**Note**: With the exception of "time_column" and "feature_names", all function arguments in the creation of the model object and the fitting of the model have default parameter values.

In [None]:
# - All global parameters
# -- Model parameters
latent_dimension = 8
hidden_layers = [50, 100, 200]
kernel_size = 3
reconstruction_wt = 1
# -- Training parameters
epochs = 20
batch_size = 32
lr = 0.001
# -- Preprocessing parameters
seq_len = 30
features = ['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']
time_name = 'Date'

With all the above parameters defined, we are now ready to instantiate our VAE model.

In [None]:
# Instantiate model 
model = ConvTimeSeriesVAE(seq_len=seq_len, dataset=dataset, time_column=time_name, feature_names=features,
                          latent_dim=latent_dimension, hidden_layer_sizes=hidden_layers,
                          reconstruction_wt=reconstruction_wt, kernel_size=kernel_size)

## Training the Model

After instantiating the model, we can call the fit function to train it according to the defined training parameters.

In [None]:
# Fit the model to the dataset
model.fit(batch_size=batch_size, lr=lr, epochs=epochs)

## Generating Samples from the Trained Model

Once the model is fit, we can call the sample function to generating synthetic time series segments.

In [None]:
synthetic_dataset = model.sample(100)

### Format of Synthetic Data

By default, the samples are outputted to a Pandas dataframe (which can be saved as a csv file by using the to_csv() method in Pandas).

In [None]:
synthetic_dataset.head(10)

Because the model generates independently sampled segments of the time series, the output format of the synthetic data *must* differ from that the original training dataset. As shown above, in addition to the features of the time series, each row of the outputted Pandas dataframe shows the segment index (which segment that row belongs to) as well as the time index (the corresponding time point of that row for a given segment).

## Visualization of Synthetic Data

Here, we run a simple test to visualize the synthetic data as compared to the original dataset (in terms of segments).

In [None]:
# Test to see quality of generated synthetic data
N = 50
samples = model.sample(N, return_dataframe=False)
compare_idx = np.random.choice(model.dataset.shape[0], N, replace=False)
for i in range(samples.shape[1]):
    plt.figure()
    plt.plot(model.dataset[compare_idx, i, :].squeeze().T, c='k', alpha=0.1)
    plt.plot(samples[:, i, :].squeeze().T, c='r', alpha=0.3)
    plt.xlabel('Time Index')
    plt.ylabel(features[i])
    plt.show()

As we can see, the synthetic segments (in red) resemeble the original dataset segments (in black) across almost all features. Notice that for the volume feature, the synthetic data does not capture the highly nonstationary behavior.

## Compute Metrics

In [None]:
model.compute_metrics()

## Save and Load Model

We can save the model by calling the save method:

In [None]:
fname = 'example_model'
model.save(fname)

We can verify that the model has been properly saved by calling the load method into a new model. 

In [None]:
model_copy = ConvTimeSeriesVAE(seq_len=seq_len, dataset=dataset, time_column=time_name, feature_names=features,
                               latent_dim=latent_dimension, hidden_layer_sizes=hidden_layers,
                               reconstruction_wt=reconstruction_wt, kernel_size=kernel_size)

In [None]:
model_copy.compute_metrics()

Before loading the model, the value of the metrics are bad. Using the load method, we can use the pre-trained model and then compute the metrics again.

In [None]:
model_copy.load(fname)

In [None]:
model_copy.compute_metrics()

We can see that the metrics are improved once the pre-trained model has been loaded. 