# Gradient Trader Part 1: The Surprising Usefulness of Autoencoders

## Using Autoencoders to Learn Most Salient Features from Time Series
This post is about a simple tool in deep learning toolbox: Autoencoder. It can be applied to multi-dimensional financial time series.

### Autoencoder
Autoencoding is the practice of copying input to output or learning the identity function. It has an internal state called latent space h which is used to represent the input. Usually, this dimension is chosen to be smaller than the input(called undercomplete). Autoencoder is composed of two parts: an encoder f:x --> H and a decoder g:H --> y.

<img src = 'https://blog.keras.io/img/ae/autoencoder_schema.jpg' />

The hidden dimension should be smaller than x, the input dimension. This way, h is forced to take on useful properties and most salient features of the input space.

Train an autoencoder to find function f,g such that:

\begin{equation*}
\arg min ||X - (g * f)X||^2
\end{equation*}

### Recurrent Autoencoder

For time series data, recurrent autoencoder are especially useful. The only difference is that the encoder and decoder are replaced by RNNs such as LSTMs. Think of RNN as a for loop over time step so the state is kept. It can be unrolled into a feedforward network.

First, the input is encoded into an undercomplete latent vector h which is then decoded by the decoder. Recurrent autoencoder is a special case of sequence-to-sequence(seq2seq) architecture which is extremely powerful in neural machine translation where the neural network maps one language sequence to another.

<img src = 'https://esciencegroup.files.wordpress.com/2016/03/seq2seq.jpg' />

In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

**Task**

Copy a tensor of two sine functions that are initialized out of phase.

The shape of the input is a tensor of shape (batch_size,time_step,input_dim).

- batch_size is the number of batches for training. Looping over each sample is slower than applying a tensor operation on a batch of several samples.

- time_step is the number of timeframes for the RNN to iterate over. In this tutorial it is 10 because 10 points are generated.

- input_dim is the number of data points at each timestep. Here we have 2 functions, so this number is 2.

To deal with financial data, simply replace the input_dim axis with desired data points.

- Bid, Ask, Spread, Volume, RSI. For this setup, the input_dim would be 5.
- Order book levels. We can rebin the order book along tick axis so each tick aggregates more liquidity. An example would be 10 levels that are 1 stdev apart. Then input_dim would be 10.

Here is an artist’s rendition of a recurrent autoencoder.

In [None]:
plt.xkcd()

x1 = np.linspace(-np.pi, np.pi)
y1 = np.sin(x1)
phi = 3
x2 = np.linspace(-np.pi+phi, np.pi+phi)
y2 = np.sin(x2)

f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, sharex=True, sharey=True)
ax1.plot(x1, y1)
ax1.set_title('Recurrent Autoencoder')
ax2.plot(x1, y1)
ax3.plot(x2, y2)
ax4.plot(x2, y2)
plt.show()

###Generator
For each batch, generate 2 sine functions, each with 10 datapoints. The phase of the sine function is random.

**This is the function generating time series pieces.**

In [None]:
import random
def gen(batch_size):
    seq_length = 10

    batch_x = []
    batch_y = []
    for _ in range(batch_size):
        rand = random.random() * 2 * np.pi

        sig1 = np.sin(np.linspace(0.0 * np.pi + rand,
                                  3.0 * np.pi + rand, seq_length * 2))
        sig2 = np.cos(np.linspace(0.0 * np.pi + rand,
                                  3.0 * np.pi + rand, seq_length * 2))
        x1 = sig1[:seq_length]
        y1 = sig1[seq_length:]
        x2 = sig2[:seq_length]
        y2 = sig2[seq_length:]

        x_ = np.array([x1, x2])
        y_ = np.array([y1, y2])
        x_, y_ = x_.T, y_.T

        batch_x.append(x_)
        batch_y.append(y_)

    batch_x = np.array(batch_x)
    batch_y = np.array(batch_y)

    return batch_x, batch_x#batch_y

###Model
The goal is to use two numbers to represent the sine functions. Normally, we use ϕ∈ℝ to represent the phase angle for a trignometric function. Let’s see if the neural network can learn this phase angle. The big picture here is to compress the input sine functions into two numbers and then decode them back.

Define the architecture and let the neural network do its trick. It is a model with 3 layers, a LSTM encoder that “encodes” the input time series into a fixed length vector(in this case 2). A RepeatVector that repeats the fixed length vector to 10 timesteps to be used as input to the LSTM decoder. For the decoder, we can either initialize the hidden state(memory) with the latent vector and use output at time *t−1 as input for time t or we can use latent vector h as the input at each timestep. These are called conditional and unconditional decoders.

In [None]:
from keras.models import Sequential, Model
from keras.layers import LSTM, RepeatVector

batch_size = 100
X_train, _ = gen(batch_size)

m = Sequential()
m.add(LSTM(2, input_shape=(10, 2)))
m.add(RepeatVector(10))
m.add(LSTM(2, return_sequences=True))
print (m.summary())
m.compile(loss='mse', optimizer='adam')
history = m.fit(X_train, X_train, epochs=500, batch_size=100)

Since this post is a demonstration of the technique, we use the smallest model possible which happens to be the best in this case. The “best” dimensionality will be one that results in the highest lossless compression. In practice, it’s mostly an art than science. In production, there are a plethora of trick to accelerate training and finding the right capacity of the latent vector. Topics include: architectures for dealing with asynchronus, non-stationary time series, preprocessing techniques such as wavelet transforms and FFT.

In [None]:
plt.plot(history.history['loss'])
plt.ylabel("loss")
plt.xlabel("epoch")
plt.show()

You may think that the neural network suddenly “got it” during training but this is just the optimzer escaping a saddle point. In fact, as we demonstrate below, the neural network(at least the decoder) doesn’t learn the math behind it at all.



In [None]:
X_test, _ = gen(1)
decoded_imgs = m.predict(X_test)

for i in range(2):
    plt.plot(range(10), decoded_imgs[0, :, i])
    plt.plot(range(10), X_test[0, :, i])
plt.title('dos_numeros')
plt.show()
