# Time Series with Recurrent Neural Networks

[Iaroslav Shcherbatyi](http://iaroslav-ai.github.io/), [ED3S 2018](http://iss.uni-saarland.de/de/ds-summerschool/)

# Synopsis

Often, real world data comes as time series  - [sequences of data points indexed in time order](https://en.wikipedia.org/wiki/Time_series). Records of vital signs of patients, sequences of physical measurements, video and audio streams, financial market data, corporate historical records are some of many examples of time series data. In this notebook, Recurrent Neural Networks are used to process time series data, as well as "classical" methods, that work with inputs being vectors of values.

Note: if you are viewing this notebook on Kaggle, you can download the notebook by clicking the cloud with arrow pointing down in the upper panel, and the necessary data from the panel to the right. 

# Lumber futures price prediction

In this task, we will come up with a model, which can forecast change of the price of the lumber futures. The following data is used: [Random Length Lumber Futures, Continuous Contract #1 (LB1) (Front Month)](https://www.quandl.com/data/CHRIS/CME_LB1-Random-Length-Lumber-Futures-Continuous-Contract-1-LB1-Front-Month?utm_medium=graph&utm_source=quandl). A subset of 3000 last historical records is used to reduce computations.

First, lets load the data!

In [None]:
import pandas as pd
import numpy as np

lumber = pd.read_csv('../input/quandl-lumber-price-history/lumber.csv')
lumber = lumber[::-1]
lumber = lumber.drop('Date', axis=1)
display(lumber[:12])

In [None]:
from skimage.util import view_as_windows

history = view_as_windows(lumber.values, (10,7)).squeeze()
X = history[:, :-1, :]
y = history[:, -1, 0] > history[:, -2, 3]

print(X.shape)
print(y.shape)

print(X[0])
print(y[0])

# Recurrent Neural Networks in Keras

Lets apply RNN first. RNN in Keras is also a layer, that takes as input sequence of vectors, and outputs the final activation of RNN. There are multiple flavors of RNN layers in Keras, such as LSTM or GRU. A few remarks are in place:
* Depending on random initialization of the weights of the neural network, testing results will be different. Fixing the random initialization can be done using `numpy` package, used internally in Keras, specifically using `numpy.random.seed` function.
* Standardizing the scale of the feature ranges in Keras is also imoprtant!
* Different types of RNN can lead to different results.

In [None]:
# GRU, SimpleRNN, LSTM are recurrent layers
from keras.layers import Input, GRU, SimpleRNN, LSTM, Dense, LeakyReLU, Softmax
from keras.models import Model
from keras.optimizers import Adam, RMSprop
from keras.losses import sparse_categorical_crossentropy

#np.random.seed(1)
X_train, X_test, y_train, y_test = X[:1500], X[1500:], y[:1500], y[1500:]

m, s = np.mean(X_train, axis=(0, 1)), np.std(X_train, axis=(0, 1))
X_train = (X_train - m) / s
X_test = (X_test - m) / s

inp = Input(shape=(9, 7))
h = inp
# Task: try SimpleRNN, GRU, LSTM
h = GRU(16)(h)  # this takes as input a sequence, and returns last activation of RNN
h = Dense(2)(h)
h = Softmax()(h)

model = Model(inputs=[inp], outputs=[h])
model.compile(Adam(0.001), sparse_categorical_crossentropy, ['accuracy'])
model.fit(X_train, y_train, epochs=5, verbose=1)

# get the model performance
loss, acc = model.evaluate(X_test, y_test)
print(acc)

Lets use a trained model to make predictions. Observing how the outputs of the model change by changing inputs might be informative of how the model "thinks".

In [None]:
# Task: something missing here?
x = np.array([
    [
        [339., 341., 337.5, 339.3, 339.3, 159., 838.],
        [338.3, 338.5, 336., 336.5, 336.5, 149., 714.],
        [337.3, 337.3, 334., 335., 335., 278., 567.],
        [336., 336., 328.2, 329.1, 329.1, 326., 464.],
        [332., 336., 331., 331.2, 331.2, 167., 387.],
        [331.5, 332.4, 330.2, 330.4, 330.4, 97., 302.],
        [331.3, 334.8, 330., 331.5, 331.5, 243., 209.],
        [332.2, 334.5, 326.6, 326.6, 326.6, 123., 104.],
        [332.9, 332.9, 325., 325., 325., 83., 93.]
    ]
])
model.predict(x)

# Scikit-Learn for sequence processing

Sometimes models which are not explicitly made for sequence processing, can be quite competitive nontheless. Lets try linear SVM below to compare on how it performs to RNN.

In [None]:
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier

Xf = np.reshape(X, (len(X), -1))
X_train, X_test, y_train, y_test = Xf[:1500], Xf[1500:], y[:1500], y[1500:]

model = make_pipeline(
    StandardScaler(),
    LinearSVC('l1', dual=False)  # Task: use L1 regularization
)

model.fit(X_train, y_train)
score = model.score(X_test, y_test)

dummy = DummyClassifier()
dummy.fit(X_train, y_train)
dummy_score = dummy.score(X_test, y_test)

print(score)
print(dummy_score)

The code below renders the weights of the linear model.

In [None]:
shape = X[0].shape
lsvc = model.steps[-1][-1]

# reshape weights to fit the shape of input
w = np.reshape(lsvc.coef_, shape)

# display what weights are applied to what features of input sequence
display(pd.DataFrame(w, columns=lumber.columns))

# IMDB data classification

One particularly successful application of RNN is in sentiment analysis, where the task is to infer whether an excert of a text has a positive tone or not. Below data is used for 25000 movie reviews from the [kaggle nlp challenge](https://www.kaggle.com/c/word2vec-nlp-tutorial).

In [None]:
# first, load the data!
import pandas as pd

data = pd.read_csv('../input/word2vec-nlp-tutorial/labeledTrainData.tsv', sep='\t')
display(data.head())

Lets split the data into inputs and outputs, and see in more detail what a review looks like. 

In [None]:
# split inputs and outputs
X = data['review'].values
y = data['sentiment'].values

print(X[0])
print(y[0])

One successful approach in working with text is to use word level representation. Firstly, some number $N$ of most frequent words is extracted from training data. Then, all texts are converted to sequences of words. Every word is converted to integer number, which corresponds to the frequency rank of the word. If word does not belong to $N$ most frequent words, it is replaced with 0. For example, consider that $N$ is set to 3. Then the following transformation is done:

['aa aa aa ba aa ab abc'] -> [1, 1, 1, 2, 1, 3, 0]

In this example, a dedicated `TextToIntSeq` is defined, which also showcases how data transformers are implemented in `sklearn`.

In [None]:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin

# https://github.com/keras-team/keras-preprocessing/blob/master/keras_preprocessing/text.py
from keras.preprocessing.text import Tokenizer

# proper sklearn TransformerMixin class
class TextToIntSeq(BaseEstimator, TransformerMixin):
    """ for a set of text, convert every text to sequence of num_words
    most frequent words, where a word is represented as integer. Words
    which are not frequent enough are replaced with 0.
    """
    def __init__(self, num_words=10000, max_seq_length=80):
        self.num_words = num_words
        self.max_seq_length = max_seq_length
        self._tokenizer = None
    
    def fit(self, X, y=None):
        # X: list of texts
        self._tokenizer = Tokenizer(self.num_words)
        self._tokenizer.fit_on_texts(X)
        return self  # proper sklearn transformer
    
    def transform(self, X, y=None):
        N = self.max_seq_length
        X = self._tokenizer.texts_to_sequences(X) # convert texts to sequences
        # trim sequences which are too long
        X = [x[:min(len(x), N)] for x in X]
        # add zeros for too small sequences
        X = [(N - len(x))*[0] + x for x in X]
        return np.array(X)

Lets use the newly defined class to convert our dataset to sequence of integers representation.

In [None]:
tok = TextToIntSeq()
tok.fit(X)
Xt = tok.transform(X)

In [None]:
print(Xt.shape)
print(tok.transform(np.array([
    'Hello world'
])))

An integer sequence representation of text data is easy to use with Keras. A dedicated `Embedding` layer can convert sequence of integers to sequence of features, which describe every word in sequence. For more details on `Embedding` layer, [refer here](https://stats.stackexchange.com/questions/270546/how-does-keras-embedding-layer-work).

In [None]:
from sklearn.model_selection import train_test_split

from keras.models import Model
from keras.layers import Input, Dense, GRU, Embedding, Softmax
from keras.losses import sparse_categorical_crossentropy
from keras.optimizers import Adam

X_train, X_test, y_train, y_test = train_test_split(Xt, y)

# definition of the network
inp = Input(shape=X_train[0].shape)

h = Embedding(tok.num_words, 128)(inp)
h = GRU(128)(h)  # Task: optimize number of neurons!
h = Dense(2)(h)  # only 2 classes are present
h = Softmax()(h)

model = Model(inputs=[inp], outputs=[h])

# try using different optimizers and different optimizer configs
model.compile(
    optimizer=Adam(), 
    loss=sparse_categorical_crossentropy,
    metrics=['accuracy']
)

model.fit(X_train, y_train, batch_size=64, epochs=3)

Lets see how well our model performs.

In [None]:
loss, acc = model.evaluate(X_test, y_test, batch_size=64)
print('Test score:', score)
print('Test accuracy:', acc)

Now, to actually use the model!

In [None]:
my_input = tok.transform(np.array([
    'Best movie EVER!!!',
    'Worst movie EVER!!!'
]))

model.predict(my_input)