## MULTIVARIATE LSTM Sequnece-to-Sequence Encoder-Decoder in Keras

This notebook is just starter created in few minutes to illustrate:
- how to implement multivariate timeseries NN approach in Keras using LSTM (here you can examine many different solutions later - so follow this notebook)
- create function to prepare data - now you can use it as a single (one y value) or multiple steps (many following y values)
- illustrate how to develop and test simple NN with sliding window

I think that this is great example to use Keras Tuner and find best parameters (length of sliding window, NN parameters etc.)

<div class="alert alert-warning">
    <strong>Implemented NN architecture so far as an example:</strong>
    <ul>
        <li>LSTM -> Encoder-Decoder -> LSTM -> Dense</li>
        <li>CONV1D -> Encoder-Decoder -> LSTM -> Dense</li>
        <li>ConvLSTM2D -> Encoder-Decoder -> LSTM -> Dense</li>
    </ul>
</div>

<div class="alert alert-info">
This is first step. Notebook is under development but if you want to learn how to deal with timeseries using NN or like ideas please follow this notebook.
<strong><br>Currently we do not focus on submission ... we will do it after some additional improvements.</strong>
</div>

In [None]:
import pandas as pd
import numpy as np
from numpy import array
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import matplotlib.pyplot as plt

from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import RepeatVector
from keras.layers import TimeDistributed
from keras.layers import Conv1D, MaxPooling1D, Flatten, ConvLSTM2D, Dropout
import tensorflow.keras.backend as K

import tensorflow as tf

import warnings
warnings.filterwarnings("ignore")

from tqdm.notebook import tqdm

In [None]:
n_steps = 8 # we use 12h window
n_lookup = 1 # predict series of 4 values in time t1, t2, t3, t4

In [None]:
df_train = pd.read_csv("../input/tabular-playground-series-jul-2021/train.csv")
df_test = pd.read_csv("../input/tabular-playground-series-jul-2021/test.csv")
df_sub = pd.read_csv("../input/tabular-playground-series-jul-2021/sample_submission.csv")

print(df_test.shape)
print(df_sub.shape)

features = ['deg_C', 'relative_humidity', 'absolute_humidity', 'sensor_1', 'sensor_2', 'sensor_3', 'sensor_4', 'sensor_5']
targets = ['target_carbon_monoxide', 'target_benzene', 'target_nitrogen_oxides']
targets_values = np.log1p(df_train[targets]).values


df_test = pd.concat([df_train[len(df_train)-n_steps-1:len(df_train)-1].drop(targets , axis = 1), df_test])

df_all = pd.concat([df_train.drop(targets , axis = 1), df_test])

df_all['date_time'] = pd.to_datetime(df_all['date_time'])


df_train.set_index('date_time', inplace=True)
df_test.set_index('date_time', inplace=True)
print(df_test.shape)
print(df_all.shape)

In [None]:
df_test.head(15)

Interested in sensor data? Probably ... This is my research ...

- Sensor_1 - (tin oxide) hourly averaged sensor response (nominally CO targeted) 
- Sensor_2 - (titania) hourly averaged sensor response (nominally NMHC targeted) 
- Sensor_3 - (tungsten oxide) hourly averaged sensor response (nominally NOx targeted) 
- Sensor_4 - (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted) 
- Sensor_5 - (indium oxide) hourly averaged sensor response (nominally O3 targeted) 

In [None]:
def plot_sensor(name):
    
    plt.figure(figsize=(16,4))

    plt.plot(df_train.index, df_train[name], label='train')
    plt.plot(df_test.index, df_test[name], label='test')
    plt.ylabel(name)
    plt.legend()
    plt.show()

for col in df_train[features].columns:
    plot_sensor(col)

In [None]:
def plot_autocor(name, df):
    
    plt.figure(figsize=(16,4))    
    timeLags = np.arange(1,2400)
    plt.plot([df[name].autocorr(dt) for dt in timeLags])
    plt.title(name); plt.ylabel('autocorr'); plt.xlabel('time lags')
    plt.show()

for col in df_train[features].columns:
    plot_autocor(col, df_train)

### NEW FEATURES

In [None]:
def cycle_sin_cos_coder(data, cols):
    for col in cols:
        data[col + '_s'] = np.sin(2 * np.pi * data[col]/data[col].max())
        data[col + '_c'] = np.cos(2 * np.pi * data[col]/data[col].max())
    return data

In [None]:
df_all['month'] = df_all['date_time'].dt.month
df_all['day'] = df_all['date_time'].dt.day
df_all['hour'] = df_all['date_time'].dt.hour

df_all = cycle_sin_cos_coder(df_all, ['month','day','hour'])
df_all.drop(['month','day','hour'], axis=1, inplace=True)
df_all.set_index('date_time', inplace=True)

print(df_all.shape)

In [None]:
df_all.head(5)

In [None]:
df_train = df_all[:len(df_train)]

df_train [targets] = targets_values
df_test = df_all[len(df_train):]

In [None]:
df_train

In [None]:
train, test = train_test_split(df_train, shuffle = False, train_size=0.8)

### FEATURE TRANSFORMATION

In [None]:
for i in train[features].columns:
    scaler = MinMaxScaler(feature_range=(-1,1))
    
    s_train = scaler.fit_transform(train[i].values.reshape(-1,1))
    s_test = scaler.transform(test[i].values.reshape(-1,1))
    s_df_test = scaler.transform(df_test[i].values.reshape(-1,1))
    
    s_train = np.reshape(s_train,len(s_train))
    s_test = np.reshape(s_test,len(s_test))
    s_df_test = np.reshape(s_df_test,len(s_df_test))

    train[i] = s_train
    test[i] = s_test
    df_test[i] = s_df_test

In [None]:
train.head(n_steps+1)

## SLIDING WINDOW 

This function split data using SLIDING WINDOW approach:
- n_steps - this is number of steps we want to look into to predict output (one y or series of y)
- n_lookup - number of steps to predict 

In [None]:
def split_sequences(Xsequences, ysequences, n_steps = 6, n_out = 1):
    X, y = list(), list()

    for i in range(len(Xsequences)):
        end_index = i + n_steps
        out_end_index = end_index + n_out
        
        if out_end_index > len(Xsequences):
            break
        
        seq_x = Xsequences.iloc[i : end_index, :] 
        if isinstance(ysequences, pd.core.series.Series):
            seq_y = ysequences.iloc[end_index : out_end_index]
            y.append(seq_y)

        X.append(seq_x)
        
    return array(X), array(y)

In [None]:
Xtrain_seq_tcm, ytrain_seq_tcm = split_sequences(train.drop(targets, axis = 1), train['target_carbon_monoxide'], n_steps, n_lookup)
Xtest_seq_tcm, ytest_seq_tcm = split_sequences(test.drop(targets, axis = 1), test['target_carbon_monoxide'], n_steps, n_lookup)

Xtrain_seq_tb, ytrain_seq_tb = split_sequences(train.drop(targets, axis = 1), train['target_benzene'], n_steps, n_lookup)
Xtest_seq_tb, ytest_seq_tb = split_sequences(test.drop(targets, axis = 1), test['target_benzene'], n_steps, n_lookup)

Xtrain_seq_tno, ytrain_seq_tno = split_sequences(train.drop(targets, axis = 1), train['target_nitrogen_oxides'], n_steps, n_lookup)
Xtest_seq_tno, ytest_seq_tno = split_sequences(test.drop(targets, axis = 1), test['target_nitrogen_oxides'], n_steps, n_lookup)

n_features = Xtrain_seq_tcm.shape[2]

print(Xtrain_seq_tcm.shape, ytrain_seq_tcm.shape)
print(Xtest_seq_tcm.shape, ytest_seq_tcm.shape)

In [None]:
np.set_printoptions(suppress=True, linewidth=255)

Xtest_sub, _ = split_sequences(df_test, [], n_steps, n_lookup)
print(Xtest_sub[0])
print(Xtest_sub.shape)

In [None]:
np.set_printoptions(suppress=True, linewidth=255)

num_seq_show = 3

for i in range(num_seq_show):
    print(f'X{i}\n {Xtrain_seq_tcm[i]}')
    print(f'y{i}\n {ytrain_seq_tcm[i]} \n\n')

## SIMPLE NN MODEL
We use LSTM (but there is more possibilities to examine). I decided to create Endocer-Decoder (RepeatVector) architecture since we are able to predict more steps in future.

In [None]:
def rmsle(y_true, y_pred):
    msle = tf.keras.losses.MeanSquaredLogarithmicError()
    return K.sqrt(msle(y_true, y_pred)) 

### A. LSTM -> Encoder-Decoder -> LSTM -> Dense

In [None]:
model_tcm = Sequential()
model_tcm.add(LSTM(100, activation='tanh', input_shape=(n_steps, n_features)))
model_tcm.add(RepeatVector(n_lookup))
model_tcm.add(LSTM(100, activation='tanh', return_sequences=True))
model_tcm.add(TimeDistributed(Dense(1)))
model_tcm.compile(optimizer = tf.keras.optimizers.Adam(learning_rate=0.02), loss= rmsle)

model_tcm.summary()

In [None]:
tf.keras.utils.plot_model(model_tcm)

In [None]:
es = tf.keras.callbacks.EarlyStopping(patience=10, verbose=0, min_delta=0.001, monitor='val_loss', mode='auto', restore_best_weights=True)
red_lr = tf.keras.callbacks.LearningRateScheduler(lambda x: 1e-3 * 0.90 ** x)

def plot_model_learning(history):
    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])
    plt.title('model loss')
    plt.ylabel('loss')
    plt.xlabel('epoch')
    plt.legend(['train', 'test'], loc='upper left')
    plt.show()

In [None]:
N_SAMPLE = 20
yhat_tcm = np.zeros((Xtest_sub.shape[0],1))

for samples in tqdm(range(N_SAMPLE)):
    tf.keras.backend.clear_session()
    
    model_tcm.compile(optimizer = tf.keras.optimizers.Adam(learning_rate=0.02), loss= rmsle)
    history_tcm = model_tcm.fit(Xtrain_seq_tcm, ytrain_seq_tcm, 
                                validation_data = (Xtest_seq_tcm, ytest_seq_tcm), 
                                epochs=100, 
                                verbose = 0,
                                batch_size = 16, 
                                callbacks=[es, red_lr])
    
  
    yhat_tcm += np.expm1(model_tcm.predict(Xtest_sub)).reshape(-1,1)

yhat_tcm = yhat_tcm / N_SAMPLE

In [None]:
yhat_tcm

### B. ConvLSTM2D -> Encoder-Decoder -> LSTM -> Dense

In [None]:
n_sub_steps = 4
n_length = 2

model_tb = Sequential()
model_tb.add(ConvLSTM2D(64, (1,2), activation='relu', input_shape=(n_sub_steps, 1, n_length, n_features)))
model_tb.add(Flatten())
model_tb.add(RepeatVector(n_lookup))
model_tb.add(LSTM(200, activation='relu', return_sequences=True))
model_tb.add(TimeDistributed(Dense(100, activation='relu')))
model_tb.add(TimeDistributed(Dense(1)))

model_tb.compile(loss='mse', optimizer='adam')

model_tb.summary()

In [None]:
tf.keras.utils.plot_model(model_tb)

In [None]:
Xtrain_seq_tb = Xtrain_seq_tb.reshape((Xtrain_seq_tb.shape[0], n_sub_steps, 1, n_length, n_features))
Xtest_seq_tb = Xtest_seq_tb.reshape((Xtest_seq_tb.shape[0], n_sub_steps , 1, n_length, n_features))

In [None]:
N_SAMPLE = 20
yhat_tb = np.zeros((Xtest_sub.shape[0],1))

for samples in tqdm(range(N_SAMPLE)):
    tf.keras.backend.clear_session()
    
    model_tb.compile(optimizer = tf.keras.optimizers.Adam(learning_rate=0.02), loss= rmsle)
    history_tb = model_tb.fit(Xtrain_seq_tb, ytrain_seq_tb, 
                          validation_data = (Xtest_seq_tb, ytest_seq_tb), 
                          epochs=100,  
                          batch_size = 16, 
                          verbose = 0, 
                          callbacks=[es, red_lr])
  
    yhat_tb += np.expm1(model_tb.predict(Xtest_sub.reshape(Xtest_sub.shape[0], n_sub_steps, 1, n_length, n_features))).reshape(-1,1)

yhat_tb = yhat_tb / N_SAMPLE

### CONV1D -> Encoder-Decoder -> LSTM -> Dense

In [None]:
model_tno = Sequential()
model_tno.add(Conv1D(64, 3, activation='relu', input_shape=(n_steps, n_features)))
model_tno.add(Conv1D(64, 3, activation='relu'))
model_tno.add(MaxPooling1D())
model_tno.add(Flatten())
model_tno.add(RepeatVector(n_lookup))
model_tno.add(LSTM(100, activation='relu', return_sequences=True))
model_tno.add(TimeDistributed(Dense(64, activation='relu')))
model_tno.add(TimeDistributed(Dense(1)))

model_tno.compile(loss='mse', optimizer='adam')

model_tno.summary()

In [None]:
tf.keras.utils.plot_model(model_tno)

In [None]:
N_SAMPLE = 20
yhat_tno = np.zeros((Xtest_sub.shape[0],1))

for samples in tqdm(range(N_SAMPLE)):
    tf.keras.backend.clear_session()
    
    model_tno.compile(optimizer = tf.keras.optimizers.Adam(learning_rate=0.02), loss= rmsle)
    history_tno = model_tno.fit(Xtrain_seq_tno, ytrain_seq_tno, 
                            validation_data = (Xtest_seq_tno, ytest_seq_tno), 
                            epochs=100, 
                            verbose = 0, 
                            batch_size = 16, 
                            callbacks=[es, red_lr])
    
  
    yhat_tno += np.expm1(model_tno.predict(Xtest_sub)).reshape(-1,1)

yhat_tno = yhat_tno / N_SAMPLE

### SUBMISSION

In [None]:
df_sub['target_carbon_monoxide'] =  yhat_tcm
df_sub['target_benzene'] = yhat_tb
df_sub['target_nitrogen_oxides'] = yhat_tno

df_sub.to_csv('lstm_001.csv', index=False)

In [None]:
df_sub

This notebook is under devleopment. If you like ideas please vote and follow - I will develp this notebook in next days.