<h1 style="text-align:center;color:brown;">Univariate Time Series Autoregression</h1>

<img src="https://idoraquel.s3.eu-central-1.amazonaws.com/userimages/Drawing-3.sketchpad.png" />

<h2 style="text-align:center;color:brown;">The goal of this notebook</h2>
<p>
This notebook is for <b>beginners</b>. My main goal is to show how the time (or any other sequence) feature can be used in datasets.
</p>

<p>
You probably already know how the linear regression works.
</p>

<p><i><u>Linear regression:</u></i> Given number of features, you are updating the weights in order to minimize the error.</p>
<p><i><u>Autoregression of order N:</u></i> Given sequence of target values over some time, we are modelling the same linear regression. But features now are the target values for the previous N time steps.</p> 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller

# Skew the distribution a bit with a constantly growing trend. 
# For the little number, the data is still stationary according to the Dickey-Fuller test
HB_GROW = 1
# number of steps that model considers. 
# try to change this number to 5 and see the difference :)
AR_ORDER = 50

data = np.random.rand(10000) ** 2
season1 = np.sin(np.arange(0, 10000 / 5, 0.2))
season2 = np.cos(np.arange(0, 10000 / 10, 0.1))
noise = np.random.randn(10000,)
stationary_data = data + season1 + season2 + noise
data = stationary_data + np.linspace(0, HB_GROW, 10000)

In [None]:
stats = adfuller(data)
assert stats[1] < 0.05, \
       "The data is not stationary, please decrease the HB_GROW" # still stationary ?

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 5))
ax1.title.set_text("Whole range")
ax1.plot(data[:2000])
ax2.title.set_text("First 300 points")
ax2.plot(data[:300]);

In [None]:
# Autoregression order
p = AR_ORDER
train = []
# Max index should have been increased by 1 (len(data) - p + 1), 
# but last slice would never have the target value to predict.
# So i just skipped it on the start
max_start_index = len(data) - p
for li in range(0, max_start_index):
    sl = data[li:li+p]
    train.append(sl)

train = np.array(train)
print(f"Train set consists of {train.shape[0]} examples {train.shape[1]} features each one")

targets = data[p:]

assert train.shape == (10000 - p, p)
assert targets.shape == (10000 - p,)


# Now we have smth like this (but with order 10, not 5)
# train_seq   target
# 1 2 3 4 5   6
# 2 3 4 5 6   7
# 3 4 5 6 7   8
# Ensure that first target is the same as the last element in second training example 
# or the second from the end in second example, etc. (because slice is shifting repeatedly by 1)
assert targets[0] == train[1, -1]
assert targets[0] == train[2, -2]
assert targets[0] == train[3, -3]

<h2 style="text-align:center;color:brown;">Validation &amp; test sets</h2>

<p><i><u style="color:red">Important:</u></i> When it comes to splitting the dataset, we always should consider the time. Namely, evaluate the data on the <b>only unseen</b> data. 
In other words, in the future</p>

In [None]:
TRAIN_SET_SLICE = slice(None, 10000)
VAL_SET_SLICE = slice(6000, 8000)
TEST_SET_SLICE = slice(8000, 10000 - p)
X_train, y_train = train[TRAIN_SET_SLICE, :], targets[TRAIN_SET_SLICE]
X_val, y_val = train[VAL_SET_SLICE, :], targets[VAL_SET_SLICE]
X_test, y_test = train[TEST_SET_SLICE, :], targets[TEST_SET_SLICE]

# a bit of tests never hurts
assert X_val.shape == (2000, p)
assert y_test.shape == (2000 - p,)
assert all(y_val == targets[6000:8000])

<h2 style="text-align:center;color:brown;">
    AR Model
</h2>
<p>In order to make it more intuitive, i won't import any blackbox models from package (although they are great :))</p>
<p>Instead, we will construct it using simple Dense layer from the keras (tensorflow)</p>

In [None]:
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.callbacks import ReduceLROnPlateau, EarlyStopping

In [None]:
inputs = Input(shape=(p,))
outputs = Dense(1)(inputs)
model = Model(inputs, outputs)
opt = SGD(learning_rate=1e-3)
model.compile(optimizer=opt, loss="mse")
model.summary()

reduce_lr = ReduceLROnPlateau(patience=10, factor=0.3, min_lr=1e-10)
early_stop = EarlyStopping(patience=20)

In [None]:
r = model.fit(X_train, 
              y_train, 
              validation_data=(X_val, y_val),
              batch_size=32,
              epochs=500, 
              verbose=False, 
              shuffle=True, callbacks=[reduce_lr, early_stop])

In [None]:
plt.plot(r.history["loss"], color="blue")
plt.plot(r.history["val_loss"], color="orange");

Please have a look at the orange curve (validation loss over epochs). It is so "nervous" because the data is **not clearly stationary**.

In other words, the model can't converge because of amount of noise

Consider *decreasing* HB_GROW to 0.1 or remove this line
> \+ np.linspace(0, HB_GROW, 10000)

In [None]:
model.evaluate(X_test, y_test)

In [None]:
test_pred = model.predict(X_test)
plt.plot(y_test[:100])
plt.plot(test_pred[:100], color="green");