# Bike sharing problem

You can get the dataset here: https://www.kaggle.com/competitions/bike-sharing-demand
It is worth mentioning the neural network-based models I am presenting were NOT hyperparameter-tunned, as I am using M1 macbook. For any inquiery, and README.md, please comment on "https://github.com/juyoungwang/Tensorflow_DeepAR".

This is a toy code I made to do some bike-sharing demand time series forecasting, using the manually implemented DeepAR (https://arxiv.org/abs/1704.04110) architecture with Tensorflow (Zhang et al. already have nice Torch implementation for this, which is the reason I decided to make the Tenworflow one).

To run the code, you should first put your data downloaded from https://www.kaggle.com/competitions/bike-sharing-demand into the data folder of the repository.

I have three neural-network-based time series forecasting methods implemented below:

* A naive LSTM-based one, which has only one LSTM as the intermediate layer.
* DeepAR architecture, without ancestral sampling functionality (which does not use the previous time step forecasting value as the new input to predict the next time step demand).
* DeepAR architecture, with ancestral sampling functionality (the one which uses previous time step forecast as a new input for the next step forecasts.). 

My DeepAR implementation differs the other DeepAR tensorflow or pytorch implementaions in the following aspects:
* My implmentation handles categorical features, by applying embedding layers to each of them in a seperate manner.
* After applying embedding, both ordinal and categorical embedded features are concatenated to be used as the input for the LSTM.
* As I used min-max scaler for data, I assumed forecasting value always higher than 0. Because of this reason, I applied "relu" activation for the sake of mean value prediction.
* Sampling-based forecasting functionality is not implemented, due to my limited computational resource (M1 macbook with no GPU). Though, its implementation is quiet straight forward.

As the model with ancestral sampling functionality is NON-SEQUENTIAL, I had to use custom training loop with state transfer functionality of keras.lstm. To the best of my knowledge, existing tensorflow implementations were mostly using sequential model supported by Keras, and they did not have properly implemented this ancestral sampling part, which is the reason I made this implementation.

**NOTES**: 
* I have not done any single hyperparameter tunning of the models. These are just implemented for fun, using my M1 macbook, which does not even have CUDA.
* Maybe modularizing the codes would be my next step, though, now I am preparing coding interviews, as I am looking for AI/ML/OR jobs in korea to complete my military service.
* To "best" train the model, because of the way training data is designed (only approx the first 20 days of months are contained), a careful time series window selection is required. Such a step is skipped in this notebook.

Let's start with the basics.

## 1. GBR
### 1.1. Raw data processor for gradient boosted regression algorithm
We will:
* drop ["casual", "registered"] columns, as they are unknown before observing the true demand data.
* separate datetime into year, month, day and hour, since we would like to use them as covariates information. Note that the dataset is registered in hourly basis, meaning, we will only consider year, month, day and hour as time covariates.

In [None]:
import seaborn as sns
sns.set_style("whitegrid")
sns.set(rc={'figure.figsize':(15,7.5)})

In [None]:
import pandas as pd
train = pd.read_csv("../input/bike-sharing-demand/train.csv", parse_dates = ["datetime"])
test = pd.read_csv("../input/bike-sharing-demand/test.csv", parse_dates = ["datetime"])
features = list(train)[:-3]
label = list(train)[-1]

def preprocessor(df, features):
    df["year"] = pd.to_datetime(df['datetime']).dt.year
    df["month"] = pd.to_datetime(df['datetime']).dt.month
    df["day"] = pd.to_datetime(df['datetime']).dt.day
    df["hour"] = pd.to_datetime(df['datetime']).dt.hour
    try:
        return df[["year", "month", "day", "hour"] + features[1:]], df["count"]
    except:
        return df[["year", "month", "day", "hour"] + features[1:]]

In [None]:
train_x, train_y = preprocessor(train, features)
test_x = preprocessor(test, features)

Code below shows there is no nan, neither null data in our dataset:

In [None]:
print(train_x.info())
print(train_y.isnull().sum())
print(test_x.info())

### 1.2. Gradient boosting regressor
I brought the code below from https://www.kaggle.com/code/kongnyooong/bike-sharing-demand-for-korean-beginners/notebook, just for the comparison purposes with the neural network-based method. I was about to implement this myself, but since there is already implemented one, why? For further details, please check the code above.

## 2. Deep leearning

### 2.1. Data processing
* Unlike tree-based methods, such as XGBoost, neural network requires (or it better performs with) data normalization (or standardization).
* For the sake of data-scaling, we will perform minmax normalization.
* For ancestral sampling purpose, we will append $z_{t-1}$ to input vector $x_t$, i.e. the RNN input vector will be $(z_{t-1},x_t)$ for timestep $t$, where $(z_s)_s$ is target series, and $(x_s)_s$ is known covariates.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
train = pd.read_csv("../input/bike-sharing-demand/train.csv", parse_dates = ["datetime"])
test = pd.read_csv("../input/bike-sharing-demand/test.csv", parse_dates = ["datetime"])
features = list(train)[:-3]
label = list(train)[-1]

def preprocessor(df, features):
    df["year"] = pd.to_datetime(df['datetime']).dt.year
    df["month"] = pd.to_datetime(df['datetime']).dt.month
    df["day"] = pd.to_datetime(df['datetime']).dt.day
    df["hour"] = pd.to_datetime(df['datetime']).dt.hour
    try:
        return df[["year", "month", "day", "hour"] + features[1:]], df["count"]
    except:
        return df[["year", "month", "day", "hour"] + features[1:]]

train_x, train_y = preprocessor(train, features)
test_x = preprocessor(test, features)

For the sake of implementational convenience, we will assume year, month, day, hour are continuous features. However, we strongly agree it is highly debatable if we would like to treat them as categorical feature or not.

In [None]:
train_x.head()

Note that there is high concentration of windspeed = 0, which is theoretically impossible.

In [None]:
plt.hist(train_x["windspeed"])

* Assuming windspeed unavailable data were replaced by 0, let as drop this column. 
* Furthermore, both temp and atemp are almost linearly related with each other. Let us also drop atemp column, to reduce any issues due to high correlation.
* Lastly, we will change the domain of season feature to be between 0 and 3, instead of using 1 and 4.

In [None]:
cont_feat = ['year', 'month', 'day', 'hour', 'temp', 'humidity']
disc_feat = ['season', 'holiday', 'workingday', 'weather']
train_x['season'] = train_x['season'] - 1
train_x['weather'] = train_x['weather'] - 1
test_x['season'] = test_x['season'] - 1
test_x['weather'] = test_x['weather'] - 1
train_x = train_x[cont_feat + disc_feat]
test_x = test_x[cont_feat + disc_feat]

We now have to apply MinMax scaling technique to continuous features:

In [None]:
from sklearn.preprocessing import MinMaxScaler
MMSc_feature = MinMaxScaler()
MMSc_feature.fit(train_x[cont_feat])
train_x[cont_feat] = MMSc_feature.transform(train_x[cont_feat])
test_x[cont_feat] = MMSc_feature.transform(test_x[cont_feat])

In [None]:
MMSc_label = MinMaxScaler()
MMSc_label.fit(np.asarray(train_y).reshape(-1,1))
train_y = MMSc_label.transform(np.asarray(train_y).reshape(-1,1))

Now, let us separate train dataset into both "other train" and "validation".

In [None]:
train = train_x.copy()
train["label"] = train_y
n = len(train)
train_df = train[0:int(n*0.7)]
val_df = train[int(n*0.7):int(n*0.9)]
test_df = train[int(n*0.9):]

It is shown in a variety of literature, that time series forcasting performance can be drastically enhanced by using the "previous target series value" as inputs. However, for the sake of simplicity of the code, we will skip such a step.

### 2.2. Tensorflow implementation for LSTM-based models

* Normalized log-transformed label has range 0 to 1. Since > 0, we would like to use ReLU() as the output activation function.

### 2.2.1. Time series window generation

Below, we will make tensorflow iterators for windowed arrays of format (batch_size, time steps, features).

In [None]:
import os
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Embedding, Dropout, Bidirectional, LSTM, Dense, TimeDistributed, Softmax, Multiply, Lambda, Reshape
from tensorflow.keras.regularizers import l2
from tensorflow.keras import backend as K
from tensorflow.keras.callbacks import Callback, EarlyStopping, ModelCheckpoint, TensorBoard
import tensorflow_probability as tfp


In [None]:
def windowed_dataset(series, window_size, batch_size, shuffle_buffer, train = True):
  dataset = tf.data.Dataset.from_tensor_slices(series)
  dataset = dataset.window(window_size + 1, shift=1, drop_remainder=True)
  dataset = dataset.flat_map(lambda window: window.batch(window_size + 1))
  dataset = dataset.shuffle(shuffle_buffer)
  if train:
    dataset = dataset.batch(batch_size).prefetch(1)
  else:
    dataset = dataset.batch(series.shape[0]).prefetch(1)
  return dataset

In [None]:
window_size = 24
batch_size = 32
shuffle_buffer = 1000
train_dat = windowed_dataset(train_df, window_size, batch_size, shuffle_buffer)
validation_dat = windowed_dataset(val_df, window_size, batch_size, shuffle_buffer)
test_dat = windowed_dataset(test_df, window_size, batch_size, shuffle_buffer, train = False)

Each train iterate are of the form:

In [None]:
print(next(train_dat.as_numpy_iterator()).shape)

In [None]:
train_dat

which looks like:

In [None]:
pd.DataFrame(data = next(train_dat.as_numpy_iterator())[0, :, :]).head(5)

while validation and test are of shape:

In [None]:
print(next(validation_dat.as_numpy_iterator()).shape)
print(next(test_dat.as_numpy_iterator()).shape)

And each batch looks like:

In [None]:
pd.DataFrame(data = next(train_dat.as_numpy_iterator())[0, :, :]).head(5)

Note that columns from
* 0 to 5 are ordinal features,
* 6 to 9 are categorical features,
* 10 is label

Let us define list of columns corresponding to:
* Ordinal features (=continuous features).
* Categorical features (=discrete features).
* Label

using the code below:

In [None]:
cont_feat_tf = list(range(0,6))
disc_feat_tf = list(range(6,10))
label_tf = [10]
# pd.DataFrame(data = next(train_dat.as_numpy_iterator())[0, :, :]).iloc[:, cont_feat_tf]

#### 2.2.2. Naive LSTM-based model

Now we will build a naive LSTM-based model for the forcasting problem. First, let's define the loss.

In [None]:
def rmse(y_true, y_pred):
    return K.sqrt(K.mean(K.square(y_pred - y_true)))

Now, we will define the embedding dimension for discrete features. We will follow a general rule from https://developers.googleblog.com/2017/11/introducing-tensorflow-feature-columns.html, which suggests:
$$\text{Embedding dimension} = \sqrt[4]{\text{Number of categories}}$$

Decimals will be rounded to the closest integers.

In [None]:
def embedding_dimension(cat_num):
    return round(cat_num**0.25)

cat_nums = [4, 2, 2, 4] # 4 for season, 2 for holiday, 2 for working day, 4 for weather

Below, we have list of hyper parameters:

In [None]:
embedding_nums = [embedding_dimension(i) for i in cat_nums]
dropout_rates = 0
lstm_units = 64
l1_penalty = 0

Now, let's define the model:

In [None]:
from tensorflow.keras.regularizers import l1

# Input layer
inp = tf.keras.Input((None, train.shape[1]-1), name = 'Input')

# Continous features
cont = K.stack([inp[:, :, i] for i in cont_feat_tf], axis = 2)

# Discrete features
cats_emb = []
for d_f in disc_feat_tf:
    cat_inp = inp[:, :, d_f]
    d_fm6 = d_f - 6
    cats_emb.append(Embedding(cat_nums[d_fm6], embedding_nums[d_fm6], name = "Embedding_{0}".format(disc_feat[d_fm6]))(cat_inp))
sub_embs = tf.concat([cont, K.stack(cats_emb, axis = 2)[:, :, :, 0]], axis =
 2)
full_embs = Dropout(dropout_rates)(sub_embs)

LSTM_out = tf.keras.layers.LSTM(lstm_units, return_sequences=True, name = "LSTM")(full_embs)
Output_dense = Dense(1, activation='relu', name='Output', kernel_regularizer=l1(l1_penalty))
Output = TimeDistributed(Output_dense, name='Output_timedistributed')(LSTM_out)


Below, let's compile the model:

In [None]:
lstm_model = tf.keras.Model(inputs=inp, outputs=Output, name='Basic_LSTM')
lstm_model.compile(optimizer='adam', loss=rmse)

And now, let's train the model:

In [None]:
epoch_num = 20
val_loss = []
for epoch in range(epoch_num):
    print("We are at epoch = {0}".format(epoch))
    for train_batch in train_dat.as_numpy_iterator():
        tr_x = train_batch[:, :, :-1]
        tr_y = np.expand_dims(train_batch[:, :, -1], axis=-1)
        lstm_model.fit(tr_x, tr_y, epochs=1, verbose=0, shuffle=False)
    vl_track = []
    for validation_batch in validation_dat.as_numpy_iterator():
        val_x = validation_batch[:, :, :-1]
        val_y = np.expand_dims(validation_batch[:, :, -1], axis=-1)
        vl_track.append(rmse(lstm_model.predict(val_x), val_y))
    val_loss.append(np.mean(vl_track))

In [None]:
sns.set_style("whitegrid")
plt.plot(val_loss)
plt.title("Validation loss curve")
plt.show()

Below is a prediction results for the first 50 timesteps:

In [None]:
pred_val = MMSc_label.inverse_transform(lstm_model(np.expand_dims(np.asarray(val_df.iloc[:50, :-1]), axis = 0))[:, :, 0])[0, :]
true_val = MMSc_label.inverse_transform(np.asarray(val_df.iloc[:50, -1]).reshape(-1,1))[:, 0]
plt.plot(pred_val, 'r', label = "Prediction")
plt.plot(true_val, 'b', label = "True value")
plt.legend(loc="upper left")
plt.show()

#### 2.2.3. DeepAR without ancestral sequence sampling functionality

DeepAR is a probabilistic time series forecasting model developed by Salinas et al. Here, I am just providing a simple implementation for this, in order to assess the performance of the model compared to the naive LSTM-based one.

To avoid any numerical issues, we will sum 1e-4 to standard deviation.

Also, my M1 macbook does not support tensorflow probability for some weird reason. I will borrow an existing implementaion of gaussian negative likelihood loss, and motify the code accordingly.

In [None]:
# https://fairyonice.github.io/Create-a-neural-net-with-a-negative-log-likelihood-as-a-loss.html
def nll_gaussian(y_true, y_pred):
    ## element wise square
    mu = y_pred[:, :, 0]
    sigma = y_pred[:, :, 1] + 1e-4
    y_true = y_true[:, :, 0]
    square = tf.square(mu - y_true)## preserve the same shape as y_pred.shape
    ms = tf.divide(square,sigma) + K.log(sigma)
    ms = tf.reduce_mean(ms)
    return ms

Model hyperparameters are:

In [None]:
embedding_nums = [embedding_dimension(i) for i in cat_nums]
dropout_rates = 0.1
lstm_units = 40
l1_penalty = 0.001

Let's define the model below.

In [None]:
from tensorflow.keras.regularizers import l1

# Input layer
inp = tf.keras.Input((None, train.shape[1]-1), name = 'Input')

# Continous features
cont = K.stack([inp[:, :, i] for i in cont_feat_tf], axis = 2)

# Discrete features
cats_emb = []
for d_f in disc_feat_tf:
    cat_inp = inp[:, :, d_f]
    d_fm = d_f - disc_feat_tf[0]
    cats_emb.append(Embedding(cat_nums[d_fm], embedding_nums[d_fm], name = "Embedding_{0}".format(disc_feat[d_fm]))(cat_inp))
sub_embs = tf.concat([cont, tf.concat(cats_emb, axis = 2)], axis = 2)
full_embs = Dropout(dropout_rates)(sub_embs)

LSTM_out_1 = tf.keras.layers.LSTM(lstm_units, return_sequences=True, name = "LSTM_1")(full_embs)
LSTM_out_2 = tf.keras.layers.LSTM(lstm_units, return_sequences=True, name = "LSTM_2")(LSTM_out_1)
LSTM_out_3 = tf.keras.layers.LSTM(lstm_units, return_sequences=True, name = "LSTM_3")(LSTM_out_2)

Output_dense_mu = Dense(1, activation='linear', name='Output_mu', kernel_regularizer=l1(l1_penalty))
Output_dense_sigma = Dense(1, activation='softplus', name='Output_sigma', kernel_regularizer=l1(l1_penalty))
mu = TimeDistributed(Output_dense_mu, name='Mu_TD')(LSTM_out_3)
sigma = TimeDistributed(Output_dense_sigma, name='Sigma_TD')(LSTM_out_3)

DeepAR_model = tf.keras.Model(inputs=inp, outputs=tf.concat([mu, sigma], axis = 2), name='DeepAR')
DeepAR_model.compile(optimizer='adam', loss=nll_gaussian)


In [None]:
epoch_num = 20
val_loss = []
for epoch in range(epoch_num):
    print("We are at epoch = {0}".format(epoch))
    for train_batch in train_dat.as_numpy_iterator():
        tr_x = train_batch[:, :, :-1]
        tr_y = np.expand_dims(train_batch[:, :, -1], axis=-1)
        DeepAR_model.fit(tr_x, tr_y, epochs=1, verbose=0, shuffle=False)
#     DeepAR_model.save_weights(os.getcwd() + "/DeepAR_weights/{0}.pth".format(epoch))
    vl_track = []
    for validation_batch in validation_dat.as_numpy_iterator():
        val_x = validation_batch[:, :, :-1]
        val_y = np.expand_dims(validation_batch[:, :, -1], axis=-1)
        vl_track.append(rmse(DeepAR_model.predict(val_x), val_y))
    val_loss.append(np.mean(vl_track))

Below, we have the prediction results for the validation dataset:

In [None]:
pred_val = MMSc_label.inverse_transform(DeepAR_model(np.expand_dims(np.asarray(val_df.iloc[:50, :-1]), axis = 0))[:, :, 0])[0, :]
pred_sigma = MMSc_label.inverse_transform(np.sqrt(DeepAR_model(np.expand_dims(np.asarray(val_df.iloc[:50, :-1]), axis = 0))[:, :, 1]))[0, :]
true_val = MMSc_label.inverse_transform(np.asarray(val_df.iloc[:50, -1]).reshape(-1,1))[:, 0]

z_score = 1.65
LB = np.clip(pred_val - z_score*pred_sigma, a_min = 0, a_max = None)
UB = np.clip(pred_val + z_score*pred_sigma, a_min = 0, a_max = None)

The plot below shows approx 90% of prediction interval surrounding mean value forecasts:

In [None]:
plt.plot(pred_val, 'r', label = "Prediction")
plt.fill_between(range(50), LB, UB, color='chocolate', alpha=0.2)
plt.plot(true_val, 'b', label = "True value")
plt.legend(loc="upper left")
plt.show()

Note: A rigorous implementation of DeepAR requires to feed value of previous RNN output as the new input to the forecasts, which is the step I skipped, since I just wanted to play with the dataset.

### 2.2.4. DeepAR with ancestral forecasting functionality

Below, we will make DeepAR implementation using Tensorflow. 
During the training phase, the network will receive:
$$(z_{t-1},x_t)$$
with aim to predict
$$z_t$$
where $(z_t)_t$ is target series, and $(x_t)_t$ is covariates.

We will be using negative Gaussian log likelihood as the objective function.

In [None]:
train = train_x.copy()
train["label"] = train_y
n = len(train)
train_df = train[0:int(n*0.7)]
val_df = train[int(n*0.7):int(n*0.9)]
test_df = train[int(n*0.9):]

In [None]:
col_list = ['prev_label', 'year', 'month', 'day', 'hour', 'temp', 'humidity', 'season', 'holiday', 'workingday', 'weather', 'label']
train_df["prev_label"] = train_df["label"].shift(1).fillna(0)
train_df = train_df[col_list]
val_df["prev_label"] = val_df["label"].shift(1).fillna(0.469262)
val_df = val_df[col_list]
test_df["prev_label"] = test_df["label"].shift(1).fillna(0.389344)
test_df = test_df[col_list]

Train dataset looks like:

In [None]:
train_df.head(5)

Let's define the iterator:

In [None]:
window_size = 24
batch_size = 32
shuffle_buffer = 1000
train_dat = windowed_dataset(train_df, window_size, batch_size, shuffle_buffer)
validation_dat = windowed_dataset(val_df, window_size, batch_size, shuffle_buffer)
test_dat = windowed_dataset(test_df, window_size, batch_size, shuffle_buffer, train = False)

Note that unlike the previous analysis we did, we now have "previous label" as also part of features. Note that the very initial value of prev_label was replaced by 0, as this is not known.

Below, we have DeepAR model parameter definition:

In [None]:
def embedding_dimension(cat_num):
    return round(cat_num**0.25)

In [None]:
class DeepAR_tf_params():
    def __init__(self):
        self.cat_nums = None
        self.cont_feat = None
        self.cont_feat_tf = None
        self.disc_feat = None
        self.disc_feat_tf = None
        self.dropout_rates = None
        self.embedding_nums = None
        self.label_tf = None
        self.lstm_num = None
        self.lstm_units = None
        self.l1_penalty = None

cont_feat = ['prev_label', 'year', 'month', 'day', 'hour', 'temp', 'humidity']
disc_feat = ['season', 'holiday', 'workingday', 'weather']

params = DeepAR_tf_params()
params.cat_nums = [4, 2, 2, 4] # Due to numerical issue, we are expanding cat_num
params.cont_feat = cont_feat
params.disc_feat = disc_feat
params.cont_feat_tf = list(range(0,7))
params.disc_feat_tf = list(range(7,11))
params.dropout_rates = 0
params.embedding_nums = [embedding_dimension(i) for i in params.cat_nums]
params.lstm_num = 3
params.label_tf = [11]
params.lstm_units = 40
params.l1_penalty = 0
params.lr = 0.0003

Now, let's build DeepAR model. We will:
* Train the network using the true label.
* Assess the validation performance via ancestral sampling with 6 step forecasting results.

**NOTE**: The model assumes we have at least one categorical and one ordinal features, with the last feature dimension being the label.

In [None]:
from tensorflow.keras.regularizers import l1

class DeepAR_tf(tf.keras.Model):
  def __init__(self, D_p:DeepAR_tf_params):
    super().__init__(self)
    self.params = D_p
    self.cat_embeddings = [Embedding(self.params.cat_nums[d_f - self.params.disc_feat_tf[0]], self.params.embedding_nums[d_f - self.params.disc_feat_tf[0]], name = "Embedding_{0}".format(self.params.disc_feat[d_f - self.params.disc_feat_tf[0]])) for d_f in self.params.disc_feat_tf]
    self.lstms = [tf.keras.layers.LSTM(self.params.lstm_units, return_sequences=True, return_state=True, name = "LSTM_{0}".format(i)) for i in range(self.params.lstm_num)]
    self.mu = Dense(1, activation='linear', name='mu', kernel_regularizer=l1(self.params.l1_penalty))
    self.sigma = Dense(1, activation='softplus', name='sigma', kernel_regularizer=l1(self.params.l1_penalty))

  def call(self, inputs, states=None, return_state=False, training=False):
    x = inputs
    continuous_features = K.stack([x[:, :, i] for i in self.params.cont_feat_tf], axis = 2)
    discrete_features = tf.concat([self.cat_embeddings[i - self.params.disc_feat_tf[0]](tf.cast(x[:, :, i], tf.int64), training = training) for i in self.params.disc_feat_tf], axis = 2)
    x = tf.concat([continuous_features, discrete_features], axis = 2)
    x = Dropout(self.params.dropout_rates)(x)
    if states is None:
      states = [self.lstms[i].get_initial_state(x) for i in range(self.params.lstm_num)]
    for i in range(self.params.lstm_num):
      x, memory_i, carry_i = self.lstms[i](x, initial_state=states[i], training=training)
      states[i] = [memory_i, carry_i]
    mu = self.mu(x, training = training)
    sigma = self.sigma(x, training = training)
    x = tf.concat([mu, sigma], axis = 2)
    if return_state:
      return x, states
    else:
      return x

In [None]:
DeepAR_tf_model = DeepAR_tf(params)
for i in train_dat.take(1):
    for t in range(i.shape[1]):
        tr_x = K.expand_dims(i[:, t, :-1], axis = 1)
        tr_y = i[:, t, -1]
        example_pred = DeepAR_tf_model(tr_x)

In [None]:
DeepAR_tf_model.summary()

Now, let's compile the model:

In [None]:
# https://fairyonice.github.io/Create-a-neural-net-with-a-negative-log-likelihood-as-a-loss.html
def nll_gaussian(y_true, y_pred):
    ## element wise square
    mu = y_pred[:, 0, 0]
    sigma = y_pred[:, 0, 1] + 1e-4
    y_true = tf.cast(y_true, tf.float32)
    square = tf.square(mu - y_true)## preserve the same shape as y_pred.shape
    ms = tf.divide(square,sigma) + K.log(sigma)
    ms = tf.reduce_mean(ms)
    return ms

**NOTE**: Now, I need to write the custom training loop below (get loss += for each time step, and update the weight at the end).

Try to do this like in Pytorch.


In [None]:
def grad(model, inputs, targets):
  loss_value = tf.zeros(1)
  stat = None
  with tf.GradientTape() as tape:
    for t in range(inputs.shape[1]):
      features = K.expand_dims(inputs[:, t, :], axis = 1)
      if stat == None:
        y_pred, stat = model(features, return_state = True, training = True)
      else:
        y_pred, stat = model(features, states = stat, return_state = True, training = True)
      loss_value += nll_gaussian(targets[:, t], y_pred)
  return loss_value, tape.gradient(loss_value, model.trainable_variables)

In [None]:
## Note: Rerunning this cell uses the same model variables
DeepAR_tf_model = DeepAR_tf(params)
optimizer = tf.keras.optimizers.Adam(learning_rate=params.lr)

In [None]:
# Keep results for plotting
train_loss_results = []

num_epochs = 30

for epoch in range(num_epochs):
  epoch_loss_avg = tf.keras.metrics.Mean()

  # Training loop - using batches of 32
  for i in train_dat:
    x = i[:, :, :-1]
    y = i[:, :, -1]
    # Optimize the model
    loss_value, grads = grad(DeepAR_tf_model, x, y)
    optimizer.apply_gradients(zip(grads, DeepAR_tf_model.trainable_variables))

    # Track progress
    epoch_loss_avg.update_state(loss_value)  # Add current batch loss

  # End epoch
  train_loss_results.append(epoch_loss_avg.result())

  print("Epoch {:03d}: Loss: {:.3f}".format(epoch, epoch_loss_avg.result()))

In [None]:
f_starting = 0
i = K.expand_dims(val_df.iloc[:50, :], axis = 0)
states = None
x = i[:, :, :-1].numpy()
y_hat = tf.zeros(i[:, :, -1].shape).numpy()
y = i[:, :, -1].numpy()
std = tf.zeros(i[:, :, -1].shape).numpy()
for t in range(x.shape[1]):
  features = K.expand_dims(x[:, t, :], axis = 1)
  if states == None:
    y_pred, states = DeepAR_tf_model(features, return_state = True, training = False)
  else:
    y_pred, states = DeepAR_tf_model(features, states = states, return_state = True, training = False)
  if t < x.shape[1]-1 and t >= f_starting:
    x[:, t+1, 0] = y_pred[:, 0, 0]
    y_hat[:, t] = y_pred[:, 0, 0]
    std[:, t] = y_pred[:, 0, 1]
  elif t < f_starting:
    y_hat[:, t] = x[:, t+1, 0]
    std[:, t] = 0*y_pred[:, 0, 1]
  else:
    y_hat[:, t] = y_pred[:, 0, 0]
    std[:, t] = y_pred[:, 0, 1]
      

In [None]:
pred_val = MMSc_label.inverse_transform(y_hat)
pred_sigma = MMSc_label.inverse_transform(np.sqrt(std))
true_val = MMSc_label.inverse_transform(y)

z_score = 1.65
LB = np.clip(pred_val - z_score*pred_sigma, a_min = 0, a_max = None)
UB = np.clip(pred_val + z_score*pred_sigma, a_min = 0, a_max = None)

In [None]:
plt.plot(pred_val[0, :], 'r', label = "Prediction")
plt.fill_between(range(y_hat.shape[1]), LB[0, :], UB[0, :], color='chocolate', alpha=0.2)
plt.plot(true_val[0, :], 'b', label = "True value")
plt.legend(loc="upper left")
plt.show()

# Test

In [None]:
real_test = pd.read_csv("../input/bike-sharing-demand/test.csv", parse_dates = ["datetime"])
real_test["label"] = np.zeros(real_test.shape[0])

In [None]:
features = list(real_test)[:-3]
label = list(real_test)[-1]

def preprocessor(df):
    df["year"] = pd.to_datetime(df['datetime']).dt.year
    df["month"] = pd.to_datetime(df['datetime']).dt.month
    df["day"] = pd.to_datetime(df['datetime']).dt.day
    df["hour"] = pd.to_datetime(df['datetime']).dt.hour
    return df[['year', 'month', 'day', 'hour', 'temp', 'humidity', 'season', 'holiday', 'workingday', 'weather', 'label']]

test_true = preprocessor(real_test)

In [None]:
cont_feat = ['year', 'month', 'day', 'hour', 'temp', 'humidity']
disc_feat = ['season', 'holiday', 'workingday', 'weather']
test_true['season'] = test_true['season'] - 1
test_true['weather'] = test_true['weather'] - 1
test_true = test_true[cont_feat + disc_feat + ["label"]]

In [None]:
col_list = ['prev_label', 'year', 'month', 'day', 'hour', 'temp', 'humidity', 'season', 'holiday', 'workingday', 'weather', 'label']
test_true["prev_label"] = test_true["label"].shift(1).fillna(0.19526038)
test_true = test_true[col_list]

In [None]:
test_true[cont_feat] = MMSc_feature.transform(test_true[cont_feat])

In [None]:
f_starting = 0
i = K.expand_dims(test_true, axis = 0)
states = None
x = i[:, :, :-1].numpy()
y_hat = tf.zeros(i[:, :, -1].shape).numpy()
std = tf.zeros(i[:, :, -1].shape).numpy()
for t in range(x.shape[1]):
  features = K.expand_dims(x[:, t, :], axis = 1)
  if states == None:
    y_pred, states = DeepAR_tf_model(features, return_state = True, training = False)
  else:
    y_pred, states = DeepAR_tf_model(features, states = states, return_state = True, training = False)
  if t < x.shape[1]-1 and t >= f_starting:
    x[:, t+1, 0] = np.clip(y_pred[:, 0, 0], a_min = 0, a_max = None)
    y_hat[:, t] = np.clip(y_pred[:, 0, 0], a_min = 0, a_max = None)
    std[:, t] = y_pred[:, 0, 1]
  elif t < f_starting:
    y_hat[:, t] = x[:, t+1, 0]
    std[:, t] = 0*y_pred[:, 0, 1]
  else:
    y_hat[:, t] = np.clip(y_pred[:, 0, 0], a_min = 0, a_max = None)
    std[:, t] = y_pred[:, 0, 1]
      

In [None]:
pred_val = np.round(MMSc_label.inverse_transform(y_hat), decimals = 0)
pred_sigma = MMSc_label.inverse_transform(np.sqrt(std))

z_score = 1.65
LB = np.clip(pred_val - z_score*pred_sigma, a_min = 0, a_max = None)
UB = np.clip(pred_val + z_score*pred_sigma, a_min = 0, a_max = None)

In [None]:
init_pred = 100
fin_pred = 160
plt.plot(pred_val[0, init_pred:fin_pred], 'r', label = "Prediction")
plt.fill_between(range(fin_pred-init_pred), LB[0, init_pred:fin_pred], UB[0, init_pred:fin_pred], color='chocolate', alpha=0.2)
plt.legend(loc="upper left")
plt.show()

In [None]:
real_test["count"] = pred_val[0, :]
submission = real_test[["datetime", "count"]]
submission.to_csv("sub.csv", index = False, header = True)