Please don't forget to upvote this notebook. I am novice and it is need for me.

# LSTM tutorial with Bitcoin price

## RNN

### Basic

Feed-forward neural network has some disadvantages in compare with recurent neural network:

* Cannot handle sequential data;

* Considers only the current input;

* Cannot memorize previous inputs. [1]



Let's see how RNN is work. [1]

<img src="https://www.simplilearn.com/ice9/free_resources_article_thumb/Fully_connected_Recurrent_Neural_Network.gif">

Here:
* "x" is the input layer;
* "h" is the hidden layer;
* "y" is the output layer;
* A (V), B (U), C (W) are the network parametres.

At any given time t, the current input is a combination of input at x(t) and h(t-1).

<img src="https://www.simplilearn.com/ice9/free_resources_article_thumb/Long_Short_Term_Memory_Networks.png">

### Backpropagation Through Time

Let's see how Backpropagation Through Time [2]

<img src="http://www.wildml.com/wp-content/uploads/2015/10/rnn-bptt-with-gradients.png">



\begin{align}
s_t = tanh(Ux_t+Ws_{t-1}) \\
\end{align}
\begin{align}
\hat{y_t} = softmax(Vs_t) \\
\end{align}
\begin{align}
E = \sum^{}_{t} {E_t}
\end{align}


E - loss function, cross entropy loss, for example.

\begin{align}
\frac{\partial E}{\partial W} =
\sum^{}_{t} {\frac{\partial E_t}{\partial W}}
\end{align}

\begin{align}
\frac{\partial E_3}{\partial V} =
\frac{\partial E_3}{\partial \hat{y_3}}
\frac{\partial \hat{y_3}}{\partial z_3}
\frac{ \partial z_3}{\partial V}
\end{align}

$ \frac{\partial E_3}{\partial V} $ depends only on the values at the current time step. But the story is diffetent for $ \frac{\partial E_3}{\partial W} $ (and for U)

\begin{align}
\frac{\partial E_3}{\partial W} =
\frac{\partial E_3}{\partial \hat{y_3}}
\frac{\partial \hat{y_3}}{\partial s_3}
\frac{ \partial s_3}{\partial W}
\end{align}

\begin{align}
s_3 = tanh(Ux_t+Ws_2)
\end{align}

Note that $ s_3 = \tanh(Ux_t + Ws_2) $ depends on $s_2$, which depends on $W$ and $s_1$, and so on.

### Vanishing and exploding Gradient Problem

RNNs have the problem of vanishing gradient. When the gradient becomes too small, the parameter updates become insignificant. [1]

Also there is the problem of exploding gradient. This problem arises when large error gradients accumulate, resulting in very large updates to the neural network model weights during the training process. [1]

tanh (or sigmoid) activation function maps all values into a range between -1 and 1, and the derivative is bounded by 1 (1/4 in the case of sigmoid) and tanh functions have derivatives of 0 at both ends. [2]

The gradient values are shrinking exponentially fast, eventually vanishing completely after a few time steps. Gradient contributions from “far away” steps become zero, and the state at those steps doesn’t contribute to what you are learning: You end up not learning long-range dependencies. Vanishing gradients aren’t exclusive to RNNs. They also happen in deep Feedforward Neural Networks. It’s just that RNNs tend to be very deep (as deep as the sentence length in our case), which makes the problem a lot more common. [2]

Fortunately, there are a few ways to combat the vanishing gradient problem. Proper initialization of the W matrix can reduce the effect of vanishing gradients. So can regularization. A more preferred solution is to use ReLU instead of tanh or sigmoid activation functions. The ReLU derivative is a constant of either 0 or 1, so it isn’t as likely to suffer from vanishing gradients. [2]

And even more popular solution is to use Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU) architectures. Both of these RNN architectures were explicitly designed to deal with vanishing gradients and efficiently learn long-range dependencies. [2]

## LSTM

### Structure of LSTM

Instead of having a single neural network layer, LSTM has four, interacting in a very special way. [3]

<img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png">

### Step-by-Step LSTM Walk Through

#### Forget gate layer

<img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-f.png">

#### Input gate layer

<img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-i.png">

#### Update the old cell state

<img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-C.png">



#### Output

<img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-o.png">

## Bitcoin Time Series Prediction with LSTM

In [None]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# First step, import libraries.
import numpy as np 
import pandas as pd 
from matplotlib import pyplot as plt

In [None]:
# Import the dataset and encode the date
df = pd.read_csv('/kaggle/input/bitcoin-historical-data/bitstampUSD_1-min_data_2012-01-01_to_2020-04-22.csv')
df.head(3)

In [None]:
df['date'] = pd.to_datetime(df['Timestamp'],unit='s').dt.date
df.head(3)

In [None]:
group = df.groupby('date')
Real_Price = group['Weighted_Price'].mean()

In [None]:
# split data
prediction_days = 30
df_train= Real_Price[:len(Real_Price)-prediction_days]
df_test= Real_Price[len(Real_Price)-prediction_days:]

In [None]:
df_train

In [None]:
# Data preprocess
training_set = df_train.values
training_set = np.reshape(training_set, (len(training_set), 1))
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler()
training_set = sc.fit_transform(training_set)
X_train = training_set[0:len(training_set)-1]
y_train = training_set[1:len(training_set)]
X_train = np.reshape(X_train, (len(X_train), 1, 1))

The model needs to know what input shape it should expect. For this reason, the first layer in a Sequential model (and only the first, because following layers can do automatic shape inference) needs to receive information about its input shape. There are several possible ways to do this:

* pass an input_shape argument to the first layer. This is a shape tuple (a tuple of integers or None entries, where None indicates that any positive integer may be expected). In input_shape, the batch dimension is not included.

* pass instead a batch_input_shape argument, where the batch dimension is included. This is useful for specifying a fixed batch size (e.g. with stateful RNNs).

model = Sequential()
model.add(Dense(32, batch_input_shape=(None, 784)))
#note that batch dimension is "None" here,
#so the model will be able to process batches of any size. [4]

In [None]:
# Importing the Keras libraries and packages
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM

# Initialising the RNN
regressor = Sequential()

# Adding the input layer and the LSTM layer
regressor.add(LSTM(units = 4, activation = 'sigmoid', input_shape = (None, 1)))

# units: Positive integer, dimensionality of the output space.
# activation: Activation function to use.
# Default: hyperbolic tangent (tanh).
# If you pass None, no activation is applied (ie. "linear" activation: a(x) = x).

# Adding the output layer
regressor.add(Dense(units = 1))

# Compiling the RNN
regressor.compile(optimizer = 'adam', loss = 'mean_squared_error')

# Fitting the RNN to the Training set
regressor.fit(X_train, y_train, batch_size = 5, epochs = 30)

In [None]:
from keras.utils.vis_utils import plot_model
plot_model(regressor, to_file='model_plot.png', show_shapes=True, show_layer_names=True, expand_nested=False)

In [None]:
regressor.summary()

In [None]:
# Making the predictions
test_set = df_test.values
inputs = np.reshape(test_set, (len(test_set), 1))
inputs = sc.transform(inputs)
inputs = np.reshape(inputs, (len(inputs), 1, 1))
predicted_BTC_price = regressor.predict(inputs)
predicted_BTC_price = sc.inverse_transform(predicted_BTC_price)

In [None]:
# Visualising the results
plt.figure(figsize=(15,5), dpi=80, facecolor='w', edgecolor='k')
ax = plt.gca()  
plt.plot(test_set, color = 'red', label = 'Real BTC Price')
plt.plot(predicted_BTC_price, color = 'blue', label = 'Predicted BTC Price')
plt.title('BTC Price Prediction', fontsize=14)
df_test = df_test.reset_index()
x=df_test.index
labels = df_test['date']
plt.xticks(x, labels, rotation = 'vertical')
for tick in ax.xaxis.get_major_ticks():
    tick.label1.set_fontsize(14)
for tick in ax.yaxis.get_major_ticks():
    tick.label1.set_fontsize(14)
plt.xlabel('Time', fontsize=14)
plt.ylabel('BTC Price(USD)', fontsize=14)
plt.legend(loc=2, prop={'size': 14})
plt.show()

References:

1. https://www.simplilearn.com/tutorials/deep-learning-tutorial/rnn?source=sl_frs_nav_playlist_video_clicked

2. http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/

3. https://colah.github.io/posts/2015-08-Understanding-LSTMs/

4. https://faroit.com/keras-docs/1.1.0/getting-started/sequential-model-guide/