# Vision models examples
[conv_lstm.py](https://github.com/keras-team/keras/blob/2.1.0/examples/conv_lstm.py) 
Demonstrates the use of a convolutional LSTM network.

This script demonstrates the use of a convolutional LSTM network.
This network is used to predict the next frame of an artificially
generated movie which contains moving squares.

Some other links found by Chao:
- https://blog.csdn.net/liuxiao214/article/details/78212982
- https://github.com/keras-team/keras/issues/1773
- The original paper: [A Machine Learning Approach for Precipitation Nowcasting](https://arxiv.org/pdf/1506.04214v1.pdf)

From the paper, I copy 2 images to illustrate convolutional LSTM:

- Internal structure of ConvLSTM
![fig12](images/conv_lstm_1.jpg)

First, the input X is of 3-dimension (P,M,N), where P is the channel, and M,N are the height and width of image.

Then,将经典的全连接LSTM的$W\cdot x$部分，替换为卷积层。

Then, from $X_t$ + $H_{t-1}, C_{t-1}$ -> $H_t, C_t$. 

to $X_{t+1}$ + $H_{t}, C_{t}$ -> $H_{t+1}, C_{t+1}$

...

卷积核的大一点，可以捕捉更快速移动的视频

- Prediction
![fig3](images/conv_lstm_2.jpg)

图的左边是train，右边是predict.

这个例子，左边是两层的ConvLSTM的堆积（stacking）。在predict的时候，这两层ConvLSTM不变，但是输出的维度，需要跟输入的维度是一样的，因为这里只有两层的堆积，所以把这两层的state信息都输出，concat起来，应该得到‘很厚’的输出（图片的长和宽都不变，主要就是剩下的那个维度），然后再用1x1卷积核得到需要的维度。（最后这部分没怎么搞懂，而且keras下面的代码似乎也不是这样的）


In [1]:
import keras
from keras.models import Sequential
from keras.layers.convolutional import Conv3D
from keras.layers.convolutional_recurrent import ConvLSTM2D
from keras.layers.normalization import BatchNormalization
import numpy as np
# import pylab as plt

import matplotlib.pyplot as plt
%matplotlib inline
print(keras.__version__)

Using TensorFlow backend.


2.1.0


We create a layer which take as input movies of shape __(n_frames, width, height, channels)__

and returns a movie of identical shape.

In [2]:
seq = Sequential()
seq.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
                   input_shape=(None, 40, 40, 1),
                   padding='same', return_sequences=True))
seq.add(BatchNormalization())
seq.output_shape

(None, None, 40, 40, 40)

In [3]:
seq.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
                   padding='same', return_sequences=True))
seq.add(BatchNormalization())
seq.output_shape

(None, None, 40, 40, 40)

In [4]:
seq.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
                   padding='same', return_sequences=True))
seq.add(BatchNormalization())
seq.output_shape

(None, None, 40, 40, 40)

In [5]:
seq.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
                   padding='same', return_sequences=True))
seq.add(BatchNormalization())
seq.output_shape

(None, None, 40, 40, 40)

In [6]:
seq.add(Conv3D(filters=1, kernel_size=(3, 3, 3),
               activation='sigmoid',
               padding='same', data_format='channels_last'))
seq.output_shape

(None, None, 40, 40, 1)

In [7]:
seq.compile(loss='binary_crossentropy', optimizer='adadelta')

每一个样本，先随机设定3到7个正方形，

然后随机设定这些正方形的移动方向，对每个样本，它的每个frame就沿这个方向挪动

然后对每个frame的每个正方形以50%的概率加一些噪声；噪声的值等概率为0.1或者-0.1

Ground Truth数据多移动一个单位，不加噪声

In [8]:
def generate_movies(n_samples=1200, n_frames=15):
    row = 80
    col = 80
    noisy_movies = np.zeros((n_samples, n_frames, row, col, 1), dtype=np.float)
    shifted_movies = np.zeros((n_samples, n_frames, row, col, 1),
                              dtype=np.float)

    for i in range(n_samples):
        # Add 3 to 7 moving squares
        n = np.random.randint(3, 8)

        for j in range(n):
            # Initial position
            xstart = np.random.randint(20, 60)
            ystart = np.random.randint(20, 60)
            # Direction of motion
            directionx = np.random.randint(0, 3) - 1
            directiony = np.random.randint(0, 3) - 1

            # Size of the square
            w = np.random.randint(2, 4)

            for t in range(n_frames):
                x_shift = xstart + directionx * t
                y_shift = ystart + directiony * t
                noisy_movies[i, t, x_shift - w: x_shift + w,
                             y_shift - w: y_shift + w, 0] += 1

                # Make it more robust by adding noise.
                # The idea is that if during inference,
                # the value of the pixel is not exactly one,
                # we need to train the network to be robust and still
                # consider it as a pixel belonging to a square.
                if np.random.randint(0, 2):
                    noise_f = (-1)**np.random.randint(0, 2)
                    noisy_movies[i, t,
                                 x_shift - w - 1: x_shift + w + 1,
                                 y_shift - w - 1: y_shift + w + 1,
                                 0] += noise_f * 0.1

                # Shift the ground truth by 1
                x_shift = xstart + directionx * (t + 1)
                y_shift = ystart + directiony * (t + 1)
                shifted_movies[i, t, x_shift - w: x_shift + w,
                               y_shift - w: y_shift + w, 0] += 1

    # Cut to a 40x40 window
    noisy_movies = noisy_movies[::, ::, 20:60, 20:60, ::]
    shifted_movies = shifted_movies[::, ::, 20:60, 20:60, ::]
    noisy_movies[noisy_movies >= 1] = 1
    shifted_movies[shifted_movies >= 1] = 1
    return noisy_movies, shifted_movies

In [9]:
noisy_movies, shifted_movies = generate_movies(n_samples=1200)

In [10]:
print(noisy_movies.shape)
print(shifted_movies.shape)

(1200, 15, 40, 40, 1)
(1200, 15, 40, 40, 1)


In [None]:
# X = noisy_movies[0,0,:,:,:].reshape(40,40)

In [None]:
# plt.imshow(X)

In [None]:
# X1 = noisy_movies[0,1,:,:,:].reshape(40,40)
# plt.imshow(X1)

In [None]:
# X2 = noisy_movies[0,7,:,:,:].reshape(40,40)
# plt.imshow(X2)

In [None]:
# X[8:16,9:17] 

In [None]:
# X1[6:16,7:17]

In [None]:
seq.fit(noisy_movies[:1000], shifted_movies[:1000], batch_size=10, epochs=300, validation_split=0.05)

Train on 950 samples, validate on 50 samples
Epoch 1/300
 50/950 [>.............................] - ETA: 23:30 - loss: 0.8293