**CS596 - Machine Learning**
<br>
Date: **4 November 2020**


Title: **Seminar 8**
<br>
Speaker: **Dr. Shota Tsiskaridze**
<br>
Teaching Assistant: **Levan Sanadiradze**

Sources:
https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

<h1 align="center">Time Series Prediction with LSTM</h1>

- The problem we are going to look at in this post is the **International Airline Passengers** prediction problem.

- Given a **year** and a **month**, the task is to predict the **number** of international airline **passengers** in units of 1,000. 


- The **data** ranges from **January 1949** to **December 1960**, or **12 years**, with **144 observations**.


- We can load this dataset easily using the **Pandas** library. 

In [None]:
# Import google colab library for loading dataset files
from google.colab import files
uploaded = files.upload()

In [None]:
import pandas
import matplotlib.pyplot as plt
dataframe = pandas.read_csv('airline-passengers.csv', usecols=[1], engine='python')
plt.plot(dataframe)
plt.show()

- As we can see, there is an **upward trend** in the dataset over time.


- You can also see some **periodicity** to the dataset that probably **corresponds** to the **Northern Hemisphere vacation** period.

<h1 align="center">Long Short-Term Memory Network</h1>

- The **Long Short-Term Memory** network, is a **recurrent neural network** that is trained using **Backpropagation** Through Time and **overcomes** the **vanishing gradient problem**.


- As such, it can be used to create **large recurrent networks** that in turn can be used to **address difficult sequence problems** in machine learning and achieve state-of-the-art results.


- Instead of neurons, **LSTM networks** have **memory blocks** that are connected through layers.


- A **block** has components that make it smarter than a classical neuron and a memory for recent sequences. 



- A **block contains gates** that manage the block’s state and output. 
  

- A block operates upon an input sequence and **each gate** within a block **uses** the **sigmoid activation units** to **control whether they are triggered or not**, making the change of state and addition of information flowing through the block conditional.


- There are **three types of gates** within a **unit**:

  - **Forget Gate**: conditionally decides what information to throw away from the block.
 
  - **Input Gate**: conditionally decides which values from the input to update the memory state.
  
  - **Output Gate**: conditionally decides what to output based on input and the memory of the block.
  
  <center><img src="images/L9_LSTM.png" width="800" alt="Example" /></center>


- We will see how you may achieve sophisticated learning and memory from a layer of LSTMs, and it is not hard to imagine how higher-order abstractions may be layered with multiple such layers.

<h1 align="center">LSTM Network for Regression</h1>

- We can phrase the problem as a **regression problem**.


- That is, **given** the **number of passengers this month**, what is the **number of passengers next month**?


- We can write a simple **function** to **convert our single column of data** into a **two-column dataset**:

  - the **first column** containing **this month’s** ($t$) passenger count;
  - the **second column** containing **next month’s** ($t+1$) passenger count.

- Before we get started, let’s first import all of the functions and classes we intend to use. 

In [None]:
import numpy
import matplotlib.pyplot as plt
import pandas
import math
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error

- Let's fix the **random number seed** to ensure our **results are reproducible**.

In [None]:
# fix random seed for reproducibility
numpy.random.seed(7)

- Now we **extract** the **NumPy array** from the **dataframe** and **convert** the integer values to **floating point values**, which are **more suitable** for modeling with a **neural network**.

In [None]:
dataset = dataframe.values
dataset = dataset.astype('float32')
dataset[:5]

- **LSTMs** are **sensitive** to the **scale of the input data**, specifically when the **sigmoid** or **tanh activation functions** are used.


- We can easily normalize the dataset using the **MinMaxScaler** preprocessing class from the **scikit-learn** library.

In [None]:
# normalize the dataset
scaler = MinMaxScaler(feature_range=(0, 1))
dataset = scaler.fit_transform(dataset)

- After we model our data and estimate the skill of our model on the training dataset, we need to get an idea of the skill of the model on new unseen data. 


- For a **normal classification** or **regression problem**, we would do this using **cross validation**.


- With **time series data**, the **sequence of values is important!** 


- A **simple method** that we can use is to **split the ordered dataset** into **train** and **test datasets**. 


- The **code** below **calculates** the **index of the split point** and **separates the data** into the **training datasets** with $67\%$ **of the observations** that we can use to train our model, leaving the **remaining** $33\%$ for **testing the model**.

In [None]:
# split into train and test sets
train_size  = int(len(dataset) * 0.67)
test_size   = len(dataset) - train_size

train, test = dataset[0:train_size,:], dataset[train_size:len(dataset),:]

print(len(train), len(test))

- Now we can define a function to **create a new dataset**.


- The **function takes two arguments**: 

  - the **dataset**, which is a **NumPy array** that we want to convert into a dataset
  
  - the **look_back**, which is the **number of previous time steps to use** as input variables to predict the next time period.
  

- In our case we choose the default value for **look_back** equal to $1$.
  

- This default **will create** a **dataset** where $X$ is the **number of passengers** at a **given time** ($t$) and $Y$ is the **number of passengers** at the **next time** ($t + 1$).

In [None]:
# convert an array of values into a dataset matrix
def create_dataset(dataset, look_back=1):
    dataX, dataY = [], []
    for i in range(len(dataset)-look_back-1):
        a = dataset[i:(i+look_back), 0]
        dataX.append(a)
        dataY.append(dataset[i + look_back, 0])
    return numpy.array(dataX), numpy.array(dataY)

In [None]:
# reshape into X=t and Y=t+1
look_back = 1
trainX, trainY = create_dataset(train, look_back)
testX, testY = create_dataset(test, look_back)

In [None]:
trainX[:5]

In [None]:
trainY[:5]

- Currently, our data is in the form: **[samples, features]** and we are framing the problem as one time step for each sample. 


- We can **transform** the prepared **train** and **test input data** into the **expected structure** using **numpy.reshape()** as follows:

In [None]:
# reshape input to be [samples, time steps, features]
trainX = numpy.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))
testX = numpy.reshape(testX, (testX.shape[0], 1, testX.shape[1]))

- We are now **ready to design** and fit our **LSTM network** for this problem.


- The **network** has:

  - a **visible layer** with **1 input**
  
  - a **hidden layer** with **4 LSTM blocks** or neurons, 
  - an **output layer** that **makes a single value prediction**. 
  
  
- The **default sigmoid activation function** is used for the **LSTM blocks**. 


- The **network** is **trained** for $100$ **epochs** and a **batch size** of $1$ is used.

In [None]:
# create and fit the LSTM network
model = Sequential()
model.add(LSTM(4, input_shape=(1, look_back)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(trainX, trainY, epochs=100, batch_size=1, verbose=2)

- **Once the model is fit**, we can **estimate the performance** of the model on the **train** and **test datasets**. 


- **Note** that we **invert the predictions before calculating error scores** to ensure that **performance is reported in the same units** as the original data.

In [None]:
# make predictions
trainPredict = model.predict(trainX)
testPredict = model.predict(testX)
# invert predictions
trainPredict = scaler.inverse_transform(trainPredict)
trainY = scaler.inverse_transform([trainY])
testPredict = scaler.inverse_transform(testPredict)
testY = scaler.inverse_transform([testY])
# calculate root mean squared error
trainScore = math.sqrt(mean_squared_error(trainY[0], trainPredict[:,0]))
print('Train Score: %.2f RMSE' % (trainScore))
testScore = math.sqrt(mean_squared_error(testY[0], testPredict[:,0]))
print('Test Score: %.2f RMSE' % (testScore))

- Finally, we can **generate predictions** using the model for both the **train** and **test dataset** to get a visual indication of the skill of the model.

In [None]:
# shift train predictions for plotting
trainPredictPlot = numpy.empty_like(dataset)
trainPredictPlot[:, :] = numpy.nan
trainPredictPlot[look_back:len(trainPredict)+look_back, :] = trainPredict
# shift test predictions for plotting
testPredictPlot = numpy.empty_like(dataset)
testPredictPlot[:, :] = numpy.nan
testPredictPlot[len(trainPredict)+(look_back*2)+1:len(dataset)-1, :] = testPredict
# plot baseline and predictions
plt.plot(scaler.inverse_transform(dataset))
plt.plot(trainPredictPlot)
plt.plot(testPredictPlot)
plt.show()

- We can see that the **model** did an **excellent job** of **fitting both** the **training** and the **test datasets**.

- We can see that the **model** has an **average error** of about:

  - $23$ **passengers** (in thousands) on the **training dataset**
  - $52$ **passengers** (in thousands) on the **test dataset**. 
  
  Not that bad.

<h1 align="center">LSTM for Regression Using the Window Method</h1>

- We can also phrase the problem so that multiple, recent time steps can be used to make the prediction for the next time step.

In [None]:
# LSTM for international airline passengers problem with window regression framing
import numpy
import matplotlib.pyplot as plt
from pandas import read_csv
import math
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error

# convert an array of values into a dataset matrix
def create_dataset(dataset, look_back=1):
    dataX, dataY = [], []
    for i in range(len(dataset)-look_back-1):
        a = dataset[i:(i+look_back), 0]
        dataX.append(a)
        dataY.append(dataset[i + look_back, 0])
    return numpy.array(dataX), numpy.array(dataY)

# fix random seed for reproducibility
numpy.random.seed(7)

# load the dataset
dataframe = read_csv('airline-passengers.csv', usecols=[1], engine='python')
dataset = dataframe.values
dataset = dataset.astype('float32')

# normalize the dataset
scaler = MinMaxScaler(feature_range=(0, 1))
dataset = scaler.fit_transform(dataset)

# split into train and test sets
train_size = int(len(dataset) * 0.67)
test_size = len(dataset) - train_size
train, test = dataset[0:train_size,:], dataset[train_size:len(dataset),:]

# reshape into X=t and Y=t+1
look_back = 3
trainX, trainY = create_dataset(train, look_back)
testX, testY = create_dataset(test, look_back)

# reshape input to be [samples, time steps, features]
trainX = numpy.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))
testX = numpy.reshape(testX, (testX.shape[0], 1, testX.shape[1]))

# create and fit the LSTM network
model = Sequential()
model.add(LSTM(4, input_shape=(1, look_back)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(trainX, trainY, epochs=100, batch_size=1, verbose=2)

# make predictions
trainPredict = model.predict(trainX)
testPredict = model.predict(testX)

# invert predictions
trainPredict = scaler.inverse_transform(trainPredict)
trainY = scaler.inverse_transform([trainY])
testPredict = scaler.inverse_transform(testPredict)
testY = scaler.inverse_transform([testY])

# calculate root mean squared error
trainScore = math.sqrt(mean_squared_error(trainY[0], trainPredict[:,0]))
print('Train Score: %.2f RMSE' % (trainScore))
testScore = math.sqrt(mean_squared_error(testY[0], testPredict[:,0]))
print('Test Score: %.2f RMSE' % (testScore))

# shift train predictions for plotting
trainPredictPlot = numpy.empty_like(dataset)
trainPredictPlot[:, :] = numpy.nan
trainPredictPlot[look_back:len(trainPredict)+look_back, :] = trainPredict

# shift test predictions for plotting
testPredictPlot = numpy.empty_like(dataset)
testPredictPlot[:, :] = numpy.nan
testPredictPlot[len(trainPredict)+(look_back*2)+1:len(dataset)-1, :] = testPredict

# plot baseline and predictions
plt.plot(scaler.inverse_transform(dataset))
plt.plot(trainPredictPlot)
plt.plot(testPredictPlot)
plt.show()

- We can see that the error was increased slightly compared to that of the previous section. 

- The window size and the network architecture were not tuned: this is just a demonstration of how to frame a prediction problem.

<h1 align="center">LSTM for Regression with Time Steps</h1>

- You may have noticed that the data preparation for the LSTM network includes time steps.


- Time steps provide **another way to phrase our time series problem**. 


- Like above in the window example, we can **take prior time steps** in our time series as **inputs** to **predict** the **output** at the **next time step**.


- We can do this **using** the **same data representation** as in the previous window-based example, **except** when we **reshape the data**, we set the **columns to be the time steps dimension** and change the **features dimension back** to 1.

In [None]:
# reshape input to be [samples, time steps, features]
trainX = numpy.reshape(trainX, (trainX.shape[0], trainX.shape[1], 1))
testX = numpy.reshape(testX, (testX.shape[0], testX.shape[1], 1))

In [None]:
# LSTM for international airline passengers problem with time step regression framing
import numpy
import matplotlib.pyplot as plt
from pandas import read_csv
import math
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error

# convert an array of values into a dataset matrix
def create_dataset(dataset, look_back=1):
    dataX, dataY = [], []
    for i in range(len(dataset)-look_back-1):
        a = dataset[i:(i+look_back), 0]
        dataX.append(a)
    dataY.append(dataset[i + look_back, 0])
    return numpy.array(dataX), numpy.array(dataY)

# fix random seed for reproducibility
numpy.random.seed(7)

# load the dataset
dataframe = read_csv('airline-passengers.csv', usecols=[1], engine='python')
dataset = dataframe.values
dataset = dataset.astype('float32')

# normalize the dataset
scaler = MinMaxScaler(feature_range=(0, 1))
dataset = scaler.fit_transform(dataset)

# split into train and test sets
train_size = int(len(dataset) * 0.67)
test_size = len(dataset) - train_size
train, test = dataset[0:train_size,:], dataset[train_size:len(dataset),:]

# reshape into X=t and Y=t+1
look_back = 3
trainX, trainY = create_dataset(train, look_back)
testX, testY = create_dataset(test, look_back)
# reshape input to be [samples, time steps, features]
trainX = numpy.reshape(trainX, (trainX.shape[0], trainX.shape[1], 1))
testX = numpy.reshape(testX, (testX.shape[0], testX.shape[1], 1))

# create and fit the LSTM network
model = Sequential()
model.add(LSTM(4, input_shape=(look_back, 1)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(trainX, trainY, epochs=100, batch_size=1, verbose=2)

# make predictions
trainPredict = model.predict(trainX)
testPredict = model.predict(testX)

# invert predictions
trainPredict = scaler.inverse_transform(trainPredict)
trainY = scaler.inverse_transform([trainY])
testPredict = scaler.inverse_transform(testPredict)
testY = scaler.inverse_transform([testY])

# calculate root mean squared error
trainScore = math.sqrt(mean_squared_error(trainY[0], trainPredict[:,0]))
print('Train Score: %.2f RMSE' % (trainScore))
testScore = math.sqrt(mean_squared_error(testY[0], testPredict[:,0]))
print('Test Score: %.2f RMSE' % (testScore))

# shift train predictions for plotting
trainPredictPlot = numpy.empty_like(dataset)
trainPredictPlot[:, :] = numpy.nan
trainPredictPlot[look_back:len(trainPredict)+look_back, :] = trainPredict

# shift test predictions for plotting
testPredictPlot = numpy.empty_like(dataset)
testPredictPlot[:, :] = numpy.nan
testPredictPlot[len(trainPredict)+(look_back*2)+1:len(dataset)-1, :] = testPredict

# plot baseline and predictions
plt.plot(scaler.inverse_transform(dataset))
plt.plot(trainPredictPlot)
plt.plot(testPredictPlot)
plt.show()

- We can see that the **results are slightly better** than **previous example**, although the **structure** of the input data **makes a lot more sense**.

<h1 align="center">LSTM with Memory Between Batches</h1>

- The **LSTM** network **has memory**, which is capable of **remembering across long sequences**.


- Normally, the **state** within the network is **reset after each training batch** when fitting the model, as well as each call to **model.predict()** or **model.evaluate()**.


- We can **gain finer control** over when the **internal state** of the LSTM network is cleared in Keras by making the LSTM layer **stateful**. 

  This means that it can **build state over the entire training sequence** and even maintain that state if needed to make predictions.


- It **requires** that the **training data not be shuffled** when fitting the network.


- It also **requires explicit resetting** of the network **state** after each **epoch** by calls to **model.reset_states()**. 

  This means that we must **create our own outer loop** of **epochs** and within **each epoch** call **model.fit()** and **model.reset_states()**. 
  
  For example like this:

In [None]:
for i in range(100):
    model.fit(trainX, trainY, epochs=1, batch_size=batch_size, verbose=2, shuffle=False)
    model.reset_states()

- Finally, when the **LSTM layer** is constructed, the **stateful parameter** must be **set True** and instead of specifying the input dimensions, we must **hard code** the **number of samples in a batch**, **number of time steps** in a sample and **number of features** in a time step by **setting** the **batch_input_shape parameter**. 

  For example:

In [None]:
model.add(LSTM(4, batch_input_shape=(batch_size, time_steps, features), stateful=True))

In [None]:
model.predict(trainX, batch_size=batch_size)

- We can adapt the previous time step example to use a stateful LSTM. 

- The full code listing is provided below

In [None]:
# LSTM for international airline passengers problem with memory
import numpy
import matplotlib.pyplot as plt
from pandas import read_csv
import math
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error

# convert an array of values into a dataset matrix
def create_dataset(dataset, look_back=1):
    dataX, dataY = [], []
    for i in range(len(dataset)-look_back-1):
        a = dataset[i:(i+look_back), 0]
        dataX.append(a)
        dataY.append(dataset[i + look_back, 0])
    return numpy.array(dataX), numpy.array(dataY)

# fix random seed for reproducibility
numpy.random.seed(7)

# load the dataset
dataframe = read_csv('airline-passengers.csv', usecols=[1], engine='python')
dataset = dataframe.values
dataset = dataset.astype('float32')

# normalize the dataset
scaler = MinMaxScaler(feature_range=(0, 1))
dataset = scaler.fit_transform(dataset)

# split into train and test sets
train_size = int(len(dataset) * 0.67)
test_size = len(dataset) - train_size
train, test = dataset[0:train_size,:], dataset[train_size:len(dataset),:]

# reshape into X=t and Y=t+1
look_back = 3
trainX, trainY = create_dataset(train, look_back)
testX, testY = create_dataset(test, look_back)

# reshape input to be [samples, time steps, features]
trainX = numpy.reshape(trainX, (trainX.shape[0], trainX.shape[1], 1))
testX = numpy.reshape(testX, (testX.shape[0], testX.shape[1], 1))

# create and fit the LSTM network
batch_size = 1
model = Sequential()
model.add(LSTM(4, batch_input_shape=(batch_size, look_back, 1), stateful=True))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')

for i in range(100):
    model.fit(trainX, trainY, epochs=1, batch_size=batch_size, verbose=2, shuffle=False)
    model.reset_states()

# make predictions
trainPredict = model.predict(trainX, batch_size=batch_size)
model.reset_states()
testPredict = model.predict(testX, batch_size=batch_size)

# invert predictions
trainPredict = scaler.inverse_transform(trainPredict)
trainY = scaler.inverse_transform([trainY])
testPredict = scaler.inverse_transform(testPredict)
testY = scaler.inverse_transform([testY])

# calculate root mean squared error
trainScore = math.sqrt(mean_squared_error(trainY[0], trainPredict[:,0]))
print('Train Score: %.2f RMSE' % (trainScore))
testScore = math.sqrt(mean_squared_error(testY[0], testPredict[:,0]))
print('Test Score: %.2f RMSE' % (testScore))

# shift train predictions for plotting
trainPredictPlot = numpy.empty_like(dataset)
trainPredictPlot[:, :] = numpy.nan
trainPredictPlot[look_back:len(trainPredict)+look_back, :] = trainPredict

# shift test predictions for plotting
testPredictPlot = numpy.empty_like(dataset)
testPredictPlot[:, :] = numpy.nan
testPredictPlot[len(trainPredict)+(look_back*2)+1:len(dataset)-1, :] = testPredict

# plot baseline and predictions
plt.plot(scaler.inverse_transform(dataset))
plt.plot(trainPredictPlot)
plt.plot(testPredictPlot)
plt.show()

- We do see that **results are worse**. 


- The **model** may **need more modules** and may need to be **trained for more epochs** to internalize the structure of the problem.

<h1 align="center">End of Seminar</h1>