# Sequence

Often we deal with sets in applied machine learning such as a train or test set of samples. Each sample in the set can be thought of as an observation from the domain. In a set, the order of the observations is not important.

A sequence is different. The sequence imposes an explicit order on the observations. The order is important. It must be respected in the formulation of prediction problems that use the sequence data as input or output for the model.

## Sequence Prediction

Sequence prediction involves predicting the next value for a given input sequence. For example:

Input Sequence: 1, 2, 3, 4, 5

Output Sequence: 6

Some examples of sequence prediction problems include:

 Weather Forecasting. Given a sequence of observations about the weather over time, predict the expected weather tomorrow.

 Stock Market Prediction. Given a sequence of movements of a security over time, predict the next movement of the security.

 Product Recommendation. Given a sequence of past purchases for a customer, predictthe next purchase for a customer.

## Sequence Classification

Sequence classification involves predicting a class label for a given input sequence. For example:

Input Sequence: 1, 2, 3, 4, 5

Output Sequence: "good"

Some examples of sequence classification problems include:

 DNA Sequence Classification. Given a DNA sequence of A, C, G, and T values, predict whether the sequence is for a coding or non-coding region.

 Anomaly Detection. Given a sequence of observations, predict whether the sequence is anomalous or not.

 Sentiment Analysis. Given a sequence of text such as a review or a tweet, predict whether the sentiment of the text is positive or negative.

## Sequence Generation

Sequence generation involves generating a new output sequence that has the same general characteristics as other sequences in the corpus. For example:

Input Sequence: [1, 3, 5], [7, 9, 11]

Output Sequence: [3, 5 ,7]

Some examples of sequence generation problems include:

 Text Generation. Given a corpus of text, such as the works of Shakespeare, generate new sentences or paragraphs of text that read they could have been drawn from the corpus.

 Handwriting Prediction. Given a corpus of handwriting examples, generate handwriting for new phrases that has the properties of handwriting in the corpus.

Music Generation. Given a corpus of examples of music, generate new musical pieces that have the properties of the corpus.


Sequence generation may also refer to the generation of a sequence given a single observation as input. An example is the automatic textual description of images.


 Image Caption Generation. Given an image as input, generate a sequence of words that describe an image.

## Sequence-to-Sequence Prediction

Sequence-to-sequence prediction involves predicting an output sequence given an input sequence. For example:

Input Sequence: 1, 2, 3, 4, 5

Output Sequence: 6, 7, 8, 9, 10

Sequence-to-sequence prediction is a subtle but challenging extension of sequence prediction, where, rather than predicting a single next value in the sequence, a new sequence is predicted that may or may not have the same length or be of the same time as the input sequence. This type of problem has recently seen a lot of study in the area of automatic text translation (e.g.
translating English to French) and may be referred to by the abbreviation seq2seq.

If the input and output sequences are a time series, then the problem may be referred to as
multi-step time series forecasting. Some examples of sequence-to-sequence problems include:

 Multi-Step Time Series Forecasting. Given a time series of observations, predict a sequence of observations for a range of future time steps.

 Text Summarization. Given a document of text, predict a shorter sequence of text that describes the salient parts of the source document.

 Program Execution. Given the textual description program or mathematical equation predict the sequence of characters that describes the correct output.

## LSTM Weights

A memory cell has weight parameters for the input, output, as well as an internal state that is built up through exposure to input time steps.

 Input Weights. Used to weight input for the current time step.

 Output Weights. Used to weight the output from the last time step.

 Internal State. Internal state used in the calculation of the output for this time step.

## LSTM Gates

The key to the memory cell are the gates. These too are weighted functions that further govern the information flow in the cell. There are three gates:

 Forget Gate: Decides what information to discard from the cell.

 Input Gate: Decides which values from the input to update the memory state.

 Output Gate: Decides what to output based on input and the memory of the cell.

The forget gate and input gate are used in the updating of the internal state. The output gate is a final limiter on what the cell actually outputs. It is these gates and the consistent data flow called the constant error carrousel or CEC that keep each cell stable (neither exploding or vanishing).

## Applications of LSTMs

Automatic Image Caption Generation

Automatic Translation of Text

Automatic Handwriting Generation





# 1 Prepare Data For LSTMs

## Normalize Series Data (0-1)

In [1]:
from pandas import Series
from sklearn.preprocessing import MinMaxScaler
# define contrived series
data = [10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0]
series = Series(data)
print(series)
# prepare data for normalization
values = series.values
values = values.reshape((len(values), 1))
# train the normalization
scaler = MinMaxScaler(feature_range=(0, 1))
scaler = scaler.fit(values)
print('Min: %f, Max: %f' % (scaler.data_min_, scaler.data_max_))
# normalize the dataset and print
normalized = scaler.transform(values)
print(normalized)
# inverse transform and print
inversed = scaler.inverse_transform(normalized)
print(inversed)

0     10.0
1     20.0
2     30.0
3     40.0
4     50.0
5     60.0
6     70.0
7     80.0
8     90.0
9    100.0
dtype: float64
Min: 10.000000, Max: 100.000000
[[ 0.        ]
 [ 0.11111111]
 [ 0.22222222]
 [ 0.33333333]
 [ 0.44444444]
 [ 0.55555556]
 [ 0.66666667]
 [ 0.77777778]
 [ 0.88888889]
 [ 1.        ]]
[[  10.]
 [  20.]
 [  30.]
 [  40.]
 [  50.]
 [  60.]
 [  70.]
 [  80.]
 [  90.]
 [ 100.]]


##  Standardize Series Data(Mean=0, std=1)

In [2]:
from pandas import Series
from sklearn.preprocessing import StandardScaler
from math import sqrt
# define contrived series
data = [1.0, 5.5, 9.0, 2.6, 8.8, 3.0, 4.1, 7.9, 6.3]
series = Series(data)
print(series)
# prepare data for normalization
values = series.values
values = values.reshape((len(values), 1))
# train the normalization
scaler = StandardScaler()
scaler = scaler.fit(values)
print('Mean: %f, StandardDeviation: %f' % (scaler.mean_, sqrt(scaler.var_)))
# normalize the dataset and print
standardized = scaler.transform(values)
print(standardized)
# inverse transform and print
inversed = scaler.inverse_transform(standardized)
print(inversed)

0    1.0
1    5.5
2    9.0
3    2.6
4    8.8
5    3.0
6    4.1
7    7.9
8    6.3
dtype: float64
Mean: 5.355556, StandardDeviation: 2.712568
[[-1.60569456]
 [ 0.05325007]
 [ 1.34354035]
 [-1.01584758]
 [ 1.26980948]
 [-0.86838584]
 [-0.46286604]
 [ 0.93802055]
 [ 0.34817357]]
[[ 1. ]
 [ 5.5]
 [ 9. ]
 [ 2.6]
 [ 8.8]
 [ 3. ]
 [ 4.1]
 [ 7.9]
 [ 6.3]]


## How to Convert Categorical Data to Numerical Data

### Integer Encoding

### One Hot Encoding

In [3]:
from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
# define example
data = ['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot']
values = array(data)
print(values)
# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
print(integer_encoded)
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)
# invert first example
inverted = label_encoder.inverse_transform([argmax(onehot_encoded[0, :])])
print(inverted)


['cold' 'cold' 'warm' 'cold' 'hot' 'hot' 'warm' 'cold' 'warm' 'hot']
[0 0 2 0 1 1 2 0 2 1]
[[ 1.  0.  0.]
 [ 1.  0.  0.]
 [ 0.  0.  1.]
 [ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]
 [ 1.  0.  0.]
 [ 0.  0.  1.]
 [ 0.  1.  0.]]
['cold']


## Prepare Sequences with Varied Lengths (Sequence Padding)

### Pre Seq.Padding

In [4]:
from keras.preprocessing.sequence import pad_sequences
# define sequences
sequences = [
[1, 2, 3, 4],
[1, 2, 3],
[1]
]
# pad sequence
padded = pad_sequences(sequences)
print(padded)


Using TensorFlow backend.


[[1 2 3 4]
 [0 1 2 3]
 [0 0 0 1]]


### Post Seq.Padding

In [5]:
from keras.preprocessing.sequence import pad_sequences
# define sequences
sequences = [
[1, 2, 3, 4],
[1, 2, 3],
[1]
]
# pad sequence
padded = pad_sequences(sequences, padding='post')
print(padded)

[[1 2 3 4]
 [1 2 3 0]
 [1 0 0 0]]


## Prepare Sequences with Varied Lengths (Seq. Truncation)

### Pre Seq.Truncation

In [6]:
from keras.preprocessing.sequence import pad_sequences
# define sequences
sequences = [
[1, 2, 3, 4],
[1, 2, 3],
[1]
]
# truncate sequence
truncated= pad_sequences(sequences, maxlen=2)
print(truncated)

[[3 4]
 [2 3]
 [0 1]]


### Post Seq.Truncation

In [7]:
from keras.preprocessing.sequence import pad_sequences
# define sequences
sequences = [
[1, 2, 3, 4],
[1, 2, 3],
[1]
]
# truncate sequence
truncated= pad_sequences(sequences, maxlen=2, truncating='post')
print(truncated)


[[1 2]
 [1 2]
 [0 1]]


## Pandas shift() Function
A key function to help transform time series data into a supervised learning problem is the Pandas shift() function. Given a DataFrame, the shift() function can be used to create copies of columns that are pushed forward (rows of NaN values added to the front) or pulled back (rows of NaN values added to the end). This is the behavior required to create columns of lag observations as well as columns of forecast observations for a time series dataset in a supervised learning format. Let’s look at some examples of the shift() function in action. We can define a mock time series dataset as a sequence of 10 numbers, in this case a single column in a DataFrame as follows:

In [8]:
from pandas import DataFrame
# define the sequence
df = DataFrame()
df['t'] = [x for x in range(10)]
print(df)

   t
0  0
1  1
2  2
3  3
4  4
5  5
6  6
7  7
8  8
9  9


In [9]:
from pandas import DataFrame
# define the sequence
df = DataFrame()
df['t'] = [x for x in range(10)]
# shift forward
df['t-1'] = df['t'].shift(1)
print(df)


   t  t-1
0  0  NaN
1  1  0.0
2  2  1.0
3  3  2.0
4  4  3.0
5  5  4.0
6  6  5.0
7  7  6.0
8  8  7.0
9  9  8.0


In [10]:
from pandas import DataFrame
# define the sequence
df = DataFrame()
df['t'] = [x for x in range(10)]
# shift backward
df['t+1'] = df['t'].shift(-1)
print(df)


   t  t+1
0  0  1.0
1  1  2.0
2  2  3.0
3  3  4.0
4  4  5.0
5  5  6.0
6  6  7.0
7  7  8.0
8  8  9.0
9  9  NaN


# 2 Develop LSTMs in Keras

The goal of this lesson is to understand how to define, fit, and evaluate LSTM models using the Keras deep learning library in 
Python. After completing this lesson, you will know:

 How to define an LSTM model, including how to reshape your data for the required 3Dinput.

 How to fit and evaluate your LSTM model and use it to make predictions on new data.

 How to take fine-grained control over the internal state in the model and when it is reset

### Note
The first hidden layer in the network must define the number of inputs to expect, e.g. the shape of the input layer. Input must be three-dimensional, comprised of samples, time steps, and features in that order.

 Samples. These are the rows in your data. One sample may be one sequence.

 Time steps. These are the past observations for a feature, such as lag variables.

 Features. These are columns in your data.

model = Sequential()

model.add(LSTM(5, input_shape=(2,1)))

model.add(Dense(1))

The number of samples does not have to be specified. The model assumes one or more samples, leaving you to define only the number of time steps and features. The final section of this lesson provides additional examples of preparing input data for LSTM models.


### Activation Function
The choice of activation function is most important for the output layer as it will define the format that predictions will take. For example, below are some common predictive modeling problem types and the structure and standard activation function that you can use in the output layer:

 Regression: Linear activation function, or linear, and the number of neurons matching the number of outputs. This is the default activation function used for neurons in the Dense layer.

 Binary Classification (2 class): Logistic activation function, or sigmoid, and one neuron the output layer.

 Multiclass Classification (> 2 class): Softmax activation function, or softmax, and one output neuron per class value, assuming a one hot encoded output pattern.

### Compilation

model.compile(optimizer='sgd', loss='mse')

or

algorithm = SGD(lr=0.1, momentum=0.3)

model.compile(optimizer=algorithm, loss='mse')

### Loss Function

The type of predictive modeling problem imposes constraints on the type of loss function that can be used. For example, below are some standard loss functions for different predictive model types:

 Regression: Mean Squared Error or mean squared error, mse for short.

 Binary Classification (2 class): Logarithmic Loss, also called cross entropy or binary crossentropy.

 Multiclass Classification (> 2 class): Multiclass Logarithmic Loss or categorical crossentropy.


### Note
verbose = 0 turns of the progress bar

predictions = model.predict(X)           # Probality

predictions = model.predict_classes(X)   # Class labels


##  LSTM State Management
Each LSTM memory unit maintains internal state that is accumulated. This internal state may require careful management for your sequence prediction problem both during the training of the network and when making predictions. By default, the internal state of all LSTM memory units in the network is reset after each batch, e.g. when the network weights are updated. This means that the configuration of the batch size imposes a tension between three things:

 The efficiency of learning, or how many samples are processed before an update.

 The speed of learning, or how often weights are updated.

 The influence of internal state, or how often internal state is reset.

Keras provides flexibility to decouple the resetting of internal state from updates to network weights by defining an LSTM layer as stateful. This can be done by setting the stateful argument on the LSTM layer to True. When stateful LSTM layers are used, you must also define the batch size as part of the input shape in the definition of the network by setting the batch input shape argument and the batch size must be a factor of the number of samples in the training dataset. The batch input shape argument requires a 3-dimensional tuple defined as batch size, time steps, and features. For example, we can define a stateful LSTM to be trained on a training dataset with 100 samples, a batch size of 10, and 5 time steps for 1 feature, as follows.

model.add(LSTM(2, stateful=True, batch_input_shape=(10, 5, 1)))

A stateful LSTM will not reset the internal state at the end of each batch. Instead, you have fine grained control over when to reset the internal state by calling the reset states() function. For example, we may want to reset the internal state at the end of each single epoch which we could do as follows:

for i in range(1000):

model.fit(X, y, epochs=1, batch_input_shape=(10, 5, 1))

model.reset_states()

The same batch size used in the definition of the stateful LSTM must also be used when making predictions.

predictions = model.predict(X, batch_size=10)


To make this more concrete, below are a 3 common examples for managing state:

 A prediction is made at the end of each sequence and sequences are independent. State should be reset after each sequence by setting the batch size to 1.

 A long sequence was split into multiple subsequences (many samples each with many time steps). State should be reset after the network has been exposed to the entire sequence by making the LSTM stateful, turning off the shuffling of subsequences, and resetting the state after each epoch.

 A very long sequence was split into multiple subsequences (many samples each with many time steps). Training efficiency is more important than the influence of long-term internal state and a batch size of 128 samples was used, after which network weights are updated and state reset.



## Mapping Applications to Models
I really want you to understand these models. To that end, this section lists 10 different and varied sequence prediction problems and notes which model may be used to address them. In each explanation, I give an example of a model that can be used to address the problem, but other models can be used if the sequence prediction problem is re-framed. Take these as best suggestions, not unbreakable rules.

### Time Series

 Univariate Time Series Forecasting. This is where you have one series with multiple input time steps and wish to predict one time step beyond the input sequence. This can be implemented as a many-to-one model.

 Multivariate Time Series Forecasting. This is where you have multiple series with multiple input time steps and wish to predict one time step beyond one or more of the input sequences. This can be implemented as a many-to-one model. Each series is just another input feature.

 Multi-step Time Series Forecasting: This is where you have one or multiple series with multiple input time steps and wish to predict multiple time steps beyond one or more of the input sequences. This can be implemented as a many-to-many model.

 Time Series Classification. This is where you have one or multiple series with multiple input time steps as input and wish to output a classification label. This can be implemented as a many-to-one model.

### Natural Language Processing

 Image Captioning. This is where you have one image and wish to generate a textual description. This can be implemented as a one-to-many model.

 Video Description. This is where you have a sequence of images in a video and wish to generate a textual description. This can be implemented with a many-to-many model.

 Sentiment Analysis. This is where you have sequences of text as input and you wish to generate a classification label. This can be implemented as a many-to-one model.

 Speech Recognition. This is where you have a sequence of audio data as input and wish to generate a textual description of what was spoken. This can be implemented with a many-to-many model.

 Text Translation. This is where you have a sequence of words in one language as input and wish to generate a sequence of words in another language. This can be implemented with a many-to-many model.

 Text Summarization. This is where you have a document of text as input and wish to create a short textual summary of the document as output. This can be implemented with a many-to-many model.


