## LSTM  peculiarities #1: state
When stacking LSTM layers you can make a choice regarding the model: **stateful** or **stateless**

1. A stateful recurrent model is one for which the internal states (memories) obtained after processing a batch of samples are reused as initial states for the samples of the next batch. This allows to process longer sequences while keeping computational complexity manageable. In case you have a long time series of stock returns, you might want to use stateful LSTM.
NB: the hidden state is passed from batch to batch, not within the batch. Within batch the sub-sequences are treated as independent.

2. If the model is stateless, the cell states are reset after each batch. It is considered to be more efficient in implementations than stateful and the clear choice when every observation string does not depend on the previous one (i.e. sentence classification). Additionally, stateless LSTMs can be implemented with and without shuffling of observations.

In the next tutorial we will try several variations: 
 - stateful 
 - stateless with shufflin 
 - stateless without shuffling

## LSTM  peculiarities #2: input/output
<img src="RNN_structures.jpg" style="width: 800px;"/>
Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Applications like:
1. Image recognition (binary outcome like cat/dog) - strictly speaking its not a sequence
2. Picture description (outcome "Tall man in red shirt holds beer) - not really for Keras
3. Sentiment analysis, typeing suggestions or our time series prediction (using a bunch of past values to predict one in the future)
4. Translation or also can be prediction of several steps in time series
5. May be used for example for ongoing video classification

Additionally we have an architecture called **Bidirectional LSTM** that not only preserves information from the past and passes it to the future but also passes information from the future to the past (used widely for the tasks where context matters a lot).

## LSTM  peculiarities #3: train set structure
In the manual implementation, we have touched on how the data can be structured before feeding it into the model. Ultimately, we always want to have a set of features and a target, additionally, this time we say that the order of features in one observation matters. Depending on statefulness (as discussed above), the order of observations will matter or not. 

In a non-time series or language related case we could have a following task:

Let's say we are expecting the model to figure out that it has to count the number of 1 and give it as prediciton.

#### case 1

In [None]:
     X       y
[0,1,1,1,2] [3]
[0,0,1,1,2] [2]
[0,0,0,1,2] [1]
[1,0,0,1,2] [2]
[0,0,0,0,1] [0]
...

In this case, the order of those observations does not matter, we could even shuffle them for better generalization.
How do we transform a long string of consecutive numbers into a training set like that?

#### case 2

<img src="time_series_structure.png" style="width: 500px;"/>


Splitting it into equal chunks is one of the well adopted solutions. An important questions arises: what should be the size of the window? It will depend a lot on field knowledge. In case of sentiment analysis - how many words do you need on average to convey a sentiment? In stock prediction you can use a PACF to derive a judgement.

#### case 3
Instead of predicting only one step ahead we could instead ask our NN to predict a sequence

<img src="time_series_sequence.png" style="width: 500px;"/>

We decide on the sequence length and keep it constant. However, our input and output sequences are now of the same length. Each new sequence starts one time step ahead of the sample sequence. The remaining values, i.e. the remainder of the full sequence length divided by the chosen length, are dropped. Keep in mind that then your model turns into se2seq type. By making the network stateful, we use the hidden state at the end of the previous sequence as the starting point of the next sequence. Since the hidden state summarizes information from earlier observation, this increases the information available for each prediction beyond the window size. 

## LSTM  peculiarities #4: batches 

NB: if you ignore everything below your LSTM won't break but might produce poorer results.

When it comes to time series forecasting, there are many suggestions what should be the batch size when it comes to the LSTM. One of the rules one can observe is your batch size is the same as your output(according to Keras functionality). In case your output is next day prediction, you should consider batch size=1.

You can still go for a bigger batch size without drastically decreasing your performance. In general, the advice is to keep batch size a value that divides without remainder into the train and validation set sizes, so that no data gets discarded. Keep in mind that the LSTM will be chaining together line 1 of batch n with line 1 of batch n+1. So try to keep the batch size somewhat meaningful. 

However, when it comes to the text processing, situation changes. [Jeremy Howard](https://www.youtube.com/watch?v=H3g26EVADgY&feature=youtu.be&t=17m50sin) advocates that splitting a big string of text into chunks, stacking it and then creating batches from the first "slice", second "slice" and etc. offers nice parallelisation capacities and does not really harm the process. Keep in mind that in case you have short sequences (that are still longer than your window size) this method might bring in additional distortion.

Example (mind that numbers are only used for simplicity, should rather be words: we have a "long" string [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17......1000]
Let's say we want to have batch=10, then we split the string into 10 chunks (that would make 100 numbers per chunk) and stack them:

[1.....100]

[101...200]

[201...300]

...

[900...1000]

(10 rows, 100 columns)

Now, if you remember in order to transform it into a supervised task we need to decide on the "window" size and the output type. Let's say we will go for a many-to-many architecture, so we choose the window=10 and same size output without overlap. 
Then our stack would look like this:

[1..10][11...20][21...30]......[91..100] 

[101..110][111...120][121...130]......[191..200]

.........

[901..910][911...920][921...930]......[991..1000]

That is [1..10] will be used to predict [2...11], [11...20] will be used to predict [12...21] and so forth.


Then, the first batch would be the first slice/column:

[1..10]

[101..110]

....

[901..910]


Second batch - second column and etc. where we use a stateful network to continue with the hidden state from the previous batch.

Structuring the data in this way, the sequence is lost at the end points of each row (i.e. 100 to 101, 200 to 201). If we consider the sequence to be not hundreds but thousands of words, that may be smth we are willing to sacrifice, because we gain a parallelisation of the process: the 10 chunks/rows are training simultaneously with every batch. 

## LSTM  peculiarities #5: Resetting ###
In case you are using a stateful LSTM, make sure you reset the state after every epoch, otherwise the NN will treat it as a continuation of the time series.


In [None]:
epochs=30

for i in range(epoch):
        model.fit(X, y, epochs=1, batch_size=n_batch, verbose=1, shuffle=False)
        model.reset_states()
        
# Also you might want to consider theonline forecast
for i in range(len(X)):
        testX, testy = X[i], y[i]
        testX = testX.reshape(1, 1, 1)
        yhat = model.predict(testX, batch_size=1)
        print('>Expected=%.1f, Predicted=%.1f' % (testy, yhat))