# Entering the 4th Dimension
## Networks for Understanding Time-Oriented Patterns in Data

Common time-based problems include
* Sequence modeling: "What comes next?" 
    * Likely next letter, word, phrase, category, cound, action, value
* Sequence-to-Sequence modeling: "What alternative sequence is a pattern match?" (i.e., similar probability distribution)
    * Machine translation, text-to-speech/speech-to-text, connected handwriting (specific scripts)
    
<img src="http://i.imgur.com/tnxf9gV.jpg">

### Simplified Approaches

* If we know all of the sequence states and the probabilities of state transition...

    ... then we have a simple Markov Chain model.
    
* If we *don't* know all of the states or probabilities (yet) but can make constraining assumptions and acquire solid information from observing (sampling) them...

    ... we can use a Hidden Markov Model approach.
    
These approached have only limited capacity because they are effectively stateless and so have some degree of "extreme retrograde amnesia."

### Can we use a neural network to learn the "next" record in a sequence?

First approach, using what we already know, would look like
* Clamp input sequence to a vector of neurons in a feed-forward network
* Learn a model on the class of the next input record

Let's try it! This can work in come situations, although it's more of a setup for our next development.

In [None]:
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))

seq_length = 3
dataX = []
dataY = []
for i in range(0, len(alphabet) - seq_length, 1):
    seq_in = alphabet[i:i + seq_length]
    seq_out = alphabet[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
    print (seq_in, '->', seq_out)

Train a network on that data (`seq-mlp.py`).

The network does learn, and could be trained to get a good accuracy. But what's really going on here?

1. Our training data isn't great because it always includes one feature, in one location that 100% determines the correct result. The problem is not as silly as it looks (consider proper use of definite/indefinite articles in natural language) but the training data isn't good enough to reflect the real problem.

2. Even if we improved our data, and even if the output category depended on factors in the all prior records, we're still learning a stateless distribution.

That is, although the network can learn from the distribution across inputs, and even though we can consider those inputs temporally separated instead of spatially separated, we're missing some fundamentals of time, such as entropy and memory.

Maybe we could add layers, neurons, and extra connections to mitigate parts of the problem. We could alo do things like a 1-D convolution to pick up frequencies and patterns.

But, fundamentally, we still have a stateless sequence model that is matching 1 record in and 1 record out.

---

> __ASIDE: Atrous Convolutions__

> An atrous convolution is a convolution filter with "holes" in it. Effectively, it is a way to enlarge the filter spatially while not adding as many parameters or attending to every element in the input.

---

## Recurrent Neural Network Concept

__Let's take the neuron's pre-activation from one time (t) and feed it into that same neuron at a later time (t+1), in combination with other relevant inputs. Then we would have a neuron with memory.__

We can weight the "return" of that value and train the weight -- so the neuron learns how important the previous value is relative to the current one.

Different neurons might learn to "remember" different amounts of their prior history.

This concept is called a *Recurrent Neural Network*, originally developed around the 1980s.

### Training a Recurrent Neural Network

<img src="http://i.imgur.com/iPGNMvZ.jpg">

We can train an RNN using backpropagation with a minor twist: since RNN neurons with different states over time can be "unrolled" (i.e., are analogous) to a sequence of neurons with the "remember" weight linking directly forward from (t) to (t+1), we can backpropagate through time as well as the physical layers of the network.

This is, in fact, called __Backpropagation Through Time__ (BPTT)

The idea is sound but -- since it creates patterns similar to very deep networks -- it suffers from the same challenges:
* Vanishing gradient
* Exploding gradient
* Saturation
* etc.

i.e., many of the same problems with early deep feed-forward networks having lots of weights.

10 steps back in time for a single layer is a not as bad as 10 layers (since there are fewer connections and, hence, weights) but it does get expensive.

---

> __ASIDE: Hierarchical and Recursive Networks__

> Network topologies can be built to reflect the relative structure of the data we are modeling. E.g., for natural language, grammar constraints mean that both hierarchy and (limited) recursion may allow a physically smaller model to achieve more effective capacity.

---

## Long Short-Term Memory (LSTM)

"Pure" RNNs were never very successful. Hochreiter and Schmidhuber (1997) made a game-changing contribution with the publication of the Long Short-Term Memory unit.

<sup>(Credit and much thanks to Chris Olah, http://colah.github.io/about.html, Research Scientist at Google Brain, for publishing the following excellent diagrams!)</sup>

*In the following diagrams, pay close attention that the output value is "split" for graphical purposes -- so the two *h* arrows/signals coming out are the same signal.*

__RNN Cell:__
<img src="http://i.imgur.com/DfYyKaN.png" width=600>

__LSTM Cell:__

<img src="http://i.imgur.com/pQiMLjG.png" width=600>


An LSTM unit is a neuron with some bonus features:
* Cell state propagated across time
* Input, Output, Forget gates
* Learns retention/discard of cell state
* Admixture of new data
* Output partly distinct from state
* Use of __addition__ (not multiplication) to combine input and cell state allows state to propagate unimpeded across time (addition of gradient)

---

> __ASIDE: Variations on LSTM__

> ... include "peephole" where gate functions have direct access to cell state; convolutional; and bidirectional, where we can "cheat" by letting neurons learn from future time steps and not just previous time steps.

___

### Do LSTMs Work Reasonably Well?

__Yes!__ These architectures are in production (2017) for deep-learning-enabled products at Baidu, Google, Microsoft, Apple, and elsewhere. They are used to solve problems in time series analysis, speech recognition and generation, connected handwriting, grammar, music, and robot control systems.

### Let's Code an LSTM Variant of our Sequence Lab

(code is in `seq-lstm1.py1`)

In [None]:
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (len(dataX), seq_length, 1))

# ...

model.add(LSTM(32, input_shape=(X.shape[1], X.shape[2])))

# ...

__Memory and context__

If this network is learning the way we would like, it should be robust to noise and also understand the relative context (in this case, where a prior letter occurs in the sequence).

I.e., we should be able to give it corrupted sequences, and it should produce reasonably correct predictions.

Make the following changes to the code to test this out:

* Change the sequence length to 4
* Assuming '?' is a placeholder for missing or corrupted data, add code at the end to predict on the following sequences:
    * ?BCD, F?HI, STU?, J??M, OP??
    * Hint: for now, use 'Z' as a placeholder for the missing values
    
__Pretty cool... BUT__

This alphabet example does seem a bit like "tennis without the net" since the original goal was to develop networks that could extract patterns from complex, ambiguous content like natural language or music, and we've been playing with a sequence (Roman alphabet) that is 100% deterministic and tiny in size.

In the terminal, go ahead and start `seq-lstm2.py` since it will take several minutes to run.

This latter script is taken 100% exactly as-is from the Keras library examples folder (March 2017 https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py) and uses precisely the logic we just learned, in order to learn and synthesize English language text from a single-author corpuse. The amazing thing is that the text is learned and generated one letter at a time, just like we did with the alphabet.

There is a minor difference in the way the inputs are encoded, using 1-hot vectors. And there is a significant difference in the way the outputs (predictions) are generated: instead of taking just the most likely output class (character) via argmax as we did before, this time we are treating the output as a distribution and sampling from the distribution.

Let's take a look at the code ... but even so, this will probably be something to come back to after lunch or a break, as the training takes about 5 minutes per epoch (late 2013 MBP CPU) and we need around 20 epochs (80 minutes!) to get good output.

In [None]:
import sys
sys.exit(0) #just to keep from accidentally running this in Jupyter

'''Example script to generate text from Nietzsche's writings.

At least 20 epochs are required before the generated text
starts sounding coherent.

It is recommended to run this script on GPU, as recurrent
networks are quite computationally intensive.

If you try this script on new data, make sure your corpus
has at least ~100k characters. ~1M is better.
'''

from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import numpy as np
import random
import sys

path = "../data/nietzsche.txt"
text = open(path).read().lower()
print('corpus length:', len(text))

chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

print('Vectorization...')
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1


# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)


def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

# train the model, output generated text after each iteration
for iteration in range(1, 60):
    print()
    print('-' * 50)
    print('Iteration', iteration)
    model.fit(X, y, batch_size=128, epochs=1)

    start_index = random.randint(0, len(text) - maxlen - 1)

    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print()
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x[0, t, char_indices[char]] = 1.

            preds = model.predict(x, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

## Gated Recurrent Unit (GRU)

In 2014, a new, promising design for RNN units called Gated Recurrent Unit was published (https://arxiv.org/abs/1412.3555)

GRUs have performed similarly to LSTMs, but are slightly simpler in design:

* GRU has just two gates: "update" and "reset" (instead of the input, output, and forget in LSTM)
* __update__ controls how to modify (weight and keep) cell state
* __reset__ controls how new input is mixed (weighted) with/against memorized state
* there is no output gate, so the cell state is propagated out -- i.e., there is __no "hidden" state__ that is separate from the generated output state

<img src="http://i.imgur.com/nnATBmC.png" width=700>

Which one should you use for which applications? The jury is still out -- this is an area for experimentation!

### Using GRUs in Keras

... is as simple as using the built-in GRU class (https://keras.io/layers/recurrent/)

If you are working with RNNs, spend some time with docs to go deeper, as we have just barely scratched the surface here, and there are many "knobs" to turn that will help things go right (or wrong).

### What Does Our Nietzsche Generator Produce?

Here are snapshots from middle and late in a training run.

#### Iteration 19

```
Iteration 19
Epoch 1/1
200287/200287 [==============================] - 262s - loss: 1.3908     

----- diversity: 0.2
----- Generating with seed: " apart from the value of such assertions"
 apart from the value of such assertions of the present of the supersially and the soul. the spirituality of the same of the soul. the protect and in the states to the supersially and the soul, in the supersially the supersially and the concerning and in the most conscience of the soul. the soul. the concerning and the substances, and the philosophers in the sing"--that is the most supersiall and the philosophers of the supersially of t

----- diversity: 0.5
----- Generating with seed: " apart from the value of such assertions"
 apart from the value of such assertions are more there is the scientific modern to the head in the concerning in the same old will of the excited of science. many all the possible concerning such laugher according to when the philosophers sense of men of univerself, the most lacked same depresse in the point, which is desires of a "good (who has senses on that one experiencess which use the concerning and in the respect of the same ori

----- diversity: 1.0
----- Generating with seed: " apart from the value of such assertions"
 apart from the value of such assertions expressions--are interest person from indeed to ordinapoon as or one of
the uphamy, state is rivel stimromannes are lot man of soul"--modile what he woulds hope in a riligiation, is conscience, and you amy, surposit to advanced torturily
and whorlon and perressing for accurcted with a lot us in view, of its own vanity of their natest"--learns, and dis predeceared from and leade, for oted those wi

----- diversity: 1.2
----- Generating with seed: " apart from the value of such assertions"
 apart from the value of such assertions of
rutould chinates
rested exceteds to more saarkgs testure carevan, accordy owing before fatherly rifiny,
thrurgins of novelts "frous inventive earth as dire!ition he
shate out of itst sacrifice, in this
mectalical
inworle, you
adome enqueres to its ighter. he often. once even with ded threaten"! an eebirelesifist.

lran innoting
with we canone acquire at them crarulents who had prote will out t
```

#### Iteration 32

```
Iteration 32
Epoch 1/1
200287/200287 [==============================] - 255s - loss: 1.3830     

----- diversity: 0.2
----- Generating with seed: " body, as a part of this external
world,"
 body, as a part of this external
world, and in the great present of the sort of the strangern that is and in the sologies and the experiences and the present of the present and science of the probably a subject of the subject of the morality and morality of the soul the experiences the morality of the experiences of the conscience in the soul and more the experiences the strangere and present the rest the strangere and individual of th

----- diversity: 0.5
----- Generating with seed: " body, as a part of this external
world,"
 body, as a part of this external
world, and in the morality of which we knows upon the english and insigning things be exception of
consequences of the man and explained its more in the senses for the same ordinary and the sortarians and subjects and simily in a some longing the destiny ordinary. man easily that has been the some subject and say, and and and and does not to power as all the reasonable and distinction of this one betray

----- diversity: 1.0
----- Generating with seed: " body, as a part of this external
world,"
 body, as a part of this external
world, surrespossifilice view and life fundamental worthing more sirer. holestly
and whan to be
dream. in whom hand that one downgk edplenius will almost eyes brocky that we wills stupid dor
oborbbill to be dimorable
great excet of ifysabless. the good take the historical yet right by guntend, and which fuens the irrelias in literals in finally to the same flild, conditioned when where prom. it has behi

----- diversity: 1.2
----- Generating with seed: " body, as a part of this external
world,"
 body, as a part of this external
world, easily achosed time mantur makeches on this
vanity, obcame-scompleises. but inquire-calr ever powerfully smorais: too-wantse; when thoue
conducting
unconstularly without least gainstyfyerfulled to wo
has upos
among uaxqunct what is mell "loves and
lamacity what mattery of upon the a. and which oasis seour schol
to power: the passion sparabrated will. in his europers raris! what seems to these her

```

### Take alook at the anomalous behavior that starts late in the training... What might have happened?

#### Iteration 38

```
Iteration 38
Epoch 1/1
200287/200287 [==============================] - 256s - loss: 7.6662     

----- diversity: 0.2
----- Generating with seed: "erable? for there is no
longer any ought"
erable? for there is no
longer any oughteesen a a  a= at ae i is es4 iei aatee he a a ac  oyte  in ioie  aan a atoe aie ion a atias a ooe o e tin exanat moe ao is aon e a ntiere t i in ate an on a  e as the a ion aisn ost  aed i  i ioiesn les?ane i ee to i o ate   o igice thi io an a xen an ae an teane one ee e alouieis asno oie on i a a ae s as n io a an e a ofe e  oe ehe it aiol  s a aeio st ior ooe an io e  ot io  o i  aa9em aan ev a

----- diversity: 0.5
----- Generating with seed: "erable? for there is no
longer any ought"
erable? for there is no
longer any oughteese a on eionea] aooooi ate uo e9l hoe atae s in eaae an  on io]e nd ast aais  ta e  od iia ng ac ee er ber  in ==st a se is ao  o e as aeian iesee tee otiane o oeean a ieatqe o  asnone anc 
 oo a t
tee sefiois to an at in ol asnse an o e e oo  ie oae asne at a ait iati oese se a e p ie peen iei ien   o oot inees engied evone t oen oou atipeem a sthen ion assise ti a a s itos io ae an  eees as oi

----- diversity: 1.0
----- Generating with seed: "erable? for there is no
longer any ought"
erable? for there is no
longer any oughteena te e ore te beosespeehsha ieno atit e ewge ou ino oo oee coatian aon ie ac aalle e a o  die eionae oa att uec a acae ao a  an eess as
 o  i a io  a   oe a  e is as oo in ene xof o  oooreeg ta m eon al iii n p daesaoe n ite o ane tio oe anoo t ane
s i e tioo ise s a asi e ana ooe ote soueeon io on atieaneyc ei it he se it is ao e an ime  ane on eronaa ee itouman io e ato an ale  a mae taoa ien

----- diversity: 1.2
----- Generating with seed: "erable? for there is no
longer any ought"
erable? for there is no
longer any oughti o aa e2senoees yi i e datssateal toeieie e a o zanato aal arn aseatli oeene aoni le eoeod t aes a isoee tap  e o . is  oi astee an ea titoe e a exeeee thui itoan ain eas a e bu inen ao ofa ie e e7n anae ait ie a ve  er inen  ite
as oe of  heangi eestioe orasb e fie o o o  a  eean o ot odeerean io io oae ooe ne " e  istee esoonae e terasfioees asa ehainoet at e ea ai esoon   ano a p eesas e aitie
```