# Neural Networks for Machine Learning (7)

### Modeling Sequences: A Brief Overview

###### Getting targets when modeling sequences:

- Transform sequence into another sequence
- Predict next input in sequence
- When predic next term, it blurs distinction between supervised learning and unsupervised learning

###### Memoryless models for sequences

![](pics/7-1-1.png)

###### Beyond memoryless models:

Generate sequence: have a model with hidden state -> internal dynamics
- can store information in hidden state for long time
- Noisy dynamics -> noisy hidden state
- Infer a probability distribution over space of hidden state vectors

The inference is only tractable for two types of hidden state model:

###### Linear dynamical system:

![](pics\7-1-2.png)

###### Hidden Markov Models:

![](pics\7-1-3.png)

###### Limitation of HMMs:
- total information of hidden states: log(N)
- utterance example: HMM needs $2^{100}$ states

###### Recurrent Neural Networks

- Linear dynamics systems and HMM are stochastic models
 1. But posterior is a deterministic function
- RNNs are deterministic
 1. Think hidden state of RNN equivalent to deterministic probability distribution of stochastic model

What RNNs can exhibit:
- They can oscillate
- They can settle to point attractors
- Behave chaotically
- Could learn little programs capture nugget knowledge and run parallel
- But RNN is very hard to train

### Training RNNs with Backpropagation

###### equivalence between feedforward nets and recurrent nets

![](pics\7-2-1.png)

###### Backpropagation with weight constraints

Compute gradients as usual, then modify to satisfy the constraints:
To constrain: $w_1=w_2$  
We need: $\Delta w_1=\Delta w_2$  
Compute: $\frac{\partial{E}}{\partial{w_1}}$ and $\frac{\partial{E}}{\partial{w_2}}$  
Use $\frac{\partial{E}}{\partial{w_1}}+\frac{\partial{E}}{\partial{w_2}}$ for $w_1$ and $w_2$

###### Backpropagation through time

1. RNN can be regard as layered, feed-forward net with shared weights
2. The forward pass builds up stack activities

Treat initial state as a parameter to learn

###### Providing input to RNN
1. specify initial states for all units
2. specify initial states for subset of units
3. sepcify same subset of units at every time step

1. specify targets for desired final states
2. specify targets for last few steps
 - good for learning tractors
 - esay to add in extra error derivatives
3. Specify desired activity of a subset of units

### A Toy Example of Training an RNN

Adding up to binary numbers

 ![](pics/7-4-1.png)

### Why it is Difficult to Train an RNN

1. use squashing function like logistic prevent activity vectors from exploding
2. backward pass is completely linear -> doulbe error derivatives at the final layer, all error derivatives will double
3. Many layers -> big/small weights -> gradients explode/shrink exponentially -> RNN with long sequence will have this problem

###### Why back-propagated gradient blows up

![](pics/7-4-2.png)

###### Four effective ways to learn an RNN

1. Long short term memory
2. Hessian free optimization
3. Echo state network -> echo $\sim$ memory

### Long Short Term Memory

###### Implement a memory cell in NN
- use circuit that implements an analog memory cell
- 'keep'/'write'/'read' gate
- we can backpropagate because logstic cells have nice derivatives

###### An example

![](pics/7-5-1.png)
![](pics/7-5-2.png)
![](pics/7-5-3.png)

###### Reading cursive handwritting

input (x,y,p) -> p: pen is up or down  
output: sequence of characters