# Lab 3: Fully-Connected Neural Networks

# Nonlinear modeling
Logistic regression and linear regression are linear models: the output variable (or the vector of logits, in logistic regression) is a linear combination of the input features.
These models can model nonlinear terms in your data, including interactions between the features, but only if you design those terms by hand.

By contrast, nonlinear models such as neural networks can "automatically" discover interaction terms and nonlinear terms in the input.
This lets them represent much more complicated functions, since most interesting problems involve terms too difficult to design by hand.

The key distinguishing factor between different kinds of nonlinear models is how those terms are discovered.
Very generic models, like RBF SVMs, can represent almost any kind of mapping from input to output, but as a result can't learn "smart" features because the only property they assume about the data is that similar inputs map to similar outputs (the "local smoothness prior").
More "opinionated" models, like neural networks, learn a function from a smaller class but can learn more complicated relationships with less data.

#### Aside: The "no free lunch theorem of machine learning"
This theorem roughly says: no machine learning model strictly outperforms any other model on all problems.
It's kind of controversial whether this really means anything: sure, you can pick a test set that has nothing to do with the input, and so random choice does better than even mean-fitting.

But, the general idea is important for understanding why we'd chose one model over another.
No model does better than any other on random functions.
Instead, when picking a model, you need to think about what _properties of the real world_ the model does well on.
For instance, if you expect classes to be well-separated, then an SVM is a good choice; if you expect neighborhood relationships to be more important, then K-nearest-neighbors might be a good fit.
When thinking about models, think about what _priors_ they impose on the functions they learn.
We'll justify neural networks by showing that they impose priors we'd reasonably expect problems in the real world to follow.

# Dense layers, mathematically
Fully-connected, feedforward neural networks are a special case of neural networks which use only fully-connected (or "dense") layers for their hidden layers.

### Dense layers
A dense layer $L$ computes its output (or _activation_) as
$$
\vec{a_L} = f (W_L \vec{a_{L-1}} + \vec{b_L})
$$
where $W_L$ and $\vec{b_L}$ are the layer's learned weight matrix and bias vector, $a_{L-1}$ is the activation of the previous layer (or is just the input, if $L = 1$), and $f(x)$ is an **activation function** which performs some (nonlinear) transformation on each element of the output vector.
You can think of a dense layer with $n$ inputs and $m$ outputs as performing $m$ linear regressions, each from the same set of $n$ input variables, then applying the activation function to each. 

The activation function is important.
A neural network with no activation function (or a linear activation function) is really just linear regression with more steps -- the matrices for each layer multiply together, and the entire network becomes $\vec{y} = W \vec{x} + \vec{b}$, where $\vec{y}$ is the output vector, $\vec{x}$ is the vector of input features, and $W$ and $\vec{b}$ are some weight matrix and bias vector.
A network with nonlinear activation functions is instead able to learn to use those activation functions to "bend" its input space in the right places and approximate any function.

$W$ and $b$ get learned through gradient descent and backpropagation, possible because every operation here is differentiable.
So, it's important that the activation function is differentiable.

### Output layer
The last layer of a feedforward neural network takes the activations of the last hidden layer (the _final representation_) and uses it to produce the output.

For regression, this is easy: just perform a linear regression from the final representation to the output.
This is equivalent to a dense layer with a linear (identity) activation function.

For classification, logistic regression is used instead of linear regression.
Binary classification ("Bernoulli output") uses the logistic sigmoid as an activation function; classifying input into one of many classes ("multinoulli output") uses the softmax function.

For a more thorough coverage, check out http://www.deeplearningbook.org/, chapter 6.2.

#### Aside: linear activations
Using the identity function as an activation in hidden layers isn't actually totally meaningless.
You replace the single matrix multiplication of a full linear regression from inputs to outputs with a product of matrices, which might together have fewer parameters than the single large matrix.
So, a block of dense layers with linear activation really learns to perform a _factored linear regression_, which acts as a low-rank approximation to a full linear regression.

# Interpretations of neural networks
Neural networks get interpreted in a ton of different ways.
Here are a few of my favorites (though there exist more: [manifold deformation](https://colah.github.io/posts/2014-03-NN-Manifolds-Topology/), logic circuits, ...).

## Representation learning
Prior to deep learning, the dominant approach to machine learning was "feature engineering" -- designing features (interactions, nonlinear terms, etc) by hand with the assumption that some simple model of them (linear model, K-NN) will suffice to solve the problem. 
One way to think about neural networks is that they perform this process automatically, learning representations in their hidden layers and then fitting a simple model (linear regression or logistic regression) to the output data.
Hidden layers are sometimes said to perform "feature extraction" in that they learn a good set of features for the problem.

The last layer is a linear model of the activation of the second-to-last layer, and so the set of hidden layers defines a nonlinear function from the input features to the final representation.
We want this function to "disentangle" the input, so that a linear model is sufficient to perform the final step.
The activations of hidden layers form "representations" of the input that contain the same important information, but discards noise and makes the information more easily accessible.

This is achieved with the backpropagation algorithm and gradient descent.
The gradient of the loss function not only tells the last layer how it should change to fit the data better, but also how the previous layer should change such that the final layer does better. 
 
This same reasoning applies within hidden layers.
The last hidden layer has the difficult task of giving the output layer a good enough representation to perform the task linearly.
This task is easier if the last hidden layer itself has a good representation as input.
And so on.
That's why we use multiple hidden layers: because each one makes the next one's job easier.

Neural networks of sufficient size are **universal function approximators** (and this is a theorem!), able to represent any function from their inputs to their outputs so long as they have at least one hidden layer.
Wider networks learn a more rich representation per hidden layer, and deeper networks learn more representations so each can be simpler.

It's worth noting: at every step, the hidden layers change so that their activations would make the _current_ version of the final layer do better.
But, the final layer is also changing, so we can't necessarily change each layer in just the right way.
This problem gets harder with deeper models.

## Hierarchical pattern-matching
I mentioned before that a model is only as good as the priors it imposes, and how those line up with the real world.
Neural networks impose the prior that the patterns they're finding are _hierarchical_ in structure.
That is, the output pattern is easy to see if we frame the question in terms of the presence or absence of slightly simpler patterns.
Those patterns are easier to find if we frame them in terms of yet simpler patterns, and so on.

For example, recognizing a handwritten digit might boil down to which kinds of common pen strokes (e.g. loops near the top or bottom of an image, straight lines, etc) are present.
Those strokes are made up of common shapes (curves, corners, angles).
And those shapes are made up of common patterns in pixels, which is the input.
Each hidden layer then ought to learn more and more abstract representations, in terms of composing simple patterns of lower-level layers.
This also implies that a single layer might recognize multiple kinds of patterns, using groups of neurons instead of the whole layer at once.

Is this prior reasonable?
I think so.
Human problem-solving is often based on breaking up problems into simple components or patterns -- physics, for instance, builds up from simple arithmetic to more advanced math, and then to higher-level reasoning, intuition, and experience.
There's something interesting going on at each level of the hierarchy, and the whole problem looks much simpler once you frame each level in terms of the previous levels.

#### Aside on DenseNet
There's a new-ish neural network architecture called DenseNet which lets each layer look at the output of several past layers instead of just the previous layer.
This might be interpreted as letting a model reason about several levels of abstraction at once.

## Function composition
Another (super cool but pretty unconventional) perspective is that each hidden layer (or a small group of hidden layers) learns to perform a simple function mapping its input to its output.
From this perspective, the entire network is a functional program like you might write in Haskell, where the functions learned by hidden layers are chained by composition to form a map from input to output.

In this view, you can think of the vector spaces that vectors in different levels live in as _types_, like in type theory and functional programming.
Blocks of dense layers learn single functions from their input type to their output type.
More complicated layers are more interesting, though: convolutional filters, for instance, apply the same function over and over like a programmer might reuse a function instead of copy-pasting logic.
Common patterns in developing neural networks correspond to higher-order functions, like maps and folds.

I think this view is fascinating.
If you want to read more, check out [this blog post](https://colah.github.io/posts/2015-09-NN-Types-FP/) by Chris Olah.

## Effects of depth and width
#### Capacity
Model capacity is first and foremost when deciding on how many layers a model should have (depth) and how many units each layer should have (width).
Wider layers learn more complicated functions, and more layers means the network composes more functions; both lead to a higher-capacity model, less likely to underfit but more likely to overfit.
Roughly, width adds more parameters than depth, since adding one unit to a layer adds a parameter for every output of the previous layer.
Neural networks are usually only a good fit for problems where you have _lots_ of data, so modern architectures tend to be very large and high-capacity.

#### Depth vs width
![Deep models generalize better](./images/depth_vs_width.png)

Image source: http://www.deeplearningbook.org/, chapter 6.4.

More interesting than determining the number of parameters is the question of where those parameters should go (more layers vs wider layers).
Generally, the modern understanding is that deeper networks are both more powerful (higher capacity) and generalize better in practice than networks that spend their parameters on width.
This seems impossible from a [bias-variance tradeoff](http://scott.fortmann-roe.com/docs/BiasVariance.html) perspective!
This effect is generally explained as: making a model "deep and thin" imposes a strong prior that the function it learns will be a composition of very many simple functions.
This appears to be a good fit for many real-world problems.

This point is illustrated in the experiment above (performed on a dataset of house address numbers), where Goodfellow et al. plot the test accuracy of a model against its number of parameters and layers, and find that shallow models are more prone to overfitting as they get wider.

The use of models with many layers distinguishes modern "deep learning" from earlier attempts at applying neural networks, which typically used one or two hidden layers.
It also explains (in part) their incredible success on some very difficult problems.

# Activation functions
### Logistic sigmoid
![sigmoid and its derivative](https://cdn-images-1.medium.com/max/1440/1*gkXI7LYwyGPLU5dn6Jb6Bg.png)

Traditionally, neural networks used the logistic sigmoid function,
$$\sigma(x) = \frac{e^x}{e^x + 1},$$ 
as an activation function for hidden layers.
However, it has the serious problem of **saturating** at high and low activations.

When computing the gradient update for a parameter update (with backpropagation), we multiply by the derivative of the activation function at that point.
When the weights of a layer using sigmoid activation are large, the input to the logistic function are very positive or very negative.
In these places, the derivative of the sigmoid function is near zero, and it's said to have _saturated_: that unit won't learn anything more, because its gradient updates (backpropagated through the sigmoid function) will be zero.

The maximum value of its derivative is also $1/4$, so as your model gets deeper and you backpropagate through many sigmoid functions, early layers will see lower gradients and learn much slower.

It's worth noting that the logistic sigmoid only has a problem with saturating when it's used for hidden units, not for output units.
If the model's very wrong in a binary classification, the cross-entropy loss's log function cancels the exponentials in the sigmoid function and the gradients grow linearly.
If the model's very right, the gradient updates will go to zero, as they should.

### ReLU
![relu and its derivative](https://cdn-images-1.medium.com/max/1440/1*g0yxlK8kEBw8uA1f82XQdA.png)

ReLU (rectified linear) activation,
$$f(x) = \max(0, x),$$
is now much more common than sigmoid activation because of the saturation problem.
Any time it "fires", it has a significant derivative and applies parameter updates.
In addition, it doesn't "depress" gradients when you propagate them through many layers like the sigmoid function.
They're also faster to compute the activations and derivatives of than other functions.
These three properties make ReLU a good default choice for almost every network, especially deep ones. 

One caveat is that ReLUs that are initialized badly or receive strong parameter updates from high gradients can get pushed to a region where they never activate, and instead always report zero.
These are called "dead ReLUs" and can never recover, since the derivative of the ReLU function is zero when it doesn't activate.
This can make training ReLU networks with high learning rates dangerous, but there are [some solutions to this](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)#Leaky_ReLUs).

Image credit (both): [Andrej Karpathy's article on why you should understand backprop](https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b) 

# SGD with momentum
![Gradient descent optimization algorithms](http://ruder.io/content/images/2016/09/saddle_point_evaluation_optimizers.gif)

GIF Source: ["An overview of gradient descent optimization algorithms"](http://ruder.io/optimizing-gradient-descent/)

In regions of parameter space that look like "ravines" or "tacos," with a downwards trend in one dimension but a much sharper U-shaped curve in the other dimension, standard SGD will follow the stronger gradient signal, which is perpindicular to the dimension it can really make meaningful progress in.

A simple modification, called **momentum**, makes steps based on a linear combination of the previous step direction and the current gradient:
$$
    m_t = \beta m_{t-1} - \alpha \nabla f(w_{t-1}) \\
    w_t = w_{t-1} + m_t
$$
where $\alpha$ is the learning rate, $f$ is the function to optimize, $w$ is the model weights, and $\beta$ is a hyperparameter called "momentum" which determines how strong of an effect previous steps have on the current step.
For $\beta = 0$, this is equivalent to standard SGD.
Common settings for $\beta$ are around 0.9.

Momentum helps avoid the issue with ravines, because the algorithm will accumulate momentum in the downwards-curving direction while eventually damping oscillation in the perpindicular (sharp) direction.

See the animation above: the red point is standard SGD (which never makes it past this ravine), and the green point is SGD with momentum.

To use this optimizer in TensorFlow, add a `momentum` argument to your [`tf.optimizers.SGD`](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/SGD), so it will take in a learning rate and a momentum.
In exchange for the little bit of extra tuning you have to do (usually none, just leave momentum at 0.9 or 0.99), momentum should give you much faster training.

# Weight initialization and variance scaling
Because of saturating sigmoids, dying ReLUs, and generally wanting our hidden layers to recieve sizeable gradient updates early on in training, how we initialize the weights and biases in a layer is way more important than you might think.
Initialization can affect whether the network will train at all, how fast it will train, and how well the optimized model will generalize.

General principles are:
 - Weights should never be initialized to zero, to break symmetry
 - Weights should be randomly initialized to small values centered around zero
 - Biases should usually be initialized to zero for hidden units
 - Units with more inputs may by chance receive many signals of the same sign, so they should have smaller weights to prevent a huge input that kills or saturates the neuron early on
 - Deeper networks are influenced more by initialization
 - The variance is pretty much all that matters, not whether you pick a normal or uniform distribution

Luckily TensorFlow can pick the standard deviation of a weight initializer for you based on its number of inputs and outputs:
 - For ReLUs, use `tf.initializers.he_normal` or `tf.initializers.he_uniform`.
 - For sigmoid units, use `tf.initializers.glorot_normal` or `tf.initializers.glorot_uniform`.

He initialization and Glorot (or Xavier) initialization differ in variance by a constant, which makes each suited to a different activation function.
If you're curious, [read more about it here](https://medium.com/usf-msds/deep-learning-best-practices-1-weight-initialization-14e5c0295b94).

# Dropout
A very common method of regularizing (non-final) dense layers is dropout.
Dropout "turns off" a given unit during a step of training at random with probability $p$, and when running the model the unit is on but its activations are multiplied by $p$.
$p$ is usually in the range 0.2 to 0.6, where higher values result in more regularization.

Dropout is effective both in reducing the capacity of a dense layer and in solving the problem of _co-adaptation_, where later units learn to correct the mistakes of early units and so the early units never learn to correct themselves (which would be better for generalization).
With dropout, the unit cannot rely on other units being present, and so must learn representations that are useful in a broad variety of situations.

Dropout can also be thought of as making an approximate ensemble by averaging many different, smaller neural networks.
This is the intuition behind why multiplying the activation by the probability of a unit being "on" works.

To use dropout in TensorFlow, use the [`tf.nn.dropout`](https://www.tensorflow.org/api_docs/python/tf/nn/dropout) operation.
Note that this tensor should only be used for training, not for inference.
To use it in Keras, use `keras.layers.Dropout`, which handles the training/inference behavior for you.

# Keras
[Keras](https://keras.io/) is a very simple API built on TensorFlow, which makes it easy to build and train common kinds of neural networks.
It handles most of the annoying stuff (running gradient updates, initializing tensors, etc) for you.
Keras is used a lot in practice for how easy it makes building powerful and performant models, but it lacks the flexibility that makes TensorFlow necessary if you want to do model development.
Still, if you're building neural nets in practice there's a good chance it'll be able to do what you need.

Training and inference in Keras follow these steps:
 1. Build a model either as an instance of `keras.models.Sequential` (this week) or `keras.models.Model` (next week), using layers from `keras.layers`
 2. Compile the model, which determines the optimizer, loss function, and which metrics to compute during training and testing
 3. Train the model with `model.fit()` or `model.fit_generator()`, passing in training data, validation data, epochs, and batch size
 4. Run inference using `model.predict()` 

### Sequential models
Keras has two model-building APIs: the simpler Sequential API for models that work "one layer at a time" (feedforward), and the Functional API which is more flexible but complicated.

The end of this document has a full worked example of model building and training with a Keras sequential model.

For more info, read the [Keras sequential model guide](https://keras.io/getting-started/sequential-model-guide/).

# The TensorFlow debugger
TensorFlow has an incredible debugging utility.
As the models you develop get more complicated, The debugger and TensorBoard become essential for debugging problems like shapes not matching up, `NaN` and `INF` values appearing instead of numbers, and operations that don't act like you expect.

Read [the official guide](https://www.tensorflow.org/guide/debugger).

# More TensorBoard visualizations
`tf.summary.histogram`s are summary operations that act like the `tf.summary.scalar`s we looked at last week.
Instead of just taking a scalar, though, they take a tensor of arbitrary shape, and turn it into a histogram counting the frequency of values in intervals in the tensor's values.
TensorBoard will plot these histograms against iteration, similar to scalars; you can interact with them in the Histograms tab.

There are a few interesting things to plot like this:
 - Hidden layer activations
 - Weights
 - Biases
 
The big one, though, is **gradients**, which can tell you what's going on during optimization and help debug problems like vanishing or exploding gradients.
To plot them, you need to get the gradients explicitly using `GradientTape.gradient()` instead of `optimizer.minimize()` like before:
```
optimizer = tf.optimizers.SGD(2e-3, 0.9)

with tf.GradientTape() as g:
    loss_scalar = ...
    
gradients = g.gradient(loss_scalar, trainable_vars)

# `gradients` now contains a list of gradients for each variable in `trainable_vars`
tf.summary.histogram('my_gradient', gradients[0], step=0)

# Instead of `optimizer.minimize()`, this becomes your training operation
optimizer.apply_gradients(zip(gradients, trainable_vars))
```

All `optimizer.minimize()` does is combine `GradientTape.gradient()` and `optimizer.apply_gradients()`.
By breaking these steps up, you get direct accesss to the gradient values and you can do some interesting things with them.
For instance, "clipping" gradients by capping them at a maximum value is common, to prevent huge steps in gradient descent that can saturate units or step outside the range where the first-order approximation to the loss function is good.

Note: if you're computing your gradients for plotting, don't use `optimizer.minimize()` (instead of `apply_gradients()`) at all.
If you do, it'll recompute certain values you already have, increasing runtime.

For more info, check out the [official documentation on histograms](https://www.tensorflow.org/api_docs/python/tf/summary/histogram).

# Example: Classification with Keras
Here's another full worked example, this time using Keras.
Since using Keras mostly involves picking layers and plugging in values, it should suffice as an explanation of how to use the library.

In [1]:
%matplotlib inline
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

### Loading data
I'm using a scipy default dataset again: classifying Iris plants into types based on sepal and petal width and length.

In [2]:
from sklearn.datasets import load_iris

# Load data
dataset = load_iris()
print('Feature names:', dataset.feature_names, '\n')

x_all = dataset.data
y_all = dataset.target

# Shuffle some features and targets together
together = np.concatenate([x_all, np.expand_dims(y_all, axis=1)], 
                          axis=1)
np.random.shuffle(together)
x_all = together[:, :-1]
y_all = together[:, -1]

print('Input shape:', x_all.shape)
print('Target shape:', y_all.shape)

# Split data into train and test sets
n_points = x_all.shape[0]
n_features = x_all.shape[1]
n_train = int(n_points * 0.7)
n_test = n_points - n_train

x_train, x_test = np.split(x_all, [n_train], axis=0)
y_train, y_test = np.split(y_all, [n_train])

Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] 

Input shape: (150, 4)
Target shape: (150,)


### Build a model
Makes a neural network with 2 hidden layers that use ReLU activation, followed by softmax output to do 3-class classification.

In [9]:
import keras
from keras.models import Sequential
from keras.layers import Dense, Activation

# Create an "empty" model
model = Sequential()

# Add a 16-unit dense layer with no activation function
# input_shape only required for the first layer
model.add( Dense(16, input_shape=(4,)) )
model.add(Activation('relu'))

# Add a 4-unit dense layer with relu activation
model.add(Dense(8, activation='relu'))

# Output layer, for 3-class classification: 
# 3 probabilities, produced by softmax activation
model.add(Dense(3, activation='softmax'))

### Build an optimizer

In [10]:
from keras.optimizers import SGD

optimizer = SGD(lr=2e-3, momentum=0.9)

### Compile the model

In [11]:
model.compile(optimizer=optimizer,
              loss='categorical_crossentropy',
              metrics=['categorical_accuracy',
                       'categorical_crossentropy'])

### Fit model

In [12]:
# Need to one-hot encode the targets
y_train_onehot = keras.utils.to_categorical(y_train, num_classes=3)
y_test_onehot = keras.utils.to_categorical(y_test, num_classes=3)

model.fit(x_train, y_train_onehot,
          epochs=50, batch_size=32,
          validation_data=(x_test, y_test_onehot))

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50


Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x7f5770038ef0>

### Evaluate on the test set
Not strictly required, since we passed it in as a val set.

Prints the loss and metrics:
[categorical_crossentropy, categorical_accuracy, categorical_crossentropy]

categorical_crossentropy prints twice, since it's both the loss and in the list of metrics

In [13]:
model.evaluate(x_test, y_test_onehot)



[0.2568022906780243, 0.9555555582046509, 0.2568022906780243]

### Run inference

In [14]:
pred = model.predict(np.array([x_test[0]]), batch_size=1)

print('Inputs:', x_test[0])
print('Predicted probabilities:', pred)
print('Predicted class:', np.argmax(pred))
print('True class label:', y_test[0])

Inputs: [6.4 3.1 5.5 1.8]
Predicted probabilities: [[0.00493052 0.29414403 0.7009254 ]]
Predicted class: 2
True class label: 2.0
