# Convolutional Neural Networks for Visual Recognition
http://cs231n.github.io

## Image Classification

### Nearest Neighbor Classifier

** $L_1$ distance**

$d_1(I_1,I_2) = \sum_{p}|I_1^p-I_2^p|$

where $I_1$, $I_2$ are vectors of two images being compared.

**$L_2$ distance**

$d_2(I_1,I_2) = \sqrt{\sum_{p}(I_1^p-I_2^p)^2}$

**k-Nearest Neighbor Classifier**

Motivation: instead of finding single closest image in the training set, we find the top k closest images and have them vote on the label of the test image.


![alt text](./img/k-nearest.png "Title")

### Pros and Cons of Nearest Neighbor Classifier
**Pros**
* Simple
* Takes no time to train

**Cons**
* May work on low-dimensional data but distance over high-dimensional spaces can be very counter-intuitive
* Images that are nearby each other are much more a function of the general color distribution of the images, or the type of background rather than their semantic identity.
* The classifier must remember all the training data and store for future comparisons with the test data.
* Classifying a test image is expensive since it requires a comparison to all training images.


## Validation sets for Hyperparameter tuning

**Example of hyperparameters:** the setting for k of k-nearest neighbor classifier.

**Tuning hyperparameters:** split training test in two: a slightly smaller training set and what we call a validation set.

In case where the size of training data might be small, people sometimes use more sophisticated techniques for hyperparameter tuning called **cross-validation**.

![alt text](./img/k-fold.png "Title")

Classic way is to split your training data randomly into train/val splits. As a rule of thumb, between 70-90% of your data usually goes to the train split. 

## Linear Classification

The approach will have two major components: a **score function** that maps the raw data to class scores, and a **loss function** that quantifies the agreement between the predicted scores and the ground truth labels.

For example, in CIFAR-10 we have a training set of N = 50,000 images, each with D = 32 x 32 x 3 = 3072 pixels, and K = 10, since there are 10 distinct classes (dog, cat, car, etc). 

### Linear Classifier

$f(x_i, W, b) = Wx_i+b$

where $x_i$ contains all pixels in the i-th image flattened into a single [3072 x 1] column. W is often called weights and is [10 x 3072] and b is called bias vector, with size of [10 x 1].

**Example**

![alt text](./img/classifier.png "Title")

In the example shown above, the linear classifier compute the scores of a class as a weighted sum of all its pixel values across all 3 of its color channels. We assume the image only has 4 pixels and that we have 3 classes.


**Bias Trick**

We can combine two sets of parameters (the biases b and weights W) into a single matrix that holds both of them, in which we get:

$f(x_i, W) = Wx_i$


### Loss Function

**Multiclass Support Vector Machine (SVM)**

SVM loss is setup so that the correct class for each image to have a score higher than the incorrect class by some fixed amount margin $\Delta$. Notice that it’s sometimes helpful to anthropomorphise the loss functions as we did above: The SVM “wants” a certain outcome in the sense that the outcome would yield a lower loss (which is good).

$$L_i = \sum_{j \neq y_i}\max (0, s_j-s_{yi}+\Delta)$$

where $y_i$ is the index of the correct class and $s_j = f(x_i,W)_j$ (the score for j-th class is the j-th element). 

We can also rewrite the loss function as:

$$L_i = \sum_{j \neq y_i}\max (0, w_j^T-w_{yi}^T+\Delta)$$


### Regularization
With the loss function presented above, a potential problem would be that there are a set of parameters W correctly classify every example. We want to encode some preference for certain set of weights W over others to remove this ambiguity.

We can do so by extending the loss function with a regularization penalty R(w). The most common regularization penalty is L2 norm.

$$R(W) = \sum_{k} \sum_{l} W_{k,l}^2$$

The full multiclass SVM loss becomes:
$$L = \frac{1}{N} \sum_i L_i + \lambda R(W)$$

or expanding this out in its full form:

$$L = \frac{1}{N} \sum_i \sum_{j \neq y_i}max(0, w_j^T-w_{yi}^T+\Delta) + \lambda \sum_{k} \sum_{l} W_{k,l}^2$$

L2 penalty leads to the appealing max margin property in SVM. The most appealing property is that penalizing large weights tends to improve generalization, because it means that no input dimension can have a very large influence on the scores by itself.

 ### Practical Consideration
 
 **Setting Delta**
 
 It turns out that this hyperparameter can safely be set to $\Delta$=1.0
 
 **Relation to Binary Support Vector Machine**
 
 $$L_i = Cmax(0,1,-y_iw^Tx_i)+R(W)$$
 
 where C is a hyperparameter and $y_i \in \{-1,1\}$. This can be regard as a special case when there are only two classes in this SVM.

### Softmax Classifier

It turns out that SVM is one of two commonly seen classifiers. The other popular choice is Softmax classifier.

$$L_i = -\log (\frac{e^{f_{yi}}}{\sum_je^{f_j}})$$ or equivalently $$L_i = -f_{yi}+\log \sum_je^{f_j}$$

It takes a vector of arbitrary real-valued scores and squashes it to vector values between 0 and one that sum to one.

**Information theory view**

The cross-entropy between a "true" distribution p and an estimated distributed q is defined as 

$$H(p,1)=-\sum_xp(x)l\log q(x)$$

The softmax classifier is hence minimizing cross-entropy between the estimated class probabilities and the "true" distribution.

Moreover, since the cross-entropy can be written in terms of entropy and the Kullback-Leibler divergence as $H(p,q)=H(p)+D_{KL}(p||q)$ and the entropy of the data function p is zero, this is also equivalent to minimizing the KL divergence between the two distributions.

In other words, the cross-entropy objective wants the predicted distribution to have all of its mass on the correct answer.

**Probabilistic interpretation**

$$P(y_i|x_i:W)=\frac{e^{f_{yi}}} {\sum_j e^{f_j}}$$

can be interpreted as the probability assigned to the correct label $y_i$, given the image $x_i$ and parameterized by W.

**Practical issues: Numeric stability**

When we are writing code for computing the Softmax function in practice, the intermediate term $e^{f_{yi}}$ and $\sum_j e^{f_j}$ may be very large due to exponential. So it is important to use a normalization trick.

$$\frac{e^{f_{yi}}} {\sum_j e^{f_j}} = \frac{Ce^{f_{yi}}} {\sum_j e^{f_j}}=\frac{e^{f_{yi}+\log C}} {\sum_j e^{{f_j}+\log C}}$$

A common choice for C is to set $\log C = -\max_jf_j$.

**Possibly confusing**

To be precise, the SVM classifier uses the hinge loss, or also sometimes called the max-margin loss. The Softmax classifier uses cross-entropy loss.

### SVM vs. Softmax

In both cases we computer the same score vector f. The difference is in the interpretation of the scores in f: The SVM interprets these as class cores adn its loss function encourage the correct class to have a score higher by a margin than the other class scores.

The Softmax classifier instead interprets the score as (unnormalized) log probabilities for each class and then encourages the (normalized) log probability of the correct class to be high. 

In practices, SVM and Softmax are usually comparable.

## Optimization: Stochastic Gradient Descent

Optimization: the process of finding the set of parameters W that minimize the loss function.

Assume a simple dataset that contains three 1-dimensional points and three classes. The full SVM loss (without regulation) becomes:

$$L_0 = \max(0,w_1^Tx_0-w_0^Tx_0+1)+\max(0,w_2^Tx_0-w_0^Tx_0+1)$$
$$L_1 = \max(0,w_0^Tx_1-w_1^Tx_1+1)+\max(0,w_2^Tx_1-w_1^Tx_1+1)$$
$$L_2 = \max(0,w_0^Tx_2-w_2^Tx_2+1)+\max(0,w_1^Tx_2-w_2^Tx_2+1)$$
$$L=(L_0+L_1+L2)/3$$

![alt text](./img/sum.png "Title")

The SVM cost funciton is an example of a convex function.


### Optimization Strategy

**Random Search**

Not quite a good idea..


In [None]:
bestloss = float("inf") # Python assigns the highest possible float value
for num in xrange(1000):
  W = np.random.randn(10, 3073) * 0.0001 # generate random parameters
  loss = L(X_train, Y_train, W) # get the loss over the entire training set
  if loss < bestloss: # keep track of the best solution
    bestloss = loss
    bestW = W
  print 'in attempt %d the loss was %f, best %f' % (num, loss, bestloss)

**Random Local Search**

Extend one foot in a random direction and take a step only if it leads downhill. Concretely, we will start out with a random W, generate random perturbation $\delta$W to it and if the loss at perturbed W+$\delta$W is lower, we will perform an update

**Following the Gradient**

The gradient is a generalization of slope for functions that don't take a signle number but a vector of numbers. Gradient is just a vector of slopes (more commonly referred as derivatives) for each dimension in the input space.

The mathematical expression for the derivative of a 1-D function with respect to its input is:

$$\frac{df(x)}{dx}=lim_{h \to 0} \frac{f(x+h)-f(x)}{h}$$

When the functions of interest take a vector of numbers instead of a single number, we call the derivatives partial derivatives and the gradient is simply the vector or partial derivatives in each dimension.

### Computing the gradient numerically with finite differences

In [1]:
def eval_numerical_gradient(f, x):
  """ 
  a naive implementation of numerical gradient of f at x 
  - f should be a function that takes a single argument
  - x is the point (numpy array) to evaluate the gradient at
  """ 

  fx = f(x) # evaluate function value at original point
  grad = np.zeros(x.shape)
  h = 0.00001

  # iterate over all indexes in x
  it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
  while not it.finished:

    # evaluate function at x+h
    ix = it.multi_index
    old_value = x[ix]
    x[ix] = old_value + h # increment by h
    fxh = f(x) # evalute f(x + h)
    x[ix] = old_value # restore to previous value (very important!)

    # compute the partial derivative
    grad[ix] = (fxh - fx) / h # the slope
    it.iternext() # step to next dimension

  return grad

**Effect of step size**

The gradient tells us the direction in which the function has the steepest rate of increase, but it does not tell us how far along this direction we should step. As we will see later in the course, choosing the step size (h) will become one of the most important hyperparameter settings in training a neural network. 

**Computing the gradient analytically with Calculus**

The numerical gradient is very simple to compute using the finite difference approximation but the downside is that it is approximate and that is very computationally expensive to compute. The second way to compute the gradient is analytically using Calculus, which allows us to derive a direct formula for the gradient that is also very fast to compute. However, unlike the numerical gradient, it can be more error prone to implement, which is why in practice it is very common to compute the analytic gradient and compare it to the numerical gradient to check correctness of your implementation. This is called **gradient check**.

With SVM loss function for a single data point:
$$L_i = \sum_{j \neq y_i}\max (0, s_j-s_{yi}+\Delta)$$

We can differentiate the function with respect to the weights.

$$\nabla w_{yi}L_i = - (\sum_{j\neq y_i}𝟙(w_j^Tx_i-w_{yi}^Tx_i)+\Delta > 0))x_i$$

𝟙 is the indicator function that is one if the condition inside is true or zero otherwise. 

For the other rows where $j\neq y_i$, the gradient is:

$$\nabla w_{j}L_i = - (\sum_{j\neq y_i}𝟙(w_j^Tx_i-w_{yi}^Tx_i)+\Delta > 0))x_i$$


### Gradient Descent

Now that we can compute the gradient of the loss function, the procedure of repeatedly evaluating the gradient and then performing a parameter update is called Gradient Descent.

**Mini-batch gradient descent**

In large-scale applications, it is wasteful to compute full loss function over the entire datasets in order to perform a single parameter update. Avery common approach to address this challenge is to computer gradient over batches of training data.

The extreme case of this is a setting where mini-batch contains only a single example. The process is called **Stochastic Gradient Descent (SGD)**. Even though SGD technically refers to using a single example at a time to evaluate the gradient, you will hear people use the term SGD even when referring to mini-batch gradient descent 

![alt text](./img/summary.png "Title")

## Backpropagation, Intuitions

### Simple expression and interpretation of the gradient

$$f(x,y)=xy$$
$$\frac{\partial f}{\partial x} = y$$
$$\frac{\partial f}{\partial y} = x$$


Derivative indicates the rate of change of a function with respect to variable surrounding an infinitesimally small region near a particular point:

$$\frac{df(x)}{dx}=\lim_{h \to 0} \frac{f(x+h)-f(x)}{h}$$

The derivative of each variable tells you the sensitivity of the whole expression on its value. If we have $\frac{\partial f}{\partial y} = 4$, we expect that increasing value of y by a some very small amount h would increase the output of the function by 4h.

The derivative on each variable tells you the sensitivity of the whole expression on its value.

### Compound expression with chain rule

For $f(x,y,z)=(x+y)z$, we can break it down into two expressions: $q=x+y$ and $f=qz$. We know $\frac{\partial f}{\partial q}=z$, $\frac{\partial f}{\partial z}=q$ and q is addition of x and y so $\frac{\partial q}{\partial x}=1$, $\frac{\partial q}{\partial y}=1$. However, we don't necessarily care about q, instead we care about gradient of f with respect to its input x,y,z. The chain rule tells us $\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial x}$

### Intuitive understanding of backpropagtaion

Backpropagation is a local process. Every gate in a circuit diagram gets some inputs and can right away two things: 1. its output value and 2. the local gradient of its inputs with respect to its output value.

It is very cost-inefficient to update every parameters for a single neurons. The way backpropagation works is to collect requests from all last-layer neurons and compute backward for the most optimal parameter values.

Good video explanation here https://www.youtube.com/watch?v=Ilg3gGewQ5U.

### Modularity: Sigmoid example

$$ f(w,x)= \frac{1}{1+e^{-(w_0x_0+w_1x_1+w_2)}}$$

The function is made up of multiple gates.

$$f(x) = \frac{1}{x} \ \frac{\partial f}{\partial x}= -\frac{1}{x^2}$$
$$f_c(x) = c+x  \ \frac{\partial f}{\partial x}= 1$$
$$f(x) = e^x  \ \frac{\partial f}{\partial x}= e^x$$
$$f_a(x) = ax \ \frac{\partial f}{\partial x}= a$$


The function the above operations implement is called sigmoid function 
$$\sigma(x) = \frac{1}{1+e^{-x}}$$
$$\frac{d \sigma(x)}{dx} = \frac{e^{-x}}{1+e^{-x}}$$
$$=\left (\frac{1+e^{-x}-1}{1+e^(-x)} \right) \left ( \frac{1}{1+e^{-x}}\right) = (1-\sigma(x))\sigma(x)$$

In [None]:
w = [2,-3,-3] # assume some random weights and data
x = [-1, -2]

# forward pass
dot = w[0]*x[0] + w[1]*x[1] + w[2]
f = 1.0 / (1 + math.exp(-dot)) # sigmoid function

# backward pass through the neuron (backpropagation)
ddot = (1 - f) * f # gradient on dot variable, using the sigmoid gradient derivation
dx = [w[0] * ddot, w[1] * ddot] # backprop into x
dw = [x[0] * ddot, x[1] * ddot, 1.0 * ddot] # backprop into w
# we're done! we have the gradients on the inputs to the circuit

### Backprop in practice: Staged computation

$$f(x,y)=\frac{x+\sigma(y)}{\sigma(x)+(x+y)^2}$$

In [None]:
# Forward pass
x = 3 # example values
y = -4

# forward pass
sigy = 1.0 / (1 + math.exp(-y)) # sigmoid in numerator   #(1)
num = x + sigy # numerator                               #(2)
sigx = 1.0 / (1 + math.exp(-x)) # sigmoid in denominator #(3)
xpy = x + y                                              #(4)
xpysqr = xpy**2                                          #(5)
den = sigx + xpysqr # denominator                        #(6)
invden = 1.0 / den                                       #(7)
f = num * invden # done!                                 #(8)

We structure the in such way that it contains multiple intermediate variables

In [None]:
# backprop f = num * invden
dnum = invden # gradient on numerator                             #(8)
dinvden = num                                                     #(8)
# backprop invden = 1.0 / den 
dden = (-1.0 / (den**2)) * dinvden                                #(7)
# backprop den = sigx + xpysqr
dsigx = (1) * dden                                                #(6)
dxpysqr = (1) * dden                                              #(6)
# backprop xpysqr = xpy**2
dxpy = (2 * xpy) * dxpysqr                                        #(5)
# backprop xpy = x + y
dx = (1) * dxpy                                                   #(4)
dy = (1) * dxpy                                                   #(4)
# backprop sigx = 1.0 / (1 + math.exp(-x))
dx += ((1 - sigx) * sigx) * dsigx # Notice += !! See notes below  #(3)
# backprop num = x + sigy
dx += (1) * dnum                                                  #(2)
dsigy = (1) * dnum                                                #(2)
# backprop sigy = 1.0 / (1 + math.exp(-y))
dy += ((1 - sigy) * sigy) * dsigy                                 #(1)

## Neural Networks Part 1: Setting up the Architecture

### Modeling one neuron

**Biological motivation and connections**

We model the firing state of the neuron with an activation function f, which represents the frequency of the spikes along the axon.

### Single neuron as a linear classifier
**Binary Softmax classifier**

For example, we can interpret $\sigma(\sum_{i}{w_ix_i+b})$ to be the probability of one of the classes $P(y_i-1|x_i;w)$. The probability of the other class would be $P(y_i=0|x_i;w)=1-P(y=1|x_i;w)$.

**Binary SVM classifier** 
Alternatively, we could attach a max-margin hinge loss to output of the neuron and train it to become a binary Support Machine.

**Regularization interpretation**
The regularization loss in both SVM/Softmax cases could in this biological view be interpreted as gradual forgetting, since it would have the effect of driving all synaptic weights w towards zero after every parameter update.

### Commonly used activation functions

![alt text](./img/sigmoid.png "Title")
**Sigmoid**

It takes a real-valued number and "squashes" into range between 0 and 1. In particular, large negative numbers become 0 and large positive numbers become 1.

$$\sigma (x) = \frac{1}{1+e^(-x)}$$

Drawbacks:

* **Sigmoid saturate and kill gradients.** A very undesirable property of the sigmoid neuron is that when activation saturates at either tail of 0 or 1, the gradient at these regions is almost zero. Recall that during backpropagation, this (local) gradient will be multiplied to gradient of this gate's output for the whole objective. Therefore, if the local gradient is very small, it will effectively "kill" the gradient and almost no signal will flow through the neuron.

* **Sigmoid outputs are not zero-centered.**  This has implications on the dynamics during gradient descent, because if the data coming to a neuron is always positive, then the gradient on the weights w will during backpropagation become either all positive, or all negative.

**Tanh** 

It squashes a real-valued number to rang [-1,1]. Like the sigmoid neuron, its activation saturate, but unlike the sigmoid neuron, its output is zero-centered.

$$tanh(x)=2\sigma(2x)-1$$

![alt text](./img/relu.png "Title")

**ReLU**

$$f(x)=max(0,1)$$

Pros:
* It was found to greatly accelerate the convergence of gradient descent compared to the sigmoid/tanh functions. It is argued that this is due to its linear, non-saturating form.
* Compared to tanh/sigmoid neurons that involve expensive operations, the ReLU can be implemented by simply thresholding matrix of activations at zero.

Cons:
* ReLU units can be fragile during training and can "die". 

**Leaky ReLU**

Leaky ReLUs are one attempt to fix the "dying ReLU" problem. Instead of the function being zero when x < 0, a leaky ReLU will instead have a small negative slope.

The function computes:
$$f(x)=1(x<0)(\alpha x)+1(x>=0)(x)$$

where $\alpha$ is a small constant. 


**Maxout**

Other types of units have been proposed that do not have functional form $f(w^Tx+b)$ where a non-linearity is applied on the dot product between the weights and the data. The Maxout neuron compute the function $max(w_1^Tx+b_1, w_2^Tx+b_2)$. The maxout neuron there enjoys all benefits of a ReLU unit and does not have its drawback (dying ReLU). However, unlike the RelU neuron it doubles the number of parameters for every single neuron, leaning a high total number parameters.

**What neuron type should I use?**

Use the ReLU non-linearity, be careful with your learning rates and possibly monitor the fraction of "dead" units in a network. If this concerns you, give Leaky ReLU or Maxout a try. Never use sigmoid. Try tanh, but expect it to work work worse than ReLU/Maxout.

### Neural Network architectures

**Layer-wise organization**

Neural Networks as neurons in graphs. For regular neural networks, the most common layer type is the fully-connected layer in which neurons between two adjacent layers are fully pairwise connected, but neuron within a single layer share no connections.

**Naming conventions**
Notice that when we say N-layer neural network, we do not count the input layer. 

**Output layer** 
Unlike all layers in a Neural Network, the output layer neurons most commonly do not have an activation function.

![alt text](./img/nn.png "Title")

**Sizing neural networks**

The left network has $4+2=6$ neurons, $[3x4]+[4x2]=20$ weights and $4+2=6$ biases, for a total 26 learnable parameters.

**Representational power**

Neural Networks with at least one hidden layer are universal approximators. The fact that deeper networks can work better than single-hidden-layer networks is an empirical observation, despite the fact that their representational power is equal.

**Setting number of layers and their sizes**


Neural Networks with more neurons can express more complicated functions. Overfitting occurs when a model with high capacity fits the noise in the data instead underlying relationship. Regulation strength is the preferred way to control overfitting of a neural network. Smaller neural networks can be preferred if the data is not complex enough to prevent overfitting.

## Neural Networks Part 2: Setting up the Data and the Loss

### Setting up the data and the model

### Data Preprocessing

There are three common forms of data preprocessing a data matrix X, where we will assume $X$ is the size $[NxD]$ ($N$ is the number of data, $D$ is their dimensionality).

**Mean subtraction**

The most common forms of data preprocessing. It involves subtracting the mean across every individual feature in the data, and has the geometric interpretation of centering the cloud of data around the origin along every dimension. In numpy this operation can be implemented as `X -= numpy.mean(X, axis=0)`.

**Normalization**

Refers to normalizing the data dimensions so that they are of approximately the same scale. There are two common ways of achieving this normalization. One is to divide each dimension by it standard deviation, once it has been zero-centered: `X/=np.std(X, axis=0)`. Another form of this preprocessing normalizes each dimension so that min and max along the dimension is -1 and 1 respectively. It only makes sense to apply this preprocessing if have a reason to believe that different input features have different scales, but they should be of approximately equal importance to the learning algorithm. 

![alt text](./img/normalize.png "Title")

**PCA and Whitening**

In this process, the data is centered as above. Then, we can compute the covariance matrix that tells us about the correlation structure in the data.

`X -= np.mean(X, axis=0)`

`conv=np.dot(X.T, X)/X.shape[0]`

The (i,j) element of the data covariance  matrix contains the covariance between i-th and j-th dimension of the data. We can compute the SVD factorization of the data covariance matrix:

`U, S, V = np.linalg.svd(conv)`

To decorrelate the data, we project the original (but zero-centered) data into eigen basis.

`Xrot = np.dot(X,U))`

`Xrot_reeduced = np.dot(X,U[:,:100]))`


The last transformation is **whitening**. It takes the data in the eigenbasis and divides every dimension by the eigenvalue to normalize the scale. 

`Xwhite = Xrot/np.sqrt(S+le-5)`

![alt text](./img/whiten.png "Title")

### Weight Initialization

Before we can begin to train the networks we have to initialize its parameters.

**Pitfall: all zero initialization**

**Small random numbers**

We still want the weights to be very close to zero, but as we have argued above, not identically zero.

**Calibrating the variations with 1/sqrt(n)**

One problem with the above suggestion is that the distribution of the outputs from a randomly initialized neuron has a variance that grows with the number of inputs. The recommended heuristic is to initialize each neuron's weight vector `w = np.random.rand(n)/sqrt(n)` where `n` is the number of its input.

**Sparse initialization**

Another way to address the uncalibrated variances problem is to set all weight metrics to zero, but to break symmetry every neuron is randomly connected to a fixed number of neurons below it.

**Initializing the biases**

It is possible and common to initialize the biases to be zero, since the asymmetry breaking is provided by the small random numbers in the weights.

**In practice**
The current recommendation is to use ReLU units and use `w = np.random.randn(n)*sqrt(2.0/n)`,

**Batch normalization**

Initialize neural networks by explicitly forcing the activation throughout a network to take on a unit gaussian distribution at the beginning of the training.

### Regularization

There are several ways of controlling the capacity of Neural Networks to prevent overfitting.


**L2 regularization**

Most common form of regularization. That is, for every weight w in the network, we add the term $\frac{1}{2}\lambda w^2$ to the objective, where $\lambda$ is the regularization strength. Lastly, notice that during gradient descent parameter update, using the L2 regularization, ultimately means that every weight is decayed linearly: `W+=-lambda * W`towards zero.


**L1 regularization**

Another relatively common form of regularization, where for each weight w we add the term $\lambda |w|$ to the objective. It is possible to combine the L1 regularization with the L2 regularization $\lambda_1|w|+ \lambda_2w^2$.

**Max norm constraints**

Enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constant. 

**Dropout**

While training, dropout is implemented by only keeping a neuron active with some probability $p$, or setting it to zero otherwise.

![alt text](./img/dropout.png "Title")

**Theme of noise in forward pass**

Dropout falls into a more general category of methods that introduce stochastic behavior in the forward pass of the network.

**Bias regulation**

It is not common to regularize the bias parameters because they do not interact with the data through multiplicative interactions, and therefore do not have the interpretation of controlling the influence of a data dimension on the final projective. 

**Per-layer regularization**

It is not very common to regularize different layers to different amounts.

**In practice**

It is most common to use a single, global L2 regularization strength that is cross-validated. It is also common to combine with dropout applied after all layers. The value of $p=0.5$ is reasonable default, but this can be tuned on validation data.

### Loss functions

Measures the compatibility between a prediction and the ground trugh label. There are several types of problems you might want to solve in practice.

**Classification**
One of two most commonly seen cost function in this setting is the SVM: $$L_i = \sum_{j\neq y_I} \max(0, f_j-f_{yi}+1$$

The second common choice is the Softmax classifier that uses cross-entropy loss:

$$L_i = -\log(\frac{e^{f_{yi}}}{\sum_j e^{f_i}})$$

**Problem: Large number of classes**
When the set of labels is very large, it may be helpful to use Hierarchical Softmax, which decomposes lables into a tree.

**Attribute classification**

Both losses above assume that there is a single correct answer $y_i$. But what if $y_i$ is a binary vector where every example may or may not have a certain attribute, and where the attributes are not exclusive.

Fro example, we can do a binary classifier for each category independently. An alternative would be to train a logistic regression for every attribute independently. A binary logistic regression classifier has only two classes (0,1), and calculate the probability of class 1 as:

$$P(y=1 | x;w,b)=\frac{1}{1+e^{-w^Tx+b}} = \sigma(w^Tx+b)$$

The loss function that maximize the log likelihood of this probability is: 

$$L_i = \sum_j y_{ij} \log(\sigma f_{ij})+(1-y_{ij})\log(1-\sigma(f_j))$$

**Regression**

Task of predicting real-valued quantities, such as the price of houses or the length or length of something of an image. 

The L2 norm squared would compute the loss for a single example of the form:
$$L_i = ||f-y_i||_2^2$$

When faced with a regression problem, first consider if it is absolutely inadequate to quantize the output into bins.  If you’re certain that classification is not appropriate, use the L2 but be careful: For example, the L2 is more fragile and applying dropout in the network (especially in the layer right before the L2 loss) is not a great idea.


**Structured prediction**

The structured loss refers to a case where the labels can be arbitrary structures such as graphs, trees or other complex objects. Usually, it is also assumed that the space of structures is very large and not easily enumerable. 

## Neural Networks Part 3: Learning and Evaluation

### Gradient Checks

Comparing the analytic gradient to the numerical gradient.

**Use the centered formula**

(bad, do not use)
$$\frac{df(x)}{dx}=\frac{f(x+h)-f(x)}{h}$$ 

(use instead)
$$\frac{df(x)}{dx}=\frac{f(x+h)-f(x-h)}{2h}$$ 

**Use relative error for the comparison**

$$\frac{|f_a - f_n|}{max(|f_a|,|f_n|)}$$


* relative error > 1e-2 usually means the gradient is probably wrong
* 1e-2 > relative error > 1e-4 should make you feel uncomfortable
* 1e-4 > relative error is usually okay for objectives with kinks. But if there are no kinks (e.g. use of tanh nonlinearities and softmax), then 1e-4 is too high.
* 1e-7 and less you should be happy.

**Use double precision**

A common pitfall is using single precision floating point to compute gradient check.

**Kinks in the objective**

One source of inaccuracy to be aware of during gradient checking is the problem of kinks. Kinks refer to non-differentiable parts of an objective functions, introduced by functions such as ReLU($\max(0,x)$), or SVM loss, Maxout neuron, etc.

**Use only few data points**

One fix to above problem of kinks is to use fewer datapoints, since loss functions that contain kinks will have fewer kinks with fewer datapoints. 

**Be careful with the step size h**

It is not necessarily the case that smaller is better because when h is much smaller, you may start running into numerical precision problems. 

**Gradcheck during a characteristic mode of operation**

It is important to realize that a gradient check is performed at a particular, single point in the space of parameters. Even if the gradient check succeeds at that point, it is not immediately certain that the gradient is correctly implemented globally. To be safe it is best to use a short burn-in time during which the network is allowed to learn and perform the gradient check after the loss starts to go down.

**Don't let the regulation overwhelm the data**

It is often the case that loss function is a sum of the data loss and the regulation loss. One danger to be aware of is that the regularization loss may overwhelm the data loss, in which case the gradient will be primarily coming from the regularization term. It is recommended to turn off regularization and check the data loss alone first, and then the regularization term second and independently.

**Remember to turn off dropout/augmentations**

When performing gradient check, remember to turn off any non-deterministic effects in the network, such as dropout, random data augmentations, etc. Otherwise these can clearly introduce huge errors when estimating numerical gradient. A better solution might be to force a particular random seed before evaluating both $f(x+h)$ and $f(x-h)$.

**Check only few dimensions**

Be sure to gradient check a few dimensions for every separate dimensions of the gradient and assume that the others are correct.


### Before learning: sanity checks Tips/Tricks

**Look for correct loss at chance performance**

Make sure you are getting the loss you expect when you initialize with small parameters.

**Overfit a tiny subset of data**

Lastly and most importantly, before training on the full dataset try to train on a tiny portion of your data to make sure you can achieve a zero cost. 

### Babysitting the learning process

**Loss function**

![alt text](./img/loss.png "Title")

The amount of "wiggle" in the loss is related to the batch size. When the batch size is 1, the wiggle will be relatively high. When the batch size is the full dataset, the wiggle will be minimal because every gradient update should be improving the loss function monotonically.

**Train/Val accuracy**

![alt text](./img/train-eval.png "Title")

**Ratio of weights: updates**

The last quantity you might want to track is the ratio of the update magnitudes to the value magnitudes.

**Activation/Gradient distributions per layer**

An incorrect initialization can slow down or even completely stall the learning process. One way to address this issue is to plot activation/gradient histograms for all layers of the networks.

![alt text](./img/vis.png "Title")

## Parameter updates

Once the analytic gradient is computed with back propagation, the gradients are used to perform a parameter update. There are several approaches for performing the update which we discuss next.

### SGD and bells and whistles

**Vanilla update**

The simplest form of update is to change the parameters along the negative gradient direction.

`x+=-learning_rate*dx`

**Momentum update**

This update can be motivated from a physical perspective of the optimization problem. Since the force on particle is related to the gradient of potential energy, the force felt by particle is precisely the gradient of loss function. The physics view suggests an update in which the gradient only directly influences the velocity, which in turn has an effect on the position.

integrate velocity
`v = mu*v - learning_rate * dx `

integrate position
`x += v`

Here we see an introduction of a `v` variable that is initialized at zero and an additional hyperparameter (`mu`).

With Momentum update, the parameter will build up velocity in any direction that has consistently gradient.

**Nesterov Momentum**

The core idea behind Nesterov momentum is that when current parameter vector is at some position x, then looking at momentum update above, we know that the momentum term alone is about to nudge the parameter vector by `mu*v`. Therefore. if we are about to compute the gradient, we can treat the future approximate position `x + mu*v` as a lookahead. 

![alt text](./img/nesterov.png "Title")

`v_prev = v`

`v = mu * v - learning_rate * dx`

`x += -mu * v_prev + (1 + mu) * v`


### Annealing the learning rate

With a high learning rate, the system contains too much kinetic energy and parameter vector bounces around chaotically, unable to settle down into deeper, but narrower parts of the loss function. 

There are three common types of implementing the learning rate decay:
* **Step decay**: Reduce learning rate by some factor after a few epochs.
* **Exponential decay**: has the mathematical form $\alpha = \alpha_0e^(-kt)$, where $\alpha, k$ are hyperparameters and t is the iteration number. 
* **1/t decay** has the mathematical form $\alpha=\alpha_0/(1+kt)$, where $\alpha_0$, k are hyperparameters and t is the iteration number.

### Second order methods

A second, popular group of methods for optimization in context of deep learning is based on Newton's method, which iterates the following:

$$x-[Hf(x)]^{-1}\nabla f(x)$$

Here $Hf(x)$ is the Hessian matrix, which is a square matrix of second-order partial derivatives of the function. Intuitively, the Hessian describes the local curvature of the loss function, which allows us to perform a more efficient update. The absence of any learning rate hyperparameters in update formula, which the proponents of these methods cite this as a large advantage over first-order methods.

However, computing Hessian in its explicit form is very costly. A large variable of quasi-Newton methods have been developed. Among these, the most popular is L-BFGS.

### Pre-parameter adaptive learning rate methods

**Adagrad**

`cache += dx**2`

`x += -learning_rate*dx/(np.sqrt(cache)+eps)`

The variable cache has size equal to the size of the gradient and keeps track of per-parameter sum of squared gradients.

**RMSprop**


`cache = decay_rate*cache + (1-decay_rate)*dx**2`

`x += -learning_rate*dx/(np.sqrt(cache)+eps)*`

Here, decay_rate is a hyperparamete and typical values are [0.9, 0.99, 0.999].


**Adam**

Adam has recently proposed update the looks a bit RMSProp with momentum. 

`m = betal1*m + (1-beta1)*dx`

`v = beta2*m + (1-beta2)*dx**2`

`x += learning_rate * m / np.sqrt(v)+eps`

### Hyperparameter optimization

The most common hyperparameters in context of Neural Networks include:
* the learning rate
* learning rate decay schedule (such as decay constant)
* regulation strength (L2 penalty, dropout strength)


**Implementation**

Larger Neural Networks typically require a long time to train, so performing hyperparameter  search can take many days/weeks. One particular design is to have a worker that continuously samples random hyperparameters and performs the optimization. It is useful to include the validation performance directly in the filename, so that it is simple to inspect and sort the progress. 

**Prefer one validation folder to cross-validation**

In most cases a single validation set of respectable size substantially simplify the code base, without the need for cross-validation with multiple folds.


**Hyperparameter range**

Search for hyperparameters on log scale. For example, a typical sampling of learning rate would look like as follows:

`learning_rate = 10**uniform(-6,1)`

That is, we are generating a random number from a uniform distribution but then raising it to the power of 10. Intuitively, this is because learning rate and regularization strength have multiplicative effects on training dynamics. For example, a fixed change of adding 0.01 to a learning rate has huge effects on dynamics if the learning rate is 0.0001, but nearly no effect if the learning rate when it is 10.


**Prefer random search to grid search**

Randomly  chosen trials are more efficient for hyper-parameter optimization than trials on the grid.

**Careful with best values on border**

Suppose we use `learning_rate = 10**uniform(-6,1)`. It is important to double check that the final learning rate is not at the edge of this interval, or otherwise you may be missing optimal hyperparameter setting beyond the interval.

**Stage your search from coarse to fine**

It can be helpful to search in coarse ranges (e.g. 10**[-6,1]) and then depending on where the best results are facing up, narrow the range.


**Bayesian Hyperparameter Optimization**

It is a whole area of research devoted to coming up with algorithms that try more efficiently navigate the space of parameters.

### Evaluation

In practiice, one realiable approach to improving performance of Neural Networks by a few percent is to train multiple indepeendent models and at test time average their predictions. There are a few approaches to forming an ensemble:

**Same model, different initialization**

Use cross-validation to determine the best hyperparameters, then train multiple models with the best set of hyperparameters but with different random initialization.

**Top models discovered during cross-validation**

**Different checkpoints of a single model**

**Running average of parameters during training**


## Putting it together: Minimal Neural Network Case Study
Details on course website.

## Convolutional Neural Networks: Architectures, Convolution / Pooling Layers

### Layers used to build ConvNets

We use three main types of layers to build ConvNet architectures: Convolutional Layer, Pooling Layer and Fully-Connected Layer.
   
### Convolutional Layer
* Accepts a column of size $W_1 x H_1 x D_1$
* Requires four hyperparameters: Number of filters K, their extent F, the stride S and the amount of zero padding p.
* Produces a volume of size $W_2 x H_2 x D_2$, where $W_2=(W_1-F+2P)/S+1$, $H_2=(H_1-F+2P)/S+1$, $D_2=K$
* With parameter sharing, it introduces $F\cdot F \cdot D_1$ weights per filter, for a total ($F \cdot F \cdot D_1$) $\cdot$ K weights and K biases.
* In the output volume, the d-th depth slice (of size $W_2 X H_2$) is the result of performing a valid convolution of the d-th filter over the input volume with a stride S, and then offset by d-th bias.

![alt text](./img/conv.png "Title")

**Implementation as Matrix Multiplication**

Note that the convolution operation essentially performs dot products between the filters and local regions of the input.

* The local regions in the input images are stretched out into columns in an operation commonly called **imc2col**. For example, if the input is $[227x227x3]$ and it is to be convolved with $11x11x3$ filters with stride 4, then we would take $[11x11x3]$ blocks of pixels in the input and stretch each block into a column vector $11*11*3$ = 364. Iterating this process in the input at stride of 4 gives (227-11)/4+1=55 locations both width and height, leading to an output matrix **X_col** of im2col of size [363x3025], where every column is stretched out receptive and there are $55*55 = 3035$ of them in total.

* The weights of CONV layer are similarly stretched out into rows. For example, if there are 96 filters of size $[11x11x3]$ this would give a matrix **W_row** of size $[96x363]$.

* The result of convolution is now equivalent to performing one large matrix multiply `np.dot(W_row, X_col).`

**Dilated convolutions**

It is possible to have filters that have spaces between each cell, called dilation.

### Pooling Layer

It is common to periodically insert a Pooling layer in-between successive Conv layers in a ConvNet architecture. the most common form is a pooling layer with filters of size $2x2$ applied with a stride of 2 downsamples every depth slice in the input by 2 along both width and height, discarding 75% of the activations.

* Accepts a volume of size $W_1xH_1xD_1$
* Requires two hyperparameters: their spatial extent F, and the stride S.
* Produces a volume of size $W_2xH_2xD_2$ where $W_2=(W_1-F)/S+1$, $H_2=(H_1-F)/S+1$ and $D_2=D1$.
* Introduces zero parameters since it computes a fixed function of the input.
* For Pooling layers, it is not common to pad the input using zero-padding.

It is worth noting that there are only two commonly seen variation of the max pooling layers found in practice: A pooling layer with $F=3, S=2$, and more commonly $f=2, S=2$. 

**General Pooling**

In addition to max pooling, the pooling units can also perform other functions, such as average pooling or even L2-norm pooling.

![alt text](./img/pooling.png "Title")

### Fully-connected layer
Neurons in a fully connected layer have full connections to all activations in previous layer, as seen in regular Neural Networks.

**Layer Patterns**
The most common form of a ConvNet architecture stacks a few CONV-RELU layers, follows them with POOL layers and repeats this pattern until the image has been merged spatially to a small size.

**Layer Sizing Patterns**

* The input layer should be divisible by 2 many times.
* The conv layers should be using small filters (e.g. $3x3$ or at most $5x5$), using a stride of S = 1, and crucially, padding the input volume with zeros in such way that the conv layer does not alter the original size of input.
* The pool layers are in charge of downsampling the spatial dimension of the input. The most common setting is to max-pooling 2x2 respective field, and with a stride of 2. 

## Visualizing what ConvNets learn

### Visualizing activations and first-layer weights

**Layer Activations**

The most straightforward visualization techniques is to show the activations of the network during the forward pass. For ReLU networks, the activations usually start out looking relatively blobby and dense, but as the training progresses the activation usually become more sparse and localized. One dangerous pitfall that can be easily noticed with the visualization is that some activation maps may be all zero for many different inputs, which indicates dead filters, and can be a sympton of high learning rates.

**Conv/FC Filters**

The second common strategy is to visualize weights. Noisy patterns can be an indicator of a network that hasn't been trained for long enough, or possibly a very low regulation strength that may have led to overfitting.

**Retrieving images that maximally activate a neuron**

Another visualization technique is to take a large dataset of images, feed them through the network and keep track of which images maximally activate some neuron. We can then visualize the images to get understanding of what neuron is looking for in tis receptive field.

### Embedding the codes with t-SNE

ConvNets can be interpreted as gradually transforming the images into a representation in which the classes are separable by a linear classifier. We can get a rough idea about the topology of the space by embedding images into two dimensions so that their low-dimensional representation has approximately equal distances than their high-dimensional representation.

### Occluding parts of the image

Support that a ConvNet classifies an image as a dog. How can we be certain its actually picking up on dog in the image as opposed to some contexual cues from the background or some other miscellaneous object? One way of investigating which part of the image some classification prediction is coming from is by plotting the probability of the class of interest as a function of the position of an occluder object.

In [None]:
measure what matters
four level of practices