&copy;Copyright for [Shuang Wu] [2017]<br>
Cite from the [coursera] named [Neural network for Machine Learning] from [University of Toronto]<br>
Learning notes<br>

# Learning the weights of a linear neuron

## Why perceptron procedure cannot be generalised to hidden layers

* Perceptron convergence procedure works by ensure every time the weights change, get closer to every "generously feasible" set of weights
    * cannot extended to more complex networks which average of two good solutions may be a bad one.
* multi-layer NN do not use perceptron learning procedure

## Diff. way to show learning procedure makes progress

* Show the output values get closer to the target values, not the weights get closer to good set of weights
    * True even for non-convex problems
        * average two give bad in non0convex
    * not true for perceptron learning
* e.g.: linear neuron w/ a squared error measure

## Linear neurons (linear filters)

* Neuron has a real-valued output which is a weighted sum of its inputs
* $y=\sum_iw_ix_i=\vec{w}^T\vec{x}$
    * $y$: neuron's estimate of the desired output
    * $\vec{w}$ weight vec.
    * $\vec{x}$ input vec.
* Aim of learning is to minimize the error summed over all training cases
    * error is the squared defference b/w the desired output and the actual output
    
## Why not solve analytically

* Standard engineering approach
    * write down set of equations, one per training case, and solve for the best set of weights
* **Scientific**: want a method that real neurons could use
* **Engineering**: want a method that can generalized to multi-layer, non-linear NN
    * analytic solution relies on it being linear and having a squared error measure
    * iterative methods usually less efficient but they are much easier to generalize
    
## sSToy e.g. for iterative method

* Each day lunch at the Cafe
    * diet consists of fish, chips and ketchup
    * several portions of each
* Cashier only tells the total price of the meal
    * several days later, you may figure out the price for each
* The ite approach: start w/ random guesses for the prices and adjust to get the better fit to the observed prices of whole meals

## Solving the equations iteratively

*  $price = x_{fish}w_{fish}+x_{chips}w_{chips}+x_{ketcp}w_{ketcp}$
    * Linear constraint on the prices
* $\vec{w} = (w_{fist}, w_{chips}, w_{ketcp})$
    * weights in of a linear neuron
* start w/ guesses for the w8 and then adjust the guesses slightly to a better fit to the actual one.
* **True** w8 used by the cashier
    * ![img29](imgs/img29.jpg)

## Model the cashier w/ arbitrary initial weigths

* initial guess
    * ![img30](imgs/img30.jpg)
    * residual error = 350
    * "Delta-rule" for learning: $\Delta w_i=\epsilon x_i(t-y)$
    * w/ $\epsilon = 1/35$, weight changes are:+20, +50, +30
    * new weights: 70, 100, 80
        * make the chips weight worse
        
## Deriving delta rule

* Define the error as the squared residuals summed over all training
    * $$E=\frac{1}{2}\sum_{n\in training}(t^n-y^n)^2$$
* Differentiate to get error derivatives for weights
    * $$\frac{\partial E}{\partial w_i} = \frac{1}{2}\sum_n\frac{\partial y^n}{\partial w_i}\frac{dE^n}{dy^n} = -\sum_n x_i^n(t^n-y^n)$$
* Batch delta rule changes the weights in proportion to their error derivatives summed over all training
    * $$\Delta w_i=-\epsilon\frac{\partial E}{\partial w_i}=\sum_n\epsilon x^n_i(t^n-y^n)$$
    
## Behav. of the itera. learning procedure

* Does the learning procedure eventually get the right ans?
    * no perfect answer
    * by making the learning rate small enough, can get as close as we desire to the best answer
* How quickly the weights converge?
    * Can be very slow if 2 input dim. are highly correlated. If almost always have the same # of portions of ketp and chips, hard to decide how to divide the price between ketp and chips.
    
## Ralationship b/w online delta-rule and learning rule for perceptrons

* Perceptron
    * increment or decrement the weight vec. by the input vec.
    * only change the weights when make an error
* Online delta-rule
    * increment or decrement the weight vec. by the input vec. scaled by the residual error and the learning rate
    * need to choose the learning rate, annoying

# Error surface for a linear neuron

## Error surface in extended w8 space

* Error surface lies in a space w/ a horizontal axis for each w8 and one vertical axis for error
    * linear neuron w/ squared error, quadratic bowl
    * vertical cross-sections are parabolas
    * horizontal cross-sections are ellipses
* Multi-layer, non-linear nets, the surface much more complicated
    * ![img31](imgs/img31.jpg)
    
## Online versus batch learning

* Simplest kind of batch learning does steepest descent on the error surface
    * travels perpendicular to the contour lines
    * ![img32](imgs/img32.jpg)
* simplest kind of online learning zig-zags around the direction of steepest descent
    * ![img33](imgs/img33.jpg)
    
## Why learning can be slow

* ![img34](imgs/img34.jpg)
* If ellipse is elongated, direction of steepest descent is almost perpendicular to the direction to wards the minumum
    * red gradient vec. has large component along the short axis of the ellipse and small component along the long axis of the ellipse
    * the opposite of what we want

# Learning the w8 of a logistic output neuron

## Logistic neurons

* These give a real-valued output that smooth and bounded function of their total input
    * nice derivatives make learning eash
    * $$z=b+\sum_i x_iw_i$$
    * $$y=\frac{1}{1+e^{-z}}$$
    * ![img35](imgs/img35.jpg)
    
## Derivatives of logistic neuron

* Derivatives of the logit, z, w.r.t the inputs and the weights are: $$z=b+\sum_i x_iw_i$$
    * $$\frac{\partial z}{\partial w_i}=x_i\quad \quad \frac{\partial z}{\partial x_i}=w_i$$
* Derivative of the output w.r.t the logit is simple if express it in terms of the output: $$y= \frac{1}{1+e^{-z}}$$
    * $$\frac{dy}{dz}=y(1-y)$$
$$y= \frac{1}{1+e^{-z}}=(1+e^{-z})^{-1}$$
$$\frac{dy}{dz} = \frac{-1(-e^{-z})}{(1+e^{-z})^2}=\frac{1}{1+e^{-z}}\frac{e^{-z}}{1+e^{-z}}=y(1-y)$$

## Using the chain rule to get the derivatives needed for learning the w8 of a logistic unit

* To learn the w8 need the derivative of the output w.r.t each weight:
    * $$\frac{\partial y}{\partial w_i}= \frac{\partial z}{\partial w_i}\frac{dy}{dz}=x_i y(1-y)$$<br>
    * <br>
    * $$\frac{\partial E}{\partial w_i} = \sum_n\frac{\partial y^n}{\partial w_i}\frac{\partial E}{\partial y^n}=-\sum_n x_i^n y^n(1-y^n)(t^n-y^n)$$
        * $x_i^n$ and $(t^n-y^n)$ is the <font color='red'>delta-rule</font>
        * $y^n(1-y^n)$ is the  <font color='green'>extra term, same as the slope of logistic</font>

# Backpropagation algo.

## Learning w/ hidden units (again)

* Networks w/o hidden units limited in the input-output mappings they can model
* Adding a layer of hand-coded features ( perceptron) makes much more powerful but hard designing the features
    * like to find good features w/o requiring insights into the task or repeated trial and error where we guess some features and see how well they work
* need to automate the loop of designing features for particular task and seeing how well it does

## Learning by perturbing weights

* Randomly perturb one weight and see if improves performance, if so , save the change
    * form of reinforcement learning
    * <font color='red'>Inefficient</font>. Need multiple forward passes on a representative set of training cases just to change one weight. Back pro. much better
    * towards the end of learning, large weight perturbations nearly always make things <font color='red'>worse</font>, b/c weights need to have the right relative values
    * ![img36](imgs/img36.jpg)
    
## Learning by using perturbations

* Could randomly perturb all weights in parallel and correlate the performance gain w/ the weight changes
    * not any better, b/c need lots of trials on each training case to 'see' the effect of changing one w8 through the noise created by all the changes to other w8
* Better idea: Randomly perturb the activities of the hidden units
    * once know how want a hidden activity to change on given training, can compute how to change the w8
    * there are fewer activities than weights, but backpropagation still wins by a factor of the # of neurons.
    
## Idea behind backpropagation

* Dont't know the hidden units ought to do, but can compute how fast the error changes as we change a hidden activity
    * use <font color='red'>error derivatives w.r.t. hidden activities</font>, instead of using desired activities to train the hidden units
    * Each hidden activity can affect many output units and can therefore have many separate effects on the error. These effects must be combined
* Can compute error derivatives for all hidden units efficiently at same time
    * once have the error derivatives for hidden activities, easy to get the error derivatives for the weights going into a hidden unit
    
## Sketch of the backpropagation algo. on a single case

* 1st, conver the discrepancy b/w each output and its target value into an error derivative
    * $$E=\frac{1}{2}\sum_{j\in output}(t_j-y_j)^2$$
* 2nd, compute error derivatives in each hidden layer from error derivatives in the layer above
    * $$\frac{\partial E}{\partial y_j}=-(t_j-y_j)$$
* 3rd, use error derivatives w.r.t. activiteis to get error derivatives w.r.t. incoming weights
    * ![img37](imgs/img37.jpg)
    
## Backpropagating $dE/dy$

* $$\frac{\partial E}{\partial z_j} = \frac{dy_j}{dz_j}\frac{\partial E}{\partial y_j} = y_j(1-y_j)\frac{\partial E}{\partial y_j}$$
* 
* $$\frac{\partial E}{\partial y_i} = \sum_j\frac{dz_j}{dy_i}\frac{\partial E}{\partial z_j}=\sum_jw_{ij}\frac{\partial E}{\partial z_j}$$
* 
* $$\frac{\partial E}{\partial w_{ij}} = \frac{\partial z_j}{\partial w_{ij}}\frac{\partial E}{\partial z_j}=y_i\frac{\partial E}{\partial z_j}$$
* ![img38](imgs/img38.jpg)

# Use the derivatives computed by the backpropagation alg.

## Converting error derivatives into learning procedure

* backpropagation alg. is an efficient way of computing error derivative $dE/dw$ for every w8 on single training case
* need make lot of other decision about error derivatives to get fully specified learning procedure
    * **Optimization**: use error derivatives on individual cases to discover good set of w8 (wk 6)
    * **Generalization**: ensure learned w8 work well for cases we did not see during training (wk 7)
* now have brief overview of these 2 sets of issues

## Optim. issues in using the weight derivatives

* How often to update the w8
    * online: after each training
    * Full batch: after full sweep trough the traning
    * Mini-batch: after small sample of training
* How much to update (futher in wk 6)
    * Fixed learning rate?
    * Adapt global learning rate?
    * Adapt learning rate on each connection separately?
    * Not use steepest descent?
    
## Overfitting: downside of using poerful modes

*  Training contains info. about regularities in mapping from input to output. But also contains 2 type of noise
    * target val. may be unreliable (usually only a minor worry)
    * <font color='red'>Sampling error</font>. there will be accidental regularities just b/c of the particular training that were chosen
* When fit model, it cannot tell which regularities are real and which are caused by sampling error
    * So it fits both kinds of regularity
    * if model flexible it can model the sampling error really well. <font color='red'>This is disaster</font>
    
## Simple e.g. of overfitting

* Model trust:
    * Complicated model fits data better
    * not economical
* Model is convincing when fits a lot of data surprisingly well
    * not surprising that complicated model can fit a small amount of data well
![img39](imgs/img39.jpg)

## Ways to reduce overfitting

* lots of diff. method
    * w8-decay
    * w8-sharing
    * early stop
    * model averaging
    * Bayesian fitting of NN
    * dropout
    * generative pre-training
* detail in wk 7