### Encoding of outputs
* Encoding of output can impact performance.
* Common encoding methods for classification:
    - Integer label for each class
    - One-hot enconding (binary vectors with one indicator for each class)
    - Binary (or other base) encoding 
* Similar choices can be made for regression and other problems

### Rate of Learning

* Back-propagation algorithm provides an approximation to the trajectory in the weight space computed by the method of steepest descent
*  Small $\eta$: smoother trajectory, slower learning
*  Too Large $\eta$: large changes in the synaptic weights, may become unstable (oscillatory)


*  *Include a momentum term*: tries to increase rate of learning while avoiding instability
* Recall: $\Delta w_{ij}(n) = \eta \delta_j(n) y_i(n) = \eta \frac{\partial E(n)}{\partial v_j(n)} y_i(n)$ 
*  *The Generalized Delta rule (Delta rule with momentum):* $\Delta w_{ij}(n) = \alpha \Delta w_{ij}(n-1) + \eta \delta_j(n) y_i(n)$
*  Can write the generalized delta rule as a time series (index $t$):
\begin{eqnarray}
\Delta w_{ij}(n) &=& -\eta \sum_{t=0}^n \alpha^{(n-t)}\delta_j(t)y_i(t)\\
&=& -\eta \sum_{t=0}^n \alpha^{(n-t)} \frac{\partial E(t)}{\partial w_{ij}(t)}
\end{eqnarray}


#### Observations & Comments:
*  $\Delta w_{ij}(n)$ is the sum of exponentially weighted time series.  For it to converge, $0 \le |\alpha| < 1$, If $\alpha = 0$, then you are operating without momentum
*  Inclusion of momentum accelerate descent in steady downhill directions
*  Inclusion of momentum has a stabilizing effect in directions that oscillate in sign
*  Momentum may prevent termination/convergence in a shallow minimum
*  The learning rate *can* be connection dependent, $\eta_{ij}$, can even set it to zero for some connections

*See Haykin's Neural Network text for more reading on this and the following.  It was the reference for these notes and provides a nice description. 


### Online vs. Batch Update

* Thus far we have focused on Online learning. *What is meant by online learning?*
* You can also do:  Batch update or (Stochastic) Mini-Batch Update


### A Sampling of Heuristics/Methods that may help Back-Propagation Algorithm Perform Better

* Maximizing Information Content
     * Provide training samples that provide the largest information content
    * Use example with largest training error
    * Use example radically different than the ones before
    * Emphasizing scheme, present more difficult patterns to the network, difficulty is determined by error
    * Problems with emphasizing scheme:
        * Distribution of samples in an epoch is distorted
        * Outliers or mislabeled samples can cause major problems

*  Activation Function: MLPs may learn better with activation functions that are antisymmetric rather than nonsymmetric. 
\begin{equation}
\phi(-v) = -\phi(v)
\end{equation}

* Target values need to be in the range of the sigmoid. 
* It is good if the input variables are uncorrelated
* It is good if variances are approximately equal

### Strategy when training a network for the first time

* This is a good read.  Includes suggestions that I often give students and agree with:  http://karpathy.github.io/2019/04/25/recipe/
* Most commonly overlooked in my experience are:
    - Be sure you know and understand your data.  Make use of any knowledge you have or gain about the problem. For example, if you realize you have an imbalanced data set, you may want to implement a method to address this such as upsampling/data augmentation for the smaller class, weighting the examples from each class accordingly, or another approach. 
    - Overfit your network with a small subset of the data.  This helps to make sure your code is working, alert you to any bugs or issues in the data. 
 



### Network Pruning Techniques

*  *Network Growing:* Start with a small MLP and add to it when unable to meet design specifications
*  *Network Pruning:* Start with a large MLP and prune it by eliminating weights (driving them to zero)

*  *Regularization:* Just as we applied regularization during regression, we can apply regularization to our NN training: $R(w) = E_s(w) + \lambda E_c(w)$ where $E_s$ is the error measure, $E_c$ is the regularization penalty term, and $\lambda$ is a regularization "trade-off" parameter
     * $\lambda = 0$: Training based only on error and training data
     * $\lambda \rightarrow \infty$: Training samples are unreliable, Minimize complexity

#### Weight Decay
\begin{equation}
E_c(w) = \left\| \mathbf{w} \right\|^2 = \sum_i w_i^2
\end{equation}
* Drives some weights to be small. 
* Weights with little or no influence are excess/unnecessary weights.  The goal: encourage unnecessary weights to go to zero and thereby improve generalization
 or at least have the smallest values that can solve the problem. 

* *So, how would you include weight decay in your Back-Propagation Training Algorithm?*

### Ill-conditioned Data

* Consider the following two systems of linear equations:
\begin{eqnarray}
 x_1 + 2x_2 = 2\\
 x_1 + 2.0001x_2 = 2
 \end{eqnarray}

 \begin{eqnarray}
 x_1 + 2x_2 = 2\\
 x_1 + 2.0001x_2 = 2.0001
 \end{eqnarray}

* *What are the solutions to both of these linear systems?*

* Ill-conditioned problems mean that small changes in values result in a large change in the solution.  This is related to *adversarial examples* in deep learning architectures. See: https://blog.openai.com/adversarial-example-research/  Adversarial examples are also related to the Curse of Dimensionality. 

* Ill-conditioned data poses problems for MLPs as well.  Ill-conditioning affects the speed and accuracy of back-propagation. 
* The *condition number* is the ratio of the largest and smallest eigenvalues of the Hessian of the error function. 
* *What is the Hessian of the least squared error function for the simple linear example shown above?*
* *What are the condition numbers for the following data sets:*

\begin{eqnarray}
\mathbf{X} = [-1.5, 1; -0.5, 1; 0.5, 1; 1.5, 1]\\
\mathbf{Y} = [1.5; 2.5; 3.5; 4.5]
\end{eqnarray}

\begin{eqnarray}
\mathbf{X} = [0, 1; 1, 1; 2, 1; 3, 1]\\
\mathbf{Y} = [3; 4; 5; 6]
\end{eqnarray}

\begin{eqnarray}
\mathbf{X} = [10, 1; 11, 1; 12, 1; 13, 1]\\
\mathbf{Y} = [13; 14; 15; 16]
\end{eqnarray}

* The Hessian matrix describes the local curvature of the error surface. Ill conditioned data have eccentric error contours (where as well conditioned data will have circular or nearly circular error contours).

* It is also good to have similar variances across input values and have node-specific learning weights.  Consider a linear network.  Suppose you have one input value with an average of $\approx 3000$, a bias fixed to 1 and a target value around $\approx 30$. Then, the weight for the input must be small to map 3000 to 30 (.001). If we had a $10\%$ error, then we would end up needing a very small learning rate to avoid instability.  However, if we had such a small learning rate and our bias was also wrong, then it would take extremely long to train. 

* We also don't want to saturate our activation functions as this leads to very slow learning.  So, initialize weights appropriately and normalize input data.  

### Some helpful visualizations to check out:
 
* https://playground.tensorflow.org
* https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/
* http://experiments.mostafa.io/public/ffbpann/
* https://google-developers.appspot.com/machine-learning/crash-course/backprop-scroll/
