## Cross-Entropy Loss Function

Ideally, we hope and expect that our neural networks will learn fast from their errors. But unlike human learning process this is not what happens with neural networks in practice! Surprisingly the pposite takes place and we notice slowdown in the learning process.

$\frac{\partial C}{\partial w}$ and $\frac{\partial C}{\partial b}$ are the expressions responsible for the learning pace. So, saying learning is slow, we mean these partial deriavtives are small.

Consider an example of just one neuron.
![neuron1-0](../Images/neuron1-0.png)
We want this neuron to take the **input 1 to the output 0**


Let's examine **quadratic loss function** first.

$$C=\frac{(y-a)^{2}}{2}$$

where 
* $a$ is the neuron's output (when the training input $x=1$ is used)
* $y$ is the corresponding desired output ($y=0$ in our case)

Recall that $a=\sigma(z),$ where $z=w x+b$.

Using chain rule and substituting $x=1$ and $y=0$ we get

$$
\frac{\partial C}{\partial w}=(a-y) \sigma^{\prime}(z) x=a \sigma^{\prime}(z) \\
\frac{\partial C}{\partial b}=(a-y) \sigma^{\prime}(z)=a \sigma^{\prime}(z)
$$

![sig](../Images/sigmoid.png)

We can see from this graph that when the neuron's output is close to 1 , the curve gets very flat, and so $\sigma^{\prime}(z)$ gets very small. Accordingly $\partial C / \partial w$ and $\partial C / \partial b$ get very small. 
However, quadratic loss would perform good if were dealing with linear layer.

**Introducing Cross-Entropy**

![neuron](../Images/neuron.png)

We define **Cross-Entropy** for this neuorn as follows:
$$C=-\frac{1}{n} \sum_{x}[y \ln a+(1-y) \ln (1-a)]$$

where $n$ is the total number of items of training data, the sum is over all training inputs, $x,$ and $y$ is the corresponding desired output.

**Why this is a loss funtion?**
1. it's non-negative $C>0$
2. if the neuron's actual output is close to the desired output for all training inputs, x, then the cross-entropy will be close to zero

**Will this loss function fix the slow learning problem?**
$$\begin{aligned}
\frac{\partial C}{\partial w_{j}} &=-\frac{1}{n} \sum_{x}\left(\frac{y}{\sigma(z)}-\frac{(1-y)}{1-\sigma(z)}\right) \frac{\partial \sigma}{\partial w_{j}} \\
&=-\frac{1}{n} \sum_{x}\left(\frac{y}{\sigma(z)}-\frac{(1-y)}{1-\sigma(z)}\right) \sigma^{\prime}(z) x_{j}
\end{aligned}$$

After simplifying we get
$$\frac{\partial C}{\partial w_{j}}=\frac{1}{n} \sum_{x} \frac{\sigma^{\prime}(z) x_{j}}{\sigma(z)(1-\sigma(z))}(\sigma(z)-y)$$

$$\frac{\partial C}{\partial w_{j}}=\frac{1}{n} \sum_{x} x_{j}(\sigma(z)-y)$$

This tells us that the rate at which the weight learns is controlled by $\sigma(z)-y,$ i.e., by the error in the output. The larger the error, the faster the neuron will learn. More intuitive!

Similarly for the bias we get $\frac{\partial C}{\partial b}=\frac{1}{n} \sum_{x}(\sigma(z)-y)$

_little more intuition here:_
The cross-entropy measures how "surprised" we are, on average, when we learn the true value for y. We get low surprise if the output is what we expect, and high surprise if the output is unexpected.

### Reinventing Cross-Entropy

Can we choose a loss function so that $\sigma^{\prime}$ disappears in the partial derivatives? That is the cost $C=C_{x}$ for a single training example $x$ would satisfy

$$\begin{aligned}
\frac{\partial C}{\partial w_{j}} &=x_{j}(a-y) \\
\frac{\partial C}{\partial b} &=(a-y)
\end{aligned}$$

From chain rule we have 
$$\frac{\partial C}{\partial b}=\frac{\partial C}{\partial a} \sigma^{\prime}(z)$$

Using $\sigma^{\prime}(z)=\sigma(z)(1-\sigma(z))=a(1-a)$ the last equation becomes
$$
\frac{\partial C}{\partial b}=\frac{\partial C}{\partial a} a(1-a)
$$

From above three equations we get
$$\frac{\partial C}{\partial a}=\frac{a-y}{a(1-a)}$$

Integrating this expression with respect to $a$ gives the contribution to the cost from a single training example $x$
$$
C=-[y \ln a+(1-y) \ln (1-a)]+\text { constant }
$$

To get the full cost function we must average over training examples, obtaining 

$$C=-\frac{1}{n} \sum_{x}[y \ln a+(1-y) \ln (1-a)]+\text { constant }$$





## Softmax

Instead of using sigmoid in the output layer, it is much more likeliy to use softmax. Applying softmax on the output layer yields the following:

$$a_{j}^{L}=\frac{e^{z_{j}^{L}}}{\sum_{k} e^{z_{k}^{L}}}$$

A good interactive visuaizarion of softmax can be found [here](http://neuralnetworksanddeeplearning.com/chap3.html#softmax). 

It is easy to prove that
$$\sum_{j} a_{j}^{L}=\frac{\sum_{j} e^{z_{j}^{L}}}{\sum_{k} e^{z_{k}^{L}}}=1$$

The output from the softmax layer is a set of positive numbers which sum up to 1. Therefore, the output from the softmax layer can be thought of as a **probability distribution**.

**How softmax helps to defeat learning slowdown?**

**log-likelihood loss function**
$$C \equiv-\ln a_{y}^{L}$$

Partial derivatives of loss w.r.t. parameters will be
$$\begin{aligned}
\frac{\partial C}{\partial b_{j}^{L}} &=a_{j}^{L}-y_{j} \\
\frac{\partial C}{\partial w_{j k}^{L}} &=a_{k}^{L-1}\left(a_{j}^{L}-y_{j}\right)
\end{aligned}$$
which are similar to those we obtained for sigmoid and cross-entropy.
(Note that we are dealing with one-hot encoded targets and outputs.)

Please review [this post](https://peterroelants.github.io/posts/cross-entropy-softmax/) for derivations of Cross-Entropy loss w.r.t softmax output layer. 

## Activation Functions: Comparison

There are many types of activation functions that can be used in hidden layers of neural networks. Usage of each activation function has been (or can be) a topic for new research paper. 

Due to researches and experiments in the recent years it has become common to use ReLU activation function for most cases. However it is useful to know the common alternatives.

Below we will look at some activation functions, and state some well-known pros and cons ([source](https://missinglink.ai/guides/neural-network-concepts/7-types-neural-network-activation-functions-right/)) 

For deeper understanding of each please go through this paper - [Activation Functions: Comparison of Trends inPractice and Research for Deep Learning](https://arxiv.org/pdf/1811.03378.pdf).

### Sigmoid (logistic)
$f(x)=\left(\frac{1}{\left(1+e x p^{-x}\right)}\right)$

* Pros:
    * Smooth gradient, preventing “jumps” in output values.
    * Output values bound between 0 and 1, normalizing the output of each neuron.
    * Clear predictions—For X above 2 or below -2, tends to bring the Y value (the prediction) to the edge of the curve, very close to 1 or 0. This enables clear predictions.
* Cons:
    *    Vanishing gradient (neuron stauration) —for very high or very low values of X, there is almost no change to the prediction, causing a vanishing gradient problem. This can result in the network refusing to learn further, or being too slow to reach an accurate prediction.
    * Outputs not zero centered.
    * Computationally expensive

### TanH
$f(x)=\left(\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}\right)$

* Pros:
    * Zero centered—making it easier to model inputs that have strongly negative, neutral, and strongly positive values.
    * Otherwise like the Sigmoid function.

* Cons:
    * Like the Sigmoid function
    
### ReLU
$f(x)=\max (0, x)=\left\{\begin{array}{ll}
x_{i}, & \text { if } x_{i} \geq 0 \\
0, & \text { if } x_{i}<0
\end{array}\right.$

* Pros:
    * Computationally efficient — allows the network to converge very quickly

* Cons:
    * The Dying ReLU problem — when inputs approach zero, or are negative, the gradient of the function becomes zero, the network cannot perform backpropagation and cannot learn.

**Below are some modifications of ReLU which try to solve the Dying ReLU problem. Though they are not as convinient as ReLU itself**

### Leaky ReLU
$f(x)=\alpha x+x=\left\{\begin{array}{cc}x & \text { if } x>0 \\ \alpha x & \text { if } x \leq 0\end{array}\right.$

### Parametric ReLU
$f\left(x_{i}\right)=\left(\begin{array}{cccc}x_{i} & & \text {if } x_{i} & > & 0 \\ a_{i} x_{i}, & & \text {if } x_{i} & \leq 0\end{array}\right)$

### ELU 
$f(x)=\left(\begin{array}{cc}x, & \text { if } x>0 \\ \alpha \exp (x)-1, & \text { if } x \leq 0\end{array}\right)$

There is a nice paper for further reading - [Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)](https://arxiv.org/abs/1511.07289)

![overfit](../Images/overfit.jpeg)

## Regularizations

### L2 regularization

The idea of L2 regularization is **to add an extra term to the loss function**, a term called the regularization term. Here $C_0$ is the original, unregularized loss function.

$$
C=C_{0}+\frac{\lambda}{2 n} \sum_{w} w^{2}
$$

Regularized qaudratic loss function:
$$
C=\frac{1}{2 n} \sum_{x}\left\|y-a^{L}\right\|^{2}+\frac{\lambda}{2 n} \sum_{w} w^{2}
$$

Regularized cross-entropy loss function:
$$C=-\frac{1}{n} \sum_{x j}\left[y_{j} \ln a_{j}^{L}+\left(1-y_{j}\right) \ln \left(1-a_{j}^{L}\right)\right]+\frac{\lambda}{2 n} \sum_{w} w^{2}$$

Intuitively, adding this term forces the network to prefer smaller weights. 


While using L2 regularization, the computation of the partial derivatives $\frac{\partial C}{\partial w}$ and $\frac{\partial C}{\partial b}$ doesn't change much.

$$
\frac{\partial C}{\partial w}=\frac{\partial C_{0}}{\partial w}+\frac{\lambda}{n} w \\
\frac{\partial C}{\partial b}=\frac{\partial C_{0}}{\partial b}
$$

For bias parameter the learning rule doesn't change

$$
b \rightarrow b-\eta \frac{\partial C_{0}}{\partial b}
$$

For weight parameters it adds a rescaling factor. This modification is usually called **weight decay**. 

$$
\begin{aligned}
w & \rightarrow w-\eta \frac{\partial C_{0}}{\partial w}-\frac{\eta \lambda}{n} w \\
&=\left(1-\frac{\eta \lambda}{n}\right) w-\eta \frac{\partial C_{0}}{\partial w}
\end{aligned}
$$

Let's rerwite these formulas for SGD (just adding a sum over training examples in the mini-batch).

$$b \rightarrow b-\frac{\eta}{m} \sum_{x} \frac{\partial C_{x}}{\partial b}$$

$$w \rightarrow\left(1-\frac{\eta \lambda}{n}\right) w-\frac{\eta}{m} \sum_{x} \frac{\partial C_{x}}{\partial w}$$


**Why regularization helps against overfitting?**

* Regularization insures the smallness of weights. In the case of small weights the behaviour of the network won't change too much if we change a few random inputs here and there. 
* That makes it difficult for a regularized network to learn the effects of local noise in the data. 
* It's an empirical fact that regularized neural networks usually generalize better than unregularized networks.


### L1 regularization

$$C=C_{0}+\frac{\lambda}{n} \sum_{w}|w|$$

Intuitively it's very much like the L2 regularization. 

Let's look at the partial derivatives:

$$\frac{\partial C}{\partial w}=\frac{\partial C_{0}}{\partial w}+\frac{\lambda}{n} \operatorname{sgn}(w)$$

And the update rule:

$$w \rightarrow w^{\prime}=w-\frac{\eta \lambda}{n} \operatorname{sgn}(w)-\eta \frac{\partial C_{0}}{\partial w}$$

In both L1 and L2 regularization cases the effect of regularization is to shrink the weights.
What is the main difference? 

In L1 regularization, the weights shrink by a constant amount toward 0. In L2 regularization, the weights shrink by an amount which is proportional to $w$.



## Dropout

This regularization technique dramatically differs from L1 and L2.

The original paper for Dropout [Dropout: A Simple Way to Prevent Neural Networks fromOverfitting](https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf) was published in 2014. It brought major performance boosts. The paper is easily readabe so I highly recommend you to go through it.

The main idea behind dropout comes from combining models. Model combination almost always improves the performance of machine learning methods. But imagine how expensive it would be for deep neural networks. And not only the training phase! It would be impossible to use that kind of super-heavy models in real-life applications due to their slowness.

Dropout comes to rescue !

Here is a small illustration from the paper.
![dropot](../Images/dropout_nn.png)

> Dropout is a technique that addresses both these issues. It prevents overfitting and provides a way of approximately combining exponentially many different neural network architectures efficiently. The term "dropout" refers to dropping out units (hidden and visible) in a neural network. By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections, as shown in Figure $1 .$ The choice of which units to drop is random. In the simplest case, each unit is retained with
a fixed probability $p$ independent of other units, where $p$ can be chosen using a validation set or can simply be set at $0.5,$ which seems to be close to optimal for a wide range of networks and tasks. For the input units, however, the optimal probability of retention is usually closer to 1 than to 0.5.

> Applying dropout to a neural network amounts to sampling a "thinned" network from
it. The thinned network consists of all the units that survived dropout (Figure $1 \mathrm{b}$ ). A neural net with $n$ units, can be seen as a collection of $2^{n}$ possible thinned neural networks. These networks all share weights so that the total number of parameters is still $O\left(n^{2}\right),$ or less. For each presentation of each training case, a new thinned network is sampled and trained. So training a neural network with dropout can be seen as training a collection of $2^{n}$ thinned networks with extensive weight sharing, where each thinned network gets trained very rarely, if at all.


**Model Description**

$$\begin{aligned}
&z_{i}^{(l+1)}=\mathbf{w}_{i}^{(l+1)} \mathbf{y}^{l}+b_{i}^{(l+1)}\\
&y_{i}^{(l+1)}=f\left(z_{i}^{(l+1)}\right)
\end{aligned}$$

$$\begin{aligned}
r_{j}^{(l)} & \sim \operatorname{Bernoulli}(p) \\
\widetilde{\mathbf{y}}^{(l)} &=\mathbf{r}^{(l)} * \mathbf{y}^{(l)} \\
z_{i}^{(l+1)} &=\mathbf{w}_{i}^{(l+1)} \widetilde{\mathbf{y}}^{l}+b_{i}^{(l+1)} \\
y_{i}^{(l+1)} &=f\left(z_{i}^{(l+1)}\right)
\end{aligned}$$

![dropout_](../Images/dropout.png)

> At test time, it is not feasible to explicitly average the predictions from exponentially many thinned models. However, a very simple approximate averaging method works well in practice. The idea is to use a single neural net at test time without dropout. The weights of this network are scaled-down versions of the trained weights. If a unit is retained with probability $p$ during training, the outgoing weights of that unit are multiplied by $p$ at test time as shown in Figure $2 .$ This ensures that for any hidden unit the expected output (under the distribution used to drop units at training time) is the same as the actual output at test time. By doing this scaling, $2^{n}$ networks with shared weights can be combined into a single neural network to be used at test time. We found that training a network with dropout and using this approximate averaging method at test time leads to significantly lower generalization error on a wide variety of classification problems compared to training with other regularization methods.

![dropout_](../Images/dropout_test.png)

## Expanding training dataset 

It is not a secret that using more data insures more generalization. This statemnet is true not only for neural networks. Bigger datasets are usually more representative for the underlying real population. (We skip the noisy big dataset cases now).

Let's look at some experiments from Michael Nielsen's book.
> Let's try training our 30 hidden neuron network with a variety of different training data set sizes, to see how performance varies. We train using a mini-batch size of 10, a learning rate η=0.5, a regularization parameter λ=5.0, and the cross-entropy cost function. We will train for 30 epochs when the full training data set is used, and scale up the number of epochs proportionally when smaller training sets are used. To ensure the weight decay factor remains the same across training sets, we will use a regularization parameter of λ=5.0 when the full training data set is used, and scale down λ proportionally when smaller training sets are used.

![datasize](../Images/datasize.png)

![nnvssvm](../Images/nn_vs_svm.png)



But data is expensive. Usually you don't have as much data at hand as you would like to. What if to create some?

A common techique for enhancing dataset size is called data augmentation. We will return to this during Convolutional Neural Networks.


## Weight Initialization

Turns out, that weight initialization plays a huge role on the performance of neural networks. 

Since now we were choosing both the weights and biases using independent Gaussian random variables, normalized to have mean 0 and standard deviation 1.

Let's look at the example from Nielsen's book, where we will use the normalized Gaussian initialization.

> We'll suppose for simplicity that we're trying to train using a training input
$x$ in which half the input neurons are on, i.e., set to $1,$ and half the input
neurons are off, i.e., set to 0. The argument which follows applies more generally, but you'll get the gist from this special case. Let's consider the weighted sum $z=\sum_{j} w_{j} x_{j}+b$ of inputs to our hidden neuron. 500 terms in this sum vanish, because the corresponding input $x_{j}$ is zero. And so $z$ is a sum over a total of 501 normalized Gaussian random variables,
accounting for the 500 weight terms and the 1 extra bias term. Thus $z$ is
itself distributed as a Gaussian with mean zero and standard deviation $\sqrt{501} \approx 22.4 .$ That is, $z$ has a very broad Gaussian distribution, not
sharply peaked at all:

![init](../Images/init.png)

> In particular, we can see from this graph that it's quite likely that $|z|$ will be pretty large, i.e., either $z \gg 1$ or $z \ll-1 .$ If that's the case then the output $\sigma(z)$ from the hidden neuron will be very close to either 1 or $0 .$
That means our hidden neuron will have saturated.


**Alternative approach!**

> Suppose we have a neuron with $n_{\text {in }}$ input weights. Then we
shall initialize those weights as Gaussian random variables with mean 0
and standard deviation $1 / \sqrt{n_{\text {in }}} .$ That is, we'll squash the Gaussians down, making it less likely that our neuron will saturate. We'll continue to choose
the bias as a Gaussian with mean 0 and standard deviation $1,$ for reasons
I'll return to in a moment. With these choices, the weighted sum $z=\sum_{j} w_{j} x_{j}+b$ will again be a Gaussian random variable with mean 0 but it'll be much more sharply peaked than it was before. Suppose, as we
did earlier, that 500 of the inputs are zero and 500 are $1 .$ Then it's easy to
show (see the exercise below) that $z$ has a Gaussian distribution with mean 0 and standard deviation $\sqrt{3 / 2}=1.22 \ldots . .$ This is much more
sharply peaked than before, so much so that even the graph below
understates the situation, since I've had to rescale the vertical axis, when
compared to the earlier graph:

![init2](../Images/init2.png)

We can continue initializing biases as before because they don't contribute much in the neuron saturation effect. 


There are many other imporved techniques for weight intialization. We will come back to some of them. Here is a [paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf) to look through for now.


A good paper to read: [Practical Recommendations for Gradient-Based Training ofDeepArchitectures](https://arxiv.org/pdf/1206.5533v2.pdf) by Yoshua Bengio