# Activation Functions

## Softmax

Softmax is an Activation Functions. Softmax is a generalization of Sigmoind for $n = 2$

<div class="alert alert-block alert-warning">TODO: https://www.quora.com/Why-is-it-better-to-use-Softmax-function-than-sigmoid-function</div>

## Sigmoid

Sigmoid is an Activation Function:

$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

The **derivative** of Sigmoid is:

$$\begin{align*}
\sigma'(x) = \frac{\mathrm{d}}{\mathrm{d}x} \sigma(x) &= \frac{\mathrm{d}}{\mathrm{d}x} \Big(\frac{1}{1 + e^{-x}} \Big)^{-1}\\ \\
&= -(1 + e^{-x})^{-2} \; (-e^{-x})\\ \\
&= \frac{e^{-x}}{(1+e^{-x})^2}\\ \\
&= \frac{1}{1+e^{-x}} \cdot \frac{e^{-x}}{1+e^{-x}}\\ \\
&= \frac{1}{1+e^{-x}} \cdot \frac{1 + e^{-x} - 1}{1+e^{-x}}\\ \\
&= \frac{1}{1+e^{-x}} \cdot \bigg( \frac{1 + e^{-x}}{1+e^{-x}} - \frac{1}{1+e^{-x}} \bigg)\\ \\
&= \frac{1}{1+e^{-x}} \cdot \bigg( 1 - \frac{1}{1+e^{-x}} \bigg)\\ \\
&= \sigma(x) \cdot \big( 1 - \sigma(x) \big)\\
\end{align*}$$ 

# Loss Functions

The network needs to make predictions as close as possible to the real values. To measure this, we use a metric of how wrong the predictions are, the **error**. 

## Cross Entropy Loss

Cross Entropy is used in ML as a Loss Function. Multiclass Cross Entropy is a generalization for Two Class Cross Entropy

<div class="alert alert-block alert-warning">TODO</divY

## Sum of Squared Errors (SSE)

A common metric is the sum of the squared errors. The **SSE** is a good choice for a few reasons, for example compared to just using $E=|y-\hat{y}|$: the square ensures the **error is always positive** and **larger errors are penalized** more than smaller errors. Also, it makes the **math nice**.

$$E = \frac{1}{2} \sum_\mu \sum_j \big[y_j^{\,\mu} - \hat{y}_j^{\,\mu}\big]^2$$

with the **prediction** 

$$\hat{y}_j^{\,\mu} = f \big(\sum_i w_{ij} \, x_i^{\,\mu} \big) = f(h_{\mu j})$$

and the **real value** $y$, the number of **output units** $j$ and the number of **data records** $\mu$. Also let be $h_{\mu j}$ the **linear combination** of the weights and the inputs.

Our goal is to find weights $w_{ij}$ that minimize the squared error $E$. To do this with a neural network we typically use **gradient descent**.

## Mean Squared Errors (MSE)

The mean squared error is the SSE divided by the degrees of freedom for the errors for the model.

<div class="alert alert-block alert-warning">TODO</div>

# Gradient Descent

Weights are updated by **substracting** the **gradient** (or **weight step**) $\Delta w_i$ multiplied by a **learning rate** $\eta$:

$$w_i  = w_i + \Delta w_i = w_i - \eta \, \frac{\partial E}{\partial w_i}$$

## Gradient Descent of SSE

### One Output Unit

Supposed we have only **one data record** and **one output unit** then the SSE is

$$E = \frac{1}{2}(y - \hat{y})^2$$

with the **error** $(y - \hat{y})$, 

$$\begin{align*}
\hat{y} &= f(h) = \sigma(h)\\
h &= \sum_i w_i x_i
\end{align*}$$

The gradient of $E$ w.r.t. $w_i$ is the negative of the **error** times the **derivative of the activation function** at $h$ times the **input value** $x_i$:

$$\begin{align*}
\frac{\partial E}{\partial w_i} &= \frac{\partial}{\partial w_i} \frac{1}{2} (y - \hat{y})^2\\
&=(y - \hat{y}) \cdot \frac{\partial}{\partial w_i} (y - \hat{y})\\
&=-(y - \hat{y}) \cdot \frac{\partial}{\partial w_i} \hat{y}\\
&=-(y - \hat{y}) \cdot f'(h) \cdot \frac{\partial}{\partial w_i}\sum_i w_i x_i\\
&=-(y - \hat{y}) \cdot f'(h) \cdot x_i\\
\end{align*}$$

Then the **weight step** $\Delta w_i$ is

$$\begin{align*}
\Delta w_i &= - \eta \, \frac{\partial E}{\partial w_i}\\
&= \eta \cdot (y - \hat{y}) \cdot f'(h) \cdot x_i
\end{align*}$$

For convenience we define an **error term** $\delta$ as

$$\delta = (y - \hat{y}) \cdot f'(h)$$

so we can write the **weight update** as

$$w_i = w_i + \eta \, \delta \, x_i$$

### Multiple Output Units

for multiple output units we have multiple error terms:

$$\delta_j = (y_j - \hat{y}_j) \cdot f'(h_j)$$

$$w_{ij} = w_{ij} + \Delta w_{ij} = w_{ij} + \eta \, \delta_j \, x_i$$


### Python Example

with one data record and four input values:

In [1]:
import numpy as np
import pandas as pd

def sigmoid(x):
    return 1/(1+np.exp(-x))

def sigmoid_prime(x):
    return sigmoid(x) * (1 - sigmoid(x))

learnrate = 0.5
x = np.array([1, 2, 3, 4])
y = np.array(0.5)
w = np.array([0.5, -0.5, 0.3, 0.1])          # Initial weights

data = []
for i in range(5):
    h = np.dot(x, w)
    y_hat = sigmoid(h)
    error = y - y_hat
    data.append((w[0], w[1], w[2], w[3], error))

    error_term = error * sigmoid_prime(h)
    delta_w = learnrate * error_term * x
    w = w + delta_w

pd.DataFrame(data, columns=['w1', 'w2', 'w3', 'w4', 'error']).round(4)

Unnamed: 0,w1,w2,w3,w4,error
0,0.5,-0.5,0.3,0.1,-0.19
1,0.4797,-0.5406,0.239,0.0187,-0.0475
2,0.4738,-0.5524,0.2214,-0.0048,-0.0035
3,0.4734,-0.5533,0.2201,-0.0065,-0.0002
4,0.4733,-0.5533,0.22,-0.0067,-0.0


## Momentum

[Momentum](https://distill.pub/2017/momentum/) is a method to **avoid local minimums** and miss the lowest possible minimum.