# Softmax

Prove that softmax is invariant to constant offsets in the input

$softmax(x) = softmax(x + c)$

$softmax(x)_i = \frac{e^{x_i}}{\sum_j e^{x_j}}$

*Proof.* $\forall i \in 1 ≤ i ≤ dim(x)$

$softmax(x + c)_i = \frac{e^{x_i + c}}{\sum_j e^{x_j + c}}$

$=\frac{e^{x_i} e^c}{\sum_j e^{x_j} e^c}$

$=\frac{e^c e^{x_i}}{e^c \sum_j e^{x_j}}$

$=\frac{e^{x_i}}{\sum_j e^{x_j}}$

$=softmax(x)$

# Derive Sigmoid

$\sigma(x) = \frac{1}{1 + e^{-x}}$

$z = 1 + e^{-x}$

$\frac{d}{dx} \sigma(x) = \frac{d \sigma}{dz} \frac{1}{z} \cdot \frac{dz}{dx} z$

$=\frac{1}{z^2} \frac{dz}{dx} z$

$=\frac{1}{z^2} \cdot \frac{dx}{dz} 1 + e^{-x}$

$=\frac{1}{z^2} \cdot - e^{-x}$

$=\frac{-e^{-x}}{-(1 + e^{-x})^2}$

$=\frac{1}{1 + e^{-x}} \frac{e^{-x}}{1 + e^{-x}}$

$=\sigma(x) \frac{e^{-x}}{1 + e^{-x}}$

$=\sigma(x) \frac{1 + e^{-x} - 1}{1 + e^{-x}}$

$=\sigma(x) \left(\frac{1 + e^{-x}}{1 + e^{-x}} - \frac{1}{1 + e^{-x}} \right)$

$=\sigma(x) \left(1 - \frac{1}{1 + e^{-x}} \right)$

$=\sigma(x) (1 - \sigma(x))$

$=\sigma(x) \sigma(-x)$

# Derive Cross-Entropy of Softmax
$CE(y, \hat y) = - \sum_i y_i log(\hat y_i)$

$ \hat y_i = \frac{e^{x_i}}{\sum_j e^{x_j}} $

$\frac{\partial}{\partial x_k} CE(y, \hat y) = 
- \frac{\partial}{\partial x_k} \sum_i y_i log(\hat y_i)$

$= - \frac{\partial}{\partial x_k} log(\hat y_i)$

$= - \frac{\partial}{\partial x_k} log \frac{e^{x_i}}{\sum_j e^{x_j}} $

$= - ( \frac{\partial}{\partial x_k} \log e^{x_i} - \frac{\partial}{\partial x_k} \log \sum_j e^{x_j} ) $

$= - ( \frac{\partial}{\partial x_k} x_i - \frac{\partial}{\partial x_k} \log \sum_j e^{x_j} ) $

$= - ( \frac{\partial}{\partial x_k} x_i - \frac{\partial}{\partial x_k} \log \sum_j e^{x_j} ) $

$= - ( \frac{\partial}{\partial x_k} x_i - \frac{1}{\sum_j e^{x_j}} * \frac{\partial}{\partial x_k} \sum_j e^{x_j} ) $

$= - ( \frac{\partial}{\partial x_k} x_i - \frac{1}{\sum_j e^{x_j}} * \frac{\partial}{\partial x_k} e^{x_k} ) $

$= - ( \frac{\partial}{\partial x_k} x_i - \frac{e^{x_k}}{\sum_j e^{x_j}}) $

$= - ( \frac{\partial}{\partial x_k} x_i - \hat y_k) $

$= \hat y_k - \frac{\partial}{\partial x_k} x_i $

$$
\begin{equation}
    \frac{\partial}{\partial x_k} CE(y, \hat y) =
    \begin{cases}
    \hat y_k - 1 &\text{if } x = k
    \hat y_k - 0 &\text{if } x \ne k
    \end{cases}
\end{equation}
$$

OR

$\frac{\partial}{\partial x} CE(y, \hat y) = \hat y - y$

# Derive gradients wrt input to 3-layer NN

$J = CE(y, \hat y) = - \sum_i y_i \log \hat y_i$

$z_1 = xW_1 + b_1$

$h = \sigma(z_1)$

$z_2 = hW_2 + b_2$

$\hat y = softmax(z_2)$

$\frac {\partial}{\partial x} CE(y, \hat y)$

$=\frac {\partial}{\partial \hat z_2} CE(y, \hat y) * \frac {\partial z_2}{\partial h} hW_2 + b * \frac {\partial h}{\partial z_1} \sigma(z_1) * \frac {\partial z_1} {\partial x} xW_1 + b_1$

$=(y - \hat y) * W_2 * \sigma(z_1)(1 - \sigma(z_1)) * W_1$

$=(y - \hat y) * W_2 * \sigma(xW_1 + b_1)(1 - \sigma(xW_1 + b_1)) * W_1$

OR

$\delta_1 = \frac {\partial}{\partial \hat z_2} CE(y, \hat y) = y - \hat y$

$\delta_2 = \frac {\partial CE}{\partial h} = \delta_1 \frac {\partial z_2}{\partial h} = \delta_1 W_2$

$\delta_3 = \frac {\partial CE}{\partial z_1} = \delta_2 \frac {\partial h}{\partial z_1} = \delta_2 \sigma\prime(z_1)$

$\frac {\partial CE}{\partial x} = \delta_3 W_1^T$

## Number of parameters
If input is $D_x$-dimensional, output is $D_y$-dimensional, and there are $H$ hidden units, how many parameters?

There should be $(D_x + 1) \cdot H + (H + 1) \cdot D_y)$ parameters.