# softmax-function
The Softmax function is a vector-valued function, and is defined as follows:

$$
\begin{align*}
softmax(\mathbf{x}) & = 
\begin{bmatrix}
\frac{e^{x_1}}{\sum_{i=1}^{n} e^{x_i}}\\
\frac{e^{x_2}}{\sum_{i=1}^{n} e^{x_i}}\\ 
\vdots\\
\frac{e^{x_n}}{\sum_{i=1}^{n} e^{x_i}}
\end{bmatrix} \\
\end{align*}
$$

where,
$$
\begin{align*}
n & \text{ is the dimensionality of the input-output vector space} \\
\mathbf{x} & \text{ is the $(n,1)$-dimensional vector }[x_1, x_2, \dots, x_n]^\intercal
\end{align*}
$$

The softmax function is also called the _soft-arg-max_ function. This is to emphasize its relation to a modified version of the conventional _arg-max_ function, which can be defined as follows:

$$
\begin{align*}
\mathrm{arg\,max}_{mod}(\mathbf{x}) & = \mathrm{one\,hot}(\mathrm{arg\,max}(\mathbf{x}), \mathrm{dim}(\mathbf{x})) \\
& = [0, \dots, 0, 1, 0, \dots, 0]^\intercal \\
\end{align*}
$$

where,
$$
\begin{align*}
\mathrm{arg\,max}(\mathbf{y}) & \text{ returns the index of the largest element in the input-vector $\mathbf{y}$} \\
\mathrm{one\,hot}(x, n) & \text{ returns an n-dimensional one-hot vector, with the $n^{th}$ element equal to 1} \\
\mathrm{arg\,max}() & \text{ is the conventional arg-max function}
\end{align*}
$$

## # derivative
Let $\mathbf{x}=\mathrm{softmax}(\mathbf{z})$, where $\mathbf{x} = [x_1, x_2, \dots, x_n]^\intercal$, and $\mathbf{z} = [z_1, z_2, \dots, z_n]^\intercal$. Then, we have

$$
\begin{align*}
\frac{\mathrm{d}\mathbf{x}}{\mathrm{d}\mathbf{z}} & = \begin{bmatrix}
\frac{\partial}{\partial z_1} & \frac{\partial}{\partial z_2} & \dots & \frac{\partial}{\partial z_n} \\
\end{bmatrix} \otimes \begin{bmatrix}
x_1 \\ x_2 \\ \vdots \\ x_n
\end{bmatrix} \\
& = \begin{bmatrix}
\frac{\partial x_1}{\partial z_1} & \frac{\partial x_1}{\partial z_2} & \dots & \frac{\partial x_1}{\partial z_n} \\
\frac{\partial x_2}{\partial z_1} & \frac{\partial x_2}{\partial z_2} & \dots & \frac{\partial x_2}{\partial z_n} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial x_n}{\partial z_1} & \frac{\partial x_n}{\partial z_2} & \dots & \frac{\partial x_n}{\partial z_n} \\
\end{bmatrix} & \text{eq.1}\\
\end{align*}
$$

Now, we have
$$
\begin{align*}
\frac{\partial x_i}{\partial z_j} & = \frac{\partial}{\partial z_j}\biggl(\frac{e^{z_i}}{\sum_{k=1}^{n} e^{z_k}} \biggr) \\
& = \begin{cases}
\frac{e^{z_i}}{\sum_{k=1}^{n} e^{z_k}} \biggl(1 - \frac{e^{z_k}}{\sum_{k=1}^{n} e^{z_k}} \biggr), & \text{if $i = j$} \\
- \frac{e^{z_i}}{\sum_{k=1}^{n} e^{z_k}} \times \frac{e^{z_j}}{\sum_{k=1}^{n} e^{z_k}}, & \text{if $i \ne j$}
\end{cases} \\
& = \begin{cases}
x_i (1 - x_i), & \text{if $i = j$} \\
- x_i x_j, & \text{if $i \ne j$} \\
\end{cases} & \text{eq.2}\\
\end{align*}
$$

From _eq.1_ and _eq.2_ above, we have

$$
\begin{align*}
\frac{\mathrm{d}\mathbf{x}}{\mathrm{d}\mathbf{z}} & = \begin{bmatrix}
x_1 (1-x_1) & - x_1 x_2 & \dots & -x_1 x_n \\
-x_1 x_2 & x_2(1-x_2) & \dots & -x_1 x_n \\
\vdots & \vdots & \ddots & \vdots \\
-x_n x_1 & - x_n x_2 & \dots & x_n (1-x_n) \\
\end{bmatrix} \\
& = \mathrm{diag}(\mathbf{x}) - \mathbf{x}\mathbf{x}^\intercal
\end{align*}
$$