In [2]:
import numpy as np

## Activation Functions

This notebook will derive the forward and backward (derivative) mathematical formulas of various activation functions in machine learning.
Then we will implement the math in python.

### ReLU

Rectified Linear Unit (ReLU) activation function.

#### Forward
$$
\operatorname{ReLU}(x) = \max(0, x) =
\begin{cases}
0, & \text{if } x \le 0, \\
x, & \text{if } x > 0.
\end{cases}
$$

In [7]:
for _ in range(5):
    x = np.random.randint(-10, 10)
    print(x, np.maximum(x, 0))

7 7
9 9
-7 0
8 8
1 1


#### Backward

$$
\frac{\partial \operatorname{ReLU}(x)}{\partial x} =
\begin{cases}
0, & \text{if } x < 0, \\
1, & \text{if } x > 0.
\end{cases}
$$

In [5]:
for _ in range(5):
    x = np.zeros((2, 2))
    x.fill(np.random.randint(-10, 10))
    print(x, x[x>0])

[[-6. -6.]
 [-6. -6.]] []
[[4. 4.]
 [4. 4.]] [4. 4. 4. 4.]
[[2. 2.]
 [2. 2.]] [2. 2. 2. 2.]
[[9. 9.]
 [9. 9.]] [9. 9. 9. 9.]
[[9. 9.]
 [9. 9.]] [9. 9. 9. 9.]


### Sigmoid

#### Forward

$$
sigmoid(x) = \sigma(x) = \frac{1}{1 + e^{-x}}
$$

In [8]:
def sigmoid(x):
    out = 1 / (1 + np.exp(-x))
    return out

sigmoid(x)

array([[0.99987661, 0.99987661],
       [0.99987661, 0.99987661]])

#### Backward

Start with the definition of sigmoid function
$$
\sigma(x) = \frac{1}{1 + e^{-x}}
$$

Rewrite into exponent form
$$
\sigma(x) = (1 + e^{-x})^{-1}
$$

To simplify differentiation, we introduce $u$ as substitution
$$
\text{Let } u = 1 + e^{-x}, \text{ then } \sigma(x) = u^{-1}
$$

Compute the derivative of $u$ with respect $x$

$$
\frac{\partial u}{\partial x} = -e^{-x} \\
\frac{\partial}{\partial u} \left( u^{-1} \right) = -u^{-2}
$$

Using chain rule:
$$
\begin{align*}
\frac{\partial u^{-1}}{\partial x} &= \frac{\partial u^{-1} }{\partial u} \cdot \frac{\partial u}{\partial x} \\
&= -u^{-2} \cdot -e^{-x} \\
&= \frac{e^{-x}}{u^2}
\end{align*}
$$

Substitute $u = 1 + e^{-x}$:
$$
\frac{\partial u^{-1}}{\partial x} = \frac{\partial \sigma(x)}{\partial x} = \frac{e^{-x}}{(1 + e^{-x})^2}
$$

Write the complement of $\sigma(x)$ as:
$$
1 - \sigma(x) = 1 - \frac{1}{1 + e^{-x}} \\
              = \frac{(1 + e^{-x}) - 1}{1 + e^{-x}} \\
              = \frac{e^{-x}}{1 + e^{-x}}
$$

Multiplying $\sigma(x)$ by $1 - \sigma(x)$:
$$
\sigma(x) (1 - \sigma(x)) = \frac{1}{1 + e^{-x}} \cdot \frac{e^{-x}}{1 + e^{-x}} \\
                          = \frac{e^{-x}}{(1 + e^{-x})^2} \\
$$

Therefore, we can conclude:
$$
\frac{\partial \sigma(x)}{\partial x} = \sigma(x) (1 - \sigma(x))
$$

In [10]:
grad = sigmoid(x) * (1 - sigmoid(x))
print(grad)

[[0.00012338 0.00012338]
 [0.00012338 0.00012338]]


### Softmax

#### Forward
Define the softmax function for the j-th component:
$$
softmax(x)_j = \frac{e^{x_j}}{\sum_{k=1}^{n} e^{x_k}}
$$

In [11]:
def softmax(x):
    return np.exp(x) / np.sum(np.exp(x))

softmax(x)

array([[0.25, 0.25],
       [0.25, 0.25]])

#### Backward

Recall the Quotient Rule.
For a function f(x) = g(x) / h(x), the derivative is given by:

$$
\begin{align}
\frac{d}{dx} \left(\frac{g(x)}{h(x)}\right)
&= \frac{g'(x) \, h(x) - g(x) \, h'(x)}{[h(x)]^2}. \\
g(x) &= e^{x_j}, \\
h(x) &= \sum_{k=1}^{n} e^{x_k}.
\end{align}
$$

Compute the derivatives of g(x) and h(x) with respect to x_i.

For the numerator:
$$
\begin{align}
\frac{\partial}{\partial x_i} \, e^{x_j} &=
\begin{cases}
e^{x_j}, & \text{if } i = j, \\
0, & \text{if } i \neq j.
\end{cases}
\end{align}
$$


For the denominator:
$$
\begin{align}
\frac{\partial}{\partial x_i} \left(\sum_{k=1}^{n} e^{x_k}\right)
&= e^{x_i}.
\end{align}
$$

Apply the quotient rule to differentiate $S_j(\mathbf{x})$ with respect to x_i:
$$
\begin{align}
\frac{\partial S_j}{\partial x_i} &= \frac{\frac{\partial}{\partial x_i}\left(e^{x_j}\right) \left(\sum_{k=1}^{n} e^{x_k}\right) - e^{x_j}\, \frac{\partial}{\partial x_i}\left(\sum_{k=1}^{n} e^{x_k}\right)}{\left(\sum_{k=1}^{n} e^{x_k}\right)^2} \\
&= \frac{\delta_{ij}\, e^{x_j} \left(\sum_{k=1}^{n} e^{x_k}\right) - e^{x_j}\, e^{x_i}}{\left(\sum_{k=1}^{n} e^{x_k}\right)^2}.
\end{align}
$$

Express the derivative in terms of the softmax function.

Recall that:
$$
\begin{align}
S_j &= \frac{e^{x_j}}{\sum_{k=1}^{n} e^{x_k}}, \\
S_i &= \frac{e^{x_i}}{\sum_{k=1}^{n} e^{x_k}}.
\end{align}
$$

Thus, we can write:

$$
\frac{\partial S_j}{\partial x_i} = S_j\left(\delta_{ij} - S_i\right)
=
\begin{cases}
S_j (1 - S_j), & \text{if } i = j, \\
- S_j S_i, & \text{if } i \neq j.
\end{cases}
$$

Finally
$$
\text{sum\_term} = \sum_{j} \frac{\partial L}{\partial S_j} S_j \\
\text{probs} = S_i \\
S_i \left(\frac{\partial L}{\partial S_i} - \sum_{j} \frac{\partial L}{\partial S_j} S_j\right)
$$

In [15]:
probs = softmax(x)
grad = 1
sum_term = np.sum(grad * probs, axis=-1, keepdims=True)
dLdx = probs * (grad - sum_term)
print(sum_term, dLdx)


[[0.5]
 [0.5]] [[0.125 0.125]
 [0.125 0.125]]


### Tanh

#### Forward

$$
\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = 2 \sigma (2x) - 1 \\
sigmoid(x) = \frac{1}{1 + e^{-x}}
$$

In [17]:
def tanh(x):
    return (np.exp(x) - np.exp(-x)) / (np.exp(x) + np.exp(-x))

tanh(x)

array([[0.99999997, 0.99999997],
       [0.99999997, 0.99999997]])

#### Backward

Let
$$
u(x) = e^x - e^{-x} \quad \text{and} \quad v(x) = e^x + e^{-x}.
$$

Then, by the quotient rule, the derivative of $\tanh(x)$ is given by:
$$
\frac{d}{dx}\left(\frac{u(x)}{v(x)}\right) = \frac{u'(x)v(x) - u(x)v'(x)}{[v(x)]^2}
$$

Step 1: Compute $u'(x)$ and $v'(x)$
Differentiate $u(x)$:
$$
u'(x) = \frac{d}{dx}\left(e^x - e^{-x}\right) = e^x + e^{-x}.
$$

Differentiate $v(x)$:
$$
v'(x) = \frac{d}{dx}\left(e^x + e^{-x}\right) = e^x - e^{-x}.
$$

Step 2: Substitute into the quotient rule. Plug the derivatives into the quotient rule:
$$
\tanh'(x) = \frac{(e^x + e^{-x})(e^x + e^{-x}) - (e^x - e^{-x})(e^x - e^{-x})}{\left(e^x + e^{-x}\right)^2}.
$$


Step 3: Simplify the numerator. Notice that:
$$
\left(e^x + e^{-x}\right)^2 = e^{2x} + 2 + e^{-2x},
$$

and
$$
\left(e^x - e^{-x}\right)^2 = e^{2x} - 2 + e^{-2x}.
$$

Subtracting these, we get:
$$
\left(e^x + e^{-x}\right)^2 - \left(e^x - e^{-x}\right)^2 = \left(e^{2x} + 2 + e^{-2x}\right) - \left(e^{2x} - 2 + e^{-2x}\right) = 4.
$$


Step 4: Write the derivative in simplified form. Thus, the derivative becomes:
$$
\tanh'(x) = \frac{4}{\left(e^x + e^{-x}\right)^2}.
$$

Recall that the hyperbolic cosine is defined as:
$$
\cosh(x) = \frac{e^x + e^{-x}}{2},
$$

so that

$$
\left(e^x + e^{-x}\right)^2 = 4\cosh^2(x).
$$

Substitute this into the expression for $\tanh'(x)$:
$$
\tanh'(x) = \frac{4}{4\cosh^2(x)} = \frac{1}{\cosh^2(x)}.
$$

Since the hyperbolic secant is defined as:
$$
\operatorname{sech}(x) = \frac{1}{\cosh(x)},
$$

we can finally write:
$$
\tanh'(x) = \operatorname{sech}^2(x).
$$

Alternatively, using the identity $\operatorname{sech}^2(x) = 1 - \tanh^2(x)$

we also have:
$$
\tanh'(x) = 1 - \tanh^2(x).
$$

In [18]:
grad = 1 - tanh(x**2)
print(grad)

[[0. 0.]
 [0. 0.]]
