
# Activation Function
type of activation function:
- Sigmoid
- Tanh
- ReLU
- Leaky ReLU
- Softmax

## Sigmoid
The sigmoid function is a mathematical function having a characteristic "S"-shaped curve or sigmoid curve. Often, sigmoid function refers to the special case of the logistic function shown in the first figure and defined by the formula

$$\sigma(x) = \frac{1}{1+e^{-x}}$$

The sigmoid function is bounded by a horizontal asymptote as $x \to \pm \infty$.

<div align="center">
  <img src="images/sigmoid.png" alt="Alt text" width="400" height="300" />
</div>


The sigmoid function is the activation function of choice for models where we need to predict the probability as an output. It is a smooth gradient function which prevents “jumps” in output values. Therefore, it prevents drastic changes in the output for small changes in the input.

loss for sigmoid function:

$$ L = -\frac{1}{N} \sum_{i=1}^N y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) $$
where $y_i$ is the true label and $\hat{y}_i$ is the predicted probability of the label being true.

## Tanh

The tanh function is a rescaled version of the logistic sigmoid, such that its outputs range from -1 to 1:

$$\tanh(x) = 2\sigma(2x) - 1 = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$

<div align="center">
  <img src="images/tanh.png" alt="Alt text" width="400" height="300" />
</div>

The tanh function is also sigmoidal (s - shaped). It’s actually mathematically shifted version of the sigmoid function. Both are similar, and both take a real-valued number and “squash” it into range between 0 and 1, but where the sigmoid function only squashes positive numbers, the tanh function squashes all real numbers.

loss for tanh function:

$$ L = -\frac{1}{N} \sum_{i=1}^N y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) $$
where $y_i$ is the true label and $\hat{y}_i$ is the predicted probability of the label being true.

## ReLU

ReLU stands for Rectified Linear Unit. It takes a real-valued input and thresholds it at zero (replaces negative values with zero)

$$f(x) = \max(0, x)$$

<div align="center">
  <img src="images/relu.png" alt="Alt text" width="400" height="300" />
</div>

ReLU is the most commonly used activation function in neural networks, especially in CNNs. When compared to a function like the sigmoid function, ReLU helps a neural network learn much faster and prevents the vanishing gradient problem.

loss for ReLU function:

$$ L = -\frac{1}{N} \sum_{i=1}^N y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) $$
where $y_i$ is the true label and $\hat{y}_i$ is the predicted probability of the label being true.

## Leaky ReLU

Leaky ReLU is an attempt to solve the dying ReLU problem. Instead of the function being zero when x < 0, a leaky ReLU will instead have a small negative slope (of 0.01, or so). That is, the function computes

$$f(x) = \begin{cases} x & \text{if } x \geq 0 \\ 0.01x & \text{otherwise} \end{cases}$$

<div align="center">
  <img src="images/leakyrelu.png" alt="Alt text" width="400" height="300" />
</div>

loss for Leaky ReLU function:

$$ L = -\frac{1}{N} \sum_{i=1}^N y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) $$
where $y_i$ is the true label and $\hat{y}_i$ is the predicted probability of the label being true.

## Softmax

The softmax function is a generalization of the logistic function and represents a probability distribution over K different possible outcomes.

The softmax function is defined as:

$$
\sigma(\mathbf{z})_j = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}}
$$

where $j = 1, \ldots, K$.

When applied to the entire vector $\mathbf{z}$, the softmax function generates a vector of probabilities $\mathbf{a}(x)$:

$$
\mathbf{a}(x) = \left[ \frac{e^{z_1}}{\sum_{k=1}^{N}{e^{z_k}}}, \frac{e^{z_2}}{\sum_{k=1}^{N}{e^{z_k}}}, \ldots, \frac{e^{z_N}}{\sum_{k=1}^{N}{e^{z_k}}} \right]
$$

The output $\mathbf{a}$ is a vector of length $N$, so for softmax regression, you could also write:

$$
\mathbf{a}(x) = \begin{bmatrix}
P(y = 1 | \mathbf{x}; \mathbf{w},b) \\
\vdots \\
P(y = N | \mathbf{x}; \mathbf{w},b)
\end{bmatrix}
= \frac{1}{ \sum_{k=1}^{N}{e^{z_k} }}
\begin{bmatrix}
e^{z_1} \\
\vdots \\
e^{z_{N}} \\
\end{bmatrix}
$$

<div align="center">
  <img src="images/softmax.png" alt="Alt text" width="400" height="300" />
</div>

loss for softmax function:

$$ L = -\frac{1}{N} \sum_{i=1}^N \sum_{j=1}^M y_{ij} \log(\hat{y}_{ij}) $$
where $y_{ij}$ is the true label and $\hat{y}_{ij}$ is the predicted probability of the label being true.

The softmax function is often used in the final layer of a neural network-based classifier. Such networks are commonly trained under a log loss (or cross-entropy) regime, giving a non-linear variant of multinomial logistic regression.
It is used in multinomial logistic regression and is often used as the last layer of a neural network-based classifier.


code for activation function plot:

```python
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-10, 10, 100)
# softmax
y = np.exp(x) / np.sum(np.exp(x))
# sigmoid
# y = 1 / (1 + np.exp(-x))
# tanh
# y = (np.exp(x) - np.exp(-x)) / (np.exp(x) + np.exp(-x))
# relu
# y = np.maximum(0, x)
# leaky relu
# y = np.maximum(0.1 * x, x)

plt.plot(x, y)
plt.xlabel('x')
plt.ylabel('y')
plt.title('softmax function')
# plt.title('softmax function')
plt.show()

```


# Loss Function
type of loss function:
- Mean Squared Error
- Cross Entropy
- Hinge Loss or Multi class SVM Loss
- Sparse Multiclass Cross Entropy

## Mean Squared Error

The mean squared error (MSE) is the average of the squared errors between the predicted values and the actual value. It is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive (and not zero) is because of randomness or because the estimator does not account for information that could produce a more accurate estimate.

$$MSE = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2$$

where $y_i$ is the true label and $\hat{y}_i$ is the predicted label.

<div align="center">
  <img src="images/meansquareerror.png" alt="Alt text" width="400" height="300" />
</div>

## Cross Entropy

Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label. So predicting a probability of .012 when the actual observation label is 1 would be bad and result in a high loss value. A perfect model would have a log loss of 0.

$$L = -\frac{1}{N} \sum_{i=1}^N y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)$$

where $y_i$ is the true label and $\hat{y}_i$ is the predicted probability of the label being true.

<div align="center">
  <img src="images/crossentropy.png" alt="Alt text" width="400" height="300" />
</div>

## Hinge Loss or Multi class SVM Loss

Hinge loss is primarily used with Support Vector Machine (SVM) Classifiers with class labels -1 and 1. It is the max(0,1-t*y) function where t*y is the output of the SVM model. The function returns 0 if the predicted output t*y is greater than or equal to one. Otherwise, it returns 1-t*y which is always a positive value. The loss is small when the predicted output is of the correct sign (positive or negative) and large when the predicted output is of the wrong sign.

$$L = \sum_{i=1}^N \sum_{j \neq y_i} \max(0, s_j - s_{y_i} + 1)$$

where $s_j$ is the score for class $j$ and $s_{y_i}$ is the score for the correct class $y_i$.

<div align="center">
  <img src="images/hingeloss.png" alt="Alt text" width="400" height="300" />
</div>

code for loss function plot:

```python
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-3, 5, 100)
# mean squared error
y = (x - 1) ** 2
# cross entropy
# y = -np.log(x)
# hinge loss
# y = np.maximum(0, 1 - x)

plt.plot(x, y)
plt.xlabel('x')
plt.ylabel('y')
plt.title('mean square function')
plt.show()
```

