# Activation Functions


## 1. Sigmoid Function

### $ f(x) = \frac{1} {1+e^{-x}} $

Non Linear Activation Function. Sigmoid transforms the values between the range 0 and 1.
Derivative of the function lies between 0 and 0.25

Advantages -

1. This is a smooth S-shaped function and is continuously differentiable


Disadvantages -

1. Vanishing Gradient Descent because for values greater than  3 or less than -3, will have very small gradients.
2. Non Zero centric So output of all the neurons will be of the same sign
3. Computationally expensive

## 2. Tanh Function

### $ f(x) = \frac{e^x - e^{-x}} {e^x + e^{-x}} $

Non Linear Activation Function. Tanh transforms the values between the range -1 and 1. Derivative of the function lies between 0 and 1

Advantages -

1. This is a smooth S-shaped function and is continuously differentiable
2. Zero centric so the gradients are not restricted to move in a certain direction.

Disadvantages -

1. Vanishing Gradient Descent because for values greater than 3 or less than -3, will have very small gradients.
2. Computationally expensive

## 3. ReLu Function


### f(x) = max(0, x)

Rectified Linear Unit is another non-Linear function. If the input is positive then the function would output the value itself, if the input is negative the output would be zero.

Advantages -

1. It does not activate all the neurons at the same time.
2. Computationally efficient
3. Avoids the vanishing gradient problem


Disadvantages -

1. when inputs approach zero, or are negative, the gradient of the function becomes zero, the network cannot perform backpropagation and cannot learn.

## 4. Leaky ReLu

### f(x) = max(0.01x, x)

In this variant of ReLU, instead of producing zero for negative inputs, it will just produce a very small value.

Advantages -

1. It does not activate all the neurons at the same time.
2. Computationally efficient
3. Avoids the vanishing gradient problem
4. It avoids the dyning neuron problem as well becuase if the input is negative, gradient would be 0.01 times the input.

Disadvantages -

1. Leaky ReLU does not provide consistent predictions for negative input values.


## 5. Parameterised ReLU


### f(x) = max(ax, x)

This is another variant of ReLU that aims to solve the problem of gradient’s becoming zero for the left half of the axis. 'a' is also a trainable parameter. The network also learns the value of ‘a‘ for faster and more optimum convergence.

Advantages -

1. Allows the negative slope to be learned.
2. Zero centered

Disadvantages -

## 6. Exponential Relu

### f(x) = x,   x>=0
###              = a(e^x-1), x<0

This is another variant of ReLU that aims to solve the problems of ReLU. 

Advantages -

1. No dead neuron
2. Zero centered

Disadvantages -
1. Slighly more computationally intensive

## 7. Softmax

### $ S(x_j) = \frac {e^{x_j}} {\sum _{k=1} ^{K} {e^{x_k}}}  , j=1,2,3...K $

 Softmax funciton is used in Multiclass Classification in Neural Network. 
 This function returns the probability for a datapoint belonging to each individual class. 
 
 Advantages -
 
 1. Softmax is used only for the output layer, for neural networks that need to classify inputs into multiple categories

## 8. Swish

### f(x) = x . sigmoid(x)

## OR

### $ f(x) = \frac {x} {1 + e^{-x}} $

Swish is as computationally efficient as ReLU and shows better performance than ReLU on deeper models. The values for swish ranges from negative infinity to infinity. 

Advantages -

1. Because of the smoothness, differentiable at all the points, hence optimizes and generalises well.


## 9. Maxout

### $ f(x) = max(w_1^Tx+b_1, w_2^Tx+b_2) $

This function generalises the ReLu and Leaky ReLu function as it maxout the neuron.

Advantages -

1. Have all the advantages of Relu and Leaky Relu function
2. Learnable activation function


## 10. Softplus

### $ f(x) = ln(1+exp(x)) $

The softplus function is similar to the ReLU function, but it is relatively smooth. It is unilateral suppression like ReLU.It has a wide acceptance range (0, + inf).

---------------------------------------------

# Loss Functions

## Regression Loss Functions

## 1. L1 Loss (Least Absolute Deviations)

### $ L1 Loss = \sum _{n=1} ^{N} |y_{true} - y_{predicted}|  $

sum of all the absolute differences in between the true value and the predicted value.


## 2. L2 Loss (Least Squared Error)

### $ L2 Loss = \sum _{n=1} ^{N} (y_{true} - y_{predicted})^2 $

It is also used to minimize the error which is the sum of all the squared differences in between the true value and the pedicted value.

Disadvantages -

1. Gives poor perfomrance in case of outliers as error becomes the square of the outlier value.


## 3. Huber Loss

![image.png](attachment:image.png)

It combines the best properties of L1 and L2 Loss. 
Works in case of outliers as it behaves as linear function.

## Classifiaction Loss Functions

## 4. Hinge Loss

### $ Hinge Loss = max(0, 1 - y * f(x)) $

Hinge loss is promarily used with Support Vector Machines classifiers with class labels as -1 and 1. In case of Right predictions that are not confident, hinge loss penalizes that.

## 5. Cross-entroy Loss

![image.png](attachment:image.png)

Cross Entrpy loss is used in case of Binay Classification problems.

## 6. Sigmoid-Cross-Entropy Loss

In Sigmoid cross entropy loss, instead of using y value it uses the sigmoid of the y value. Ouput values ranges in between 0-1.

## 7. Softmax-Cross-Entropy Loss

![image.png](attachment:image.png)

Softmax cross entropy loss is used in case of multi class classification. It converts a set of fraction vectors into corresponding probability vectors.