# Neural Network Representation

---

![alt text](https://notesonml.files.wordpress.com/2015/05/neural-network.png)

Let, i/p layer is denoted by '$X$' and o/p layer is denoted by '$\hat{y}$'.

* There are 4 i/p features. Let's name them '$X_{1}$',' $X_{2}$', '$X_{3}$' & '$X_{4}$' respectively.
* These 4 i/p features combine to create a hidden layer. Let's name the nodes of hidden layer as $a_{1}$, $a_{2}$, $a_{3}$, $a_{4}$, $a_{5}$.

So for each of these layers there must be an activation function. Let's name them-
* For i/p layer the activation function is $\mathrm{a}^{[0]}$.
* For hidden layer the activation function is $\mathrm{a}^{[1]}$.
* For o/p layer the activation function is $\mathrm{a}^{[2]}$.

This is a 2 layer Neural Network because in case of Neural Network we donot count the i/p Layer.

Parameters are associated with the Hidden layer and the o/p layer, such as-
* For hidden layer $\mathrm{w}^{[1]}$ and $\mathrm{b}^{[1]}$ is associated, where the shape of $\mathrm{w}^{[1]}$ is $[5*4]$ & $\mathrm{b}^{[1]}$ is $[5*1]$. Where - 
 * For $\mathrm{w}^{[1]}$, 5 = nodes of hidden layer and 4 = features of i/p layer.
* For o/p layer $\mathrm{w}^{[2]}$ and $\mathrm{b}^{[2]}$ is associated, where the shape of $\mathrm{w}^{[2]}$ is $[1*5]$ & $\mathrm{b}^{[2]}$ is $[1*1]$. Where -
 * For $\mathrm{w}^{[2]}$, 1 = Classes of o/p and 5 = nodes of hidden layer.
 
#### Notations :
 
* $\mathrm{a}^{[l]}_{i}$ = activation function for nodes for different layers, where 
 * $l$ = index of layers in the neural network. 
 * $i$ = index of nodes of a particular layer of activation function.

# Computing output

---


Here we'll see how the activation function for each node ($a_{i}$) calculates the o/p of the neural network.

Each activation function computes the o/p in 2 steps:
* **Step-1:** It calculates the $z = w*x + b$
* **Step-2:** It calculates $\sigma(z)$.

Now if we compute the o/p from each node of hidden layer, we get-
* from 1st node the activation function of the hidden layer, we get-

 $\mathrm{a}^{[1]}_{1} = \sigma(\mathrm{z}^{[1]}_{1})$ where $\mathrm{z}^{[1]}_{1} = \mathrm{w}^{[1]}_{1}*x + \mathrm{b}^{[1]}_{1}$
* from 2nd node the activation function of hidden layer, we get-

 $\mathrm{a}^{[1]}_{2} = \sigma(\mathrm{z}^{[1]}_{2})$ where $\mathrm{z}^{[1]}_{2} = \mathrm{w}^{[1]}_{2}*x + \mathrm{b}^{[1]}_{2}$
* from 3rd node the activation function of hidden layer, we get-

 $\mathrm{a}^{[1]}_{3} = \sigma(\mathrm{z}^{[1]}_{3})$ where $\mathrm{z}^{[1]}_{3} = \mathrm{w}^{[1]}_{3}*x + \mathrm{b}^{[1]}_{3}$
* from 4th node the activation function of hidden layer, we get-

 $\mathrm{a}^{[1]}_{4} = \sigma(\mathrm{z}^{[1]}_{4})$ where $\mathrm{z}^{[1]}_{4} = \mathrm{w}^{[1]}_{4}*x + \mathrm{b}^{[1]}_{4}$
* from 5th node the activation function of hidden layer, we get-

 $\mathrm{a}^{[1]}_{5} = \sigma(\mathrm{z}^{[1]}_{5})$ where $\mathrm{z}^{[1]}_{5} = \mathrm{w}^{[1]}_{5}*x + \mathrm{b}^{[1]}_{5}$
 
Now if we compute the o/p from the o/p layer, we get-
* $\mathrm{a}^{[2]} = \sigma(\mathrm{z}^{[2]})$ where $\mathrm{z}^{[2]} = \mathrm{w}^{[2]}*\mathrm{a}^{[1]} + \mathrm{b}^{[2]}$


**The o/p from activation function of each layer is-**
$$\mathrm{a}^{[l]} = \sigma(\mathrm{z}^{[l]})$$ where $$\mathrm{z}^{[l]} = \mathrm{w}^{[l]}*\mathrm{a}^{[l-1]} + \mathrm{b}^{[l]}$$



 




# Vectorizing across multiple training examples

---


In order to implement these equation for 'm' training examples, we need to run a for loop for 1 to m.

Now 
* for $X_{1}$ input, the activation will be $\mathrm{a}^{[2]}(1)$ and the o/p will be $\hat{y}(1)$.
* for $X_{2}$ input, the activation will be $\mathrm{a}^{[2]}(2)$ and the o/p will be $\hat{y}(2)$.
* similarly for for $X_{m}$ input, the activation will be $\mathrm{a}^{[2]}(m)$ and the o/p will be $\hat{y}(m)$.

So the for loop will look like,


```
for i = 1 to m:
      z[1](i) = w[1]*x(i) + b[1]
      a[1](i) = sigmoid(z[1](i))
      z[2](i) = w[2]*a[1](i) + b[1]
      a[2](i) = sigmoid(z[2](i))
```

Now for vectorization,

* $\mathrm{z}^{[1]} = \mathrm{w}^{[1]}*X+ \mathrm{b}^{[1]}$ or as $X = \mathrm{A}^{[0]}$ we can say, $\mathrm{z}^{[1]} = \mathrm{w}^{[1]}*\mathrm{A}^{[0]}+ \mathrm{b}^{[1]}$
* $\mathrm{A}^{[1]} = \sigma({\mathrm{z}^{[1]}})$
* $\mathrm{z}^{[2]} = \mathrm{w}^{[1]}*\mathrm{A}^{[1]}+ \mathrm{b}^{[1]}$
* $\mathrm{A}^{[2]} = \sigma({\mathrm{z}^{[2]}})$

where , 
$$X = 
 \begin{pmatrix}
  \vdots  & \vdots  &  \vdots & \vdots \\
  \mathrm{x}^{[1]} & \mathrm{x}^{[1]} & \cdots & \mathrm{x}^{[m]} \\ 
  \vdots  & \vdots  &  \vdots & \vdots \\
 \end{pmatrix}$$


$$\mathrm{Z}^{[1]} = 
 \begin{pmatrix}
  \vdots  & \vdots  &  \vdots & \vdots \\
  \mathrm{z}^{[1]}(1) & \mathrm{z}^{[1]}(2) & \cdots & \mathrm{z}^{[1]}(m) \\ 
  \vdots  & \vdots  &  \vdots & \vdots \\
 \end{pmatrix}$$
 
 $$\mathrm{A}^{[1]} = 
 \begin{pmatrix}
  \vdots  & \vdots  &  \vdots & \vdots \\
  \mathrm{a}^{[1]}(1) & \mathrm{a}^{[1]}(2) & \cdots & \mathrm{a}^{[1]}(m) \\ 
  \vdots  & \vdots  &  \vdots & \vdots \\
 \end{pmatrix}$$
 
* Horizontally the matrix A goes over different training examples and vertically the different indices in the matrix A corresponds to different hidden units.
* And a similar intuition holds true for the matrix Z as well as for X where horizontally corresponds to different training examples and vertically it corresponds to different input features.

# Activation Functions

---


There are many activation functions used to compute o/p from the layers of Neural Network. Activation functions can be divided in to 2 classes. Such as- 
* Linear Activation Functions
* Non-linear Activation Functions

Linear Activation Function is used when training a model using Linear Regression.

In case of Deep Learning a neural network model consists many hidden layer. Using Linear Activation Function we cannot compute the output from different hidden layers.So Non-linear Activaton Function is needed in case of Deep Learning.

## Sigmoid Function

* **Function :** $\sigma(z) = \frac{1}{1 + \mathrm{e}^{-z}}$

* **Derivative :** $\sigma'(z) = \sigma(z)*(1-\sigma(z))$

 ![alt text](https://miro.medium.com/max/500/1*Myto4ZQagAOoyom4tqkaRQ.png)
  
Sigmoid takes a real value as input and outputs another value between 0 and 1. It’s easy to work with and has all the nice properties of activation functions: it’s non-linear, continuously differentiable, monotonic, and has a fixed output range.

* **Pros:** 
 1. It is nonlinear in nature. Combinations of this function are also nonlinear.
 2. The output of the activation function is always going to be in range (0,1) compared to (-inf, inf) of linear function. So we have our activations bound in a range.
* **Cons:**
 1. Towards either end of the sigmoid function, the Y values tend to respond very less to changes in X.
 2.It gives rise to a problem of “vanishing gradients”.
 
   **Vanishing Gradiants:** Gradient is small or has vanished (cannot make significant change because of the extremely small value). The network refuses to learn further or is drastically slow.
 3. Its output isn’t zero centered. It makes the gradient updates go too far in different directions. 0 < output < 1, and it makes optimization harder.



In [None]:
import numpy as np

In [None]:
def sigmoid(z):
  return 1/(1+np.exp(-z))

In [None]:
sigmoid(np.array([100, 50, 10, -10, -50, -100]))

array([1.00000000e+00, 1.00000000e+00, 9.99954602e-01, 4.53978687e-05,
       1.92874985e-22, 3.72007598e-44])

## Tanh or Hyperbolic Tangent Activation Function
tanh is also like sigmoid but better. The range of the tanh function is from (-1 to 1). 

* **Function:** $\tanh(z) = \frac{e^z - \mathrm{e}^{-z}}{e^z + \mathrm{e}^{-z}}$
* **Derivative:** $\tanh'(z) = 1 - \tanh(z)^2$

![alt text](https://miro.medium.com/max/500/1*51Q7QouspCkOvENni2RwfQ.png)

* **Pros:**
 1. The gradient is stronger for tanh than sigmoid ( derivatives are steeper).
 
* **Cons:**
 1. Tanh also has the vanishing gradient problem. 



In [None]:
def tanh(z):
  t = np.divide((np.exp(z) - np.exp(-z)),(np.exp(z) - np.exp(-z)))
  return t

In [None]:
tanh(np.array([100, 10, 1, 0, -1, -10, -100]))

  


array([ 1.,  1.,  1., nan,  1.,  1.,  1.])

## ReLU
A recent invention which stands for Rectified Linear Units. The formula is deceptively simple: $max(0,z)$. Despite its name and appearance, it’s not linear and provides the same benefits as Sigmoid but with better performance.

* **Function:** $ R(z) =
  \begin{cases}
    z       & \quad \text{if } z >0\\
    0  & \quad \text{if } z \leq 0
  \end{cases}
$

* **Derivative:** $ R'(z) =
  \begin{cases}
    1       & \quad \text{if } z >0\\
    0  & \quad \text{if } z < 0
  \end{cases}
$

![alt text](https://miro.medium.com/max/1000/1*m_0v2nY5upLmCU-0SuGZXg.png)

* **Pros:**
 1. It avoids and rectifies vanishing gradient problem.
 2. ReLU is less computationally expensive than tanh and sigmoid because it involves simpler mathematical operations.
 
* **Cons:**
 1. One of its limitation is that it should only be used within Hidden layers of a Neural Network Model.
 2. Some gradients can be fragile during training and can die. It can cause a weight update which will makes it never activate on any data point again.
 3. Simply saying that ReLU could result in Dead Neurons.
 4. In another words, For activations in the region (x<0) of ReLU, gradient will be 0 because of which the weights will not get adjusted during descent. That means, those neurons which go into that state will stop responding to variations in error/ input ( simply because gradient is 0, nothing changes ). This is called dying ReLU problem.
 5. The range of ReLU is [0, inf). This means it can blow up the activation.


In [None]:
def relu(x, y):
  x <= 0
  r = np.maximum(x, y)
  return(r)

In [None]:
relu(10,-10)

10

## LeakyReLU
LeakyReLU is a variant of ReLU. Instead of being 0 when $z<0, a$ leaky ReLU allows a small, non-zero, constant gradient $\alpha$ (Normally, $\alpha$=0.01).

* **Function:** $ R(z) =
  \begin{cases}
    z       & \quad \text{if } z >0\\
    \alpha z  & \quad \text{if } z \leq 0
  \end{cases}
$
* **Derivative:** $ R'(z) =
  \begin{cases}
    1       & \quad \text{if } z >0\\
    \alpha  & \quad \text{if } z \leq 0
  \end{cases}
$

![alt text](https://miro.medium.com/max/500/1*gDIUV3yonKbIWh_9Kl4ShQ.png)

* **Pros:**
 1. Leaky ReLUs are one attempt to fix the “dying ReLU” problem by having a small negative slope (of 0.01, or so).
* **Cons:**
 1. As it possess linearity, it can’t be used for the complex Classification. It lags behind the Sigmoid and Tanh for some of the use cases.

In [None]:
az = 0.01*z
def leekyrelu(az, z):
  lr = np.maximum(az, z)
  return lr

In [None]:
leekyrelu(10, -10)

10

# Initializing Parameters
To train Neural Network we cannot initalize the parameters('w' & 'b') by 0. Because if we do so, the activation function for different nodes ($a_{i}$) will be symmetric in nature. So the o/p of 1st hidden layer say $\mathrm{a}^{[1]}$ = o/p of 2nd hidden layer say $\mathrm{a}^{[2]}$. So we will not be able to get different o/ps' from different hidden layer by initializing 'w' & 'b' by 0.

So we randomly take 'w'. 'b' doesn't have any problem if we initialize it by 0.

In [None]:
w = np.random.randn(2,2)*0.01
b = np.zeros((2,1))
print(w)
print(b)

[[-0.01275736  0.01106708]
 [-0.01041753 -0.00275047]]
[[0.]
 [0.]]


Now to initialize 'w' we've multiplied the random number with 0.01. Because it is always preferrable to initiliaize parameter with a number as small as possibl. While using 'Sigmoid' or 'tanh' as an activation function, if we use bigger initial value for 'w' then 'w' will be positioned at such a place where the gradiant of those activation function will be close to 0. So the gradiant descent will be much slower.

For this reason we usually initialize our parameters as small as possible.