# Neural Networks

## 1. Definition

<img src="images/deep_nn2.png">

## 2. Notations

- m : number of examples in the dataset
- $n_x$ : input size
- $n_y$ : output size (or number of classes)


- $X \in R^{n_x × m}$ is the input matrix
- $x \in R^{n_x × 1}$ is the input
- $x^{(i)} \in R^{n_x}$ is the $i^{th}$ example represented as a column vector
- $x^{(i)}_j$ is the $j^{th}$ feature value of the $i^{th}$ traing example 
- $Y \in R^{n_y × m}$ is the label matrix
- $y^{(i)} \in R^{n_y}$ is the output label for the $i^{th}$ example (**onehot**)
- $w \in R^{n_x × 1}$ is the weight
- $b \in R$ is the bias

## 3. NN Architecture 
A NN is based on a collection of connected units or nodes. An unit neuron that receives a signal can process it and then signal additional artificial neurons connected to it.

Signals travel from the first layer (the **input layer**), to the last layer (the **output layer**), possibly after traversing the **hidden layers** *multiple times*.
<img src="images/architecture.svg">

**Note**: A NN has only 1 input and output layer but can have multiple hidden layers.
<img src="images/type_nn.png">

### Why do we need Deep Neural Networks ?

<img src="images/deep_nn.jpg">

## 4. Neuron Model (Logistic Unit)

<img src="images/neuron.png">

Input unit: $x = \left [
\begin{aligned}
x_1\\
x_2\\
...\\
x_{n_x}
\end{aligned}
\right ]_{(n_x × 1 )}$
Weight unit: $w = \left [
\begin{aligned}
w_1\\
w_2\\
...\\
w_{n_x}
\end{aligned}
\right ]_{(n_x × 1 )}$
Bias unit: $b \in \mathbb{R}$

$z = w^Tx + b$

Activation unit: $a = \sigma (z)$

## 5. Network Model (Set of Neurons)

<img src="images/neural_net.jpeg"> 

#### Notations:

- $L$: Number of layers (Not count for imput layer).
- $n^{[l]}$: Number of units in layer $l^{th}$, $n^{[0]} = n_x$, $0 \leq l \leq L$.
- $\sigma^{[l]}$ : Activation function of layer $l^{th}$
- $a^{[l]}$: Activation in layer $l^{th}$, $a^{[0]} = x$.
- $a^{[l]}_i$: Activation of unit $i^{[th]}$in layer $l^{th}$, $1 \leq i \leq n^{[l]}$.
- $W^{[l]}$: Weight in layer $l^{th}$.
- $w^{[l]}_i$: Weight of unit $i^{[th]}$ in layer $l^{th}$.
- $b^{[l]}$: Bias in layer $l^{th}$.
- $b^{[l]}_i$: Bias of unit $i^{[th]}$ in layer $l^{th}$.

### 5.1. The unit $i^{th}$ in layer $l^{th}$
Let: $\large z^{[l]}_i = w^{[l]T}_i a^{[l-1]} + b^{[l]}_i \in \mathbb{R} $

$\large a^{[l]}_i = \sigma^{[l]}{( w^{[l]T}_i a^{[l-1]} + b^{[l]}_i)} \in \mathbb{R}$

With: 

$w^{[l]}_i = \left [  
\begin{aligned} 
&w^{[l]}_{i,1}\\ 
&w^{[l]}_{i,2}\\
&...\\
&w^{[l]}_{i,n^{[l-1]}}
\end{aligned} 
\right ] \in \mathbb{R}^{n^{[l-1]} × 1}$, 
$b^{[l]}_i \in \mathbb{R}$ and $a^{[l]} = \left [  
\begin{aligned} 
&a^{[l]}_{1}\\ 
&a^{[l]}_{2}\\
&...\\
&a^{[l]}_{n^{[l]}}
\end{aligned} 
\right ] \in \mathbb{R}^{n^{[l]} × 1}$

$a^{[0]} = x $

### 5.2. The layer $l^{th}$ for ONE example
Let: $\large z^{[l]} = W^{[l]}a^{[l-1]} + b^{[l]} \in \mathbb{R}^{n^{[l]} × 1}$

$\large a^{[l]} = \sigma{(z^{[l]})} \in \mathbb{R}^{n^{[l]} × 1}$

With:

$
W^{[l]} = \left [  
\begin{aligned} 
---&w^{[l]T}_1--- \\ 
---&w^{[l]T}_2--- \\
&...\\
---&w^{[l]T}_{n^{[l]}}---
\end{aligned} \right ] \in \mathbb{R}^{n^{[l]} × n^{[l-1]}}$

$b^{[l]} = \left [  
\begin{aligned} 
&b^{[l]}_{1}\\ 
&b^{[l]}_{2}\\
&...\\
&b^{[l]}_{n^{[l-1]}}
\end{aligned} 
\right ] \in \mathbb{R}^{n^{[l-1]} × 1}$



### 5.3. The layer $l^{th}$ for ALL examples

Let: $\large Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]}$

$ \large A^{[l]} = \sigma(Z^{[l]})$

Wiht: $A^{[l]} = \left [ a^{[l](1)} a^{[l](2)} ... a^{[l](m)}\right ] \in \mathbb{R}^{n^{l} × m}$

$a^{[l](j)}$: Activation of layer $l^{th}$ is compute from example $j^{th}$

$\Longrightarrow A^{[0]} = X = \left [ x^{(1)} x^{(2)} ... x^{(m)}\right ]$

## 6. Multi-class Classification
In order to make neural network to work with multi-class notification we may use **One-vs-All** approach.
<img src="images/multi_class.png"> 


### 6.1. Softmax
$n^{[L]} = n_y$

$Z^{[L]} = W^{[L]} A^{[L-1]} + b^{[L]}  \in \mathbb{R}^{n_y}$

Activation function of last layer is **Softmax** funtion.

$$\large A^{[L]} = Softmax(Z^{[L]}) = 1\backslash \sum_{i=1}^{n_y} e^{z^{[L]}_i}$$

$A^{[L]} \in \mathbb{R}^{n_y × m} $

$a^{[L]}_i > 0  , 1 < i < n_y$

$\sum_{i=1}^{n_y} a^{[L]}_i  = 1$

### 6.2. Using One Hot encodings
Many times in deep learning you will have a y vector with numbers ranging from 0 to C-1, where C is the number of classes. If C is for example 4, then you might have the following y vector which you will need to convert as follows:
<img src="images/onehot.png" style="width:600px;height:150px;">
This is called a "one hot" encoding, because in the converted representation exactly one element of each column is "hot" (meaning set to 1).

After  encoding:

$Y = [y^{(1)} ... y^{(m)}] \in \mathbb{R}^{n_y × m}$ s the label matrix (**Onehot**)

$y^{(i)} \in \mathbb{R}^{n_y}$ is the output label for the $i^{th}$ example

$y^{(i)}_j = 1$ if label of $i^{th}$ example is $j$ ($0 \leq j < n_y$)

## 7. Cost function
The cost function for the neuron network is quite similar to the logistic regression cost function.

Activation function of last layer is **Softmax** function.

Output is **OneHot**

$\large J\left(W^{[1]}, b^{[1]}, ..., W^{[L]}, b^{[L]}\right) 
= \frac{-1}{m} \sum_{i=1}^m \sum_{j=1}^{n_y} y^{(i)}_j * \log {a^{[L](i)}} $