                                    All rights reserved © Global AI Hub 2020 

<div style="text-align:center"><img src="logo.jpeg" /></div>

# Neural Networks

In [3]:
!ls

'ls' i‡ ya da d�Ÿ komut, ‡al�Ÿt�r�labilir
program ya da toplu iŸ dosyas� olarak tan�nm�yor.


Perhaps the easiest way to think about artificial intelligence, machine learning, neural networks, and deep learning is to think of them like Russian nesting dolls. Each is essentially a component of the prior term.

<div style="text-align:center"><img src="russian.jpg" /></div>

> Neural networks—and more specifically, artificial neural networks (ANNs)—mimic the human brain through a set of algorithms. At a basic level, a neural network is comprised of four main components: inputs, weights, a bias or threshold, and an output. 

<div style="text-align:center"><img src="structure.jpg" />

## The Neuron

AI may have come on in leaps and bounds in the last few years, but we’re still some way from truly intelligent machines – machines that can reason and make decisions like humans. ANN may provide the answer to this.

Human brains are made up of connected networks of neurons. ANNs seek to simulate these networks and get computers to act like interconnected brain cells, so that they can learn and make decisions in a more humanlike manner.


<div style="text-align:center"><img src="bio_neuron.png" /></div>
<h5><center>Biological Neurons</center></h5>

- ANNs are modeled on biological neural networks in the brain. The brain is made up of cells called neurons, which send signals to each other through connections known as synapses. Neurons transmit electrical signals to other neurons based on the signals they themselves receive from other neurons.

- An artificial neuron simulates how a biological neuron behaves by adding together the values of the inputs it receives. If this is above some threshold, it sends its own signal to its output, which is then received by other neurons. Each of its inputs can be adjusted by multiplying it by some weighting factor based on their importance. Say, if input A were twice as important as input B, then input A would have larger weight. Weights can also be negative, if the value of that input is unimportant. The process of training neural network involves adjusting these weight values so that the final output of the network gives you the right answer. 

<div style="text-align:center"><img src="neuron.png" /></div>
<h5><center>Deep Learning Neurons</center></h5>

## The Activation Function

- Activation functions are mathematical equations that determine the output of a layer. The function is attached to each neuron in the network, and determines whether it should be activated or not, based on whether each neuron’s input is relevant for the model’s prediction. Activation functions also help normalize the output of each neuron with to a range between 1 and 0 or between -1 and 1. 

- An additional aspect of activation functions is that they must be computationally efficient because they are calculated across thousands or even millions of neurons for each data sample. Modern neural networks use a technique called backpropagation to train the model, which places an increased computational strain on the activation function, and its a derivative function.

<div style="text-align:center"><img src="most_common_activations.png"/>
<h5><center>Most common Neural Network Activation Functions</center></h5>

- Imagine a neural network without the activation functions. In that case, every neuron will only be performing a linear transformation on the inputs using the weights and biases. Although linear transformations make the neural network simpler, but this network would be less powerful and will not be able to learn the complex patterns from the data.

<div style="text-align:center"><img src="act.png"/>
<h5><center>Deep Learning Neurons with Activation Function</center></h5>

## How do Neural Networks work?

- Deep learning neural networks (called deep neural networks) are modeled on the way scientists believe the human brain works. They process and reprocess data, gradually refining the analysis and results to accurately recognize, classify, and describe objects within the data.

- Deep neural networks consist of multiple layers of interconnected nodes, each of which uses a progressively more complex deep learning algorithm to extract and identify features and patterns in the data. They then calculate the likelihood or confidence that the object or information can be classified or identified in one or more ways.

- The input and output layers of a deep neural network are called visible layers. The input layer is where the deep learning model ingests the data for processing, and the output layer is where the final identification, classification, or description is calculated.

- In between the input and output layers are hidden layers where the calculations of each previous layer are weighted and refined by progressively more complex algorithms to zero in on the final outcome. This movement of calculations through the network is called forward propagation.

- Another process called backpropagation identifies errors in calculated predictions, assigns them weights and biases, and pushes them back to previous layers to train or refine the model. Together, forward propagation and backpropagation allow the network to make predictions about the identity or class of the object while learning from inconsistencies in the outcomes. The result is a system that learns as it works and gets more efficient and accurate over time when processing large amounts of data.

## How do Neural Networks learn?

- Information flows through a neural network in two ways. When it's learning (being trained) or operating normally (after being trained), patterns of information are fed into the network via the input units, which trigger the layers of hidden units, and these in turn arrive at the output units. This common design is called a feedforward network. Not all units "fire" all the time. Each unit receives inputs from the units to its left, and the inputs are multiplied by the weights of the connections they travel along. Every unit adds up all the inputs it receives in this way and (in the simplest type of network) if the sum is more than a certain threshold value, the unit "fires" and triggers the units it's connected to (those on its right).

- For a neural network to learn, there has to be an element of feedback involved—just as children learn by being told what they're doing right or wrong. In fact, we all use feedback, all the time. Think back to when you first learned to play a game like ten-pin bowling. As you picked up the heavy ball and rolled it down the alley, your brain watched how quickly the ball moved and the line it followed, and noted how close you came to knocking down the skittles. Next time it was your turn, you remembered what you'd done wrong before, modified your movements accordingly, and hopefully threw the ball a bit better. So you used feedback to compare the outcome you wanted with what actually happened, figured out the difference between the two, and used that to change what you did next time ("I need to throw it harder," "I need to roll slightly more to the left," "I need to let go later," and so on). The bigger the difference between the intended and actual outcome, the more radically you would have altered your moves.

<div style="text-align:center"><img src="propagations.png"/>

<h5><center>Learning Process of a Neural Network</center></h5>

- Neural networks learn things in exactly the same way, typically by a feedback process called backpropagation (sometimes abbreviated as "backprop"). This involves comparing the output a network produces with the output it was meant to produce, and using the difference between them to modify the weights of the connections between the units in the network, working from the output units through the hidden units to the input units—going backward, in other words. In time, backpropagation causes the network to learn, reducing the difference between actual and intended output to the point where the two exactly coincide, so the network figures things out exactly as it should.

<div style="text-align:center"><img src="learn_example.PNG"/>

<h5><center>Example of Learning Process</center></h5>

## Forward Propagation

As we mentioned before, a Neural Network is organized in layers, where the first layer
contains the inputs of the network, the last layer is the output of the network, and the
layers in between are called hidden layers. Each layer gets it’s inputs from the layer
before, and passes it’s outputs to the next. We call this step forward propagation.

**Mathematical expression of the Forward Propagation**:

For one example $x^{(i)}$:
$$Z^{(i)} = w^T x^{(i)} + b $$
$$\hat{y}^{(i)} = a^{(i)} = g(z^{(i)}) = sigmoid(z^{(i)})$$ 
$$ \mathcal{L}(a^{(i)}, y^{(i)}) =  - y^{(i)}  \log(a^{(i)}) - (1-y^{(i)} )  \log(1-a^{(i)})$$

The cost is then computed by summing over all training examples:
$$ J = \frac{1}{m} \sum_{i=1}^m \mathcal{L}(a^{(i)}, y^{(i)})$$


**Mathematical expression of the Forward Propagation**:

For input layer $X$ = $A^{[0]}$:
$$Z^{[1]} = W^{[1]} X + B^{[1]} $$
$$A^{[1]} = g^{[1]}(Z^{[1]})$$ 

Where:  
$W^{[1]}$: wieght matrix of l'th layer  
$B^{[1]}$: bias matrix of l'th layer  
$g^{[1]}$: activation function of l'th layer  




For other layers:

$$Z^{[l]} = W^{[l]} A^{[l-1]} + B^{[l]} $$
$$A^{[L]} = g^{[l]}(Z^{[l]})$$ 

Loss for logistic regression and one example:
$$ \mathcal{L}(a^{[L](i)}, y) =  - y^{(i)}  \log(a^{[L](i)}) - (1-y^{(i)} )  \log(1-a^{[L](i)})$$

Where:  
$i$: i'th training example  

The cost is then computed by summing over all training examples:
$$ J = \frac{1}{m} \sum_{i=1}^m \mathcal{L}(a^{(i)}, y^{(i)})$$

<div style="text-align:center"><img src="equation.png"/>


## Cost Function

Typically, with neural networks, we seek to minimize the error. As such, the objective function is often referred to as a cost function or a loss function and the value calculated by the loss function is referred to as simply “loss”. The loss function (or error) is for a single training example, while the cost function is over the entire training set. Most common cost functions are; Mean Square Error (MSE), Mean Absolute Error(MAE), Categorical Cross Entropy Cost Function and Binary Cross Entropy Cost Function 



 $$-\frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right))$$
<h5><center>Cross Entropy Cost Function</center></h5> 

## Gradient Descent

Gradient descent is a first-order iterative optimization algorithm for finding a global minimum of a differentiable function. To find a global minimum of a function using gradient descent, we take steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point.

<div style="text-align:center"><img src="gradient_descent.png"/>


Other keywords in optimazation algorithms:
- RMSprop
- Momentum
- Adam Optimization

### Learning Rate

We need to define step size of graident descent algorithm to reach optimal value. We called this parameter Learning Rate $(\alpha)$. Selecting the appropriate learning rate is one of the most important tuning process in deep learning application. If it is too small, we may not find the global minimum due to the small steps. If it is too large, we may overshoot the minima.

<div style="text-align:center"><img src="learning_rate.png"/>


There are tree main Gradient Descent Algorithm:
1. Batch Gradient Descent
2. Stochastic Gradient Descent
3. Mini-Batch Gradient Descent

### Batch Gradient Descent

We use all training examples to compute gradient of cost function. This method will not be appropriate for big dataset. It takes one step of each iteration to minima. Thus, in that case, it takes too much time to converge. 

### Stochastic Gradient Descent

Stochastic gradient descent  is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in high-dimensional optimization problems this reduces the computational burden, achieving faster iterations in trade for a lower convergence rate.

In this approach, we use only one training example in each iteration to update neural nets parameters. Using only one example speeds up the training but also decreases the converge probability.

<div style="text-align:center"><img src="local_global_min.png"/>


### Mini-Batch Gradient Descent


Between Batch and Stochastic approach, we can use mini-batches in the training set to find minima in the cost function. Mini-batch size should be not too small or not too big. It is good to choose a batch size in multiples of 2 (64-512) due to computational process. Selecting smaller learning rate can increase the converge probability and vectorization makes the mini-batch gradient descent algorithm converge faster.


<div style="text-align:center"><img src="grad_types.png"/>

<h5><center>Converging Process of Gradient Descent Algorithms</center></h5> 

## Backward Propagation

Most deep neural networks are feed-forward, meaning they flow in one direction only from input to output. However, you can also train your model through backpropagation; that is, move in opposite direction from output to input. Backpropagation allows us to calculate and attribute the error associated with each neuron, allowing us to adjust and fit the algorithm appropriately.

Back Propagation is the essence of neural network training. We can call it a tune process of weights ($W$) and biases ($B$) of each layer. Tuning weights and biases properly decreases errors and make neural network more generalize.

To minimize cost function, we take the derivative of this function to update the weights and biases. We continue this process for the specified number of iterations to find global minimum with optimization algorithms like gradient descent.

**Mathematical expression of the Backward Propagation**:

For  final (output) layer $\hat{y} = A^{[L]} $:

$$ dZ^{[L]} = A^{[L]} -Y $$
$$ dW^{[L]} = \frac{1}{m} dZ^{[L]} A^{[L-1]^T} $$
$$ db^{[L]} = \frac{1}{m}np.sum(dZ^{[L]},axis=1,keepdims=True) $$

For other layers:
$$ dZ^{[l]} = W^{[L+1]^T} dZ^{[l+1]} * g^{[l]'} (Z^{[l]}) $$
$$ dW^{[l]} = \frac{1}{m}dZ^{[l]}dA^{[l-1]^T} $$
$$ db^{[l]} = \frac{1}{m}np.sum(dZ^{[l]},axis=1,keepdims=True) $$


Where:  
$dZ$ denotes $\partial L / \partial Z$ for the final layer  
$Y$: actual label  
$W^{[l]}$: weigth matrix of l'th layer  
$b^{[l]}$: bias matrix of l'th layer  
$*$: elemet-wise product  
$L$: layer number

After the $dB$ and $dW$ values are calculated, we can update the weight and biases with following formula:

$$ W^{[l]} = W^{[l]} - \alpha * dW^{[l]} $$ 
$$ B^{[l]} = B^{[l]} - \alpha * dB^{[l]} $$ 

Where:  
$\alpha$ : Learning Rate

## Tuning Hyperparameters

When we start to train deep learning model, we need to tune hyperparamters for our cases. These hyperparamters can be:

**Learning Rate**: Selecting learning rate is highly effects model perfomance. Appropirate scale for searching learning rate will be:  
  
$\alpha = 10^r$  
  
Where:  
$r ∈ [a,b]$  
$a,b ∈  \mathbb{Z}$

**Epoch Number**: Selecting the number of iteration is also very important to find global minimum error in cost function.  

**Layer Number**: Increase layer number makes model more complex. This can be cause overfitting during the training.

**Neuron Number in Layers**: Increase in neuron number also increases model complexity.

**Mini-Batch Size**: Mini-Batch size need to be selected carefully due to the train time and converge trade-off.

## Overfitting in Neural Networks

Due to the complexity of neural network architecture, these models prone to overfitting. We can use different methods to prevent overfitting, make model more generalize and increase test accuracy. Most common ways are:
- <u>Getting more training data:</u> Decreases both bias and variance.
- <u>Using Regularization:</u> Decreases model complexity and variance.

### L2 Regulariazation

In L2 regularization, we adding a new component to cost function to penalize large weights. Decrease in weights make model more generalize and decrease variance.

New cost function will be:
$$ J(w^{[1]},b^{[1]}...,w^{[L]},b^{[L]}) = \frac{1}{m} \sum_{i=1}^m \mathcal{L}(\hat y^{(i)}, y^{(i)}) + \frac {\lambda}{2m}\sum_{l=1}^L(||w^{[l]}||_F^2)$$

$$||w^{[l]}||_F^2 = \sum_{i=1}^{n^{[l-1]}}\sum_{j=1}^{n^{[l]}}(W_{ij}^{[l]})^2$$ 

Where:  
$\lambda$: Regularization parameter  
$F$: Forbes norm (squared norm of a matrix)
$n$: layer number

### Dropout Regulariazation

In the dropout regularization, neurons in the layers are randomly eliminated or dropped out during the training based on selected threshold. For example, if we select the threshold as 0.6, the probability of a neuron that removed from network is %40. Thus, network gets smaller and overfitting will be prevented.

<div style="text-align:center"><img src="dropout.png"/>

<h5><center>Visualization of Dropout</center></h5> 

---

## References
- Deep Learning A-Z™: Hands-On Artificial Neural Networks
- Neural Networks and Deep Learning by [deeplearning.ai](deeplearning.ai)

- https://towardsdatascience.com/how-does-back-propagation-in-artificial-neural-networks-work-c7cad873ea7  
- https://www.hackerearth.com/blog/developers/3-types-gradient-descent-algorithms-small-large-data-sets/  
- https://towardsdatascience.com/a-quick-guide-to-activation-functions-in-deep-learning-4042e7addd5b  
- https://towardsdatascience.com/how-to-improve-a-neural-network-with-regularization-8a18ecda9fe3

## Building a ANN

**CodingPlace **