# ARTIFICIAL NEURAL NETWORKS

# Perceptron

* **TLU** (threshold logic unit, or linear threshold unit, LTU):
  * inputs and outputs are **numbers** (not binary on/off values)
  * each input connection is associated with a **weight**
  * TLU computes a **weighted sum** of its inputs,   
    then applies a **step function** (later activation function) to that sum,  
    and outputs the result  
<br/>  
  
* **Hebb's rule**:
  * *Cells that fire together, wire together*
  * learning rule reinforces connections that help reduce the error:   
    for every output neuron that produced a wrong prediction, if reinforces the weights from the inputs that would have contributed to the correct prediction

# Backpropagation

* Rumelhart et al. 1985. "Learning internal representations by error propagation".   
  First introduction of backpropagation algorithm
<br/>
  
* Find out how each connection weight and each bias term should be tweaked in order to reduce the error.  
  * just performs a regular Gradient Descent step
  * **autodiff** = automatically computing gradients, the one used by backpropagation is *reverse-mode autodiff*  
<br/> 
    
* **IMPORTANT**:  

  * initialize all hidden layers' connection weights **randomly**, or training will fail:
    * to break the symmetry
    * and allow backpropagation to train a diverse team of neurons
  * if weights and biases initialized to zeros, all neurons in a given layer will be identical,   
    thus backpropagation will affect them in exactly the same way so they will remain identical.  
    i.e. model will act as if it had only one neuron per layer  
    
### Activation functions

  * **Why?** To introduce some nonlinearity betweeen layers   
    * if you chain several linear transformations, you get a linear transformation
    * without nonlinearity between layers, even a deep stack of layers is equivalent to a single layer   
      can't solve very complex problems
  * **sigmoid** (logistic) function: $sigmoid(z) = 1 / (1 + exp(-z))$
    * allows GD to make some progress at every step
  * **hyperbolic tangent** function: $tanh(z) = (2 * s(2z)) - 1$
    * S-shaped, continuous, differentiable
    * output: -1 to 1
    * range that makes each layer's output more or less centered around 0 at the beginning of training,   
      which often **helps speed up convergence**
  * **ReLU** (rectified linear unit) function: $ReLU(z) = max(0, z)$
    * continuous, not differentiable at $z = 0$ (slope changes abruptly)
    * its derivative is 0 for $z < 0$
    * fast to compute -> has become the default
    * no maximum value (helps reduce some issues during GD)
  

# Regression MLPs

* Prediction:
  * single value (price of a house): one output neuron  
  * multiple values: one output neuron per output dimension  
<br/> 
  
* **Do not** use any activation function for output neurons, so they can output any range of values
* If you want **output always positive**, use ReLU in output layer, or softplus (smooth variant of ReLU)  
  $softplus(z) = log(1 + exp(z))$
* If you want **output within a range** of values, use the logistic function or hyperbolic tangent,   
  then scale labels to the appropriate range (0 to 1 for logistic, -1 to 1 for tanh)
* **Loss function** during training: mean squared error
  * if lots of outliers in training: mean absolute error
  * or Huber loss (combination of both):
    * quadratic when error smaller that a threshold (typically 1)  
      -> allows it to converge faster and be more precise that mean absolute error 
    * but linear when error larger that the threshold  
      -> makes it less sensitive to outliers than mean squared error  

### Typical regression MLP architecture:

<table align='left' width="80%">
    <thead>
        <tr>
            <th width="30%" style="text-align:left">Hyperparameter</th>
            <th style="text-align:left">Typical value</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <th style="text-align:left">nb. input neurons</th>
            <td style="text-align:left">one per input feature (eg. 28*28 for mnist)</td>
        </tr>
        <tr>
            <th style="text-align:left">nb. hidden layers</th>
            <td style="text-align:left">depends on the problem, but typically 1 to 5</td>
        </tr>
        <tr>
            <th style="text-align:left">nb. neurons per hidden layer</th>
            <td style="text-align:left">depends on the problem, but typically 10 to 100</td>
        </tr>
        <tr>
            <th style="text-align:left">nb. output neurons</th>
            <td style="text-align:left">1 per prediction dimension</td>
        </tr>
        <tr>
            <th style="text-align:left">hidden activation</th>
            <td style="text-align:left">ReLU (or SELU)</td>
        </tr>
        <tr>
            <th style="text-align:left">output activation</th>
            <td style="text-align:left">None, or ReLU/softplus (if positive outputs) or logistic/tanh (if bounded outputs)</td>
        </tr>
        <tr>
            <th style="text-align:left">loss function</th>
            <td style="text-align:left">MSE, or MAE/Huber (if outliers)</td>
        </tr>
    </tbody>
</table>  


# Classification MLPs

* Prediction:
  * binary classification: 
    * single output with logistic activation function:  
    * output = between 0 and 1, interpreted as estimated probability of positive class  
      proba negative class = 1 - output
  * multilabel binary classification: 
    * multiple output neurons with logistic activation function
    * output neurons do not necessarily add up to 1 -> output any combination of labels
  * multiclass classification:
    * each instance can be only a single class out of $x$ ones: 
    * one output neuron per class
    * softmax activation on output layer: ensure that   
      all estimated probas are between 0 andd 1,  
      and add up to 1 (classes are exclusive)  
<br/> 
  
* Loss function: since we are predicting probability distributions, **cross-entropy** (*log loss*) is a good choice


### Typical classification MLP architecture:

<table align='left' width="80%">
    <thead>
        <tr>
            <th width="25%" style="text-align:left">Hyperparameter</th>
            <th width="25%"  style="text-align:left">Binary classification</th>
            <th width="25%"  style="text-align:left">Multilabel binary classification</th>
            <th width="25%"  style="text-align:left">Multiclass classification</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <th style="text-align:left">Input and hidden layers</th>
            <td style="text-align:left">Same as regression</td>
            <td style="text-align:left">Same as regression</td>
            <td style="text-align:left">Same as regression</td>
        </tr>
        <tr>
            <th style="text-align:left">Nb. output neurons</th>
            <td style="text-align:left">1</td>
            <td style="text-align:left">1 per label</td>
            <td style="text-align:left">1 per class</td>
        </tr>
        <tr>
            <th style="text-align:left">Output layer activation</th>
            <td style="text-align:left">Logistic</td>
            <td style="text-align:left">Logistic</td>
            <td style="text-align:left">Sotfmax</td>
        </tr>
        <tr>
            <th style="text-align:left">Loss function</th>
            <td style="text-align:left">Cross entropy</td>
            <td style="text-align:left">Cross entropy</td>
            <td style="text-align:left">Cross entropy</td>
        </tr>
    </tbody>
</table>  



# Complex models

* more complex architectures, or multiple inputs and outputs
* use the **Functional API** instead of the **Sequential API**

In [5]:
import tensorflow as tf
from tensorflow import keras
print('tf version   : ', tf.__version__)
print('keras version: ', keras.__version__)

tf version   :  2.3.0
keras version:  2.4.0
