# Outline for this section

1. Deep learning - basics & reasoning
    - learning problems
    - representations
    
2. From biological to artificial neural networks
    - neurons 
    - universal function approximation
    
3. components of ANNs
    - building parts
    - learning
    
4. ANN architectures
    - Multilayer perceptrons
    - Convolutional neural networks

<img align="center" src="./images/core_aspects_examples.png" width="500" height="280" />

# General Understanding

- as said before: `deep learning` is (a subset of) `machine learning` 
- it thus includes the core aspects we talked about in the [previous section]() and builds upon them:
    - different learning problems and resulting models/architectures
    - loss function & optimization
    - training, evaluation, validation
    - biases & problems
    
- this furthermore transfers to the key components you as a user has to think about
    - objective function (What is the goal?)
    - learning rule (How should weights be updated to improve the objective function?)
    - network architecture (What are the network parts and how are they connected?)
    - initialisation (How are weights initially defined?)
    - environment (What kind of data is provided for/during the learning?)

### Learning problems

As in [machine learning]() in general, we have `supervised` & `unsupervised learning problems` again:

<img align="center" src="./images/supervised_unsupervised.png" width="1200" height="350" />

### Complex Learning

In learning data patterns, one major problem is the `variance` of the input we encounter which subsequently makes it very hard to find appropriate `transformations` that can lead to/help to achieve `generalizable behavior`. 

- let's assume we want to learn to recognize, label and predict "cats" based on a set of images that look like this

<img align="center" src="./images/cat_prototype.png" alt="logo" title="Github" width="150" height="250" />

- utilizing the `models` and `approaches` we talked about so far, we would use `predetermined transformations` (`features`) of our data `X`:

<img align="center" src="./images/cat_ml.png" width="600" height="280" />

- however, this is by far not the only way we could encounter a cat ... there are a lots of sources of variation of our data `X`, including:

- deformation

<img align="center" src="./images/cat_deformation.png" alt="logo" title="Github" width="600" height="350" />

- occlusion

<img align="center" src="./images/cat_occlusion.png" alt="logo" title="Github" width="600" height="350" />

- background clutter

<img align="center" src="./images/cat_background.png" alt="logo" title="Github" width="600" height="350" />

- and intraclass variation

<img align="center" src="./images/cat_variation.png" alt="logo" title="Github" width="600" height="350" />

- these variations (and many more) are usually not accounted for and our mapping from `X` to `Y` would fail

- what we want to learn to prevent this are `invariant representations` that capture `latent variables` which are variables you (most likely) cannot directly observe, but that affect the variables you can observe 

<img align="center" src="./images/cat_dl.png" alt="logo" title="Github" width="600" height="350" />

- the "simple models" we talked about so far work with `predetermined transformations` and thus perform `shallow learning`, more "complex models" perform `deep learning` in their `hidden layers` to learn `representations`

One important aspect to discuss here is another `inductive bias` we put into `models` (think about the `AI` set again) : the `hierarchical perception` of the `natural world`. In other words: the world around is `compositional` which means that the things we perceive are composed of smaller pieces, which themselves are composed of smaller pieces and so on ... .

<img align="center" src="./images/eickenberg_2016.png" width="600" height="400" />

The question is still: how do `ANN`s do that?

- using biological neurons and networks as the basis for artificial neurons and networks might therefore also help to learn `invariant representations` that capture `latent variables`
- `deep learning` = `representation learning`
- our minds (most likely) contains `(invariant) representations` about the world that allow us to interact with it
    - `task optimization`
    - `generalizability` 

## Perceptron Learning

<img src="images/logistic_classifier.png"  height="600" width="800" />

## Activation Functions

- the thing about `activation function`s...

    - they define the resulting type of an `artificial neuron`
    - thus they also define its capabilities
    - require non-linearity
        - because otherwise only linear functions and decision probabilities
        
$$\begin{array}{l}
\text { Non-linear transfer functions}\\
\begin{array}{llc}
\hline \text { Name } & \text { Formula } & \text { Year } \\
\hline \text { none } & \mathrm{y}=\mathrm{x} & - \\
\text { sigmoid } & \mathrm{y}=\frac{1}{1+e^{-x}} & 1986 \\
\tanh & \mathrm{y}=\frac{e^{2 x}-1}{e^{2 x}+1} & 1986 \\
\text { ReLU } & \mathrm{y}=\max (\mathrm{x}, 0) & 2010 \\
\text { (centered) SoftPlus } & \mathrm{y}=\ln \left(e^{x}+1\right)-\ln 2 & 2011 \\
\text { LReLU } & \mathrm{y}=\max (\mathrm{x}, \alpha \mathrm{x}), \alpha \approx 0.01 & 2011 \\
\text { maxout } & \mathrm{y}=\max \left(W_{1} \mathrm{x}+b_{1}, W_{2} \mathrm{x}+b_{2}\right) & 2013 \\
\text { APL } & \mathrm{y}=\max (\mathrm{x}, 0)+\sum_{s=1}^{S} a_{i}^{s} \max \left(0,-x+b_{i}^{s}\right) & 2014 \\
\text { VLReLU } & \mathrm{y}=\max (\mathrm{x}, \alpha \mathrm{x}), \alpha \in 0.1,0.5 & 2014 \\
\text { RReLU } & \mathrm{y}=\max (\mathrm{x}, \alpha \mathrm{x}), \alpha=\operatorname{random}(0.1,0.5) & 2015 \\
\text { PReLU } & \mathrm{y}=\max (\mathrm{x}, \alpha \mathrm{x}), \alpha \text { is learnable } & 2015 \\
\text { ELU } & \mathrm{y}=\mathrm{x}, \text { if } \mathrm{x} \geq 0, \text { else } \alpha\left(e^{x}-1\right) & 2015 \\
\hline
\end{array}
\end{array}$$

In [2]:
from IPython.display import IFrame
IFrame(src='https://polarisation.github.io/tfjs-activation-functions/', width=700, height=400)

- historically either [sigmoid](https://en.wikipedia.org/wiki/Logistic_function) or [tanh](https://en.wikipedia.org/wiki/Hyperbolic_function#Hyperbolic_tangent) utilized
- even though they are [non-linear functions]() their properties make them insufficient for most problems, especially `sigmoid`
    - rather simple `polynomials`  
    - mainly work for `binary problems`
    - computationally expensive
    - they saturate causing the neuron and thus network to "die", i.e. stop `learning`
- modern `ANN` frequently use `continuous activation functions` like [Rectified Linear Unit](https://deepai.org/machine-learning-glossary-and-terms/rectified-linear-units)
    - doesn't saturate
    - faster training and convergence
    - introduce network sparsity

## Design

Design a `neural network` so that for every possible input `X`, the outcome is `f(X)`.

Here we introduce a [hidden layer]() that learns or more precisely `approximates` what those `transformations`/`functions` are on its own:

Importantly, the [hidden layer]() consists of [artificial neurons]() that perceive `weighted inputs` `w` and perform [non-linear]() ([non-saturating]()) [activation functions]() `v` which `output` will be used for the `task` at hand

<img align="center" src="./images/UAT_hiddenlayer_function.png" alt="logo" title="Github" width="600" height="350" />

It gets even better: this holds true even if there are multiple `inputs` and `outputs`:

<img align="center" src="./images/UAT_generalizability.png" alt="logo" title="Github" width="600" height="350" />


<img align="center" src="./images/ANN_layer.png" alt="logo" title="Github" width="600" height="350" />


| Term         | Definition | 
|--------------|:-----:|
| Layer |  Structure or network topology in the architecture of the model that consists of `nodes` and is connected to other layers, receiving and passing information. |
| Input layer |  The layer that receives the external input data. |
| Hidden layer(s) |  The layer(s) between `input` and `output layer` which performs `transformations` via `non-linear activation functions` . |
| Output layer |  The layer that produces the final output/task. |




<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/ANN_subparts.png" alt="logo" title="Github" width="600" height="350" />


| Term         | Definition | 
|--------------|:-----:|
| Node |  `Artificial neurons`. |
| Connection | Connection between `nodes`, providing `output` of one `node`/`neuron` as `input` to the next `node`/`neuron`.  |
| Weight |  The relative importance of the `connection`. |
| Bias |  The bias term that can be added to the `propagation function`, i.e. input to a neuron computed from the outputs of its predecessor neurons and their connections as a weighted sum. |



- `ANN`s can be described based on their amount of `hidden layers` (`depth`, `width`)

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/ANN_multilayer.png" alt="logo" title="Github" width="600" height="350" />

# Model Training/Fitting

- when talking about `model fitting`, we need to talk about three central aspects:
    - the `model`
    - the `loss function`
    - the `optimization`
    
| Term         | Definition | 
|--------------|:-----:|
| Model |  A set of parameters that makes a prediction based on a given input. The parameter values are fitted to available data.|
| Loss function | A function that evaluates how well your algorithm models your dataset |
| Optimization | A function that tries to minimize the loss via updating model parameters. |
	

#### An example: linear regression

- Model:  $$y=\beta_{0}+\beta_{1} x_{1}^{2}+\beta_{2} x_{2}^{2}$$
- Loss function: $$ M S E=\frac{1}{n} \sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)^{2}$$
- optimization: [Gradient descent]()


- `Gradient descent` with a `single input variable` and `n samples`
    - Start with random weights (`β0` and `β1`) $$\hat{y}_{i}=\beta_{0}+\beta_{1} X_{i}$$
    - Compute loss (i.e. `MSE`) $$M S E=\frac{1}{n} \sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)^{2}$$
    - Update `weights` based on the `gradient`
    
<img align="center" src="https://cdn.hackernoon.com/hn-images/0*D7zG46WrdKx54pbU.gif" alt="logo" title="Github" width="550" height="280" />
<sub><sup><sub><sup><sup>https://cdn.hackernoon.com/hn-images/0*D7zG46WrdKx54pbU.gif
</sup></sup></sub></sup></sub>

- `Gradient descent` for complex models with `non-convex loss functions`
    - Start with random weights (`β0` and `β1`) $$\hat{y}_{i}=\beta_{0}+\beta_{1} X_{i}$$
    - Compute loss (i.e. `MSE`) $$M S E=\frac{1}{n} \sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)^{2}$$
    - Update `weights` based on the `gradient`
    
<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/gradient_descent_complex_models.png" alt="logo" title="Github" width="500" height="280" />

- to sufficiently talk about `learning` in `ANN`s we need to add a few things, however we heard some of them already 
    - `metric`
    - `activation function`
    - `weights`
    - `batch size`
    - `gradient descent`
    - `backpropagation`
    - `epoch`
    - `regularization`

**Initialization of `weights` & `biases`**

- Upon `building` our network we also need to `initialize` the `weights` and `biases`. 
- Both are important `hyper-parameters` for our `ANN` and the way it `learns` as they can help preventing `activation function outputs` from `exploding` or `vanishing` when moving through the `ANN`. 
- This relates directly to the `optimization` as the `loss gradient` might become too large or too small, prolonging the time the network needs to converge or even prevents it completely. Importantly, certain `initializers` work better with certain `activation functions`. For example: [tanh](https://en.wikipedia.org/wiki/Hyperbolic_functions#Hyperbolic_tangent) likes `Glorot/Xavier initialization` while [ReLu](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)) likes `He initialization`. 

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/ANN_Cat_biases.png" alt="logo" title="Github" width="600" height="350" />

**A journey through the `ANN`**

The input is then processed by the `layers`, their `nodes` and respective `activation functions`, being passed through the `ANN`. Each `layer` and `node` will compute a certain `transformation` of the `input` it receives from the previous `layer` based on its `activation function` and `weights`/`biases`.

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/ANN_cat_connections.png" alt="logo" title="Github" width="700" height="450" />

**The `output layer`**

- After a while, we will reach the end of our `ANN`, the `output layer`. 
- As the last part of our `ANN`, it will produce the results we're interested in. Its number of `nodes` and `activation function` will depend on the `learning problem` at hand. 
- For a `binary classification task` it will have `2 nodes` corresponding to the both `classes` and might use `sigmoid` or [softmax activation function](https://en.wikipedia.org/wiki/Softmax_function). 
- For `multiclass classification tasks` it will have as many `nodes` as there are `classes` and utilize the [softmax activation function](https://en.wikipedia.org/wiki/Softmax_function). 
- Both `sigmoid` and `softmax` are related to `logistic regression`, with the latter being a generalized form of it. Why does this matter? Our `output layer` will produce `real-valued scores` for each of the `classes` that are however not `scaled` and straightforward to interpret. 
- Using for example the `softmax function` we can transform these values into `scaled probability distributions` between `0` and `1` which values add up to `1` and can be submitted to other analysis pipelines or directly evaluated. 

Lets assume our `ANN` is `trained` to recognize and distinguish `cats` and `capybaras`, meaning we have a `binary classification task`.  Defining `cats` as `class 1` and `capybaras` as `class 2` (not my opinion, just an example), the corresponding `vectors` we would like to obtain from the `output layer` would be `[1,0]` and `[0,1]` respectively. However, what we would get from the `output layer` in absence of e.g. `softmax`, would rather look like `[1.6, 0.2]` and `[0.4, 1.2]`. This is identical to what the penultimate `layer` would provide as `input` the `output` i.e. `softmax layer` if we had an additional layer just for that and not the respective `activation function`. 

After passing through the `softmax layer` or our `output layer` with `softmax activation function` the `real-valued scores` `[1.6, 0.2]` and `[0.4, 1.2]` would be (for example) `[0.802, 0.198]` and `[0.310, 0.699]`. Knowing it's now a `scaled probabilistic distribution` that can range between `0` and `1` and sums up to `1`, it's much easier to interpret.

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/ANN_cat_labels.png" alt="logo" title="Github" width="700" height="450" />

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/ANN_cat_softmax.png" alt="logo" title="Github" width="700" height="450" />

**The `metric`**

The index of the vector provided by the `softmax output layer` with the largest value will be treated as the `class` predicted by the `ANN`, which in our example would be "cat". The `ANN` will then use the `predicted class` and compare it to the `true class`, computing a `metric` to assess its performance. Remember folks: `deep learning` is `machine learning` and computing a `metric` is no exception to that. Thus, depending on your data and `learning problem` you can indicate a variety of `metrics` your `ANN` should utilize, including `accuracy`,  `F1`, `AUC`, etc. . Note: in `binary tasks` usually only the largest value is treated as a `class prediction`, this is called `Top-1 accuracy`. On the contrary, in `multiclass tasks` with many `classes` (animals, cell components, disease propagation types, etc.) quite often the largest `5` values are treated as `class predictions` and utilized within the `metric`, which is called `Top-5 accuracy`.

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/ANN_cat_accuracy.png" alt="logo" title="Github" width="700" height="450" />

**The `loss function`**

Besides the `metric`, our `ANN` will also a compute a `loss function` that will quantify how far the `probabilities`, computed by the `softmax function` of the `output layer`, are away from the `true values` we want to achieve, i.e. the `classes`. As mentioned in the [introduction]() and comparable to the `metric`, the choice of `loss function` depends on the data you have and the `learning problem` you want to solve. If you want to `predict` `numerical values` you might want to employ a `regression` based approach and use `MSE` as the `loss function`. If you want to `predict` `classes` you might to employ a `classification` based approach and use a form of `cross-entropy` as the `loss function`.

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/ANN_cat_loss.png" alt="logo" title="Github" width="700" height="450" />

**`Batch size`**

As with other `machine learning` approaches, we will ideally have a `training`, `validation` and `test set`. One `hyperparameter` that is involved in this process and also can define our entire `learning process` is `batch size`. It defines the number of `samples` in the `training set` our `ANN` processes before `optimization` is used to update the `weights` based on the result of the `loss function`. For example, if our `training set` has `100 samples` and we set a `batch size` of `5`, we would divide the `training set` into `20 batches` of `5 samples` each. In turn this would mean that our `ANN` goes through `5 samples` before using `optimization` to update the `weights` and thus our `ANN` would update its `weights` `20` times during `training`.


<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/ANN_cat_batch.png" alt="logo" title="Github" width="700" height="450" />

**The `optimization`**

Once a `batch` has been processed by the `ANN` the `optimization algorithm` will get to work. As mentioned before, most `machine learning problems` utilize [gradient descent](https://en.wikipedia.org/wiki/Gradient_descent) as the `optimization algorithm`. As mentioned during the [introduction]() and a few slides above, we have an `objective function` we want to `optimize`, for example `minimizing` the `error` computed by our `cross-entropy loss function`.  So what happens is the following. At first, an entire `batch` is processed by the `ANN` and `accuracy` as well as `loss` are computed. 

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/ANN_cat_gd.png" alt="logo" title="Github" width="700" height="450" />

Optimizers on surface:

<div>
<img src="images/gds_1.gif"  height="600" width="800" />
</div>

**`Backpropagation`**

- Actually, `gradient descent` is part of something bigger called [backpropagation](https://en.wikipedia.org/wiki/Backpropagation). 
- Once we did a `forward pass` through the `ANN`, i.e. `data` goes from `input` to `output layer`, the `ANN` will use `backpropagation` to update the `model parameters`. 
- It does so by utilizing `gradient descent` and the [chain rule](https://en.wikipedia.org/wiki/Chain_rule) to `propagate` the `error` `backwards`. 
- Simply put: starting at the `output layer` `gradient descent` is applied to `update` its `parameters`, i.e. `weights` and `biases`, the `error` is re-computed through the `loss function` and `propagated backwards` to the previous `layer`, where `parameters` will be `updated`, the `error` re-computed through the `loss function` and so forth. 
- As `parameters` interact with each other, the application of the `chain rule` is important as it can decompose the `composition` of two `differentiable functions` into their `derivatives`.

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/ANN_cat_bp.png" alt="logo" title="Github" width="700" height="450" />


<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/ANN_cat_bp_2.png" alt="logo" title="Github" width="700" height="450" />


**The number of `epochs`**

The duration of the `ANN` `training` is usually determined by the interplay between `batch sizes` and another `hyperparameter` called `epochs`. Whereas the `batch size` defines the number of `training set samples` to process before updating the `model parameters`, the number of `epochs` specifies how often the `ANN` should process the entire `training set`. Thus, once all `batches` have been processed, one `epoch` is over. The number of `epochs` is something you set when start the `training`, just like the `batch size`. Both are therefore `parameters` for the `training` and not `parameters` that are learned by the `training`. For example, if you have `100 samples`, a `batch size` of `10` and set the number of `epochs` to `500` your `ANN` will go through the entire `training set` `500` times, that is `5000 batches` and thus `5000 updates` to your `model`. While this sounds already like a lot, these numbers are more than small compared to that what "real-life" `ANN`s go through. There, these numbers are in the millions and beyond. 

<img align="center" src="https://raw.githubusercontent.com/PeerHerholz/ML-DL_workshop_SynAGE/master/lecture/static/ANN_cat_epoch.png" alt="logo" title="Github" width="700" height="450" />


Please note: this is of course only the theoretical duration in terms of `iterations` and not the actual duration it takes to `train` your `ANN`. This is quite often hard to `predict` (hehe, got it?) as it depends on the `computational setup` you're working with, the `data` and obviously the `model` and its `hyperparameters`.