# A very short Introduction to Artificial Neural Networks and Deep Learning

![NeuralNet](https://cdn.pixabay.com/photo/2018/08/28/13/20/neural-network-3637503_1280.png)

**Author:** Sebastian Klassmann, sklassm1@uni-koeln.de  

Date: May 26$^{th}$, 2020  
Libraries used: nnone, this notebook is presentation only.  
Python Version: 3.7.7 
Other dependencies: None

------

Please be aware that this notebook is merely a short introduction to a few terms and concepts that are neccessary for a practical understanding of basic artificial neural networks. 
In fact, it is quite impossible to cover all the ground we are trying to cover today in just over an hour. We will still try, albeit at the expense of conceptual depth.

For a more in-depth introduction and additional depth of discussion, **please refer to the references given at the end of this notebook.**

------

## 1. What is the core concept of Artificial Neural Networks?

Artificial Neural Networks are at the very core of research strategies that have been subsumed as *machine learning* or *deep learning* approaches to data-driven science over the past decades.

To quote from Anderson 2014, ANNs are "mathematical methods for input:output mappings. In practice they usually share some subset of the following features:
1. They are inspired by biology, but are not expected to slavishly emulate biology.
2. They are implemented as computer programs.
3. They have 'nodes' that play the role of neurons, collections of neurons, brain structures or cognitive modules. Their role depends on the modeller's intent and the problem being modelled.
4. Nodes are interconnected. [...]
5. The individual nodes produce outputs from inputs. The network's output is the collection of all these local operations" (p. 82)

Now, that does not tell us too much just yet, does it?

Therefore, what does that mean for us? Let us consider the following basic schematization of an Artificial Neural Network (ANN):

<img src='ann1.png' width=800>

What would this simple architecture do in practice?

* Given a suitable input signal to the input layer, the model would transform the input to a potentially very different output at the output layer. This transformation is solely based on the connection between the nodes, as well as certain predispositions of the nodes at hand.
* All partial operations in the network can be expressed using very simple mathematical operations.

### Let's see this in action!

### A very much simplified ANN:
<img src='ann2.png' width=800>

As you can see, the above schematization has been expanded by adding gravity to the connections from given nodes to other nodes. These are called 'weights' (w). For simplicity, they have not been given index numbers.

Now, let us assume a simple input to this neural net, let's make it 
$
\begin{pmatrix}
1\\
1\\
\end{pmatrix} 
$:

<img src='ann3.png' width=800>

This input (in this simplified case: *activation at the input layer*) is then passed on along the arrows (edges) of the network. The weights of the connections directly influence the activation input to the next layer of nodes. The activation in any of the subsequent nodes will be equal to the sum of all incoming weighted activations:

<img src='ann4.png' width=800>

The very same holds true for the resulting activation passed on by the output layer:

<img src='ann5.png' width=800>

As you can see, the output of our simplified network given the input $
\begin{pmatrix}
1\\
1\\
\end{pmatrix} 
$ would be a single float (1.4), which looks a little different than the input we passed to the network.

Wait, what? You said "very much simplified"?

Yes, the previous schematics have been a little simpler than what is usually used for building ANNs.

In fact, we need a few extra assumptions that are usually made:  
* Each node / neuron in our network will usually be considered to have a certain tendency to show activation based on an input. This is expressed as a *bias* parameter (b). These biases are the other (the first one being the weights) set of parameters that can drastically change the performance of our network structure.
* Usually, we want to limit the potential values weighted activations can show by means of a nonlinear function that *squashes* the activation values. One very common example is the so-called **sigmoid activation function** for a weighted activation y after application of the given non-linearity:

$$
y = \frac{1}{1+e^{-x}}
$$

<img src='sigmoid_wiki.png' width=400>

While x is the sum of weighted activations being passed to the node at hand, i.e.:

$$
y = \frac{1}{1+e^{-(\sum_{x}i_x w_x + b_y)}}
$$

Or, a little simpler:

$$
Y = \frac{1}{1+e^{-(I \cdot W + B)}}
$$

$\rightarrow$ Yes, those capital letters hint at matrices.  

$\rightarrow$ Yes, they make your life easier.

$\rightarrow$ We will get to this in a second.

Let's not go too deep on this for now, it is actually very simple, once you can use a machine do the computations for you.

<img src='https://cdn.pixabay.com/photo/2017/07/10/23/43/question-mark-2492009__340.jpg' width=1300>

## Perspectives

At this stage, you might be wondering: How can this all be applied in a productive fashion? And how do we get from this to something like the LSTM-results we were shown in the beginning of this session?

Let us conduct a little thought experiment.

We have seen that an ANN takes a compatible input signal, passes on activations based on it's inherent connections and thereby transforms it to a given output.

![cr: O'Reilly Media](https://www.oreilly.com/library/view/practical-convolutional-neural/9781788392303/assets/246151fb-7893-448d-b9bb-7a87b387a24b.png)

Imagine an ANN with some randomly chosen weighted connection between nodes and some randomly chosen biases for the different nodes.

<img src='https://cdn.pixabay.com/photo/2015/03/07/03/46/white-shepherd-662744__340.jpg' width=800>


Imagine a picture of a dog, for example.

Computers are basically stupid. Therefore, a computer will not be able to tell you whether or not it is actually looking at a dog.

Imagine, that we were able to preprocess the image in a way that converts the image to an array of numbers (think of pixel-wise greyscale values, for exampe).

![glueckskeks Pixabay](https://cdn.pixabay.com/photo/2017/07/14/08/23/fortune-cookies-2503077__480.jpg)

Further, Imagine, that we have stored the information that the animal shown has the property of being a dog written on a virtual piece of paper (a label, if you will) and stored away safely, so that our ANN cannot see it while processing the picture.

Imagine that this information is encoded in a format that is structurally very much compatible to the output generated by our network. The simplest case (that is probably technically useless) would be a single float.

Imagine that we allow our network to compare the output to the label. Be assured that our network's judgement will probably be very bad, as all the parameters making up the connections in our network were chosen randomly.

Now imagine that we have a thousand images of varying mammals, a certain amount of them being dogs.  
<img src='https://cdn.pixabay.com/photo/2016/08/12/13/01/collage-1588416__340.jpg' width=500>

$\rightarrow$ Ok, fine. These are all dogs. But then again, your computer wouldn't know. Would it?

Imagine that we pass all of them (including non-dogs) through our network.

Imagine that we make our network just a little smarter by implementing a mathematical way of calculating the error of the outputs compared to the labels of the respective pictures. If we were able to find an algorithm for slightly adjusting those values of the network that we can influence (i.e.: the weights and biases, as we cannot alter the inputs from the training set) in order to make it slightly better at differentiating dogs from non-dogs in our training set.

<img src='https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQ0yUENFTNbKRDg-cTH0h2WYjejTQRhggD5Mz7SsEev9MxSFtR_gA' width=500>

The above equations are the core of an algorithm called *backpropagation*.  
  
In a network that connects many layers of nodes with each other, adjusting a single weight from the first to the second layer will change the weighted activation that is passed on through the entirety of the remaining network.  

Therefore, **it is not possible to start modifying our parameters from the first layer** on. It is much more productive to calculate the error at the output layer and then **look back through the network in order to gently nudge the parameters in a direction that minimizes the error, working (propagating) our way backwards to the whole network**. This algorithm will have to arrive at the point of how we can best adjust the parameters related to the first layer, while keeping in mind optimal changes to all successive layers.  

$\rightarrow$ This may seem daunting and is indeed a little difficult to calculate by hand. Luckily, most machine learning libraries (inlcuding Tensorflow / Keras) will just be able to quickly do the required calculations for you. Therefore, in order to work on machine / deep learning models, you do not need to be able to calculate the backpropagation of a loss value by hand. It is, however, beneficial to have a basic understanding of how this procedure works.

Imagine that we save the slightly improved values for our parameters and confront the modified network with the same set of pictures, over and over again. Every time (iteration, or: epoch), it will get slightly better at performing the classification task at hand.

$\rightarrow$ This is the core concept of (supervised) Machine Learning (using feed-forward networks).

You want more dog examples? Please feel free to consult Skansi (2018), pp. 50-78.

In our next session, we will be looking at a classification problem that has been called the "hello world" algorithm of Machine Learning: The automatized recognition of handwritten digits from a well-known corpus.

$\rightarrow$ I will be sending out additional reading material to supplement today's *tour-de-force*. Please make sure that you read them (please expect the texts by Thursday, this week).

$\rightarrow$ The assignment about todays session will focus on concepts that we dealt with today, as well as a little bit of vector and matrix calculations in Python (Numpy).

$\rightarrow$ There will be an additional / alternative assignment based on one of the texts I will be sending out this week. This one is a basic case for a classification problem and does not require any knowledge in Tensorflow. You can solve it using Numpy alone. Please remember that your choice in assignments is entirely up to you. This one might be a challenge, but it's well worth a try!

----

## References:  
Anderson, B. (2014). *Computational neuroscience and cognitive modelling: a student's introduction to methods and procedures*. Sage.  
Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep learning*. MIT press.  
Rashid, T. (2017). *Neuronale Netze selbst programmieren: ein verständlicher Einstieg mit Python*. O'Reilly.  
Skansi, S. (2018). *Introduction to Deep Learning: From Logical Calculus to Artificial Intelligence*. Springer.  
  
----

**Additionally, spend some of your free time on this excellent [video series](https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi)**.