<b>Group Number:</b> 7
<br><b>Name Group Member 1:</b>   Paraa Afifi
<br><b>u-Kürzel Group Member 1:</b>   uppns
<br><b>Name Group Member 2:</b>   Dan-Jason Bräuninger
<br><b>u-Kürzel Group Member 2:</b>   uuuab
<br><b>Name Group Member 3:</b> Sami Shahzad
<br><b>u-Kürzel Group Member 3:</b>uvoei

# 2 Classification with Neural Networks
In this chapter, you will understand the workings of a classifier and manually train one that operates on a single value. You will improve the classifier step by step and learn fundamental concepts about classification as you go along.
Finally, you will use automated backpropagation to train a multi-layer neural network to emulate a logic gate.

## 2.1 Introduction
In machine learning and statistics, classification is the problem of identifying to which set of categories (sub-populations) a new observation belongs to, on the basis of a training set of data containing observations (or instances) whose category membership is known. Examples are assigning a given email to the "spam" or "non-spam" class or assigning a diagnosis to a given patient based on observed characteristics of the patient (sex, blood pressure, presence or absence of certain symptoms, etc.). [1]

A classification process requires a dataset that is split into different categories. A classifier can be trained on this dataset by learning the relationship between certain properties of the input data and the corresponding categories. 
To classify new data, the process is similar as in the chapter "Regression", however additional computational steps can be added depending on the application.
A common classification problem that can be solved by neural networks is image recognition (seen in Figure 1).

<center>
  <img src="images/neural_network_classification.png" alt=" Image recognition by a neural net" width="800"/>
</center>
<p style="text-align: center;"><em>Figure 1 - Image recognition by a neural network</em></p>



In [1]:
from __future__ import annotations # Used to allow referencing classes that have not yet been defined in type annotations. This will become default behaviour in Python 3.10. Until then, we have to use this line to enable that behaviour
from typing import *

import numpy as np
from ipywidgets import interact, Layout, FloatSlider
import plotly.graph_objs as go
import time
import threading
from typing import *
import matplotlib.pyplot as plt



In [2]:
def relu(input_val: np.ndarray) -> np.ndarray:
    return np.maximum(input_val, 0)

In [3]:
def mean_squared_loss(predictions: np.ndarray, solutions: np.ndarray) -> float:
    total_squared_loss = np.sum(np.subtract(predictions, solutions)**2) #np allows to handle both values and lists
    mean_squared_loss = total_squared_loss/len(predictions)
    return mean_squared_loss

In [4]:
class SimpleNeuron:
    def __init__(self, plot: Interactive2DPlot):
        self.plot = plot #I am assigned the following plot
        self.plot.register_neuron(self) #hey plot, remember me
        
    def set_values(self, weight: float, bias: float):
        self.weight = weight
        self.bias = bias
        self.plot.update() #hey plot, I have changed, redraw my output
        
    def get_weight(self) -> float:
        return self.weight
    
    def get_bias(self) -> float:
        return self.bias

    def compute(self, x: Union[float, np.ndarray]) -> Union[float, np.ndarray]:
        self.activation = np.dot(self.weight, x) + self.bias
        return self.activation

In [5]:
# an Interactive Plot monitors the activation of a neuron or a neural network
class Interactive2DPlot:
    def __init__(self, points_red: Dict[str, List[float]], points_blue: Dict[str, List[float]], ranges: Dict[str, Tuple[float, float]], loss_function: Callable[[np.ndarray, np.ndarray], float] = mean_squared_loss, loss_string: str = "Loss", width: int = 800, height: int = 400, margin: Dict[str, int] = { 't': 0, 'l': 170 }, draw_time: float = 0.1):
        self.idle = True
        self.points_red = points_red
        self.points_blue = points_blue
        self.draw_time = draw_time
        self.loss_function = loss_function
        self.loss_string = loss_string

        self.x = np.arange(ranges["x"][0], ranges["x"][1], 0.01)
        self.y = np.arange(ranges["y"][0], ranges["y"][1], 0.01)

        self.layout = go.Layout(
            xaxis=dict(title="Neck height in m", range=ranges["x"]),
            yaxis=dict(title="y", range=ranges["y"]),
            width=width,
            height=height,
            showlegend=False,
            margin=margin,
        )
        self.trace = go.Scatter(x=self.x, y=self.y)

        self.plot_points_red = go.Scatter(
            x=points_red["x"], y=points_red["y"], mode="markers", marker=dict(color='rgb(255, 0, 0)', size=10)
        )
        self.plot_points_blue = go.Scatter(
            x=points_blue["x"],
            y=points_blue["y"],
            mode="markers",
            marker=dict(color='rgb(0, 0, 255)', size=10, symbol="square"),
        )

        self.plot_point_new = go.Scatter(
            x=[], y=[], mode="markers", marker=dict(size=20, symbol="star", color='rgb(0,0,0)')
        )

        self.data = [self.trace, self.plot_points_red, self.plot_points_blue, self.plot_point_new]
        self.plot = go.FigureWidget(self.data, self.layout)

    def register_neuron(self, neuron: SimpleNeuron):
        self.neuron = neuron

    def redraw(self):
        self.idle = False
        time.sleep(self.draw_time)
        self.plot.data[0].y = self.neuron.compute(self.x)
        self.idle = True

    def update(self):
        loss_red = self.loss_function(self.neuron.compute(self.points_red["x"]), self.points_red["y"])
        loss_blue = self.loss_function(self.neuron.compute(self.points_blue["x"]), self.points_blue["y"])
        print(self.loss_string,": {:0.3f}".format((loss_red + loss_blue) / 2))

        if self.idle:
            thread = threading.Thread(target=self.redraw)
            thread.start()

## 2.2 From Regression to Classification

###  2.2.1 Linear Regression

You find yourself working on a farm with sheep and llamas grazing in separate enclosures. However, last night the shepard forgot to close the gate between the two enclosures. The llamas and sheep now are mixed and have to be separated again. You immediately come up with a machine learning based solution to separate the sheep from the llamas again: You assume that llamas can be distinguished from sheep by measuring the distance from the top of their head to their spine, since llamas have significantly longer necks. Using a LIDAR scanner, neck heights will be measured autonomously and the animals will be separated using a food enticement and an electronic turnstile that only lets llamas through.

<center>
  <img src="images/neck_heights.png" alt="Concept of Neck Height Measurement" width="600"/>
</center>
<p style="text-align: center;"><em>Figure 2: Concept of Neck Height Measurement.</em></p>


To collect sample data, you go out on the field with a measuring tape and measure the neck heights of some sheep and llamas. You specify two categories: '0' for sheep and '1' for llamas. (See table 1)

Most llamas are grown up and have long necks, but there are also some young llamas with smaller necks. However, since their necks are still longer than the sheeps', you figure that this won't be a problem.

|  Animal | Neck height  | Category  |
|---------|--------------|-----------|
| Sheep #1| 0.20m        |0          |
| Sheep #2| 0.23m        |0          |
| Sheep #3| 0.28m        |0          |
| Sheep #4| 0.32m        |0          |
| Sheep #5| 0.35m        |0          |
| Llama #1| 0.55m        |1          |
| Llama #2| 0.68m        |1          |
| Llama #3| 0.74m        |1          |
| Llama #4| 0.83m        |1          |
| Llama #5| 0.95m        |1          |

<p style="text-align: center;">
    Table. 1 - Your data mining results
</p>





#### 2.2.1.1 Training a Linear Regression Neuron by Hand
For the sake of simplicity, you start by using a single neuron as a classifier. Run the two cells below to define the data mining points and to display a plot.

In [6]:
points_sheep = dict(
              x=[ 0.20, 0.23, 0.28, 0.32, 0.35],
              y=[ 0, 0, 0, 0, 0]
             )

points_llamas = dict(
              x=[ 0.55, 0.68, 0.74, 0.83, 0.95],
              y=[ 1,  1, 1, 1, 1]
             )

ranges = dict(x=[-0.1, 1.25], y=[-0.5, 1.4])
slider_layout = Layout(width="90%")

In [7]:
plot1 = Interactive2DPlot(points_sheep, points_llamas, ranges, loss_string="Mean Squared Loss")
neuron1 = SimpleNeuron(plot1)

interact(
    neuron1.set_values,
    weight=FloatSlider(min=-2, max=4, step=0.1, layout = slider_layout),
    bias=FloatSlider(min=-1, max=1, step=0.1, layout = slider_layout),
)

plot1.plot

interactive(children=(FloatSlider(value=0.0, description='weight', layout=Layout(width='90%'), max=4.0, min=-2…

FigureWidget({
    'data': [{'type': 'scatter',
              'uid': '0aafdaad-bb28-4220-8102-d83fd7c72f59',
              'x': {'bdata': ('mpmZmZmZub8L16NwPQq3v3wUrkfher' ... 'tRuB6F8z+rR+F6FK7zP9SjcD0K1/M/'),
                    'dtype': 'f8'},
              'y': {'bdata': ('AAAAAAAA4L9cj8L1KFzfv7gehetRuN' ... 'gehev1PxyuR+F6FPY/RQrXo3A99j8='),
                    'dtype': 'f8'}},
             {'marker': {'color': 'rgb(255, 0, 0)', 'size': 10},
              'mode': 'markers',
              'type': 'scatter',
              'uid': '22eb4423-90b2-4b4a-8a55-ed6a4cdae0a8',
              'x': [0.2, 0.23, 0.28, 0.32, 0.35],
              'y': [0, 0, 0, 0, 0]},
             {'marker': {'color': 'rgb(0, 0, 255)', 'size': 10, 'symbol': 'square'},
              'mode': 'markers',
              'type': 'scatter',
              'uid': '035bf352-58e5-4429-a96d-0efe7881820a',
              'x': [0.55, 0.68, 0.74, 0.83, 0.95],
              'y': [1, 1, 1, 1, 1]},
             {'marker': {'color': '

<div class="alert alert-block alert-success">
<b>Question (1pt):</b> Change the weight and bias sliders above. What is a weight and bias combination that results in a loss < 0.05?
</div>

<div class="alert alert-block alert-success">
<b>Your Answer: With weight 1.6 and bias -0.3 we became a LOSS of 0.042 </b> 
</div>

***
#### 2.2.1.2 Working our way towards a discrete classifier
Now we want to use our trained neuron to classify new neck heights. To do that, we have to write a program that takes in a neck height and outputs what the trained neuron thinks about it. The classifier will also plot the new neck height. Run the box below to get the values from the task before.

In [8]:
# a duplicate of the last plot, so you don't have to scroll
plot2 = Interactive2DPlot(points_sheep, points_llamas, ranges, loss_string="Mean Squared Loss") 
neuron2 = SimpleNeuron(plot2)
neuron2.set_values(neuron1.get_weight(), neuron1.get_bias()) #get your values from last task

plot2.plot

Mean Squared Loss : 0.500


FigureWidget({
    'data': [{'type': 'scatter',
              'uid': '27e40f27-8d76-45b2-ad14-3b8c251e28a6',
              'x': {'bdata': ('mpmZmZmZub8L16NwPQq3v3wUrkfher' ... 'tRuB6F8z+rR+F6FK7zP9SjcD0K1/M/'),
                    'dtype': 'f8'},
              'y': {'bdata': ('AAAAAAAA4L9cj8L1KFzfv7gehetRuN' ... 'gehev1PxyuR+F6FPY/RQrXo3A99j8='),
                    'dtype': 'f8'}},
             {'marker': {'color': 'rgb(255, 0, 0)', 'size': 10},
              'mode': 'markers',
              'type': 'scatter',
              'uid': '29ce6722-9dcf-431c-8726-334b4d020536',
              'x': [0.2, 0.23, 0.28, 0.32, 0.35],
              'y': [0, 0, 0, 0, 0]},
             {'marker': {'color': 'rgb(0, 0, 255)', 'size': 10, 'symbol': 'square'},
              'mode': 'markers',
              'type': 'scatter',
              'uid': 'e039267a-6a98-4e07-a4a8-a83974e4e83d',
              'x': [0.55, 0.68, 0.74, 0.83, 0.95],
              'y': [1, 1, 1, 1, 1]},
             {'marker': {'color': '

<div class="alert alert-block alert-success">
<b>Task:</b> Try to implement a classifier using just a linear neuron. (Yes, an almost futile task, but this will make sense later). <br> Complete the python code below and receive a classification_result.
<ul>
    <li> the classification result shall be the output of neuron2, given the new neck height </li>
    <li> you shouldn't need to add more than 1 line of code </li>
    <li> after executing, take a look at the star in the plot above. It represents the current input/output for the new neck length</li>
</ul>

</div>

In [9]:
new_neck_height = 0.9  # this value shall be varied to answer the questions below

classification_result: float

### STUDENT CODE HERE (1pt)
classification_result = neuron2.compute(new_neck_height)
### STUDENT CODE until HERE

plot2.plot.data[3].x = [new_neck_height] #update plot
plot2.plot.data[3].y = [classification_result] 

print("Result:", classification_result)

Result: 0.0


<div class="alert alert-block alert-success">
<b>Question (5 pts):</b>  Answer the following questions in the answer block below and indicate which question your answer is referring to: <br>
    
1. What classification value does the smallest llama have? (run the cell above and change new_neck_height) <br>
2. What classification value do animals with a neck height of 0.1m or 0.9m have? <br>
3. Why is the classification value continuous, even though the training data had only two discrete values? <br>
4. How would you interpret this continuous classification value? Try to describe it in a few words, there is no single correct answer. <br>
5. Your neuron outputs a continuous value, but what we need is a discrete output, that clearly says either "llama" or "sheep". To do this, you add a simple decision to the output of the neuron. The decision should be approximately just as sensitive towards llamas as to sheep. What neuron output (y-value) would you choose as the threshold and why? (no single correct answer) <br>
6. You want to add more data to your model to improve its performance. As you collect more data, you find a small llama with a neck height of 0.40m in your dataset. After you train your model on the new data, your discrete classifier decides that this small llama is a sheep. (Remember: the decision at the end only gets the y-value). Why is it problematic in this case to use a linear regression model for discrete classification? What property of the approximation function should be different? <br>
7. You decide that that manually adding a discrete decision at the end of your network is an unpractical idea. It would be better to improve the linear neuron by adding a heaviside step function as an activation function, just like adding a ReLu function. Then the training could be automated and the right threshold could be found automatically. What is the problem with this approach if we still want to use the Backpropagation algorithm? <br>
</div>

<div class="alert alert-block alert-success">
<b>Your Answer:</b> 

1. The shortest llama has a classification value of 0.5, and exactly at this point, we have the boundary between the llama and the sheep.<br>
2. An animal with a neck height of 0.1m has a classification value of -0.14, and an animal with a neck height of 0.9m has a classification value of 1.14.<br>
3. Because a neuron establishes a continuous linear relationship between input and output (weight * input + bias = output), the two discrete data points serve as parameters for the linear relationship.<br>
4. Je höher der Wert desto sicherer sind wir uns ( bzw. ist sich das NN ), dass es einen LAMA ist. Und je niedriger es, ist desto sicherer ist sich das NN, dass es ein SCHAF ist.<br>
5. Ich hätte genau den Mittelwert zwischen 1 und 0 genommen also 0.5 -> dies haben wir aus den empirischen Versuche vermutet. <br>
6. Das Problem ist, dass eine lineare Funktion nur eine gerade Trennlinie ziehen kann. Wenn die Daten sich überlappen oder komplexere Muster haben, kann eine gerade Linie sie nicht korrekt trennen. Wir bräuchten eine flexiblere Funktion, die nicht-lineare Grenzen ziehen kann.<br>
7. Das Problem bei der heaviside step function ist, dass sie nicht Diffbar ist an dem Ursprung - Steigung -> Unendlich - Und da dieser Punkt sehr wichtig für die Klassifikation ist (genau der Entscheidungspunkt zwischen A und B ) können wir ohne ihn nicht weiteroptimieren. Also der Backpropagation Algo bricht hier zusammen.<br>
    
</div>

***
### 2.2.2 Logistic Regression

In machine learning, the go-to assumption for an unknown two-class probability distribution is a logistic distribution.[2]
Its cumulated function is the logistic function, of which the sigmoid function is the most used special case. (See Fig 3.)
The sigmoid function enables a model to capture most natural occuring probability distributions.[3] (Further reading: see section "Further Reading" at the end of document)

In the introduction of Task 2.1, we gave the neck lengths corresponding labels. "0" for sheep and "1" for llama.
Here we can interpret the output of the neuron as the "llama probability": For example: An output of 1 means "100%" llama probability and an output of 0.2 means "20%" llama probability and so on.

<center>
  <img src="images/sigmoid.png" alt="Sigmoid Activation Function" width="500"/>
</center>
<p style="text-align: center;"><em>Figure 3: Sigmoid Activation Function.</em></p>

In [10]:
def sigmoid(x: np.ndarray) -> np.ndarray:
    return 1 / (1 + np.exp(-x))

<div class="alert alert-block alert-success">
<b>Task:</b> Complete Code and Train Neuron. Change the <code>SigmoidNeuron</code> class below to apply a sigmoid function to the final output.

</div>

In [11]:
class SigmoidNeuron(SimpleNeuron): #inheriting from SimpleNeuron, 
                                   #all functions stay the same unless they are specified here

    def compute(self, x: Union[float, np.ndarray]) -> Union[float, np.ndarray]:
        ### STUDENT CODE HERE (1 pt)
        Ausgabe = super().compute(x)
        self.activation = sigmoid(Ausgabe)
        ### STUDENT CODE until HERE
        return self.activation

In [12]:
classification_plot_sig = Interactive2DPlot(points_llamas, points_sheep, ranges, loss_string="Mean Squared Loss")

our_sig_neuron = SigmoidNeuron(classification_plot_sig)

interact(
    our_sig_neuron.set_values,
    weight=FloatSlider(min=-50, max=200, step=0.1, layout = slider_layout),
    bias=FloatSlider(min=-50, max=50, step=0.1, layout = slider_layout),
)

classification_plot_sig.plot

interactive(children=(FloatSlider(value=0.0, description='weight', layout=Layout(width='90%'), max=200.0, min=…

FigureWidget({
    'data': [{'type': 'scatter',
              'uid': '4d93d3bb-8a86-4ff5-9834-aa4cfca53e1e',
              'x': {'bdata': ('mpmZmZmZub8L16NwPQq3v3wUrkfher' ... 'tRuB6F8z+rR+F6FK7zP9SjcD0K1/M/'),
                    'dtype': 'f8'},
              'y': {'bdata': ('AAAAAAAA4L9cj8L1KFzfv7gehetRuN' ... 'gehev1PxyuR+F6FPY/RQrXo3A99j8='),
                    'dtype': 'f8'}},
             {'marker': {'color': 'rgb(255, 0, 0)', 'size': 10},
              'mode': 'markers',
              'type': 'scatter',
              'uid': 'fddf07b3-8386-4126-921d-a294e3894dda',
              'x': [0.55, 0.68, 0.74, 0.83, 0.95],
              'y': [1, 1, 1, 1, 1]},
             {'marker': {'color': 'rgb(0, 0, 255)', 'size': 10, 'symbol': 'square'},
              'mode': 'markers',
              'type': 'scatter',
              'uid': '4bed8b1e-0087-42b5-9368-9cf25753b0cd',
              'x': [0.2, 0.23, 0.28, 0.32, 0.35],
              'y': [0, 0, 0, 0, 0]},
             {'marker': {'color': '

<div class="alert alert-block alert-success">
<b>Question (3 pts):</b> Answer the following questions in the answer block below and indicate which question your answer is referring to: <br>
    
1. Give one example of an optimal weight and bias combination. <br>
2. What advantage does a classifier have in general that also outputs a probability compared to a classifier that just outputs a binary yes/no value? (a few words). <br>
3. Give one example how we can use the additional probability information to increase the accuracy of our seperation process. <br>
</div>

<div class="alert alert-block alert-success">
<b>Your Answer:</b> 
    
1. Für ( weight = 31.20 & bias = -14 ) haben wir LOSS = 0 !!<br>
2. Wir haben hier die Möglichkeit zur Anpassung der Entscheidungsschwelle, so dass es unsere spezifische Aufgabe am besten erfüllt<br>
3. Sagen wir in unserem Lama-Schaf Beispiel haben wir mehr Lamas als Schafe und bei Lamas gab es ein paar Ausreißer (kürzerer Hals, nah an Schaf ), die eigentlich durch das NN als Schaf erkannt werden. 
Hier können wir die Entscheidungsschwelle bewusst runter setzen, sodass wir genau diese Ausreißer wieder in dem richtigen Bereich reinholen können ( Z.B. statt eine klare Trennung bei 0.5, setzen wir diese bei 0.35) <br>
</div>

## 2.3 Cross Entropy/Logarithmic Loss:
The most common loss function for classification is cross entropy loss, also called logarithmic loss. (In the context of machine learning, they are equal). In the special case of two categories, the loss is called binary cross entropy. The binary cross entropy loss between the ground truth data value $y$ and the predicted value $\hat{y}$ is calculated as follows:

\begin{align}
−[y \cdot log(\hat{y}) + (1 − y) \cdot log(1 − \hat{y})]
\end{align}

In this manner, the average of all data points is calculated w.r.t. this loss.
It turns out that the derivative of a logarithmic loss using one hot encoding (explained below) is just the solution vector subtracted by the network output, which makes it very easy to work with.
**Note:** Cross entropy loss can only be used, if the output values are between 0 and 1.

<center>
  <img src="images/cross_entropy.png" alt="Log/Cross-Entropy loss func" width="600"/>
</center>
<p style="text-align: center;"><em>Figure 4:  Logarithmic/Cross-Entropy Loss Function</em></p>

<div class="alert alert-block alert-success">
<b>Question (3 pts):</b> Calculate Squared and Cross Entropy Loss. Fill out the ??? in the table below (Markdown is fine to display the table). Use the cells below for calculations. 

</div>

<div class="alert alert-block alert-success">
<b>Your Answer:</b>


| Input         | Llama Probability  |      Squared Loss    | Cross Entropy Loss   |
|---------------|--------------------|----------------------|----------------------|
|    llama(1)   | 0.99               |  0.0001    |  0.0101       |
|    sheep(0)   | 0.6                |  0.36       |  0.9163       |
|    sheep(0)   | 0.95               |  0.9025     |  2.9957       |
|    sheep(0)   | 0.999999           |  1   |  13.81555      |

</div>

In [13]:
def cross_entropy_loss(predictions: np.ndarray, solutions: np.ndarray) -> float:
    predictions += 1e-15 #in order to prevent log(0)
    total_loss = np.sum(-(solutions*np.log(predictions)+(1-solutions)*np.log(1-predictions)))
    avg_loss = total_loss/len(predictions)
    return avg_loss

In [14]:
predicted = np.array([0.999999]) #insert here
actual = np.array([0]) #insert here


print("mean squared loss: {:0.4f}".format(mean_squared_loss(predicted,actual)))
print("cross entropy loss: {:0.4f}".format(cross_entropy_loss(predicted,actual)))

mean squared loss: 1.0000
cross entropy loss: 13.8155


<div class="alert alert-block alert-success">
<b>Question (3 pts):</b>  Answer the following questions in the answer block below and indicate which question your answer is referring to: <br>
    
1. How do the goals of regression and classification generally differ?<br>
2. Why do you think cross entropy loss is better suited for classification training algorithms?<br>
</div>

<div class="alert alert-block alert-success">
<b>Your Answer:</b> 
    
1. Bei der Regression geht es darum, einen kontinuierlichen Wert vorherzusagen. Wobei es bei der Klassifikation hingegen geht es darum, Daten in verschiedene Kategorien oder Klassen einzuteilen <br>
2. Der Cross-Entropy Loss hat die Eigenschaft, dass er bei falschen Vorhersagen stark betraft! Was dafür sorgt, dass das Modell schnell lernt, dass er einen Fehler gemacht hat. Dies sorgt dafür, dass wir das Modell in die richtige Richtung führen. <br>
</div>

## 2.4 One-Hot Encoding
To do classification, categories have to be represented in a way that the classifier can process. Neural networks cannot understand categories directly and need a numeric representation.

### 2.4.1 Disadvantages of Integer Encoding

In the llama classifier, llamas were assigned the value $1$ and sheep the value $0$. One single output neuron would "fire", if a llama was found, and not fire, if a sheep was found. This type of representing categories is called **integer** or **label encoding**

This works reasonably well for binary classification, but what if we want to distinguish between sheep, llamas and shepherd dogs?
Doing this with just one output neuron would result in complications: 
- Dogs would need a label that is numerically higher or lower (for example $2$), implying an order (Dogs > Llamas) where there actually is none.
- it would be necessary to interpret three different states out of one output neuron value

Another disadvantage can be seen in the next question:

<div class="alert alert-block alert-success">
<b>Question (1 pt):</b> Suppose the encodings are: 0 for sheep, 1 for llamas and 2 for dogs. You classified 5 sheep and 5 dogs today. You want your classifier to output the average classification for today. What will the classifier say?
</div>

<div class="alert alert-block alert-success">
<b>Your Answer: [(5*0) * (5*1)]/10= 1 -> Und die Klasse 1 ist Lama. Was ein völliger Unsinn ist! Wir hatten gar kein Lama und das NN hat sich für Lama entschieden</b> 
</div>

### 2.4.2 Composition of One-Hot Encoding

The solution for the shortcomings of integer encoding looks like this:

| Input         | One Hot Encoding  | 
|---------------|--------------------|
|    sheep   | [1,0,0]                |
|    llama   | [0,1,0]               |
|    dog     | [0,0,1]           |



The length of the representation vector is always equal to the amount of categories. Only one element of the vector is 1 for each category ("one-hot").
Using this encoding, we can conveniently use 3 output neurons for 3 different categories, so that the activation of each output neuron represents the classification score for that category.

###  2.4.3 Limits of One-Hot Encoding
One-hot encoding is not an unimprovable solution to represent categories, but rather another tool in the box that happens to work well for many problems, but not for all.

<div class="alert alert-block alert-success">
<b>Question (1 pts):</b> Suppose you would like to train a speech recognition neural network that can classify all English words contained in the Oxford English Dictionary. It does not need to classify whole sentences, just single words. What would be a problem using one-hot encoding?
</div>

<div class="alert alert-block alert-success">
<b>Your Answer:hahaha, dann bräuchten wir für jedes Wort einen Eintrag in dem OHE Vektor -> D.h. der Vektor hat dann > 500.000 Einträge :(</b> 
</div>

## 2.5 Softmax Activation Function

The sigmoid function works fine for a "yes or no" problem, i.e. binary decisions. But more often than not we want to distinguish between more than two categories. For that, we need a function that takes in **multiple** neuron activations from the last layer of a network and outputs a **probability vector** containing the probabilities for each category. 

The key: **Each input** of this function is **normalized by the other inputs** such that the sum of the output vector is always 1. This activation function is different from ReLU or Sigmoid, because it always applies to the layer as a whole. In practice, it only makes sense as the activation function for the output layer.  Figure 3 shows an example network.

We can realize a softmax activation function by taking each element $x_i$ of the input vector, calculating $\exp(x_i)$ and then normalizing this value by dividing it by the sum of the $\exp$ results of all single input vector elements. Strictly speaking, the $\exp$ is not necessary for this effect - a linear normalization, limited to non-negative values, could also be interpreted as probability. However, the exponential normalization offers properties that improve performance (see "further reading").

\begin{align}
(\text{Softmax}(x))_i = \frac{\exp(x_i)}{\sum_j \exp(x_j)}
\end{align}


<img src="images/softmax_example_network.png" />
<p style="text-align: center;">
    Fig. 3 - Softmax activation function
</p>

<div class="alert alert-block alert-success">
<b>Question (1 pts):</b> In "logistic regression", we also obtained a probability by applying a sigmoid function on the last layers' output. Why can't we apply a sigmoid function on each output neuron of this network instead of a softmax and get a probability vector?
</div>

<div class="alert alert-block alert-success">
<b>Your Answer: Wenn wir die sigmoid function auf jeden Neuron anwenden, bekommen wir als Summe aller Wahrscheinlichkeiten einen Wert > 1, wodurch wir keine Aussage zu der Klassifikation machen können. Anders gesagt, haben wir dann keinen Zusammenhang zwischen den einzelnen Neuronen-Wahrscheinlichkeiten, jedes Neuron bekommt seine Wahrscheinlichkeit unabhängig vom anderen. </b> 
</div>

***
## 2.6 Automated Classification Training

### 2.6.1 Introduction

We already have explored automated training using backpropagation in the last chapter. We had one set of points that we had to fit a function as close as possible. The task is similar for classification training. However instead of y-coordinates for points, we now have discrete categories.

You got already a set of neck lengths and the correspoding categories (see table 1). In the field of machine learing, this dataset is called __training data__. It specifies the behaviour that the neural net should have. We will use backpropagation to adjust the weights and biases of the network over and over again until the network outputs the same values to a given set of inputs as in the training data. During backpropagation, the network is figuratively "learning" the training data. 

***
### 2.6.2 Realizing an XOR Gate with a Neural Network

You find yourself working as an engineer at a major electronic component manufacturing company. Your company wants to produce the first XOR gate chip that runs on artificial intelligence. You are given the training data in the form of a truth table:


| Input 1| Input 2  | Output    |
|--------|----------|-----------|
|    0   | 0        |0          |
|    0   | 1        |1          |
|    1   | 0        |1          |
|    1   | 1        |0          |


<p style="text-align: center;">
    Table. 2 - XOR Truth table
</p>


In this task we will make use of arrays and matrices to ease the handling of the data and the network parameters. We will also utilize a neural network without biases in order to make the algorithm as simple as possible.
The training data consists of a 2D Array of all possible input states and a 1D Array of all corresponding outputs. 

#### 2.6.2.1 Task : Create Training Data

A training set consists of an input set and a solution set. During supervised training, the network is adjusted until its predictions to the input set match the corresponding predetermined solutions.
Complete the training data below using the truth table

<div class="alert alert-block alert-success">
<b>Task:</b> Create Training Data. A training set consists of an input set and a solution set. During supervised training, the network is adjusted until its predictions to the input set match the corresponding predetermined solutions (not always see: Overfitting, but in this case). Complete the training data below using the groundtruth table above. Please initialize the solution '2 dimensional' as well.

</div>

In [15]:
xor_input_set: np.ndarray
xor_solution_set: np.ndarray

# STUDENT CODE HERE (1 pt)
xor_input_set = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]) # Das sind die Punkte des XOR
xor_solution_set = np.array([[0], [1], [1], [0]]) # Und das sind die Lösungen in der gleichen Reihenfolge wie in xor_input_set
# STUDENT CODE until HERE

# Quick sanity check
assert xor_input_set.shape == (len(xor_input_set),2), f'Expected shape of {(len(xor_input_set),2)}, but found {xor_input_set.shape}'
assert xor_solution_set.shape == (len(xor_solution_set),1), f'Expected shape of {(len(xor_solution_set),1)}, but found {xor_solution_set.shape}'

#### 2.6.2.2 Initializing the Network
Next, the Network has to be defined and initialized. For this task, we use a network with 3 hidden neurons (see Figure 4).

<center>
  <img src="images/3x2_xor_network.png" alt="Neural Network" width="1000"/>
</center>
<p style="text-align: center;"><em>Figure 4: Neural Network .</em></p>

We define $w_{01}, w_{02}, w_{03}, w_{10}, w_{11}, w_{12}$ all at once by just defining a 2x3 weight matrix $w_{l1}$ and do the same for $w_{l2}$. The matrices will be initialized with values between -1 and 1

Run the cell below to define a neural network class that is depicted above.

In [16]:
class NeuralNetwork:
    def __init__(self, in_dim : int, hl_dim : int, ol_dim : int):
        self.in_dim = in_dim
        self.hl_dim = hl_dim
        self.ol_dim = ol_dim
        self.hl_sum = [0] * in_dim
        self.hl_activation = [0] * hl_dim
        self.ol_sum = [0] * ol_dim
        self.prediction = 0
        self.b = 0
        self.w_i = np.zeros((in_dim, hl_dim))
        self.w_o = np.zeros((hl_dim, ol_dim))
        self.history = []
        
    def set_conf(self, w_i: np.ndarray, w_o: np.ndarray, b: float):  # w_i and w_o are matrices here
        assert w_i.shape == (self.in_dim, self.hl_dim)
        assert w_o.shape == (self.hl_dim, self.ol_dim)
        
        self.w_i = w_i
        self.w_o = w_o
        self.b = b

    def get_conf(self) -> Dict[str, Union[np.ndarray, float]]:
        configuration = dict()
        configuration['w_i'] = self.w_i
        configuration['w_o'] = self.w_o
        configuration['b'] = self.b
        return configuration

    def get_ex(self) -> Dict[str, float]:
        excitations = dict();
        excitations['hl_sum'] = self.hl_sum
        excitations['hl_activation'] = self.hl_activation
        excitations['ol_sum'] = self.ol_sum
        return excitations
    
    
    def show_conf(self):
        print("weight matrix w_i:")
        print(self.w_i)
        print("\nweight matrix w_o:")
        print(self.w_o)
        print("Bias")
        print(self.b)

    def compute(self, input_set: np.ndarray) -> np.ndarray:
        self.hl_sum = input_set.dot(self.w_i)
        self.hl_activation = relu(self.hl_sum) 
        self.ol_sum = relu(self.hl_activation).dot(self.w_o) + self.b
        self.prediction = sigmoid(self.ol_sum)

        return self.prediction
    
    def save_configuration(self):
        self.history.append(self.get_conf())

We are going to instantiate the neural network with the desired dimensions, before initializing it with random weights. 

In [17]:
xor_logic_gate_net = NeuralNetwork(in_dim=2, hl_dim=3, ol_dim=1)

def initialize_network(net):
    #np.random.seed(3)
    weight_matrix_i = np.random.rand(net.in_dim, net.hl_dim)  # a 2x3 matrix of weights
    weight_matrix_o = np.random.rand(net.hl_dim, net.ol_dim)  # a 3x1 matrix of weights
    bias = np.random.randn()
    net.set_conf(weight_matrix_i,weight_matrix_o,bias)
    
initialize_network(xor_logic_gate_net) #just a test initialization to illustrate the weight matrices

xor_logic_gate_net.show_conf()

weight matrix w_i:
[[0.12987473 0.79614992 0.9760545 ]
 [0.71622    0.40211837 0.52224639]]

weight matrix w_o:
[[0.98204208]
 [0.27249726]
 [0.51764363]]
Bias
1.9321132477960596


#### 2.6.2.3 Defining Training Process
Finally, run the cells below to implement a backpropagation algorithm. Try to understand the code. See Fig. 4 for explanation of the variable names.

In [18]:
def sigmoid_prime(x: np.ndarray) -> np.ndarray: #the derivative of sigmoid
    return sigmoid(x)*(1-sigmoid(x))

In [19]:
def train(net: NeuralNetwork, input_set: np.ndarray, solution_set: np.ndarray, learning_rate: float, epochs: int):
    for t in range(epochs):
        
        net.save_configuration()
        # Forward pass: compute predicted solution_set
        predictions = net.compute(input_set)
        # Compute and print loss
        log_loss = cross_entropy_loss(predictions, solution_set)
        
        if (t % 5 == 0):  # only output every 5th epoch
            print("Loss after Epoch {}: {:0.4f}".format(t, log_loss))

        #unravel variables here for readability
        ol_sum = net.get_ex()['ol_sum']
        hl_activation = net.get_ex()['hl_activation']
        hl_sum = net.get_ex()['hl_sum']
        w_i = net.get_conf()['w_i']
        w_o = net.get_conf()['w_o']
        b = net.get_conf()['b']
        
        # Backpropagation to compute gradients of w_i and w_o with respect to loss
        # start from the loss at the end and then work towards the front
        grad_ol_sum = sigmoid_prime(ol_sum) * (predictions - solution_set)
        grad_w_o = hl_activation.T.dot(grad_ol_sum)  # Gradient of Loss with respect to w_o
        grad_hl_activation = grad_ol_sum.dot(w_o.T)  # the second layer's error
        grad_hl_sum = hl_sum.copy()  # create a copy to work with
        grad_hl_sum[hl_sum < 0] = 0  # the derivate of ReLU
        grad_w_i = input_set.T.dot(grad_hl_sum * grad_hl_activation)  #

        updated_weight_matrix_i = w_i - learning_rate * grad_w_i
        updated_weight_matrix_o = w_o - learning_rate * grad_w_o
        updated_bias = b - learning_rate * grad_ol_sum.sum()
        net.set_conf(updated_weight_matrix_i, updated_weight_matrix_o,
                       updated_bias)  # Apply updated weights to network

<div class="alert alert-block alert-success">
<b>Task:</b> Choose Hyperparameters and Train
<ul>
<li> Choose an optimal learning rate and number of epochs by trying out values and running the cell below.
<li> If your training data was correct, the network should be ready for use after training.
A successfull training should result in a loss smaller than 0.02.
                                                     
<li><b>Hint:</b> Press Shift+Enter on the cell below and then the "up" arrow key to repeat the training easily.

</ul>
</div>

In [20]:
learning_rate: float
epochs: int
# STUDENT CODE HERE (2 pts)
learning_rate = 10
epochs = 35
# STUDENT CODE until HERE

initialize_network(xor_logic_gate_net) #initialize again so you can just run this box and train a new network
train(xor_logic_gate_net, xor_input_set, xor_solution_set,learning_rate,epochs)

Loss after Epoch 0: 0.7771
Loss after Epoch 5: 0.7091
Loss after Epoch 10: 0.5486
Loss after Epoch 15: 0.4827
Loss after Epoch 20: 0.0377
Loss after Epoch 25: 0.0292
Loss after Epoch 30: 0.0248


#### Visualization with a Contour Plot animation

Now that you hopefully achieved promising results, we want to visualize the results with the help of a contour plot animation. The code to create them is located in the following cells. Run them and have a look at the animation. If it is not running properly, search it in your current directory. Ideally, you should see how the net narrows down its prediction to separate the classes from each other.

In [21]:
def create_grid(input_set: np.array):
    '''Helper Function to create a numpy grid, which is feed into a neural net.'''
    min_x = input_set[:,0].min()-3
    max_x = input_set[:,0].max()+3
    min_y = input_set[:,1].min()-3
    max_y = input_set[:,1].max()+3

    # create x and y base vectors
    x_grid = np.arange(min_x, max_x, 0.1)
    y_grid = np.arange(min_y, max_y, 0.1)

    # create all of the lines and rows of the grid
    xx, yy = np.meshgrid(x_grid, y_grid)

    # flatten each grid to a vector
    r1, r2 = xx.flatten(), yy.flatten()
    r1, r2 = r1.reshape((len(r1), 1)), r2.reshape((len(r2), 1))

    # horizontal stack vectors to create x1,x2 input for the model
    grid = np.hstack((r1,r2))
    return xx,yy,grid   

def create_scatter(input_set: np.array, solution_set:np.array, ax: plt.Axes):
    """Helper function, which creates the scatter plot from a input_set and a solution_set"""
    for class_value in range(2):
        # get row indexes for samples with this class
        row_ix, _ = np.where(class_value == solution_set)
        
        # create scatter of these samples
        colors = np.array(["red", "blue"])
        ax.scatter(input_set[row_ix, 0], input_set[row_ix, 1], c=colors[class_value])


In [22]:
# this function is actually obsolet, but usefull for a static contour plot 
def decision_boundary_plot(input_set: np.array, solution_set: np.array, model: NeuralNetwork):
    '''Creates static contour plot'''
    fig, ax = plt.subplots()

    xx, yy, grid = create_grid(input_set)

    # make predictions for the grid
    prediction = model.compute(grid)

    # reshape the predictions back into a grid
    zz = prediction.reshape(xx.shape)

    # plot the grid of x, y and z values as a surface
    contour_plot = ax.contourf(xx, yy, zz, cmap='RdBu')
    fig.colorbar(contour_plot, ax= ax)

    # create scatter plot for samples from each class
    create_scatter(input_set, solution_set, ax)

    ax.set_title("Contourplot") 

    plt.show()

In [23]:
from matplotlib import animation
from matplotlib.animation import FFMpegWriter
from mpl_toolkits.axes_grid1 import make_axes_locatable
from IPython.display import Video

def make_animation_of_net_history(history, input_set, solution_set, filename):
    '''Creates animation from net history'''
    fig, ax = plt.subplots()
    # create grid for the predictions
    xx, yy, grid = create_grid(input_set)

    # full control over the colorbar, necessary for the animation
    div = make_axes_locatable(ax)
    cax = div.append_axes('right', '5%', '5%')  

    plot_title = ax.set_title("")

    def animate(i):

        # instantiate the net and load weights from history into it
        net = NeuralNetwork(in_dim=2, hl_dim=3, ol_dim=1)
        w_i = history[i]["w_i"]
        w_o = history[i]["w_o"]
        b = history[i]["b"]
        net.set_conf(w_i, w_o, b)
        prediction = net.compute(grid)
        zz = prediction.reshape(xx.shape)

        # reset colorbar axes
        cax.cla()
        
        contour_plot = ax.contourf(xx, yy, zz, cmap='RdBu')
        fig.colorbar(contour_plot, cax= cax)

        create_scatter(input_set, solution_set, ax)

        plot_title.set_text(f"Epoch: {i}")
        
        return contour_plot, plot_title

    ani = animation.FuncAnimation(fig, animate, frames=len(history), blit=True)
    ani.save(filename)

    # plt.show()


In [24]:
plt.ioff()
make_animation_of_net_history(xor_logic_gate_net.history, xor_input_set, xor_solution_set, "XOR_contourplot.mp4")
Video('XOR_contourplot.mp4')

<div class="alert alert-block alert-success">
<b>Question (2 pts):</b> Answer the following questions in the answer block below and indicate which question your answer is referring to: <br>
    
1. Why are the losses different each time you run the cell?<br>
2. What is a good learning rate that reaches a loss < 0.02 in < 100 epochs most of the time?<br>
</div>

<div class="alert alert-block alert-success">
<b>Your Answer:</b> 
    
1. Weil wir bei jedem Start unterschiedliche Anfangswerte (Parameter) haben. Und diese Werte sind random gewählt. <br>
2. Mit learning_rate = 10 und epochs = 35 erreichen wir Loss < 0.02 <br>
</div>

<div class="alert alert-block alert-success">
<b>Task:</b> Classification Test. Run the cell below and change the sliders and do a validation check on your logic gate.

</div>

In [25]:
def change(input1: float, input2: float):
    input_vector = np.array([input1 * 1, input2 * 1])     # converting bool to float
    prediction = xor_logic_gate_net.compute(input_vector)
    print("\t input: {} \t \t output: {:0.9f}".format(input_vector, prediction[0]))

interact(
    change,
    input1=FloatSlider(min=0, max=1, step=1, layout=Layout(width="22%")),
    input2=FloatSlider(min=0, max=1, step=1, layout=Layout(width="22%")),
);

interactive(children=(FloatSlider(value=0.0, description='input1', layout=Layout(width='22%'), max=1.0, step=1…

<div class="alert alert-block alert-success">
<b>Task:</b> Continuous Input Test. Change the sliders and observe the changes when the input is varied continuously instead of binary.

</div>

In [26]:
interact(change, input1=0.0, input2=0.0);

interactive(children=(FloatSlider(value=0.0, description='input1', max=1.0), FloatSlider(value=0.0, descriptio…

<div class="alert alert-block alert-success">
<b>Question (5 pts):</b> Answer the following questions in the answer block below and indicate which question your answer is referring to:<br>
    
1. What can you observe when changing the sliders? How would you describe the general relationship between the two inputs and the output (a few words)<br>
2. Change the sliders to the training data values e.g.(1.00, 1.00). Does the output match the training data exactly? Why is that the case?<br>
3. The neural network now can do something more than just predicting the values of the input set that you gave it. What "special ability" has your network gained automatically? (Hint: Think about neural networks in general, the XOR gate is just an example)<br>
4. How can this special ability be useful when applying neural networks to self-driving vehicles?<br>
5. Why does this ability make it easier to use a neural network for self-driving vehicles than traditional rule-based programming. (One pos. and neg. aspect)<br>
</div>

<div class="alert alert-block alert-success">
<b>Your Answer:</b> 
    
1. Immer wenn sich die Inputs vom Wert her annähren, nährt sich der Output zu 0. Und das gegenteil auch<br>
2. Nein, weil wir erstens eine Approximation haben und keine Exakte Vorhersage! Wir wollen nämlich die Muster lernen und nicht auswendig die Antworten kennen. zweitens gibt uns das NN einen Kontinuierlichen Wert für die Approximation obwohl die Trainingsdaten diskret waren.  <br>
3. Generalisieren!! Wir haben jetzt einen kontinuierlichen Verlauf für die Vorhersage. Für Fälle wie (0.1,0.5) kann jetzt mein Modell eine Vorhersage machen obwohl er nicht auf solche gelernt wurde. <br>
4. Zum Beispiel : Das Modell trainiert bei Sonne → funktioniert auch bei Regen, Nebel, Schnee, trainiert auf Autobahnen → funktioniert auch auf Landstraßen, trainiert mit roten Autos → erkennt auch blaue, grüne Autos, etc. <br>
5. Weil wir beim traditionellen Programmieren, explizit alle Regeln erwähnen müssen (Mio. von Regeln) was praktisch nicht machbar ist. Und bei NN haben wir das Generalisieren!  <br>
</div>

<div class="alert alert-block alert-success">
<b>Task:</b> Create an OR gate:
<ul>
<li> Adjust the solution set.
<li> Choose hyperparameters and train.
<li> Verify your results with a simple test. This is up to you.

<li><b>Hint:</b> Have a look at the truth table of the OR gate below. 
</ul>

</div>

| Input 1| Input 2  | Output    |
|--------|----------|-----------|
|    0   | 0        |0          |
|    0   | 1        |1          |
|    1   | 0        |1          |
|    1   | 1        |1          |


<p style="text-align: center;">
    Table. 3 - OR Truth table
</p>

In [None]:
# the input set stays the same, but we assign it a new name for code clarity
or_input_set = xor_input_set.copy()

# STUDENT CODE HERE (1 pt)
or_solution_set = np.array([[0], [1], [1], [1]])
# STUDENT CODE until HERE

In [None]:
or_logic_gate_net = NeuralNetwork(in_dim = 2, hl_dim=3, ol_dim=1)

learning_rate: float
epochs: int
# STUDENT CODE HERE (2 pts)
learning_rate = 0.1
epochs = 50
# STUDENT CODE until HERE
initialize_network(or_logic_gate_net)
train(or_logic_gate_net, or_input_set, or_solution_set,learning_rate,epochs)


Loss after Epoch 0: 0.5009
Loss after Epoch 5: 0.4886
Loss after Epoch 10: 0.4772
Loss after Epoch 15: 0.4665
Loss after Epoch 20: 0.4562
Loss after Epoch 25: 0.4464
Loss after Epoch 30: 0.4368
Loss after Epoch 35: 0.4274
Loss after Epoch 40: 0.4182
Loss after Epoch 45: 0.4092


In [29]:
make_animation_of_net_history(or_logic_gate_net.history, or_input_set, or_solution_set, "OR_contourplot.mp4")
Video("OR_contourplot.mp4")

In [33]:
# write a short test in this code block
# STUDENT CODE HERE (2 pts)

def change(input1: float, input2: float):
    input_vector = np.array([input1 * 1, input2 * 1])
    prediction = or_logic_gate_net.compute(input_vector)
    print("\t input: {} \t \t output: {:0.9f}".format(input_vector, prediction[0]))
    
interact(
    change,
    input1=FloatSlider(min=0, max=1, step=1, layout=Layout(width="22%")),
    input2=FloatSlider(min=0, max=1, step=1, layout=Layout(width="22%")),
);

# STUDENT CODE until HERE

interactive(children=(FloatSlider(value=0.0, description='input1', layout=Layout(width='22%'), max=1.0, step=1…

## 2.7 Neural-Networks using DeepLearningLibraries

In [34]:
x_train = np.array([[0, 0],
                    [0, 1],
                    [1, 0],
                    [1, 1]], dtype = 'float64')

y_train = np.array([[0],
                    [1],
                    [1],
                    [0]], dtype = 'float64')

### 2.7.2 PyTorch Example
This example is just a reference for how the syntax will look when using PyTorch. You do not need to install PyTorch just to run it.

In [35]:
import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(2, 3, True)
        self.fc2 = nn.Linear(3, 1, True)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = torch.sigmoid(self.fc2(x))
        return x

net = Net()

inputs = torch.from_numpy(x_train).type(torch.FloatTensor)
targets = torch.from_numpy(y_train).type(torch.FloatTensor)

criterion = nn.BCELoss()
optimizer = optim.Adam(net.parameters(), lr=0.01)

print("Training loop:")
for idx in range(0, 201):
    for input, target in zip(inputs, targets):
        optimizer.zero_grad()   # zero the gradient buffers
        output = net(input)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()    # Does the update
    if idx % 50 == 0:
        print("Epoch: {: >8}  |  Loss: {}".format(idx, loss.data.numpy()))

Training loop:
Epoch:        0  |  Loss: 0.7773979902267456
Epoch:       50  |  Loss: 0.6953570246696472
Epoch:      100  |  Loss: 0.6885163187980652
Epoch:      150  |  Loss: 0.6886417269706726
Epoch:      200  |  Loss: 0.6892629861831665


## 2.8 Outlook: Classification Tests in the Real World

A classic application of neural networks is the classification of images. A commonly used data set is CIFAR-10, which consists of:  
 1. Images of  airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks  (10 Categories)
 2. Labels attached to each image that categorize the image
 

<center>
  <img src="images/cifar10_plot.png" alt="CIFAR-10 dataset[4]" width="600"/>
</center>
<p style="text-align: center;"><em>Figure 3: CIFAR-10 dataset[4]</em></p>

 
The labels (also called annotations) act as the "solution" for the training set. Each item (airplane, car..) is a separate category. 
During training, the weights and biases in the network are adjusted in just the right way, until it performs the right mathematical operations to correctly classify the given training data. After training, the network can recognize whether the image is a cat, an airplane, etc. This even works for pictures that the network has never seen. You will find out how neural networks can perform image classification in the next class.

### Sources:
[1] Wikipedia, Statistical classification https://en.wikipedia.org/wiki/Statistical_classification, retrieved 01.05.2019

[2]  Brownlee, Jason 2018. Machine Learning Algorithms From Scratch. p. 70

[3]  Gibbs, M.N. (Nov 2000). "Variational Gaussian process classifiers". IEEE Transactions on Neural Networks. p. 1458–1464.

[4] Cifar-10, Cifar-100 Dataset Introduction
Corochann - https://corochann.com/cifar-10-cifar-100-dataset-introduction-1258.html, retrieved 02.02.2019


### Further Reading

The Sigmoid Function in Logistic Regression: http://karlrosaen.com/ml/notebooks/logistic-regression-why-sigmoid/

Why Softmax uses exponential function: https://stackoverflow.com/questions/17187507/why-use-softmax-as-opposed-to-standard-normalization

# Feedback and Recap

<div class="alert alert-block alert-success">
<b>Question (3pt):</b>  Please conclude in a few sentences what you learned in this exercise
</div>

<div class="alert alert-block alert-success">
<b>Your Answer: I have learned how neural networks really work and how they solve the classification problem, for both binary and multi-class classification with different activation functions.</b> 
</div>

## And give us feedback if you like


1) Do you think this task was designed well? 

2) Where can we improve this task?

<strong>Thanks for participating in LAMA! :)</strong>