## Banknote classification with fcNN without hidden layer compared to fcNN with hidden layer

**Goal:** In this notebook you will do your first classification. You will see that fully connected networks without a hidden layer can only learn linar decision boundaries, while fully connected networks with hidden layers are able to learn non-linear decision boundaries.

**Usage:** The idea of the notebook is that you try to understand the provided code. Run it, check the output, and play with it by slightly changing the code. 

**Dataset:** You work with a banknote data set and classification task. We have 5 features of wavelet transformed images of banknotes:
>1. variance  (continuous feature) 
>2. skewness (continuous feature) 
>3. curtosis (continuous feature) 
>4. entropy (continuous feature) 
>5. class (binary indicating if the banknote is real or fake)  

Don't bother too much how these features exactely came from.

For this analysis we only use 2 features. 

>x1: skewness of wavelet transformed image  
>x2: entropy of wavelet transformed image


**The goal is to classify each banknote to either "real" (Y=0) or "fake" (Y=1).**


**Content:**
* visualize the data in a simple scatter plot and color the points by the class label
* use the Keras library to build a fcNN without hidden layers (logistic regression). Use SGD with the objective to minimize the crossentropy loss. 
* visualize the learned decision boundary in a 2D plot
* use the Keras library to build a fcNN with a single hidden layer. Use SGD with the objective to minimize the crossentropy loss. 
* visualize the learned decision boundary in a 2D plot
* compare the performace and the decision boundaries of the two models
* stack more hidden layers to the model and playaround with the epochs



| [open in colab](https://colab.research.google.com/github/tensorchiefs/dl_book/blob/master/chapter_02/nb_ch02_01.ipynb)




#### Imports

In the next two cells, we load all the required libraries and functions from keras and numpy. We also download the data with the 5 featues from the provided url.

In [2]:
# load required libraries:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('default')

import keras 
from keras.models import Sequential
from keras.layers import Dense 
from keras.utils import to_categorical 
from keras import optimizers

Using TensorFlow backend.


In [3]:
# Load data from url
from urllib.request import urlopen
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt'
raw_data = urlopen(url)
dataset = np.loadtxt(raw_data, delimiter=",")
print(dataset.shape)

(1372, 5)


Let's extract the two featues *x1: skewness of wavelet transformed image* and *x2: entropy of wavelet transformed image*. We print the shape and see that we for X  we have 1372 oberservations with two featues and for Y there are 1372 binary labels.

In [24]:
# Here we use extract the two features and the label of the dataset
X=dataset[:,[1,3]]
Y=dataset[:,4]
Y_c=to_categorical(Y,2)
print(X.shape)
print(Y.shape)

(1372, 2)
(1372,)


In [48]:
np.unique(Y,return_counts=True)

(array([0., 1.]), array([762, 610]))

Since the banknotes are described by only 2 features, we can easily visualize the positions of real and fake banknotes in the 2D feature space. You can see that the boundary between the two classes is not separable by a straight line. A curved boundary line will do better. But even then we cannot expect a perfect seperation.


### fcNN with one hidden layer 

We know that the boundary between the two classes is not descriped very good by a line. Therefore a single neuron is not appropriate to model the probability for a fake banknote based on its two features. To get a more flexible model, we introduce an additional layer between input layer and output layer. This is called hidden layer. Here we use a hidden layer with 8 neurons. We also change the ouputnodes form 1 to 2, to get two ouputs for the probability of real and fake banknote. Because we now have 2 outputs, we use the *softmax* activation function in the output layer. The softmax activation ensures that the output can be interpreted as a probability (see book for details)

In [0]:
# Definition of the network
model = Sequential()
model.add(Dense(20, batch_input_shape=(None, 2),activation='relu'))
model.add(Dense(500, activation='relu'))
model.add(Dense(500, activation='relu'))
model.add(Dense(500, activation='relu'))
model.add(Dense(500, activation='relu'))
model.add(Dense(500, activation='relu'))
model.add(Dense(500, activation='relu'))

model.add(Dense(20, activation='relu'))

model.add(Dense(2, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])

In [125]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_141 (Dense)            (None, 20)                60        
_________________________________________________________________
dense_142 (Dense)            (None, 500)               10500     
_________________________________________________________________
dense_143 (Dense)            (None, 500)               250500    
_________________________________________________________________
dense_144 (Dense)            (None, 500)               250500    
_________________________________________________________________
dense_145 (Dense)            (None, 500)               250500    
_________________________________________________________________
dense_146 (Dense)            (None, 500)               250500    
_________________________________________________________________
dense_147 (Dense)            (None, 500)               250500    
__________

#### Add more hidden layers and play around with the training epochs
<img src="https://raw.githubusercontent.com/tensorchiefs/dl_book/master/imgs/paper-pen.png" width="60" align="left" />  
Exercise: Add more hidden layers to the model and play around with the training epochs. What do you observe? Look at the learned decision boundary. How does the loss and the accuracy change?



In [126]:
### works only for a quite deep model
model.evaluate(X,Y_c)



[0.6892015112037214, 0.5553935860058309]

In [130]:
-np.log(0.5)

0.6931471805599453

In the next cell, train the network. In other words, we tune the parameters that were initialized randomly with stochastic gradient descent to minimize our loss function (the categorical crossentropy). We set the batchsize to 128 per updatestep and train for 400 epochs.

Let's look again at the leraning curve, we plot the accuracy and the loss vs the epochs. You can see that after 100 epochs, we predict around 86% of our data correct and have a loss aorund 0.29 (this values can vary from run to run). This is already alot better than the model without a hidden layer.