# Calculation of the cross entropy loss (NLL) function for classification tasks


**Goal:** In this notebook you will use Keras to set up a CNN for classification of MNIST images and calculate the cross entropy which is achieved before the CNN was trained. You will first calculate the cross entropy loss for a binary classification problem then for a classification problem with ten classes. You will calculate with basic numpy functions the loss that is expected from random guessing and see that an untrained CNN is not better than guessing.

**Usage:** Before working through this notebook we recommend to read chapter 4.2. The idea of the notebook is that you try to understand the provided code by running it, checking the output and playing with it by slightly changing the code and rerunning it. 

**Dataset:** You work with the MNIST dataset. We have 60'000 28x28 pixel greyscale images of digits (0-9).

**Content:**
* load the original MNIST data 
* create a subset the of the data to make it binary classification problem
* define a CNN in Keras
* evaluation of the cross entropy loss function of the untrained CNN using Keras for only two classes
* evaluation of the cross entropy loss function of the untrained CNN using Keras for all classes
* implement the loss function yourself using the predicted probabilities and numpy


| [open in colab](https://colab.research.google.com/github/tensorchiefs/dl_book/blob/master/chapter_04/nb_ch04_02.ipynb)



#### Imports

First you load all the required libraries. 

In [0]:
# load required libraries:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('default')
from sklearn.metrics import confusion_matrix

import tensorflow
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Convolution2D, MaxPooling2D, Flatten , Activation
from tensorflow.keras.utils import to_categorical 
from tensorflow.keras import optimizers






#### Loading and preparing the MNIST data
You download the MNIST data, normalize the pixel-values to be between 0 and 1 and create  a subdataset which only contains images with the labels 0 and 1.

In [4]:
from tensorflow.keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

X_train=x_train / 255 #divide by 255 so that they are in range 0 to 1
X_train=np.reshape(X_train, (X_train.shape[0],28,28,1))
Y_train=tensorflow.keras.utils.to_categorical(y_train,10) # one-hot encoding

# define sub data containing only images with 0 or 1
idx = (y_train==0)|(y_train==1)

X_train_sub = X_train[idx]
Y_train_sub = y_train[idx]
Y_train_sub=tensorflow.keras.utils.to_categorical(Y_train_sub,2) # one-hot encoding

Y_train.shape, X_train.shape, Y_train_sub.shape, X_train_sub.shape

((60000, 10), (60000, 28, 28, 1), (12665, 2), (12665, 28, 28, 1))

## CNN model

You use the same CNN model as in chapter 2. First you will use it to evaluate the loss for only two classes (0 and 1)  and then for all ten classes of the mnist digits.

In [0]:
# here we define hyperparameter of the CNN
batch_size = 128
nb_classes = 2  # for the sub data we only have 2 classes
img_rows, img_cols = 28, 28
kernel_size = (3, 3)
input_shape = (img_rows, img_cols, 1)
pool_size = (2, 2)

In [6]:
# define CNN with 2 convolution blocks and 2 fully connected layers
model = Sequential()

model.add(Convolution2D(8,kernel_size,padding='same',input_shape=input_shape))
model.add(Activation('relu'))
model.add(Convolution2D(8, kernel_size,padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=pool_size))

model.add(Convolution2D(16, kernel_size,padding='same'))
model.add(Activation('relu'))
model.add(Convolution2D(16,kernel_size,padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=pool_size))

model.add(Flatten())
model.add(Dense(40))
model.add(Activation('relu'))
model.add(Dense(2))
model.add(Activation('softmax'))

# compile model and intitialize weights
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

W0623 11:37:52.366887 140368466982784 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


In [7]:
# summarize model along with number of model weights
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 28, 28, 8)         80        
_________________________________________________________________
activation (Activation)      (None, 28, 28, 8)         0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 28, 28, 8)         584       
_________________________________________________________________
activation_1 (Activation)    (None, 28, 28, 8)         0         
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 14, 14, 8)         0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 14, 14, 16)        1168      
_________________________________________________________________
activation_2 (Activation)    (None, 14, 14, 16)        0

### Evaluation of the untrained model using Keras

<img src="https://raw.githubusercontent.com/tensorchiefs/dl_book/master/imgs/paper-pen.png" width="60" align="left" />  


*Exercise 1: Compute the cross entropy loss with the Keras and compare the result with the value that you would expect when you have a classification problem with two classes. Remember that the network is untrained and the predictions are basically just random guesses.*  

You best use the function model.evaluate(), to get the cross entropy loss for the untrained network. The input for this function are the images and corresponding true labels, the function returns the corssentropy loss and the accuracy of the prediction. Note that we use the sub dataset with only two classes (0 and 1). 




In [0]:
# Write your code here

Scroll down to see the solution.

</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>

In [8]:
model.evaluate(X_train_sub, Y_train_sub)



[0.6908787820633814, 0.4676668]

If you have no idea about the training dataset,  you would expect each class with equal probability and your guess for every image would be 1/nr_of_classes. In the case of 2 classes, you predicit every image with a probability around 0.5. The resulting x-entropy, which is the negative log likelihood, is calculated below.

In [9]:
nr_of_classes=2
-np.log(1/nr_of_classes)

0.6931471805599453

#### Return to the book 
<img src="https://raw.githubusercontent.com/tensorchiefs/dl_book/master/imgs/Page_turn_icon_A.png" width="120" align="left" />  

Let's now work with the full Mnist dataset (all ten classes). You can use the same network architecture as before, the only thing you need to change is the number of outputs nodes, which you set to ten, if you want to work with all digit classes from 0 to 9.

In [0]:
nb_classes = 10

# define CNN with 2 convolution blocks and 2 fully connected layers
model = Sequential()

model.add(Convolution2D(8,kernel_size,padding='same',input_shape=input_shape))
model.add(Activation('relu'))
model.add(Convolution2D(8, kernel_size,padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=pool_size))

model.add(Convolution2D(16, kernel_size,padding='same'))
model.add(Activation('relu'))
model.add(Convolution2D(16,kernel_size,padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=pool_size))

model.add(Flatten())
model.add(Dense(40))
model.add(Activation('relu'))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))

# compile model and intitialize weights
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [11]:
# summarize model along with number of model weights
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_4 (Conv2D)            (None, 28, 28, 8)         80        
_________________________________________________________________
activation_6 (Activation)    (None, 28, 28, 8)         0         
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 28, 28, 8)         584       
_________________________________________________________________
activation_7 (Activation)    (None, 28, 28, 8)         0         
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 14, 14, 8)         0         
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 14, 14, 16)        1168      
_________________________________________________________________
activation_8 (Activation)    (None, 14, 14, 16)       

Here you predict the probabilities for all images in the train data set. You did not train the network yet, therefore the probabilities will be around 10% for each image.

In [0]:
# Calculates the probailities for the training data
Pred_prob = model.predict_proba(X_train)

In [13]:
Pred_prob[0:5]

array([[0.0980892 , 0.10156239, 0.10729206, 0.09290784, 0.13197447,
        0.10113405, 0.08253013, 0.08996263, 0.09509624, 0.09945098],
       [0.09687394, 0.09786396, 0.1102694 , 0.10245232, 0.13668509,
        0.10452785, 0.07775477, 0.08944198, 0.09143064, 0.09270005],
       [0.09365467, 0.10288849, 0.10690026, 0.09287751, 0.13751161,
        0.10155033, 0.08650888, 0.0892804 , 0.09370886, 0.09511901],
       [0.09355306, 0.10104182, 0.10930493, 0.10792493, 0.11838662,
        0.10044384, 0.08611196, 0.09196585, 0.09831635, 0.09295069],
       [0.09381933, 0.09867891, 0.11066364, 0.0970021 , 0.13890381,
        0.10495119, 0.0880869 , 0.08708301, 0.08927269, 0.09153841]],
      dtype=float32)

In [14]:
Pred_prob.shape, Y_train.shape

((60000, 10), (60000, 10))

#### Calculate the loss function using numpy
<img src="https://raw.githubusercontent.com/tensorchiefs/dl_book/master/imgs/paper-pen.png" width="60" align="left" />  

Exercise 2: Use numpy to calculate the value of the negative log-likelihood loss (=x-entropy) that you expect for the untrained CNN, which you have constructed above to discriminate between the 10 classes in MNIST. Use numpy to empirically determine the x-entropy  that results from the probabilities (Pred_prob), that you have recieved from the untrained CNN by predicting for all images the probabilities for each of the ten classes. To determine the x-entropy of the prediction, you can loop over each example and use its true label (Y_train) and the predicted probability for the true class. Do you get the x-entropy value that you have expected?




In [0]:
# Write your code here

Scroll down to see the solution.

</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>

In the next cell you calculate the cross entropy loss of each single image, then you sum all individual losses and divide the sum with the nr of training examples. You take the negative of this result to get the negative log likelihood, also known as categorical cross entropy.

In [0]:
loss=np.zeros(len(X_train))
Y=np.argmax(Y_train,axis=1)
for i in range(0,len(X_train)):
  loss[i]=np.log(Pred_prob[i][Y[i]])
-np.sum(loss)/len(X_train)

2.300853642523289

If you have no idea about the training dataset, your guess for every image would be 1/nr_of_classes, in the case with 10 classes, you would predicit every image with a probability around 0.1. The corresponding negative log likelihood is calculated below:

In [0]:
nr_of_classes=10
-np.log(1/nr_of_classes)

2.3025850929940455

You get more or less the same result as if you use the model.evaluate function from Keras. 

In [0]:
model.evaluate(X_train, Y_train)



[2.300853636042277, 0.12145]