# Calculation of the cross entropy loss (NLL) function for classification


**Goal:** In this notebook you will learn how to calculate the cross entropy loss which is also the negative log likelihood, for an untrained neural network. You will first calculate the cross entropy loss for a classification problem for with two classes and then do the same for ten classes. You will see that an untrained CNN is not better than guessing, we will verify this result by calulation the loss in numpy with the predicted probabilities of a random guess.

**Usage:** The idea of the notebook is that you try to understand the provided code by running it, checking the output and playing with it by slightly changing the code and rerunning it. 

**Dataset:** You work with the MNIST dataset. We have 60'000 28x28 pixel greyscale images of digits (0-9).

**Content:**
* load the original MNIST data 
* create a subset the of the data to make it binary classification problem
* define a CNN in Keras
* evaluation of the cross entropy loss function of the untrained CNN using Keras for only two classes
* evaluation of the cross entropy loss function of the untrained CNN using Keras for all classes
* implement the loss function yourself using the predicted probabilities and numpy


| [open in colab](https://colab.research.google.com/github/tensorchiefs/dl_book/blob/master/chapter_04/nb_ch04_02.ipynb)



#### Imports

In the next two cells, we load all the required libraries. We download the Mnist data, normalize the pixelvalues to be between 0 and 1 and create  a subdataset where we only use the labels 0 and 1.

In [0]:
# load required libraries:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('default')
from sklearn.metrics import confusion_matrix

import tensorflow
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Convolution2D, MaxPooling2D, Flatten , Activation
from tensorflow.keras.utils import to_categorical 
from tensorflow.keras import optimizers






#### Loading and preparing the MNIST data

In [2]:
from tensorflow.keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

X_train=x_train / 255 #divide by 255 so that they are in range 0 to 1
X_train=np.reshape(X_train, (X_train.shape[0],28,28,1))
Y_train=tensorflow.keras.utils.to_categorical(y_train,10) # one-hot encoding

idx = (y_train==0)|(y_train==1)

X_train_sub = X_train[idx]
Y_train_sub = y_train[idx]
Y_train_sub=tensorflow.keras.utils.to_categorical(Y_train_sub,2) # one-hot encoding

Y_train.shape, X_train.shape, Y_train_sub.shape, X_train_sub.shape

((60000, 10), (60000, 28, 28, 1), (12665, 2), (12665, 28, 28, 1))

## CNN model

We use the same CNN model as in chapter 2. First we will use it to evaluate the loss for only two classes (0 and 1)  and then for all ten classes of the mnist digits.

In [0]:
# here we define hyperparameter of the CNN
batch_size = 128
nb_classes = 10
img_rows, img_cols = 28, 28
kernel_size = (3, 3)
input_shape = (img_rows, img_cols, 1)
pool_size = (2, 2)

In [4]:
# define CNN with 2 convolution blocks and 2 fully connected layers
model = Sequential()

model.add(Convolution2D(8,kernel_size,padding='same',input_shape=input_shape))
model.add(Activation('relu'))
model.add(Convolution2D(8, kernel_size,padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=pool_size))

model.add(Convolution2D(16, kernel_size,padding='same'))
model.add(Activation('relu'))
model.add(Convolution2D(16,kernel_size,padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=pool_size))

model.add(Flatten())
model.add(Dense(40))
model.add(Activation('relu'))
model.add(Dense(2))
model.add(Activation('softmax'))

# compile model and intitialize weights
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

W0617 11:29:38.214396 140627765655424 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


In [5]:
# summarize model along with number of model weights
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 28, 28, 8)         80        
_________________________________________________________________
activation (Activation)      (None, 28, 28, 8)         0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 28, 28, 8)         584       
_________________________________________________________________
activation_1 (Activation)    (None, 28, 28, 8)         0         
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 14, 14, 8)         0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 14, 14, 16)        1168      
_________________________________________________________________
activation_2 (Activation)    (None, 14, 14, 16)        0

### Evaluation of the untrained model using Keras
In the next cell we use the model.evaluate() function, to get the cross entropy loss for the untrained network. The input for this function are the images and corresponding true labels, the function returns the corssentropy loss and the accuracy of the prediction. Note that we use the sub dataset with only two classes (0 and 1). 

<img src="https://raw.githubusercontent.com/tensorchiefs/dl_book/master/imgs/paper-pen.png" width="60" align="left" />  


*Exercise 1: Compute the cross entropy loss with the Keras and compare the result with the value that you would expect when you have a classification problem with two classes. Remember that the network is untrained and the predictions are basically just random guesses.*  




In [0]:
# Write your code here

Scroll down to see the solution.

</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>

In [7]:
model.evaluate(X_train_sub, Y_train_sub)



[0.69635132471187, 0.3659692]

If we have no idea about the training dataset,  our guess for every image would be 1/nr_of_classes, in the case with 2 classes, we predicit every image with a probability around 0.5. the negative log likelihood is calculated below.

In [8]:
nr_of_classes=2
-np.log(1/nr_of_classes)

0.6931471805599453

#### Return to the book 
<img src="https://raw.githubusercontent.com/tensorchiefs/dl_book/master/imgs/Page_turn_icon_A.png" width="120" align="left" />  

Now we want to work with the full Mnist dataset (all ten classes). Let's use the same network architecture as before, the only thing we change is the number of outputs nodes. We will use ten for all digits from 0 to 9.

In [0]:
# define CNN with 2 convolution blocks and 2 fully connected layers
model = Sequential()

model.add(Convolution2D(8,kernel_size,padding='same',input_shape=input_shape))
model.add(Activation('relu'))
model.add(Convolution2D(8, kernel_size,padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=pool_size))

model.add(Convolution2D(16, kernel_size,padding='same'))
model.add(Activation('relu'))
model.add(Convolution2D(16,kernel_size,padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=pool_size))

model.add(Flatten())
model.add(Dense(40))
model.add(Activation('relu'))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))

# compile model and intitialize weights
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [10]:
# summarize model along with number of model weights
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_4 (Conv2D)            (None, 28, 28, 8)         80        
_________________________________________________________________
activation_6 (Activation)    (None, 28, 28, 8)         0         
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 28, 28, 8)         584       
_________________________________________________________________
activation_7 (Activation)    (None, 28, 28, 8)         0         
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 14, 14, 8)         0         
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 14, 14, 16)        1168      
_________________________________________________________________
activation_8 (Activation)    (None, 14, 14, 16)       

Here we predict the probabilities for all trainimages. We did not train the network, therefore the probabilities will be around 10% for each image.

In [0]:
# Calculates the probailities for the training data
Pred_prob = model.predict_proba(X_train)

In [12]:
Pred_prob[0:5]

array([[0.09250782, 0.09867508, 0.10000185, 0.10002195, 0.09862246,
        0.10734326, 0.09982929, 0.10206727, 0.10361547, 0.09731548],
       [0.09321542, 0.10242418, 0.09866772, 0.09818424, 0.10036258,
        0.0991603 , 0.10205214, 0.10095597, 0.10177615, 0.10320126],
       [0.094369  , 0.09765789, 0.09780802, 0.09717362, 0.10379422,
        0.10682811, 0.10278382, 0.10072705, 0.10344021, 0.09541803],
       [0.09513659, 0.10402688, 0.09870471, 0.09828294, 0.10176218,
        0.10028785, 0.09771107, 0.10047312, 0.10383096, 0.09978367],
       [0.09448312, 0.10221019, 0.09744866, 0.10018276, 0.09825058,
        0.10830294, 0.10025328, 0.10058228, 0.10219348, 0.09609275]],
      dtype=float32)

In [13]:
Pred_prob.shape, Y_train.shape

((60000, 10), (60000, 10))

#### Calculate the loss function using numpy
<img src="https://raw.githubusercontent.com/tensorchiefs/dl_book/master/imgs/paper-pen.png" width="60" align="left" />  

Exercise 2: Calculate the loss function for the Mnist example with ten classes using numpy. Use the predicted probabilities (Pred_prob) and loop over each example with the cocorresponding true label (Y_train). Is this the value that you expected?




In [0]:
# Write your code here

Scroll down to see the solution.

</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>
</br>

In the next cell we calculate the cross entropy loss of each single image, then we sum all individual losses and divide the sum with the nr of training examples. We take the negative of this result the get the negative negative log likelihood, also know as categorical cross entropy.

In [15]:
loss=np.zeros(len(X_train))
Y=np.argmax(Y_train,axis=1)
for i in range(0,len(X_train)):
  loss[i]=np.log(Pred_prob[i][Y[i]])
-np.sum(loss)/len(X_train)

2.300853642523289

If we have no idea about the training dataset, our guess for every image would be 1/nr_of_classes, in the case with 10 classes, we predicit every image with a probability around 0.1. the negative log likelihood is calculated below

In [16]:
nr_of_classes=10
-np.log(1/nr_of_classes)

2.3025850929940455

We get the same result if we use the model.evaluate function from Keras.

In [17]:
model.evaluate(X_train, Y_train)



[2.300853636042277, 0.12145]