# Evidential Deep Learning and Reliability Evaluation for MNIST Dataset

### In this notebook, Evidential Deep Learning (EDL) is introduced to quantify classification uncertainty and also a new way to measure the reliability of the Machine Learning classifiers is disscussed. 

The idea of reliability evaluation is going to be a part of [SafeML Project](https://github.com/ISorokos/SafeML).

The EDL part of this notebook is a modified version of another notebook provided by [Michael Ehrlich on GitHub](https://github.com/michaeleh/Evidential-Deep-Learning-to-Quantify-Classification-Uncertainty/blob/main/demo.ipynb).

In [None]:
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers
import keras
from scipy import ndimage
import matplotlib.pyplot as plt
import cv2
import tensorflow as tf
from keras import backend as K

from scipy import stats

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Model / data parameters
num_classes = 10
input_shape = (28, 28, 1)

# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# Scale images to the [0, 1] range
x_train = x_train.astype("float32") / 255
x_test = x_test.astype("float32") / 255
# Make sure images have shape (28, 28, 1)
x_train = np.expand_dims(x_train, -1)
x_test = np.expand_dims(x_test, -1)

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

Simple LeNet like in the paper without activation

In [None]:
inp = keras.Input(shape=input_shape)

x = layers.Conv2D(filters=5, kernel_size=(5,5), activation='relu')(inp)
x = layers.Conv2D(filters=5, kernel_size=(3, 3), activation='relu')(x)
x = layers.Flatten()(x)
x = layers.Dense(units=512, activation='relu')(x)
x = layers.Dense(units=128, activation='relu')(x)
x = layers.Dense(units=10)(x)


model = keras.Model(inp, x)

model.summary()

#### Source: https://github.com/atilberk/evidential-deep-learning-to-quantify-classification-uncertainty
#### Paper: https://arxiv.org/pdf/1806.01768.pdf

### Loss Functions

There are three different loss functions defined in the paper:

#### 1) Integrating out the class probabilities from posterior of Dirichlet prior & Multinomial likelihood - will be mentioned as *Eqn. 3* (as in the paper)

$$
\mathcal{L}_i(\Theta) =
- log ( \int \prod_{j=1}^K p_{ij}^{y_{ij}} \frac{1}{B(\alpha_i)} \prod_{j=1}^K p_{ij}^{\alpha_{ij} -1 } d\boldsymbol{p}_i )
= \sum_{j=1}^K y_{ij} (log(S_i) - log(\alpha_{ij}))
$$

#### 2) Using cross-entropy loss - will be mentioned as *Eqn. 4* (as in the paper)

$$
\mathcal{L}_i(\Theta) =
\int [\sum_{j=1}^K -y_{ij} log(p_{ij})] \frac{1}{B(\alpha_i)} \prod_{j=1}^K p_{ij}^{\alpha_{ij} -1 } d\boldsymbol{p}_i 
= \sum_{j=1}^K y_{ij} (\psi(S_i) - \psi(\alpha_{ij}))
$$

#### 3) Using sum of squares loss - will be mentioned as *Eqn. 5* (as in the paper)

$$
\mathcal{L}_i(\Theta) =
\int ||\boldsymbol{y}_i - \boldsymbol{p}_i||_2^2 \frac{1}{B(\alpha_i)} \prod_{j=1}^K p_{ij}^{\alpha_{ij} -1 } d\boldsymbol{p}_i 
= \sum_{j=1}^K \mathbb{E}[(y_{ij} - p_{ij})^2]
$$

$$
= \sum_{j=1}^K \mathbb{E}[y_{ij}^2 - 2 y_{ij}p_{ij} + p_{ij}^2] 
= \sum_{j=1}^K (y_{ij}^2 - 2 y_{ij}\mathbb{E}[p_{ij}] + \mathbb{E}[p_{ij}^2])
$$

$$
= \sum_{j=1}^K (y_{ij}^2 - 2 y_{ij}\mathbb{E}[p_{ij}] + \mathbb{E}[p_{ij}]^2 + \text{Var}(p_{ij}))
= \sum_{j=1}^K (y_{ij} - \mathbb{E}[p_{ij}])^2 + \text{Var}(p_{ij})
$$

$$
= \sum_{j=1}^K (y_{ij}^2 - 2 y_{ij}\mathbb{E}[p_{ij}] + \mathbb{E}[p_{ij}]^2 + \text{Var}(p_{ij}))
= \sum_{j=1}^K (y_{ij} - \mathbb{E}[p_{ij}])^2 + \text{Var}(p_{ij})
$$

$$
= \sum_{j=1}^K (y_{ij} - \frac{\alpha_{ij}}{S_i})^2 + \frac{\alpha_{ij}(S_i - \alpha_{ij})}{S_i^2(S_i + 1)}
$$

$$
= \sum_{j=1}^K (y_{ij} - \hat{p}_{ij})^2 + \frac{\hat{p}_{ij}(1 - \hat{p}_{ij})}{(S_i + 1)}
$$

In [None]:
# Source: https://github.com/michaeleh/Evidential-Deep-Learning-to-Quantify-Classification-Uncertainty/blob/main/demo.ipynb

lgamma = tf.math.lgamma
digamma = tf.math.digamma

epochs = [1]

def KL(alpha, num_classes=10):
    one = K.constant(np.ones((1,num_classes)),dtype=tf.float32)
    S = K.sum(alpha,axis=1,keepdims=True)  

    kl = lgamma(S) - K.sum(lgamma(alpha),axis=1,keepdims=True) +\
    K.sum(lgamma(one),axis=1,keepdims=True) - lgamma(K.sum(one,axis=1,keepdims=True)) +\
    K.sum((alpha - one)*(digamma(alpha)-digamma(S)),axis=1,keepdims=True)
          
    return kl


def loss_func(y_true, output):
    y_evidence = K.relu(output)
    alpha = y_evidence+1
    S = K.sum(alpha,axis=1,keepdims=True)
    p = alpha / S  

    err = K.sum(K.pow((y_true-p),2),axis=1,keepdims=True)
    var = K.sum(alpha*(S-alpha)/(S*S*(S+1)),axis=1,keepdims=True)
    
    l =  K.sum(err + var,axis=1,keepdims=True)
    l = K.sum(l)
    
    
    kl =  K.minimum(1.0, epochs[0]/50) * K.sum(KL((1-y_true)*(alpha)+y_true))
    return l + kl

In [None]:
batch_size = 1024
model.compile(loss=loss_func, optimizer="adam", metrics=['accuracy'])

from tqdm import tqdm

for i in tqdm(range(30)):
    model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs[0], verbose=0, validation_split=0.2)
    epochs[0]+=1

In [None]:
score = model.evaluate(x_test, y_test, verbose=0)
print("Test loss:", score[0])
print("Test accuracy:", score[1])

In [None]:
def rotate(im,deg):
    #rotation angle in degree
    return ndimage.rotate(im, deg)

## Theory of Evidence

Source: https://github.com/atilberk/evidential-deep-learning-to-quantify-classification-uncertainty

Paper: https://arxiv.org/pdf/1806.01768.pdf

Suppose that there are $K$ outputs of an NN. Then we can write the following equality
$$u + \sum_{k = 1}^{K} b_k = 1$$
where $b_k$ corresponds to $k^{th}$ ReLU output which will be interpreted as the *belief mass* of the $k^{th}$ class and $u$ is the *uncertainty mass* of the particular outputs.

Each $b_k$ is defined as follows
$$b_k =\frac{e_k}{S}$$
where $e_k$ is the evidence of the $k^{th}$ class and $S$ is the strength of the Dirichlet we'll use and defined as 
$$S = \sum_{k = 1}^{K} (e_k + 1)$$
which leaves $u$ the following portion
$$u = \frac{K}{S}$$


Replacing $e_k + 1$ with $a_k$
$$\alpha_k = e_k + 1$$
and using the resultant simplex vector $a$ in a Dirichlet as the density
$$
D(\boldsymbol{p}|\boldsymbol{\alpha}) = \begin{cases} 
      \frac{1}{B(\boldsymbol{\alpha})} \prod_{i=1}^{K} p_i^{\alpha_i - 1} & \text{for } \boldsymbol{p} \in \mathcal{S}_K \\
      0 & \text{otherwise}
   \end{cases}
$$

As a result, we can define $\mathcal{S}_K$ as 
$$\mathcal{S}_K = \{ \boldsymbol{p} | \sum_{i=1}^K p_i = 1 \text{ and } 0 \leq p_1,...,p_K \leq 1 \}$$
and the probability of $k^{th}$ can still be calculated as
$$\hat{p}_k = \frac{\alpha_k}{S}$$

In [None]:
def calc_prob_uncertinty(p):
  
    evidence = np.maximum(p[0], 0)

    alpha = evidence +1

    u = 10/ alpha.sum()
    prob = alpha[np.argmax(alpha)] / alpha.sum()
    return prob, u

In [None]:
d = 1
digit = 1
angles_range = list(range(0,180,10))

test_labels = np.argmax(y_test,axis=1)

predictions = []
uncertinties = []
probabilities= []
imgs = []
for angle in angles_range:    

    im = x_test[np.where(test_labels==digit)[0][0]]    
    shape = im.shape
    im = rotate(im, angle)    
    im = cv2.resize(im,shape[:-1],interpolation = cv2.INTER_AREA)
    imgs.append(im)
    p = model.predict(np.array([im.reshape(shape)]))  
    prob, uncertinty = calc_prob_uncertinty(p)
    uncertinties.append(uncertinty)
    probabilities.append(prob)  
    predictions.append(np.argmax(p))

plt.plot(angles_range,probabilities, label=f'Class={d}',marker='o')
plt.plot(angles_range,uncertinties, label=f'Uncertinty={d}',marker='o')

plt.xlabel('Angle')
plt.ylabel('Probability')
plt.legend()
plt.title('Probability on Rotation')
plt.grid()
plt.show()

plt.plot(angles_range,predictions, label=f'Class={d}',marker='o')
plt.xlabel('Angle')
plt.ylabel('Predictions')
plt.legend()
plt.title('Prediction per Rotation')
plt.grid()
plt.show()

f,axs = plt.subplots(1,len(imgs),figsize=(10,20))
for ax,im in zip(axs.ravel(),imgs):
    ax.imshow(im,cmap='gray')
plt.show()

## Reliability Evaluation of the Classification Algorithm (Stable Operational Profile)

We assume, in line with the literature, that the black-box reliability is expressed as the probability of not failing on a randomly chosen input $d_r \epsilon D$ [[1]](https://doi.org/10.1016/j.ress.2020.107193).

Assuming that each class is an operational profile of the traffic sign recognition and aslo assuming no prior knowledge about the occurrence of failures
within partitions, the priors $f_i (x)$ are set to $Beta(\boldsymbol{\alpha_{i}} = 1, \boldsymbol{\beta_{i}} = 1)$. Let's consider $N_{i}$ as the number of test images that provided as an input to the algorithm and $r_{i}$ as the number of failures. 

The Dirichlet distribution $D(\boldsymbol{\alpha_{1}},..., \boldsymbol{\alpha_{n}})$ modeling the OPP before the new observation, with the new information $N_{1}, ..., N_{n}$, will become:

$$
D(\boldsymbol{\alpha_{1}}+N_{1},..., \boldsymbol{\alpha_{n}}+N_{n})
$$


Based on equation 14, the updated distribution of the conditional probability of failure in recognising class $i$ in the operation profile or partition $S_{i}$ will be:

$$f_{F_{i}} = B(\boldsymbol{\alpha_{i}} + r_{i},\boldsymbol{\beta_{i}} + N_{i} - r_{i})$$ 

The expected value of $f_{F_{i}}$ can be calculated as:

$$E[F_{i}] = \frac{\boldsymbol{\alpha_{i}} + r_{i}}{\boldsymbol{\beta_{i}} + \boldsymbol{\alpha_{i}} + N_{i}}$$

Considering the same probability of each $OPP_{i}$ as $1/10$, the reliabiity can be calculated as:

$$
E[R] = 1 - \sum_{i=1}^{43} OPP_{i} \times E[F_{i}] = 1 - 0.1\times \sum_{i=1}^{43} \frac{\boldsymbol{\alpha_{i}} + r_{i}}{\boldsymbol{\beta_{i}} + \boldsymbol{\alpha_{i}} + N_{i}}
$$

It should be noted that the partisions can be more detailed by separating the conditions like rain, light, rotation, etc [2,3]. The example that we have considered is a super simplified version. 

[[1] Pietrantuono, R., Popov, P., & Russo, S. (2020). Reliability assessment of service-based software under operational profile uncertainty. Reliability Engineering & System Safety, 204, 107193.](https://doi.org/10.1016/j.ress.2020.107193)

[[2] Zhang, M., Zhang, Y., Zhang, L., Liu, C., & Khurshid, S. (2018, September). DeepRoad: GAN-based metamorphic testing and input validation framework for autonomous driving systems. In 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE) (pp. 132-142).](https://doi.org/10.1145/3238147.3238187)

[[3] Jöckel, L., Kläs, M., & Martínez-Fernández, S. (2019, July). Safe traffic sign recognition through data augmentation for autonomous vehicles software. In 2019 IEEE 19th International Conference on Software Quality, Reliability and Security Companion (QRS-C) (pp. 540-541).](https://doi.org/10.1109/QRS-C.2019.00114)

In [None]:
y_pred1 = model.predict(x_test)
y_pred =  np.argmax(y_pred1,axis=1)

# Separating Wrong Responses of the CNN Classifier
X_test_wrong, y_test_wrong = x_test[np.where(y_test != y_pred)], y_test[np.where(y_test != y_pred)]

# Separating Correct Responses of the CNN Classifier
X_test_correct, y_test_correct = x_test[np.where(y_test == y_pred)], y_test[np.where(y_test == y_pred)]

r = 0
N = 0
E_F = np.zeros(10)

for ii in range(10):
    X_test_wrong_i, y_test_wrong_i = X_test_wrong[np.where(y_test_wrong == ii+1)], y_test_wrong[np.where(y_test_wrong == ii+1)]
    X_test_correct_i, y_test_correct_i = x_test[np.where(y_test_correct == ii+1)], y_test[np.where(y_test_correct == ii+1)]
    r = X_test_wrong_i.shape[0]
    N = X_test_wrong_i.shape[0] + X_test_correct_i.shape[0] 
    
    E_F[ii] = (1 + r)/(1+1+N)
    
Reliability = 1 - 0.1*(sum(E_F))

print(Reliability)