# VISUM Explainable AI Hands-on Session

### Learning Objectives 

- Learn to train a **classification** model for the task **Fish vs. Non-Fish**
- **Explore** and compare **explanations** generated by different **interpretability methods**
- Learn how to use **interpretability** saliency maps to do **weakly-supervised object detection**

### Python packages


In [None]:
%tensorflow_version 1.10

In [None]:
!pip install innvestigate

In [None]:
 !git clone --recursive https://github.com/visum-summerschool/visum-2020.git

In [None]:
import ast
import cv2
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from keras import losses
from keras.models import Model
from keras.layers import Dense
from keras.callbacks import ModelCheckpoint
from keras.applications.densenet import DenseNet121
from keras.applications.densenet import preprocess_input

# import os
# os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

import tensorflow as tf
tf.get_logger().setLevel('ERROR')

### Dataset

The data used in this hands-on session represents a **subset** of the **training data** used in the **VISUM Project**. Nonetheless, in this dataset, there are also **images without fish**, images containing **one fish**, and others containing **several fish**. 

![alt text](https://drive.google.com/uc?export=view&id=1kXcMAhYWKogEH7qfJMTfJ2jTgomC8Udq)

The **csv file** provided has the following structure: **sequence**, **frame**, **bounding-box** values ([[$x_{min}$, $y_{min}$, $x_{max}$, $y_{max}$]]), and **label**. To tackle the **classification** problem, **bounding-box** labels are **not needed**. However, we will use them later for the evaluation of the saliency maps as pseudo-labels. All the other fields are necessary for this task, with the sequence and frame being the identifiers of the image, and the label the ground-truth annotation used to supervise the learning process. 

![alt text](https://drive.google.com/uc?export=view&id=17nSUhC-ggGedEpef_-vwfSvujuTA-MWu)

In [None]:
# read data to dataframe
# images were resized to 224x224, bounding-box values were converted accordingly
df = pd.read_csv('/content/visum-2020/xAI/data.csv') 

In [None]:
#split data into train, validation and test. 
#Note: there could be images of the same sequence in different folds 

test_size=0.01
val_size=0.2
train_idxs, test_idxs, _, _ = train_test_split(df.index.values, df.label.values, test_size=test_size, stratify=df.label.values, random_state=19)

train_df = df.loc[df.index.isin(train_idxs)]
test_df = df.loc[df.index.isin(test_idxs)]

tr_idxs, val_idxs, _, _ = train_test_split(train_df.index.values, train_df.label.values, test_size=val_size, stratify=train_df.label.values, random_state=19)

tr_df = train_df.loc[train_df.index.isin(tr_idxs)]
val_df = train_df.loc[train_df.index.isin(val_idxs)]    

print('Train data:',len(tr_df), 'images\nValidation data:', len(val_df), 'images\nTest data:',len(test_df), 'images')

In [None]:
#Load train data
train_imgs = [] 
train_imgs_orig = [] 
train_labels = [] 

for seq, frame, clf in zip(tr_df['seq'], tr_df['frame'], tr_df['label']): 
    path = '/content/visum-2020/xAI/images/seq' + str(seq) + '/img' + str(frame) + '.jpg'
    img = cv2.imread(path, cv2.IMREAD_UNCHANGED)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    
    train_imgs_orig.append(img)
    #preprocess for pre-trained model
    img = preprocess_input(img)
    
    train_imgs.append(img)
    train_labels.append(int(clf))

train_imgs_orig = np.asarray(train_imgs_orig)                
train_imgs = np.asarray(train_imgs)
train_labels = np.asarray(train_labels)

In [None]:
#Load validation data
val_imgs = [] 
val_imgs_orig = [] 
val_labels = [] 

for seq, frame, clf in zip(val_df['seq'], val_df['frame'], val_df['label']): 
    path = '/content/visum-2020/xAI/images/seq' + str(seq) + '/img' + str(frame) + '.jpg'
    img = cv2.imread(path, cv2.IMREAD_UNCHANGED)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    
    val_imgs_orig.append(img)
    #preprocess for pre-trained model
    img = preprocess_input(img)
    
    val_imgs.append(img)
    val_labels.append(int(clf))

val_imgs_orig = np.asarray(val_imgs_orig)                
val_imgs = np.asarray(val_imgs)
val_labels = np.asarray(val_labels)

In [None]:
#Load test data 
test_imgs = [] 
test_imgs_orig = [] 
test_bbs = []
test_labels = [] 

for seq, frame, bb, clf in zip(test_df['seq'], test_df['frame'], test_df['box_coords'], test_df['label']): 
    path = '/content/visum-2020/xAI/images/seq' + str(seq) + '/img' + str(frame) + '.jpg'
    img = cv2.imread(path, cv2.IMREAD_UNCHANGED)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    
    test_imgs_orig.append(img)
    
    img = preprocess_input(img)
    
    test_imgs.append(img)
    test_labels.append(int(clf))
    if(int(clf)==1): 
        bb = ast.literal_eval(bb)
        test_bbs.append(bb)

test_imgs_orig = np.asarray(test_imgs_orig)                
test_imgs = np.asarray(test_imgs)
test_labels = np.asarray(test_labels)

In [None]:
#Show test data
%matplotlib inline

fig = plt.figure(figsize=(28,16))
for ii in range(test_imgs_orig.shape[0]): 

    ax = fig.add_subplot(4, 7, ii+1)
    ax.imshow(test_imgs_orig[ii])
    if test_labels[ii] == 0:
        plt.title('Without Fish')
    else:
        plt.title('Fish')
    plt.axis('off')

plt.show()

### Model 
---


The **classification model** we use in this session is the well-known DenseNet (https://arxiv.org/pdf/1608.06993.pdf), in particular, the **DenseNet-121**. 

![alt text](https://drive.google.com/uc?export=view&id=1vVAgK7L82dMz8IMu1Cq0Ma6tgSAPmk8H)

A typical strategy used in Computer Vision to fight overfitting and ease the learning process is **transfer learning**. The most common way of doing transfer learning is to **initialize the network** with the **weights** resultant of training the convolutional neural network in the large **ImageNet** dataset (http://www.image-net.org/).

In [None]:
dense_model = DenseNet121(include_top=True, weights='imagenet', input_shape=(224,224,3))
output = Dense(1, activation='sigmoid', name='clf')(dense_model.layers[-2].output)

#Then create the corresponding model 
model = Model(inputs=dense_model.input, outputs=output)
model.summary()

model.compile(optimizer='adadelta', loss=losses.binary_crossentropy, metrics=['accuracy'])

In [None]:
##TRAINING CODE 
##To save the best performing model in validation
#checkpoint = ModelCheckpoint('/content/visum-2020/xAI/model.hdf5', monitor='val_loss', save_best_only=True, save_weights_only=True, verbose=True, mode='min')

##Train model
#model.fit(train_imgs, train_labels, batch_size=16, epochs=10, verbose=1, validation_data=(val_imgs, val_labels), callbacks=[checkpoint])

##TESTING CODE
model.load_weights('/content/visum-2020/xAI/model.hdf5')

### Interpretability Methods

---

In this section we focus our attention on **interpretability methods** to generate the desired saliency maps to use in order to perform weakly-supervised object detection. For this, we recur to the well-know **iNNvestigate** toolbox (https://github.com/albermax/innvestigate), where we have easy access to several state-of-the-art interpretability methods.  

To provide you with an example, we select the **Deep Taylor** interpretability method (https://www.sciencedirect.com/science/article/pii/S0031320316303582?via%3Dihub) to produce saliency maps for the test data.

![alt text](https://drive.google.com/uc?export=view&id=1NmzocyO6R-EcRbT0B98ic1mf1tlR-Lst)

In [None]:
#Import method from iNNvestigate toolbox
from innvestigate.analyzer import create_analyzer
analyzer = create_analyzer("deep_taylor", model)

In [None]:
%matplotlib inline
#Show and save test data saliency maps

maps_test = [] 

for orig, img, clf in zip(test_imgs_orig, test_imgs, test_labels): 
    ii = 1
    img_res = np.reshape(img, (1,img.shape[0],img.shape[1],img.shape[2]))

    pred = model.predict(img_res)
    pred = np.around(pred, 0)
    pred = int(pred)

    res = analyzer.analyze(img_res)        
    img_int = res.sum(axis=np.argmax(np.asarray(res.shape) == 3))

    if np.max(np.abs(img_int)) == 0:
      continue
    else:
      img_int /= np.max(np.abs(img_int))

    img_int = np.reshape(img_int, (img.shape[0], img.shape[1]))  
    maps_test.append(img_int)
    
    if(clf==1):
        figure, ax = plt.subplots(1, 2, figsize=(10,6))
        plt.title('Prediction: {}  Class label: {}'.format(pred, clf), loc='center')
        ax[0].imshow(orig)
        ax[0].axis('off')
        ax[1].imshow(img_int, cmap='hot')
        ax[1].axis('off')
        plt.show()

You used Deep Taylor to produce the interpretability saliency maps shown above. Now, **consider other interpretability methods** from the iNNvestigate toolbox and compare them in a qualitative fashion (consider for example: **'gradient'**, **'input_t_gradient'**, **'guided_backprop'**, **'lrp.sequential_preset_a_flat'**). **Which ones** do you consider suitable to be used for **weakly supervised object detection**? Why?


Apart from the interpretability method used, **which other factor** has a major **impact** in the **quality of the saliency maps**? Instead of using the trained model, experiment generating the saliency maps produced by the original ImageNet trained DenseNet. 

### Generate bounding-boxes based on saliency maps

- In this section, we generate **bounding-boxes** based on the **saliency maps** obtained before. 

In [None]:
%matplotlib inline

object_preds = [] 

fig = plt.figure(figsize=(30,25))
for ii in range(test_imgs.shape[0]):

    img = np.copy(test_imgs_orig[ii])
    imap = np.copy(maps_test[ii])

    if test_labels[ii] == 1: 
        box = []
        
        blur = cv2.GaussianBlur(imap, (7,7), 0) * 255
        
        ret, ths = cv2.threshold(blur.astype('uint8'), 0, 1, cv2.THRESH_BINARY + cv2.THRESH_OTSU) #Otsu's thresholding after Gaussian filtering
        contours, hierarchy = cv2.findContours(ths, cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_SIMPLE)
        
        max_cnt = max(contours, key=cv2.contourArea)
        for cnt in contours:
            cnt_area = cv2.contourArea(cnt)
            if cnt_area < 0.1 * cv2.contourArea(max_cnt):
                continue
            x, y, w, h = cv2.boundingRect(cnt)
            img1 = cv2.rectangle(img, (x,y), (x+w,y+h), (0,255,0), 1)
            img2 = cv2.rectangle(imap, (x,y), (x+w,y+h), 0.5, 1)
            box.append([x,y,x+w,y+h])
        object_preds.append(box)

    ax = fig.add_subplot(5, 6, ii+1)
    ax.imshow(img)
    ax.axis('off')
plt.show()

In [None]:
print('Bounding-box predictions:\n')
print(object_preds)
print('\nBounding-box ground-truth:\n')
print(test_bbs)

### Qualitative Evaluation

- In this section, we **compare** in a **qualitative way** the weakly-generated bounding boxes with the available ground-truth for the test data.

In [None]:
%matplotlib inline

j = 0
fig = plt.figure(figsize=(30,25))
for i in range(test_imgs.shape[0]):
    img = np.copy(test_imgs_orig[i])
    
    if test_labels[i] == 1: 
        box_pred = object_preds[j]
        box_gnd = test_bbs[j]
        
        for bb in box_pred: # predictions in yellow
            x, y, x2, y2 = bb[0], bb[1], bb[2], bb[3]
            img1 = cv2.rectangle(img, (x,y), (x2,y2), (0,255,0), 1)

        for bb in box_gnd: # ground-truth in blue
            x, y, x2, y2 = bb[0], bb[1], bb[2], bb[3]
            img1 = cv2.rectangle(img1, (x,y), (x2,y2), (0,0,255), 1)   

        j+=1         
        
    ax = fig.add_subplot(5, 6, i+1)
    ax.imshow(img)
    ax.axis('off')

plt.show()

### Discussion and additional exercise (Semi-supervised Learning)

- In this tutorial we discussed how to use **interpretability saliency maps** to perform object detection when we only have  **weak labels** (class labels) available.

- Now, image a scenario where you have class labels for all images but **bounding-box labels only for a percentage of the dataset**. How would you train such a model? 

- **Suggestion**: in a scenario as the one described, you could first **train a fish classifier** like the one used in this hands-on session, generate the **bounding-box predictions** based on the saliency maps, and, afterwards, use these bounding-box predictions as **pseudo-labels**, training the model as if it was fully supervised. 

- If you finished this hands-on tutorial and still have time, do the following **additional exercise**: Consider the **baseline model** provided in the **project competition**, adapt it to this dataset, and **train** it when you have **20%** of it **fully annotated**, and **80%** of it with only **weak annotations**, and compare this with the scenarion where you only train with the 20% fully annotated dataset.