## Visualization of CNN: Grad-CAM
* **Objective**: Convolutional Neural Networks are widely used on computer vision. It is powerful for processing grid-like data. However we hardly know how and why it works, due to the lack of decomposability into individually intuitive components. In this assignment, we will introduce the Grad-CAM which visualizes the heatmap of input images by highlighting the important region for visual question answering(VQA) task.

* **To be submitted**: this notebook in two weeks, **cleaned** (i.e. without results, for file size reasons: `menu > kernel > restart and clean`), in a state ready to be executed (if one just presses 'Enter' till the end, one should obtain all the results for all images) with a few comments at the end. No additional report, just the notebook!

* NB: if `PIL` is not installed, try `conda install pillow`.


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

import torchvision.transforms as transforms
from PIL import Image
import cv2

import warnings
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Visual Question Answering problem
Given an image and a question in natural language, the model choose the most likely answer from 3 000 classes according to the content of image. The VQA task is indeed a multi-classificaition problem.
<img src="vqa_model.PNG">

We provide you a pretrained model `vqa_resnet` for VQA tasks.

In [None]:
# load model
from load_model import load_model
vqa_resnet = load_model()

In [None]:
# print(vqa_resnet) # for more information 

In [None]:
checkpoint = '2017-08-04_00.55.19.pth'
saved_state = torch.load(checkpoint, map_location=device)
# reading vocabulary from saved model
vocab = saved_state['vocab']

# reading word tokens from saved model
token_to_index = vocab['question']

# reading answers from saved model
answer_to_index = vocab['answer']

num_tokens = len(token_to_index) + 1

# reading answer classes from the vocabulary
answer_words = ['unk'] * len(answer_to_index)
for w, idx in answer_to_index.items():
    answer_words[idx]=w

### Inputs
In order to use the pretrained model, the input image should be normalized using `mean = [0.485, 0.456, 0.406]`, and `std = [0.229, 0.224, 0.225]`, and be resized as `(448, 448)`. You can call the function `image_to_features` to achieve image preprocessing. For input question, the function `encode_question` is provided to encode the question into a vector of indices. You can also use `preprocess` function for both image and question preprocessing.

In [None]:
def get_transform(target_size, central_fraction=1.0):
    return transforms.Compose([
        transforms.Scale(int(target_size / central_fraction)),
        transforms.CenterCrop(target_size),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406],
                             std=[0.229, 0.224, 0.225]),
    ])

In [None]:
def encode_question(question):
    """ Turn a question into a vector of indices and a question length """
    question_arr = question.lower().split()
    vec = torch.zeros(len(question_arr), device=device).long()
    for i, token in enumerate(question_arr):
        index = token_to_index.get(token, 0)
        vec[i] = index
    return vec, torch.tensor(len(question_arr), device=device)

In [None]:
# preprocess requires the dir_path of an image and the associated question. 
#It returns the spectific input form which can be used directly by vqa model. 
def preprocess(dir_path, question):
    q, q_len = encode_question(question)
    img = Image.open(dir_path).convert('RGB')
    image_size = 448  # scale image to given size and center
    central_fraction = 1.0
    transform = get_transform(image_size, central_fraction=central_fraction)
    img_transformed = transform(img)
    img_features = img_transformed.unsqueeze(0).to(device)
    
    inputs = (img_features, q.unsqueeze(0), q_len.unsqueeze(0))
    return inputs

We provide you two pictures and some question-answers.

In [None]:
Question1 = 'What animal'
Answer1 = ['dog','cat' ]
indices1 = [answer_to_index[ans] for ans in Answer1]# The indices of category 
img1 = Image.open('dog_cat.png')
img1

In [None]:
dir_path = 'dog_cat.png' 
inputs = preprocess(dir_path, Question1)
ans = vqa_resnet(*inputs) # use model to predict the answer
answer_idx = np.argmax(F.softmax(ans, dim=1).data.numpy())
print(answer_words[answer_idx])

In [None]:
Question2 = 'What color'
Answer2 = ['green','yellow' ]
indices2 = [answer_to_index[ans] for ans in Answer2]
img2 = Image.open('hydrant.png')
img2

In [None]:
dir_path = 'hydrant.png' 
inputs = preprocess(dir_path, Question2)
ans = vqa_resnet(*inputs) # use model to predict the answer
answer_idx = np.argmax(F.softmax(ans, dim=1).data.numpy())
print(answer_words[answer_idx])

### Grad-CAM 
* **Overview:** Given an image with a question, and a category (‘dog’) as input, we foward propagate the image through the model to obtain the `raw class scores` before softmax. The gradients are set to zero for all classes except the desired class (dog), which is set to 1. This signal is then backpropagated to the `rectified convolutional feature map` of interest, where we can compute the coarse Grad-CAM localization (blue heatmap).


* **To Do**: Define your own function Grad_CAM to achieve the visualization of the two images. For each image, consider the answers we provided as the desired classes. Compare the heatmaps of different answers, and conclude. 


* **Hints**: 
 + We need to record the output and grad_output of the feature maps to achieve Grad-CAM. In pytorch, the function `Hook` is defined for this purpose. Read the tutorial of [hook](https://pytorch.org/tutorials/beginner/former_torchies/nnft_tutorial.html#forward-and-backward-function-hooks) carefully. 
 + The pretrained model `vqa_resnet` doesn't have the activation function after its last layer, the output is indeed the `raw class scores`, you can use it directly. Run "print(vqa_resnet)" to get more information on VGG model.
 + The last CNN layer of the model is: `vqa_resnet.resnet_layer4.r_model.layer4[2].conv3` 
 + The size of feature maps is 14x14, so as your heatmap. You need to project the heatmap to the original image(224x224) to have a better observation. The function `cv2.resize()` may help.  
 + Here is the link of the paper [Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization](https://arxiv.org/pdf/1610.02391.pdf)

<img src="grad_cam.png">

In [None]:
class Grad_Cam(nn.Module):
    def __init__(self, vqa_resnet):
        super().__init__()
        
        # get the pretrained VGG19 network
        self.vqa_resnet = vqa_resnet
        self.vqa_resnet.resnet_layer4.r_model.layer4[2].conv3.register_forward_hook(self.forward_hook)
        self.vqa_resnet.resnet_layer4.r_model.layer4[2].conv3.register_backward_hook(self.backward_hook)
        
        self.gradients = None
        self.activations = None
    
    # hook for the last convolutional layer activations
    def forward_hook(self, layer, input, output):
        self.activations = output[0]
    # hook for the gradients of the activations
    def backward_hook(self, layer, grad_input, grad_output):
        self.gradients = grad_output[0]

    def forward(self, inputs):
        x = self.vqa_resnet(*inputs)
        return x
    
    # method for the gradient extraction
    def get_activations_gradient(self):
        return self.gradients
    
    # method for the activation extraction
    def get_activations(self):
        return self.activations
    
    
    def compute_heatmap(self, *inputs, i):
        img_features, q, q_len = inputs
        
        ans = self.vqa_resnet(*inputs)
        yc = ans[:, i]
        yc.backward()
        
        activations = self.get_activations().detach()
        activations = torch.unsqueeze(activations, 0)
        gradients = self.get_activations_gradient()
        pooled_gradients = torch.mean(gradients, dim=[0, 2, 3])
        for i in range(activations.shape[1]):
            activations[:, i, :, :] *= pooled_gradients[i]
        
        heatmap = torch.mean(activations, dim=1).squeeze()
        heatmap = np.maximum(heatmap, 0) # relu on top of the heatmap
        heatmap /= torch.max(heatmap) # normalize the heatmap

        return heatmap

In [None]:
gc = Grad_Cam(vqa_resnet)

__Question 1:__

In [None]:
dir_path = 'dog_cat.png' 
inputs = preprocess(dir_path, Question1)
heatmap1 = gc.compute_heatmap(*inputs,i=indices1[0])
heatmap2 = gc.compute_heatmap(*inputs,i=indices1[1])

heatmap1 = (np.array(heatmap1))
heatmap1 = cv2.resize(heatmap1, (224,224), interpolation = cv2.INTER_AREA)

heatmap2 = (np.array(heatmap2))
heatmap2 = cv2.resize(heatmap2, (224,224), interpolation = cv2.INTER_AREA)

plt.figure( figsize=(10,10))
plt.subplot(1,2,1)
plt.title("Answer for 'dog'")
plt.imshow(img1)
plt.imshow(heatmap1,alpha=0.4)

plt.subplot(1,2,2)
plt.title("Answer for 'cat'")
plt.imshow(img1)
plt.imshow(heatmap2,alpha=0.4)

__Question 2__

In [None]:
dir_path = 'hydrant.png' 
inputs = preprocess(dir_path, Question2)
heatmap1 = gc.compute_heatmap(*inputs,i=indices2[0])
heatmap2 = gc.compute_heatmap(*inputs,i=indices2[1])

heatmap1 = (np.array(heatmap1))
heatmap1 = cv2.resize(heatmap1, (224,224), interpolation = cv2.INTER_AREA)

heatmap2 = (np.array(heatmap2))
heatmap2 = cv2.resize(heatmap2, (224,224), interpolation = cv2.INTER_AREA)

plt.figure( figsize=(10,10))
plt.subplot(1,2,1)
plt.title("Answer for 'green'")
plt.imshow(img2)
plt.imshow(heatmap1,alpha=0.4)

plt.subplot(1,2,2)
plt.title("Answer for 'yellow'")
plt.imshow(img2)
plt.imshow(heatmap2,alpha=0.4)

__Conclusion__

Gradcam is a method explaining the decisions of a neural network without additional training. In our examples, The region responsible of classifying well our models are highlighted. However, we think there is room for improvement as all the pixel dictating a given pertaining class are not covered.