# <a id='toc1_'></a>[Faster R-CNN: Explained](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [Faster R-CNN: Explained](#toc1_)    
  - [Faster R-CNN Architecture](#toc1_1_)    
  - [Incorporating the Feature Pyramid Network (FPN)](#toc1_2_)    
  - [Utility Functions](#toc1_3_)    
  - [Inference & Evaluation (IoU & mAP)](#toc1_4_)    
    - [Inference](#toc1_4_1_)    
    - [Computing IoU](#toc1_4_2_)    
  - [Activation Visualization](#toc1_5_)    
  - [GradCAM](#toc1_6_)    
  - [GradCAM++](#toc1_7_)    
  - [EigenCAM](#toc1_8_)    
  - [AblationCAM](#toc1_9_)    
  - [Deep Feature Factorizations.](#toc1_10_)    
  - [ScoreCAM](#toc1_11_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Faster R-CNN Architecture](#toc0_)

Faster R-CNN is a popular object detection algorithm introduced by Shaoqing Ren et al. in their 2015 paper, [Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks](https://arxiv.org/abs/1506.01497). Faster R-CNN is an extension of the [R-CNN](https://www.kaggle.com/achrafkhazri/r-cnn-explained) and [Fast R-CNN](https://www.kaggle.com/achrafkhazri/fast-r-cnn-explained) algorithms, which were also introduced by the same authors. Faster R-CNN is an end-to-end object detection algorithm that uses a convolutional neural network (CNN) to extract features from an input image, a region proposal network (RPN) to generate object proposals, and a series of convolutional layers to classify and refine the object proposals. The following figure shows the architecture of the Faster R-CNN algorithm:

1. **Backbone Network**: The backbone network is responsible for extracting high-level features from the input image. It typically consists of a convolutional neural network (CNN) architecture, such as ResNet or MobileNet, which is pre-trained on a large-scale image classification dataset like ImageNet. The backbone network transforms the input image into a feature map that encodes the spatial information of the image.

2. **Region Proposal Network (RPN)**: The RPN is a fully convolutional network that generates a set of object proposals, which are potential bounding box regions that may contain objects of interest. The RPN takes the feature map from the backbone network as input and applies a set of anchor boxes at each spatial location. For each anchor box, the RPN predicts the probability of it containing an object and the corresponding bounding box coordinates. The RPN uses a binary classification loss and a bounding box regression loss to train the network.

3. **Region-based Convolutional Neural Network (R-CNN)**: The R-CNN takes the object proposals generated by the RPN and performs object classification and bounding box regression. It consists of a series of fully connected layers and softmax classifiers to classify the object proposals into different classes. The R-CNN also refines the bounding box coordinates predicted by the RPN to improve the localization accuracy. The R-CNN is trained using a multi-task loss that combines the classification loss and the bounding box regression loss.

The Faster R-CNN algorithm combines these three components to achieve accurate and efficient object detection. By using the RPN to generate object proposals, Faster R-CNN eliminates the need for manually defining anchor boxes and improves the efficiency of the detection process. The backbone network provides the necessary feature representation for object detection, while the R-CNN performs the final classification and localization. Overall, Faster R-CNN has become a widely used algorithm in the field of computer vision and has achieved state-of-the-art performance on various object detection benchmarks.




## <a id='toc1_2_'></a>[Incorporating the Feature Pyramid Network (FPN)](#toc0_)
The Feature Pyramid Network (FPN) can be incorporated into the Faster R-CNN architecture to improve the detection performance, especially for objects at different scales. FPN addresses the challenge of detecting objects at both small and large scales by creating a feature pyramid that captures multi-scale information.

In the Faster R-CNN architecture, the backbone network extracts high-level features from the input image. However, these features are typically at a single scale, which may not be sufficient for detecting objects of different sizes. This is because objects can vary in scale within an image, and using features from a single scale may result in missed detections or inaccurate bounding box predictions.

To address this issue, FPN introduces a top-down pathway and lateral connections to create a feature pyramid. The top-down pathway upsamples the features from higher resolution levels to lower resolution levels, while the lateral connections merge the upsampled features with the features at each level of the pyramid. This process creates a set of feature maps at different scales, where each level of the pyramid captures information at a specific scale.

By incorporating FPN into the Faster R-CNN architecture, we can use the feature pyramid to generate object proposals and perform object classification and bounding box regression at multiple scales. This allows the model to effectively detect objects of different sizes and improves the overall detection accuracy.

For example, when detecting small objects, the FPN can leverage the high-resolution features from the top of the pyramid, which contain fine-grained details. On the other hand, when detecting large objects, the FPN can utilize the low-resolution features from the bottom of the pyramid, which capture the global context of the image. By combining features from multiple scales, the model becomes more robust and capable of detecting objects across a wide range of sizes.

## <a id='toc1_3_'></a>[Utility Functions](#toc0_)

In [None]:
import torch
import torchvision
import cv2
import os
import json
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
from pytorch_grad_cam.ablation_layer import AblationLayerFasterRCNN
from pytorch_grad_cam import GradCAM, AblationCAM, EigenCAM, ScoreCAM, GradCAMPlusPlus, DeepFeatureFactorization
from pytorch_grad_cam.utils.model_targets import FasterRCNNBoxScoreTarget
from pytorch_grad_cam.utils.reshape_transforms import fasterrcnn_reshape_transform
from pytorch_grad_cam.utils.image import show_cam_on_image, scale_cam_image, show_factorization_on_image
from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval

# Defining COCO categories
coco_labels = ['__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane',
              'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 
              'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 
              'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella',
              'N/A', 'N/A', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard',
              'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard',
              'surfboard', 'tennis racket', 'bottle', 'N/A', 'wine glass', 'cup', 'fork',
              'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange',
              'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch',
              'potted plant', 'bed', 'N/A', 'dining table', 'N/A', 'N/A', 'toilet',
              'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave',
              'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book', 'clock', 'vase',
              'scissors', 'teddy bear', 'hair drier', 'toothbrush']
              
def read_image(dataset_dir, image_name) -> np.ndarray:
    """
    Read and preprocess an image from the dataset directory.

    Args:
        dataset_dir (str): The directory path of the dataset.
        image_name (str): The name of the image file.

    Returns:
        image (numpy.ndarray): The preprocessed image as a NumPy array.

    """
    # Read the image using OpenCV
    image = cv2.imread(dataset_dir + image_name)

    # Convert the image from BGR to RGB
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    return image


def predict(input_tensor, model, detection_threshold) -> tuple:
    """
    Perform object detection on an input tensor using a Faster R-CNN model.

    Args:
        input_tensor (torch.Tensor): The input tensor to perform object detection on.
        model (torch.nn.Module): The Faster R-CNN model.
        detection_threshold (float): The minimum confidence score threshold for object detection.

    Returns:
        boxes (numpy.ndarray): An array of bounding boxes for the detected objects.
        classes (list): A list of class labels for the detected objects.
        labels (numpy.ndarray): An array of class labels for the detected objects.
        indices (list): A list of indices corresponding to the detected objects.
    """
    # Perform object detection using the model
    outputs = model(input_tensor)

    # Extract the predicted classes, labels, scores, and bounding boxes
    pred_classes = [coco_labels[i] for i in outputs[0]['labels'].cpu().numpy()]
    pred_labels = outputs[0]['labels'].cpu().numpy()
    pred_scores = outputs[0]['scores'].detach().cpu().numpy()
    pred_bboxes = outputs[0]['boxes'].detach().cpu().numpy()

    # Initialize empty lists for storing the filtered results
    boxes, classes, labels, indices = [], [], [], []

    # Filter the results based on the detection threshold
    for index in range(len(pred_scores)):
        if pred_scores[index] >= detection_threshold:
            boxes.append(pred_bboxes[index].astype(np.int32))
            classes.append(pred_classes[index])
            labels.append(pred_labels[index])
            indices.append(index)

    # Convert the boxes to numpy array
    boxes = np.int32(boxes)

    return boxes, classes, labels, indices


def predict_with_scores(input_tensor, model, detection_threshold) -> tuple:
    """
    Perform object detection on an input tensor using a Faster R-CNN model and return the bounding boxes,
    class labels, labels, indices, and scores for the detected objects.

    Args:
        input_tensor (torch.Tensor): The input tensor to perform object detection on.
        model (torch.nn.Module): The Faster R-CNN model.
        detection_threshold (float): The minimum confidence score threshold for object detection.

    Returns:
        boxes (numpy.ndarray): An array of bounding boxes for the detected objects.
        classes (list): A list of class labels for the detected objects.
        labels (numpy.ndarray): An array of class labels for the detected objects.
        indices (list): A list of indices corresponding to the detected objects.
        scores (numpy.ndarray): An array of confidence scores for the detected objects.
    """
    # Perform object detection using the model
    outputs = model(input_tensor)

    # Extract the predicted classes, labels, scores, and bounding boxes
    pred_classes = [coco_labels[i] for i in outputs[0]['labels'].cpu().numpy()]
    pred_labels = outputs[0]['labels'].cpu().numpy()
    pred_scores = outputs[0]['scores'].detach().cpu().numpy()
    pred_bboxes = outputs[0]['boxes'].detach().cpu().numpy()

    # Initialize empty lists for storing the filtered results
    boxes, classes, labels, indices, scores = [], [], [], [], []

    # Filter the results based on the detection threshold
    for index in range(len(pred_scores)):
        if pred_scores[index] >= detection_threshold:
            boxes.append(pred_bboxes[index].astype(np.int32))
            classes.append(pred_classes[index])
            labels.append(pred_labels[index])
            indices.append(index)
            scores.append(pred_scores[index])

    # Convert the boxes to numpy array
    boxes = np.int32(boxes)

    return boxes, classes, labels, indices, scores


def draw_boxes(boxes, labels, classes, image) -> np.ndarray:
    """
    Draw bounding boxes and class labels on the input image.

    Args:
        boxes (numpy.ndarray): An array of bounding boxes for the detected objects.
        labels (numpy.ndarray): An array of class labels for the detected objects.
        classes (list): A list of class labels for the detected objects.
        image (numpy.ndarray): The input image.

    Returns:
        numpy.ndarray: The image with bounding boxes and class labels drawn on it.
    """
    for i, box in enumerate(boxes):
        cv2.rectangle(
            image,
            (int(box[0]), int(box[1])),
            (int(box[2]), int(box[3])),
            (0, 255, 0), 2
        )
        cv2.putText(image, classes[i], (int(box[0]), int(box[1] - 5)),
                    cv2.FONT_HERSHEY_SIMPLEX, 0.8, (0, 255, 0), 2,
                    lineType=cv2.LINE_AA)
    return image

def fasterrcnn_reshape_transform(x):
    """
    Reshape the output (activations) of a Faster R-CNN model to the shape of the input image.

    Args:
        x (dict): The output (activations) of a Faster R-CNN model.
        
    Returns:
        dict: The reshaped output (activations) of a Faster R-CNN model.
    """
    # Specify the target size (last two dimensions) of the reshaped output (height, width)
    target_size = x['pool'].size()[-2 : ]
    # Initialize an empty list for storing the activations
    activations = []
    # Iterate over the activations
    for key, value in x.items():
        # Resize the activation to the target size using bilinear interpolation
        activations.append(torch.nn.functional.interpolate(torch.abs(value), target_size, mode='bilinear'))
    # Concatenate the activations along the channel dimension
    activations = torch.cat(activations, axis=1)
    return activations

class FasterRCNNBoxScoreTarget:
    """ For every original detected bounding box specified in "bounding boxes",
        assign a score on how the current bounding boxes match it,
            1. In IOU
            2. In the classification score.
        If there is not a large enough overlap, or the category changed,
        assign a score of 0.

        The total score is the sum of all the box scores.
    """

    def __init__(self, labels, bounding_boxes, iou_threshold=0.5):
        self.labels = labels
        self.bounding_boxes = bounding_boxes
        self.iou_threshold = iou_threshold

    def __call__(self, model_outputs):
        output = torch.Tensor([0])

        if len(model_outputs["boxes"]) == 0:
            return output

        for box, label in zip(self.bounding_boxes, self.labels):
            box = torch.Tensor(box[None, :])

            ious = torchvision.ops.box_iou(box, model_outputs["boxes"])
            index = ious.argmax()
            if ious[0, index] > self.iou_threshold and model_outputs["labels"][index] == label:
                score = ious[0, index] + model_outputs["scores"][index]
                output = output + score
        return output
    
def renormalize_cam_in_bounding_boxes(boxes, image_float_np, grayscale_cam):
    """Normalize the CAM to be in the range [0, 1] 
    inside every bounding boxes, and zero outside of the bounding boxes. """
    renormalized_cam = np.zeros(grayscale_cam.shape, dtype=np.float32)
    images = []
    for x1, y1, x2, y2 in boxes:
        img = renormalized_cam * 0
        img[y1:y2, x1:x2] = scale_cam_image(grayscale_cam[y1:y2, x1:x2].copy())    
        images.append(img)
    
    renormalized_cam = np.max(np.float32(images), axis = 0)
    renormalized_cam = scale_cam_image(renormalized_cam)
    eigencam_image_renormalized = show_cam_on_image(image_float_np, renormalized_cam, use_rgb=True)
    image_with_bounding_boxes = draw_boxes(boxes, labels, classes, eigencam_image_renormalized)
    return image_with_bounding_boxes

## <a id='toc1_4_'></a>[Inference & Evaluation (IoU & mAP)](#toc0_)

The `torchvision.models.detection` module provides several pre-trained models for object detection, segmentation, and person keypoint detection, as well as some training utilities. For Faster R-CNN, the following models are available:

1. `fasterrcnn_resnet50_fpn`: This is a Faster R-CNN model with a ResNet-50 backbone and Feature Pyramid Network (FPN). It's pre-trained on the COCO train2017 dataset.

2. `fasterrcnn_mobilenet_v3_large_fpn`: This is a Faster R-CNN model with a MobileNetV3-Large backbone and FPN. It's also pre-trained on the COCO train2017 dataset.

3. `fasterrcnn_mobilenet_v3_large_320_fpn`: This is another Faster R-CNN model with a MobileNetV3-Large backbone and FPN, but designed for 320x320 input images. It's pre-trained on the COCO train2017 dataset.

### <a id='toc1_4_1_'></a>[Inference](#toc0_)

In [None]:
# Defining constants
DATASET_DIR = "dataset/val2017/"
CONFIDENCE_THRESHOLD = 0.9
SAVE = False
DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Loading the COCO dataset in "dataset/test2017" 
test_images = os.listdir(DATASET_DIR)

# Loading annotations
coco_gt = COCO('dataset/instances_val2017.json')

# Preparing the image info dictionary & filename to id dictionary
image_ids = coco_gt.getImgIds()
image_info = coco_gt.loadImgs(image_ids)
filename_to_id = {img['file_name']: img['id'] for img in image_info}

# Loading the pre-trained Faster R-CNN model trained on COCO
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(weights=torchvision.models.detection.FasterRCNN_ResNet50_FPN_Weights.DEFAULT)
model.eval()
model.to(DEVICE)

# Defining the transformation to be applied to images
transform = torchvision.transforms.Compose([torchvision.transforms.ToTensor()])
split = "val"

# Creating the output directory
output_dir = "output (Faster_RCNN)/" + split + "/"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Initializing empty lists for storing the predictions
predictions = []

# Iterating over the images
for sample in test_images:

    # Reading the image
    img_id = filename_to_id[sample]
    image = np.array(Image.open(DATASET_DIR + sample).convert('RGB'))
    input_tensor = transform(image).unsqueeze(0)
    input_tensor = input_tensor.to(DEVICE)

    # Forward pass through the model
    with torch.no_grad():
        outputs = model(input_tensor)  

    # Parse outputs
    pred_boxes = outputs[0]['boxes'].data.cpu().numpy()
    pred_scores = outputs[0]['scores'].data.cpu().numpy()
    pred_labels = outputs[0]['labels'].data.cpu().numpy()

    # Adding predictions to the list
    for i, (box, score, label) in enumerate(zip(pred_boxes, pred_scores, pred_labels)):
        box_height = box[3] - box[1]
        box_width = box[2] - box[0]
        new_box = np.array([box[0], box[1], box_width, box_height])
        prediction = {
            'image_id': img_id,
            'category_id': int(label),
            'bbox': new_box.tolist(),
            'score': float(score)
        }
        predictions.append(prediction)

    # Drawing bounding boxes and class labels on the image and saving it iff SAVE is True
    if SAVE:
        boxes, classes, labels, indices = predict(input_tensor, model, CONFIDENCE_THRESHOLD)
        image = draw_boxes(boxes, labels, classes, image)
        Image.fromarray(image).save(output_dir + sample)

# Saving the predictions in a JSON file
with open(output_dir + 'predictions.json', 'w') as f:
    json.dump(predictions, f)

# Computing mAP
coco_dt = coco_gt.loadRes(output_dir + 'predictions.json')
coco_eval = COCOeval(coco_gt, coco_dt, 'bbox')
coco_eval.evaluate()
coco_eval.accumulate()
coco_eval.summarize()

mAP = coco_eval.stats[0]
print("mAP: ", mAP)

### <a id='toc1_4_2_'></a>[Computing IoU](#toc0_)

In [None]:
# Load COCO annotations for the entire dataset
DATASET_DIR = "/kaggle/input/coco-2017-dataset/coco2017/val2017/"
coco = COCO('/kaggle/input/coco-2017-dataset/coco2017/annotations/instances_val2017.json')
DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
transform = torchvision.transforms.Compose([torchvision.transforms.ToTensor()])
CONFIDENCE_THRESHOLD = 0.9

# Load the pre-trained Faster R-CNN model trained on COCO
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(weights=torchvision.models.detection.FasterRCNN_ResNet50_FPN_Weights.DEFAULT)
model.eval()
model.to(DEVICE)

# Initializing variables
total_iou = 0.0
total_boxes = 0
counter = 0

# Iterating over all images in the dataset
for img_id in coco.getImgIds():
    image = read_image(DATASET_DIR, coco.loadImgs(img_id)[0]['file_name'])
    input_tensor = transform(image).unsqueeze(0)
    input_tensor = input_tensor.to(DEVICE)

    # Getting ground truth boxes corresponding to the image id
    ann_ids = coco.getAnnIds(imgIds=img_id)
    anns = coco.loadAnns(ann_ids)
    gt_boxes = []
    gt_labels = []
    for ann in anns:
        gt_boxes.append(ann['bbox'])
        gt_labels.append(ann['category_id'])
    gt_boxes = np.array(gt_boxes)
    gt_labels = np.array(gt_labels)

    # Getting predicted boxes
    boxes, classes, labels, indices, scores = predict_with_scores(input_tensor, model, CONFIDENCE_THRESHOLD)
    keep_indices = [i for i, score in enumerate(scores) if score >= CONFIDENCE_THRESHOLD]

    # only keep boxes with score >= CONFIDENCE_THRESHOLD
    boxes = [boxes[i] for i in keep_indices]
    labels = [labels[i] for i in keep_indices]

    # Computing IoU
    iou = 0.0
    for index,gt_box in enumerate(gt_boxes):

        # Get x2, y2 coordinates of the ground truth box
        gt_x2 = gt_box[0] + gt_box[2]
        gt_y2 = gt_box[1] + gt_box[3]
        gt_box = [gt_box[0], gt_box[1], gt_x2, gt_y2]

        max_iou = 0.0
        for index_2, box in enumerate(boxes):
            # computing ioU
            x1 = max(gt_box[0], box[0])
            y1 = max(gt_box[1], box[1])
            x2 = min(gt_box[2], box[2])
            y2 = min(gt_box[3], box[3])
            intersection = max(x2 - x1, 0) * max(y2 - y1, 0)
            union = (gt_box[2] - gt_box[0]) * (gt_box[3] - gt_box[1]) + (box[2] - box[0]) * (box[3] - box[1]) - intersection
            iou = intersection / union
            
            if iou > max_iou and labels[index_2] == gt_labels[index]:
                max_iou = iou
        total_iou += max_iou
        total_boxes += 1
    if counter %100 ==0:
        print(counter)
    counter += 1

print("Average IoU: ", total_iou.item() / total_boxes)

## <a id='toc1_5_'></a>[Activation Visualization](#toc0_)

Activation visualization is a technique used in deep learning to understand and analyze the behavior of neural networks. It involves visualizing the activations or outputs of individual neurons or layers within a neural network when given a particular input.

The main purpose of activation visualization is to gain insights into how the network is processing and representing the input data. By visualizing the activations, we can identify which parts of the input are being emphasized or ignored by the network. This can help us understand the network's decision-making process and potentially identify any issues or biases in the model. However, due to many number of filters and neurons in a neural network, it is not possible to visualize all the activations at once. Therefore, we typically visualize the activations of a single neuron or a small group of neurons at a time.

In general, the process involves the following steps:

1. Load the pre-trained model: Activation visualization is typically performed on pre-trained models. The first step is to load the model architecture and weights.

2. Prepare the input data: Depending on the task, you need to prepare the input data that the model expects. This could involve preprocessing, resizing, or normalizing the input images.

3. Forward pass: Pass the input data through the model to obtain the activations. This involves feeding the input data through the layers of the model and collecting the activations at the desired layer(s).

4. Visualize the activations: Once you have obtained the activations, you can visualize them using various techniques such as heatmaps, feature maps, or activation histograms. These visualizations can provide insights into the learned representations and patterns within the network.

In [None]:
def activation_visualization(images, model, transform, show=False, save=True, split='val'):
    output_dir = "output/activation_visualization/" + split + "/"
    global first_layer_activations
    global last_layer_activations
    
    for sample in images:

        # Read the image from disk using the image_name
        image = read_image(DATASET_DIR, sample)
        image_tensor = transform(image).unsqueeze(0)

        # Forward pass through the model
        output = model(image_tensor)
        
        first_layer_activations = first_layer_activations.squeeze(0).detach().cpu().numpy()
        last_layer_activations = last_layer_activations.squeeze(0).detach().cpu().numpy()

        # Nomralize each of the activations
        for i in range(first_layer_activations.shape[0]):
            first_layer_activations[i] = (first_layer_activations[i] - first_layer_activations[i].min()) / (first_layer_activations[i].max() - first_layer_activations[i].min()) * 255

        for i in range(last_layer_activations.shape[0]):
            last_layer_activations[i] = (last_layer_activations[i] - last_layer_activations[i].min()) / (last_layer_activations[i].max() - last_layer_activations[i].min()) * 255
           
        # Visualize the activations
        fig, axes = plt.subplots(3, 3, figsize=(10, 5))

        # Original image
        for i in range(0, 3):
            axes[0, i].axis('off')
        axes[0, 1].imshow(image)
        axes[0, 1].set_title("Original Image")

        # First layer activations for only the first 3 filters
        axes[1, 0].imshow(torchvision.transforms.ToPILImage()(first_layer_activations[0]), cmap="gray")
        axes[1, 0].set_title("First Layer Activations\nFilter 1")
        axes[1, 0].axis("off")

        axes[1, 1].imshow(torchvision.transforms.ToPILImage()(first_layer_activations[1]), cmap="gray")
        axes[1, 1].set_title("First Layer Activations\nFilter 2")
        axes[1, 1].axis("off")

        axes[1, 2].imshow(torchvision.transforms.ToPILImage()(first_layer_activations[2]), cmap="gray")
        axes[1, 2].set_title("First Layer Activations\nFilter 3")
        axes[1, 2].axis("off")

        # Last layer activations for only the first 3 filters
        axes[2, 0].imshow(torchvision.transforms.ToPILImage()(last_layer_activations[0]), cmap="gray")
        axes[2, 0].set_title("Last Layer Activations\nFilter 1")
        axes[2, 0].axis("off")

        axes[2, 1].imshow(torchvision.transforms.ToPILImage()(last_layer_activations[1]), cmap="gray")
        axes[2, 1].set_title("Last Layer Activations\nFilter 2")
        axes[2, 1].axis("off")

        axes[2, 2].imshow(torchvision.transforms.ToPILImage()(last_layer_activations[2]), cmap="gray")
        axes[2, 2].set_title("Last Layer Activations\nFilter 3")
        axes[2, 2].axis("off")
        plt.subplots_adjust(left=0.05, right=0.95, top=0.95, bottom=0.05, hspace=0.2, wspace=0.2)
        plt.tight_layout()

        # Save the figure iff needed
        if save:
            if not os.path.exists(output_dir):
                os.makedirs(output_dir)
            plt.savefig(output_dir + sample)

        # Show the figure iff needed
        if show:
            plt.show()

        # Reset activations
        first_layer_activations = None 
        last_layer_activations = None

# Constants
DATASET_DIR = "dataset/val2017/"

# Load the pre-trained Faster R-CNN model trained on COCO
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
model.eval()

# Define the transformation to be applied to images
transform = torchvision.transforms.Compose([torchvision.transforms.ToTensor()])

# Load the COCO dataset in "dataset/test2017"
test_images = os.listdir(DATASET_DIR)

# Specify the first and last layer of the model which will be used for visualization (hooked)
first_layer = model.backbone.body.relu
last_layer = model.backbone.body.layer4[2].conv3

# Initializing activations which will be used in the hook functions
first_layer_activations = None
last_layer_activations = None

# Defining hook for the first layer
def first_layer_hook(module, input, output):
    global first_layer_activations
    first_layer_activations = output

# Defining hook for the last layer
def last_layer_hook(module, input, output):
    global last_layer_activations
    last_layer_activations = output

# Registering the hooks
first_layer.register_forward_hook(first_layer_hook)
last_layer.register_forward_hook(last_layer_hook)

# Apply activation visualization
activation_visualization(test_images, model, transform, show=True, save=True, split='val')

## <a id='toc1_6_'></a>[GradCAM](#toc0_)

GradCAM (Gradient-Weighted Class Activaiton Mapping) introduced by the paper [Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization](https://arxiv.org/abs/1610.02391) in 2017 is part of explainable AI for computer vision. It is a technique for visualizing and interpreting the predictions of Convolutional Neural Networks (CNNs). It uses the gradients of any target `concept` (say logits for 'dog' or even a caption), flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept.
The GradCAM technique can be summarized in the following steps:

1. Let $y^c$ be the score for class $c$ before the softmax, and $A^k$ be the activation map of the last convolutional layer. The gradient of $y^c$ w.r.t. $A^k$ is computed as:

$$\alpha^c_k = \frac{1}{Z}\sum_i\sum_j\frac{\partial y^c}{\partial A^k_{ij}}$$

where $Z$ is the number of elements in $A^k$, and $A^k_{ij}$ is the activation at the $i$-th row and $j$-th column of $A^k$.

2. The activation map $L^c_{GradCAM}$ is computed by:

$$L^c_{GradCAM} = ReLU(\sum_k\alpha^c_kA^k)$$

3. Heat map can then be computed by normalizing the activation map $L^c_{GradCAM}$.

4. The heat map is then upsampled to the size of the input image.

**Notes:**

- According to the authors, We find that Grad-CAM maps become progressively worse as we move to earlier convolutional layers as they have smaller receptive fields and only focus on less semantic local features. That is why most people tend to use the last convolutional layer.

In [None]:
DATASET_DIR = "dataset/val2017/"
test_images = os.listdir(DATASET_DIR)
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
model.eval()
transform = torchvision.transforms.Compose([torchvision.transforms.ToTensor()])

Gradcam = GradCAM(model,
                    [model.backbone.body.layer4[2].conv3])

for image in test_images:
    image_name = image
    image = np.array(Image.open(DATASET_DIR + image))
    original_image = image.copy()
    image_float_np = np.float32(image) / 255
    
    input_tensor = transform(image_float_np)
    input_tensor = input_tensor.unsqueeze(0)

    # Run the model and display the detections
    boxes, classes, labels, indices = predict(input_tensor, model, 0.9)
    targets = [FasterRCNNBoxScoreTarget(labels=labels, bounding_boxes=boxes)]

    image_with_predictions = draw_boxes(boxes, labels, classes, image)

    # Computing GradCAM
    grayscale_gradcam = Gradcam(input_tensor=input_tensor, targets=targets)[0, :]
    gradcam = show_cam_on_image(image_float_np, grayscale_gradcam, use_rgb=True)
    if len(boxes) == 0:
        continue
    renormalized_gradcam = renormalize_cam_in_bounding_boxes(boxes, image_float_np, grayscale_gradcam)

    fig, axes = plt.subplots(1, 4, figsize=(10, 5))
    axes[0].imshow(original_image)
    axes[0].set_title("Input Image")
    axes[0].axis("off")

    # Image with predicted bounding boxes
    axes[1].imshow(image_with_predictions)
    axes[1].set_title("Predicted Boxes")
    axes[1].axis("off")

    # GradCAM heatmap
    axes[2].imshow(gradcam)
    axes[2].set_title("GradCAM Heatmap")
    axes[2].axis("off")

    # GradCAM heatmap renormalized in bounding boxes
    axes[3].imshow(renormalized_gradcam)
    axes[3].set_title("GradCAM Heatmap\nRenormalized in Bounding Boxes")
    axes[3].axis("off")


    plt.tight_layout()
    # Saving the images
    output_dir = "output/gradcam/val/"
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    plt.savefig(output_dir + image_name)
    plt.show()

## <a id='toc1_7_'></a>[GradCAM++](#toc0_)

Grad-CAM++ introduced by Chattopadhyay et al in 2018 in their paper [Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks](https://arxiv.org/abs/1710.11063) is an extension of Grad-CAM. It refines the localization of important regions by considering higher-order derivatives beyond the first-order gradients. It combines the first and second-order gradients of the target concept to obtain the weights (specifically, the Hessian matrix) for the activation maps. This refinement aims to enhance the localization accuracy of the highlighted regions, providing a more precise understanding of where the model is focusing its attention to make predictions. It is worth noting that the paper offers a concrete mathematical derivation of the Grad-CAM++ which can't be covered here.

![Image](assets/gradcam%20vs%20gradcampp.png)

In [None]:
DATASET_DIR = "dataset/val2017/"
test_images = os.listdir(DATASET_DIR)
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
model.eval()
transform = torchvision.transforms.Compose([torchvision.transforms.ToTensor()])

Gradcampp = GradCAMPlusPlus(model,
                    [model.backbone.body.layer4[2].conv3])

for image in test_images:
    image_name = image
    image = np.array(Image.open(DATASET_DIR + image))
    original_image = image.copy()
    image_float_np = np.float32(image) / 255
    
    input_tensor = transform(image_float_np)
    input_tensor = input_tensor.unsqueeze(0)

    # Run the model and display the detections
    boxes, classes, labels, indices = predict(input_tensor, model, 0.9)
    if len(boxes) == 0:
        continue
    targets = [FasterRCNNBoxScoreTarget(labels=labels, bounding_boxes=boxes)]

    image_with_predictions = draw_boxes(boxes, labels, classes, image)

    # Computing GradCAM
    grayscale_gradcam = Gradcampp(input_tensor=input_tensor, targets=targets)[0, :]
    gradcam = show_cam_on_image(image_float_np, grayscale_gradcam, use_rgb=True)
    if len(boxes) == 0:
        continue
    renormalized_gradcam = renormalize_cam_in_bounding_boxes(boxes, image_float_np, grayscale_gradcam)

    fig, axes = plt.subplots(1, 4, figsize=(10, 5))
    axes[0].imshow(original_image)
    axes[0].set_title("Input Image")
    axes[0].axis("off")

    # Image with predicted bounding boxes
    axes[1].imshow(image_with_predictions)
    axes[1].set_title("Predicted Boxes")
    axes[1].axis("off")

    # GradCAM heatmap
    axes[2].imshow(gradcam)
    axes[2].set_title("GradCAM++ Heatmap")
    axes[2].axis("off")

    # GradCAM heatmap renormalized in bounding boxes
    axes[3].imshow(renormalized_gradcam)
    axes[3].set_title("GradCAM++ Heatmap\nRenormalized in Bounding Boxes")
    axes[3].axis("off")


    plt.tight_layout()
    # Saving the images
    output_dir = "output/gradcampp/val/"
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    plt.savefig(output_dir + image_name)
    plt.show()

## <a id='toc1_8_'></a>[EigenCAM](#toc0_)

EigenCAM is a class activation mapping technique that provides visual explanations for convolutional neural network (CNN) models. It was introduced by Mohammed Bany Muhammad Mohammed Yeasin in their paper [Eigen-CAM: Class Activation Map using Principal Components](https://arxiv.org/abs/2008.00299). EigenCAM is a generalization of GradCAM and GradCAM++ that uses the principal components using Singular Value Decomposition of the activation maps to compute the weights for the activation maps. This allows EigenCAM to capture the most important features in the activation maps, providing a more accurate localization of the important regions in the input image.

The main advantage is the ability to give robust explanations for the model's predictions even if it is misclassified. This is because EigenCAM uses the principal components of the activation maps, which are not affected by the model's predictions and doesn't require gradient information. This is in contrast to GradCAM and GradCAM++, which use the gradients of the model's predictions to compute the weights for the activation maps. This means that GradCAM and GradCAM++ can give misleading explanations if the model's predictions are incorrect.

![Image](assets/eigenmap%20vs%20gradcam.png)

In [None]:
DATASET_DIR = "dataset/val2017/"
test_images = os.listdir(DATASET_DIR)
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
model.eval()
transform = torchvision.transforms.Compose([torchvision.transforms.ToTensor()])
target_layers = [model.backbone]

Eigencam = EigenCAM(model,
               target_layers, 
               reshape_transform=fasterrcnn_reshape_transform)

for image in test_images:
    image_name = image
    image = np.array(Image.open(DATASET_DIR + image))
    original_image = image.copy()
    image_float_np = np.float32(image) / 255
    
    input_tensor = transform(image_float_np)
    input_tensor = input_tensor.unsqueeze(0)

    # Run the model and display the detections
    boxes, classes, labels, indices = predict(input_tensor, model, 0.9)
    targets = [FasterRCNNBoxScoreTarget(labels=labels, bounding_boxes=boxes)]

    image_with_predictions = draw_boxes(boxes, labels, classes, image)

    # Computing EigenCAM
    grayscale_eigencam = Eigencam(input_tensor=input_tensor, targets=targets)[0, :]
    eigencam_image = show_cam_on_image(image_float_np, grayscale_eigencam, use_rgb=True)
    if len(boxes) == 0:
        continue
    renormalized_eigencam_image = renormalize_cam_in_bounding_boxes(boxes, image_float_np, grayscale_eigencam)

    fig, axes = plt.subplots(1, 4, figsize=(10, 5))
    axes[0].imshow(original_image)
    axes[0].set_title("Input Image")
    axes[0].axis("off")

    # Image with predicted bounding boxes
    axes[1].imshow(image_with_predictions)
    axes[1].set_title("Predicted Boxes")
    axes[1].axis("off")

    # EigenCAM heatmap
    axes[2].imshow(eigencam_image)
    axes[2].set_title("EigenCAM Heatmap")
    axes[2].axis("off")

    # EigenCAM heatmap renormalized in bounding boxes
    axes[3].imshow(renormalized_eigencam_image)
    axes[3].set_title("EigenCAM Heatmap\nRenormalized in Bounding Boxes")
    axes[3].axis("off")


    plt.tight_layout()

    # Saving the figures
    output_dir = "output/eigencam/val/"
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    plt.savefig(output_dir + image_name)
    plt.show()

## <a id='toc1_9_'></a>[AblationCAM](#toc0_)

AblationCAM introduced by Saurabh Desai and Harish Ramaswamy in their paper [Ablation-CAM: Visual Explanations for Deep Convolutional Network via Gradient-free Localization](https://openaccess.thecvf.com/content_WACV_2020/papers/Desai_Ablation-CAM_Visual_Explanations_for_Deep_Convolutional_Network_via_Gradient-free_Localization_WACV_2020_paper.pdf) which was published in CVPR 2020 is a gradient-free class activation mapping technique that provides visual explanations for convolutional neural network (CNN) models. It is a gradient-free alternative to GradCAM and GradCAM++ that uses ablation to compute the weights for the activation maps. This allows AblationCAM to provide visual explanations for models that do not have a gradient defined for their output.

The main idea is to use ablation (zero out) to compute the weights for the activation maps. This involves removing each activation map one at a time and computing the difference in the model's predictions. The weights are then computed by taking the difference between the model's predictions with and without the activation map. These weights are then used to compute the activation map, which is then upsampled to the size of the input image.

The ablation impact directly measures the importance of a unit to the class score, rather than using gradients which are indirect and noisy. It is also insensitive to implementation details like model architecture. The ablation impact remains consistent across models. Ablation-CAM explanations do not change drastically for wrong predictions. Moreover, Gradient-based methods can highlight unrelated regions if the model is wrong.
Computationally, not backpropagating gradients can be more efficient for generating explanations.

![Image](assets/ablationcam%20vs%20gradcam.png)

In [None]:
DATASET_DIR = "dataset/val2017/"
test_images = os.listdir(DATASET_DIR)
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
model.eval()
transform = torchvision.transforms.Compose([torchvision.transforms.ToTensor()])
target_layers = [model.backbone]

Ablationcam = AblationCAM(model,
                  target_layers, 
                  reshape_transform=fasterrcnn_reshape_transform,
                  ablation_layer=AblationLayerFasterRCNN(),
                  ratio_channels_to_ablate=0.01)

for image in test_images:
    image_name = image
    image = np.array(Image.open(DATASET_DIR + image))
    original_image = image.copy()
    image_float_np = np.float32(image) / 255
    
    input_tensor = transform(image_float_np)
    input_tensor = input_tensor.unsqueeze(0)

    # Run the model and display the detections
    boxes, classes, labels, indices = predict(input_tensor, model, 0.9)
    targets = [FasterRCNNBoxScoreTarget(labels=labels, bounding_boxes=boxes)]

    image_with_predictions = draw_boxes(boxes, labels, classes, image)

    # Computing AblationCAM
    grayscale_ablationcam = Ablationcam(input_tensor=input_tensor, targets=targets)[0, :]
    ablationcam = show_cam_on_image(image_float_np, grayscale_ablationcam, use_rgb=True)
    if len(boxes) == 0:
        continue
    renormalized_ablationcam = renormalize_cam_in_bounding_boxes(boxes, image_float_np, grayscale_ablationcam)

    fig, axes = plt.subplots(1, 4, figsize=(10, 5))
    axes[0].imshow(original_image)
    axes[0].set_title("Input Image")
    axes[0].axis("off")

    # Image with predicted bounding boxes
    axes[1].imshow(image_with_predictions)
    axes[1].set_title("Predicted Boxes")
    axes[1].axis("off")

    # AblationCAM heatmap
    axes[2].imshow(ablationcam)
    axes[2].set_title("AblationCAM Heatmap")
    axes[2].axis("off")

    # AblationCAM heatmap renormalized in bounding boxes
    axes[3].imshow(renormalized_ablationcam)
    axes[3].set_title("AblationCAM Heatmap\nRenormalized in Bounding Boxes")
    axes[3].axis("off")


    plt.tight_layout()
    # Saving the images
    output_dir = "output/ablation/val/"
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    plt.savefig(output_dir + image_name)
    plt.show()

## <a id='toc1_10_'></a>[Deep Feature Factorizations.](#toc0_)

Deep Feature Factorization -DFF for short-, a method capable of localizing similar semantic concepts within an image or a set of images. It was introduced by Collins el al in their paper [Deep Feature Factorization For Concept Discovery](https://arxiv.org/abs/1806.10206).

Usually explainability methods answer questions like “Where does the model see a cat in the image?”. Here instead we will get a much more detailed glimpse into the model, and ask it: “Show me all the different concepts you found inside the image, and how are they classified”.


The previous methods were not able to answer a lot of questions like:


1. What are the internal concepts the model finds?

Does the network just see the cat head and body together? Or maybe it detects them as different concepts ? We heard that neural networks are able to identify high level features like ears, eyes, faces and legs. But we’re never actually able to see this in the model explanations.

2. Could it be that the body of the cat also pulls the output towards other categories as well?

Just because it contributes to a higher output for one category, it doesn’t mean it doesn’t contribute to other categories as well. For example, there are many different types of cats. To take this into account when we’re interpreting the heatmaps, we would have to carefully look at all the heatmaps and keep track of them.

3. How do we merge all the visualizations into a single image?

In terms of the visualization itself, if we have 10 heatmaps for 10 categories, we would need to look at 10 different images. And some of the pixels could get high values in several heatmaps, for example different categories of cats. This is a lot of information to unpack and not very effecient.


The idea of DFF is to factorize the activations from the model into different concepts using Non Negative Matrix Factorization (or from now on- NMF), and for every pixel compute how it corresponds with each of the concepts:

- Turn the small 2D images in activations into 1D vectors, by reshaping the activations from a tensor of shape Batch, Channels, Height , Width, to a tensor with the shape Channels x (Batch x Height x Width) Reminder: the activations are typcically non-negative since they are often after a ReLU gate.

- Compute the NMF of V, for some number of components N.

This gives us V = WH.

W is a matrix with the shape (channels x N).

H is a matrix with the shape N x (Batch x Height x Width).

- W can be thought of as the feature representations of the detected concepts.

- H (after reshaping it back to 2D activations) contains how the pixels corresponds with the different concepts.

If we input a batch of several images, concepts that repeat across the images will be computed. This gives us a way of automatically discovering concepts in a dataset, and performing tasks like co-localization, further detailed in the paper. However for our purposes now we will use a batch size of 1: we just want to detect the concepts detected in a single image.

In [None]:
DATASET_DIR = "dataset/val2017/"
test_images = os.listdir(DATASET_DIR)
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
model.eval()
transform = torchvision.transforms.Compose([torchvision.transforms.ToTensor()])

dff = DeepFeatureFactorization(model=model, target_layer=model.backbone.body.layer3[5].conv3, computation_on_concepts=model.roi_heads.box_predictor.cls_score)


for image in test_images:
    image_name = image
    image = np.array(Image.open(DATASET_DIR + image))
    original_image = image.copy()
    image_float_np = np.float32(image) / 255
    
    input_tensor = transform(image_float_np)
    input_tensor = input_tensor.unsqueeze(0)

    # Run the model and display the detections
    boxes, classes, labels, indices = predict(input_tensor, model, 0.9)

    image_with_predictions = draw_boxes(boxes, labels, classes, image)
    
    # Computing Deep Feature Factorization
    concepts_2, batch_explanations_2, concept_scores_2 = dff(input_tensor, n_components = 2)
    visualization_2 = show_factorization_on_image(image_float_np, 
                                                batch_explanations_2[0],
                                                image_weight=0.3)
    
    # Computing again for n_components = 3
    concepts_3, batch_explanations_3, concept_scores_3 = dff(input_tensor, n_components = 3)
    visualization_3 = show_factorization_on_image(image_float_np, 
                                                batch_explanations_3[0],
                                                image_weight=0.3)
    
    # Computing again for n_components = 5
    concepts_5, batch_explanations_5, concept_scores_5 = dff(input_tensor, n_components = 5)
    visualization_5 = show_factorization_on_image(image_float_np, 
                                                batch_explanations_5[0],
                                                image_weight=0.3)
        
    if len(boxes) == 0:
        continue

    fig, axes = plt.subplots(1, 5, figsize=(10, 5))
    axes[0].imshow(original_image)
    axes[0].set_title("Input Image")
    axes[0].axis("off")
    
    # Image with predicted bounding boxes
    axes[1].imshow(image_with_predictions)
    axes[1].set_title("Predicted Boxes")
    axes[1].axis("off")

    # Deep Feature Factorization heatmap
    axes[2].imshow(visualization_2)
    axes[2].set_title("DFF Heatmap (2)")
    axes[2].axis("off")

    # Deep Feature Factorization heatmap
    axes[3].imshow(visualization_3)
    axes[3].set_title("DFF Heatmap (3)")
    axes[3].axis("off")

    # Deep Feature Factorization heatmap
    axes[4].imshow(visualization_5)
    axes[4].set_title("DFF Heatmap (5)")
    axes[4].axis("off")

    plt.tight_layout()
    # Saving the images
    output_dir = "output/dff/val/"
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    plt.savefig(output_dir + image_name)
    plt.show()

## <a id='toc1_11_'></a>[ScoreCAM](#toc0_)

- So far it causes the computer to crash. 

In [None]:
DATASET_DIR = "dataset/val2017/"
test_images = os.listdir(DATASET_DIR)
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
model.eval()
transform = torchvision.transforms.Compose([torchvision.transforms.ToTensor()])
target_layers = [model.backbone.body.layer4[2].conv3]

Scorecam = ScoreCAM(model,
                target_layers)

for image in test_images:
    image_name = image
    image = np.array(Image.open(DATASET_DIR + image))
    original_image = image.copy()
    image_float_np = np.float32(image) / 255
    
    input_tensor = transform(image_float_np)
    input_tensor = input_tensor.unsqueeze(0)

    # Run the model and display the detections
    boxes, classes, labels, indices = predict(input_tensor, model, 0.9)
    targets = [FasterRCNNBoxScoreTarget(labels=labels, bounding_boxes=boxes)]

    image_with_predictions = draw_boxes(boxes, labels, classes, image)

    # Computing ScoreCAM
    grayscale_scorecam = Scorecam(input_tensor=input_tensor, targets=targets)[0, :]
    scorecam = show_cam_on_image(image_float_np, grayscale_scorecam, use_rgb=True)
    if len(boxes) == 0:
        continue
    renormalized_scorecam = renormalize_cam_in_bounding_boxes(boxes, image_float_np, grayscale_scorecam)

    fig, axes = plt.subplots(1, 4, figsize=(10, 5))
    axes[0].imshow(original_image)
    axes[0].set_title("Input Image")
    axes[0].axis("off")

    # Image with predicted bounding boxes
    axes[1].imshow(image_with_predictions)
    axes[1].set_title("Predicted Boxes")
    axes[1].axis("off")

    # ScoreCAM heatmap
    axes[2].imshow(scorecam)
    axes[2].set_title("ScoreCAM Heatmap")
    axes[2].axis("off")

    # ScoreCAM heatmap renormalized in bounding boxes
    axes[3].imshow(renormalized_scorecam)
    axes[3].set_title("ScoreCAM Heatmap\nRenormalized in Bounding Boxes")
    axes[3].axis("off")

    plt.tight_layout()
    # Saving the images
    output_dir = "output/scorecam/val/"
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    plt.savefig(output_dir + image_name)
    plt.show()