# CAM and Object Detection (modified)

## Authors: 
Sat Arora, sat.arora@uwaterloo.ca \
Richard Fan, r43fan@uwaterloo.ca

### Project Goal ***REMOVE THIS***:
"CAM and object detection". First, you should implement some standard method for CAM for some (simple) classification network trained on image-level tags. You should also obtain object detection (spacial localization of the object approximate "center"). You should apply your approach to one specific object type (e.g. faces, or anything else). Training should be done on image-level tags (e.g. face, no face). You can come up with your specialized dataset, but feel free to use subsets of standard data. You can also test the ideas on real datasets where label noise is present.



## Abstract

Class Activation Maps (CAMs) is a very important tool and concept in Computer Vision. During classification, the goal of CAMs is to indicate the regions of the image that were used by a Convolutional Neural Network to lead it to classifying an image as containing a certain object.

In order to understand what the Class Activation Maps do, this report will describe in detail the motivation, ideas & concepts that guide our process to making our own CAMs. Following this, we will do some deeper analysis of what happens in certain scenarios to better understand the algorithm's output.

The approach and motivation are inspired by [Learning Deep Features for Discriminative Localization](http://cnnlocalization.csail.mit.edu/Zhou_Learning_Deep_Features_CVPR_2016_paper.pdf) (Zhou, Khosla, Laperdriza, Oliva, Tarralba), a paper that was released in 2016. The appraoch is extended by comparing common classification CNNs (specifically, ResNet18) with a CNN that we train, analyzing the difference in image labelling and the heat map. These networks will be trained on a face/no-face dataset with labelling. 

## Team Contributions

Sat Arora: sat.arora@uwaterloo.ca
- Initial ResNet18 model for object detection.
- Heat map logic.
- Experimenting with multiple objects of same type.

Richard Fan: r43fan@uwaterloo.ca
- Creating custom model (and fine-tuning) for object detection.
- Heat map logic.
- Testing difference between custom model and ResNet18 model.

Fun fact: We are born on the same day.

## Motivation

### Conceptual Idea

As mentioned in the Abstract, the goal of CAMs is to indicate the regions of images that is used by the CNN to identify a certain category.

In the case of categorization, the last layer before output is a softmax layer (in order to determine which class is the most likely). Before running this last layer, if we run a technique called **Global Average Pooling (GAP)** on the convolutional feature maps at this point, then we can use these as features for a fully-connected layer that produces our categorization.

Note: The idea of GAP is straight forward. An implementation can be seen here:
$$\text{GAP}(F_d) = \frac{1}{H \times W} \sum_{i=1}^{H} \sum_{j=1}^{W} F_d(i, j)$$
Simply put, it averages the values of the maps into a singular number, and by doing so it reduces the dimensionality of the image.

With this structure, we can leverage our knowledge of how the softmax works: we can project the weights of the output layer onto the convolution feature maps. This essentially leaves us with a heatmap of the "most important" features (since higher weights in the classification will be where the object is). This technique is known as "Class Activation Mapping".

### How can this be more formally seen?
Say that $\forall (x,y)$, the activation of unit $k$ in the last convolutional layer in the CNN is $f_k(x,y)$. Then, after performing GAP, we have the average for unit $k$ to be $$F^k = \sum_{x,y}{f_k(x,y)}$$

Thus, we have that for some arbitrary class $c$, the input to the softmax in the final decision layer is $$S_c = \sum_k{w_k^cF_k}$$ where $w_k^c$ is exactly the "importance", or weight, of class $c$ for the unit $k$. Recall that the otuput of softmax is thus $$\frac{\exp(S_c)}{\sum_{c_0}{\exp(S_{c_0})}}$$ for class $c$. If we plug in $F^k = \sum_{x,y}{f_k(x,y)}$, we get $$S_c = \sum_{x,y}{\sum_k{w_k^cf_k(x,y)}}$$

Define $M_c$ to be the CAM for $c$, with each spatial element $M_c(x,y) = \sum_k{w_k^cf_k(x,y)}$. Then we can rewrite the definition of the class score $S_c$ to be $$S_c = \sum_{x,y}{M_c(x,y)}$$

As such, we see that $M_c(x,y)$ is exactly the importance of the activation for $c$ at spatial coordinate $(x,y)$. 

### What does this mean?

Thus, we can conclude that $f_k$ will be the map of the persence of the visual pattern corresponding to the location of the object. We have that the CAM is a weighted linear sum of these visual patterns, and so by upsamimagepling the CAM to the size of the input image, we can identify the image regions that played the biggest influence in the particular category. 

*Or, by a simple rethought, the regions that are highlighted correspond to the class that the CNN describes this image to be.*

## Code Libraries

Many libraries used in our implementation would be considered as "standard" in Computer Vision projects or courses, but we list out everything in the import order to get a better understanding of what each import is used for:

- ``PIL``: Used to read images from a directory. This image will get passed into the tensor layers. 

- ``torch`` / ``torchvision``: The main libraries for PyTorch (along with its own packages). These provide pre-set models (like ResNet18), and ability to create transformations and our own CNNs. This is extensively used for manipulating our tensors (along with ``numpy``, which is more forward-facing as will be seen), providing loaders for our training and testing process, and to perform training & computations on CUDA/MPS (GPU configurations) or the CPU.

- ``numpy``: Used to manipulate tensors from ``torch``, and acts as a middle layer to write data in a form that libarries such as ``cv2`` and ``PIL`` can understand. 

- ``cv2``: Used for dealing with image resizing, writing/drawing, and modifying. It is particularly useful in overlaying our heatmap on top of the image, and optionally writing an image to a directory for later use.

In [1]:
from PIL import Image

import torch
import torchvision.models as models
from torchvision import transforms
import torch.nn as nn
import torchvision.transforms as transforms
from torchvision.datasets import ImageFolder
from torch.utils.data import DataLoader
from torch.autograd import Variable
from torch.nn import functional as F

import numpy as np

import cv2

## Dataset
We use a dataset of face and non-face images found on Kaggle from Sagar Karar. To get this dataset and format it in the way that the program needs to read it, run the following commands.

**Note**: The first step assumes that you have the ``kaggle`` package installed on pip. Otherwise, click on [this link to the dataset page](https://www.kaggle.com/datasets/sagarkarar/nonface-and-face-dataset) and download the dataset. This will replace the first line in the bash script below.

After downloading the data and manipulating its contents to restructure it, we have that the train and test datasets are in the ``Dataset/train`` and ``Dataset/folder`` respectively. Note that the data is transformed by reshaping it, converting it to a Tensor, and then normalizing it with the standardized `mean` and `std` from ImageNet, which is what the models are trained on.

In [10]:
# Define transformations for data preprocessing
transform = transforms.Compose([
    transforms.Resize((224, 224)),  # Resize images to a uniform size
    transforms.ToTensor(),  # Convert images to PyTorch tensors
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])  # Normalize the images
])

# Load train and test datasets
train_dataset = ImageFolder('Dataset/train', transform=transform)
test_dataset = ImageFolder('Dataset/test', transform=transform)

print(train_dataset[0])

# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

(tensor([[[-2.1179, -2.1179, -2.1179,  ..., -1.2103, -1.1932, -1.2103],
         [-2.1008, -2.1008, -2.1179,  ..., -1.2788, -1.2788, -1.2617],
         [-2.1008, -2.1008, -2.0665,  ..., -1.2959, -1.3302, -1.2788],
         ...,
         [-1.8953, -1.9295, -1.8953,  ..., -2.1179, -2.1179, -2.1179],
         [-1.9124, -1.9295, -1.8953,  ..., -2.1179, -2.1179, -2.1179],
         [-1.9467, -1.9638, -1.9295,  ..., -2.1179, -2.1179, -2.1179]],

        [[-2.0357, -2.0357, -2.0357,  ..., -1.2829, -1.2654, -1.2829],
         [-2.0182, -2.0182, -2.0357,  ..., -1.3529, -1.3529, -1.3354],
         [-2.0182, -2.0182, -1.9832,  ..., -1.3704, -1.4055, -1.3529],
         ...,
         [-1.7906, -1.8256, -1.7906,  ..., -2.0357, -2.0357, -2.0357],
         [-1.8081, -1.8256, -1.7906,  ..., -2.0357, -2.0357, -2.0357],
         [-1.8431, -1.8606, -1.8256,  ..., -2.0357, -2.0357, -2.0357]],

        [[-1.8044, -1.8044, -1.8044,  ..., -1.0724, -1.0550, -1.0724],
         [-1.7870, -1.7870, -1.8044,  ..., -

In [3]:
# Load a pre-trained ResNet18 model
model = models.resnet18(pretrained=True)

num_features = model.fc.in_features
model.fc = nn.Linear(num_features, 2)  # 2 output classes: face and no-face

criterion = nn.CrossEntropyLoss()



In [6]:
# Training loop
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device = torch.device('mps')
model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    
num_epochs = 2
train_correct = 0
train_total = 0
for epoch in range(num_epochs):
    model.train()
    train_loss = 0.0
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        train_total += labels.size(0)
        train_correct += (torch.argmax(outputs, dim=1) == labels).sum().item()

        train_loss += loss.item() * images.size(0)

    epoch_loss = train_loss / len(train_dataset)
    print(f'Epoch [{epoch + 1}/{num_epochs}], Train Loss: {epoch_loss:.4f}, Train Accuracy: {train_correct / train_total * 100:.2f}%')

# Evaluation on test set
model.eval()
test_correct = 0
test_total = 0
test_loss_total = 0.0
with torch.no_grad():
    for images, labels in test_loader:
        images, labels = images.to(device), labels.to(device)
        outputs = model(images)
        loss = criterion(outputs, labels)
        _, predicted = torch.max(outputs, 1)
        test_total += labels.size(0)
        test_correct += (predicted == labels).sum().item()
        test_loss_total += loss.item() * images.size(0)


accuracy = test_correct / test_total
test_loss = test_loss_total / len(test_dataset)
print(f'Test Loss: {test_loss:.4f}, Test Accuracy: {accuracy * 100:.2f}%')

Epoch [1/2], Train Loss: 0.0939, Train Accuracy: 96.84%
Epoch [2/2], Train Loss: 0.0380, Train Accuracy: 97.73%
Test Loss: 0.1219, Test Accuracy: 95.28%


In [12]:

# original_img = cv2.imread('sat.png')
# img = cv2.imread('2 faces.png')
# img = cv2.imread('river_hand.jpeg')
# img = cv2.imread('image_2.jpg')
original_img = cv2.imread('tejas.jpg')
# img = cv2.imread('shahan.jpg')
# img = cv2.imread('osama.jpg')
# img = cv2.imread('Human1250 copy.png')

if img is not None:
    print("Image loaded successfully!")
else:
    print("Unable to load the image. Please check the file path.")

features_blobs = []
def hook_feature(module, input, output):
    features_blobs.append(output.data.cpu().numpy())

model._modules.get('layer4').register_forward_hook(hook_feature)

img = cv2.cvtColor(original_img, cv2.COLOR_BGR2RGB)
preprocess = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
input_img = preprocess(img).unsqueeze(0).to(device)


# Forward pass to get feature maps
with torch.no_grad():
    feature_maps = model(input_img)

params = list(model.parameters())
weight = np.squeeze(params[-2].data.cpu().numpy())
cam = weight[0].dot(features_blobs[0].reshape(-1, 7 * 7))

print("cam", cam)
cam = cam.reshape(7, 7)
cam = cam - np.min(cam)
cam = cam / np.max(cam)
cam = np.uint8(255 * cam)
# cam = cv2.resize(cam, (256, 256))
cam = cv2.resize(cam, (img.shape[1], img.shape[0])) 
print("shape", cam.shape)

# Apply heatmap on the original image
heatmap = cv2.applyColorMap(cam, cv2.COLORMAP_JET)
result = heatmap * 0.3 + original_img * 0.5
cv2.imwrite('CAM3.jpg', result)

Image loaded successfully!
cam [-1.0258129  -1.3987672  -1.5977178  -1.7860192  -1.8843458  -2.0514984
 -1.7054107  -0.97524136 -1.0213745  -0.26057523  0.34875482 -0.3549348
 -1.6226544  -1.6017452  -0.72951573  0.55181843  7.2906237  13.969972
 12.593851    4.783316   -0.7883249  -0.8047608   0.8720914  12.793606
 24.015093   21.92626     9.450156   -0.30643874 -1.0357145  -0.20476629
  8.717547   18.442791   17.652874    7.5965056  -0.9641609  -1.699307
 -1.9514683  -0.32045045  3.5392182   3.5824316  -0.884097   -1.9502213
 -1.5851424  -1.9714911  -2.0003047  -1.9613106  -1.9999803  -2.2184565
 -1.8343383 ]
shape (2100, 1576)


True

We see that this pre-trained ResNet-18 model runs reasonably well on some images. However, how would these same images compare if we disabled the pre-training on the model? In other words, if we don't load in pre-trained weights from training with ImageNet, how different would the same results be? We explore this below:

We see that the ResNet-18 model with pre-trained weights outperforms 

In [None]:
def misclassify(dataset):
    face_indices = [i for i in range(len(dataset)) if dataset[i][1] == 1]  # Assuming class 1 is for faces
    non_face_indices = [i for i in range(len(dataset)) if dataset[i][1] == 0]  # Assuming class 0 is for non-faces
    
    # Misclassify 30 face images as non-faces
    for _ in range(30):
        idx = random.choice(face_indices)
        dataset.targets[idx] = 0  # Change the label to non-face

    # Misclassify 30 non-face images as faces
    for _ in range(30):
        idx = random.choice(non_face_indices)
        dataset.targets[idx] = 1  # Change the label to face