# CAM and Object Detection (modified)

## Authors: 
Sat Arora, sat.arora@uwaterloo.ca \
Richard Fan, r43fan@uwaterloo.ca

### Project Goal:
"CAM and object detection". First, you should implement some standard method for CAM for some (simple) classification network trained on image-level tags. You should also obtain object detection (spacial localization of the object approximate "center"). You should apply your approach to one specific object type (e.g. faces, or anything else). Training should be done on image-level tags (e.g. face, no face). You can come up with your specialized dataset, but feel free to use subsets of standard data. You can also test the ideas on real datasets where label noise is present.



## Abstract

Class Activation Maps (CAMs) is a very important tool and concept in Computer Vision. During classification, the goal of CAMs is to indicate the regions of the image that were used by a Convolutional Neural Network to lead it to classifying an image as containing a certain object.

In order to understand what the Class Activation Maps do, this report will describe in detail the motivation, ideas & concepts that guide our process to making our own CAMs. Following this, we will do some deeper analysis of what happens in certain scenarios to better understand our algorithms output.

The approach and motivation are inspired by [Learning Deep Features for Discriminative Localization](http://cnnlocalization.csail.mit.edu/Zhou_Learning_Deep_Features_CVPR_2016_paper.pdf) (Zhou, Khosla, Laperdriza, Oliva, Tarralba), a paper that was released in 2016. The appraoch is extended by testing our own NNs along with common ones (ResNet18) and analyzing impacts of more special scenarios.

## Team Contributions

Sat Arora: sat.arora@uwaterloo.ca
- INSERT HERE
- INSERT HERE
- INSERT HERE

Richard Fan: r43fan@uwaterloo.ca
- INSERT HERE
- INSERT HERE
- INSERT HERE

Fun fact: We are born on the same day.

## Motivation

As mentioned in the Abstract, the goal of CAMs

In [1]:
import torch
import torchvision.models as models
import cv2
import numpy as np
from torchvision import transforms
import torch.nn as nn


In [2]:
import torch
import torchvision.transforms as transforms
from torchvision.datasets import ImageFolder
from torch.utils.data import DataLoader

# Define transformations for data preprocessing
transform = transforms.Compose([
    transforms.Resize((224, 224)),  # Resize images to a uniform size
    transforms.ToTensor(),  # Convert images to PyTorch tensors
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])  # Normalize the images
])

# Load train and test datasets
train_dataset = ImageFolder('Dataset/train', transform=transform)
test_dataset = ImageFolder('Dataset/test', transform=transform)

# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)


In [3]:
import torchvision.models as models
import torch.nn as nn

# Load a pre-trained ResNet18 model
model = models.resnet18(pretrained=False)

num_features = model.fc.in_features
model.fc = nn.Linear(num_features, 2)  # 2 output classes: face and no-face

criterion = nn.CrossEntropyLoss()



In [22]:
# Training loop
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device = torch.device('mps')
model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    
num_epochs = 1
train_correct = 0
train_total = 0
for epoch in range(num_epochs):
    model.train()
    train_loss = 0.0
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        train_total += labels.size(0)
        train_correct += (torch.argmax(outputs, dim=1) == labels).sum().item()

        train_loss += loss.item() * images.size(0)

    epoch_loss = train_loss / len(train_dataset)
    print(f'Epoch [{epoch + 1}/{num_epochs}], Train Loss: {epoch_loss:.4f}, Train Accuracy: {train_correct / train_total * 100:.2f}%')

# Evaluation on test set
model.eval()
test_correct = 0
test_total = 0
test_loss_total = 0.0
with torch.no_grad():
    for images, labels in test_loader:
        images, labels = images.to(device), labels.to(device)
        outputs = model(images)
        loss = criterion(outputs, labels)
        _, predicted = torch.max(outputs, 1)
        test_total += labels.size(0)
        test_correct += (predicted == labels).sum().item()
        test_loss_total += loss.item() * images.size(0)


accuracy = test_correct / test_total
test_loss = test_loss_total / len(test_dataset)
print(f'Test Loss: {test_loss:.4f}, Test Accuracy: {accuracy * 100:.2f}%')



Epoch [1/1], Train Loss: 0.5915, Train Accuracy: 92.02%
Test Loss: 0.1186, Test Accuracy: 95.85%


In [23]:
torch.save(model, 'model.pth')

# Load the state dictionary
model2 = torch.load('model.pth')
device = torch.device('mps')
model2.to(device)

# If you are loading the model for inference, call model.eval() to set dropout and batch normalization layers to evaluation mode
model2.eval()
test_correct = 0
test_total = 0
test_loss_total = 0.0
with torch.no_grad():
    for images, labels in test_loader:
        images, labels = images.to(device), labels.to(device)
        outputs = model2(images)
        loss = criterion(outputs, labels)
        _, predicted = torch.max(outputs, 1)
        test_total += labels.size(0)
        test_correct += (predicted == labels).sum().item()
        test_loss_total += loss.item() * images.size(0)


accuracy = test_correct / test_total
test_loss = test_loss_total / len(test_dataset)
print(f'Test Loss: {test_loss:.4f}, Test Accuracy: {accuracy * 100:.2f}%')

Test Loss: 0.1186, Test Accuracy: 95.85%


In [16]:
from torch.nn import functional as F
finalconv_name = 'layer4'
from torch.autograd import Variable
from PIL import Image

# correct = 0
# total = 0
# with torch.no_grad():
#     for images, labels in test_loader:
#         images, labels = images.to(device), labels.to(device)
#         outputs = model(images)
#         _, predicted = torch.max(outputs, 1)
#         total += labels.size(0)
#         correct += (predicted == labels).sum().item()

# accuracy = correct / total
# print(f'Test Accuracy: {accuracy * 100:.2f}%')



LABELS_file = 'imagenet-simple-labels.json'
# image_file = 'sat.png'
image_file = '2 faces.png'


# hook the feature extractor
features_blobs = []
def hook_feature(module, input, output):
    features_blobs.append(output.data.cpu().numpy())

model._modules.get(finalconv_name).register_forward_hook(hook_feature)

# get the softmax weight
params = list(model.parameters())
# print(params)
weight_softmax = np.squeeze(params[-2].data.cpu().numpy())

def returnCAM(feature_conv, weight_softmax, class_idx):
    # generate the class activation maps upsample to 256x256
    size_upsample = (256, 256)
    bz, nc, h, w = feature_conv.shape
    output_cam = []
    for idx in class_idx:
        cam = weight_softmax[idx].dot(feature_conv.reshape((nc, h*w)))
        cam = cam.reshape(h, w)
        cam = cam - np.min(cam)
        cam_img = cam / np.max(cam)
        cam_img = np.uint8(255 * cam_img)
        output_cam.append(cv2.resize(cam_img, size_upsample))
    return output_cam


normalize = transforms.Normalize(
   mean=[0.485, 0.456, 0.406],
   std=[0.229, 0.224, 0.225]
)
preprocess = transforms.Compose([
   transforms.Resize((224,224)),
   transforms.ToTensor(),
   normalize
])

# load test image
img_pil = Image.open(image_file)
if img_pil.mode == "RGBA":
    img_pil = img_pil.convert("RGB")
img_tensor = preprocess(img_pil)
img_variable = Variable(img_tensor.unsqueeze(0)).to(device)
logit = model(img_variable)


print("output", logit)

classes = ['face', 'no_face']

h_x = F.softmax(logit, dim=1).data.squeeze()
probs, idx = h_x.sort(0, True)
probs = probs.cpu().numpy()
idx = idx.cpu().numpy()

# output the prediction
for i in range(0, 2):
    print('{:.3f} -> {}'.format(probs[i], classes[idx[i]]))

# generate class activation mapping for the top1 prediction
CAMs = returnCAM(features_blobs[0], weight_softmax, [idx[0]])

# render the CAM and output
print('output CAM.jpg for the top1 prediction: %s'%classes[idx[0]])
img = cv2.imread(image_file)
height, width, _ = img.shape
heatmap = cv2.applyColorMap(cv2.resize(CAMs[0],(width, height)), cv2.COLORMAP_JET)
result = heatmap * 0.3 + img * 0.5
cv2.imwrite('CAM2.jpg', result)

import matplotlib.pyplot as plt



output tensor([[-1.0153,  1.5560]], device='mps:0', grad_fn=<LinearBackward0>)
0.929 -> no_face
0.071 -> face
output CAM.jpg for the top1 prediction: no_face


In [6]:
model

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
  

In [12]:

# Load and preprocess the image
original_img = cv2.imread('sat.png')
# original_img = cv2.imread('2 faces.png')
# img = cv2.imread('river_hand.jpeg')
# img = cv2.imread('image_2.jpg')
# img = cv2.imread('tejas.jpg')
# img = cv2.imread('shahan.jpg')
# img = cv2.imread('osama.jpg')
# img = cv2.imread('Human1250 copy.png')

if original_img is not None:
    print("Image loaded successfully!")
else:
    print("Unable to load the image. Please check the file path.")

features_blobs = []
def hook_feature(module, input, output):
    features_blobs.append(output.data.cpu().numpy())

model._modules.get('layer4').register_forward_hook(hook_feature)

img = cv2.cvtColor(original_img, cv2.COLOR_BGR2RGB)
preprocess = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
input_img = preprocess(img).unsqueeze(0).to(device)


# Forward pass to get feature maps
with torch.no_grad():
    feature_maps = model(input_img)

params = list(model.parameters())
weight = np.squeeze(params[-2].data.cpu().numpy())
cam = weight[0].dot(features_blobs[0].reshape(-1, 7 * 7))

print("cam", cam)
cam = cam.reshape(7, 7)
cam = cam - np.min(cam)
cam = cam / np.max(cam)
cam = np.uint8(255 * cam)
# cam = cv2.resize(cam, (256, 256))
cam = cv2.resize(cam, (img.shape[1], img.shape[0])) 
print("shape", cam.shape)

# Apply heatmap on the original image
heatmap = cv2.applyColorMap(cam, cv2.COLORMAP_JET)
result = heatmap * 0.3 + original_img * 0.5
cv2.imwrite('CAM3.jpg', result)


Image loaded successfully!
cam [-0.5741447  -1.6797626  -1.9717126  -2.193766   -2.209783   -2.2199125
 -0.81649613 -1.3061167  -2.5648055  -3.137256   -3.6369727  -3.6624393
 -3.5447173  -1.5581748  -1.5196267  -2.6370683  -3.0102556  -3.4986055
 -3.3736675  -3.2397418  -1.3593086  -1.7064718  -2.1653624  -1.6765227
 -1.7456464  -1.5183933  -1.5797147  -0.01637324 -1.6570137  -1.5708845
 -0.10633154 -0.40312177  1.9872444   1.6235801   1.2023222  -1.1372105
 -0.7300208  -0.8870628  -0.99238753  0.8796444   7.74089     1.1962438
  0.83911055  1.1900783   0.4664203   0.14652914 -0.12807944  0.90346867
  1.8204753 ]
shape (3024, 4032)


True