## Gifsplanation

Gifsplanation is a method proposed [here](https://doi.org/10.48550/arXiv.2102.09475) as a means of generating counterfactuals by shifting the latent representation in the direction that causes the most change in class. In this way, by making use of an encoder and decoder, we can generate images that show the visual changes that would lead to a higher prediction confidence in a class.

We already have an encoder which will be the backbone from our trained model. There are some benefits to using this model, primarily that the CAVs we have generated will have meaning in the latent space of the model. This means that we can hopefully pass the CAVs through and influence the activations to cause changes in the image in the direction of the CAVs. This will hopefully show us the visual features that the CAVs represent.

So we will need to allow for values to be added to the activations in the layers that we used as bottleneck layers for our automated concept extraction method. To facilitate this I have copyed the ResNet implementation that the Faster R-CNN model is using and made alterations to allow for a decoder and for values to be added to the intermediate activations. This file, resnet.py, has been included in the torchvision files folder in the Utils within the repo.

Ideally, we will create an autoencoder that can create realistic looking reconstructions that we can alter by passing in some combination of the CAVs and gradients from the images. Theoretically, if we pass the gradients for a prediction we can hopefully only influence that portion of the image/tile. This is something I will have to experiment with.

### Imports

We will be making use of some PyTorch modules and loading in our modified ResNet and some training files.

In [1]:
import torch
from torch.utils.data import DataLoader
import torchvision.transforms as T

from Utils.Torchvision_files.resnet import resnet50
import Utils.ACE.ace_helpers as ace_helpers
from Utils.Training.dataset import MidogDataset, transforms
from Utils.Training.training import train_ae
import Utils.Training.utils as utils

### Loading in the trained model

We want to make use of the backbone from our trained model, so we will load it in and isolate the backbone and take the weights into a new ResNet50 model. This new model will have no Feature Pyramid Network, but this is not important as we don't need activations for multiple layers to aid in detection, we only need the activations from the final layer to pass to the decoder for reconstruction.

In [2]:
%%capture
# Load the old model in.
bottleneck_layers = ['backbone.body.layer4.2.conv1']

# Create the model variable and set it to evaluate.
mymodel = ace_helpers.MyModel("mitotic", bottleneck_layers)
mymodel.model.eval()
mymodel.model.model

FasterRCNN(
  (transform): GeneralizedRCNNTransform(
      Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
      Resize(min_size=(800,), max_size=1333, mode='bilinear')
  )
  (backbone): BackboneWithFPN(
    (body): IntermediateLayerGetter(
      (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
      (layer1): Sequential(
        (0): Bottleneck(
          (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
       

In [3]:
# Take the transform for our original model, this will ensure that activations are the same.
transform = mymodel.model.model.transform

# Let us define the inverse of this transform so we can return to the original image once we have output.
inv_transform = invTrans = T.Compose([T.Resize((512, 512)),
                                      T.Normalize(mean = [ 0., 0., 0. ],
                                                     std = [ 1/0.229, 1/0.224, 1/0.225 ]),
                                      T.Normalize(mean = [ -0.485, -0.456, -0.406 ],
                                                     std = [ 1., 1., 1. ]),
                               ])

# Take the backbone from our trained model.
backbone = mymodel.model.model.backbone.body

We want to make sure we take the transform from the model as this was applied during the training of the backbone, so it is important to keep it consistent. We also need an inverse of this transform so we can recover images from the model.

### Create the new ResNet AutoEncoder

Now we can load in our custom RenNet model and view the architecture. I made use of the [Torchxrayvision implementation](https://github.com/mlmed/torchxrayvision/blob/master/torchxrayvision/autoencoders.py) and found an implementation that contained training code [here](https://github.com/AlexPasqua/Autoencoders) which I altered for my use case. 

In [4]:
# Take a new ResNet50 model.
resnet_backbone = resnet50()

In [5]:
resnet_backbone

ResNetAE(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): BottleneckInsertValues(
    (0): Bottleneck(
      (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (downsample): Sequential(
        (0): Conv2d(64, 256, kernel_size=(1, 1

Now we can load the state from our trained backbone into our ResNet autoencoder. The encoder layers are identical to the layers used in the backbone and so we can copy the weights exactly, the other layers are initialized from scratch.

In [6]:
%%capture
# Load the state from the backbone from our model.
resnet_backbone.load_state_dict(backbone.state_dict(), strict=False)

# NOTE: We are missing the fully connected layer weights and biases, but this is okay as we will just take the activations from
# the last bottleneck layer.

It is important that we freeze the encoder layers by setting the require_grad attribute to False. This will ensure that no gradients are calculated for these tensors of parameters and will make sure they maintain their trained values.

In [7]:
end_of_encoder = False

for name, child in resnet_backbone.named_children():
    
    if name == "uplayer1":
        end_of_encoder = True
        
    if not end_of_encoder:
        for param in child.parameters():
            param.requires_grad = False
            
    else:
        for param in child.parameters():
            param.requires_grad = True

We will now remove the trained model and empty the cache as it is taking up GPU memory that is needed during training.

In [8]:
del mymodel
torch.cuda.empty_cache()

### Loading in the Dataset

The dataset that we defined during the model training can be used again here, I have saved it as a script in the Utils folder in the repo for use across scripts and notebooks. We will train the autoencoder to reconstruct the images we pass through the encoder by making use of the Mean Squared Error, which calculates how well the resultant image overlaps with the original.

In [9]:
# Create a Dataset object with the given root path to the training data and a defined transformation.
midog = MidogDataset("D:/DS/DS4/Project/Training_mitotic_figures", transforms)

The model can now be trained by passing the model, dataset, transforms, batch size, epochs to run for, output path and the path to a checkpoint if there is one. This will allow the model training to stop and restart again if needed.

In [None]:
metrics = train_ae(resnet_backbone, midog, transform, inv_transform, bs=2, num_epochs=20, output_path="D:/DS/DS4/Project/AutoEncoder", checkpoint="D:/DS/DS4/Project/AutoEncoder/ResNetAE_2023_05_01_21_24_22_4.pth")

100%|██████████████████████████████████████████████████████████████████████████████| 3400/3400 [33:25<00:00,  1.70it/s]
Evaluating Model: 100%|██████████████████████████████████████████████████████████████| 375/375 [02:28<00:00,  2.52it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 3400/3400 [31:27<00:00,  1.80it/s]
Evaluating Model: 100%|██████████████████████████████████████████████████████████████| 375/375 [02:11<00:00,  2.84it/s]
 17%|█████████████▏                                                                 | 570/3400 [05:26<26:46,  1.76it/s]

In [None]:
metrics

We can see above that the model has learned how to reconstruct an image well and has improved since over the course of the training. Looking at the validation loss we can see that the model has not overfit the data.

### Testing the reconstruction

Making use of the dataloader, we can pass an image to the autoencoder and view the returned image against the original. This will allow us to visually assess the quality of the reconstructions.

In [None]:
# Create a DataLoader with the dataset with a batch size of 2, no shuffling and use a custom defined collate_fn to batch
# the output as desired.
data_loader = DataLoader(midog, batch_size=1, shuffle=False, collate_fn=utils.collate_fn)

# Create an iterator from the DataLoader
data = iter(data_loader)

In [None]:
# Take the next image batch (of size 1)
output = next(data)

This image now has to undergo the same transformation used during training and in the original trained Faster R-CNN model. This can then be passed to the GPU with the model and the output returned from the model.

In [None]:
# Transform the image batch.
imgs, targets = transform(output[0], output[1])

In [None]:
# Move the image and model to the GPU.
cuda_imgs = imgs.tensors.to("cuda")
resnet_backbone.to("cuda")

In [None]:
# Pass the image to the autoencoder.
results = resnet_backbone(cuda_imgs)

In [None]:
# Use the inverse transform on the original image
original = inv_transform(imgs.tensors)
result = inv_transform(results) 

In [None]:
# Create a transform to convert the tensors to PIL images.
pil_transform = T.ToPILImage()

# Transform the tensors to PIL images.
original_img = pil_transform(torch.squeeze(original)) # imgs
reconstructed_img = pil_transform(torch.squeeze(result))

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 2, figsize=(12,6))

ax[0].imshow(original_img)
ax[1].imshow(reconstructed_img)

ax[0].set_title('Original Image')
ax[1].set_title('Reconstructed Image')

ax[0].tick_params(left=False, bottom=False)
ax[1].tick_params(left=False, bottom=False)

ax[0].set(yticklabels=[], xticklabels=[])
ax[1].set(yticklabels=[], xticklabels=[]) 

title = 'Comparison of Original Image and Recontruction'
fig.suptitle(title, fontsize=16)
plt.tight_layout()
plt.show()

We can see that the recontructed image appears quite similar to the original, just not as clear and with some small discrepancies. This shows us that the model has learned how to convert the encoding we originally had in our model back to an approximate reconstruction of the original input.

Ideally, we will be able to alter the intermediate activations and perceive changes in the reconstructions that seem plausible given the values added.

### Deciding on addition to the intermediate activations

There are several interesting angles to take here. I have concept activation vectors that are meaningful in the space within my encoder and an encoder that has been trained on a task with multiple outputs, the detections of mitotic figures. I believe the most interesting application considering the standings is to make use of some combination of the CAV in the direction of the gradients. The motivation would be to influece a detection, whose gradients we find, in the direction that the CAV causes influence. This will theoretically alter the image in only the portion of interest and in a way that represents the visual feature of the CAV.

I will look at this in incremental stages, first testing if the gradients can in fact alter only the relevant portion of an image. Next I will implment a small portion of an influential CAV to the activations along with the gradients. The motivation for this is that the CAV should influence the portion of the image relating to the detection. Regardless, I am excited to see if the results are meaningful or just distored reconstructions.

### Passing the gradients in

...

### Adding the CAV to the gradient

...

### Generating GIFs
...

### Conclusion

In [None]:
original_img

In [None]:
reconstructed_img