Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternate depth normalization #10

Open
jbrownkramer opened this issue Mar 4, 2024 · 4 comments
Open

Alternate depth normalization #10

jbrownkramer opened this issue Mar 4, 2024 · 4 comments

Comments

@jbrownkramer
Copy link

jbrownkramer commented Mar 4, 2024

The justification in the paper for using disparity is "scale normalization". I know that this comes from OmniVore and ImageBind.
However, this does not actually achieve scale normalization.

What could scale normalization mean? disparity images are not scale invariant in the way RGB images are: If you bring a thing closer it will have larger disparities, as opposed to RBG images where colors stay the same. Instead it must mean something like: two images with the same "disparity" should take up the same number of pixels.

To achieve this, you should use f/depth instead of bf/depth. This makes sense because b is an arbitrary value associated with the particular camera setup that you have, and it provides you no information about the geometry of the scene you are looking at. If you change b physically, the depth image does not change, but the disparity does.

One other suggested improvement: when you resize to 224, you're actually implicitly changing the focal length. So if h is the original height of the image, I would suggest computing "disparity" as

(224/h)f/depth

If normalization is having any positive affect, I bet this improved normalization will do better.

@StanLei52
Copy link
Collaborator

Thank you so much for pointing this out and for your insightful suggestion! I will look into this and experiment with this normalization you mentioned in the depth-related experiments, to see whether it yields better performance.

@jbrownkramer
Copy link
Author

Another comment is that the use of RandomResizedCrop in augmentation during training might largely break the connection between scale and the object being imaged. It might be good to maintain that information by applying the same (224/h) scaling factor during training (where h is the height of the crop). Actually, since w can be very different from h in RandomResizedCrop, something like sqrt((224/h)*(224/w)) might be better since it preserves information about the area of the object.

@StanLei52
Copy link
Collaborator

@jbrownkramer Thank you for your comments. If possible, could you please provide your implementation as you mentioned so that I can find some time later to conduct experiments on this? Thanks.

@jbrownkramer
Copy link
Author

Here you go. You should be able to replace RandomResizedCrop in RGBD_Processor_Train with this. It is untested code, FYI.

import torch
import torchvision.transforms.functional as F
from torchvision.transforms import RandomResizedCrop

class RandomResizedCropAndScale(RandomResizedCrop):
    """
    Crop a random portion of image and resize it to a given size. Scale it by the sqrt of the ratio in areas between the new image and the crop size in the original image, but only apply this scaling to the final channel.
    """

    def __init__(self, size, *args, **kwargs):
        super().__init__(size, *args, **kwargs)

    def forward(self, img):
        """
        Args:
            img (PIL Image or Tensor): Image to be cropped and resized.

        Returns:
            Tensor: Randomly cropped, resized, and scaled image.
        """
        i, j, h, w = self.get_params(img, self.scale, self.ratio)
        cropped_and_resized_img = F.resized_crop(img, i, j, h, w, self.size, self.interpolation, antialias=self.antialias)

        # Convert the cropped and resized image to a tensor if it's not already
        if not isinstance(cropped_and_resized_img, torch.Tensor):
            cropped_and_resized_img = F.to_tensor(cropped_and_resized_img)

        _, height, width = F.get_dimensions(cropped_and_resized_img)
        
        scale_factor = torch.sqrt((height * width) / (h * w))
        
        scaled_img = cropped_and_resized_img.clone()  # Clone to avoid in-place modification issues
        scaled_img[-1, :, :] *= scale_factor  # This applies scale_factor to the last channel


        return scaled_img

You should also be able to replace the __call__ function in RGBD_Processor_Eval with

    def __call__(self, depth):
        # here depth refers to disparity, in torch savefile format
        # note use ToTensor to scale image to [0,1] first
        img = torch.randn((3, 224, 224))

        if depth.ndim == 2:
            depth = depth.unsqueeze(0)
			
	scale = 224/depth.shape[0]

        rgbd = torch.cat([img, depth * scale], dim=0)
        transform_rgbd = self.rgbd_transform(rgbd)
        img = transform_rgbd[0:3, ...]
        depth = transform_rgbd[3:4, ...]
        
        return depth

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants