Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Including a new loss in the computation graph #4

Closed
freerafiki opened this issue Nov 29, 2022 · 6 comments
Closed

Including a new loss in the computation graph #4

freerafiki opened this issue Nov 29, 2022 · 6 comments

Comments

@freerafiki
Copy link

Hi, first thanks for the nice work and for releasing it open source.

I have a question which is related to pytorch lightning but also to your code, so I wanted to ask here if you could point me to some part of the code or other resources.
I am adding one (or more) loss functions to change the way the model learns the assembly task. I believe there is margin for improvements in the loss function calculation.

My question is: how can I add a custom loss and include it in the computation graph?

I added the loss function in base_model.py, added in the loss_dict, added its weight in the configuration file (let's call our loss custom_loss, I added _C.loss = CN(); _C.loss.custom_loss_w = 1. in the config file).
Now when I run the training (no errors, it goes through), I see my loss always having almost the same value without changes. One explanation could be that I need to train longer, but I do suspect that the optimization is not including my new loss.

So I debugged the code a bit and printed out the losses, and got:

loss: trans_loss, value: tensor([[0.1740, 0.0799]], device='cuda:0', grad_fn=<StackBackward0>)
loss: rot_pt_cd_loss, value: tensor([[0.0029, 0.0040]], device='cuda:0', grad_fn=<StackBackward0>)
loss: transform_pt_cd_loss, value: tensor([[0.0024, 0.0012]], device='cuda:0', grad_fn=<StackBackward0>)
loss: custom_loss, value: tensor([[0.0262, 0.0298]], device='cuda:0')
loss: rot_loss, value: tensor([[0.4134, 0.3462]], device='cuda:0', grad_fn=<StackBackward0>)
loss: rot_pt_l2_loss, value: tensor([[0.0294, 0.1065]], device='cuda:0', grad_fn=<StackBackward0>)

As you can see, the custom_loss is missing the grad_fn, and as I can read here, this could mean that the function is not connected to the computation graph and therefore the optimizer does not take it into account. That would explain why it never decreases.
So I switched my custom loss (which was an exponential) with a pre-defined from pytorch (trying torch.nn.L1Loss for example) to see if the problem was in the definition of the loss (I read yesterday A Gentle Introduction to torch.autograd, but it did not solve my doubts, should I add _requires_grad somewhere? But I also did not see it in your code for other loss functions) but the result is the same (still no grad_fn), so I was wondering, where in the code should I look at to include the loss in the computation graph?

Thanks a lot for your time in advance.

Context for why I am trying to add a loss:
I got the training to work and I was looking at results, and to me it seems that the fragments stays in the same place where they have been placed at the beginning (in the origin). Even after a consistent number of epochs, both pieces are still there. I want to add a loss which forces them to be moved apart. I am starting with a simple idea about the minimum distances of two points: assuming we got the correct transformation for the assembly of two pieces, the minimum distance between any point in piece A and in piece B would be zero (any point on the border of the two pieces). Of course the fact that the minimum distance is zero does not necessarily mean that the assembly is correct, but it's a place to start. I would be happy already by seeing the two pieces being moved apart. More will follow later (maybe using center of mass or more sophisticated methods), but my guess is that the estimation of the transformation needs to be guided with a meaningful loss on the assembly part, meaning the two pieces are completing and not overlapping each other.

@Wuziyi616
Copy link
Owner

Hi, thanks for your interest in our work! The loss should be added in this function, which I believe you have already done. Since you can also print it and do have the loss_w in the config file, I highly suspect maybe there is a bug in your loss implementation, which blocks the gradient? Do you mind posting your loss implementation as well?

@freerafiki
Copy link
Author

Hi - thanks for the quick answer.
Yes, I added the loss there. So, I also thought my implementation could be the problem, but even when using a simple function (just creating a tensor and using this as a "fake" loss) the same thing happens.

Short example:

Let me explain, if I add in base_model.py

def compute_custom_loss(self, data_dict):
        custom_loss = torch.Tensor(len(data_dict['data_id'])).cuda()
        return custom_loss

def _calc_loss(self, out_dict, data_dict):
    ... # here your code
    custom_loss = self.compute_custom_loss(data_dict)
    loss_dict = {
                'trans_loss': trans_loss,
                'rot_pt_cd_loss': rot_pt_cd_loss,
                'transform_pt_cd_loss': transform_pt_cd_loss,
                'custom_loss': custom_loss,
            } 

I still get

loss: trans_loss, value: tensor([[0.2590, 0.1225]], device='cuda:0', grad_fn=<StackBackward0>)
loss: rot_pt_cd_loss, value: tensor([[0.0046, 0.0063]], device='cuda:0', grad_fn=<StackBackward0>)
loss: transform_pt_cd_loss, value: tensor([[0.0482, 0.0021]], device='cuda:0', grad_fn=<StackBackward0>)
loss: custom_loss, value: tensor([[7.6154e+31, 3.0854e-41]], device='cuda:0')
loss: rot_loss, value: tensor([[0.4776, 0.6484]], device='cuda:0', grad_fn=<StackBackward0>)
loss: rot_pt_l2_loss, value: tensor([[0.0117, 0.1338]], device='cuda:0', grad_fn=<StackBackward0>)

So the loss is not attached to any gradient. And since the loss is just one line with the creation of the tensor, it should have no error (or at least pytorch could tell me something?)

Long example:

The whole loss is a bit of a mess at the moment and it is not really finished, since it does not work I did not finish the implementation. Actually I was going to implement two losses and this is the first one, but the idea (of the first one) is the following:

  • I have a method which calculates distance between the centers of mass of the two fragments (assuming we have 2 pieces for now to make things easier) and also select the points 'facing' each other (it does this by creating a virtual cylinder between the two centers of mass and selecting the point within this cylinder)
  • once I have the distances, I also get a rough idea about the shape of the two fragments calculating min and max along the three axis
  • then I calculate the loss having the distance between centers of mass as "prediction" and the average size of the fragment ((min + max) / 2) as the "target" which should push the transformation to move the fragments apart from each other (but not too far)

Let's assume the two methods to find points, distance and average size are working. If interested, I can post the whole code, but I think it becomes too long and part of another discussion on how to do this, since it is still very inefficient now, and it may go out of the scope of this discussion, but do let me know if that would be better, maybe I can later open another issue to discuss different kinds of losses.

def find_facing_points_and_distance(self, data_dict, radius=0.5, tolerance_dist=10e5):
   ...
   return facing_points, dist_euclidean, found_something
def get_bbox_avg_size(self, points3D):
   ...
   return avg_size

Then I calculate the loss as:

def compute_min_dist_loss(self, facing_info, data_dict, max_loss=1, dist_sigma=10e-5):
   dists_cm = torch.Tensor([dist_cm, dist_cm]).cuda()
   target_dists = torch.Tensor([bbox_avg_size_A, bbox_avg_size_B]).cuda()
   loss_fn = torch.nn.L1Loss(reduce=False)
   loss_value = loss_fn(dists_cm, target_dists)
   return loss_value

Even using this (the torch.nn.L1Loss from pytorch) I still do not get any gradient. Initially, my idea was to normalize the distance of the centers of mass (dists_cm) and have an exponential function to compute the loss (something like torch.clip(torch.exp(-((dists_cm_normalized)/dist_sigma)), 0, 1) where dist_sigma is a parameter depending on the size of the objects).

So, my idea was to simplify things and get at least one example to work (let's say the custom_loss with only a fake empty tensor, or a simple loss with always 0, it does not matter) and once I am sure that it works build on top of it, clean the code to find the facing points and distances and calculates the loss on the minimum distance in a more meaningful way.

@Wuziyi616
Copy link
Owner

For the simple example you gave at the beginning, it is actually expected to have no grad_fn. This is because data_dict['data_id'] itself doesn't require gradient. On the other hand, can you try something like custom_loss = F.mse_loss(pred_trans)? In this case, pred_trans is a model output prediction, so it should require gradient

@Wuziyi616
Copy link
Owner

I think your loss implementation is problematic. The two components involved in the L1 loss come from

dists_cm = torch.Tensor([dist_cm, dist_cm]).cuda()

and

target_dists = torch.Tensor([bbox_avg_size_A, bbox_avg_size_B]).cuda()

Obviously, you are converting some numpy values to torch tensors? These numpy values are detached from the computation graph, thus not having grad_fn. What you should do is that in your helper function to calculate dist_info and bbox_avg, you should keep them always torch tensors (and probably on CUDA as well). Then, the gradient can flow from the loss to the model

@freerafiki
Copy link
Author

It makes a lot of sense now. Using your suggestion custom_loss = F.mse_loss(pred_trans) I already get custom_loss tensor(0.0205, device='cuda:0', grad_fn=<MseLossBackward0>) which looks to be what I wanted (now I have to fix the pipeline).

The problem is that if I want to compute a loss based on the center of mass of the two fragments (or on other properties which derives from the transformation proposed from the network at that iterations, but are not directly stored in pred_trans), I need to find a way to connect the tensors. But I guess if I manage to get the points of the point cloud after transformation using pred_trans and then compute the loss on those points, the connection could work anyway, right?

I will try this (applying predicted transformation and then estimating minimum distance) step by step to keep the flow connected! Thanks, now the problem is clear!

@Wuziyi616
Copy link
Owner

Yes, I'm happy to help and good luck!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants