# Dice Loss PR #1249

Open
opened this issue Apr 12, 2017 · 39 comments

### rogertrullo commented Apr 12, 2017 • edited

 Hi, I have implemented a Dice loss function which is used in segmentation tasks, and sometimes even preferred over cross_entropy. More info in this paper: http://campar.in.tum.de/pub/milletari2016Vnet/milletari2016Vnet.pdf Here's the link of it: https://github.com/rogertrullo/pytorch/blob/rogertrullo-dice_loss/torch/nn/functional.py#L708 How could I submit a PR? thanks!

### IssamLaradji commented May 31, 2017 • edited

 Is your code doing the same thing as this ? ```def dice_loss(input, target): smooth = 1. iflat = input.view(-1) tflat = target.view(-1) intersection = (iflat * tflat).sum() return 1 - ((2. * intersection + smooth) / (iflat.sum() + tflat.sum() + smooth))```

### tommy-qichang commented Sep 6, 2017

 Hi @IssamLaradji I've a few questions about the code. Does smooth similar to eps which avoid division by zero? Like the cross entropy loss, the result should be a positive value so I'm wondering if is that correct : `return 1 - ((2. * intersection + smooth) / (iflat.sum() + tflat.sum() + smooth))` Thanks

### IssamLaradji commented Oct 19, 2017 • edited

 @tommy-qichang `smooth` does more than that. You can set smooth to zero and add `eps` to the denominator to prevent division by zero. However, having a larger smooth value (also known as Laplace smooth, or Additive smooth) can be used to avoid overfitting. The larger the smooth value the closer the following term is to 1 (if everything else is fixed), `((2. * intersection + smooth) / (iflat.sum() + tflat.sum() + smooth))` This decreases the penalty obtained from having `2*intersection` different from `iflat.sum() + tflat.sum() `. A similar approach is commonly used in Naive Bayes, see equation (119) in these notes. Yah that should be the case, good catch!

### tommy-qichang commented Oct 19, 2017

 Thanks @IssamLaradji for your reply. As for 2, even the cost value is negative, I think it doesn't affect the backpropagation.

### IssamLaradji commented Oct 19, 2017

 That's true. It shouldn't affect the optimization
Author

### rogertrullo commented Oct 24, 2017 • edited

 Hi @IssamLaradji, for some reason I never got to reply. I am sorry. These codes are actually doing something very similar but they have small differences: The Dice ratio in my code follows the definition presented in the paper I mention; (the difference it's in the denominator where you define the union as the sum whereas I use the sum of the squares). Additionally, my code was thought to be used for having 2 channels in the last layer whereas yours takes only one channel (possibly the output of a sigmoid layer). I just tested your code and mine and theres a difference of the order of 1e-3. I am not really sure why, I think is related to the fact that I basically compute the dice independently for each element in the batch and then divide by the batch size, but not really sure.

### varghesealex90 commented Oct 27, 2017 • edited

 @IssamLaradji Does your implementation need the target and predictions to be in one hot encoding manner? I think it does not need. Lets say I have 3 classes in an image, is it possible to get the dice score associated with each class? If this is possible, one could assign a more weightage to under represented classes Regards Varghese
Author

### rogertrullo commented Oct 27, 2017

 @varghesealex90 read my previous post, as I mentioned, isamm's code takes as input only one channel representing the probability of a pixel being foreground, so it is for binary problems.

### varghesealex90 commented Oct 27, 2017

 I was browsing through your code. I have a multi class problem However my labels are one hot encoded. Is there a quick and neat way of doing it in pytorch

### IssamLaradji commented Oct 27, 2017 • edited

 @varghesealex90 A naive quick way is to apply the dice loss on each channel with a different weight. Note that this does not tie the classes in the output layer as it treats each class independently as a binary problem. But earlier layers would try to learn features that differentiate between the classes. Here is an inefficient way of doing this, ```def dice_loss(input, target): smooth = 1. loss = 0. for c in range(n_classes): iflat = input[:, c ].view(-1) tflat = target[:, c].view(-1) intersection = (iflat * tflat).sum() w = class_weights[c] loss += w*(1 - ((2. * intersection + smooth) / (iflat.sum() + tflat.sum() + smooth))) return loss``` Where `class_weights` is a list containing the weight for each class and `input` and `target` are shaped as `(n_batches, n_classes, height, width)`. `target` is assumed to be one-hot encoded. With proper vectorization, you can make this run much faster.
Author

### rogertrullo commented Oct 27, 2017

 @varghesealex90 here's a way to convert a Tensor to one hot: `y_onehot.scatter_(1,GT.view(GT(0),1,GT.size(1),GT.size(2)),1)` This assumes that `GT` is a tensor of shape N,H,W, and `y_onehot` is a Tensor of shape N,C,H,W where C is the number of classes

### varghesealex90 commented Oct 27, 2017

@rogertrullo

I think there is bug in the one hot encoding code you shared here.
batch_size=3
gt = torch.ones(batch_size,4,4).long() # N,H,W

## make dumpy 3 classes

gt[0]=0
gt[2]=2

y_one_hot = torch.FloatTensor(batch_size, 4,4)

y_one_hot.scatter_(1,gt.view(gt.size(0),1,gt.size(1),gt.size(2)),1)

error:

File "", line 1, in
RuntimeError: invalid argument 3: Index tensor must have same dimensions as output tensor at /Users/soumith/code/builder/wheel/pytorch-src/torch/lib/TH/generic/THTensorMath.c:50

I know I am missing something here. Any ideas?

Author

### rogertrullo commented Oct 27, 2017

 @varghesealex90 the error is in `y_one_hot = torch.FloatTensor(batch_size,4,4)` , it should be `y_one_hot = torch.FloatTensor(batch_size,3,4,4)`, Notice the the 3 indicating number of classes

### varghesealex90 commented Oct 27, 2017

 @rogertrullo I am using your Dice Loss function, however I am getting the following error num=torch.sum(num,dim=3)#####b,c File "/usr/local/lib/python2.7/dist-packages/torch/autograd/variable.py", line 476, in sum return Sum.apply(self, dim, keepdim) File "/usr/local/lib/python2.7/dist-packages/torch/autograd/_functions/reduce.py", line 21, in forward return input.sum(dim) RuntimeError: dimension out of range (expected to be in range of [-3, 2], but got 3) Dimension of input and target are 10,3,240,240 They have been converted to cuda float

### varghesealex90 commented Oct 27, 2017

 the size of input, target and num are (10L, 3L, 240L, 240L) (10L, 3L, 240L, 240L) (10L, 3L, 240L, 240L) which seems correct

### varghesealex90 commented Oct 27, 2017

@rogertrullo Is the fix to convert subtract the dim by 1

``````num=torch.sum(num,dim=2)
num=torch.sum(num,dim=3)#b,c

den1=probs*probs#--p^2
den1=torch.sum(den1,dim=2)
den1=torch.sum(den1,dim=3)#b,c,1,1

den2=target*target#--g^2
den2=torch.sum(den2,dim=2)
den2=torch.sum(den2,dim=3)#b,c,1,1
``````

#### The fix ?

``````num=torch.sum(num,dim=1)
num=torch.sum(num,dim=2)#b,c

den1=probs*probs#--p^2
den1=torch.sum(den1,dim=1)
den1=torch.sum(den1,dim=2)#b,c,1,1

den2=target*target#--g^2
den2=torch.sum(den2,dim=1)
den2=torch.sum(den2,dim=2)#b,c,1,1
``````
Author

### rogertrullo commented Oct 27, 2017

 @varghesealex90 the problem is that you are summing across channels from the beginning, so I am not sure if that is right (nor sure if it's not). But it should work without modification. Did you print the size of num before `num=torch.sum(num,dim=3)#####b,c`?

### varghesealex90 commented Oct 27, 2017

('input', (10L, 3L, 240L, 240L))
('target', (10L, 3L, 240L, 240L))
('num', (10L, 3L, 240L, 240L))

File "/usr/local/lib/python2.7/dist-packages/torch/autograd/variable.py", line 476, in sum
return Sum.apply(self, dim, keepdim)
File "/usr/local/lib/python2.7/dist-packages/torch/autograd/_functions/reduce.py", line 21, in forward
return input.sum(dim)
RuntimeError: dimension out of range (expected to be in range of [-3, 2], but got 3)

PS: I commented out these lines in the code

# from .modules.utils import _single, _pair, _triple

Author

### rogertrullo commented Oct 27, 2017

 @varghesealex90 sorry, I don't see why it would fail. It is weird because num has 4 dimensions so it should be fine to sum across the 4th one; it should produce a tensor of shape 10,3,240,1

### varghesealex90 commented Oct 27, 2017

 @rogertrullo I think I got it a= torch.ones(2,3,4,4) print (a.size()) (2L, 3L, 4L, 4L) b= torch.sum(a,dim=2) print (b.size()) (2L, 3L, 4L) c= torch.sum(b,dim=3) Error pops up. Because there is no 4th dimension . I think we should keep dim 2 through out. Your thoughts?
Author

### rogertrullo commented Oct 27, 2017 • edited

 @varghesealex90 oh in my computer it keeps a singleton dimension, in that case I guess you can switch the lines that add the 3rd dimension and the 2nd dimension, so that we first add the 3rd dimension and then the second (currently the 2nd is first and then the 3rd)

### varghesealex90 commented Oct 27, 2017

 Yes. That would be perfect.

### trypag commented Nov 12, 2017 • edited

 @rogertrullo Hi, I am having issues optimizing the dice loss that you specified, the loss term should be something like `loss = criterion(output, target_var) + (1- dice_loss(output, target_var))`. That formulation is not possible because I noticed the upper bound of the function I implemented is not 1, as opposed to the real dice similarity. I rewrote the dice loss so that I can filter out the unwanted label -1, I might have introduced a bug while rewriting the code. ``````def dice_loss(output, target, weights=1, ignore_index=None): output = output.exp() encoded_target = output.data.clone().zero_() if ignore_index is not None: # mask of invalid label mask = target == ignore_index # clone target to not affect the variable ? filtered_target = target.clone() # replace invalid label with whatever legal index value filtered_target[mask] = 0 # one hot encoding encoded_target.scatter_(1, filtered_target.unsqueeze(1), 1) # expand the mask for the encoded target array mask = mask.unsqueeze(1).expand(output.data.size()) # apply 0 to masked pixels encoded_target[mask] = 0 else: encoded_target.scatter_(1, target.unsqueeze(1), 1) encoded_target = Variable(encoded_target) assert output.size() == encoded_target.size(), "Input sizes must be equal." assert output.dim() == 4, "Input must be a 4D Tensor." numerator = (output * encoded_target).sum(dim=3).sum(dim=2) denominator = output.pow(2) + encoded_target if ignore_index is not None: # exclude masked values from den1 denominator[mask] = 0 dice = 2 * (numerator / denominator.sum(dim=3).sum(dim=2)) * weights return dice.sum() / dice.size(0) `````` Would you mind to give me your point of view on this code and the formulation of your loss please ? Thanks
Author

### rogertrullo commented Nov 12, 2017 • edited

 Hi @trypag , there is no need to compute `1-dice_val`, instead I just multiply the `dice_val` by -1. We do that since we want to maximize the dice, and the optimizers trie to minimize the function. I haven't checked your code (I will do it later) but you would probably want to multiply by -1.

### trypag commented Nov 13, 2017 • edited

 @rogertrullo thanks for answering, I have a few questions: do you agree this loss is not bounded between 0 and 1 ? I noticed it can go higher than 1. I understand that min(-dice) = max(dice), but in this configuration it seems strange, for example if you only optimise with the dice loss, you first start with a negative loss, I have no idea if it will get to 0 eventually, at least for me optimizing only with the dice does not converge. With the cross_entropy loss, having `loss = ce(output, target) - dice(output, target)` we might have a negative loss at some time also. Thanks :)
Author

### rogertrullo commented Nov 16, 2017

 @trypag , they way I did it, the loss should actually start with a value close to 0, and then decrease to a negative value. The "more negative" the better. It doesn't matter that the loss is negative, I have trained already systems with this loss and they work quite well. How many classes do you have?

### trypag commented Nov 16, 2017

 Alright, I have 134 classes.
Author

### rogertrullo commented Nov 16, 2017

 @trypag, then the loss should go beyond -1. Basically what I do is to add the individual dice scores so the perfect score should be -134. If you want it to be between 0 and minus one, you should divide it by the number of classes

### faustomilletari commented Apr 28, 2018

 the squares at the denominator. have a look at proof (somewhere in) here: https://mediatum.ub.tum.de/doc/1395260/1395260.pdf
referenced this issue May 21, 2018

### PeterXiaoGuo commented Jun 21, 2018 • edited

 @IssamLaradji Hi, I read your post and I still face a problem that what's the shape of input and target? I try to feed it in shape of (# of batch size, channel = 1, width, height) and find that dice_loss is larger than 1. Should I feed set batch_size = 1 during each dice loss calculation? Besides, when I calculate the dice loss, should I divide it by 2, as @rogertrullo mentioned divided by the number of class? I use prostate3T dataset and some of the labels containing {0, 1, 2} while most labels containing only {0, 1} values. Thank you very much! Peter

### JingLi-0131 commented Oct 24, 2018 • edited

 As for dice loss, should I have to write the backward() function for computing the gradient? I watched the dice loss in V-net, https://github.com/mattmacy/vnet.pytorch, it included the backward() function from scratch.

### trypag commented Oct 24, 2018

 No you don't since autograd will backprop everything for you.

### CodeR57 commented Jan 2, 2019

 Hey! I wanted to use the dice loss for training my network, but I can't find it in torch.nn.loss. Is the dice loss commit merged yet? How can I use it?

### StuckinPhD commented Jan 2, 2019 • edited

 Hey! I wanted to use the dice loss for training my network, but I can't find it in torch.nn.loss. Is the dice loss commit merged yet? How can I use it? Dice loss has not been merged yet. But you can use it as given above. Ive tried the code and it works. (Hoever for binary segmentation I see to be getting better results with BCE) Make sure to change your label tensor such that each channel represents a class, even if you have a binary segmentation problem, the network output should be 2 channels and the label tensor should be 2 channels, 1 for background and 1 for foreground.

### CodeR57 commented Jan 3, 2019

 @farazkhan86 Thanks a lot for the reply. Which of the above options did you use? My use case is in multi-class (3 classes) segmentation.

### StuckinPhD commented Jan 7, 2019

 @CodeR57 I used the one provided by @rogertrullo but the one given by @IssamLaradji is also fine, both seem to give similar results. You have to arrange your target tensor such that each channel represents a class. I'm using the following function to rewrite my target class to each channel: ``````def _expand_target(input, C, device): """ Converts NxHxW label image to NxCxHxW, where each label is stored in a separate channel :param input: 3D input image (NxHxW) :param C: number of channels/labels :return: 4D output image (NxCxHxW) """ assert input.dim() == 3 shape = input.size() shape = list(shape) shape.insert(1, C) shape = tuple(shape) result = torch.zeros(shape) # for each batch instance for i in range(input.size()[0]): # iterate over channel axis and create corresponding binary mask in the target for c in range(C): mask = result[i, c] mask[input[i] == c] = 1 return result.to(device) `````` All the best

### JingLiRaysightmed commented Jan 10, 2019 • edited

 I defined the dice loss function showed below but it doesn't work. ``````def dice_loss(y_pred, y): # y_pred.shape = torch.Size([1, 2, 128, 128, 128]) # y.shape = torch.Size([1, 128, 128, 128]) smooth = 1. print(y_pred.grad_fn) # y_pred = F.softmax(y_pred, dim=1) print(y_pred.grad_fn) # y_pred = torch.argmax(y_pred, dim=1) print(y_pred.grad_fn) # None iflat = y_pred.view(-1) tflat = y.view(-1) intersection = (iflat * tflat).sum() loss = 1 - ((2. * intersection + smooth) / (iflat.sum() + tflat.sum() + smooth)) return loss `````` The error is RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn I don't know why the grad_fn changed? And how to make dice loss work?
referenced this issue Jan 30, 2019

### emergencyd commented Sep 8, 2019

 loss = ce(output, target) - dice(output, target) I thought the dice loss could be used directly, like loss = - dice(output, target) rather than loss = ce(output, target) - dice(output, target). I am quite curious about it... which one is right? @rogertrullo
Author

### rogertrullo commented Sep 8, 2019

 @emergencyd -dice... should be fine, that was just a test trying to combine the regular cross entropy with the dice loss