Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deeplab doesn't predict correctly the segmentation masks #3739

Open
kirk86 opened this Issue Mar 25, 2018 · 44 comments

Comments

Projects
None yet
@kirk86
Copy link

kirk86 commented Mar 25, 2018

Everything seems to be working properly training/evaluation etc. except from the fact that deeplab doesn't predict the segmentation masks.

Example:
001278_image

001278_prediction

The original images in the dataset are either colored like the above one or black and white, but all the masks are black and white.

@Raj-08

This comment has been minimized.

Copy link

Raj-08 commented Mar 26, 2018

same with me
Did you freeze your model?

@kirk86

This comment has been minimized.

Copy link
Author

kirk86 commented Mar 26, 2018

@Raj-08 nope. Just trained normally for 25000 steps and then did validation and that's what I get.

@Raj-08

This comment has been minimized.

Copy link

Raj-08 commented Mar 26, 2018

check your data that you are visualizing , it should be with 0-1 and type should be uint8 .
Divide it by np.max i.e data/np.max(data)
convert to uint8 and then visualize , you should get it right then

@kirk86

This comment has been minimized.

Copy link
Author

kirk86 commented Mar 26, 2018

@Raj-08 thanks for the response. I'm a bit confused though.

Divide it by np.max i.e data/np.max(data)

That's scaling the data in the range [0-1]. i.e. float. Then making it uint8 makes them int. I dunno how that's gonna help.

Question 1: Should the data be in float in range[0-1] or should it be uint8 in range [0-255]?
Question 2: Should we do the same for the masks?
Quesiton 3: Does deeplab also require bounding boxes for the masks or it can work without that?
Question 4: Does it matter if the data are of different sizes, .e.g. img1.size = 256x320x3, img2.size=520x320x3, ..., etc.? Or should we make them all fixed size?

Thanks!

@meteorshowers

This comment has been minimized.

Copy link

meteorshowers commented Apr 6, 2018

@kirk86
I met the same problem!
did you change the last logits layer from 21 to 2?
If so , I think maybe I can solve the problem.

@meteorshowers

This comment has been minimized.

Copy link

meteorshowers commented Apr 6, 2018

@Raj-08 how to freez part of my model?
Thanks!

@kirk86

This comment has been minimized.

Copy link
Author

kirk86 commented Apr 6, 2018

@meteorshowers Yes I did change the class number from 21 to 2.

@khcy82dyc

This comment has been minimized.

Copy link

khcy82dyc commented Apr 10, 2018

@kirk86 Hi have you figured out the solution? I had exactly the same issue...

@kirk86

This comment has been minimized.

Copy link
Author

kirk86 commented Apr 10, 2018

@khcy82dyc No TBH, I haven't, my 2 cents after spending a week chasing bugs not only on the deeplab model but on faster-rcnn as well is get as far away as you can. Most of the models have additional complexity in terms of trying to understand them since they use slim. In my experience most models break once you change a bit the configuration settings. For instance for faster-rcnn once you change to multi-gpu training things again break.

@khcy82dyc

This comment has been minimized.

Copy link

khcy82dyc commented Apr 11, 2018

@kirk86 @Raj-08 @meteorshowers I may have found a solution, it's to do with the number of class value and for some reason it should not include background as one class. otherwise it will produce a blank segmentation. So for this case kirk86 might need to change it to 1 instead of 2.

other funny behaviours I have encountered: for my case I had 1500 500*500 images with 7 classes including background: I'm using initialize_last_layer=False, last_layers_contain_logits_only=True, fine_tune_batch_norm=False and Deeplabv3_xception as the initial checkpoint

if i set my background to any other values instead of 0, training produce constant loss value

if I set my background to 0 and number of class I set 7, I get blank prediction

If I set fine_tune_batch_norm=true the loss will become 6 digit at 50000 steps,

if I set fine_tune_batch_norm=false I get "Loss is inf or nan. : Tensor had NaN values" even with a learning rate 0.00001

Do you guys mind share you training parameter here please?

(would be nice if the @aquariusjay can provide some suggestions the reason for these strange issues)

@kirk86

This comment has been minimized.

Copy link
Author

kirk86 commented Apr 11, 2018

@khcy82dyc TBH with you I think I've tried it even with 1 class and still had the same results plus all the models under tensorflow/research are so convoluted with unnecessary code that makes it hard for me to understand what they are actually doing. As I said after spending, cough, wasting some time I decided to look elsewhere. It's been like a month since I last touched them, so I can't even remember the settings that I had. But let me say something last. From my understanding you've spend some time debugging and tried multiple configurations from what you're saying, if none of those configurations is working then maybe ............................. cough!

@georgosgeorgos

This comment has been minimized.

Copy link

georgosgeorgos commented Apr 11, 2018

@khcy82dyc I'm trying to train the model on a different dataset.
After 3-4k iterations, the loss stops to decrease and starts to oscillate (20k iterations). Did you experiment something similar?

@sid6641

This comment has been minimized.

Copy link

sid6641 commented Apr 13, 2018

@kirk86 @khcy82dyc , I just can't seem to make it work on my custom dataset, working with mobilenet_v2 version, getting the same complete black output. Running inference on pre-trained model works fine for me. Tried most of the things @khcy82dyc mentioned, its the same result.

Any suggestions?

@georgosgeorgos

This comment has been minimized.

Copy link

georgosgeorgos commented Apr 13, 2018

I'm having the same problem with the black output.
Probably it is something related to the number or order of classes.

@holyprince

This comment has been minimized.

Copy link

holyprince commented May 2, 2018

@meteorshowers ,hi, I also want to use own data which has the different class number but meet the problem, I only change the dataset settings in train.py and segmentation_dataset.py from the manual. I want to ask where I can change the last logits layer from 21 to 2. Thank you!

@shanyucha

This comment has been minimized.

Copy link

shanyucha commented May 3, 2018

got the same issue. i used deeplab to do lane line segmentation. But unfortunately i got a constant loss between 0.2 and 0.3 and black output predicted images.

@Soulempty

This comment has been minimized.

Copy link

Soulempty commented May 8, 2018

@khcy82dyc
Hello,have you have a good solution about the 1 class training?I also meet the problems.

@shanyucha

This comment has been minimized.

Copy link

shanyucha commented May 8, 2018

@kirk86 do you have imbalanced data distribution like 1:100? i encountered your problem before, after setting different weight_loss, i got non-black masks.

@kirk86

This comment has been minimized.

Copy link
Author

kirk86 commented May 8, 2018

@shanyucha it is imbalanced but not at that level more of 40:60.

@Raj-08

This comment has been minimized.

Copy link

Raj-08 commented May 8, 2018

@meteorshowers Do you mean freeze the weights before training or generate frozen graph from trained model ?

@GWwangshuo

This comment has been minimized.

Copy link

GWwangshuo commented May 18, 2018

@georgosgeorgos My loss also keeps no change. Any suggestions? Thanks.

@XL2013

This comment has been minimized.

Copy link

XL2013 commented May 26, 2018

@shanyucha I met the same imbalanced data distribution problem, but I don't know how to set the weight_loss, did you mean the weight parameter in tf.losses.softmax_cross_entropy? if not, how to change the weight_loss?

@shanyucha

This comment has been minimized.

Copy link

shanyucha commented May 28, 2018

@XL2013 you can refer to this issue:
ttps://github.com/tensorflow/models/issues/3730

  1. In your case, the data samples may be strongly biased to one of the classes. That is why the model only predicts one class in the end. To handle that, I would suggest using larger loss_weight for the under-sampled class (i.e., that class that has fewer data samples). You could modify the weights in line 72 by doing something like
    weights = tf.to_float(tf.equal(scaled_labels, 0)) * label0_weight + tf.to_float(tf.equal(scaled_labels, 1)) * label1_weight + tf.to_float(tf.equal(scaled_labels, ignore_label)) * 0.0
    where you need to tune the label0_weight and label1_weight (e.g., set label0_weight=1 and increase label1_weight).
@georgosgeorgos

This comment has been minimized.

Copy link

georgosgeorgos commented May 31, 2018

@GWwangshuo did you solve? In my case, the problem was in the classes Ids

@Blackpassat

This comment has been minimized.

Copy link

Blackpassat commented Jun 8, 2018

@shanyucha Hi, I met the problem of constant loss as well. Have you solved this issue? I also assigned weights to different classes. But my loss stays constant around 0.11

@bleedingfight

This comment has been minimized.

Copy link

bleedingfight commented Jun 13, 2018

@shanyucha Thanks for you reply,I use deeplabv3+ train my data,My result is black(object pixle is 1,others is 0),I want segment lane which in my data,but the lane pixle number is too small,If I change the weight ,what i should change it?is like 64 scaled_labels = tf.reshape(scaled_labels, shape=[-1])

 65     # not_ignore_mask = tf.to_float(tf.not_equal(scaled_labels,
 66     #                                            ignore_label)) * loss_weight
 67     not_ignore_mask = tf.to_float(tf.equal(scaled_labels, 0)) * 1 + tf.to_float(t    f.equal(scaled_labels, 1)) * 2 + tf.to_float(tf.equal(scaled_labels, ignore_label    )) * 1

I changed the code and set the weight is 2 but the result is too bad,Should change 2 to some other number,>2 or <2,why?thanks for your help!

@Blackpassat

This comment has been minimized.

Copy link

Blackpassat commented Jun 14, 2018

Hi @bleedingfight , I tried the same way as you did for adding weights to labels. But I ended up with a loss oscillating around 0.11. How's your loss perform?

@shanyucha

This comment has been minimized.

Copy link

shanyucha commented Jun 14, 2018

@bleedingfight the weight is decided by the ratio between class 0 and class 1 in your case. and the ignore_label class should get weight 0 i suppose.

@Blackpassat

This comment has been minimized.

Copy link

Blackpassat commented Jun 14, 2018

Hi @shanyucha , my weights are assigned according to the ratio of different classes and the weight for ignore label is 0. However my loss seems not decaying. Did you get a normal result after assigning the weights?

@bleedingfight

This comment has been minimized.

Copy link

bleedingfight commented Jun 18, 2018

@Blackpassat I'm sorry for late to reply,I train my models some days,but The result is so bad too.I just change the weight from 10,200 ,500,1000,but The loss oscillating around 1.but hen I train over 200000 steps,The loss have some time is 0.5-0.7,but it is oscillating.can you I have try to alter:

  • decrease lr from 0.001,0.0001,0.00001,0.00001,the result is same(I just train about 10w step and ctrl+c,alter lr to another value to save training time )
  • I increase the train_batch from 8 to 32,but some error like "oop xxx",even though I have nvidia titan x(12GB) *8,clone_num=8,I change train_batch value to 16 it works but loss oscillating around 1,miou=0.46
  • I change the value of OS,
    --atrous_rates=3
    --atrous_rates=6
    --atrous_rates=9
    --output_stride=32
    config is lick:
python "${WORK_DIR}"/train.py \
  --logtostderr \
  --initialize_last_layer=False \
  --num_clones=8 \
  --last_layers_contain_logits_only=True \
  --dataset='lane_seg' \
  --train_split="train" \
  --model_variant="xception_65" \
  --atrous_rates=3 \
  --atrous_rates=6 \
  --atrous_rates=9 \
  --output_stride=32 \
  --decoder_output_stride=4 \
  --train_crop_size=513 \
  --train_crop_size=513 \
  --train_batch_size=16 \
  --training_number_of_steps="${NUM_ITERATIONS}" \
  --fine_tune_batch_norm=True \
  --tf_initial_checkpoint="${INIT_FOLDER}/deeplabv3_pascal_train_aug/model.ckpt" \
  --train_logdir="${TRAIN_LOGDIR}" \
  --dataset_dir="${LANE_DATASET}"

The loss oscillating

  • In utils/train_util.py,an sentense is like # Use larger learning rate for last layer variables,so I change the value like that:
142     for layer in last_layers:
143       if layer in var.op.name and 'biases' in var.op.name:
144         gradient_multipliers[var.op.name] = 50 * last_layer_gradient_multiplier
145         break
146       elif layer in var.op.name:
147         gradient_multipliers[var.op.name] = 10*last_layer_gradient_multiplier
148         break

the same result trained 1w steps
have someone know why?what parameter i can change to decrease loss?My initial model is deeplabv3_pascal_train_aug_2018_01_04.tar.gz as writed in local_test.sh.My initial model is wrong?

@bleedingfight

This comment has been minimized.

Copy link

bleedingfight commented Jun 18, 2018

@shanyucha
I'm sorry to later to reply,I don't understand what your means.I should change the weight like that:

not_ignore_mask = tf.to_float(tf.equal(scaled_labels, 0)) * 1 + tf.to_float(tf    .equal(scaled_labels, 1)) * 200 + tf.to_float(tf.equal(scaled_labels, ignore_label    )) * 0

set tf.to_float(tf.equal(scaled_labels, ignore_label )) * 0 or set 200 to value in range[0-1]?can you tell me what the three part's means?I think it is that:

  • tf.to_float(tf.equal(scaled_labels, 0)) * 1:stand for my background label's loss,1 stands for weights
  • tf.to_float(tf.equal(scaled_labels, 1)) * 200:stand for my object(lane) label's loss,200 stands for weights,In my dataset,lane label is too tittle,so I tell loss function is you met lane loss will greater then background,so increase the label 1's significance.200 shold be finetuned for better value.
  • tf.to_float(tf.equal(scaled_labels, ignore_label )) * 0:stand for ignore label loss,0 stands for weight,because in my case mask png only include 0(background and lane),so I set ignore label is 255(as lick voc),so the weight shold shet to 0 to tell deeplab don't deal with 255 pixle.
    Can you tell what my opition is wrong?thanks very much.
@shanyucha

This comment has been minimized.

Copy link

shanyucha commented Jun 19, 2018

@bleedingfight your understanding is right if your ratio between label 0 and label 1 is 200:1

@sunformoon

This comment has been minimized.

Copy link

sunformoon commented Jun 20, 2018

For the data imbalance problem, maybe the hard examples mining or doing the positive examples augmentation (augment the positive patches) is another way rather than just brutely setting a large loss_weight for the imbalanced class.

@shanyucha

This comment has been minimized.

Copy link

shanyucha commented Jun 21, 2018

exactly. the best way is to balance samples from the beginning. set different weights is a tradeoff.

@zhaolewen

This comment has been minimized.

Copy link

zhaolewen commented Jul 5, 2018

After quite some struggles, I finally got the deeplab model running on a single fore-ground object class segmentation task. I find the configuration of the label images very important:

  1. num_classes = num of fore-ground object class count + background, thus in my case it's 2
  2. ignore_label=255 means that in the single-channel png label image, the color 255 will mark the region that does not influence the calculation of loss and gradients. (thus, pay attention not to mark your object as 255)
  3. the value of the class of the object is the exact value in the label png image. that is to say, if you have 5 classes, they should be marked as 0,1,2,3,4 in the png. They will not be visible in the png but that's ok. Do not give them values like 40, 80, 120, 180, 220 to make them visible, because the code in the repo reads the exact values in the png as the label of the class.
  4. in the label png, 0 is the background. no not set ignore_label=0 (to not complicate things)

I have encountered 3 of the 4 problems mentioned by @khcy82dyc (constant loss, blank prediction, and NaN). The three are all because I have not given the correct value for the object in the label png.

I my case, I labelled the object pixels as 128, but since my num_classes=2, all values above and equal to 2 seems to be ignored. Thus the network only sees 0 in the label png, it predicts nothing, and it converges to 0, thus producing the inevitable NaN even if I added gradient clipping, adding 1e-8 to the logits, decreased the learning rate, and disabled the momentum.

By the way, when the network is running correctly, it will still produce blank predictions during some steps (about 200 steps of 2 images per batch for me), but soon it starts predicting.

@getsanjeev

This comment has been minimized.

Copy link

getsanjeev commented Jul 5, 2018

@zhaolewen Thank you for being so descriptive. I am facing a similar issue, my model is predicting NaNs for all pixel values. Do you have any sample code for DataGenerator for segmentation in keras? Anything you would like to suggest?

@zhaolewen

This comment has been minimized.

Copy link

zhaolewen commented Jul 16, 2018

Hi @getsanjeev , I don't know about the DataGenerator class, I've modified the segmentation_dataset.py, what is important is:
_PASCAL_VOC_SEG_INFORMATION = DatasetDescriptor( splits_to_sizes={ 'train': 1464, 'train_aug': 10582, 'trainval': 2913, 'val': 1449, }, num_classes=21, # background + nb of object classes ignore_label=255, # set the values for your objects from 1 to 2, 3, 4, etc. sets the places where you want to ignore as 255 )

If you're having NaNs all the time, I think it's more because of the configuration and the input instead of the code.

@kritiyer

This comment has been minimized.

Copy link

kritiyer commented Jul 19, 2018

@zhaolewen Thank you for your detailed description! I also have a binary segmentation problem, but I set the background class to be 0 (ignore_label=0) and the object class as 1. Is it necessary to change the background to 255 and the object to 0? My predicted images are blank (all pixels are labeled as class 1).

edit: I misunderstood the 4th point. I think zhaolewen meant that you should not set anything for the ignore-label parameter, not that 0 was a bad choice for the ignore-label parameter

@lillyro

This comment has been minimized.

Copy link

lillyro commented Aug 10, 2018

I my case, I labelled the object pixels as 128, but since my num_classes=2, all values above and equal to 2 seems to be ignored. Thus the network only sees 0 in the label png, it predicts nothing, and it converges to 0, thus producing the inevitable NaN even if I added gradient clipping, adding 1e-8 to the logits, decreased the learning rate, and disabled the momentum.

Hi @zhaolewen
I misunderstand that "since my num_classes=2, all values above and equal to 2 seems to be ignored".
How can we get this information? If I want use the object as 255, how can we change my code?
Thanks!

@zhaolewen

This comment has been minimized.

Copy link

zhaolewen commented Aug 10, 2018

@lillyro you can divide the values in your image by 255, then your num_classes woulde also be 2, because you've got 0 as background and 1 as the object.

@omair50

This comment has been minimized.

Copy link

omair50 commented Jan 21, 2019

Hi @zhaolewen, Could you please share the configuration settings for train.py for your case? My loss is oscillating around 0.2.

@anandhupvr

This comment has been minimized.

Copy link

anandhupvr commented Feb 13, 2019

@omair50 find any solution ?

@margokhokhlova

This comment has been minimized.

Copy link

margokhokhlova commented Apr 5, 2019

After quite some struggles, I finally got the deeplab model running on a single fore-ground object class segmentation task. I find the configuration of the label images very important:

  1. num_classes = num of fore-ground object class count + background, thus in my case it's 2
  2. ignore_label=255 means that in the single-channel png label image, the color 255 will mark the region that does not influence the calculation of loss and gradients. (thus, pay attention not to mark your object as 255)
  3. the value of the class of the object is the exact value in the label png image. that is to say, if you have 5 classes, they should be marked as 0,1,2,3,4 in the png. They will not be visible in the png but that's ok. Do not give them values like 40, 80, 120, 180, 220 to make them visible, because the code in the repo reads the exact values in the png as the label of the class.
  4. in the label png, 0 is the background. no not set ignore_label=0 (to not complicate things)

I have encountered 3 of the 4 problems mentioned by @khcy82dyc (constant loss, blank prediction, and NaN). The three are all because I have not given the correct value for the object in the label png.

I my case, I labelled the object pixels as 128, but since my num_classes=2, all values above and equal to 2 seems to be ignored. Thus the network only sees 0 in the label png, it predicts nothing, and it converges to 0, thus producing the inevitable NaN even if I added gradient clipping, adding 1e-8 to the logits, decreased the learning rate, and disabled the momentum.

By the way, when the network is running correctly, it will still produce blank predictions during some steps (about 200 steps of 2 images per batch for me), but soon it starts predicting.

A late question, were the images integers or floats (labels?)

@margokhokhlova

This comment has been minimized.

Copy link

margokhokhlova commented Apr 11, 2019

Thank you, I followed this one, and it works, the model learns, although the final performance I am getting is lower than the one I get with the Unet. I also add sigmoid activation to the model output and use bce + jaccard loss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.