Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[deeplab] Training deeplab model with ADE20K dataset #3730

Open
walkerlala opened this issue Mar 24, 2018 · 99 comments
Open

[deeplab] Training deeplab model with ADE20K dataset #3730

walkerlala opened this issue Mar 24, 2018 · 99 comments
Assignees
Labels
models:research models that come under research directory type:support

Comments

@walkerlala
Copy link
Contributor

walkerlala commented Mar 24, 2018

System information

  • What is the top-level directory of the model you are using: deeplab
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 1.6.0
  • Bazel version (if compiling from source):
  • CUDA/cuDNN version: 9.0/7.0.4
  • GPU model and memory: 1080Ti * 2 , 10Gb * 2
  • Exact command to reproduce:

Describe the problem

This is a feature request. I am trying to train the deeplab model with the ADE20K dataset (see this presentation). I have finished the data format conversion and "successfully" train the model on a small subset of ADE20K. Below is the modification to file research/deeplab/datasets/segmentation_dataset.py which is used to extract segmentation data.

diff --git a/research/deeplab/datasets/segmentation_dataset.py b/research/deeplab/datasets/segmentation_dataset.py
index a777252..8648fb2 100644
--- a/research/deeplab/datasets/segmentation_dataset.py
+++ b/research/deeplab/datasets/segmentation_dataset.py
@@ -85,10 +85,22 @@ _PASCAL_VOC_SEG_INFORMATION = DatasetDescriptor(
     ignore_label=255,
 )
 
+_ADE20K_INFORMATION = DatasetDescriptor(
+    splits_to_sizes = {
+        'train': 40,
+        'val': 5,
+    },
+    # TODO temporarily change it to 21 otherwise dimension mismatch
+    num_classes=21,
+    ignore_label=255,
+)
+
 
 _DATASETS_INFORMATION = {
     'cityscapes': _CITYSCAPES_INFORMATION,
     'pascal_voc_seg': _PASCAL_VOC_SEG_INFORMATION,
+    'ade20k': _ADE20K_INFORMATION,
 }
 
 # Default file pattern of TFRecord of TensorFlow Example.

The problem is, in the ADE20K dataset there are 150 classes, which is different from that in the VOC or cityspace dataset. That brings problem w.r.t the checkpoint file. Currently there are only pretrained model on the VOC and cityspace dataset. So we have two choices here:

  1. Do not use the checkpoint file. In this case, there is an error:
absl.flags._exceptions.IllegalFlagValueError: flag --tf_initial_checkpoint=None: Flag --tf_initial_checkpoint must be specified.
  1. set num_classes=21 to use those two provided checkpoint files

Are there any alternatives to these?

If anyone have any workable solution for the ADE20K dataset it would be really appreciated.

@aquariusjay
Copy link
Contributor

  1. You could modify the code here so that the exclude_list only includes the `_LOGITS_SCOPE_NAME' and also set the flag initialize_last_layer = False. (Note you still want to restore the variables in ASPP, decoder and so on). By doing so, only the weights in the last classification layer is not initialized (then you could use a classification layer with 150 classes).

  2. You need to explore the min_resize_value and max_resize_value (set resize_factor = output_stride) for ADE20K which contains images of huge various scales (e.g., dimension ranges from 50 to 2000). In that case, by setting min_resize_value and max_resize_value, you are able to resize the images on-the-fly to the similar range (or you could do that manually by yourself while pre-processing the dataset). Note however these hyper-parameters may affect the performance, and we have not yet explored that carefully.

@walkerlala
Copy link
Contributor Author

@aquariusjay Thanks for the hints. Now I have started the training, using the provided VOC model checkpoint, setting fine_tune_batch_norm to False, using the mobilenet_v2 variant and a batch size of 8. Hopefully that the loss will drop after several hours...

There are still two things confusing me:

  1. the segmentation annotation images within the ADE20K dataset have trhee channels, but I am reading it with label_reader = build_data.ImageReader('png', channels=1) , as for what we have done for the VOC dataset (in datasets/build_voc2012_data.py). Will that be a problem?

  2. why do we have the resize_factor parameters?

@walkerlala
Copy link
Contributor Author

Oh, will it be OK to prepare a pull request for the ADE20K dataset?

@aquariusjay
Copy link
Contributor

Regarding your previous questions:

  1. The groundtruth images should contain only 1 channel with values = semantic labels.
  2. You could check the code for details.

We currently do not have any plan to prepare that.
However, note that one should be able to do that by using the provided code/model/script.
Also, any contributions for extra dataset to the codebase is welcome.

Cheers,

@brett-whitford
Copy link

@aquariusjay,

I'm currently having similar issues attempting to train with a custom dataset and was hoping you could offer some insight.

You could modify the code here so that the exclude_list only includes the `_LOGITS_SCOPE_NAME' and also set the flag initialize_last_layer = False.

The link you included "here" appears to need a Google SSO to login. I am assuming that was a link to the train_util.py script. Here are the changes I have currently made to implement your architecture on my custom dataset:

  1. segmentation_dataset.py
  • I added the information for my "toy_dataset"
_TOY_DATASET_INFORMATION = DatasetDescriptor(
    splits_to_sizes={
        'train': 800,
        'trainval': 1000,
        'val': 200,
    },
    num_classes=10,
    ignore_label=255,
)

_DATASETS_INFORMATION = {
    'cityscapes': _CITYSCAPES_INFORMATION,
    'pascal_voc_seg': _PASCAL_VOC_SEG_INFORMATION,
    'toy_dataset': _TOY_DATASET_INFORMATION,
}
  1. train.py
  • I do not initialize the final layer of the network.
  • I point training to the directory containing my custom "toy_dataset"
flags.DEFINE_boolean('initialize_last_layer', False,
                     'Initialize the last layer.')

flags.DEFINE_string('dataset', 'toy_dataset',
                    'Name of the segmentation dataset.')
  1. train_utils.py
  • I modify the code here so that the exclude_list only includes the `_LOGITS_SCOPE_NAME', as you stated above.
  exclude_list = ['_LOGITS_SCOPE_NAME']
  if not initialize_last_layer:
    exclude_list.extend(last_layers)
  1. eval.py
  • I point evaluation to my custom "toy_dataset".
flags.DEFINE_string('dataset', 'toy_dataset',
                    'Name of the segmentation dataset.')

However, when I run this my code appears to successfully train, but then running into an issues with the the confusion matrix during evaluation (I include the traceback below for reference). Any tips/suggestions on how to fix this?

Thanks for your help!
Brett

Error Traceback:

~/brett/wss-python/models/research/deeplab$ sh local_test_custom.sh 
Converting toy dataset...
>> Converting image 50/200 shard 0
>> Converting image 100/200 shard 1
>> Converting image 150/200 shard 2
>> Converting image 200/200 shard 3
>> Converting image 250/1000 shard 0
>> Converting image 500/1000 shard 1
>> Converting image 750/1000 shard 2
>> Converting image 1000/1000 shard 3
>> Converting image 200/800 shard 0
>> Converting image 400/800 shard 1
>> Converting image 600/800 shard 2
>> Converting image 800/800 shard 3
--2018-03-30 12:33:03--  http://download.tensorflow.org/models/deeplabv3_pascal_train_aug_2018_01_04.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 172.217.8.176, 2607:f8b0:4009:80d::2010
Connecting to download.tensorflow.org (download.tensorflow.org)|172.217.8.176|:80... connected.
HTTP request sent, awaiting response... 416 Requested range not satisfiable

    The file is already fully retrieved; nothing to do.

toy_dataset
INFO:tensorflow:Training on trainval set
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/losses/losses_impl.py:731: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.

INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead.
INFO:tensorflow:Ignoring initialization; other checkpoint exists
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py:736: __init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
INFO:tensorflow:Restoring parameters from /home/makbar/brett/wss-python/models/research/deeplab/datasets/toy_dataset/exp/train_on_trainval_set/train/model.ckpt-11
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /home/makbar/brett/wss-python/models/research/deeplab/datasets/toy_dataset/exp/train_on_trainval_set/train/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 11.
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.
toy_dataset
INFO:tensorflow:Evaluating on val set
INFO:tensorflow:Performing single-scale test.
INFO:tensorflow:Eval num images 200
INFO:tensorflow:Eval batch size 1 and num batch 200
INFO:tensorflow:Waiting for new checkpoint at /home/makbar/brett/wss-python/models/research/deeplab/datasets/toy_dataset/exp/train_on_trainval_set/train
INFO:tensorflow:Found new checkpoint at /home/makbar/brett/wss-python/models/research/deeplab/datasets/toy_dataset/exp/train_on_trainval_set/train/model.ckpt-12
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/training/python/training/evaluation.py:303: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /home/makbar/brett/wss-python/models/research/deeplab/datasets/toy_dataset/exp/train_on_trainval_set/train/model.ckpt-12
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting evaluation at 2018-03-30-16:35:58
Traceback (most recent call last):
  File "/home/makbar/brett/wss-python/models/research/deeplab/eval.py", line 175, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/home/makbar/brett/wss-python/models/research/deeplab/eval.py", line 168, in main
    eval_interval_secs=FLAGS.eval_interval_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/evaluation.py", line 301, in evaluation_loop
    timeout=timeout)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/training/python/training/evaluation.py", line 452, in evaluate_repeatedly
    session.run(eval_ops, feed_dict)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 546, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1022, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1113, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1098, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1170, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 950, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [`predictions` out of bound] [Condition x < y did not hold element-wise:] [x (mean_iou/confusion_matrix/control_dependency_1:0) = ] [255 255 255...] [y (mean_iou/ToInt64_2:0) = ] [10]
	 [[Node: mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT64, DT_STRING, DT_INT64], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/Switch, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_0, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_1, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_2, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/Switch_1, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_4, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/Switch_2)]]

Caused by op u'mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert', defined at:
  File "/home/makbar/brett/wss-python/models/research/deeplab/eval.py", line 175, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/home/makbar/brett/wss-python/models/research/deeplab/eval.py", line 142, in main
    predictions, labels, dataset.num_classes, weights=weights)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/metrics_impl.py", line 1009, in mean_iou
    num_classes, weights)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/metrics_impl.py", line 263, in _streaming_confusion_matrix
    labels, predictions, num_classes, weights=weights, dtype=dtypes.float64)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/confusion_matrix.py", line 183, in confusion_matrix
    message='`predictions` out of bound')],
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/check_ops.py", line 579, in assert_less
    return control_flow_ops.Assert(condition, data, summarize=summarize)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 118, in wrapped
    return _add_should_use_warning(fn(*args, **kwargs))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 177, in Assert
    guarded_assert = cond(condition, no_op, true_assert, name="AssertGuard")
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 432, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2027, in cond
    orig_res_f, res_f = context_f.BuildCondBranch(false_fn)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1868, in BuildCondBranch
    original_result = fn()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 175, in true_assert
    condition, data, summarize, name="Assert")
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 48, in _assert
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1650, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): assertion failed: [`predictions` out of bound] [Condition x < y did not hold element-wise:] [x (mean_iou/confusion_matrix/control_dependency_1:0) = ] [255 255 255...] [y (mean_iou/ToInt64_2:0) = ] [10]
	 [[Node: mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT64, DT_STRING, DT_INT64], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/Switch, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_0, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_1, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_2, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/Switch_1, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_4, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/Switch_2)]]

@walkerlala
Copy link
Contributor Author

walkerlala commented Mar 31, 2018 via email

@wonderit
Copy link
Contributor

wonderit commented Apr 3, 2018

@walkerlala

I am trying to train the deeplab model with the ADE20k datasets.
I'm having some problem with data format conversion.
Would you mind sharing the code for ADE20k datasets? It would be really appreciated.

@shipengai
Copy link

@brett-whitford When I use my data .I have the same error with you . Can you share your solution?
Thank you very much .I 'm looking forword to your reply

@walkerlala
Copy link
Contributor Author

@wonderit Of course. Please wait for a while until I have access to my GPU server.

@walkerlala
Copy link
Contributor Author

walkerlala commented Apr 3, 2018

@wonderit Here is the patch for converting training data and training deeplabv3 on ADE20K.

https://gist.github.com/walkerlala/82d978e68407e65158e8825cd470d7e1

(it can also be found at http://fastdrivers.org/misc/patch-for-ade20k.patch )

You can apply this patch on top of commit 1d38a22 or 5281c9a without conflict.

Note:

  1. you can to manually adjust the path in train_ade20k.py for training and supply correct path of the training data for converting the data, as documented in the doc

  2. training data can be found at: http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip

I am also going to submit a PR to get these into the repo. However, I don't have enough GPU to get a good pretrained model (only get two Nvidia 1080...) If you can obtain a decent pretrained model, please share!

@walkerlala
Copy link
Contributor Author

Also, anyone interested in add ADE20K to deeplabv3 can take a look at this PR I just created: #3853

@shipengai
Copy link

@walkerlala When use val.py, did you have the error 'predictions' out of bound?just same with the @brett-whitford ' question.
Thank you

@shipengai
Copy link

@walkerlala Can you share your eval script?

@hhwxxx
Copy link

hhwxxx commented Apr 8, 2018

@walkerlala @aquariusjay
Hi, I am confused about the exclude_list and initialize_last_layer.

I am not sure whether I understand it correctly:
If one want to fine-tune deeplab-v3+ on another dataset, only _LOGITS_SCOPE_NAME need to be excluded?

If so, following @aquariusjay 's suggestion, in "train_utils.py":

exclude_list = [_LOGITS_SCOPE_NAME]
if not initialize_last_layer:
    exclude_list.extend(last_layers)

if set initialize_last_layer=false, then exclude_list will include the last_layers. In "train.py" last_layers is the list [_LOGITS_SCOPE_NAME, _IMAGE_POOLING_SCOPE, _ASPP_SCOPE, _CONCAT_PROJECTION_SCOPE, _DECODER_SCOPE, ].
So all variables in the list will be excluded. This seems inconsistent.

Shouldn't it be the following?
initialize_last_layer=true and exclude_list = [_LOGITS_SCOPE_NAME]

@lydialixia
Copy link

lydialixia commented Apr 9, 2018

Hi, I'm training on my own dataset as well (only two classes).

When I set initialize_last_layer=false and

exclude_list = ['logits']
if not initialize_last_layer:
    exclude_list.extend(last_layers)

Then when I run vis.py, it gives me all black images (not binary).

When I only set initialize_last_layer=false, I got binary images (result is not good, but at least show some learning). But it gives me this when run train.py:

INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 6390723.
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.

when training_number_of_steps=100000

Anyone knows why this happens? Thanks!

@hhwxxx
Copy link

hhwxxx commented Apr 10, 2018

@lydialixia
Hello.
You should add 'global_step' in exclude_list:

exclude_list = ['global_step']

But I am still confused about whether one should set initialize_last_layer=false when to fine-tune deeplab-v3+ on another task.

@aquariusjay
Copy link
Contributor

When you want to fine-tune DeepLab on other datasets, there are a few cases:

  1. You want to re-use ALL the trained weigths: set initialize_last_layer = True (last_layers_contain_logits_only does not matter in this case).

  2. You want to re-use ONLY the network backbone (i.e., exclude ASPP, decoder and so on): set initialize_last_layer = False and last_layers_contain_logits_only = False.

  3. You want to re-use ALL the trained weights EXCEPT the logits (since the num_classes may be different): set initialize_last_layer = False and last_layers_contain_logits_only = True.

@georgosgeorgos
Copy link

Hi @walkerlala: did you manage to finetune the ADE20K dataset?
I'm trying to finetune on a dataset of the same size, but without success: after the first ~2K iterations the loss stops to decrease and starts to oscillate (~20K iterations).
I tried different learning rates, removed the regularization, but for the moment no improvement.

@walkerlala
Copy link
Contributor Author

walkerlala commented Apr 12, 2018

@georgosgeorgos No I can't eventually fine tune the model on ADE20K dataset. I don't have enough GPU. Every time I try to fine tune the batch normalization parameters the model blow up throwing out out-of-memory error. So I freeze the batch normalization layers when training. Finally I only got a model with only "modest" performance:

Here is the original image (too large to display here): http://www.fastdrivers.org/misc/stuffseg-origin.jpg

Here is the segmentation result:
result

However I can get a satisfying result with PSPNet:

mmexport_1_473_seg

According to the slides from the 2017 Coco + Places Workshop, deeplabv3 should also be able to do that, but I haven't got any luck to fine-tune that. Hopefully Google can provide a fine-tuned pre-trained model in the future @aquariusjay .

@cfosco
Copy link

cfosco commented Apr 15, 2018

@brett-whitford - Hi Brett, I am having the exact same problem as you. How did you end up solving it?

@cfosco
Copy link

cfosco commented Apr 15, 2018

@shipeng-uestc - Hi shipeng, did you manage to solve the issue? I am currently using exclude_list=[_LOGITS_SCOPE_NAME] with _LOGITS_SCOPE_NAME imported from deeplab.model as @walkerlala suggested but I am still having the same error as Brett.

@jiyongma
Copy link

when I run
python deeplab/eval.py
--logtostderr
--eval_split="val"
--model_variant="xception_65"
--atrous_rates=6
--atrous_rates=12
--atrous_rates=18
--output_stride=16
--decoder_output_stride=4
--eval_crop_size=513
--eval_crop_size=513
--dataset="ade20k"
--checkpoint_dir="./deeplab/datasets/ADE20K/exp/train_on_train_set/train"
--eval_logdir="./deeplab/datasets/ADE20K/exp/train_on_train_set/eval"
--dataset_dir="./deeplab/datasets/ADE20K/tfrecord"

NotFoundError (see above for traceback): Key aspp1_depthwise/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
[[Node: save/RestoreV2/_299 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_306_save/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
please help me !!!thanks

@qmy612
Copy link

qmy612 commented Apr 19, 2018

@hhwxxx Hello, in your answer to lydialixia, do you mean in train_util.py, exclude_list should be like this:
exclude_list = ['global_step']
exclude_list = ['logits']

but I still can't start training, the information is:
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 30000.
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.

I have also tried exclude_list = ['_LOGITS_SCOPE_NAME'], this doesn't work.
When just set exclude_list = ['global_step'], the model will achieve mean iu = 0.93 after 10000 iteractions, I don't know whether this is wrong.
Waitting online, thank you!

@hhwxxx
Copy link

hhwxxx commented Apr 19, 2018

@qmy612

Hello. Maybe you can try this:
exclude_list = ['global_step', 'logits']

As to the _LOGITS_SCOPE_NAME, it is defined in "model.py", so you should use like this: model._LOGITS_SCOPE_NAME.

And I have no idea about miou=0.93.

@BeSlower
Copy link

BeSlower commented May 1, 2018

Just set set initialize_last_layer = False and last_layers_contain_logits_only = True works for me, if you wanna train on your own dataset with different num classes.

@holyprince
Copy link

@BeSlower , yes, the solution is work for me but there is another problem that the result is all black and no other label , but during the training process , the loss is decrease. Can anyone help me ?

@xianshunw
Copy link

@qmy612 Did you get the problem solved? I am having the exacting problem as you

@qmy612
Copy link

qmy612 commented May 6, 2018

@xiangjinwu Yes, the answer of hhwxxx is work.
exclude_list = ['global_step', 'logits']

@ajinkya933
Copy link

@apolo74 Thanks I got the output now

@apolo74
Copy link

apolo74 commented Apr 16, 2019

@apolo74 Thanks I got the output now

Happy to hear that!

@hakS07
Copy link

hakS07 commented Jun 10, 2019

@BeSlower

Just set set initialize_last_layer = False and last_layers_contain_logits_only = True works for me, if you wanna train on your own dataset with different num classes.

hi,i tried the training on my own data(classe=2=1+background)
initialize_last_layer = False
last_layers_contain_logits_only = True
label=gray-scale image (0 1)
but what i got as predicted mask en the test is black mask
can help me with this

@hakS07
Copy link

hakS07 commented Jun 10, 2019

@holyprince

@BeSlower , yes, the solution is work for me but there is another problem that the result is all black and no other label , but during the training process , the loss is decrease. Can anyone help me ?

hi ,i have the same problem as you, the predicted mask is a black image
did you fix it ??

@hakS07
Copy link

hakS07 commented Jun 10, 2019

@apolo74

Hi guys, sorry I've been disconnected from this thread... the black output is related to 2 very important settings:

Assuming that you are re-training on your own data that, for example, has 2 classes... in my toy case I mentioned I created a dataset with circles and squares. Then I have 2 classes BUT the parameter called "--num-classes" should be 4 because: 2 (own classes) + 1 (background) + 1 (ignore_label)
The pixel values in your "background" class are supposed to be 0, pixel values for your first class should be 1, for your second class should be 2 and so on... DON'T save your classes with other values like 100 or 224, you have to save your class images following the order from 1 to N
Hope this helps
/B

Hi , i tried the trainig on my custom dataset
as you said:
classe=3=1(obj)+background+ignore_label
label=gray_scale image(0,1)
in my label, there are two pixels:0 for background and 1 for object
so should i put the ignore_label in the class number calculation??
but what i got as output is a black mask
can help to fix it?

@yougoforward
Copy link

Hey guys! Have you ever evaluate the provided ade20k pretrained model on val set? I have test them, but both mobilenetv2_ade20k_train and xception65_ade20k_train are lower than the reported performance for about 3%-4%.
here is my evaluation script:
python eval.py
--logtostderr
--eval_split="val"
--model_variant="xception_65"
--atrous_rates= 12
--atrous_rates=24
--atrous_rates=36
--output_stride=8
--decoder_output_stride=4
--eval_crop_size=513
--eval_crop_size=513
--min_resize_value=513
--max_resize_value=513
--resize_factor=8
--aspp_with_batch_norm=true
--aspp_with_separable_conv=true
--decoder_use_separable_conv=true
--dataset="ade20k"
--checkpoint_dir="datasets/ADE20K/deeplabv3_xception_ade20k_train"
--eval_logdir="datasets/ADE20K/exp/v3plus/eval_ori"
--dataset_dir="datasets/ADE20K/tfrecord"
--max_number_of_evaluations=1
--eval_scales=0.5
--eval_scales=0.75
--eval_scales=1.0
--eval_scales=1.25
--eval_scales=1.5
--eval_scales=1.75
--add_flipped_images=true
By the way, the pretrained models for pascal and cityscapes work well. Could someone help me verify the performance or give me some advice?

@hakS07
Copy link

hakS07 commented Jul 15, 2019

@apolo74

Hi guys, sorry I've been disconnected from this thread... the black output is related to 2 very important settings:
Assuming that you are re-training on your own data that, for example, has 2 classes... in my toy case I mentioned I created a dataset with circles and squares. Then I have 2 classes BUT the parameter called "--num-classes" should be 4 because: 2 (own classes) + 1 (background) + 1 (ignore_label)
The pixel values in your "background" class are supposed to be 0, pixel values for your first class should be 1, for your second class should be 2 and so on... DON'T save your classes with other values like 100 or 224, you have to save your class images following the order from 1 to N
Hope this helps
/B

thx to your descriptive comment i was able to train successfully deeplab on my custom dataset(14000 images)
after 20000 iteration i tested the trained model with python code it detects fine but when i put the model on an ios application(after convert to tflite model) it gives bad and wrong segmentation
do you have any idea about using deeplab mobilenete trained model on mobile??

@ma8tsch
Copy link

ma8tsch commented Aug 21, 2019

Hey guys,
does anyone know how one can freeze layers for training? Say I want to freeze the weights of the backbone and only train the rest. Is that possible?

I would really appreciate some help on this matter. Thanks in advance

@lattard
Copy link

lattard commented Oct 23, 2019

@ma8tsch did you manage to freeze some layers eventually ? If yes, can you pls provide some details ?

@lolitsgab
Copy link

lolitsgab commented Oct 29, 2019

Hi guys, sorry I've been disconnected from this thread... the black output is related to 2 very important settings:

  1. Assuming that you are re-training on your own data that, for example, has 2 classes... in my toy case I mentioned I created a dataset with circles and squares. Then I have 2 classes BUT the parameter called "--num-classes" should be 4 because: 2 (own classes) + 1 (background) + 1 (ignore_label)
  2. The pixel values in your "background" class are supposed to be 0, pixel values for your first class should be 1, for your second class should be 2 and so on... DON'T save your classes with other values like 100 or 224, you have to save your class images following the order from 1 to N
    Hope this helps
    /B

Thank you for your help :) setting

--initialize_last_layer=False\
--last_layers_contain_logits_only=True

allowed me to no longer have all black masks. I am not getting color spotted, but not acceptable, masks after only 100 steps.

To phrase what you said more clearly (for me at least), you are saying that images should be labeled with only values from 1...N where N is the number of classes, and 0 is reserved for background, and possibly even N+1 because of the ignore label (I am not utilizing this).

In other words, if you have 2 classes (circle and triangle), you will have 4 labels/indexes in your image.

  • index 0 = bg
  • index 1 = class1, say circle
  • index 2 = class2, say triangle
  • index 3 (which by default in the other datasets is 255 instead of 3) = IGNORE_LABEL

How can I confirm that this is the case for my dataset?

I'll report back tomorrow after 10,000 steps to confirm.

@lolitsgab
Copy link

lolitsgab commented Oct 29, 2019

How did y'all color index your images? It seems that my images ARE color indexed as @apolo74 specified.

Here is what my model got after 10000 steps:
non-seg
seg

This is what a color indexed image looks like in my dataset (not from same picture as above):
000017

Any possible help?

@Etheryramirezrs
Copy link

Hi i am trying to run deeplab in my own dataset but i get an error when i am running the train.py it is related to the number of clases because i have 5 but apparently the program is expecting 21 like the number of classes in the VOC dataset,
Assign requires shapes of both tensors to match. lhs shape= [5] rhs shape= [21]

@JinyuanShao
Copy link

@aquariusjay
Hi there~, about the problem of classes imbalance, new version train_utils.py of deeplab seems to change the code, so maybe I can't add variables like label1_weight = to fix classes imbalance problem.
Could you give me some advice to edit code to add weights of classes?
Thank you very much.

@PallawiSinghal
Copy link

PallawiSinghal commented Dec 11, 2019

When you want to fine-tune DeepLab on other datasets, there are a few cases:

  1. You want to re-use ALL the trained weigths: set initialize_last_layer = True (last_layers_contain_logits_only does not matter in this case).
  2. You want to re-use ONLY the network backbone (i.e., exclude ASPP, decoder and so on): set initialize_last_layer = False and last_layers_contain_logits_only = False.
  3. You want to re-use ALL the trained weights EXCEPT the logits (since the num_classes may be different): set initialize_last_layer = False and last_layers_contain_logits_only = True.

Hi, My loss does not change. It has become stagnant. I have tried everything mentioned related to deeplabv3+ on every blog.
I am training to detect roads. My images are of 2000x2000.
My training data has 45k images.
I have created my image in the form of PASCAL VOC. I have three kinds of pixels.
background = [0,0,0]
Void class = [255,255,255]
road = [1,1,1]
so the number of classes = 3
I am using PASCAL VOC pre trained weights.

changes in train_util.py are :
1.
ignore_weight = 0
label0_weight =10
label1_weight = 15
not_ignore_mask =
tf.to_float(tf.equal(scaled_labels, 1)) * label0_weight

  • tf.to_float(tf.equal(scaled_labels, 2)) * label1_weight
  • tf.to_float(tf.equal(scaled_labels, ignore_label)) * ignore_weight

Variables that will not be restored.

exclude_list = ['global_step','logits']
if not initialize_last_layer:
exclude_list.extend(last_layers)

my train.py

nohup python deeplab/train.py
--logtostderr
--training_number_of_steps=65000
--train_split="train"
--model_variant="xception_65"
--atrous_rates=6
--atrous_rates=12
--atrous_rates=18
--output_stride=16
--decoder_output_stride=4
--train_batch_size=2
--initialize_last_layer=False
--last_layers_contain_logits_only=True
--dataset="pascal_voc_seg"
--tf_initial_checkpoint="/data/old_model/models/research/deeplabv3_pascal_trainval/model.ckpt"
--train_logdir="/data/old_model/models/research/deeplab/mycheckpoints"
--dataset_dir="/data/models/research/deeplab/datasets/tfrecord" > my_output.log &

Please help 👍
INFO:tensorflow:global step 700: loss = 0.1759 (0.449 sec/step)
INFO:tensorflow:global step 710: loss = 0.1695 (0.655 sec/step)
INFO:tensorflow:global step 720: loss = 0.1742 (0.689 sec/step)
INFO:tensorflow:global step 730: loss = 0.1710 (0.505 sec/step)
INFO:tensorflow:global step 740: loss = 0.1708 (0.868 sec/step)
INFO:tensorflow:global step 750: loss = 0.1683 (0.632 sec/step)
INFO:tensorflow:global step 760: loss = 0.1692 (0.442 sec/step)
INFO:tensorflow:global step 770: loss = 0.1693 (0.597 sec/step)
INFO:tensorflow:global step 780: loss = 0.1665 (0.441 sec/step)
INFO:tensorflow:global step 790: loss = 0.1680 (0.548 sec/step)
INFO:tensorflow:global step 800: loss = 0.1708 (0.372 sec/step)
INFO:tensorflow:global step 810: loss = 0.1674 (0.327 sec/step)
INFO:tensorflow:global step 820: loss = 0.1666 (0.951 sec/step)
INFO:tensorflow:global step 830: loss = 0.1651 (0.557 sec/step)
INFO:tensorflow:global step 840: loss = 0.1663 (0.506 sec/step)
INFO:tensorflow:global step 850: loss = 0.1646 (0.446 sec/step)
INFO:tensorflow:global step 860: loss = 0.1666 (0.424 sec/step)
INFO:tensorflow:global step 870: loss = 0.1654 (0.520 sec/step)
INFO:tensorflow:global step 880: loss = 0.1662 (0.675 sec/step)
INFO:tensorflow:global step 890: loss = 0.1673 (0.325 sec/step)
INFO:tensorflow:global step 900: loss = 0.1633 (0.548 sec/step)
INFO:tensorflow:global step 910: loss = 0.1659 (0.374 sec/step)
INFO:tensorflow:global step 920: loss = 0.1639 (0.663 sec/step)
INFO:tensorflow:global step 930: loss = 0.1658 (0.442 sec/step)
INFO:tensorflow:global step 940: loss = 0.1654 (0.568 sec/step)
.
.
.
INFO:tensorflow:global step 17850: loss = 0.1416 (0.555 sec/step)
INFO:tensorflow:global step 17860: loss = 0.1417 (0.684 sec/step)
INFO:tensorflow:global step 17870: loss = 0.1415 (0.572 sec/step)
INFO:tensorflow:global step 17880: loss = 0.1417 (0.569 sec/step)
INFO:tensorflow:global step 17890: loss = 0.1415 (0.535 sec/step)
INFO:tensorflow:global step 17900: loss = 0.1415 (0.541 sec/step)
INFO:tensorflow:global step 17910: loss = 0.1419 (0.459 sec/step)
INFO:tensorflow:global step 17920: loss = 0.1415 (0.800 sec/step)
INFO:tensorflow:global step 17930: loss = 0.1417 (0.647 sec/step)
INFO:tensorflow:global step 17940: loss = 0.1416 (0.509 sec/step)
INFO:tensorflow:global step 17950: loss = 0.1416 (0.755 sec/step)
INFO:tensorflow:global step 17960: loss = 0.1417 (0.495 sec/step)
INFO:tensorflow:global step 17970: loss = 0.1419 (0.556 sec/step)
INFO:tensorflow:global step 17980: loss = 0.1417 (0.492 sec/step)
INFO:tensorflow:global step 17990: loss = 0.1416 (0.878 sec/step)
INFO:tensorflow:global step 18000: loss = 0.1415 (0.803 sec/step)
INFO:tensorflow:global step 18010: loss = 0.1418 (0.695 sec/step)
INFO:tensorflow:global step 18020: loss = 0.1418 (0.449 sec/step)
INFO:tensorflow:global step 18030: loss = 0.1415 (0.678 sec/step)
INFO:tensorflow:global step 18040: loss = 0.1418 (0.449 sec/step)
INFO:tensorflow:global step 18050: loss = 0.1415 (0.681 sec/step)
INFO:tensorflow:global step 18060: loss = 0.1415 (0.866 sec/step)
INFO:tensorflow:global step 18070: loss = 0.1417 (0.534 sec/step)
INFO:tensorflow:global step 18080: loss = 0.1415 (0.939 sec/step)
INFO:tensorflow:global step 18090: loss = 0.1416 (0.349 sec/step)
INFO:tensorflow:global step 18100: loss = 0.1416 (0.576 sec/step)
INFO:tensorflow:global step 18110: loss = 0.1416 (0.626 sec/step)
INFO:tensorflow:global step 18120: loss = 0.1418 (0.951 sec/step)
INFO:tensorflow:global step 18130: loss = 0.1417 (0.386 sec/step)
INFO:tensorflow:global step 18140: loss = 0.1417 (0.375 sec/step)
@aquariusjay

@PallawiSinghal
Copy link

As I do not have access to your dataset, and it usually takes experimental experience to tune the hyper-parameters.

@aquariusjay Hi, May I know how we can quantify our dataset to find out these values.
ignore_weight
label0_weight
label1_weight

@HanChen-HUST
Copy link

@PallawiSinghal did u solve it?I also want to change the loss_weight

@HanChen-HUST
Copy link

@jinyuan30 did u solve it?I also want to change the loss_weight

@HanChen-HUST
Copy link

@aquariusjay Hi there~, about the problem of classes imbalance, new version train_utils.py of deeplab seems to change the code, so maybe I can't add variables like label1_weight = to fix classes imbalance problem.
Could you give me some advice to edit code to add weights of classes?
Thank you very much.

@LightingX
Copy link

@aquariusjay Hi there~, about the problem of classes imbalance, new version train_utils.py of deeplab seems to change the code, so maybe I can't add variables like label1_weight = to fix classes imbalance problem.
Could you give me some advice to edit code to add weights of classes?
Thank you very much.

Hello, it seems that I meet the same problem, have you solved it yet?

@Alive1024
Copy link

@LightingX Hi,friend! Have you figured out how to adjust the loss weight in new version of train_utils.py?
I tried to change the label_weights from None to a Python list in the common.py, but I got a ValueError: Subscripts with ellipses are not yet supported

@claudiourbina
Copy link

@aquariusjay Hi there~, about the problem of classes imbalance, new version train_utils.py of deeplab seems to change the code, so maybe I can't add variables like label1_weight = to fix classes imbalance problem.
Could you give me some advice to edit code to add weights of classes?
Thank you very much.

Did you solve it? I have the same problem now :/.

@LightingX
Copy link

@Alive1024 @claudiourbina Hey guys, in the latest implemented version, it seems we can adjust the weight by params. When training, add label_weights param to the train params list. For example, if I have 2 classes and their weights are 0.01 and 1, I can add this to the train params:

--label_weights={0.01,1.0}

@RostyslavBryiovskyi
Copy link

RostyslavBryiovskyi commented Mar 29, 2021

@essalahsouad Hi! Did you solved problem with black images ? Still actual for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
models:research models that come under research directory type:support
Projects
None yet
Development

No branches or pull requests