Skip to content

Segmentation fault when training DeepLab segmentation model #8137

@monocongo

Description

@monocongo

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
  • TensorFlow installed from (source or binary): conda install tensorflow=1
  • TensorFlow version (use command below): 1.13.1
  • Python version: 3.6.7

Describe the current behavior
I am trying to train the DeepLab model following the directions in this DeepLab example using the following command:

$ python deeplab/train.py \
    --logtostderr \
    --training_number_of_steps=30000 \
    --train_split="train" \
    --model_variant="xception_65" \
    --atrous_rates=6 \
    --atrous_rates=12 \
    --atrous_rates=18 \
    --output_stride=16 \
    --decoder_output_stride=4 \
    --train_crop_size="513,513" \
    --train_batch_size=1 \
    --dataset="basins" \
    --tf_initial_checkpoint=/home/james/deeplab/pretrained/x65-b2u1s2p-d48-2-3x256-sc-cr300k_init.ckpt.data-00000-of-00001 \
    --train_logdir=./deeplab/datasets/basins/exp/train_on_train_set/train \
    --dataset_dir=./deeplab/datasets/basins

I get the following output/error:

<thousands of useless/confusing warning messages>
...
2020-01-25 16:03:48.084849: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2808000000 Hz
2020-01-25 16:03:48.086476: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x559e7db9e710 executing computations on platform Host. Devices:
2020-01-25 16:03:48.086916: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path ./deeplab/datasets/basins/exp/train_on_train_set/train/model.ckpt
INFO:tensorflow:Starting Queues.
Segmentation fault (core dumped)

Describe the expected behavior
I am hoping that the model will train as advertised.

Code to reproduce the issue
Run a DeepLab model training as described in the DeepLab tutorial documentation referenced above.

BTW I have had this happen on two separate machines, both of which are running Ubuntu 18.04. One is a Dell laptop with CPU and the other is an AWS EC2 instance with T4 GPU. On the EC2 instance the TF version installed is 1.15.0 and I also tried using TensorFlow 2.0 but when I did that the code failed immediately with a message indicating that tf.contrib is no longer included in TensorFlow so my assumption is that this code has not been ported to work for TF2. Please advise if there is a known version of TF that this code works with, it appears to be broken in its current state using the versions of TF that I've tried.

Thanks in advance for any insight or suggestions. And/or if there is a more up-to-date semantic segmentation model from TensorFlow other than DeepLab then please point me in the right direction.

Metadata

Metadata

Labels

models:researchmodels that come under research directorytype:bugBug in the code

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions