[Deeplab] mIOU increasing but training loss curve all over the place. #7361

rajanieprabha · 2019-08-02T08:50:07Z

System information

What is the top-level directory of the model you are using: models/research/deeplab
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04.2 LTS
TensorFlow installed from (source or binary): Binary
TensorFlow version (use command below): 1.13.1
Bazel version (if compiling from source):
CUDA/cuDNN version: 10.1
GPU model and memory:
Exact command to reproduce:

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with

python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the problem

I trained the xception_65 model for 150000 iterations on a custom dataset. As the training kept going, validation improved.
Iterations 50k -> (mIOU: 0.95)
Iterations 100k -> (mIOU: 0.96)
........mIOU: (0.965) for 150k

But the training loss is all over the place. I don't understand why is this happening. Any leads?

Source code / logs

NUM_ITERATIONS=150000
python3 "${WORK_DIR}"/train.py
--logtostderr
--train_split="trainval"
--model_variant="xception_65"
--atrous_rates=6
--atrous_rates=12
--atrous_rates=18
--output_stride=8
--decoder_output_stride=4
--train_crop_size="321,321"
--dataset="lake"
--train_batch_size=8
--training_number_of_steps="${NUM_ITERATIONS}"
--fine_tune_batch_norm=false
--train_logdir="${TRAIN_LOGDIR}"
--base_learning_rate=0.0001
--learning_policy="poly"
--tf_initial_checkpoint="training from pascal VOC pretrained checkpoint"
--dataset_dir="${DATASET}"

The text was updated successfully, but these errors were encountered:

rs220122 · 2019-10-08T10:01:14Z

You can see the predicted label on tensorboard when you set --save_summaries_images.
This helps you solve the problem.

rajanieprabha · 2019-10-08T11:25:22Z

Yes, I have been monitoring them and they are fine. The training data predictions are fine. Is this normal behaviour for loss to be this way? Maybe because of small batch size?

QuinPilot · 2020-04-08T10:55:02Z

I met the same problem.
Have you solved it?

Vaishali-Nimilan · 2020-06-09T21:08:22Z

Is there a way to see validation graph in Deeplavv3 to check if the model is overfitting?

rajanieprabha changed the title ~~mIOU increasing but training loss curve all over the place.~~ [Deeplab] mIOU increasing but training loss curve all over the place. Aug 2, 2019

ravikyram added models:research models that come under research directory type:support labels Jun 24, 2020

ravikyram assigned gpapan, YknZhu and aquariusjay Jun 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Deeplab] mIOU increasing but training loss curve all over the place. #7361

[Deeplab] mIOU increasing but training loss curve all over the place. #7361

rajanieprabha commented Aug 2, 2019 •

edited

rs220122 commented Oct 8, 2019 •

edited

rajanieprabha commented Oct 8, 2019

QuinPilot commented Apr 8, 2020

Vaishali-Nimilan commented Jun 9, 2020

[Deeplab] mIOU increasing but training loss curve all over the place. #7361

[Deeplab] mIOU increasing but training loss curve all over the place. #7361

Comments

rajanieprabha commented Aug 2, 2019 • edited

System information

Describe the problem

Source code / logs

rs220122 commented Oct 8, 2019 • edited

rajanieprabha commented Oct 8, 2019

QuinPilot commented Apr 8, 2020

Vaishali-Nimilan commented Jun 9, 2020

rajanieprabha commented Aug 2, 2019 •

edited

rs220122 commented Oct 8, 2019 •

edited