Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Deeplab] mIOU increasing but training loss curve all over the place. #7361

Open
rajanieprabha opened this issue Aug 2, 2019 · 4 comments
Open
Assignees
Labels
models:research models that come under research directory type:support

Comments

@rajanieprabha
Copy link

rajanieprabha commented Aug 2, 2019

System information

  • What is the top-level directory of the model you are using: models/research/deeplab
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04.2 LTS
  • TensorFlow installed from (source or binary): Binary
  • TensorFlow version (use command below): 1.13.1
  • Bazel version (if compiling from source):
  • CUDA/cuDNN version: 10.1
  • GPU model and memory:
  • Exact command to reproduce:

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with

python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the problem

Screenshot from 2019-08-02 10-44-01

I trained the xception_65 model for 150000 iterations on a custom dataset. As the training kept going, validation improved.
Iterations 50k -> (mIOU: 0.95)
Iterations 100k -> (mIOU: 0.96)
........mIOU: (0.965) for 150k

But the training loss is all over the place. I don't understand why is this happening. Any leads?

Source code / logs

NUM_ITERATIONS=150000
python3 "${WORK_DIR}"/train.py
--logtostderr
--train_split="trainval"
--model_variant="xception_65"
--atrous_rates=6
--atrous_rates=12
--atrous_rates=18
--output_stride=8
--decoder_output_stride=4
--train_crop_size="321,321"
--dataset="lake"
--train_batch_size=8
--training_number_of_steps="${NUM_ITERATIONS}"
--fine_tune_batch_norm=false
--train_logdir="${TRAIN_LOGDIR}"
--base_learning_rate=0.0001
--learning_policy="poly"
--tf_initial_checkpoint="training from pascal VOC pretrained checkpoint"
--dataset_dir="${DATASET}"

@rajanieprabha rajanieprabha changed the title mIOU increasing but training loss curve all over the place. [Deeplab] mIOU increasing but training loss curve all over the place. Aug 2, 2019
@rs220122
Copy link

rs220122 commented Oct 8, 2019

You can see the predicted label on tensorboard when you set --save_summaries_images.
This helps you solve the problem.

@rajanieprabha
Copy link
Author

Yes, I have been monitoring them and they are fine. The training data predictions are fine. Is this normal behaviour for loss to be this way? Maybe because of small batch size?

@QuinPilot
Copy link

I met the same problem.
Have you solved it?

@Vaishali-Nimilan
Copy link

Is there a way to see validation graph in Deeplavv3 to check if the model is overfitting?

@ravikyram ravikyram added models:research models that come under research directory type:support labels Jun 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
models:research models that come under research directory type:support
Projects
None yet
Development

No branches or pull requests

8 participants