You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
Describe the problem
I trained the xception_65 model for 150000 iterations on a custom dataset. As the training kept going, validation improved.
Iterations 50k -> (mIOU: 0.95)
Iterations 100k -> (mIOU: 0.96)
........mIOU: (0.965) for 150k
But the training loss is all over the place. I don't understand why is this happening. Any leads?
The text was updated successfully, but these errors were encountered:
rajanieprabha
changed the title
mIOU increasing but training loss curve all over the place.
[Deeplab] mIOU increasing but training loss curve all over the place.
Aug 2, 2019
Yes, I have been monitoring them and they are fine. The training data predictions are fine. Is this normal behaviour for loss to be this way? Maybe because of small batch size?
System information
You can collect some of this information using our environment capture script:
https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh
You can obtain the TensorFlow version with
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
Describe the problem
I trained the xception_65 model for 150000 iterations on a custom dataset. As the training kept going, validation improved.
Iterations 50k -> (mIOU: 0.95)
Iterations 100k -> (mIOU: 0.96)
........mIOU: (0.965) for 150k
But the training loss is all over the place. I don't understand why is this happening. Any leads?
Source code / logs
NUM_ITERATIONS=150000
python3 "${WORK_DIR}"/train.py
--logtostderr
--train_split="trainval"
--model_variant="xception_65"
--atrous_rates=6
--atrous_rates=12
--atrous_rates=18
--output_stride=8
--decoder_output_stride=4
--train_crop_size="321,321"
--dataset="lake"
--train_batch_size=8
--training_number_of_steps="${NUM_ITERATIONS}"
--fine_tune_batch_norm=false
--train_logdir="${TRAIN_LOGDIR}"
--base_learning_rate=0.0001
--learning_policy="poly"
--tf_initial_checkpoint="training from pascal VOC pretrained checkpoint"
--dataset_dir="${DATASET}"
The text was updated successfully, but these errors were encountered: