Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to train Custom model #74

Closed
jllarraz opened this issue Oct 2, 2019 · 17 comments
Closed

Unable to train Custom model #74

jllarraz opened this issue Oct 2, 2019 · 17 comments
Labels
training Training Related Questions

Comments

@jllarraz
Copy link

jllarraz commented Oct 2, 2019

Hi,

I have been trying to train a custom model with 2 classes and I have modified the training script to do a transfer knowledge from the trained yolo model.
This is my train.py script (See attached file
train.txt

But unfortunately I can't make it work, it always fails with
WARNING:tensorflow:Reduce LR on plateau conditioned on metric val_losswhich is not available. Available metrics are: lr W1002 12:07:19.268314 4404483520 callbacks.py:1824] Reduce LR on plateau conditioned on metricval_losswhich is not available. Available metrics are: lr WARNING:tensorflow:Early stopping conditioned on metricval_losswhich is not available. Available metrics are: W1002 12:07:19.268509 4404483520 callbacks.py:1250] Early stopping conditioned on metricval_loss which is not available. Available metrics are: 1/Unknown - 8s 8s/stepTraceback (most recent call last): File "/Users/t230418/Downloads/TensorFlow2/train.py", line 187, in <module> app.run(main) File "/usr/local/lib/python3.7/site-packages/absl/app.py", line 299, in run _run_main(main, args) File "/usr/local/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "/Users/t230418/Downloads/TensorFlow2/train.py", line 182, in main validation_data=val_dataset) File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 728, in fit use_multiprocessing=use_multiprocessing) File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 324, in fit total_epochs=epochs) File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 123, in run_one_epoch batch_outs = execution_function(iterator) File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 86, in execution_function distributed_function(input_fn)) File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 73, in distributed_function per_replica_function, args=(model, x, y, sample_weights)) File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 760, in experimental_run_v2 return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs) File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1787, in call_for_each_replica return self._call_for_each_replica(fn, args, kwargs) File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 2132, in _call_for_each_replica return fn(*args, **kwargs) File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/autograph/impl/api.py", line 258, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 264, in train_on_batch output_loss_metrics=model._output_loss_metrics) File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_eager.py", line 311, in train_on_batch output_loss_metrics=output_loss_metrics)) File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_eager.py", line 252, in _process_single_batch training=training)) File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_eager.py", line 166, in _model_loss per_sample_losses = loss_fn.call(targets[i], outs[i]) File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/keras/losses.py", line 221, in call return self.fn(y_true, y_pred, **self._fn_kwargs) File "/Users/t230418/Downloads/TensorFlow2/yolov3_tf2/models.py", line 304, in yolo_loss true_class_idx, pred_class) File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/keras/losses.py", line 978, in sparse_categorical_crossentropy y_true, y_pred, from_logits=from_logits, axis=axis) File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/keras/backend.py", line 4549, in sparse_categorical_crossentropy labels=target, logits=output) File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/ops/nn_ops.py", line 3477, in sparse_softmax_cross_entropy_with_logits_v2 labels=labels, logits=logits, name=name) File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/ops/nn_ops.py", line 3397, in sparse_softmax_cross_entropy_with_logits precise_logits, labels, name=name) File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_nn_ops.py", line 11838, in sparse_softmax_cross_entropy_with_logits _six.raise_from(_core._status_to_exception(e.code, message), None) File "<string>", line 3, in raise_from tensorflow.python.framework.errors_impl.InvalidArgumentError: Received a label value of -1 which is outside the valid range of [0, 2). Label valuesp:SparseSoftmaxCrossEntropyWithLogits] WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-6 W1002 12:07:19.918585 4404483520 util.py:144] Unresolved object in checkpoint: (root).layer-6 WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-7 W1002 12:07:19.918745 4404483520 util.py:144] Unresolved object in checkpoint: (root).layer-7 WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-8 W1002 12:07:19.918799 4404483520 util.py:144] Unresolved object in checkpoint: (root).layer-8 WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-6.arguments W1002 12:07:19.918851 4404483520 util.py:144] Unresolved object in checkpoint: (root).layer-6.arguments WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-6._variable_dict W1002 12:07:19.918897 4404483520 util.py:144] Unresolved object in checkpoint: (root).layer-6._variable_dict WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-6._trainable_weights W1002 12:07:19.918942 4404483520 util.py:144] Unresolved object in checkpoint: (root).layer-6._trainable_weights WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-6._non_trainable_weights W1002 12:07:19.918987 4404483520 util.py:144] Unresolved object in checkpoint: (root).layer-6._non_trainable_weights WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-7.arguments W1002 12:07:19.919031 4404483520 util.py:144] Unresolved object in checkpoint: (root).layer-7.arguments WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-7._variable_dict W1002 12:07:19.919076 4404483520 util.py:144] Unresolved object in checkpoint: (root).layer-7._variable_dict WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-7._trainable_weights W1002 12:07:19.919139 4404483520 util.py:144] Unresolved object in checkpoint: (root).layer-7._trainable_weights WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-7._non_trainable_weights W1002 12:07:19.919196 4404483520 util.py:144] Unresolved object in checkpoint: (root).layer-7._non_trainable_weights WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-8.arguments W1002 12:07:19.919239 4404483520 util.py:144] Unresolved object in checkpoint: (root).layer-8.arguments WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-8._variable_dict W1002 12:07:19.919282 4404483520 util.py:144] Unresolved object in checkpoint: (root).layer-8._variable_dict WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-8._trainable_weights W1002 12:07:19.919326 4404483520 util.py:144] Unresolved object in checkpoint: (root).layer-8._trainable_weights WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-8._non_trainable_weights W1002 12:07:19.919369 4404483520 util.py:144] Unresolved object in checkpoint: (root).layer-8._non_trainable_weights WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/alpha/guide/checkpoints#loading_mechanics for details. W1002 12:07:19.919421 4404483520 util.py:152] A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/alpha/guide/checkpoints#loading_mechanics for details.

Any ideas?

@AnaRhisT94
Copy link

We'll need more info. then that.
How did you create the tfrecord files for those two classes?

@jllarraz
Copy link
Author

jllarraz commented Oct 2, 2019 via email

@jllarraz
Copy link
Author

jllarraz commented Oct 2, 2019

This is the script that I am using
rectlabel_create_pascal_tf_record.txt

@AnaRhisT94
Copy link

AnaRhisT94 commented Oct 2, 2019

You haven't used the transfer flag here which is

flags.DEFINE_enum('transfer', 'none',
                  ['none', 'darknet', 'no_output', 'frozen', 'fine_tune'],
                  'none: Training from scratch, '
                  'darknet: Transfer darknet, '
                  'no_output: Transfer all but output, '
                  'frozen: Transfer and freeze all, '
                  'fine_tune: Transfer all and freeze darknet only')

Try to use:

flags.DEFINE_enum('transfer', 'no_output',
                  ['none', 'darknet', 'no_output', 'frozen', 'fine_tune'],
                  'none: Training from scratch, '
                  'darknet: Transfer darknet, '
                  'no_output: Transfer all but output, '
                  'frozen: Transfer and freeze all, '
                  'fine_tune: Transfer all and freeze darknet only')

Then use classes=80, and just filter the classes you don't need in the inference time.

I'm also struggling with training for 2 classes myself. I'll be trying this https://github.com/YunYang1994/tensorflow-yolov3 implementation soon.

@jllarraz
Copy link
Author

jllarraz commented Oct 2, 2019

I pass the transfer as a parameter, as far as I know that bit of code just defines the possible values of the flag transfer and if you don't specifiy one it uses none as default or no_output in your example.

this is how I use the script (just copied from the website)
python train.py --batch_size 8 --dataset /mypath/train.tfRecord --val_dataset /mypath/val.tfRecord --classes /myPath/my_coco.names --epochs 10 --mode eager_fit --transfer fine_tune --weights ./checkpoints/yolov3-tiny.tf --tiny

And I also tried
python train.py --batch_size 8 --dataset /mypath/train.tfRecord --val_dataset /mypath/val.tfRecord --epochs 10 --mode eager_fit --transfer fine_tune --weights ./checkpoints/yolov3-tiny.tf --tiny

@jllarraz
Copy link
Author

jllarraz commented Oct 2, 2019

I have created a sample tf record with just one sample in case that it's helpful to find the problem
train_example.tfRecord.zip

@AnaRhisT94
Copy link

From the error it says that there are many -1 values, and indeed we can see in the output a few -1 values. Take a look at it

@jllarraz
Copy link
Author

jllarraz commented Oct 2, 2019

That's the reason why I am asking, because I can't figure out why that happens.

@AnaRhisT94
Copy link

AnaRhisT94 commented Oct 2, 2019

Debug the creation of the TFRecord file, maybe it's because of:
class_id = -1.

Check in debugging this:

class_id = getClassId(obj['name'], label_map_dict)
if class_id < 0:
continue

see maybe the class_id is a string and you should do class_id <"0" instead.

@jllarraz
Copy link
Author

jllarraz commented Oct 2, 2019

Thanks for the tip, but I had checked that and it's an int. Also the classId seems ok it's always in the range

@AnaRhisT94
Copy link

You're welcome.
If you don't mind, I'd also love some help:
#75

@Vitor050291
Copy link

I am having the same trouble here

@DanielWicz
Copy link

I am also having this trouble

@fabhau
Copy link

fabhau commented Oct 9, 2019

I am also having this problem in combination with transfer learning.

@Kuz-man
Copy link
Contributor

Kuz-man commented Nov 5, 2019

Basically, the labels are created in dataset.py where each item in the tfrecord is passed to the parse_tfrecord method. That's the lines you're interested in:

class_text = tf.sparse.to_dense(
        x['image/object/class/text'], default_value='')
labels = tf.cast(class_table.lookup(class_text), tf.float32)

So, if a label under image/object/class/text does not exist in the value you pass to the --classes argument, the label will be -1, as assigned here:

class_table = tf.lookup.StaticHashTable(tf.lookup.TextFileInitializer(
        class_file, tf.string, 0, tf.int64, LINE_NUMBER, delimiter="\n"), -1)

In a nutshell, make sure your labels in the tfrecord match those inside the --classes argument (which defaults to ./data/coco.names). In one of your calls above, you're leaving the argument empty and in the other you're using: --classes /myPath/my_coco.names.
Also, if your tfrecord already has class/label inside, you might want to use those instead of looking up.

@zzh8829 zzh8829 added the training Training Related Questions label Dec 20, 2019
@zzh8829
Copy link
Owner

zzh8829 commented Dec 21, 2019

See this tutorial on custom training https://github.com/zzh8829/yolov3-tf2/blob/master/docs/training_voc.md

@zzh8829 zzh8829 closed this as completed Dec 21, 2019
@escudero
Copy link

I don't know if it was your case.
It happened to me because I was using accented labels.
I removed the accents and it worked correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
training Training Related Questions
Projects
None yet
Development

No branches or pull requests

8 participants