Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BaseCollectiveExecuter::StartAbort Out of range: End of Sequence #104

Closed
muhammad-maaz-confiz opened this issue Nov 20, 2019 · 12 comments
Closed
Labels
training Training Related Questions

Comments

@muhammad-maaz-confiz
Copy link

TF2 0TinyYolo_Training

Hi,

While training Tiny Yolo on VOC dataset, in the end of each epoch I am getting the error "BaseCollectiveExecuter::StartAbort Out of range: End of Sequence". Also training early stops after 4 epochs. The terminal outputs are attached. Note that I am using ubuntu 18.04 with TF2.0.

Also while using eager mode training, all went right.
training_tf2 0

@AnaRhisT94
Copy link

I think it's logical to add the repeat() function while creating your training and validation dataset.

@muhammad-maaz-confiz
Copy link
Author

Hi @AnaRhisT94,

So while loading TFRecords, using dataset.map(....).repeat() will resolve this issue? Please guide thoroughly as I am new to Tensorflow and wanted to investigate in depth details.

Thanks

@AnaRhisT94
Copy link

AnaRhisT94 commented Nov 20, 2019

Hi @AnaRhisT94,

So while loading TFRecords, using dataset.map(....).repeat() will resolve this issue? Please guide thoroughly as I am new to Tensorflow and wanted to investigate in depth details.

Thanks

Best thing is to try and see. Should work though :)
You can use it after batch or after shuffle, just add .repeat()

@NiklasWilson
Copy link

NiklasWilson commented Nov 21, 2019

Another option is switching from "fit" to "eager_tf".

Ah just saw you said you did that first.

@NiklasWilson
Copy link

NiklasWilson commented Nov 22, 2019

@muhammad-maaz-confiz Did adding the .repeat() resolve this for you? and have you been able to train a modal that did detects something (even inaccurately)?

I am having this exact same issue and have yet to train a functioning modal.

@muhammad-maaz-confiz
Copy link
Author

Hi @NiklasWilson ,

adding .repeat() and passing steps_per_epoch in model.fit() solves the issue for me. Because of the busy routine I did not able to test the model. will let you know about detection once I test the model.

@NiklasWilson
Copy link

NiklasWilson commented Nov 22, 2019

@muhammad-maaz-confiz Awesome! I will add that to my code tonight and test myself too. While eagerly awaiting the results of your tests :)

@NiklasWilson
Copy link

NiklasWilson commented Nov 22, 2019

Side note this

early stops after 4 epochs

can happen with or without these errors. It is related to this line of code.
ReduceLROnPlateau(verbose=1), EarlyStopping(patience=3, verbose=1),

You can modify the patience value to decrease the premature stopping.
I have changed mine to
ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=3, verbose=1) EarlyStopping(monitor='val_loss', min_delta=0, patience=10, verbose=1)

Its purpose is to make sure that your training quality does not decrease.

@muhammad-maaz-confiz
Copy link
Author

Thanks @NiklasWilson and @AnaRhisT94 ,

While repeating the dataset and adding steps_per_epoch in model.fit() seems to resolve the issue for initial epochs and the issue reappeared after a couple of epochs (3 epochs for me). What could be gone wrong? Also note that the training automatically stops after 9 epochs and no message regarding early stopping. Thanks
Tf20_End_fSequence

@NiklasWilson
Copy link

NiklasWilson commented Nov 23, 2019

@muhammad-maaz-confiz
This problem went away for me when I put the repeat here
train_dataset = train_dataset.map(lambda x, y: ( dataset.transform_images(x, FLAGS.size), dataset.transform_targets(y, anchors, anchor_masks, 80))).repeat()
(only there, no where else)

I didnt put it here, but it makes sense that it would also need to go here (adding that now actually)
val_dataset = val_dataset.map(lambda x, y: ( dataset.transform_images(x, FLAGS.size), dataset.transform_targets(y, anchors, anchor_masks, 80))).repeat()

Also note that the training automatically stops after 9 epochs

Notice in your picture, the very last line says "Killed"
So something external to the training script stopped it. Like another program.

@mmaaz60
Copy link

mmaaz60 commented Nov 24, 2019

By repeating the training dataset and specifying steps_per_epoch and validation_steps in model.fit(), I am able to get rid of this error/warning.

@zzh8829
Copy link
Owner

zzh8829 commented Dec 21, 2019

I believe this error doesn't actually affect training, it's likely a bug from tensorflow

@zzh8829 zzh8829 closed this as completed Dec 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
training Training Related Questions
Projects
None yet
Development

No branches or pull requests

5 participants