BaseCollectiveExecuter::StartAbort Out of range: End of Sequence #104

muhammad-maaz-confiz · 2019-11-20T13:54:32Z

Hi,

While training Tiny Yolo on VOC dataset, in the end of each epoch I am getting the error "BaseCollectiveExecuter::StartAbort Out of range: End of Sequence". Also training early stops after 4 epochs. The terminal outputs are attached. Note that I am using ubuntu 18.04 with TF2.0.

Also while using eager mode training, all went right.

AnaRhisT94 · 2019-11-20T17:05:35Z

I think it's logical to add the repeat() function while creating your training and validation dataset.

muhammad-maaz-confiz · 2019-11-20T17:29:27Z

Hi @AnaRhisT94,

So while loading TFRecords, using dataset.map(....).repeat() will resolve this issue? Please guide thoroughly as I am new to Tensorflow and wanted to investigate in depth details.

Thanks

AnaRhisT94 · 2019-11-20T17:35:42Z

Hi @AnaRhisT94,

So while loading TFRecords, using dataset.map(....).repeat() will resolve this issue? Please guide thoroughly as I am new to Tensorflow and wanted to investigate in depth details.

Thanks

Best thing is to try and see. Should work though :)
You can use it after batch or after shuffle, just add .repeat()

NiklasWilson · 2019-11-21T21:44:39Z

Another option is switching from "fit" to "eager_tf".

Ah just saw you said you did that first.

NiklasWilson · 2019-11-22T16:36:13Z

@muhammad-maaz-confiz Did adding the .repeat() resolve this for you? and have you been able to train a modal that did detects something (even inaccurately)?

I am having this exact same issue and have yet to train a functioning modal.

muhammad-maaz-confiz · 2019-11-22T19:41:03Z

Hi @NiklasWilson ,

adding .repeat() and passing steps_per_epoch in model.fit() solves the issue for me. Because of the busy routine I did not able to test the model. will let you know about detection once I test the model.

NiklasWilson · 2019-11-22T19:42:25Z

@muhammad-maaz-confiz Awesome! I will add that to my code tonight and test myself too. While eagerly awaiting the results of your tests :)

NiklasWilson · 2019-11-22T19:47:17Z

Side note this

early stops after 4 epochs

can happen with or without these errors. It is related to this line of code.
ReduceLROnPlateau(verbose=1), EarlyStopping(patience=3, verbose=1),

You can modify the patience value to decrease the premature stopping.
I have changed mine to
ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=3, verbose=1) EarlyStopping(monitor='val_loss', min_delta=0, patience=10, verbose=1)

Its purpose is to make sure that your training quality does not decrease.

muhammad-maaz-confiz · 2019-11-23T06:11:50Z

Thanks @NiklasWilson and @AnaRhisT94 ,

While repeating the dataset and adding steps_per_epoch in model.fit() seems to resolve the issue for initial epochs and the issue reappeared after a couple of epochs (3 epochs for me). What could be gone wrong? Also note that the training automatically stops after 9 epochs and no message regarding early stopping. Thanks

NiklasWilson · 2019-11-23T14:35:14Z

@muhammad-maaz-confiz
This problem went away for me when I put the repeat here
train_dataset = train_dataset.map(lambda x, y: ( dataset.transform_images(x, FLAGS.size), dataset.transform_targets(y, anchors, anchor_masks, 80))).repeat()
(only there, no where else)

I didnt put it here, but it makes sense that it would also need to go here (adding that now actually)
val_dataset = val_dataset.map(lambda x, y: ( dataset.transform_images(x, FLAGS.size), dataset.transform_targets(y, anchors, anchor_masks, 80))).repeat()

Also note that the training automatically stops after 9 epochs

Notice in your picture, the very last line says "Killed"
So something external to the training script stopped it. Like another program.

mmaaz60 · 2019-11-24T17:24:45Z

By repeating the training dataset and specifying steps_per_epoch and validation_steps in model.fit(), I am able to get rid of this error/warning.

zzh8829 · 2019-12-21T12:05:39Z

I believe this error doesn't actually affect training, it's likely a bug from tensorflow

lazerliu mentioned this issue Nov 29, 2019

Is there anyone successfully trained the code with any dataset? #107

Closed

zzh8829 added the training Training Related Questions label Dec 20, 2019

zzh8829 closed this as completed Dec 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BaseCollectiveExecuter::StartAbort Out of range: End of Sequence #104

BaseCollectiveExecuter::StartAbort Out of range: End of Sequence #104

muhammad-maaz-confiz commented Nov 20, 2019

AnaRhisT94 commented Nov 20, 2019

muhammad-maaz-confiz commented Nov 20, 2019

AnaRhisT94 commented Nov 20, 2019 •

edited

NiklasWilson commented Nov 21, 2019 •

edited

NiklasWilson commented Nov 22, 2019 •

edited

muhammad-maaz-confiz commented Nov 22, 2019

NiklasWilson commented Nov 22, 2019 •

edited

NiklasWilson commented Nov 22, 2019 •

edited

muhammad-maaz-confiz commented Nov 23, 2019

NiklasWilson commented Nov 23, 2019 •

edited

mmaaz60 commented Nov 24, 2019

zzh8829 commented Dec 21, 2019

BaseCollectiveExecuter::StartAbort Out of range: End of Sequence #104

BaseCollectiveExecuter::StartAbort Out of range: End of Sequence #104

Comments

muhammad-maaz-confiz commented Nov 20, 2019

AnaRhisT94 commented Nov 20, 2019

muhammad-maaz-confiz commented Nov 20, 2019

AnaRhisT94 commented Nov 20, 2019 • edited

NiklasWilson commented Nov 21, 2019 • edited

NiklasWilson commented Nov 22, 2019 • edited

muhammad-maaz-confiz commented Nov 22, 2019

NiklasWilson commented Nov 22, 2019 • edited

NiklasWilson commented Nov 22, 2019 • edited

muhammad-maaz-confiz commented Nov 23, 2019

NiklasWilson commented Nov 23, 2019 • edited

mmaaz60 commented Nov 24, 2019

zzh8829 commented Dec 21, 2019

AnaRhisT94 commented Nov 20, 2019 •

edited

NiklasWilson commented Nov 21, 2019 •

edited

NiklasWilson commented Nov 22, 2019 •

edited

NiklasWilson commented Nov 22, 2019 •

edited

NiklasWilson commented Nov 22, 2019 •

edited

NiklasWilson commented Nov 23, 2019 •

edited