Resolves coredump caused by `tf.data.experimental.save` with prefetch #49383

ashahab · 2021-05-20T18:34:29Z

Repeat and prefetch in combination cause the snapshot reader Initialize function to be invoked multiple times.
However, there is nothing to prefetch on the very last iteration. This results in Prefetch issuing a CancelThreads call while the snapshot thread is trying to initialize. See

tensorflow/tensorflow/core/kernels/data/prefetch_dataset_op.cc

Line 151 in 6446dda

CancelThreads();

Currently the dataset reference counting is done asymmetrically. The reference increment happens at the end of initialization, where as the reference decrement
happens in a destructor. When prefetch cancels the snapshot thread, it errors out of the initialization function. And stops calling the reference increment. However, the reference decrement happens regardless, as it is in the destructor which always is invoked during cleanup. This results in an attempt to decrement the null dataset pointer, and therefore a segmentation fault.
This is different from all other dataset ops, where the dataset reference increment happens in the constructor and the decrement happens in the destructor, which are symmetric.

The solution to this is to ensure that the dataset reference is always initialized to nullptr, and to check for null when decrementing the dataset reference.

Repeat and prefetch in combination cause the snapshot reader Initialize function to be invoked multiple times. However, there is nothing to prefetch on the very last iteration. This results in Prefetch issuing a CancelThreads call while the snapshot thread is trying to initialize. See https://github.com/tensorflow/tensorflow/blob/6446dda92eaadf11d22377e2354307642d739d73/tensorflow/core/kernels/data/prefetch_dataset_op.cc#L151 Currently the dataset reference counting is done asymmetrically. The reference increment happens at the end of initialization, where as the reference decrement happens in a destructor. When prefetch cancels the snapshot thread, it errors out of the initialization function. And stops calling the reference increment. However, the reference decrement happens regardless, as it is in the destructor which always is invoked during cleanup. This results in an attempt to decrement the null dataset pointer, and therefore a segmentation fault. This is different from all other dataset ops, where the dataset reference increment happens in the constructor and the decrement happens in the destructor, which are symmetric. The solution to this is to ensure that the dataset reference is always initialized to nullptr, and to check for null when decrementing the dataset reference.

ashahab · 2021-05-20T18:35:31Z

@yangustc07 @mihaimaruseac this is a cherry pick of #49166

yangustc07 · 2021-05-20T18:44:38Z

Thanks Abin. I added the release owner of r2.5.

mihaimaruseac · 2021-05-21T15:27:50Z

Since this is in 25 it will have to wait until we patch 2.5 again.

google-cla bot added the cla: yes label May 20, 2021

yangustc07 requested a review from goldiegadde May 20, 2021 18:43

gbaned self-assigned this May 21, 2021

gbaned added this to Assigned Reviewer in PR Queue via automation May 21, 2021

gbaned assigned mihaimaruseac and unassigned gbaned May 21, 2021

mihaimaruseac added the waiting for patch release PR will be reviewed and merged only if we do a patch release since PR is not on master branch label May 21, 2021

mihaimaruseac approved these changes Aug 6, 2021

View reviewed changes

PR Queue automation moved this from Assigned Reviewer to Approved by Reviewer Aug 6, 2021

mihaimaruseac merged commit 6f39597 into tensorflow:r2.5 Aug 6, 2021

PR Queue automation moved this from Approved by Reviewer to Merged Aug 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolves coredump caused by `tf.data.experimental.save` with prefetch #49383

Resolves coredump caused by `tf.data.experimental.save` with prefetch #49383

ashahab commented May 20, 2021

ashahab commented May 20, 2021

yangustc07 commented May 20, 2021

mihaimaruseac commented May 21, 2021

Resolves coredump caused by tf.data.experimental.save with prefetch #49383

Resolves coredump caused by tf.data.experimental.save with prefetch #49383

Conversation

ashahab commented May 20, 2021

ashahab commented May 20, 2021

yangustc07 commented May 20, 2021

mihaimaruseac commented May 21, 2021

Resolves coredump caused by `tf.data.experimental.save` with prefetch #49383

Resolves coredump caused by `tf.data.experimental.save` with prefetch #49383