-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Remove chunk count check on the ChunkBuffer #16868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Open again to get the tests completed. |
d63ed4e to
0d32a96
Compare
| /// chunk count and it decreases when a chunk data is retrieved. When this reaches | ||
| /// to 0, no more chunk needs to be loaded. | ||
| size_t remaining_chunk_count_ = 0; | ||
| //size_t remaining_chunk_count_ = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: please remove commented code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thank you.
| --running_preloaders_; | ||
| if (running_preloaders_.load() == 0) { | ||
| // all preloaders are completed, so we can notify the batch_buffer. | ||
| batch_buffer_->stop(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we only stop when the sampler is exhausted. It is also possible that the program wants to exit in the middle of an sweep. In this scenario, the stop_ is not switched to true and thus it could cause a hang from join() called from the destructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The free_workers() call makes the threads to exit and then it tirggers the stop(). This happens at every reset() and also at the distructor of the chunk dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go though this scenario: In the middle of a sweep, the worker is waiting inside add_chunk_data(), because the current buffer contains enough data. At this point, the user decide to exit current sweep and start a new one. Upon exiting, no more get_batch is called and the worker thread keeps waiting. At reset(), chunkDataset calls free_workers(), which calls join() to wait worker to finish. Because the worker is still in the cv wait and no notification is ever triggered, the join() will hang the program.
The original code called stop() in free_workers() before join, which breaks the wait, send the notification and resolve the hang.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@goldsborough is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Previously, the ChunkBuffer depends on the remaining chunk count to signal end of dataloading. This does not work with distributed samplers where each sampler only loads a subset of chunks. This refactor remove the dependency on the remaining chunk count at the ChunkBuffer.