Skip to content

Conversation

@jaliyae
Copy link
Contributor

@jaliyae jaliyae commented Feb 7, 2019

Previously, the ChunkBuffer depends on the remaining chunk count to signal end of dataloading. This does not work with distributed samplers where each sampler only loads a subset of chunks. This refactor remove the dependency on the remaining chunk count at the ChunkBuffer.

@jaliyae
Copy link
Contributor Author

jaliyae commented Feb 7, 2019

@apaszke . This is based on our offline conversation on removing the sample_count method from the distributed samplers introduced in this PR, #16624. I will send another PR, only containing distributed samplers next.

@jaliyae jaliyae closed this Feb 8, 2019
@jaliyae
Copy link
Contributor Author

jaliyae commented Feb 8, 2019

Open again to get the tests completed.

@jaliyae jaliyae reopened this Feb 8, 2019
@jaliyae jaliyae force-pushed the jaliyae/chunk_buffer_refactor branch from d63ed4e to 0d32a96 Compare February 8, 2019 23:54
/// chunk count and it decreases when a chunk data is retrieved. When this reaches
/// to 0, no more chunk needs to be loaded.
size_t remaining_chunk_count_ = 0;
//size_t remaining_chunk_count_ = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: please remove commented code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thank you.

--running_preloaders_;
if (running_preloaders_.load() == 0) {
// all preloaders are completed, so we can notify the batch_buffer.
batch_buffer_->stop();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we only stop when the sampler is exhausted. It is also possible that the program wants to exit in the middle of an sweep. In this scenario, the stop_ is not switched to true and thus it could cause a hang from join() called from the destructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The free_workers() call makes the threads to exit and then it tirggers the stop(). This happens at every reset() and also at the distructor of the chunk dataset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go though this scenario: In the middle of a sweep, the worker is waiting inside add_chunk_data(), because the current buffer contains enough data. At this point, the user decide to exit current sweep and start a new one. Upon exiting, no more get_batch is called and the worker thread keeps waiting. At reset(), chunkDataset calls free_workers(), which calls join() to wait worker to finish. Because the worker is still in the cv wait and no notification is ever triggered, the join() will hang the program.
The original code called stop() in free_workers() before join, which breaks the wait, send the notification and resolve the hang.

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@goldsborough is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants