Remove chunk count check on the ChunkBuffer #16868

jaliyae · 2019-02-07T22:49:10Z

Previously, the ChunkBuffer depends on the remaining chunk count to signal end of dataloading. This does not work with distributed samplers where each sampler only loads a subset of chunks. This refactor remove the dependency on the remaining chunk count at the ChunkBuffer.

jaliyae · 2019-02-07T22:52:53Z

@apaszke . This is based on our offline conversation on removing the sample_count method from the distributed samplers introduced in this PR, #16624. I will send another PR, only containing distributed samplers next.

jaliyae · 2019-02-08T02:55:36Z

Open again to get the tests completed.

xzhu1900 · 2019-02-09T01:42:50Z

torch/csrc/api/include/torch/data/datasets/chunk.h

  /// chunk count and it decreases when a chunk data is retrieved. When this reaches
  /// to 0, no more chunk needs to be loaded.
-  size_t remaining_chunk_count_ = 0;
+  //size_t remaining_chunk_count_ = 0;


Nit: please remove commented code.

Fixed. Thank you.

xzhu1900 · 2019-02-09T01:52:43Z

torch/csrc/api/include/torch/data/datasets/chunk.h

+    --running_preloaders_;
+    if (running_preloaders_.load() == 0) {
+      // all preloaders are completed, so we can notify the batch_buffer.
+      batch_buffer_->stop();


It seems we only stop when the sampler is exhausted. It is also possible that the program wants to exit in the middle of an sweep. In this scenario, the stop_ is not switched to true and thus it could cause a hang from join() called from the destructor.

The free_workers() call makes the threads to exit and then it tirggers the stop(). This happens at every reset() and also at the distructor of the chunk dataset.

Let's go though this scenario: In the middle of a sweep, the worker is waiting inside add_chunk_data(), because the current buffer contains enough data. At this point, the user decide to exit current sweep and start a new one. Upon exiting, no more get_batch is called and the worker thread keeps waiting. At reset(), chunkDataset calls free_workers(), which calls join() to wait worker to finish. Because the worker is still in the cv wait and no notification is ever triggered, the join() will hang the program.
The original code called stop() in free_workers() before join, which breaks the wait, send the notification and resolve the hang.

apaszke

Looks great!

facebook-github-bot

@goldsborough is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

jaliyae requested review from ebetica, goldsborough and yf225 as code owners February 7, 2019 22:49

jaliyae closed this Feb 8, 2019

jaliyae reopened this Feb 8, 2019

jaliyae added 3 commits February 8, 2019 15:53

Remove chunk count check on the ChunkBuffer

c5b2cd0

fix the build error

e132dfc

Removed LockedSampler and related code

0d32a96

jaliyae force-pushed the jaliyae/chunk_buffer_refactor branch from d63ed4e to 0d32a96 Compare February 8, 2019 23:54

added an assert

b7f5ba0

xzhu1900 reviewed Feb 9, 2019

View reviewed changes

Fix review comments

f9d404c

apaszke approved these changes Feb 12, 2019

View reviewed changes

goldsborough approved these changes Feb 13, 2019

View reviewed changes

facebook-github-bot reviewed Feb 13, 2019

View reviewed changes

facebook-github-bot closed this in bc39cf4 Feb 13, 2019

ezyang added open source merged labels Jun 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove chunk count check on the ChunkBuffer #16868

Remove chunk count check on the ChunkBuffer #16868

Uh oh!

jaliyae commented Feb 7, 2019

Uh oh!

jaliyae commented Feb 7, 2019

Uh oh!

jaliyae commented Feb 8, 2019

Uh oh!

xzhu1900 Feb 9, 2019

Uh oh!

jaliyae Feb 9, 2019

Uh oh!

xzhu1900 Feb 9, 2019

Uh oh!

jaliyae Feb 9, 2019

Uh oh!

xzhu1900 Feb 12, 2019

Uh oh!

apaszke left a comment

Uh oh!

facebook-github-bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Remove chunk count check on the ChunkBuffer #16868

Remove chunk count check on the ChunkBuffer #16868

Uh oh!

Conversation

jaliyae commented Feb 7, 2019

Uh oh!

jaliyae commented Feb 7, 2019

Uh oh!

jaliyae commented Feb 8, 2019

Uh oh!

xzhu1900 Feb 9, 2019

Choose a reason for hiding this comment

Uh oh!

jaliyae Feb 9, 2019

Choose a reason for hiding this comment

Uh oh!

xzhu1900 Feb 9, 2019

Choose a reason for hiding this comment

Uh oh!

jaliyae Feb 9, 2019

Choose a reason for hiding this comment

Uh oh!

xzhu1900 Feb 12, 2019

Choose a reason for hiding this comment

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants