Reset `DataLoader` workers instead of creating new ones #35795

emcastillo · 2020-04-01T05:57:17Z

This PR needs discussion as it changes the behavior of DataLoader. It can be closed if its not considered a good practice.

Currently, the DataLoader spawns a new _BaseDataLoaderIter object every epoch,
In the case of the multiprocess DataLoader, every epoch the worker processes are re-created and they make a copy of the original Dataset object.
If users want to cache data or do some tracking on their datasets, all their data will be wiped out every epoch. Notice that this doesn't happen when the number of workers is 0. giving some inconsistencies with the multiprocess and serial data loaders.

This PR keeps the _BaseDataLoaderIter object alive and just resets it within epochs, so the workers remain active and so their own Dataset objects. People seem to file issues about this often.

dr-ci · 2020-04-01T05:58:27Z

💊 CI failures summary and remediations

As of commit f9cc75d (more details on the Dr. CI page):

1/2 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)
1/2 broken upstream at merge base 3ec24f0 on Aug 25 from 6:12pm to 9:00pm PDT (5 commits; f6b7c6d - c1553ff)

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

Since your merge base is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.

pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test on Aug 25 from 6:12pm to 9:00pm PDT (5 commits; f6b7c6d - c1553ff)
- 🔁 rerun

ci.pytorch.org: 1 failed

Failed: pr/caffe2-pytorch-linux-xenial-rocm3.5.1-py3.6-test

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 111 times.

ezyang · 2020-04-01T13:18:16Z

@ssnl do you think you could take a look at this? Lmk if you can't and I'll find someone else.

tkerola

I just added some small comments.

torch/utils/data/dataloader.py

ssnl · 2020-04-01T15:12:47Z

Thanks for doing this. Some of the issues that we need to figure out for worker reusing are

whether we should enable this by default, as workers may hold resources.
what about seeds and worker_init_fn? do we reset / run them at each epoch?

Honestly speaking I don't know the correct answer to these two questions. What are your thoughts? :)

emcastillo · 2020-04-02T02:05:44Z

Hi! Thanks for the feedback 😃

whether we should enable this by default, as workers may hold resources.

I think this is the expected behavior by users so it's not a bad idea to have it as the default.
However, this may cause some no-compat issues with users code so we have to be careful here, maybe adding a continuous parameter to the constructor is good?

what about seeds and worker_init_fn? do we reset / run them at each epoch?

I think that running them only once should be fine.

emcastillo · 2020-04-03T03:16:46Z

I added an argument to DataLoader constructor to control the behavior after discussing it with @tkerola
Fixed some issues and I guess that all that is left is to write some unit tests.

emcastillo · 2020-04-08T02:06:49Z

Hi, @ssnl can we get some comments of this? 😇

emcastillo · 2020-05-01T05:29:35Z

Hello,
We would like to get some comments if possible!
Thanks

ssnl · 2020-05-11T20:10:56Z

Hi sorry for the lateness. I will take a look by tomorrow.

torch/utils/data/dataloader.py

ssnl

Also need to reset the dataset iter in worker process for IterableDataset

torch/utils/data/dataloader.py

ssnl · 2020-05-13T16:30:22Z

Thanks again for doing this! My general opinion is that this should definitely be done. But given how complicated the current DataLoader ctor arguments are (look at the # of arguments that are only used with multiprocessing loading), and equally unhappy about how complicated the DataLoader logic is, we should be careful about the behavior and API.

emcastillo · 2020-05-15T09:33:06Z

Hi, Thanks a lot for the comments!

Regarding resetting the IterableDataset iterator in the worker, I can't find a reset mechanism for it.
https://pytorch.org/docs/stable/_modules/torch/utils/data/dataset.html#IterableDataset
Maybe I am looking at the wrong place?

I think that I should just create a new iterator for the dataset in the fetcher that the worker uses once we have stopped the iteration. Is this right?

ssnl · 2020-05-16T18:58:49Z

@emcastillo It would require recreating the iterator in the worker process, after all iterators exhaust. So some kind of message passing from the main process to the worker is needed for this. You can probably alter the index_queue mechanism to do so.

emcastillo · 2020-05-18T04:03:44Z

Hi, I did the synchronization of the worker process and the iterator itself.

Regarding the DataLoader API, the only alternative I see is to create a sub-class rather than adding a parameter. But since there are no other specializations I find it a bit overkill.

emcastillo · 2020-05-18T09:34:59Z

I have changed the dataloader tests in my local env to use always the persistent workers.
All works but one test where a warning after accessing length is not being raised.

I still have to deal with it, but all the other tests work fine.
Once I fix this, I will push a specialization of the test suite for the persistent workers version.
I am just thinking about creating a factory function for the DataLoader so I can subclass the TestDataLoader and use this function instead of adding or modifying all the existing tests.

emcastillo · 2020-05-25T01:22:53Z

@ssnl any comments?😇

emcastillo · 2020-06-08T05:15:57Z

Just dropping by 😁

emcastillo · 2020-06-30T03:12:54Z

Hi @ssnl , I rebased and solved conflicts, can I get another look?
Thanks!

test/test_dataloader.py

emcastillo · 2020-07-27T06:20:54Z

Any feedback in this 😇

ssnl

Looks mostly good to me! also cc @VitalyFedyunin

torch/utils/data/dataloader.py

VitalyFedyunin · 2020-08-06T22:28:46Z

Hello! I'm allocating a full day tomorrow to review and land it. Meanwhile can you please rebase.

VitalyFedyunin · 2020-08-06T22:37:12Z

test/test_dataloader.py

+            # Changing the start value here doesn't have any effect in the dataset
+            # cached by the workers. since they are not recreated between epochs
+            # and can cache values safely
+            dataset.start = i


It is failing nicely with self.persistent_workers = False ?

It is confirmed to fail (which is expected). But test is actually checking the fact that we are not making any new copies of dataset. There is no test, which checks that you are using exactly same DataSet objects after reset.

It is still acceptable to merge as is anyway.

Actually, the DataSet object is instantiated only once, by the user, and then it seems that is cloned when calling:
https://github.com/pytorch/pytorch/blob/master/torch/utils/data/dataloader.py#L764-L769

The object seems to be hardly cloned in the new process memory space, no pickle (I tried to verify this by overriding __getstate__ or __setstate__, additionally, in the new processes, the object id is the same than in the main process, so there is no actual "instantiation".

It seems that the only way we have to assert that the object is not being recreated is actually this test. If the object is recreated it will get new values in the next iteration (.start) and if not, it will retain the cached values.

VitalyFedyunin

Overall looks good with couple concerns:

I'm not sure we are actually testing that we reset DataSet instead of creating new ones. We might need to create another multiprocessing queue to gather number of initializations and asserting on them.
There is no way for DataSet to tell if it is used in classic or reset mode. If might be error-prone if DataSet tries to free memory, especially if we do iteratable datasets

torch/utils/data/dataloader.py

emcastillo · 2020-08-11T00:35:14Z

Hi!, Thanks for all the reviews and feedback,
I am a bit busy with some internal projects right now, so once I get everything sorted up I will rebase and change this accordingly.

Thanks again!

emcastillo · 2020-08-24T08:37:01Z

Sorry for the long time without taking care of this.
I just rebased and verified that everything works fine, I will be addressing the concerns in the review in the next hours.

emcastillo · 2020-08-26T05:48:42Z

Hi, I tried to do what you suggest but I had a bit of hard time:

I'm not sure we are actually testing that we reset DataSet instead of creating new ones. We might need to create another multiprocessing queue to gather number of initializations and asserting on them.

As I replied above, I tried to do similar things to enable this check, but it seems that the object is being directly cloned in the fork without any actual python instantiation or pickling (it always retain the original object id). I also verified that __new__ is also called once, in the main process.
It seems that we don't have an easy way to assert that the object is the same or not.

There is no way for DataSet to tell if it is used in classic or reset mode. If might be error-prone if DataSet tries to free memory, especially if we do iterable datasets.

Since the persistent workers are not enabled by default, I believe up to a certain extent that it is the user responsibility to ensure that their datasets can be used in this manner. We could add some kind of facility to the Dataset to identify the mode, (probably an attribute with a setter method invoked from the dataloader?) but I think we can leave that to a different PR if possible.

I just rebased this on top of master, so if possible I wonder if we could merge it as-is :)

emcastillo · 2020-08-31T03:58:13Z

Thanks for approving the changes!!
Will this be landed soon?😇

facebook-github-bot

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

VitalyFedyunin · 2020-09-01T14:41:42Z

Sorry it takes eternity to land. Changes to DataLoader triggered all ML tests to run.

ssnl · 2020-09-01T18:58:07Z

Glad to see this merged. Thanks @emcastillo and @VitalyFedyunin !

facebook-github-bot · 2020-09-01T20:14:49Z

@VitalyFedyunin merged this pull request in 5472426.

facebook-github-bot · 2020-09-01T20:15:04Z

@VitalyFedyunin merged this pull request in 5472426.

emcastillo requested a review from apaszke as a code owner April 1, 2020 05:57

pytorchbot added the open source label Apr 1, 2020

emcastillo force-pushed the dataloader-workers branch from 6bb7f0a to 6661a01 Compare April 1, 2020 06:23

ezyang requested a review from ssnl April 1, 2020 13:17

tkerola reviewed Apr 1, 2020

View reviewed changes

torch/utils/data/dataloader.py Outdated Show resolved Hide resolved

torch/utils/data/dataloader.py Outdated Show resolved Hide resolved

torch/utils/data/dataloader.py Outdated Show resolved Hide resolved

emcastillo force-pushed the dataloader-workers branch from c5f9b69 to e256176 Compare April 3, 2020 02:57

zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 6, 2020

ssnl reviewed May 13, 2020

View reviewed changes

torch/utils/data/dataloader.py Outdated Show resolved Hide resolved

ssnl reviewed May 13, 2020

View reviewed changes

torch/utils/data/dataloader.py Outdated Show resolved Hide resolved

emcastillo force-pushed the dataloader-workers branch from e256176 to 386ca46 Compare May 18, 2020 03:58

emcastillo force-pushed the dataloader-workers branch from 7416a98 to f400cc4 Compare May 18, 2020 04:53

emcastillo changed the title ~~Reset _BaseDataLoaderIter instead of creating a new one~~ Reset DataLoader workers instead of creating new ones May 20, 2020

emcastillo force-pushed the dataloader-workers branch from c67c5ff to 9499686 Compare May 20, 2020 04:29

emcastillo force-pushed the dataloader-workers branch from 9499686 to 900baf7 Compare June 30, 2020 03:12

ssnl reviewed Jul 16, 2020

View reviewed changes

test/test_dataloader.py Show resolved Hide resolved

test/test_dataloader.py Outdated Show resolved Hide resolved

emcastillo force-pushed the dataloader-workers branch from d846e9d to a1580c9 Compare July 16, 2020 05:30

ssnl reviewed Jul 29, 2020

View reviewed changes

torch/utils/data/dataloader.py Outdated Show resolved Hide resolved

torch/utils/data/dataloader.py Outdated Show resolved Hide resolved

VitalyFedyunin reviewed Aug 6, 2020

View reviewed changes

VitalyFedyunin reviewed Aug 8, 2020

View reviewed changes

torch/utils/data/dataloader.py Outdated Show resolved Hide resolved

belltailjp mentioned this pull request Aug 17, 2020

MultiprocessFileCache for efficient and easy cache with pytorch dataloader (num_workers>0) pfnet/pfio#128

Merged

emcastillo force-pushed the dataloader-workers branch from a1580c9 to 9c31d6c Compare August 24, 2020 08:36

emcastillo force-pushed the dataloader-workers branch from 9c31d6c to 0fb5164 Compare August 26, 2020 05:49

Reset _BaseDataLoaderIter instead of creating a new one

f9cc75d

emcastillo force-pushed the dataloader-workers branch from 0fb5164 to f9cc75d Compare August 26, 2020 05:53

VitalyFedyunin approved these changes Aug 29, 2020

View reviewed changes

facebook-github-bot reviewed Aug 31, 2020

View reviewed changes

facebook-github-bot closed this in 5472426 Sep 1, 2020

facebook-github-bot added the merged label Sep 1, 2020

KushajveerSingh mentioned this pull request Sep 4, 2020

AttributeError: '_FakeLoader' object has no attribute 'persistent_workers' (using pytorch master) fastai/fastai#2756

Closed

VitalyFedyunin requested a review from albanD September 13, 2020 19:51

mruberry added the Merged label Oct 28, 2020

Big-Brother-Pikachu mentioned this pull request Apr 15, 2021

Why using mp.spawn is slower than using torch.distributed.launch when using multi-GPU training #47587

Closed

cowwoc mentioned this pull request Sep 9, 2021

Ability to release resources associated with a DataLoader Lightning-AI/pytorch-lightning#9390

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reset `DataLoader` workers instead of creating new ones #35795

Reset `DataLoader` workers instead of creating new ones #35795

emcastillo commented Apr 1, 2020 •

edited

Loading

dr-ci bot commented Apr 1, 2020 •

edited

Loading

ezyang commented Apr 1, 2020

tkerola left a comment

ssnl commented Apr 1, 2020

emcastillo commented Apr 2, 2020 •

edited

Loading

emcastillo commented Apr 3, 2020

emcastillo commented Apr 8, 2020

emcastillo commented May 1, 2020

ssnl commented May 11, 2020

ssnl left a comment

ssnl commented May 13, 2020

emcastillo commented May 15, 2020 •

edited

Loading

ssnl commented May 16, 2020

emcastillo commented May 18, 2020

emcastillo commented May 18, 2020

emcastillo commented May 25, 2020

emcastillo commented Jun 8, 2020

emcastillo commented Jun 30, 2020

emcastillo commented Jul 27, 2020

ssnl left a comment

VitalyFedyunin commented Aug 6, 2020

VitalyFedyunin Aug 6, 2020

VitalyFedyunin Aug 10, 2020 •

edited

Loading

emcastillo Aug 26, 2020

VitalyFedyunin left a comment

emcastillo commented Aug 11, 2020

emcastillo commented Aug 24, 2020

emcastillo commented Aug 26, 2020 •

edited

Loading

emcastillo commented Aug 31, 2020

facebook-github-bot left a comment

VitalyFedyunin commented Sep 1, 2020

ssnl commented Sep 1, 2020

facebook-github-bot commented Sep 1, 2020

facebook-github-bot commented Sep 1, 2020

Reset DataLoader workers instead of creating new ones #35795

Reset DataLoader workers instead of creating new ones #35795

Conversation

emcastillo commented Apr 1, 2020 • edited Loading

dr-ci bot commented Apr 1, 2020 • edited Loading

💊 CI failures summary and remediations

🚧 1 fixed upstream failure:

ci.pytorch.org: 1 failed

ezyang commented Apr 1, 2020

tkerola left a comment

Choose a reason for hiding this comment

ssnl commented Apr 1, 2020

emcastillo commented Apr 2, 2020 • edited Loading

emcastillo commented Apr 3, 2020

emcastillo commented Apr 8, 2020

emcastillo commented May 1, 2020

ssnl commented May 11, 2020

ssnl left a comment

Choose a reason for hiding this comment

ssnl commented May 13, 2020

emcastillo commented May 15, 2020 • edited Loading

ssnl commented May 16, 2020

emcastillo commented May 18, 2020

emcastillo commented May 18, 2020

emcastillo commented May 25, 2020

emcastillo commented Jun 8, 2020

emcastillo commented Jun 30, 2020

emcastillo commented Jul 27, 2020

ssnl left a comment

Choose a reason for hiding this comment

VitalyFedyunin commented Aug 6, 2020

VitalyFedyunin Aug 6, 2020

Choose a reason for hiding this comment

VitalyFedyunin Aug 10, 2020 • edited Loading

Choose a reason for hiding this comment

emcastillo Aug 26, 2020

Choose a reason for hiding this comment

VitalyFedyunin left a comment

Choose a reason for hiding this comment

emcastillo commented Aug 11, 2020

emcastillo commented Aug 24, 2020

emcastillo commented Aug 26, 2020 • edited Loading

emcastillo commented Aug 31, 2020

facebook-github-bot left a comment

Choose a reason for hiding this comment

VitalyFedyunin commented Sep 1, 2020

ssnl commented Sep 1, 2020

facebook-github-bot commented Sep 1, 2020

facebook-github-bot commented Sep 1, 2020

Reset `DataLoader` workers instead of creating new ones #35795

Reset `DataLoader` workers instead of creating new ones #35795

emcastillo commented Apr 1, 2020 •

edited

Loading

dr-ci bot commented Apr 1, 2020 •

edited

Loading

emcastillo commented Apr 2, 2020 •

edited

Loading

emcastillo commented May 15, 2020 •

edited

Loading

VitalyFedyunin Aug 10, 2020 •

edited

Loading

emcastillo commented Aug 26, 2020 •

edited

Loading