-
Notifications
You must be signed in to change notification settings - Fork 22.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reset DataLoader
workers instead of creating new ones
#35795
Conversation
💊 CI failures summary and remediationsAs of commit f9cc75d (more details on the Dr. CI page):
🚧 1 fixed upstream failure:These were probably caused by upstream breakages that were already fixed.
Please rebase on the
|
6bb7f0a
to
6661a01
Compare
@ssnl do you think you could take a look at this? Lmk if you can't and I'll find someone else. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added some small comments.
Thanks for doing this. Some of the issues that we need to figure out for worker reusing are
Honestly speaking I don't know the correct answer to these two questions. What are your thoughts? :) |
Hi! Thanks for the feedback 😃
I think this is the expected behavior by users so it's not a bad idea to have it as the default.
I think that running them only once should be fine. |
c5f9b69
to
e256176
Compare
I added an argument to |
Hi, @ssnl can we get some comments of this? 😇 |
Hello, |
Hi sorry for the lateness. I will take a look by tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also need to reset the dataset iter in worker process for IterableDataset
Thanks again for doing this! My general opinion is that this should definitely be done. But given how complicated the current DataLoader ctor arguments are (look at the # of arguments that are only used with multiprocessing loading), and equally unhappy about how complicated the DataLoader logic is, we should be careful about the behavior and API. |
Hi, Thanks a lot for the comments! Regarding resetting the I think that I should just create a new iterator for the dataset in the |
@emcastillo It would require recreating the iterator in the worker process, after all iterators exhaust. So some kind of message passing from the main process to the worker is needed for this. You can probably alter the index_queue mechanism to do so. |
e256176
to
386ca46
Compare
Hi, I did the synchronization of the worker process and the iterator itself. Regarding the |
7416a98
to
f400cc4
Compare
I have changed the dataloader tests in my local env to use always the persistent workers. I still have to deal with it, but all the other tests work fine. |
_BaseDataLoaderIter
instead of creating a new oneDataLoader
workers instead of creating new ones
c67c5ff
to
9499686
Compare
@ssnl any comments?😇 |
Just dropping by 😁 |
9499686
to
900baf7
Compare
Hi @ssnl , I rebased and solved conflicts, can I get another look? |
d846e9d
to
a1580c9
Compare
Any feedback in this 😇 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly good to me! also cc @VitalyFedyunin
Hello! I'm allocating a full day tomorrow to review and land it. Meanwhile can you please rebase. |
# Changing the start value here doesn't have any effect in the dataset | ||
# cached by the workers. since they are not recreated between epochs | ||
# and can cache values safely | ||
dataset.start = i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is failing nicely with self.persistent_workers = False
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is confirmed to fail (which is expected). But test is actually checking the fact that we are not making any new copies of dataset
. There is no test, which checks that you are using exactly same DataSet objects after reset.
It is still acceptable to merge as is anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the DataSet
object is instantiated only once, by the user, and then it seems that is cloned when calling:
https://github.com/pytorch/pytorch/blob/master/torch/utils/data/dataloader.py#L764-L769
The object seems to be hardly cloned in the new process memory space, no pickle (I tried to verify this by overriding __getstate__
or __setstate__
, additionally, in the new processes, the object id is the same than in the main process, so there is no actual "instantiation".
It seems that the only way we have to assert that the object is not being recreated is actually this test. If the object is recreated it will get new values in the next iteration (.start) and if not, it will retain the cached values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good with couple concerns:
-
I'm not sure we are actually testing that we reset DataSet instead of creating new ones. We might need to create another multiprocessing queue to gather number of initializations and asserting on them.
-
There is no way for DataSet to tell if it is used in classic or reset mode. If might be error-prone if DataSet tries to free memory, especially if we do iteratable datasets
Hi!, Thanks for all the reviews and feedback, Thanks again! |
a1580c9
to
9c31d6c
Compare
Sorry for the long time without taking care of this. |
Hi, I tried to do what you suggest but I had a bit of hard time:
As I replied above, I tried to do similar things to enable this check, but it seems that the object is being directly cloned in the fork without any actual python instantiation or pickling (it always retain the original object id). I also verified that
Since the persistent workers are not enabled by default, I believe up to a certain extent that it is the user responsibility to ensure that their datasets can be used in this manner. We could add some kind of facility to the Dataset to identify the mode, (probably an attribute with a setter method invoked from the dataloader?) but I think we can leave that to a different PR if possible. I just rebased this on top of master, so if possible I wonder if we could merge it as-is :) |
9c31d6c
to
0fb5164
Compare
0fb5164
to
f9cc75d
Compare
Thanks for approving the changes!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Sorry it takes eternity to land. Changes to DataLoader triggered all ML tests to run. |
Glad to see this merged. Thanks @emcastillo and @VitalyFedyunin ! |
@VitalyFedyunin merged this pull request in 5472426. |
1 similar comment
@VitalyFedyunin merged this pull request in 5472426. |
This PR needs discussion as it changes the behavior of
DataLoader
. It can be closed if its not considered a good practice.Currently, the
DataLoader
spawns a new_BaseDataLoaderIter
object every epoch,In the case of the multiprocess DataLoader, every epoch the worker processes are re-created and they make a copy of the original
Dataset
object.If users want to cache data or do some tracking on their datasets, all their data will be wiped out every epoch. Notice that this doesn't happen when the number of workers is 0. giving some inconsistencies with the multiprocess and serial data loaders.
This PR keeps the
_BaseDataLoaderIter
object alive and just resets it within epochs, so the workers remain active and so their ownDataset
objects. People seem to file issues about this often.