Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reset DataLoader workers instead of creating new ones #35795

Closed
wants to merge 1 commit into from

Conversation

emcastillo
Copy link
Collaborator

@emcastillo emcastillo commented Apr 1, 2020

This PR needs discussion as it changes the behavior of DataLoader. It can be closed if its not considered a good practice.

Currently, the DataLoader spawns a new _BaseDataLoaderIter object every epoch,
In the case of the multiprocess DataLoader, every epoch the worker processes are re-created and they make a copy of the original Dataset object.
If users want to cache data or do some tracking on their datasets, all their data will be wiped out every epoch. Notice that this doesn't happen when the number of workers is 0. giving some inconsistencies with the multiprocess and serial data loaders.

This PR keeps the _BaseDataLoaderIter object alive and just resets it within epochs, so the workers remain active and so their own Dataset objects. People seem to file issues about this often.

@emcastillo emcastillo requested a review from apaszke as a code owner April 1, 2020 05:57
@dr-ci
Copy link

dr-ci bot commented Apr 1, 2020

💊 CI failures summary and remediations

As of commit f9cc75d (more details on the Dr. CI page):



🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

Since your merge base is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.


ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 111 times.

@ezyang
Copy link
Contributor

ezyang commented Apr 1, 2020

@ssnl do you think you could take a look at this? Lmk if you can't and I'll find someone else.

Copy link

@tkerola tkerola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added some small comments.

torch/utils/data/dataloader.py Outdated Show resolved Hide resolved
torch/utils/data/dataloader.py Outdated Show resolved Hide resolved
torch/utils/data/dataloader.py Outdated Show resolved Hide resolved
@ssnl
Copy link
Collaborator

ssnl commented Apr 1, 2020

Thanks for doing this. Some of the issues that we need to figure out for worker reusing are

  1. whether we should enable this by default, as workers may hold resources.
  2. what about seeds and worker_init_fn? do we reset / run them at each epoch?

Honestly speaking I don't know the correct answer to these two questions. What are your thoughts? :)

@emcastillo
Copy link
Collaborator Author

emcastillo commented Apr 2, 2020

Hi! Thanks for the feedback 😃

whether we should enable this by default, as workers may hold resources.

I think this is the expected behavior by users so it's not a bad idea to have it as the default.
However, this may cause some no-compat issues with users code so we have to be careful here, maybe adding a continuous parameter to the constructor is good?

what about seeds and worker_init_fn? do we reset / run them at each epoch?

I think that running them only once should be fine.

@emcastillo
Copy link
Collaborator Author

I added an argument to DataLoader constructor to control the behavior after discussing it with @tkerola
Fixed some issues and I guess that all that is left is to write some unit tests.

@zou3519 zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 6, 2020
@emcastillo
Copy link
Collaborator Author

Hi, @ssnl can we get some comments of this? 😇

@emcastillo
Copy link
Collaborator Author

Hello,
We would like to get some comments if possible!
Thanks

@ssnl
Copy link
Collaborator

ssnl commented May 11, 2020

Hi sorry for the lateness. I will take a look by tomorrow.

Copy link
Collaborator

@ssnl ssnl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need to reset the dataset iter in worker process for IterableDataset

torch/utils/data/dataloader.py Outdated Show resolved Hide resolved
@ssnl
Copy link
Collaborator

ssnl commented May 13, 2020

Thanks again for doing this! My general opinion is that this should definitely be done. But given how complicated the current DataLoader ctor arguments are (look at the # of arguments that are only used with multiprocessing loading), and equally unhappy about how complicated the DataLoader logic is, we should be careful about the behavior and API.

@emcastillo
Copy link
Collaborator Author

emcastillo commented May 15, 2020

Hi, Thanks a lot for the comments!

Regarding resetting the IterableDataset iterator in the worker, I can't find a reset mechanism for it.
https://pytorch.org/docs/stable/_modules/torch/utils/data/dataset.html#IterableDataset
Maybe I am looking at the wrong place?

I think that I should just create a new iterator for the dataset in the fetcher that the worker uses once we have stopped the iteration. Is this right?

@ssnl
Copy link
Collaborator

ssnl commented May 16, 2020

@emcastillo It would require recreating the iterator in the worker process, after all iterators exhaust. So some kind of message passing from the main process to the worker is needed for this. You can probably alter the index_queue mechanism to do so.

@emcastillo
Copy link
Collaborator Author

Hi, I did the synchronization of the worker process and the iterator itself.

Regarding the DataLoader API, the only alternative I see is to create a sub-class rather than adding a parameter. But since there are no other specializations I find it a bit overkill.

@emcastillo
Copy link
Collaborator Author

I have changed the dataloader tests in my local env to use always the persistent workers.
All works but one test where a warning after accessing length is not being raised.

I still have to deal with it, but all the other tests work fine.
Once I fix this, I will push a specialization of the test suite for the persistent workers version.
I am just thinking about creating a factory function for the DataLoader so I can subclass the TestDataLoader and use this function instead of adding or modifying all the existing tests.

@emcastillo emcastillo changed the title Reset _BaseDataLoaderIter instead of creating a new one Reset DataLoader workers instead of creating new ones May 20, 2020
@emcastillo
Copy link
Collaborator Author

@ssnl any comments?😇

@emcastillo
Copy link
Collaborator Author

Just dropping by 😁

@emcastillo
Copy link
Collaborator Author

Hi @ssnl , I rebased and solved conflicts, can I get another look?
Thanks!

test/test_dataloader.py Show resolved Hide resolved
test/test_dataloader.py Outdated Show resolved Hide resolved
@emcastillo
Copy link
Collaborator Author

Any feedback in this 😇

Copy link
Collaborator

@ssnl ssnl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good to me! also cc @VitalyFedyunin

torch/utils/data/dataloader.py Outdated Show resolved Hide resolved
torch/utils/data/dataloader.py Outdated Show resolved Hide resolved
@VitalyFedyunin
Copy link
Contributor

Hello! I'm allocating a full day tomorrow to review and land it. Meanwhile can you please rebase.

# Changing the start value here doesn't have any effect in the dataset
# cached by the workers. since they are not recreated between epochs
# and can cache values safely
dataset.start = i
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is failing nicely with self.persistent_workers = False ?

Copy link
Contributor

@VitalyFedyunin VitalyFedyunin Aug 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is confirmed to fail (which is expected). But test is actually checking the fact that we are not making any new copies of dataset. There is no test, which checks that you are using exactly same DataSet objects after reset.

It is still acceptable to merge as is anyway.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the DataSet object is instantiated only once, by the user, and then it seems that is cloned when calling:
https://github.com/pytorch/pytorch/blob/master/torch/utils/data/dataloader.py#L764-L769

The object seems to be hardly cloned in the new process memory space, no pickle (I tried to verify this by overriding __getstate__ or __setstate__, additionally, in the new processes, the object id is the same than in the main process, so there is no actual "instantiation".

It seems that the only way we have to assert that the object is not being recreated is actually this test. If the object is recreated it will get new values in the next iteration (.start) and if not, it will retain the cached values.

Copy link
Contributor

@VitalyFedyunin VitalyFedyunin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good with couple concerns:

  1. I'm not sure we are actually testing that we reset DataSet instead of creating new ones. We might need to create another multiprocessing queue to gather number of initializations and asserting on them.

  2. There is no way for DataSet to tell if it is used in classic or reset mode. If might be error-prone if DataSet tries to free memory, especially if we do iteratable datasets

torch/utils/data/dataloader.py Outdated Show resolved Hide resolved
@emcastillo
Copy link
Collaborator Author

Hi!, Thanks for all the reviews and feedback,
I am a bit busy with some internal projects right now, so once I get everything sorted up I will rebase and change this accordingly.

Thanks again!

@emcastillo
Copy link
Collaborator Author

Sorry for the long time without taking care of this.
I just rebased and verified that everything works fine, I will be addressing the concerns in the review in the next hours.

@emcastillo
Copy link
Collaborator Author

emcastillo commented Aug 26, 2020

Hi, I tried to do what you suggest but I had a bit of hard time:

I'm not sure we are actually testing that we reset DataSet instead of creating new ones. We might need to create another multiprocessing queue to gather number of initializations and asserting on them.

As I replied above, I tried to do similar things to enable this check, but it seems that the object is being directly cloned in the fork without any actual python instantiation or pickling (it always retain the original object id). I also verified that __new__ is also called once, in the main process.
It seems that we don't have an easy way to assert that the object is the same or not.

There is no way for DataSet to tell if it is used in classic or reset mode. If might be error-prone if DataSet tries to free memory, especially if we do iterable datasets.

Since the persistent workers are not enabled by default, I believe up to a certain extent that it is the user responsibility to ensure that their datasets can be used in this manner. We could add some kind of facility to the Dataset to identify the mode, (probably an attribute with a setter method invoked from the dataloader?) but I think we can leave that to a different PR if possible.

I just rebased this on top of master, so if possible I wonder if we could merge it as-is :)

@emcastillo
Copy link
Collaborator Author

Thanks for approving the changes!!
Will this be landed soon?😇

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@VitalyFedyunin
Copy link
Contributor

Sorry it takes eternity to land. Changes to DataLoader triggered all ML tests to run.

@ssnl
Copy link
Collaborator

ssnl commented Sep 1, 2020

Glad to see this merged. Thanks @emcastillo and @VitalyFedyunin !

@facebook-github-bot
Copy link
Contributor

@VitalyFedyunin merged this pull request in 5472426.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@VitalyFedyunin merged this pull request in 5472426.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants