Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataLoader] __getitems__ added to description of Dataset API and better supported within Subset #100375

Closed
wants to merge 7 commits into from

Conversation

stsouko
Copy link
Contributor

@stsouko stsouko commented May 1, 2023

DataLoader supports batched loading from Mapped Datasets.

This is the fetcher's implementation of auto-detection of batch loading support.

torch.utils.data._utils.fetch._MapDatasetFetcher

class _MapDatasetFetcher(_BaseDatasetFetcher):
    def fetch(self, possibly_batched_index):
        if self.auto_collation:
            if hasattr(self.dataset, "__getitems__") and self.dataset.__getitems__:
                data = self.dataset.__getitems__(possibly_batched_index)
            else:
                data = [self.dataset[idx] for idx in possibly_batched_index]

Description of Dataset API now shows this feature.

Additionally, Subset dataset now supports __getitems__ if parent dataset supports it.

@stsouko stsouko requested review from NivekT and ejguan as code owners May 1, 2023 13:42
@pytorch-bot
Copy link

pytorch-bot bot commented May 1, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100375

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 954fbd6:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: dataloader release notes category label May 1, 2023
@NivekT
Copy link
Contributor

NivekT commented May 1, 2023

@stsouko Can you circle back and address @ejguan's comment here?

@stsouko
Copy link
Contributor Author

stsouko commented May 1, 2023

@stsouko Can you circle back and address @ejguan's comment here?

Opened fixing PR.
#100409

@@ -291,6 +298,11 @@ def __init__(self, dataset: Dataset[T_co], indices: Sequence[int]) -> None:
self.dataset = dataset
self.indices = indices

# add batched sampling support when parent dataset supports it.
# see torch.utils.data._utils.fetch._MapDatasetFetcher
if getattr(dataset, "__getitems__", None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit if you don't want to rely on the falsy evaluation of None, you could also make this:

Suggested change
if getattr(dataset, "__getitems__", None):
if getattr(dataset, "__getitems__", False):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Python does have a proper way to do it:

Suggested change
if getattr(dataset, "__getitems__", None):
if hasattr(dataset, "__getitems__"):

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ejguan This is a bit different semantically, the original would still return false if a.__getitems__ = False etc. while hasattr would return true in that case.

Copy link
Collaborator

@Skylion007 Skylion007 May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code that was originally in this PR was hasattr(dataset, "__getitems__") and dataset.__getitems__ which caused mypy to complain unlike this version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, what is the case that __getitems__ becomes an attribute rather than a method? Then, why not hasattr(dataset, "__getitems__") and callable(dataset.__getitems__)

Copy link
Collaborator

@Skylion007 Skylion007 May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ejguan That's a double lookup on __getitems__ better to do callable(getattr(dataset, "__getitems__", None)) which is more efficient. I am not sure what case __getitems__ could become an attribute but better to code defensively in this case.

Suggested change
if getattr(dataset, "__getitems__", None):
if callable(getattr(dataset, "__getitems__", None)):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I kept the same behavior as torch.utils.data._utils.fetch._MapDatasetFetcher

class _MapDatasetFetcher(_BaseDatasetFetcher):
    def fetch(self, possibly_batched_index):
        if self.auto_collation:
            if hasattr(self.dataset, "__getitems__") and self.dataset.__getitems__:

Copy link
Collaborator

@Skylion007 Skylion007 May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both call sites should probably changed to be honest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to do the same for fetcher, but it's out of scope of this PR.

@@ -299,6 +311,9 @@ def __getitem__(self, idx):
def __len__(self):
return len(self.indices)

def _getitems(self, idx):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just add __getitems__ and raise Error if self.dataset doesn't have __getitems__ implemented?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because torch.utils.data._utils.fetch._MapDatasetFetcher will do false positive check

class _MapDatasetFetcher(_BaseDatasetFetcher):
    def fetch(self, possibly_batched_index):
        if self.auto_collation:
            if hasattr(self.dataset, "__getitems__") and self.dataset.__getitems__:
                data = self.dataset.__getitems__(possibly_batched_index)
            else:
                data = [self.dataset[idx] for idx in possibly_batched_index]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, we can do the similar thing for Subset.__getitems__ by calling __getitems__ if it's available. Otherwise, calling self.dataset[idx] by iterating idx from possibly_batched_index

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea

@stsouko stsouko requested a review from ejguan May 3, 2023 15:50
Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me now. I am importing to internal to see if this PR breaks any internal system.

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@NivekT NivekT added the topic: improvements topic category label May 3, 2023
@NivekT NivekT changed the title __getitems__ added to description of Dataset API. [DataLoader] __getitems__ added to description of Dataset API and better supported within Subset May 3, 2023
Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ejguan
Copy link
Contributor

ejguan commented May 3, 2023

@pytorchbot merge -r

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 3, 2023
@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased getitems onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout getitems && git pull --rebase)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check

Details for Dev Infra team Raised by workflow job

@stsouko
Copy link
Contributor Author

stsouko commented May 4, 2023

@ejguan, could you check tests pipeline?

@ejguan
Copy link
Contributor

ejguan commented May 4, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check

Details for Dev Infra team Raised by workflow job

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ejguan
Copy link
Contributor

ejguan commented May 4, 2023

It seems something wrong with the CI. I will land it from internal. Thanks

@ejguan
Copy link
Contributor

ejguan commented May 5, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

kiersten-stokes pushed a commit to kiersten-stokes/pytorch that referenced this pull request May 8, 2023
…etter supported within `Subset` (pytorch#100375)

DataLoader supports batched loading from Mapped Datasets.

This is the fetcher's implementation of auto-detection of batch loading support.

torch.utils.data._utils.fetch._MapDatasetFetcher
```
class _MapDatasetFetcher(_BaseDatasetFetcher):
    def fetch(self, possibly_batched_index):
        if self.auto_collation:
            if hasattr(self.dataset, "__getitems__") and self.dataset.__getitems__:
                data = self.dataset.__getitems__(possibly_batched_index)
            else:
                data = [self.dataset[idx] for idx in possibly_batched_index]
```

Description of Dataset API now shows this feature.

Additionally, Subset dataset now supports `__getitems__` if parent dataset supports it.
Pull Request resolved: pytorch#100375
Approved by: https://github.com/ejguan, https://github.com/NivekT
@stsouko stsouko deleted the getitems branch May 20, 2023 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: dataloader release notes category topic: improvements topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants