Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend ConcatDataset to return dataset index #32034

Open
ATriantafyllopoulos opened this issue Jan 10, 2020 · 10 comments
Open

Extend ConcatDataset to return dataset index #32034

ATriantafyllopoulos opened this issue Jan 10, 2020 · 10 comments
Labels
enhancement Not as big of a feature, but technically not a bug. Should be easy to fix module: dataloader Related to torch.utils.data.DataLoader and Sampler small We think this is a small issue to fix. Consider knocking off high priority small issues triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@ATriantafyllopoulos
Copy link

ATriantafyllopoulos commented Jan 10, 2020

馃殌 Feature

This feature would extend torch.utils.data.ConcatDataset to return the data set index from which a sample originated.

Motivation

This would be useful when the user wants to keep track of which data set each sample comes from.

The use case I have in mind is multi-task learning when using data sets that contain mutually exclusive labels (e.g. dataset A contains labels for Task A, and dataset B contains labels for task B) as opposed to all labels being present in a single data set.

In this case, it is still possible to do multi-task learning by pooling examples from different source, if one keeps track of which source each example in a batch came from, which is the motivation behind this request.

Pitch

The feature would consist of two changes:

  • adding an extra return_index argument to the class (defaulting to False)
  • adding dataset_idx to the return if return_index is set to True (
    return self.datasets[dataset_idx][sample_idx]
    )

cc @ssnl

@fmassa
Copy link
Member

fmassa commented Jan 10, 2020

Thanks for opening the issue!

I think it would be more valuable, instead of adding a new constructor argument to the dataset, to instead add an extra method that computes the dataset_idx and the sample_idx, which is used by __getitem__. Something in the lines of what I did in maskrcnn-benchmark.

Thoughts?

@fmassa fmassa added module: dataloader Related to torch.utils.data.DataLoader and Sampler enhancement Not as big of a feature, but technically not a bug. Should be easy to fix small We think this is a small issue to fix. Consider knocking off high priority small issues triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jan 10, 2020
@ATriantafyllopoulos
Copy link
Author

I think it would be more valuable, instead of adding a new constructor argument to the dataset, to instead add an extra method that computes the dataset_idx and the sample_idx, which is used by __getitem__.

I cannot find any reference of that method outside the class. The way I intend to use this feature is something like:

data = torch.utils.data.ConcatDataset((dataset_A, dataset_B))
loader = torch.utils.data.DataLoader(data, batch_size=10, shuffle=True)
features, labels, dataset_idx = next(iter(loader))

By adopting your proposal, I can't see how I would be able to use a DataLoader to access the dataset_idx. Do you have any thoughts on this?

@fmassa
Copy link
Member

fmassa commented Jan 10, 2020

Yes, you can do something like:

class MyConcatDataset(ConcatDataset):
    def __getitem__(self, idx):
        data = super().__getitem__(idx)
        dataset_idx = self.get_dataset_idx(idx)
        return data, dataset_idx

which is trivially done if such a method is available. Plus, it lets you easily configure how / what to return, which is not possible with your method.

@ATriantafyllopoulos
Copy link
Author

Ah, you meant implementing it in such a way that the user would still need to override ConcatDataset to get the intended behavior. I was thinking along the lines of adding that behavior to the base class itself.

I definitely see the merits of restructuring the function in the way you propose, as it will be very useful in any case.

Adding the extra argument I recommended on top might make the code harder to understand, so unless there's more use-cases for it, I'd skip it for now. Should I go ahead and open a PR with this minor restructuring?

@fmassa
Copy link
Member

fmassa commented Jan 13, 2020

Should I go ahead and open a PR with this minor restructuring?

Yes, I think restructuring the ConcatDataset to add a method similar to get_idxs would be a great addition! Can you send a PR?

@ATriantafyllopoulos
Copy link
Author

I now implemented the change and tried to send a PR. I don't have the rights to push to this repo though.

I was also wondering about the recommended policy for contributions. Should I push a change from an own fork, or directly push my branch to this repo? I couldn't find anything in the contribution guide here, and I'm a bit of a newbie in open-source contributing so I don't know what's standard process.

@choidongyeon
Copy link

@ATriantafyllopoulos I typically push changes to my own fork, which then publish a PR. Are you still working on this? If not, I can pick up where you left off.

@ATriantafyllopoulos
Copy link
Author

@ATriantafyllopoulos I typically push changes to my own fork, which then publish a PR. Are you still working on this? If not, I can pick up where you left off.

Uh, thanks for bumping this. I procrastinated this far longer than was necessary. PR is now open: #39052

@g-karthik
Copy link

g-karthik commented Jul 31, 2020

@fmassa @ATriantafyllopoulos I have a somewhat related question:

Suppose I have N datasets, each with tensors of shape (B, num_cands, seq_len) where B, num_cands and seq_len may vary across the N datasets. I want to run a single DistributedDataParallel training job, where each mini-batch (in each GPU) comes from any one of the N datasets. I do not want a given mini-batch to have examples from multiple datasets, because then I'll get dimension mismatch errors.

How would I go about enabling the above, while at the same time also ensuring each process receives data exclusive to it (for which one typically uses DistributedSampler)?

My data loader code currently looks like the following (where train_datasets and valid_datasets are lists each containing N TensorDataset objects, corresponding to the N datasets):

    final_train_dataset = ConcatDataset(train_datasets)
    final_valid_dataset = ConcatDataset(valid_datasets)

    train_sampler = torch.utils.data.distributed.DistributedSampler(final_train_dataset) if args.distributed else None
    valid_sampler = torch.utils.data.distributed.DistributedSampler(final_valid_dataset) if args.distributed else None
    train_loader = DataLoader(final_train_dataset, sampler=train_sampler, batch_size=args.train_batch_size, shuffle=(not args.distributed))
    valid_loader = DataLoader(final_valid_dataset, sampler=valid_sampler, batch_size=args.valid_batch_size, shuffle=False)

But this is obviously failing because some of my datasets have differing num_cands and seq_len. This piece of code will work if all datasets had the same num_cands and seq_len.

@patrick-tssn
Copy link

Is there an anticipated update to address this issue? I am convinced that integrating diverse data types plays a crucial role in enhancing current deep learning methodologies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Not as big of a feature, but technically not a bug. Should be easy to fix module: dataloader Related to torch.utils.data.DataLoader and Sampler small We think this is a small issue to fix. Consider knocking off high priority small issues triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants