Extend `ConcatDataset` to return dataset index #32034

ATriantafyllopoulos · 2020-01-10T10:15:44Z

🚀 Feature

This feature would extend torch.utils.data.ConcatDataset to return the data set index from which a sample originated.

Motivation

This would be useful when the user wants to keep track of which data set each sample comes from.

The use case I have in mind is multi-task learning when using data sets that contain mutually exclusive labels (e.g. dataset A contains labels for Task A, and dataset B contains labels for task B) as opposed to all labels being present in a single data set.

In this case, it is still possible to do multi-task learning by pooling examples from different source, if one keeps track of which source each example in a batch came from, which is the motivation behind this request.

Pitch

The feature would consist of two changes:

adding an extra return_index argument to the class (defaulting to False)
adding dataset_idx to the return if return_index is set to True (

pytorch/torch/utils/data/dataset.py

Line 207 in 8ea49e7

return self.datasets[dataset_idx][sample_idx]

)

cc @ssnl

The text was updated successfully, but these errors were encountered:

fmassa · 2020-01-10T10:44:50Z

Thanks for opening the issue!

I think it would be more valuable, instead of adding a new constructor argument to the dataset, to instead add an extra method that computes the dataset_idx and the sample_idx, which is used by __getitem__. Something in the lines of what I did in maskrcnn-benchmark.

Thoughts?

ATriantafyllopoulos · 2020-01-10T11:18:08Z

I think it would be more valuable, instead of adding a new constructor argument to the dataset, to instead add an extra method that computes the dataset_idx and the sample_idx, which is used by __getitem__.

I cannot find any reference of that method outside the class. The way I intend to use this feature is something like:

data = torch.utils.data.ConcatDataset((dataset_A, dataset_B))
loader = torch.utils.data.DataLoader(data, batch_size=10, shuffle=True)
features, labels, dataset_idx = next(iter(loader))

By adopting your proposal, I can't see how I would be able to use a DataLoader to access the dataset_idx. Do you have any thoughts on this?

fmassa · 2020-01-10T12:27:52Z

Yes, you can do something like:

class MyConcatDataset(ConcatDataset):
    def __getitem__(self, idx):
        data = super().__getitem__(idx)
        dataset_idx = self.get_dataset_idx(idx)
        return data, dataset_idx

which is trivially done if such a method is available. Plus, it lets you easily configure how / what to return, which is not possible with your method.

ATriantafyllopoulos · 2020-01-10T13:15:34Z

Ah, you meant implementing it in such a way that the user would still need to override ConcatDataset to get the intended behavior. I was thinking along the lines of adding that behavior to the base class itself.

I definitely see the merits of restructuring the function in the way you propose, as it will be very useful in any case.

Adding the extra argument I recommended on top might make the code harder to understand, so unless there's more use-cases for it, I'd skip it for now. Should I go ahead and open a PR with this minor restructuring?

fmassa · 2020-01-13T12:35:20Z

Should I go ahead and open a PR with this minor restructuring?

Yes, I think restructuring the ConcatDataset to add a method similar to get_idxs would be a great addition! Can you send a PR?

ATriantafyllopoulos · 2020-01-15T09:21:18Z

I now implemented the change and tried to send a PR. I don't have the rights to push to this repo though.

I was also wondering about the recommended policy for contributions. Should I push a change from an own fork, or directly push my branch to this repo? I couldn't find anything in the contribution guide here, and I'm a bit of a newbie in open-source contributing so I don't know what's standard process.

choidongyeon · 2020-05-26T23:01:34Z

@ATriantafyllopoulos I typically push changes to my own fork, which then publish a PR. Are you still working on this? If not, I can pick up where you left off.

ATriantafyllopoulos · 2020-05-27T06:48:11Z

@ATriantafyllopoulos I typically push changes to my own fork, which then publish a PR. Are you still working on this? If not, I can pick up where you left off.

Uh, thanks for bumping this. I procrastinated this far longer than was necessary. PR is now open: #39052

g-karthik · 2020-07-31T03:17:59Z

@fmassa @ATriantafyllopoulos I have a somewhat related question:

Suppose I have N datasets, each with tensors of shape (B, num_cands, seq_len) where B, num_cands and seq_len may vary across the N datasets. I want to run a single DistributedDataParallel training job, where each mini-batch (in each GPU) comes from any one of the N datasets. I do not want a given mini-batch to have examples from multiple datasets, because then I'll get dimension mismatch errors.

How would I go about enabling the above, while at the same time also ensuring each process receives data exclusive to it (for which one typically uses DistributedSampler)?

My data loader code currently looks like the following (where train_datasets and valid_datasets are lists each containing N TensorDataset objects, corresponding to the N datasets):

    final_train_dataset = ConcatDataset(train_datasets)
    final_valid_dataset = ConcatDataset(valid_datasets)

    train_sampler = torch.utils.data.distributed.DistributedSampler(final_train_dataset) if args.distributed else None
    valid_sampler = torch.utils.data.distributed.DistributedSampler(final_valid_dataset) if args.distributed else None
    train_loader = DataLoader(final_train_dataset, sampler=train_sampler, batch_size=args.train_batch_size, shuffle=(not args.distributed))
    valid_loader = DataLoader(final_valid_dataset, sampler=valid_sampler, batch_size=args.valid_batch_size, shuffle=False)

But this is obviously failing because some of my datasets have differing num_cands and seq_len. This piece of code will work if all datasets had the same num_cands and seq_len.

patrick-tssn · 2024-04-09T13:57:17Z

Is there an anticipated update to address this issue? I am convinced that integrating diverse data types plays a crucial role in enhancing current deep learning methodologies.

ATriantafyllopoulos mentioned this issue May 27, 2020

Update concatdataset #39052

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend `ConcatDataset` to return dataset index #32034

Extend `ConcatDataset` to return dataset index #32034

ATriantafyllopoulos commented Jan 10, 2020 •

edited by pytorch-probot bot

fmassa commented Jan 10, 2020

ATriantafyllopoulos commented Jan 10, 2020

fmassa commented Jan 10, 2020

ATriantafyllopoulos commented Jan 10, 2020

fmassa commented Jan 13, 2020

ATriantafyllopoulos commented Jan 15, 2020

choidongyeon commented May 26, 2020

ATriantafyllopoulos commented May 27, 2020

g-karthik commented Jul 31, 2020 •

edited

patrick-tssn commented Apr 9, 2024

Extend ConcatDataset to return dataset index #32034

Extend ConcatDataset to return dataset index #32034

Comments

ATriantafyllopoulos commented Jan 10, 2020 • edited by pytorch-probot bot

🚀 Feature

Motivation

Pitch

fmassa commented Jan 10, 2020

ATriantafyllopoulos commented Jan 10, 2020

fmassa commented Jan 10, 2020

ATriantafyllopoulos commented Jan 10, 2020

fmassa commented Jan 13, 2020

ATriantafyllopoulos commented Jan 15, 2020

choidongyeon commented May 26, 2020

ATriantafyllopoulos commented May 27, 2020

g-karthik commented Jul 31, 2020 • edited

patrick-tssn commented Apr 9, 2024

Extend `ConcatDataset` to return dataset index #32034

Extend `ConcatDataset` to return dataset index #32034

ATriantafyllopoulos commented Jan 10, 2020 •

edited by pytorch-probot bot

g-karthik commented Jul 31, 2020 •

edited