ray.torch.train.prepare_data_loader #38115

MMorente · 2023-08-04T08:19:53Z

What happened + What you expected to happen

I have a local ray cluster and I am using pytorch dataloaders. As described in the docs, I am using ray.train.torch.prepare_data_loader to prepare the dataloader for ray, this works well when the workers only use one GPU, however if in the ScalingConfig I specify more than one GPU per worker, prepare_data_loader fails with the message:

device = get_device() 
self._auto_transfer = auto_transfer if device.type == "cuda" else False
AttributeError: 'list' object has no attribute 'type'

My cluster has two nodes and each one has 4 GPUs.

Versions / Dependencies

Ray: 2.6.1.
Python: 3.9.5
OS: Unix

Reproduction script

from torch.utils.data import DataLoader, Dataset
import numpy as np

import ray
from ray.train.torch import TorchTrainer
from ray.air.config import ScalingConfig

class TestDataset(Dataset):
    def __init__(self):
        super().__init__()
        self.data = np.random.random_sample((10,1))

    def __getitem__(self, index):
        return self.data[index]

def train_func(config: dict) -> None:
   dataset = TestDataset()
   dataloader = DataLoader(dataset)
   dataloader = ray.train.torch.prepare_data_loader(dataloader)
   print("done")

if __name__ == "__main__":
    trainer = TorchTrainer(
        train_loop_per_worker=train_func,
        scaling_config=ScalingConfig(num_workers=1, use_gpu=True, resources_per_worker={"GPU":2})
    )
    trainer.fit()

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

xwjiang2010 · 2023-08-04T15:55:20Z

I verified that the bug existed on 2.6.1. And probably on master as well. We didn't do a good job testing multiple GPU per worker case.

The following needs to happen:

crystalize get_device semantics. Across multiple versions of this file, I see the semantics change - whether it's a list v.s. it's an int.
update get_device doc string to reflect that clearly.
update the logic in WrappedDataLoader to accommodate the fact that it can be a list.

cc @matthewdeng for triaging visibility. Marking this as P1 for now.

@MMorente would you like to contribute a PR to fix this? I think the fix can be as simple as

device = get_device()
if isinstance(device, list):
    device = device[0]

matthewdeng · 2023-08-04T15:58:18Z

@xwjiang2010 can you update the docs to reflect the behavior?

MMorente · 2023-08-04T16:50:08Z

I verified that the bug existed on 2.6.1. And probably on master as well. We didn't do a good job testing multiple GPU per worker case.

The following needs to happen:

crystalize get_device semantics. Across multiple versions of this file, I see the semantics change - whether it's a list v.s. it's an int.

update get_device doc string to reflect that clearly.

update the logic in WrappedDataLoader to accommodate the fact that it can be a list.

cc @matthewdeng for triaging visibility. Marking this as P1 for now.

@MMorente would you like to contribute a PR to fix this? I think the fix can be as simple as
device = get_device()
if isinstance(device, list):
    device = device[0]

@xwjiang2010 I have just created a pull request #38127 with the change.

anyscalesam · 2024-01-29T18:39:41Z

@matthewdeng can you re-review and advise on next steps?

matthewdeng · 2024-01-29T18:41:36Z

This will be fixed by #42314, cc @woshiyyya to double check.

woshiyyya · 2024-01-29T22:16:47Z

Yes this will be fixed by the this PR.

MMorente added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 4, 2023

xwjiang2010 added P2 Important issue, but not time-critical train Ray Train Related Issue air P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) P2 Important issue, but not time-critical labels Aug 4, 2023

MMorente mentioned this issue Aug 4, 2023

Fix bug with multiple gpus per worker #38127

Closed

7 tasks

anyscalesam removed the air label Oct 27, 2023

anyscalesam assigned matthewdeng Jan 29, 2024

matthewdeng mentioned this issue Jan 29, 2024

[Train] Split overloaded ray.train.torch.get_device into another get_devices API for multi-GPU worker setup #42314

Merged

8 tasks

matthewdeng closed this as completed Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ray.torch.train.prepare_data_loader #38115

ray.torch.train.prepare_data_loader #38115

MMorente commented Aug 4, 2023

xwjiang2010 commented Aug 4, 2023

matthewdeng commented Aug 4, 2023

MMorente commented Aug 4, 2023

anyscalesam commented Jan 29, 2024

matthewdeng commented Jan 29, 2024

woshiyyya commented Jan 29, 2024

ray.torch.train.prepare_data_loader #38115

ray.torch.train.prepare_data_loader #38115

Comments

MMorente commented Aug 4, 2023

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

xwjiang2010 commented Aug 4, 2023

matthewdeng commented Aug 4, 2023

MMorente commented Aug 4, 2023

anyscalesam commented Jan 29, 2024

matthewdeng commented Jan 29, 2024

woshiyyya commented Jan 29, 2024