Equivalent of get_worker_info to split an IterableDataset #7667

davidaknowles · 2024-07-10T18:46:08Z

❓ Questions and Help

I have an IterableDataset of unknown size. I would like to use something like torch.utils.data.get_worker_info to split it across the spawned xmp processes, but AFAIK there is no equivalent in xla_multiprocessing. Is there a workaround? I tried randomly subsampling on each process but this hangs for me for some reason.

The text was updated successfully, but these errors were encountered:

JackCaoG · 2024-07-10T18:49:17Z

I felt like you are looking for

xla/torch_xla/runtime.py

Lines 135 to 192 in 34736f0

    
           @requires_pjrt 
        
           def local_process_count() -> int: 
        
             """Returns the number of processes running on this host.""" 
        
             return xu.getenv_as(xenv.PJRT_LOCAL_PROCESS_COUNT, int, defval=1) 
        
           @requires_pjrt 
        
           def global_device_count() -> int: 
        
             """Returns the total number of devices across all processes/hosts.""" 
        
             return len(torch_xla._XLAC._xla_get_all_devices()) 
        
           @requires_pjrt 
        
           def world_size() -> int: 
        
             """Returns the total number of processes participating in the job.""" 
        
             if torch_xla._XLAC._xla_get_replication_devices_count() == 0: 
        
               return 1 
        
             return global_device_count() 
        
           @requires_pjrt 
        
           def local_device_count() -> int: 
        
             """Returns the total number of devices on this host. 
        
             Assumes each process has the same number of addressable devices. 
        
             """ 
        
             return local_process_count() * addressable_device_count() 
        
           @requires_pjrt 
        
           def addressable_device_count() -> int: 
        
             """Returns the number of devices visible to this process.""" 
        
             return torch_xla._XLAC._xla_num_devices() 
        
           @requires_pjrt 
        
           def global_ordinal() -> int: 
        
             """Returns global ordinal of this thread within all processes. 
        
             Global ordinal is in range [0, global_device_count). Global ordinals are not 
        
             guaranteed to have any predictable relationship to the TPU worker ID nor are 
        
             they guaranteed to be contiguous on each host.""" 
        
             return torch_xla._XLAC._xla_get_default_device_ordinal() 
        
           @requires_pjrt 
        
           def local_ordinal() -> int: 
        
             """Returns local ordinal of this thread within this host. 
        
             Local ordinal is in range [0, local_device_count).""" 
        
             local_rank = xu.getenv_as(xenv.PJRT_LOCAL_PROCESS_RANK, int, 0) 
        
             devices_per_process = addressable_device_count() 
        
             return local_rank * devices_per_process + xla_device().index 
        
           @requires_pjrt 
        
           def process_index() -> int: 
        
             return torch_xla._XLAC._xla_get_process_index()

.

For the up to date master api you can also check https://pytorch.org/xla/master/#module-torch_xla.runtime

JackCaoG · 2024-07-12T18:42:54Z

@will-cromar @zpcore do you know where torch.utils.data gets that info? Wondering if we can do some mapping and also support that api.

zpcore · 2024-07-12T21:25:39Z

The worker attributes are setup when we initialize the dataloader:
https://github.com/pytorch/pytorch/blob/7c289c2a5c4e2233251565afadc2d95acf64b8c1/torch/utils/data/dataloader.py#L1113-L1128.

Since we are using torch's dataloader:

xla/test/test_train_mp_imagenet.py

Lines 235 to 252 in 1651e76

    
           train_loader = torch.utils.data.DataLoader( 
        
               train_dataset, 
        
               batch_size=FLAGS.batch_size, 
        
               sampler=train_sampler, 
        
               drop_last=FLAGS.drop_last, 
        
               shuffle=False if train_sampler else True, 
        
               num_workers=FLAGS.num_workers, 
        
               persistent_workers=FLAGS.persistent_workers, 
        
               prefetch_factor=FLAGS.prefetch_factor) 
        
           test_loader = torch.utils.data.DataLoader( 
        
               test_dataset, 
        
               batch_size=FLAGS.test_set_batch_size, 
        
               sampler=test_sampler, 
        
               drop_last=FLAGS.drop_last, 
        
               shuffle=False, 
        
               num_workers=FLAGS.num_workers, 
        
               persistent_workers=FLAGS.persistent_workers, 
        
               prefetch_factor=FLAGS.prefetch_factor)

, I think it should contain the worker info. I can do a test on the real data to see if it is there or not.

davidaknowles · 2024-07-17T19:58:32Z

Thanks, those were the functions I was looking for. A cartoon version of my solution is the following:

class MyDataset(torch.utils.data.IterableDataset):

    def __init__(self):
        super().__init__()
        self.N = 100
        self.data = torch.rand(self.N, 30) 

    def __iter__(self): 
        for i in range(self.N): 
            if i % xr.world_size() == xm.get_ordinal(): 
                yield self.data[i]

def _mp_fn_(index): 

    device = xm.xla_device()
    dataset = MyDataset()
    dataloader = torch.utils.data.DataLoader(dataset, batch_size = 10)
    device_loader = pl.MpDeviceLoader(dataloader, device)

    for epoch in range(3): 
        mysum = torch.tensor(0., device = device) 
        for batch in device_loader: 
            mysum += batch.sum()
        sumsum = xm.all_reduce(xm.REDUCE_SUM, mysum).item()
        print(epoch, sumsum)

This runs fine... my new issue is on my real data when I hit the .item() it hangs. mysum here is meant to be the total loss for the data processed on the current device, and then sumsum is the total loss for the epoch (across all devices). Maybe there's a better pattern for getting the total loss?

JackCaoG · 2024-07-17T21:32:10Z

can you always do a xm.mark_step() or torch_xla.sync() before you do the .item call. It is always recommend to flush the pending executions before accessing the value of the tensor.

davidaknowles · 2024-07-17T21:58:40Z

Hmm so now it hangs on that mark_step() instead. Well, it gets past the mark_step() on one device but the other 3 hang.

JackCaoG · 2024-07-17T22:00:55Z

that's... interesting. It usually mean the graph is different for each device. Can you dump the HLO following https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#common-debugging-environment-variables-combinations? You should multiple files.

davidaknowles · 2024-07-18T01:03:08Z

Hmm each device could end up processing a (slightly) different number of batches, I suppose that technically makes the graph different? I'll figure out getting the HLO and report back.

davidaknowles · 2024-07-18T01:46:34Z

OK HLO files are here. LMK if anything looks suspicious! In the meantime I'll see if I can get eager mode -> compilation going with the nightly build.

davidaknowles · 2024-07-18T15:07:52Z

Nightly build's (2.5.something) torch_xla.distributed.xla_multiprocessing is only giving me access to 1 of 4 devices, is that expected?

JackCaoG · 2024-07-19T00:57:08Z

hmm no that's not expected, I am on nightly and if I do

python examples/data_parallel/train_resnet_xla_ddp.py

I can see 4 processes

epoch: 1, step: 190, loss: 6.7231669425964355, rate: 1746.296055355192
epoch: 1, step: 190, loss: 6.705419540405273, rate: 1746.3170991653592
epoch: 1, step: 190, loss: 6.700830459594727, rate: 1745.7355188993108
epoch: 1, step: 190, loss: 6.731178283691406, rate: 1746.154144282245

(each process prints their own loss)

JackCaoG · 2024-07-19T01:02:24Z

btw I check your HLO, the last computation is the same

HloModule IrToHlo.14, entry_computation_layout={(f32[], f32[])->(f32[])}

%AddComputation.6 (x.7: f32[], y.8: f32[]) -> f32[] {
  %x.7 = f32[] parameter(0)
  %y.8 = f32[] parameter(1)
  ROOT %add.9 = f32[] add(f32[] %x.7, f32[] %y.8)
}

ENTRY %IrToHlo.14 (p0.1: f32[], p1.2: f32[]) -> (f32[]) {
  %p1.2 = f32[] parameter(1), metadata={op_type="xla__device_data" op_name="xla__device_data" source_file="/home/daknowles/.local/lib/python3.10/site-packages/torch/_ops.py" source_line=854}
  %p0.1 = f32[] parameter(0), metadata={op_type="xla__device_data" op_name="xla__device_data" source_file="/home/daknowles/.local/lib/python3.10/site-packages/torch/_ops.py" source_line=854}
  %tuple.3 = (f32[], f32[]) tuple(f32[] %p1.2, f32[] %p0.1), metadata={op_type="xla__cross_replica_sum" op_name="xla__cross_replica_sum" source_file="/home/daknowles/.local/lib/python3.10/site-packages/torch/_ops.py" source_line=854}
  %get-tuple-element.4 = f32[] get-tuple-element((f32[], f32[]) %tuple.3), index=0, metadata={op_type="xla__cross_replica_sum" op_name="xla__cross_replica_sum" source_file="/home/daknowles/.local/lib/python3.10/site-packages/torch/_ops.py" source_line=854}
  %get-tuple-element.5 = f32[] get-tuple-element((f32[], f32[]) %tuple.3), index=1, metadata={op_type="xla__cross_replica_sum" op_name="xla__cross_replica_sum" source_file="/home/daknowles/.local/lib/python3.10/site-packages/torch/_ops.py" source_line=854}
  %all-reduce.10 = (f32[], f32[]) all-reduce(f32[] %get-tuple-element.4, f32[] %get-tuple-element.5), replica_groups={}, constrain_layout=true, to_apply=%AddComputation.6, metadata={op_type="xla__cross_replica_sum" op_name="xla__cross_replica_sum" source_file="/home/daknowles/.local/lib/python3.10/site-packages/torch/_ops.py" source_line=854}
  %get-tuple-element.12 = f32[] get-tuple-element((f32[], f32[]) %all-reduce.10), index=1, metadata={op_type="xla__cross_replica_sum" op_name="xla__cross_replica_sum" source_file="/home/daknowles/.local/lib/python3.10/site-packages/torch/_ops.py" source_line=854}
  %get-tuple-element.11 = f32[] get-tuple-element((f32[], f32[]) %all-reduce.10), index=0, metadata={op_type="xla__cross_replica_sum" op_name="xla__cross_replica_sum" source_file="/home/daknowles/.local/lib/python3.10/site-packages/torch/_ops.py" source_line=854}
  ROOT %tuple.13 = (f32[]) tuple(f32[] %get-tuple-element.11)
}

which is just a simple all_reduce.. I can't really tell why it hang. Do you have a repo I can try on my end? The model code can just be dummy model code or you can use one of my examples in https://github.com/pytorch/xla/blob/master/examples/data_parallel/train_resnet_spmd_data_parallel.py

davidaknowles · 2024-07-26T01:46:00Z

Hi @JackCaoG - I made a minimal branch of my repo here. Hopefully it's straightforward to test with the info in the README. Thanks!

JackCaoG · 2024-07-26T01:53:49Z

Thanks, let me take a look tmr.

JackCaoG · 2024-07-29T19:00:41Z

I am able to repo, let me look into it a bit.

JackCaoG · 2024-07-29T20:31:04Z

One thing I realized by running

alias save_hlo="XLA_IR_DEBUG=1 XLA_HLO_DEBUG=1 XLA_SAVE_TENSORS_FMT='hlo' XLA_SAVE_TENSORS_FILE='/tmp/save1.hlo'"
alias cpplog="TF_CPP_MIN_LOG_LEVEL=0 TF_CPP_VMODULE=\"xla_graph_executor=5,pjrt_computation_client=3\""

cpplog save_hlo PT_XLA_DEBUG=1 python train.py

is that each process is running differnt number of graphs. In process 1(by checking /tmp/save1.hlo.0), it execute 36 graphs

in process 1 it is 37

This explains why all_reudce will hang because the number of graphs is differernt. Is your dataloader setting up in a way that each process gets different number of batches?

davidaknowles · 2024-07-29T20:47:28Z

Yes definitely possible. I don't think it would be different by more than 1 batch but presumably that's bad enough.

Is there an easy workaround to that or should I just set up the dataloader to ensure an equal number of batches?

Thanks

JackCaoG · 2024-07-29T21:11:26Z

we usually just drop the last batch to make every process execute the same. It is required to have each process to execute the same number of graph, otherwise collective ops will be confuse. For example TPU:1 expects TPU:0 to join a all_reudce but TPU:0 has moved to a new graph that was trying to all_reduce a different tensor, that will either produce an incorrect result or hang forever.

davidaknowles · 2024-07-30T00:50:48Z

Yup makes sense. It's slightly less straightforward to do here since it's an IterableDataset where I don't know the total number of samples (or equivalently batches) globally. I can figure that out though somehow I'm sure now I understand what the issue is.

davidaknowles · 2024-08-06T01:17:33Z

You can close this. In case others find this useful, this is my modified training loop to end the epoch when any of the individual devices are done/exhausted:

      enum_dataloader = enumerate(dataloader)
      
      while True:

          try: 
              step_i, dat = next(enum_dataloader)
              done = torch.tensor(0, dtype=torch.int32, device = self.device)
          except StopIteration:
              done = torch.tensor(1, dtype=torch.int32, device = self.device)
          
          # Synchronize the flag across all workers
          done = xm.all_reduce(xm.REDUCE_MAX, done)
          
          # If any worker is done (including me), break the loop
          if done.item() == 1:
              break 

          # remaining training code...

davidaknowles closed this as completed Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Equivalent of get_worker_info to split an IterableDataset #7667

Equivalent of get_worker_info to split an IterableDataset #7667

davidaknowles commented Jul 10, 2024

JackCaoG commented Jul 10, 2024

JackCaoG commented Jul 12, 2024

zpcore commented Jul 12, 2024 •

edited

Loading

davidaknowles commented Jul 17, 2024

JackCaoG commented Jul 17, 2024

davidaknowles commented Jul 17, 2024

JackCaoG commented Jul 17, 2024

davidaknowles commented Jul 18, 2024

davidaknowles commented Jul 18, 2024

davidaknowles commented Jul 18, 2024

JackCaoG commented Jul 19, 2024 •

edited

Loading

JackCaoG commented Jul 19, 2024

davidaknowles commented Jul 26, 2024

JackCaoG commented Jul 26, 2024

JackCaoG commented Jul 29, 2024

JackCaoG commented Jul 29, 2024

davidaknowles commented Jul 29, 2024

JackCaoG commented Jul 29, 2024 •

edited

Loading

davidaknowles commented Jul 30, 2024

davidaknowles commented Aug 6, 2024

Equivalent of get_worker_info to split an IterableDataset #7667

Equivalent of get_worker_info to split an IterableDataset #7667

Comments

davidaknowles commented Jul 10, 2024

❓ Questions and Help

JackCaoG commented Jul 10, 2024

JackCaoG commented Jul 12, 2024

zpcore commented Jul 12, 2024 • edited Loading

davidaknowles commented Jul 17, 2024

JackCaoG commented Jul 17, 2024

davidaknowles commented Jul 17, 2024

JackCaoG commented Jul 17, 2024

davidaknowles commented Jul 18, 2024

davidaknowles commented Jul 18, 2024

davidaknowles commented Jul 18, 2024

JackCaoG commented Jul 19, 2024 • edited Loading

JackCaoG commented Jul 19, 2024

davidaknowles commented Jul 26, 2024

JackCaoG commented Jul 26, 2024

JackCaoG commented Jul 29, 2024

JackCaoG commented Jul 29, 2024

davidaknowles commented Jul 29, 2024

JackCaoG commented Jul 29, 2024 • edited Loading

davidaknowles commented Jul 30, 2024

davidaknowles commented Aug 6, 2024

zpcore commented Jul 12, 2024 •

edited

Loading

JackCaoG commented Jul 19, 2024 •

edited

Loading

JackCaoG commented Jul 29, 2024 •

edited

Loading