-
-
Notifications
You must be signed in to change notification settings - Fork 654
Description
❓ Questions/Help/Support
I've been successfully using ignite with regular Dataset
/TensorDataset
classes in the past. These have a fixed length and are tied to a DataLoader
with a DistributedSampler
. Keeping all other training hyper-parameters equal, if I increase the number of nodes/GPUs, I've always noticed that the ETA displayed by the ProgressBar
reduces.
Then, I switched to an IterableDataset
where the length was computable in advance and so __len__
was defined. There is no DistributedSampler
defined in this case because the dataset is iterable: the data files are grouped into distinct subsets in advance and assigned to different ranks. In this scenario too, I noticed that keeping all else equal, the ETA displayed by ProgressBar
reduces when the number of nodes/GPUs increases. Some earlier discussion on this here: #1263.
Finally, I came across the setting where I had a massive dataset where the length (i.e., number of data-points) was not computable in advance. So I removed the __len__
definition, making the IterableDataset
more generic.
Unfortunately, in this final setting, I find that the ETA displayed by ProgressBar
doesn't reduce when the number of nodes/GPUs increases. I tried training for a fixed 50000 iterations, i.e., epoch_length
of 50000. I notice that if I train on 1 GPU, the ETA is much lesser than if I train on > 1 GPUs. I also notice that the overall time taken per iteration is much lesser when 1 GPU is used.
I'm confused about this behavior, it doesn't seem like I'm doing something incorrect. Could you please explain what may be happening?