You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, operations on tensors are mostly asynchronous. I suggest to add asynchronous loading from disk:
tensor = torchaudio.load(filepath: str, non_blocking=True) # new 'non_blocking' argument
Motivation
Currently, one of PyTorch most powerful features is it's non blocking execution on tensors. For example. Consider the code snippet for training loop:
for raw_data, target in data_loader:
raw_data = raw_data.to('cuda:0', non_blocking=True)
target = target.to('cuda:0', non_blocking=True)
raw_data = augmenter(raw_data) # room impulse response, add noise, etc.. on GPU
features = feature_extractor(raw_data) # e.g use torchaudio MelSpectrogram on GPU
output = model(features)
loss = criterion(output, target)
loss.backward()
optimizer.step()
Allows the main training loop to run through all it's iterations, and schedule the work to execute on the GPU without syncing GPU and CPU until the very last step.
But the problem is, loading tensors from disk is synchronous, breaking the async loop. So I need my dataloaders to load files asynchronously. Currently in Python, I have 2 ways of loading files asynchronously: multi-thread & multi-process.
The problem is, multi-thread is slower then expected, and multi-process forces a copy of huge raw audio arrays between processes. But if async loading is implemented in torchaudio in C. It can be done in multi-thread efficiently.
In addition, my limited knowledge of CUDA streams, tells me that while probably the model(features) part happens on a single stream, and therefore batchN must be trained on before batchN+1. The batches are prepared on separate CUDA streams, and can all be prepared in parallel. Just they will be trained on in order.
This feature simplifies parallel data preparation. Which requires loading from disk.
Pitch
Alternatives
Alternative:
Dataloader object (or DataLoader like), that worked with with multi-thread, instead of multi-process.
Why?
Folklore tales tell that while multi-thread in python is horribly inefficient (due to GIL). It should still be good for IO. I am unsure if this is true, but if it is, it means that we can load files asynchronously in 1 process, and avoid the transfer of raw audio signals (huge arrays) between processes. I tried it once, I may have missed something, but multi-thread file loading was slower then 1 thread file loading.
Additional context
This issue: https://discuss.pytorch.org/t/should-we-set-non-blocking-to-true/38234/3
It has a note by one of NVIDIA engineers that running model(data) is a sync point between CPU and GPU. Later he says it's not
true, and it's confusing. If this is true, the above code won't schedule batchN+1 before batchN is done. It can be rewritten to
schedule all the batches preparation in separate Thread, bypassing that restriction.
Thanks for the suggestion. This is interesting.
I asked around if there is a PyTorch API that we can use to fulfill this feature, but unfortunately this seems not possible at the moment.
Firstly, .to(non_blocking=True) is only effective when the data is transferred from pinned memory to CUDA, and internally, it delegates the non blocking behavior to CUDA API. The logic and related interfaces reside in CUDA realm and PyTorch does not have any special mechanism around it.
Now, if we want to add non_blocking to torchaudio.load, then the simple approach would be to return a surrogate object that represents the resulting Tensor and behave exactly like Tensor. This means that the surrogate object has to be compatible with all the operations (existing ones and ones being added/proposed) that PyTorch provides. Which means we practically re-implement Tensor type, which is not maintainable.
In v0.12.0 we added torchaudio.io.StreamReader class, which breaks down the loading operation into multiple steps, so users can write own async function.
🚀 Feature
Currently, operations on tensors are mostly asynchronous. I suggest to add asynchronous loading from disk:
Motivation
Currently, one of PyTorch most powerful features is it's non blocking execution on tensors. For example. Consider the code snippet for training loop:
Allows the main training loop to run through all it's iterations, and schedule the work to execute on the GPU without syncing GPU and CPU until the very last step.
But the problem is, loading tensors from disk is synchronous, breaking the async loop. So I need my dataloaders to load files asynchronously. Currently in Python, I have 2 ways of loading files asynchronously: multi-thread & multi-process.
The problem is, multi-thread is slower then expected, and multi-process forces a copy of huge raw audio arrays between processes. But if async loading is implemented in torchaudio in C. It can be done in multi-thread efficiently.
In addition, my limited knowledge of CUDA streams, tells me that while probably the model(features) part happens on a single stream, and therefore batchN must be trained on before batchN+1. The batches are prepared on separate CUDA streams, and can all be prepared in parallel. Just they will be trained on in order.
This feature simplifies parallel data preparation. Which requires loading from disk.
Pitch
Alternatives
Alternative:
Dataloader object (or DataLoader like), that worked with with multi-thread, instead of multi-process.
Why?
Folklore tales tell that while multi-thread in python is horribly inefficient (due to GIL). It should still be good for IO. I am unsure if this is true, but if it is, it means that we can load files asynchronously in 1 process, and avoid the transfer of raw audio signals (huge arrays) between processes. I tried it once, I may have missed something, but multi-thread file loading was slower then 1 thread file loading.
Additional context
It has a note by one of NVIDIA engineers that running model(data) is a sync point between CPU and GPU. Later he says it's not
true, and it's confusing. If this is true, the above code won't schedule batchN+1 before batchN is done. It can be rewritten to
schedule all the batches preparation in separate Thread, bypassing that restriction.
https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/
The text was updated successfully, but these errors were encountered: