Non blocking torchaudio.load() #1628

maor121 · 2021-07-14T13:56:52Z

🚀 Feature

Currently, operations on tensors are mostly asynchronous. I suggest to add asynchronous loading from disk:

tensor = torchaudio.load(filepath: str, non_blocking=True)  # new 'non_blocking' argument

Motivation

Currently, one of PyTorch most powerful features is it's non blocking execution on tensors. For example. Consider the code snippet for training loop:

for raw_data, target in data_loader:
    raw_data = raw_data.to('cuda:0', non_blocking=True)
    target = target.to('cuda:0', non_blocking=True)
    
    raw_data = augmenter(raw_data)          # room impulse response, add noise, etc.. on GPU
    features = feature_extractor(raw_data)  # e.g use torchaudio MelSpectrogram on GPU
    output = model(features)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()

Allows the main training loop to run through all it's iterations, and schedule the work to execute on the GPU without syncing GPU and CPU until the very last step.

But the problem is, loading tensors from disk is synchronous, breaking the async loop. So I need my dataloaders to load files asynchronously. Currently in Python, I have 2 ways of loading files asynchronously: multi-thread & multi-process.
The problem is, multi-thread is slower then expected, and multi-process forces a copy of huge raw audio arrays between processes. But if async loading is implemented in torchaudio in C. It can be done in multi-thread efficiently.

In addition, my limited knowledge of CUDA streams, tells me that while probably the model(features) part happens on a single stream, and therefore batchN must be trained on before batchN+1. The batches are prepared on separate CUDA streams, and can all be prepared in parallel. Just they will be trained on in order.

This feature simplifies parallel data preparation. Which requires loading from disk.

Pitch

Alternatives

Alternative:
Dataloader object (or DataLoader like), that worked with with multi-thread, instead of multi-process.

Why?
Folklore tales tell that while multi-thread in python is horribly inefficient (due to GIL). It should still be good for IO. I am unsure if this is true, but if it is, it means that we can load files asynchronously in 1 process, and avoid the transfer of raw audio signals (huge arrays) between processes. I tried it once, I may have missed something, but multi-thread file loading was slower then 1 thread file loading.

Additional context

This issue: https://discuss.pytorch.org/t/should-we-set-non-blocking-to-true/38234/3
It has a note by one of NVIDIA engineers that running model(data) is a sync point between CPU and GPU. Later he says it's not
true, and it's confusing. If this is true, the above code won't schedule batchN+1 before batchN is done. It can be rewritten to
schedule all the batches preparation in separate Thread, bypassing that restriction.
He has some notes there on pin_memory on dataloader which i am not sure i understood, and reference to this NVIDIA blog:
https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/

The text was updated successfully, but these errors were encountered:

mthrok · 2021-07-22T01:58:29Z

Hi @maor121

Thanks for the suggestion. This is interesting.
I asked around if there is a PyTorch API that we can use to fulfill this feature, but unfortunately this seems not possible at the moment.

Firstly, .to(non_blocking=True) is only effective when the data is transferred from pinned memory to CUDA, and internally, it delegates the non blocking behavior to CUDA API. The logic and related interfaces reside in CUDA realm and PyTorch does not have any special mechanism around it.

Now, if we want to add non_blocking to torchaudio.load, then the simple approach would be to return a surrogate object that represents the resulting Tensor and behave exactly like Tensor. This means that the surrogate object has to be compatible with all the operations (existing ones and ones being added/proposed) that PyTorch provides. Which means we practically re-implement Tensor type, which is not maintainable.

mthrok · 2022-07-29T23:25:41Z

In v0.12.0 we added torchaudio.io.StreamReader class, which breaks down the loading operation into multiple steps, so users can write own async function.

…h#1628) * Update iOS GPU tutorial based on the lite interpreter support * few minor elabrations added

mthrok added the module: IO label Aug 4, 2021

mthrok added this to Issues in I/O Aug 4, 2021

mthrok mentioned this issue Jan 20, 2022

[dicscussion] Batched CPU/GPU audio decoding / encoding #2159

Open

mthrok closed this as completed Jul 29, 2022

mthrok pushed a commit to mthrok/audio that referenced this issue Dec 13, 2022

Update iOS GPU tutorial based on the lite interpreter support (pytorc…

e90a8a2

…h#1628) * Update iOS GPU tutorial based on the lite interpreter support * few minor elabrations added

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non blocking torchaudio.load() #1628

Non blocking torchaudio.load() #1628

maor121 commented Jul 14, 2021 •

edited

Loading

mthrok commented Jul 22, 2021

mthrok commented Jul 29, 2022

Non blocking torchaudio.load() #1628

Non blocking torchaudio.load() #1628

Comments

maor121 commented Jul 14, 2021 • edited Loading

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

mthrok commented Jul 22, 2021

mthrok commented Jul 29, 2022

maor121 commented Jul 14, 2021 •

edited

Loading