Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non blocking torchaudio.load() #1628

Closed
maor121 opened this issue Jul 14, 2021 · 2 comments
Closed

Non blocking torchaudio.load() #1628

maor121 opened this issue Jul 14, 2021 · 2 comments
Projects

Comments

@maor121
Copy link

maor121 commented Jul 14, 2021

🚀 Feature

Currently, operations on tensors are mostly asynchronous. I suggest to add asynchronous loading from disk:

tensor = torchaudio.load(filepath: str, non_blocking=True)  # new 'non_blocking' argument

Motivation

Currently, one of PyTorch most powerful features is it's non blocking execution on tensors. For example. Consider the code snippet for training loop:

for raw_data, target in data_loader:
    raw_data = raw_data.to('cuda:0', non_blocking=True)
    target = target.to('cuda:0', non_blocking=True)
    
    raw_data = augmenter(raw_data)          # room impulse response, add noise, etc.. on GPU
    features = feature_extractor(raw_data)  # e.g use torchaudio MelSpectrogram on GPU
    output = model(features)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()

Allows the main training loop to run through all it's iterations, and schedule the work to execute on the GPU without syncing GPU and CPU until the very last step.

But the problem is, loading tensors from disk is synchronous, breaking the async loop. So I need my dataloaders to load files asynchronously. Currently in Python, I have 2 ways of loading files asynchronously: multi-thread & multi-process.
The problem is, multi-thread is slower then expected, and multi-process forces a copy of huge raw audio arrays between processes. But if async loading is implemented in torchaudio in C. It can be done in multi-thread efficiently.

In addition, my limited knowledge of CUDA streams, tells me that while probably the model(features) part happens on a single stream, and therefore batchN must be trained on before batchN+1. The batches are prepared on separate CUDA streams, and can all be prepared in parallel. Just they will be trained on in order.

This feature simplifies parallel data preparation. Which requires loading from disk.

Pitch

Alternatives

Alternative:
Dataloader object (or DataLoader like), that worked with with multi-thread, instead of multi-process.

Why?
Folklore tales tell that while multi-thread in python is horribly inefficient (due to GIL). It should still be good for IO. I am unsure if this is true, but if it is, it means that we can load files asynchronously in 1 process, and avoid the transfer of raw audio signals (huge arrays) between processes. I tried it once, I may have missed something, but multi-thread file loading was slower then 1 thread file loading.

Additional context

  1. This issue: https://discuss.pytorch.org/t/should-we-set-non-blocking-to-true/38234/3
    It has a note by one of NVIDIA engineers that running model(data) is a sync point between CPU and GPU. Later he says it's not
    true, and it's confusing. If this is true, the above code won't schedule batchN+1 before batchN is done. It can be rewritten to
    schedule all the batches preparation in separate Thread, bypassing that restriction.
  2. He has some notes there on pin_memory on dataloader which i am not sure i understood, and reference to this NVIDIA blog:
    https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/
@mthrok
Copy link
Collaborator

mthrok commented Jul 22, 2021

Hi @maor121

Thanks for the suggestion. This is interesting.
I asked around if there is a PyTorch API that we can use to fulfill this feature, but unfortunately this seems not possible at the moment.

Firstly, .to(non_blocking=True) is only effective when the data is transferred from pinned memory to CUDA, and internally, it delegates the non blocking behavior to CUDA API. The logic and related interfaces reside in CUDA realm and PyTorch does not have any special mechanism around it.

Now, if we want to add non_blocking to torchaudio.load, then the simple approach would be to return a surrogate object that represents the resulting Tensor and behave exactly like Tensor. This means that the surrogate object has to be compatible with all the operations (existing ones and ones being added/proposed) that PyTorch provides. Which means we practically re-implement Tensor type, which is not maintainable.

@mthrok
Copy link
Collaborator

mthrok commented Jul 29, 2022

In v0.12.0 we added torchaudio.io.StreamReader class, which breaks down the loading operation into multiple steps, so users can write own async function.

@mthrok mthrok closed this as completed Jul 29, 2022
mthrok pushed a commit to mthrok/audio that referenced this issue Dec 13, 2022
…h#1628)

* Update iOS GPU tutorial based on the lite interpreter support

* few minor elabrations added
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
I/O
Issues
Development

No branches or pull requests

2 participants