[tensorflow/tfjs][tfjs-node-gpu] cuda_malloc_async fails with CUDA device attribute error #5740

danwexler · 2021-10-18T23:44:13Z

Using tfjs-node-gpu on a GKE cluster running on an n1-higmem-8 with an NVIDIA P4 or V100 GPU fails when the cuda_malloc_async allocater is set using TF_GPU_ALLOCATOR.

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow.js): YES
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04 64bit
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: none
TensorFlow.js installed from (npm or script link): npm
TensorFlow.js version (use command below): 3.9.0
Browser version: none, only tested in node
Node version: 14.15.3
Tensorflow.js Converter Version: none

Describe the current behavior

The app is a video filter that loads applies a super-resolution layer model to each frame in a video file, batching N frames together into a Tensor4D to scale up the resolution by 4x. I run tf.memory() after each frame to ensure that I am not leaking any tensors. After processing slightly more than 100x 1280x720 frames correctly, TF bails out and dumps the memory allocations, as well as displaying the message:

If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.

However, when I do set TF_GPU_ALLOCATOR=cuda_malloc_async, my normally correct startup process fails with:

tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:72] Failed to get device attribute: CUDA error: invalid argument (CUDA_ERROR_INVALID_VALUE)

Describe the expected behavior
My primary issue is being able to use model.predict() on several hundred video frames, grouped together into batches, without running out of memory. I have eliminated any tensor leaks according to tf.memory(), so I'm not sure what to try next? I have seen discussions mentioning tf.engine.startScope/endScope, and I can also try dispose()ing my model every N frames and re-loading it, or even tf.engine.reset() every N frames, but these seem like band-aids for internal TFJS issues.

I do not explicitly allocate any TF variables within my code, so I do not expect tf.disposeVariables() to help. Is it possible that the model allocates variables internally that would benefit from running tf.disposeVariables() every frame?

I repeat the same allocation pattern for each video frame batch, but I cannot find any way of re-using the existing Tensors to avoid fragmentation.

Standalone code to reproduce the issue
Producing repro code is possible, but a significant effort. If there are no simple answers to this issue, then I will take the necessary time to mock up a repro.

Basically, I start by decoding frames into separate files using ffmpeg. Then, the processing loop will pre-fetch the next batch of N frames (N typically is 1-10) into a T4D by loading the individual frames:

const stack = []
for (i=0; i < N; ++i) stack.push(tf.node.decodeImage(fs.readFileSync(filename), 3))
const t4d = tf.stack(stack)

Once pre-fetched, processing is just: superresModel.predict(t4d)

The output batch is finished, I then extract the individual frames and save them back to new output files using:

const saveTensor3DAsImageFile = async (tensor, frameIdx, dstExpr) => {
  const executedImage = await tf.node.encodePng(tensor)
  tensor.dispose()
  const filename = sprintf(dstExpr, frameIdx) // image output path
  fs.writeFileSync(filename, executedImage)
}

After all batches are finished, I just call ffmpeg again to re-encode the output frame files.

Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.

err.log
nvidia_smi.log

The text was updated successfully, but these errors were encountered:

pyu10055 · 2021-10-20T16:11:14Z

@danwexler just want to make sure there are no memory leaks, in your preprocessing code:

for (i=0; i < N; ++i) stack.push(tf.node.decodeImage(fs.readFileSync(filename), 3))

I assume you have dispose the tensors within stack array? can you show the tf.memory output before and after the inference?
thanks

danwexler · 2021-10-20T16:29:56Z

Yes, apologies, I was just mocking the real function in the bug report. As I said, I print tf.memory() after each frame to ensure there are no new additional tensors or memory allocated. Here's the full code for my pre-fetch function:

const stackTensors = (imagesExp: string, batchStartFrame: number, batchSize: number) => {
  const tensors: Array<tf.Tensor3D> = []
  for (let j = 0; j < batchSize; ++j) {
    const idx = batchStartFrame + j + 1
    const frame = sprintf(imagesExp, idx)
    const tensor: tf.Tensor3D = loadImageAsTensor3D(frame)
    if (tensor) tensors.push(tensor)
  }
  const batch: tf.Tensor4D = <tf.Tensor4D> tf.stack(tensors)
  tensors.forEach(tensor => tensor.dispose())
  return batch
}

danwexler · 2021-10-20T16:46:26Z

Here's a typical output from tf.memory():

2021-10-15T01:05:21.796333333Z Task starting:
2021-10-15T01:05:21.796398331Z {
2021-10-15T01:05:21.796486309Z   "TensorFlowMemory": {
2021-10-15T01:05:21.796489621Z     "unreliable": true,
2021-10-15T01:05:21.796492909Z     "numTensors": 308,
2021-10-15T01:05:21.796496160Z     "numDataBuffers": 308,
2021-10-15T01:05:21.796499492Z     "numBytes": 16704400
2021-10-15T01:05:21.796502823Z   }
2021-10-15T01:05:21.796506027Z }
2021-10-15T01:05:25.044486894Z Task completed:
2021-10-15T01:05:25.044580754Z {
2021-10-15T01:05:25.044670002Z   "TensorFlowMemory": {
2021-10-15T01:05:25.044672942Z     "unreliable": true,
2021-10-15T01:05:25.044675892Z     "numTensors": 308,
2021-10-15T01:05:25.044678802Z     "numDataBuffers": 308,
2021-10-15T01:05:25.044681744Z     "numBytes": 16704400
2021-10-15T01:05:25.044684701Z   }
2021-10-15T01:05:25.044687538Z }

The allocated memory is the core upscale layer model, after warmup/predict.

pyu10055 · 2021-10-20T16:55:33Z

@danwexler the other thing I want to confirm is that are you using tfjs model or tf saved model for inference?

danwexler · 2021-10-20T17:09:31Z

I'm using a pretrained super-resolution model loaded from a cached version of the Idealo ESRGAN. The model is currently loaded from Unpkg at this location: https://unpkg.com/@upscalerjs/models@0.8.27/idealo/gans using tf.loadLayersModel(). That version is provided by the author of the npm upscaler package.

IOW, this is not a TFJS-provided model from TFHub, and I do believe it is a TF saved model. Please correct me if I'm wrong as I did not do the original training. I feel very much like I need to understand more about the internals of how models work in order to understand this issue.

I believe these are the model files: gans.zip

Looking at this model file, it seems to be a Keras 2.4.0 model converted using the TFJS Converter v2.0.1.post1

danwexler · 2021-10-20T17:12:14Z

FYI, this is all part of an unannounced product that is in development which allows you to run TFJS models both locally in the browser and at scale on a dedicated cluster of cloud VMs. So I do run this code both in tfjs-node-gpu and in the browser with tfjs, however the browser is typically used to adjust settings on a single frame rather than rendering the entire video. You can run the entire video processing locally too, it just runs much faster when split up across multiple VMs and on bigger GPUs.

pyu10055 · 2021-10-21T03:27:30Z

@danwexler are you using cuda 11.2? I believe TF 2.5.0+ would require 11.2 at least. seems this problem is fixed in the upcoming TF 2.7.0
tensorflow/tensorflow#48545

danwexler · 2021-10-21T19:43:33Z

Understood. Good info. Unfortunately, 11.2 is not available using the default Google Kubernetes Engine (GKE) nvidia-driver-installer Daemon Set.

I've upgraded to using the tensorflow/tensorflow:nightly-gpu Docker base file, and upgraded my GKE backplane to the Rapid channel since the backplane version changes the base NVIDIA driver and CUDA version. Unfortunately, it looks like that still installs only CUDA v11.0..

I believe there is a way to install a different driver than is installed on GKE based on the backplane version. Do you know of any documentation or instructions on how to upgrade the CUDA version on a GKE VM via the standard nvidia-driver-installer Daemon Set?

This is not a blocking issue for me during development. I'll be testing workarounds while I wait for the TF 2.7.0 release.

However, it would be great if there was a way to reuse existing allocations rather than re-allocating the same large tensors for data pre-fetch and model.predict(). That would definitely avoid fragmentation with user control at the API level. Otherwise, it seems to me that the current allocator is just not optimized to detect existing free blocks to re-use, at least for larger blocks? Hopefully the cuda_malloc_async allocator is an improvement in this regard. Alternatively, I plan to look at tf.engine.reset() to clear out the entire allocator and re-load my model from scratch every N frames. Any other workarounds I should explore?

pyu10055 · 2021-10-22T16:15:03Z

@danwexler engine reset could de-allocate all your weight tensors for the model, you would need to recreate and upload them to gpu again, and I am not sure it will improve GPU memory fragmentation.

google-ml-butler · 2022-05-20T18:41:59Z

Are you satisfied with the resolution of your issue?
Yes
No

danwexler added the type:bug Something isn't working label Oct 18, 2021

danwexler changed the title ~~cuda_malloc_async fails with CUDA device attribute error~~ [tfjs-node-gpu] cuda_malloc_async fails with CUDA device attribute error Oct 19, 2021

danwexler changed the title ~~[tfjs-node-gpu] cuda_malloc_async fails with CUDA device attribute error~~ [tensorflow/tfjs][tfjs-node-gpu] cuda_malloc_async fails with CUDA device attribute error Oct 19, 2021

rthadur assigned pyu10055 Oct 20, 2021

rthadur added the comp:node.js label Oct 20, 2021

pyu10055 added a commit that referenced this issue Nov 5, 2021

tfjs-node: upgrade TF to 2.7.0, ref #5740.

dd3d7a6

pyu10055 mentioned this issue Nov 5, 2021

[tfjs-node]: upgrade TF to 2.7.0 #5818

Merged

mattsoulanille pushed a commit that referenced this issue Nov 5, 2021

tfjs-node: upgrade TF to 2.7.0, ref #5740. (#5818)

c4bbc22

pyu10055 closed this as completed May 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tensorflow/tfjs][tfjs-node-gpu] cuda_malloc_async fails with CUDA device attribute error #5740

[tensorflow/tfjs][tfjs-node-gpu] cuda_malloc_async fails with CUDA device attribute error #5740

danwexler commented Oct 18, 2021 •

edited

pyu10055 commented Oct 20, 2021

danwexler commented Oct 20, 2021

danwexler commented Oct 20, 2021

pyu10055 commented Oct 20, 2021

danwexler commented Oct 20, 2021 •

edited

danwexler commented Oct 20, 2021

pyu10055 commented Oct 21, 2021 •

edited

danwexler commented Oct 21, 2021 •

edited

pyu10055 commented Oct 22, 2021

google-ml-butler bot commented May 20, 2022

[tensorflow/tfjs][tfjs-node-gpu] cuda_malloc_async fails with CUDA device attribute error #5740

[tensorflow/tfjs][tfjs-node-gpu] cuda_malloc_async fails with CUDA device attribute error #5740

Comments

danwexler commented Oct 18, 2021 • edited

pyu10055 commented Oct 20, 2021

danwexler commented Oct 20, 2021

danwexler commented Oct 20, 2021

pyu10055 commented Oct 20, 2021

danwexler commented Oct 20, 2021 • edited

danwexler commented Oct 20, 2021

pyu10055 commented Oct 21, 2021 • edited

danwexler commented Oct 21, 2021 • edited

pyu10055 commented Oct 22, 2021

google-ml-butler bot commented May 20, 2022

danwexler commented Oct 18, 2021 •

edited

danwexler commented Oct 20, 2021 •

edited

pyu10055 commented Oct 21, 2021 •

edited

danwexler commented Oct 21, 2021 •

edited