Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tensorflow/tfjs][tfjs-node-gpu] cuda_malloc_async fails with CUDA device attribute error #5740

Closed
danwexler opened this issue Oct 18, 2021 · 10 comments
Assignees
Labels
comp:node.js type:bug Something isn't working

Comments

@danwexler
Copy link

danwexler commented Oct 18, 2021

Using tfjs-node-gpu on a GKE cluster running on an n1-higmem-8 with an NVIDIA P4 or V100 GPU fails when the cuda_malloc_async allocater is set using TF_GPU_ALLOCATOR.

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow.js): YES
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04 64bit
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: none
  • TensorFlow.js installed from (npm or script link): npm
  • TensorFlow.js version (use command below): 3.9.0
  • Browser version: none, only tested in node
  • Node version: 14.15.3
  • Tensorflow.js Converter Version: none

Describe the current behavior

The app is a video filter that loads applies a super-resolution layer model to each frame in a video file, batching N frames together into a Tensor4D to scale up the resolution by 4x. I run tf.memory() after each frame to ensure that I am not leaking any tensors. After processing slightly more than 100x 1280x720 frames correctly, TF bails out and dumps the memory allocations, as well as displaying the message:

If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.

However, when I do set TF_GPU_ALLOCATOR=cuda_malloc_async, my normally correct startup process fails with:

tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc:72] Failed to get device attribute: CUDA error: invalid argument (CUDA_ERROR_INVALID_VALUE)

Describe the expected behavior
My primary issue is being able to use model.predict() on several hundred video frames, grouped together into batches, without running out of memory. I have eliminated any tensor leaks according to tf.memory(), so I'm not sure what to try next? I have seen discussions mentioning tf.engine.startScope/endScope, and I can also try dispose()ing my model every N frames and re-loading it, or even tf.engine.reset() every N frames, but these seem like band-aids for internal TFJS issues.

I do not explicitly allocate any TF variables within my code, so I do not expect tf.disposeVariables() to help. Is it possible that the model allocates variables internally that would benefit from running tf.disposeVariables() every frame?

I repeat the same allocation pattern for each video frame batch, but I cannot find any way of re-using the existing Tensors to avoid fragmentation.

Standalone code to reproduce the issue
Producing repro code is possible, but a significant effort. If there are no simple answers to this issue, then I will take the necessary time to mock up a repro.

Basically, I start by decoding frames into separate files using ffmpeg. Then, the processing loop will pre-fetch the next batch of N frames (N typically is 1-10) into a T4D by loading the individual frames:

const stack = []
for (i=0; i < N; ++i) stack.push(tf.node.decodeImage(fs.readFileSync(filename), 3))
const t4d = tf.stack(stack)

Once pre-fetched, processing is just: superresModel.predict(t4d)

The output batch is finished, I then extract the individual frames and save them back to new output files using:

const saveTensor3DAsImageFile = async (tensor, frameIdx, dstExpr) => {
  const executedImage = await tf.node.encodePng(tensor)
  tensor.dispose()
  const filename = sprintf(dstExpr, frameIdx) // image output path
  fs.writeFileSync(filename, executedImage)
}

After all batches are finished, I just call ffmpeg again to re-encode the output frame files.

Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.

err.log
nvidia_smi.log

@danwexler danwexler added the type:bug Something isn't working label Oct 18, 2021
@danwexler danwexler changed the title cuda_malloc_async fails with CUDA device attribute error [tfjs-node-gpu] cuda_malloc_async fails with CUDA device attribute error Oct 19, 2021
@danwexler danwexler changed the title [tfjs-node-gpu] cuda_malloc_async fails with CUDA device attribute error [tensorflow/tfjs][tfjs-node-gpu] cuda_malloc_async fails with CUDA device attribute error Oct 19, 2021
@pyu10055
Copy link
Collaborator

@danwexler just want to make sure there are no memory leaks, in your preprocessing code:

for (i=0; i < N; ++i) stack.push(tf.node.decodeImage(fs.readFileSync(filename), 3))

I assume you have dispose the tensors within stack array? can you show the tf.memory output before and after the inference?
thanks

@danwexler
Copy link
Author

Yes, apologies, I was just mocking the real function in the bug report. As I said, I print tf.memory() after each frame to ensure there are no new additional tensors or memory allocated. Here's the full code for my pre-fetch function:

const stackTensors = (imagesExp: string, batchStartFrame: number, batchSize: number) => {
  const tensors: Array<tf.Tensor3D> = []
  for (let j = 0; j < batchSize; ++j) {
    const idx = batchStartFrame + j + 1
    const frame = sprintf(imagesExp, idx)
    const tensor: tf.Tensor3D = loadImageAsTensor3D(frame)
    if (tensor) tensors.push(tensor)
  }
  const batch: tf.Tensor4D = <tf.Tensor4D> tf.stack(tensors)
  tensors.forEach(tensor => tensor.dispose())
  return batch
}

@danwexler
Copy link
Author

Here's a typical output from tf.memory():

2021-10-15T01:05:21.796333333Z Task starting:
2021-10-15T01:05:21.796398331Z {
2021-10-15T01:05:21.796486309Z   "TensorFlowMemory": {
2021-10-15T01:05:21.796489621Z     "unreliable": true,
2021-10-15T01:05:21.796492909Z     "numTensors": 308,
2021-10-15T01:05:21.796496160Z     "numDataBuffers": 308,
2021-10-15T01:05:21.796499492Z     "numBytes": 16704400
2021-10-15T01:05:21.796502823Z   }
2021-10-15T01:05:21.796506027Z }
2021-10-15T01:05:25.044486894Z Task completed:
2021-10-15T01:05:25.044580754Z {
2021-10-15T01:05:25.044670002Z   "TensorFlowMemory": {
2021-10-15T01:05:25.044672942Z     "unreliable": true,
2021-10-15T01:05:25.044675892Z     "numTensors": 308,
2021-10-15T01:05:25.044678802Z     "numDataBuffers": 308,
2021-10-15T01:05:25.044681744Z     "numBytes": 16704400
2021-10-15T01:05:25.044684701Z   }
2021-10-15T01:05:25.044687538Z }

The allocated memory is the core upscale layer model, after warmup/predict.

@pyu10055
Copy link
Collaborator

@danwexler the other thing I want to confirm is that are you using tfjs model or tf saved model for inference?

@danwexler
Copy link
Author

danwexler commented Oct 20, 2021

I'm using a pretrained super-resolution model loaded from a cached version of the Idealo ESRGAN. The model is currently loaded from Unpkg at this location: https://unpkg.com/@upscalerjs/models@0.8.27/idealo/gans using tf.loadLayersModel(). That version is provided by the author of the npm upscaler package.

IOW, this is not a TFJS-provided model from TFHub, and I do believe it is a TF saved model. Please correct me if I'm wrong as I did not do the original training. I feel very much like I need to understand more about the internals of how models work in order to understand this issue.

I believe these are the model files: gans.zip

Looking at this model file, it seems to be a Keras 2.4.0 model converted using the TFJS Converter v2.0.1.post1

@danwexler
Copy link
Author

FYI, this is all part of an unannounced product that is in development which allows you to run TFJS models both locally in the browser and at scale on a dedicated cluster of cloud VMs. So I do run this code both in tfjs-node-gpu and in the browser with tfjs, however the browser is typically used to adjust settings on a single frame rather than rendering the entire video. You can run the entire video processing locally too, it just runs much faster when split up across multiple VMs and on bigger GPUs.

@pyu10055
Copy link
Collaborator

pyu10055 commented Oct 21, 2021

@danwexler are you using cuda 11.2? I believe TF 2.5.0+ would require 11.2 at least. seems this problem is fixed in the upcoming TF 2.7.0
tensorflow/tensorflow#48545

@danwexler
Copy link
Author

danwexler commented Oct 21, 2021

Understood. Good info. Unfortunately, 11.2 is not available using the default Google Kubernetes Engine (GKE) nvidia-driver-installer Daemon Set.

I've upgraded to using the tensorflow/tensorflow:nightly-gpu Docker base file, and upgraded my GKE backplane to the Rapid channel since the backplane version changes the base NVIDIA driver and CUDA version. Unfortunately, it looks like that still installs only CUDA v11.0..

I believe there is a way to install a different driver than is installed on GKE based on the backplane version. Do you know of any documentation or instructions on how to upgrade the CUDA version on a GKE VM via the standard nvidia-driver-installer Daemon Set?

This is not a blocking issue for me during development. I'll be testing workarounds while I wait for the TF 2.7.0 release.

However, it would be great if there was a way to reuse existing allocations rather than re-allocating the same large tensors for data pre-fetch and model.predict(). That would definitely avoid fragmentation with user control at the API level. Otherwise, it seems to me that the current allocator is just not optimized to detect existing free blocks to re-use, at least for larger blocks? Hopefully the cuda_malloc_async allocator is an improvement in this regard. Alternatively, I plan to look at tf.engine.reset() to clear out the entire allocator and re-load my model from scratch every N frames. Any other workarounds I should explore?

@pyu10055
Copy link
Collaborator

@danwexler engine reset could de-allocate all your weight tensors for the model, you would need to recreate and upload them to gpu again, and I am not sure it will improve GPU memory fragmentation.

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:node.js type:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants