Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to fully optimize processing to take advance of all available hardware #20

Closed
Nuzzlet opened this issue Dec 5, 2020 · 33 comments
Closed
Assignees
Labels
question Further information is requested

Comments

@Nuzzlet
Copy link

Nuzzlet commented Dec 5, 2020

I run this function:

  async function analyzeFrame (frameData, index) {
    const tensor = await getTensorFromBuffer(frameData)
    const faces = await faceapi.detectAllFaces(tensor, optionsSSDMobileNet).withFaceLandmarks().withFaceDescriptors()
    tensor.dispose()
    faces.forEach(async (face) => {
      // console.log(face)
    })
  }

for hundreds of frames of a video. When doing so, it takes about half the time as if it where using only tfjs-node. It fills the RAM of my two 1080s, however the only usage that changes when I start the script is GPU 1s copy utilization goes to about 15%.

I want to take better advantage of my hardware and run this faster.

Windows 10, cuda 10, TFJS 2.7.0

@Nuzzlet
Copy link
Author

Nuzzlet commented Dec 5, 2020

For example, could I run two instances of my code - one for each GPU? I should note I also get this error:

W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: Invoking ptxas not supported on Windows
Relying on driver to perform ptx compilation. This message will be only logged once.

Does PTX equate to a significant performance bump? Should I switch to Linux just for this?

@meeki007
Copy link
Contributor

meeki007 commented Dec 5, 2020

Linux user here
I was using only 8% to 16% of my nVidia Tesla K10 in a dell 620xd server per task.
However I was never able to get my GPU utilization above 48% per GPU core (CPU bottleneck)

using cuda 10.1

Don't know if my old card even supports PTX

PTX support might be the bottle neck but get ready for hell on earth getting CUDA running nice on linux.
I wrote up all the documentation after I figured it out
https://github.com/meeki007/Shinobi-personal-documentation/blob/master/Cuda_Objects_and_plates.md
You might find a tidbit of help in that documentation going linux route but time marches forward and I think its outdated now.

@vladmandic probably knows way more about this.

@vladmandic
Copy link
Owner

vladmandic commented Dec 5, 2020

two issues here, first need to keep pipeline saturated instead of emptying it before each time before next frame.
yes, js is single-threaded, but time slicing would still allow some operations to be ordered better - this way you'll not saturate gpu, but you will at least saturate cpu.

so instead of:

async function analyzeFrame (frameData, index) {
  const tensor = await getTensorFromBuffer(frameData)
  const faces = await faceapi.detectAllFaces(tensor, optionsSSDMobileNet).withFaceLandmarks().withFaceDescriptors()
  tensor.dispose()
  faces.forEach(async (face) => {
    // console.log(face)
  })
}

something like this (this is not a working code, just an example):

async function analyzeFrames (frameData, index, numFrames) {
  promises = [];
  for (let i=0; i < numFrames; i++) {
    promises.push(getTensorFromBuffer(frameData[i]).then((tensor) => {
      faceapi.detectAllFaces(tensor, optionsSSDMobileNet).withFaceLandmarks().withFaceDescriptors().then((faces) => {
        // console.log(face)
        tensor.dispose();
      })
    }))
  }
  return Promise.all(promises);
}

just make sure that your promise pool is not too large as things can spiral out of control.
(and yes, this is over simplified as it fills entire pool and that waits until it's done instead of reusing free slots as they become free).

btw, i hate this kind of code as it's unreadable, i like async/await much more - this is only a last resort.

second, no matter what you do, js is still single threaded. so you could spawn a fix number of worker processes (in node) or web workers (in browser) and do a loop that assigns processing of next frame to first available worker.
that's pretty much the only way to saturate gpu.

but then you also get to a second layer of issues - how to transfer frameData to a worker without adding extra latency (as transfer to worker is far more costly than transfer to function inside a same thread). so maybe move that processing into worker as well.

then and only then i'd look at platform specific stuff - i don't see a reason to suspect that linux would behave better than windows before it's even saturated.
ptx is about multi-threaded execution and i'm not even sure how much that applies - you'd either be using it single-threaded (default) or multi-process (with workers), not multi-threaded.

@vladmandic vladmandic self-assigned this Dec 5, 2020
@vladmandic vladmandic added the question Further information is requested label Dec 5, 2020
@Nuzzlet
Copy link
Author

Nuzzlet commented Dec 5, 2020

two issues here, first need to keep pipeline saturated instead of emptying it before each time before next frame.
yes, js is single-threaded, but time slicing would still allow some operations to be ordered better - this way you'll not saturate gpu, but you will at least saturate cpu.

async function analyzeFrame (frameData, index) {
  const tensor = await getTensorFromBuffer(frameData)
  const faces = await faceapi.detectAllFaces(tensor, optionsSSDMobileNet).withFaceLandmarks().withFaceDescriptors()
  tensor.dispose()
  faces.forEach(async (face) => {
    // console.log(face)
  })
}
async function analyzeFrames (frameData, index, numFrames) {
  promises = [];
  for (let i=0; i < numFrames; i++) {
    promises.push(getTensorFromBuffer(frameData[i]).then((tensor) => {
      faceapi.detectAllFaces(tensor, optionsSSDMobileNet).withFaceLandmarks().withFaceDescriptors().then((faces) => {
        // console.log(face)
        tensor.dispose();
      })
    }))
  }
  return Promise.all(promises);
}

just make sure that your promise pool is not too large as things can spiral out of control.
(and yes, this is over simplified as it fills entire pool and that waits until it's done instead of reusing free slots as they become free).

btw, i hate this kind of code as it's unreadable, i like async/await much more - this is only a last resort.

second, no matter what you do, js is still single threaded. so you could spawn a fix number of worker processes (in node) or web workers (in browser) and do a loop that assigns processing of next frame to first available worker.
that's pretty much the only way to saturate gpu.

then and only then i'd look at platform specific stuff - i don't see a reason to suspect that linux would behave better than windows before it's even saturated.
ptx is about multi-threaded execution and i'm not even sure how much that applies - you'd either be using it single-threaded (default) or multi-process (with workers), not multi-threaded.

Thank you! I understand, I don't like that promise approach either but it makes sense. A few questions

What version of CUDA and Cudnn do you recommend? And, how do I split into smaller processes whilst keeping one instance of Face-API. I tried to use fork to run analyzeFrame however it returns loads of buffer allocation errors. The first instance fills my vRam.

@vladmandic
Copy link
Owner

What version of CUDA and Cudnn do you recommend

latest you manage to get working :)
it's pretty much graphic driver dependent, cuda version has to match driver version or it will start failing.

And, how do I split into smaller processes whilst keeping one instance of Face-API

i don't think you can. instance of face-api has one instance of tfjs which can only have one active backend.
since entire goal is to have multiple active backends, it leads that you need multiple instances of face-api.

the thing with fork is that it takes time to initialize process and that can interfere with already running instance - you don't want to be doing that while another instance is already running.

that's why i mentioned that it should be a pre-started pool of processes and then assign processing to first available one, not start it on-demand.

@Nuzzlet
Copy link
Author

Nuzzlet commented Dec 5, 2020

What version of CUDA and Cudnn do you recommend

latest you manage to get working :)
it's pretty much graphic driver dependent, cuda version has to match driver version or it will start failing.

And, how do I split into smaller processes whilst keeping one instance of Face-API

i don't think you can. instance of face-api has one instance of tfjs which can only have one active backend.
since entire goal is to have multiple active backends, it leads that you need multiple instances of face-api.

the thing with fork is that it takes time to initialize process and that can interfere with already running instance - you don't want to be doing that while another instance is already running.

that's why i mentioned that it should be a pre-started pool of processes and then assign processing to first available one, not start it on-demand.

Okay perfect. So I have altered my code to do as you describe, adding each async function to Promises.all() It does seem to have a slight speed improvement, but not what I'm looking for.

Since there can only be one backend, is it possible to split into multiple processes?

@Nuzzlet
Copy link
Author

Nuzzlet commented Dec 5, 2020

Should I split the promises array into say 2-4 arrays, then assign each promise to its own worker? I've never used worker-threats before.

@vladmandic
Copy link
Owner

Should I split the promises array into say 2-4 arrays, then assign each promise to its own worker?

Yes, but get more complicated quicky

Since there can only be one backend, is it possible to split into multiple processes?

There can only be one backend per-process. Since the goal is to have multiple active backends, we need multiple processes each with its own active backend.
But...Backend initialization can interfere with any other active processes, so we need to:

  1. Start one worker, wait until it finishes backend initialization and then start another worker and so on.
  2. When all workers are started, assign each worker a batch and wait results from each worker.
  3. Since workers can process at different speeds, we should also poll for each worker busy/free and assign next batch as worker becomes free.

Which means we need some sort of inter-process communication between parent and workers.

With browser and web workers, there is built-in sendmessage() and event listener on the other side. With node, we need to implement it manually.

  • Simplest would be to capture worker stdin and stdout when forking it so we can send commands to worker via it's stdin and look at stdout for results.
    But that is ugly and I wouldn't do that outside of prototype code
  • Heavy-duty solution would be to have something like gRPC or some kind of message bus, but that brings in tons of dependencies.
    Probably not worth the weight
  • Myself, I'd do something like utilize nodejs built-in http class so each worker acts as simple web server (no framework, just naked http server)
    And then have super-simplistic http rest api to send commands to worker and get results from it

All-in-all, it's not trivial, but not super-long.
I might do a prototype next week for fun :)

@vladmandic
Copy link
Owner

Actually, just remembered that forked NodeJS processes have built-in send() that can be used for IPC.
I haven't used it before, will try.

@meeki007
Copy link
Contributor

meeki007 commented Dec 5, 2020

@vladmandic - " Myself, I'd do something like utilize nodejs built-in http class so each worker acts as simple web server (no framework, just naked http server)
And then have super-simplistic http rest api to send commands to worker and get results from it "

That is the path I'm taking - Think of each node as a worker. I'm getting there. @Nuzzlet -Don't wait around for me to finish this. nodeJS/node-red app but I'll let you know if i ever get it working.

looking into this now for my attempt at this. ---> NodeJS processes have built-in send()

image

@Nuzzlet
Copy link
Author

Nuzzlet commented Dec 5, 2020

@meeki007 @vladmandic
Do I need multiple backends? What if I spawn a worker, then using IPC send it tf & faceapi, in which it would execute an array of frames to analyze, and then return to the main threat.

Worker Threats: https://nodejs.org/api/worker_threads.html#worker_threads_worker_threads

can share memory with the main process, maybe TF & faceAPI can be shared this way. Do you see any issues with this? It could be much simpler. Or is it necessary to have multiple backends in order to call detectFaces() multiple times simultaneously?

Also above you said it's impossible to have multiple backends at once? I'm confused.

@vladmandic
Copy link
Owner

Also above you said it's impossible to have multiple backends at once? I'm confused.

It is impossible to have multiple backends inside same process.

Do I need multiple backends? What if I spawn a worker, then using IPC send it tf & faceapi, in which it would execute an array of frames to analyze, and then return to the main threat.

Not out-of-the-box for sure. In theory? Maybe.
I see two issues:

  • TFJS holds textures for each model. If you use same instance of TFJS to process same model from two different processes, it would for sure cause corruption of textures in GPU memory. Possibly could be resolved by dynamically naming models when registering it with tfjs, but i'd need to refactor a lot of face-api code for that.
  • Even if that is done, still not sure if TFJS would not cause internal conflict and corrupt states between multiple instances.

Now, on memory usage, it shouldn't be that high - and definitely should get as high to cause issues with multiple processes.
So if that problem remains, we can look at that separately.

@Nuzzlet
Copy link
Author

Nuzzlet commented Dec 6, 2020

Also above you said it's impossible to have multiple backends at once? I'm confused.

It is impossible to have multiple backends inside same process.

Do I need multiple backends? What if I spawn a worker, then using IPC send it tf & faceapi, in which it would execute an array of frames to analyze, and then return to the main threat.

Not out-of-the-box for sure. In theory? Maybe.
I see two issues:

  • TFJS holds textures for each model. If you use same instance of TFJS to process same model from two different processes, it would for sure cause corruption of textures in GPU memory. Possibly could be resolved by dynamically naming models when registering it with tfjs, but i'd need to refactor a lot of face-api code for that.
  • Even if that is done, still not sure if TFJS would not cause internal conflict and corrupt states between multiple instances.

Now, on memory usage, it shouldn't be that high - and definitely should get as high to cause issues with multiple processes.
So if that problem remains, we can look at that separately.

How do you have multiple backends at once? I tried to create two processes with their own tfjs-node-gpu & faceapi. Both return constant buffer issues.

I think this is related to the vRam issues I mentioned above. When I have some time I'll put together a bug report in a new issue.

@vladmandic
Copy link
Owner

take a look at https://github.com/vladmandic/faceapi-workers

Test for multi-process FaceAPI execution in NodeJS

This is not 100% optimized, it's intended as proof-of-concept
Code is documented inline

You can see that the speed-up from multiprocessing is quite significant

single-process:

source: src/single-process.js

2020-12-06 10:55:47 INFO:  @vladmandic/faceapi-workers version 0.0.1
2020-12-06 10:55:47 INFO:  User: vlado Platform: linux Arch: x64 Node: v15.0.1
2020-12-06 10:55:47 INFO:  FaceAPI single-process test
2020-12-06 10:55:47 STATE:  Version: TensorFlow/JS 2.7.0 FaceAPI 0.8.9 Backend: tensorflow
...
2020-12-06 10:56:33 INFO:  Processed 57 images in 46559 ms

multi-process with numWorkers=2:

source src/multi-process.js plus src/multi-process-worker.js

2020-12-06 11:17:26 INFO:  @vladmandic/faceapi-workers version 0.0.1
2020-12-06 11:17:26 INFO:  User: vlado Platform: linux Arch: x64 Node: v15.0.1
2020-12-06 11:17:26 INFO:  FaceAPI multi-process test
2020-12-06 11:17:26 STATE:  Main: started worker: 26562
2020-12-06 11:17:26 STATE:  Main: started worker: 26568
2020-12-06 11:17:27 STATE:  Worker: PID: 26568 TensorFlow/JS 2.7.0 FaceAPI 0.8.9 Backend: tensorflow
2020-12-06 11:17:27 STATE:  Worker: PID: 26562 TensorFlow/JS 2.7.0 FaceAPI 0.8.9 Backend: tensorflow
2020-12-06 11:17:27 STATE:  Main: dispatching to worker: 26562
2020-12-06 11:17:27 STATE:  Main: dispatching to worker: 26568
2020-12-06 11:17:28 DATA:  Worker received message: 26562 { image: '/mnt/c/Users/mandi/OneDrive/People/Miami/DSC00317.jpg' }
2020-12-06 11:17:28 DATA:  Worker received message: 26568 { image: '/mnt/c/Users/mandi/OneDrive/People/Miami/DSC00319.jpg' }
2020-12-06 11:17:28 DATA:  Main: worker finished: 26562 detected faces: 1
2020-12-06 11:17:28 STATE:  Main: dispatching to worker: 26562
2020-12-06 11:17:28 DATA:  Worker received message: 26562 { image: '/mnt/c/Users/mandi/OneDrive/People/Miami/DSC00320.jpg' }
2020-12-06 11:17:28 DATA:  Main: worker finished: 26568 detected faces: 3
2020-12-06 11:17:28 STATE:  Main: dispatching to worker: 26568
...
2020-12-06 11:18:00 INFO:  Processed 57 images in 33999 ms

multi-process with numWorkers=4:

2020-12-06 10:59:43 INFO:  Processed 57 images in 20001 ms

I've tried with 16 workers and it's running fine without any memory issues, but not getting any more performance gains on a notebook with a limited hardware.

@Nuzzlet
Copy link
Author

Nuzzlet commented Dec 6, 2020

take a look at https://github.com/vladmandic/faceapi-workers

Test for multi-process FaceAPI execution in NodeJS

This is not 100% optimized, it's intended as proof-of-concept
Code is documented inline

You can see that the speed-up from multiprocessing is quite significant

single-process:

source: src/single-process.js

2020-12-06 10:55:47 INFO:  @vladmandic/faceapi-workers version 0.0.1
2020-12-06 10:55:47 INFO:  User: vlado Platform: linux Arch: x64 Node: v15.0.1
2020-12-06 10:55:47 INFO:  FaceAPI single-process test
2020-12-06 10:55:47 STATE:  Version: TensorFlow/JS 2.7.0 FaceAPI 0.8.9 Backend: tensorflow
...
2020-12-06 10:56:33 INFO:  Processed 57 images in 46559 ms

multi-process with numWorkers=2:

source src/multi-process.js plus src/multi-process-worker.js

2020-12-06 11:17:26 INFO:  @vladmandic/faceapi-workers version 0.0.1
2020-12-06 11:17:26 INFO:  User: vlado Platform: linux Arch: x64 Node: v15.0.1
2020-12-06 11:17:26 INFO:  FaceAPI multi-process test
2020-12-06 11:17:26 STATE:  Main: started worker: 26562
2020-12-06 11:17:26 STATE:  Main: started worker: 26568
2020-12-06 11:17:27 STATE:  Worker: PID: 26568 TensorFlow/JS 2.7.0 FaceAPI 0.8.9 Backend: tensorflow
2020-12-06 11:17:27 STATE:  Worker: PID: 26562 TensorFlow/JS 2.7.0 FaceAPI 0.8.9 Backend: tensorflow
2020-12-06 11:17:27 STATE:  Main: dispatching to worker: 26562
2020-12-06 11:17:27 STATE:  Main: dispatching to worker: 26568
2020-12-06 11:17:28 DATA:  Worker received message: 26562 { image: '/mnt/c/Users/mandi/OneDrive/People/Miami/DSC00317.jpg' }
2020-12-06 11:17:28 DATA:  Worker received message: 26568 { image: '/mnt/c/Users/mandi/OneDrive/People/Miami/DSC00319.jpg' }
2020-12-06 11:17:28 DATA:  Main: worker finished: 26562 detected faces: 1
2020-12-06 11:17:28 STATE:  Main: dispatching to worker: 26562
2020-12-06 11:17:28 DATA:  Worker received message: 26562 { image: '/mnt/c/Users/mandi/OneDrive/People/Miami/DSC00320.jpg' }
2020-12-06 11:17:28 DATA:  Main: worker finished: 26568 detected faces: 3
2020-12-06 11:17:28 STATE:  Main: dispatching to worker: 26568
...
2020-12-06 11:18:00 INFO:  Processed 57 images in 33999 ms

multi-process with numWorkers=4:

2020-12-06 10:59:43 INFO:  Processed 57 images in 20001 ms

I've tried with 16 workers and it's running fine without any memory issues, but not getting any more performance gains on a notebook with a limited hardware.

The files you referenced in src do not exist when I look at the linked git.

@meeki007
Copy link
Contributor

meeki007 commented Dec 6, 2020

@Nuzzlet
assuming you understand how the above worker process works you can speed things up even more by pre-loading all the models and initialize tfjs.
Then have the node standing by to receive a img (based on user input or however you plan on sending the img)

right now for simplisity and understanding in the examples @vladmandic has created the main asyc function has to load all the modles and initialize tf everytime you send it a image(s)

You could do what I did and pull this stuff out of the main asyc function into a Promise
Example from my code:

//tfjs_backend
    var tfjs_backend; //error check of tfjs_backend
    Promise.resolve
    (
      faceapi.tf.setBackend('tensorflow')
    )
    .then( tfjs_backend = true )
    .catch(error =>
      {
      tfjs_backend = ("Could not set tfjs backend" + error),
      this.warn(tfjs_backend),
      this.status(
      {
        fill: 'red',
        shape: 'dot',
        text: "detected error"
      });
    });

ignore the warn and status stuff

this way its not loading the models and tf stuff every time.

@Nuzzlet
Copy link
Author

Nuzzlet commented Dec 6, 2020

@Nuzzlet
assuming you understand how the above worker process works you can speed things up even more by pre-loading all the models and initialize tfjs.
Then have the node standing by to receive a img (based on user input or however you plan on sending the img)

right now for simplisity and understanding in the examples @vladmandic has created the main asyc function has to load all the modles and initialize tf everytime you send it a image(s)

You could do what I did and pull this stuff out of the main asyc function into a Promise
Example from my code:

//tfjs_backend
    var tfjs_backend; //error check of tfjs_backend
    Promise.resolve
    (
      faceapi.tf.setBackend('tensorflow')
    )
    .then( tfjs_backend = true )
    .catch(error =>
      {
      tfjs_backend = ("Could not set tfjs backend" + error),
      this.warn(tfjs_backend),
      this.status(
      {
        fill: 'red',
        shape: 'dot',
        text: "detected error"
      });
    });

ignore the warn and status stuff

this way its not loading the models and tf stuff every time.

I don't see the above worker process. src/multi-process.js & src/multi-process-worker.js do not exist at https://github.com/vladmandic/faceapi-workers

Am I missing something?

@meeki007
Copy link
Contributor

meeki007 commented Dec 6, 2020

I do not see them as well

you are not missing something

I was giving you something else to peck at while vladmandic gets back to us

@vladmandic really needs a Donations page. I owe the man a 18pack/box of beer at this point.

@vladmandic
Copy link
Owner

vladmandic commented Dec 6, 2020

@Nuzzlet My bad, I committed changes, but didn't do a push to github - it's online now.

@meeki007 New workflow is different and a bit more advanced:

  1. Enumerate all images to be processed into a queue
  2. Start N number of workers
  3. On startup each worker initializes instance of FaceAPI and signals back to main when it's ready
  4. For each ready signal received from worker, main dispaches job to worker and removes it from queue
  5. Worker processes job and returns results to main
  6. Worker signals main that it's ready for next job (goto Can this still be used clientside? #4)
  7. When queue is empty main signals all workers to shutdown (exit is inside worker, not kill from main)

Btw, I intentionally didn't use promises and I stayed with async/await for readability.

@Nuzzlet Regarding memory, I have no issues with memory consumption even with 32 workers processing 24MP images in parallel

@vladmandic vladmandic changed the title Not taking anywhere near full advantage of my GPUs how to fully optimize processing to take advance of all available hardware Dec 6, 2020
@vladmandic
Copy link
Owner

I've added code to measure worker initialization time and ipc messaging latency - its not much (~500ms init and ~2ms latency).

Further optimizations to multi-processing approach would:

  • Use promises to better saturate JS execution pipeline on CPU
  • Batch jobs to minimize impact of IPC latency
  • Image loading and conversion to tensor should be done out-of-band asynchronously for each batch inside a worker so each detection loop would use a pre-prepared tensor thus minimizing CPU/GPU switches and keeping GPU fully saturated.

@Nuzzlet
Copy link
Author

Nuzzlet commented Dec 7, 2020

Here I have adopted your code into my system that allows input of a video file. It's VERY rough. I used promises to simplify the code significantly from what you provided at the expense of readability, and there are no commends but your welcome to explore:
https://github.com/Nuzzlet/aiVideo2

Also, I created an issue here: tensorflow/tfjs#4362 about my vRam issue. I cannot fully test this code until I find a fix/workaround. Creating a new tensor from a buffer seems to have some memory leak that prevents multiple threads from working at all.

However, when running on the CPU, I also found multithreading yielded significant performance boosts.

In the meantime, I'm going to try to come up with an alternative to loading image to tensor. Maybe another process with CPU only (no gpu) access to run decode on.

@vladmandic
Copy link
Owner

vladmandic commented Dec 7, 2020

@Nuzzlet just brainstorming out loud, you problem could be due to how gpu memory is handled in general, not an actual memory leak.

tfjs does not deallocate memory immediately, it relies on js engine to perform garbage collection triggered on thresholds. but for gpu measuing of used memory is unreliable (you get sizeof handle instead of sizeof actual allocated memory) so threshold does not get triggered. instead garbage collection becomes a time-based operation and you exhaust memory before it even triggered.

with browser and webgl backend, there is WEBGL_DELETE_TEXTURE_THRESHOLD which if set to 0 changes memory handler to deallocate memory immediately. but there is no such flag for tfjs-node-gpu.

as a test, try processing 10-20 frames and then pausing for a 1-2sec to see if memory usage goes down? if it does, that means it definitely a garbage collection issue.

additionally, i'd check for any possible v8 flags that could be passed to nodejs to trigger different garbage collection of nodejs js engine itself.

@Nuzzlet
Copy link
Author

Nuzzlet commented Dec 7, 2020

@Nuzzlet just brainstorming out loud, you problem could be due to how gpu memory is handled in general, not an actual memory leak.

tfjs does not deallocate memory immediately, it relies on js engine to perform garbage collection triggered on thresholds. but for gpu measuing of used memory is unreliable (you get sizeof handle instead of sizeof actual allocated memory) so threshold does not get triggered. instead garbage collection becomes a time-based operation and you exhaust memory before it even triggered.

with browser and webgl backend, there is WEBGL_DELETE_TEXTURE_THRESHOLD which if set to 0 changes memory handler to deallocate memory immediately. but there is no such flag for tfjs-node-gpu.

as a test, try processing 10-20 frames and then pausing for a 1-2sec to see if memory usage goes down? if it does, that means it definitely a garbage collection issue.

additionally, i'd check for any possible v8 flags that could be passed to nodejs to trigger different garbage collection of nodejs js engine itself.

vram is filled the moment deocdeImage is run. calling tensor.dispose, then setting a 20-second timeout does not change anything.

However, it may be what you say because when running only one worker it does not have any memory issues however it still appears filled.

@vladmandic
Copy link
Owner

vladmandic commented Dec 7, 2020

the thing is, .dispose() does not actually release GPU memory immediately, it only marks it as disposable.

TFJS implemented it that way because allocating and deallocating memory on GPU is costly, so it tries to re-use already allocated memory. I suspect that in case of decodeImage, it uploads the buffer to GPU RAM and does not reuse it, so every subsequent call to decodeImage just fills GPU RAM and you run out of GPU RAM before garbage collection kicks in.

And garbage collection is late to kick in because call to allocate in CPU returns actual memory size so it knows when threshold is hit. But with GPU memory, allocate returns just a handle, not actual memory size. So garbage collection trigger stays below threshold.

Having 1-2 sec delay every xx frames would allow garbage collection to kick-in based on idle timer - alternative to high threshold.

Again, this is just my brainstorming based on how WebGL backend behaves - I have several TFJS issues on that topic.

@Nuzzlet
Copy link
Author

Nuzzlet commented Dec 7, 2020

the thing is, .dispose() does not actually release GPU memory immediately, it only marks it as disposable.

TFJS implemented it that way because allocating and deallocating memory on GPU is costly, so it tries to re-use already allocated memory. I suspect that in case of decodeImage, it uploads the buffer to GPU RAM and does not reuse it, so every subsequent call to decodeImage just fills GPU RAM and you run out of GPU RAM before garbage collection kicks in.

And garbage collection is late to kick in because call to allocate in CPU returns actual memory size so it knows when threshold is hit. But with GPU memory, allocate returns just a handle, not actual memory size. So garbage collection trigger stays below threshold.

Having 1-2 sec delay every xx frames would allow garbage collection to kick-in based on idle timer - alternative to high threshold.

Again, this is just my brainstorming based on how WebGL backend behaves - I have several TFJS issues on that topic.

I think the answer here is tricking tjfs-node-gpu there is only say 2gb of vRam available per backend instance. That way there is enough leftover for each thread.

@vladmandic
Copy link
Owner

I think the answer here is tricking tjfs-node-gpu there is only say 2gb of vRam available per backend instance. That way there is enough leftover for each thread.

Yes, that would be a solution if tfjs had accurate measurements of GPU memory usage. Unfortunately, it doesn't (I have yet another tfjs issue open for that). But as it is, since it doesn't reliably know how much is available, it doesn't know when to stop and do garbage collection.

All-in-all, GPU memory handling in TFJS is quite buggy.

Btw, I have some more complex models (250MB weights and ~10k tensors) that fail to execute single pass in 4GB GPU RAM, but work perfectly with only 350-500MB CPU RAM.

@Nuzzlet
Copy link
Author

Nuzzlet commented Dec 7, 2020

I think the answer here is tricking tjfs-node-gpu there is only say 2gb of vRam available per backend instance. That way there is enough leftover for each thread.

Yes, that would be a solution if tfjs had accurate measurements of GPU memory usage. Unfortunately, it doesn't (I have yet another tfjs issue open for that). But as it is, since it doesn't reliably know how much is available, it doesn't know when to stop and do garbage collection.

All-in-all, GPU memory handling in TFJS is quite buggy.

Btw, I have some more complex models (250MB weights and ~10k tensors) that fail to execute single pass in 4GB GPU RAM, but work perfectly with only 350-500MB CPU RAM.

Would you like me to try them on my GPUs with 8gb vram? Also It doesn't seem like tjfs-node-gpu takes advantage of two GPUS at all.

*correction, TFJS-node-gpu appears to fill both GPUs vram.

@vladmandic
Copy link
Owner

already did, they work, but need to trigger garbage collection between passes on webgl backend which makes them really slow on gpu - up to the point that multi-processing on cpu is sometimes faster and more reliable than using gpu. tfjs needs to fix how it uses gpu memory - right now it's about 6-10x higher than cpu memory plus the fact that that due to how available memory is calculated, it misses triggering garbage collection and ends up in out-of-memory condition.

@Nuzzlet
Copy link
Author

Nuzzlet commented Dec 7, 2020

already did, they work, but need to trigger garbage collection between passes on webgl backend which makes them really slow on gpu - up to the point that multi-processing on cpu is sometimes faster and more reliable than using gpu. tfjs needs to fix how it uses gpu memory - right now it's about 6-10x higher than cpu memory plus the fact that that due to how available memory is calculated, it misses triggering garbage collection and ends up in out-of-memory condition.

Does this apply only to tfjs-node-gpu or WebGL as well when using browser?

@vladmandic
Copy link
Owner

to webgl since flag WEBGL_DELETE_TEXTURE_THRESHOLD allows to trigger garbage collection. i'm not aware of equivalent in tfjs-node-gpu, so it's useless to me.

fyi, tfjs team is working on a new backend for nodejs - tfjs-backend-nodegl that would use angle for gl integration instead of cuda bindings. but i think it's going to be some time until its ready. so browser and node implementatios would be much more simmilar and allow for uniform development moving forward.

@Nuzzlet
Copy link
Author

Nuzzlet commented Dec 7, 2020

to webgl since flag WEBGL_DELETE_TEXTURE_THRESHOLD allows to trigger garbage collection. i'm not aware of equivalent in tfjs-node-gpu, so it's useless to me.

fyi, tfjs team is working on a new backend for nodejs - tfjs-backend-nodegl that would use angle for gl integration instead of cuda bindings. but i think it's going to be some time until its ready. so browser and node implementatios would be much more simmilar and allow for uniform development moving forward.

  1. Would it be possible to create a FaceAPI implementation using: https://github.com/tensorflow/tfjs/tree/master/tfjs-backend-nodegl as is?

  2. Could it be possible to re-use the same tfjs-node-gpu backend with multiple threads of FaceAPI?

@vladmandic
Copy link
Owner

I'll check out master and try to do a build of tfjs-backend-nodegl. If it builds cleanly (no guarantees), I can try. Last time I checked (about a two months ago), it was not building.
Then the question is which ops are implemented so far.

Would it be possible to reuse same backend with multiple processes (there are no threads in JS)? In theory. I'd have to do manual backend registration so it gets unique identifier per worker and still use same actual backend. Will see.

@vladmandic
Copy link
Owner

fyi, tfjs-backend-nodegl is non-functional just yet. i'll keep an eye on it.
i'm closing this issue for now as quite a lot of progress has been made and i don't want to overload a single conversation.

feel free to open a new issue in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants