New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to fully optimize processing to take advance of all available hardware #20
Comments
For example, could I run two instances of my code - one for each GPU? I should note I also get this error:
Does PTX equate to a significant performance bump? Should I switch to Linux just for this? |
Linux user here using cuda 10.1 Don't know if my old card even supports PTX PTX support might be the bottle neck but get ready for hell on earth getting CUDA running nice on linux. @vladmandic probably knows way more about this. |
two issues here, first need to keep pipeline saturated instead of emptying it before each time before next frame. so instead of: async function analyzeFrame (frameData, index) {
const tensor = await getTensorFromBuffer(frameData)
const faces = await faceapi.detectAllFaces(tensor, optionsSSDMobileNet).withFaceLandmarks().withFaceDescriptors()
tensor.dispose()
faces.forEach(async (face) => {
// console.log(face)
})
} something like this (this is not a working code, just an example): async function analyzeFrames (frameData, index, numFrames) {
promises = [];
for (let i=0; i < numFrames; i++) {
promises.push(getTensorFromBuffer(frameData[i]).then((tensor) => {
faceapi.detectAllFaces(tensor, optionsSSDMobileNet).withFaceLandmarks().withFaceDescriptors().then((faces) => {
// console.log(face)
tensor.dispose();
})
}))
}
return Promise.all(promises);
} just make sure that your promise pool is not too large as things can spiral out of control. btw, i hate this kind of code as it's unreadable, i like async/await much more - this is only a last resort. second, no matter what you do, js is still single threaded. so you could spawn a fix number of worker processes (in node) or web workers (in browser) and do a loop that assigns processing of next frame to first available worker. but then you also get to a second layer of issues - how to transfer frameData to a worker without adding extra latency (as transfer to worker is far more costly than transfer to function inside a same thread). so maybe move that processing into worker as well. then and only then i'd look at platform specific stuff - i don't see a reason to suspect that linux would behave better than windows before it's even saturated. |
Thank you! I understand, I don't like that promise approach either but it makes sense. A few questions What version of CUDA and Cudnn do you recommend? And, how do I split into smaller processes whilst keeping one instance of Face-API. I tried to use fork to run analyzeFrame however it returns loads of buffer allocation errors. The first instance fills my vRam. |
latest you manage to get working :)
i don't think you can. instance of face-api has one instance of tfjs which can only have one active backend. the thing with fork is that it takes time to initialize process and that can interfere with already running instance - you don't want to be doing that while another instance is already running. that's why i mentioned that it should be a pre-started pool of processes and then assign processing to first available one, not start it on-demand. |
Okay perfect. So I have altered my code to do as you describe, adding each async function to Promises.all() It does seem to have a slight speed improvement, but not what I'm looking for. Since there can only be one backend, is it possible to split into multiple processes? |
Should I split the promises array into say 2-4 arrays, then assign each promise to its own worker? I've never used worker-threats before. |
Yes, but get more complicated quicky
There can only be one backend per-process. Since the goal is to have multiple active backends, we need multiple processes each with its own active backend.
Which means we need some sort of inter-process communication between parent and workers. With browser and web workers, there is built-in sendmessage() and event listener on the other side. With node, we need to implement it manually.
All-in-all, it's not trivial, but not super-long. |
Actually, just remembered that forked NodeJS processes have built-in |
@vladmandic - " Myself, I'd do something like utilize nodejs built-in http class so each worker acts as simple web server (no framework, just naked http server) That is the path I'm taking - Think of each node as a worker. I'm getting there. @Nuzzlet -Don't wait around for me to finish this. nodeJS/node-red app but I'll let you know if i ever get it working. looking into this now for my attempt at this. ---> NodeJS processes have built-in send() |
@meeki007 @vladmandic Worker Threats: https://nodejs.org/api/worker_threads.html#worker_threads_worker_threads can share memory with the main process, maybe TF & faceAPI can be shared this way. Do you see any issues with this? It could be much simpler. Or is it necessary to have multiple backends in order to call detectFaces() multiple times simultaneously? Also above you said it's impossible to have multiple backends at once? I'm confused. |
It is impossible to have multiple backends inside same process.
Not out-of-the-box for sure. In theory? Maybe.
Now, on memory usage, it shouldn't be that high - and definitely should get as high to cause issues with multiple processes. |
How do you have multiple backends at once? I tried to create two processes with their own tfjs-node-gpu & faceapi. Both return constant buffer issues. I think this is related to the vRam issues I mentioned above. When I have some time I'll put together a bug report in a new issue. |
take a look at https://github.com/vladmandic/faceapi-workers Test for multi-process FaceAPI execution in NodeJSThis is not 100% optimized, it's intended as proof-of-concept You can see that the speed-up from multiprocessing is quite significant single-process:source:
|
The files you referenced in src do not exist when I look at the linked git. |
@Nuzzlet right now for simplisity and understanding in the examples @vladmandic has created the main asyc function has to load all the modles and initialize tf everytime you send it a image(s) You could do what I did and pull this stuff out of the main asyc function into a Promise //tfjs_backend
var tfjs_backend; //error check of tfjs_backend
Promise.resolve
(
faceapi.tf.setBackend('tensorflow')
)
.then( tfjs_backend = true )
.catch(error =>
{
tfjs_backend = ("Could not set tfjs backend" + error),
this.warn(tfjs_backend),
this.status(
{
fill: 'red',
shape: 'dot',
text: "detected error"
});
}); ignore the warn and status stuff this way its not loading the models and tf stuff every time. |
I don't see the above worker process. src/multi-process.js & src/multi-process-worker.js do not exist at https://github.com/vladmandic/faceapi-workers Am I missing something? |
I do not see them as well you are not missing something I was giving you something else to peck at while vladmandic gets back to us @vladmandic really needs a Donations page. I owe the man a 18pack/box of beer at this point. |
@Nuzzlet My bad, I committed changes, but didn't do a push to github - it's online now. @meeki007 New workflow is different and a bit more advanced:
Btw, I intentionally didn't use promises and I stayed with async/await for readability. @Nuzzlet Regarding memory, I have no issues with memory consumption even with 32 workers processing 24MP images in parallel |
I've added code to measure worker initialization time and ipc messaging latency - its not much (~500ms init and ~2ms latency). Further optimizations to multi-processing approach would:
|
Here I have adopted your code into my system that allows input of a video file. It's VERY rough. I used promises to simplify the code significantly from what you provided at the expense of readability, and there are no commends but your welcome to explore: Also, I created an issue here: tensorflow/tfjs#4362 about my vRam issue. I cannot fully test this code until I find a fix/workaround. Creating a new tensor from a buffer seems to have some memory leak that prevents multiple threads from working at all. However, when running on the CPU, I also found multithreading yielded significant performance boosts. In the meantime, I'm going to try to come up with an alternative to loading image to tensor. Maybe another process with CPU only (no gpu) access to run decode on. |
@Nuzzlet just brainstorming out loud, you problem could be due to how gpu memory is handled in general, not an actual memory leak. tfjs does not deallocate memory immediately, it relies on js engine to perform garbage collection triggered on thresholds. but for gpu measuing of used memory is unreliable (you get sizeof handle instead of sizeof actual allocated memory) so threshold does not get triggered. instead garbage collection becomes a time-based operation and you exhaust memory before it even triggered. with browser and as a test, try processing 10-20 frames and then pausing for a 1-2sec to see if memory usage goes down? if it does, that means it definitely a garbage collection issue. additionally, i'd check for any possible v8 flags that could be passed to nodejs to trigger different garbage collection of nodejs js engine itself. |
vram is filled the moment deocdeImage is run. calling tensor.dispose, then setting a 20-second timeout does not change anything. However, it may be what you say because when running only one worker it does not have any memory issues however it still appears filled. |
the thing is, TFJS implemented it that way because allocating and deallocating memory on GPU is costly, so it tries to re-use already allocated memory. I suspect that in case of And garbage collection is late to kick in because call to allocate in CPU returns actual memory size so it knows when threshold is hit. But with GPU memory, allocate returns just a handle, not actual memory size. So garbage collection trigger stays below threshold. Having 1-2 sec delay every xx frames would allow garbage collection to kick-in based on idle timer - alternative to high threshold. Again, this is just my brainstorming based on how |
I think the answer here is tricking tjfs-node-gpu there is only say 2gb of vRam available per backend instance. That way there is enough leftover for each thread. |
Yes, that would be a solution if tfjs had accurate measurements of GPU memory usage. Unfortunately, it doesn't (I have yet another tfjs issue open for that). But as it is, since it doesn't reliably know how much is available, it doesn't know when to stop and do garbage collection. All-in-all, GPU memory handling in TFJS is quite buggy. Btw, I have some more complex models (250MB weights and ~10k tensors) that fail to execute single pass in 4GB GPU RAM, but work perfectly with only 350-500MB CPU RAM. |
Would you like me to try them on my GPUs with 8gb vram? Also It doesn't seem like tjfs-node-gpu takes advantage of two GPUS at all. *correction, TFJS-node-gpu appears to fill both GPUs vram. |
already did, they work, but need to trigger garbage collection between passes on |
Does this apply only to tfjs-node-gpu or WebGL as well when using browser? |
to fyi, tfjs team is working on a new backend for nodejs - |
|
I'll check out master and try to do a build of Would it be possible to reuse same backend with multiple processes (there are no threads in JS)? In theory. I'd have to do manual backend registration so it gets unique identifier per worker and still use same actual backend. Will see. |
fyi, feel free to open a new issue in the future. |
I run this function:
for hundreds of frames of a video. When doing so, it takes about half the time as if it where using only tfjs-node. It fills the RAM of my two 1080s, however the only usage that changes when I start the script is GPU 1s copy utilization goes to about 15%.
I want to take better advantage of my hardware and run this faster.
Windows 10, cuda 10, TFJS 2.7.0
The text was updated successfully, but these errors were encountered: