You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current human pose viewer looks human-ish and is in high definition (768x768); however, it is not pleasant to the eyes and works slowly (70ms~ with WebGPU and 3070Ti), requiring us to devise better models with fewer resources.
Instead of our current U-NET, at around 100MB (float16), and many operations, we could use smaller less accurate implementations, like SqueezeNet. As this will be faster but less accurate, we could train multiple networks, similar to a diffusion process, to iteratively improve on the quality, given context. Then in inference time, as we strive for real-time translation, we could perform as many iterations as possible to achieve the current frame rate based on heuristics.
Finally, there needs to be an upscaling model. 768x768 might be unneeded, and 512x512 may be enough. This means that the original U-NETs don't need to be 256x256 as well. It may as well be that we use 64x64 latent space tensors to optimize and the "upscaling" model makes them into a nice video, or that we generate the face, body, and hands independently in low resolution (64x64), then stitch them on top of each other, and "upscale" to fix the color and imperfections. (Upscaling model can be autoencoder - https://nn.labml.ai/diffusion/stable_diffusion/model/autoencoder.html)
What's clear is that there needs to be:
Complex training pipeline, to train all these models, based on real and predicted data
Complex inference pipeline, to estimate how much inference we can perform on a given machine to strike a looks/speed balance
Alternatives
Striving to work on mobile devices, we can ignore the web platform and only focus on optimizations for specific silicon. (#25)
Another optimization route is using batches to speed up inference. On WebGL, they don't seem to matter much, but on WebGPU they seem to result in a 5-10x speed improvement, based on batch size etc. We need to "learn" how much we can batch on a given device to still keep the real-time performance and how to buffer many frames as fast as possible,
The text was updated successfully, but these errors were encountered:
Then generate a latent vector Z, generate a photo, run pose estimation on the output, and train a system to translate between the pose vector and Z. This way, we can have a small LSTM network that can do that for every type of pose estimation.
Now that speed has been addressed and on a Apple M1 Max MacBook Pro it takes 30ms/frame (33fps), we move on to fixing the model output which was trained on OpenPose and is inferenced on MediaPipe
pix2pix_oct_op_2023.mp4
Once the new model is out, I'll consider this issue solved for what it is, and we will need to consider better models (like the StyleGAN suggestion above) in a new issue.
Problem
The current human pose viewer looks human-ish and is in high definition (
768x768
); however, it is not pleasant to the eyes and works slowly (70ms~ with WebGPU and 3070Ti), requiring us to devise better models with fewer resources.t.mp4
Description
We currently use a GAN to train a U-NET from poses to people (https://github.com/sign-language-processing/everybody-sign-now). It is 2D, working frame-by-frame, with an LSTM to share the context state at the bottleneck of the U-NET.
Instead of our current U-NET, at around 100MB (float16), and many operations, we could use smaller less accurate implementations, like SqueezeNet. As this will be faster but less accurate, we could train multiple networks, similar to a diffusion process, to iteratively improve on the quality, given context. Then in inference time, as we strive for real-time translation, we could perform as many iterations as possible to achieve the current frame rate based on heuristics.
Finally, there needs to be an upscaling model.
768x768
might be unneeded, and512x512
may be enough. This means that the original U-NETs don't need to be256x256
as well. It may as well be that we use64x64
latent space tensors to optimize and the "upscaling" model makes them into a nice video, or that we generate the face, body, and hands independently in low resolution (64x64), then stitch them on top of each other, and "upscale" to fix the color and imperfections. (Upscaling model can be autoencoder - https://nn.labml.ai/diffusion/stable_diffusion/model/autoencoder.html)What's clear is that there needs to be:
Alternatives
Striving to work on mobile devices, we can ignore the web platform and only focus on optimizations for specific silicon. (#25)
Another optimization route is using batches to speed up inference. On WebGL, they don't seem to matter much, but on WebGPU they seem to result in a 5-10x speed improvement, based on batch size etc. We need to "learn" how much we can batch on a given device to still keep the real-time performance and how to buffer many frames as fast as possible,
The text was updated successfully, but these errors were encountered: