Human Pose Viewer - Faster, Prettier, Generic Implementation #58

AmitMY · 2022-10-12T19:52:00Z

Problem

The current human pose viewer looks human-ish and is in high definition (768x768); however, it is not pleasant to the eyes and works slowly (70ms~ with WebGPU and 3070Ti), requiring us to devise better models with fewer resources.

t.mp4

Description

We currently use a GAN to train a U-NET from poses to people (https://github.com/sign-language-processing/everybody-sign-now). It is 2D, working frame-by-frame, with an LSTM to share the context state at the bottleneck of the U-NET.

Instead of our current U-NET, at around 100MB (float16), and many operations, we could use smaller less accurate implementations, like SqueezeNet. As this will be faster but less accurate, we could train multiple networks, similar to a diffusion process, to iteratively improve on the quality, given context. Then in inference time, as we strive for real-time translation, we could perform as many iterations as possible to achieve the current frame rate based on heuristics.

Finally, there needs to be an upscaling model. 768x768 might be unneeded, and 512x512 may be enough. This means that the original U-NETs don't need to be 256x256 as well. It may as well be that we use 64x64 latent space tensors to optimize and the "upscaling" model makes them into a nice video, or that we generate the face, body, and hands independently in low resolution (64x64), then stitch them on top of each other, and "upscale" to fix the color and imperfections. (Upscaling model can be autoencoder - https://nn.labml.ai/diffusion/stable_diffusion/model/autoencoder.html)

What's clear is that there needs to be:

Complex training pipeline, to train all these models, based on real and predicted data
Complex inference pipeline, to estimate how much inference we can perform on a given machine to strike a looks/speed balance

Alternatives

Striving to work on mobile devices, we can ignore the web platform and only focus on optimizations for specific silicon. (#25)

Another optimization route is using batches to speed up inference. On WebGL, they don't seem to matter much, but on WebGPU they seem to result in a 5-10x speed improvement, based on batch size etc. We need to "learn" how much we can batch on a given device to still keep the real-time performance and how to buffer many frames as fast as possible,

The text was updated successfully, but these errors were encountered:

AmitMY · 2023-03-08T09:46:42Z

One extra interesting option is training a StyleGAN
https://github.com/autonomousvision/stylegan-xl
which has almost the same number of parameters on the generator.

Then generate a latent vector Z, generate a photo, run pose estimation on the output, and train a system to translate between the pose vector and Z. This way, we can have a small LSTM network that can do that for every type of pose estimation.

AmitMY · 2023-11-10T15:36:43Z

Now that speed has been addressed and on a Apple M1 Max MacBook Pro it takes 30ms/frame (33fps), we move on to fixing the model output which was trained on OpenPose and is inferenced on MediaPipe

pix2pix_oct_op_2023.mp4

Once the new model is out, I'll consider this issue solved for what it is, and we will need to consider better models (like the StyleGAN suggestion above) in a new issue.

AmitMY · 2024-01-15T08:06:23Z

Latest pix2pix trained on MediaPipe
test.webm

AmitMY added enhancement New feature or request spoken-to-signed labels Oct 12, 2022

AmitMY added this to To do in Development via automation Oct 12, 2022

AmitMY mentioned this issue Apr 7, 2023

feat(pix2pix): add inference batching #83

Open

AmitMY closed this as completed Jan 15, 2024

Development automation moved this from To do to Done Jan 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Human Pose Viewer - Faster, Prettier, Generic Implementation #58

Human Pose Viewer - Faster, Prettier, Generic Implementation #58

AmitMY commented Oct 12, 2022 •

edited

AmitMY commented Mar 8, 2023

AmitMY commented Nov 10, 2023

AmitMY commented Jan 15, 2024

Human Pose Viewer - Faster, Prettier, Generic Implementation #58

Human Pose Viewer - Faster, Prettier, Generic Implementation #58

Comments

AmitMY commented Oct 12, 2022 • edited

Problem

Description

Alternatives

AmitMY commented Mar 8, 2023

AmitMY commented Nov 10, 2023

AmitMY commented Jan 15, 2024

AmitMY commented Oct 12, 2022 •

edited