Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Human Pose Viewer - Faster, Prettier, Generic Implementation #58

Closed
2 tasks
AmitMY opened this issue Oct 12, 2022 · 3 comments
Closed
2 tasks

Human Pose Viewer - Faster, Prettier, Generic Implementation #58

AmitMY opened this issue Oct 12, 2022 · 3 comments
Labels
enhancement New feature or request spoken-to-signed
Projects

Comments

@AmitMY
Copy link
Contributor

AmitMY commented Oct 12, 2022

Problem

The current human pose viewer looks human-ish and is in high definition (768x768); however, it is not pleasant to the eyes and works slowly (70ms~ with WebGPU and 3070Ti), requiring us to devise better models with fewer resources.

t.mp4

Description

We currently use a GAN to train a U-NET from poses to people (https://github.com/sign-language-processing/everybody-sign-now). It is 2D, working frame-by-frame, with an LSTM to share the context state at the bottleneck of the U-NET.

Instead of our current U-NET, at around 100MB (float16), and many operations, we could use smaller less accurate implementations, like SqueezeNet. As this will be faster but less accurate, we could train multiple networks, similar to a diffusion process, to iteratively improve on the quality, given context. Then in inference time, as we strive for real-time translation, we could perform as many iterations as possible to achieve the current frame rate based on heuristics.

Finally, there needs to be an upscaling model. 768x768 might be unneeded, and 512x512 may be enough. This means that the original U-NETs don't need to be 256x256 as well. It may as well be that we use 64x64 latent space tensors to optimize and the "upscaling" model makes them into a nice video, or that we generate the face, body, and hands independently in low resolution (64x64), then stitch them on top of each other, and "upscale" to fix the color and imperfections. (Upscaling model can be autoencoder - https://nn.labml.ai/diffusion/stable_diffusion/model/autoencoder.html)

What's clear is that there needs to be:

  • Complex training pipeline, to train all these models, based on real and predicted data
  • Complex inference pipeline, to estimate how much inference we can perform on a given machine to strike a looks/speed balance

Alternatives

Striving to work on mobile devices, we can ignore the web platform and only focus on optimizations for specific silicon. (#25)

Another optimization route is using batches to speed up inference. On WebGL, they don't seem to matter much, but on WebGPU they seem to result in a 5-10x speed improvement, based on batch size etc. We need to "learn" how much we can batch on a given device to still keep the real-time performance and how to buffer many frames as fast as possible,

@AmitMY AmitMY added enhancement New feature or request spoken-to-signed labels Oct 12, 2022
@AmitMY AmitMY added this to To do in Development via automation Oct 12, 2022
@AmitMY
Copy link
Contributor Author

AmitMY commented Mar 8, 2023

One extra interesting option is training a StyleGAN
https://github.com/autonomousvision/stylegan-xl
which has almost the same number of parameters on the generator.

Then generate a latent vector Z, generate a photo, run pose estimation on the output, and train a system to translate between the pose vector and Z. This way, we can have a small LSTM network that can do that for every type of pose estimation.

@AmitMY
Copy link
Contributor Author

AmitMY commented Nov 10, 2023

Now that speed has been addressed and on a Apple M1 Max MacBook Pro it takes 30ms/frame (33fps), we move on to fixing the model output which was trained on OpenPose and is inferenced on MediaPipe

pix2pix_oct_op_2023.mp4

Once the new model is out, I'll consider this issue solved for what it is, and we will need to consider better models (like the StyleGAN suggestion above) in a new issue.

@AmitMY
Copy link
Contributor Author

AmitMY commented Jan 15, 2024

Latest pix2pix trained on MediaPipe
test.webm

@AmitMY AmitMY closed this as completed Jan 15, 2024
Development automation moved this from To do to Done Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request spoken-to-signed
Projects
Development

No branches or pull requests

1 participant