Repository code and models are now available at https://github.com/sign-language-processing/pose-to-video.
This repository aims to train an in-browser real-time image translation model from pose estimation to videos.
The model code is a port of Tensorflow's pix2pix tutorial.
We have recorded high resolution green screen videos of:
- Maayan Gazuli (
Maayan_1
,Maayan_2
) - Israeli Sign Language Interpreter - Amit Moryossef (
Amit
) - Project author
These videos are open for anyone to use for the purpose of sign language video generation.
This repository does not support additional input except for images. By default, every image of Maayan is inferenced with a red background (255, 200, 200), and every image of Amit is inferenced with a blue background (200, 200, 255).
- The videos were recorded in ProRes and were convereted to mp4 using
ffmpeg
. - Then, using Final Cut Pro X, removed the green screen using the keying effect, and exported for "desktop".
- Finally, the FCPX export was processed again by
ffmpeg
to reduce its size (3.5GB -> 250MB).
ffmpeg -i CAM3_output.mp4 -qscale 0 CAM3_norm.mp4
Download the data from here.
Or use the command line:
wget --no-clobber --convert-links --random-wait -r -p --level 3 -E -e robots=off --adjust-extension -U mozilla "https://nlp.biu.ac.il/~amit/datasets/GreenScreen/"
cd /home/nlp/amit/WWW/datasets/GreenScreen/mp4/Amit && gdown --folder --continue --id 1X1GuGMPHm4Sty9hr7Goxbbig5KpBE7p1
cd /home/nlp/amit/WWW/datasets/GreenScreen/mp4/Maayan_1 && gdown --folder --continue --id 1X4-LagvS2JWm9xyOg5t2QAvP1nDxt3Vr
cd /home/nlp/amit/WWW/datasets/GreenScreen/mp4/Maayan_2 && gdown --folder --continue --id 1XBz8NrRomAU506q7xYZUWkXEw_yVz5YD
Run python -m everybody_sign_now.train
Training is currently performed on CPU, roughly 5 minutes / 1000 steps.
This will train for a long while, and log each epoch result in a progress
directory.
Once satisfied with the result, the script can be killed.
- Add
LSTM
to thepix2pix
state, to introduce temporal coherence with very little additional compute - Add another upsampler, from
256x256
to512x512
- Add face specific descriminator
- Add hand specific descriminator
- Mostly position body in fixed position
Run ./convert_to_tfjs.sh
This will create a web_model
directory with the model in tfjs, quantized to float16
.
- Can Everybody Sign Now? Exploring Sign Language Video Generation from 2D Poses - evaluates generated videos quantitatively and qualitatively showing that the current models are not enough to generated adequate videos for Sign Language
- Everybody Sign Now: Translating Spoken Language to Photo Realistic Sign Language Video - proposal for better pose-to-video generation models, in higher quality, and with person-look control