Skip to content

Demo videos for "A Data-Driven Representation for Sign Language Production"

License

Notifications You must be signed in to change notification settings

walsharry/VQ_SLP_Demos

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation


Logo

A Data-Driven Representation for Sign Language Production

Vector Quantised Sign Language Production

Abstract

Phonetic representations are used when recording spoken languages, but no equivalent exists for recording signed languages. As a result, linguists have proposed several annotation systems that operate on the gloss or sub-unit level; however, these resources are notably irregular and scarce.

Sign Language Production (SLP) aims to automatically translate spoken language sentences into continuous sequences of sign language. However, current state-of-the-art approaches rely on scarce linguistic resources to work. This has limited progress in the field. This paper introduces an innovative solution by transforming the continuous pose generation problem into a discrete sequence generation problem. Thus, overcoming the need for costly annotation. Although, if available, we leverage the additional information to enhance our approach.

By applying Vector Quantisation (VQ) to sign language data, we first learn a codebook of short motions that can be combined to create a natural sequence of sign. Where each token in the codebook can be thought of as the lexicon of our representation. Then using a transformer we perform a translation from spoken language text to a sequence of codebook tokens. Each token can be directly mapped to a sequence of poses allowing the translation to be performed by a single network. Furthermore, we present a sign stitching method to effectively join tokens together. We evaluate on the RWTH-PHOENIX-Weather-2014T (PHOENIX14T) and the more challenging Meine DGS Annotated (mDGS) datasets. An extensive evaluation shows our approach outperforms previous methods, increasing the BLEU-1 back translation score by up to 72%.

system_overview

Demos

Previous approaches to SLP attempt to regress pose directly from the spoken language. This leads to underarticulated signing, as the signer regresses to the mean. Whereas, here by first learning a codebook we can ensure our new lexicon is expressive.

Codebook Tokens

Here we present example tokens from the Codebooks.

RWTH-PHOENIX-Weather-2014T

Meine DGS Annotated

Translation Examples

Here we present translation examples.

Left skeleton - the ground truth extracted from the original videos.

Middel skeleton - applying the codebook to quantize the ground truth sequence.

Right skeleton - the translation output from the Text-to-Tokens transformer.

Note, in the following examples we show the baseline model without the stitching module. In "Comparison to Progressive Transformer" below we add the stitching module to show its effectiveness for creating smoother natural signing sequences.

RWTH-PHOENIX-Weather-2014T

Failure case:

The PHOENIX14T dataset only contains a single view of the signer. As a result, our pose estimator struggles to capture some high-frequency movements, as can be seen in the ground truth data. In addition, for longer sequences, the model can struggle to capture all the fine-grain detail in the handshape.

Meine DGS Annotated

Failure case:

In the following examples, the model is able to capture the motion, but the fine detail in the hands is lost during the quantization step.

Comparison to Progressive Transformer

Here we compare our full approach to the progressive transformer. We apply both the contrastive learning and stitching module. Hence, the examples show smooth continuous signing.

Codebook PCA plots

As shown by the first plot "Without replacement" the codebook collapses to a single token, meaning the codebook cannot accurately quantise a sequence of sign language. In the following plot ("With Replacement") we show our aggressive replacement strategy helps evenly distribute tokens within the embedding space, allowing for the accurate quantization as shown in the videos above. Finally, as shown in the final plot our contrastive loss has a significant impact on the embedding space. We suggest that the non-uniform distribution of the tokens is the model collapsing lexical variants and overcoming signer-dependent features.

Without replacement With Replacement With Contrastive Learning

License

Distributed under the Attribution-NonCommercial-ShareAlike 4.0 International License. See LICENSE.txt for more information.

(back to top)

About

Demo videos for "A Data-Driven Representation for Sign Language Production"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published