Sign Language Datasets

Datasets used by the sign_language_translator python package.

See the download tree for quick links to videos, landmarks, word mappings & parallel corpus.

Sign Language Datasets

Download via CLI (needs python):

pip install sign-langauge-translator
slt download "datasets/.*landmarks.*csv.*\.zip"

Problem Overview

Sign Language is a gesture based method of communication. In sign languages, the vocabulary is small with many spoken language words corresponding to the same sign. Sentences are short and contain only the keywords. Each person has their own accent of performing the sign. Every region has its own sign language and there are few large scale standardization efforts.

We can obtain standard dictionaries from reputable organizations in each country and concatenate signs from them using standardized grammar rules to translate & synthesize datasets.
We can record people performing the signs from the dictionary to capture diversity of accents.
We can scrape sign language videos and use deep learning to generate their glosses & translations.

Because most regional languages will have very few hours of data, the best approach will be to train a many-to-many seq2seq translation model.

Sign Recording Options

Sign language can be represented as:

Videos

Videos can consist of individual words, phrases or sentences.
Each video can contain just one person or multiple people talking at the same time.
Using computer vision, videos can be decomposed into 3D motion vectors of joints on the body as a preprocessing step to reduce the bias and noise in the dataset and enables more data augmentation.

Token Sequence + Gesture Dictionary

Sign sequence written using text word-for-word is called gloss and it captures the grammar of sign language.
There are other sign writing notations like HamNoSys etc which write down individual movements of the hands but this project currently only uses the word level tokens.

~~Motion capture gloves (costly for users & dataset makers)~~

Translation Dataset Needs

A translation model requires a parallel corpus of sentences that should be mapped to each other.

For each sign language video or sequence of videos, save translations & glosses in multiple languages

Sign Languages can be modeled as semi-formal languages (a mixture of rule based language & natural language). So, there is an opportunity for synthetic dataset generation.

Obtain sign language dictionaries.
List down all words in several text languages that can be mapped to those videos.
Train a language model to write sentences of only the supported words.
Translate those generated sentences using grammar rules of that regional language or a deep learning model into gloss (sign labels).
Concatenate videos corresponding to the tokens in the text to synthesize parallel video.

Datasets

The datasets currently available in the sign_language_translator package are chunked, preprocessed and labeled appropriately. More details on assets can be found in the release description.

Naming conventions:

Dictionaries: country-organization-number_sign-label.mp4
Replications: c*-o*-n*_s*_person-code_camera-angle.mp4
Sentences: c*-o*-n*_gloss[_p*_c*].mp4
Archives: c*-o*-n*[_p*-c*]_category-subcategory-extension.zip
Preprocessed videos: c*-o*-n*_s*[_p*_c*].category-model.ext
Videos without Signs: wordless_wordless_person_camera.mp4

The sign labels, tokens & glosses may contain word sense disambiguation wrapped in parenthesis e.g. *_spring(coil).mp4 or *_spring(water-fountain).mp4.
Person Codes are of the format [dh][fm]\d+. For example df0001 stands for deaf-female-0001 and hm0002 means hearing-male-0002
Camera Angles are from (front|below|left|right|top-left|top-right)-\d+x\d+y\d+z. (not finalized yet)
Category in preprocessed videos and archives is from (videos|landmarks).
Subcategory in Archive name is from (dictionary(-replication)?|sentences(-replication)?|mediapipe-pose-2-hand-1). It will include the model name in case of preprocessed files.

Statistics:

Sign Language

Dictionary

Sentences

Synthetic Sentences

Replications

Pakistan

Signs: 776 (27 min)

Word Tokens:
en: 1584
hi: 92
latn-ur: 2
ur: 2071

Count: 13 (57 sec)

Translations:
en: 19
hi: 14
latn-ur: 13
ur: 17

glosses
en: 14
latn-ur: 13
ur: 15

Count: 1 (7 sec)

Translations:
en: 2
hi: 2
latn-ur: 1
ur: 2

glosses
en: 2
latn-ur: 1
ur: 2

Dictionary: 22 hrs
Sentences: 45 min

Download Tree

sign-language-datasets
├── README.md
├── text-preprocessing.json
├── todo.json
│
├── asset_urls
│   ├── archive-urls.json
│   ├── extra-urls.json
│   └── pk-dictionary-urls.json
│
├── parallel_texts
│   ├── pk-dictionary-mapping.json
│   ├── pk-sentence-mapping.json
│   └── pk-synthetic-sentence-mapping.json
│
└── schemas
    └── mapping-schema.json

Releases
├── v0.0.4 (Landmark Datasets)
│   ├── pk-hfad-1_landmarks-mediapipe-pose-2-hand-1-csv.zip
│   └── pk-hfad-1_landmarks-mediapipe-pose-2-hand-1-json.zip
│
├── v0.0.3 (Video Datasets)
│   └── pk-hfad-1_videos-mp4.zip
│
├── v0.0.2 (Dictionary)
│   ├── pk-hfad-1_*.mp4 [788]
│   ├── pk-hfad-2_*.mp4 [1]
│   └── wordless_wordless.mp4 [1]
│
└── v0.0.1 (Language Models for Dataset generation)
    └── *

How to Contribute

Project Setup:

Clone the repo

git clone https://github.com/sign-language-translator/sign-language-datasets.git

Configure JSON schema in VSCode workspace settings especially for *-mapping.json files.

Our Needs:

1. Compile dictionaries

Rename files to follow the convention (country-organization-...)
Upload individual files to v0.0.2 Dictionary release.
Upload zip archive to v0.0.3 Video Datasets release.
Link individual file urls in asset_urls/*-dictionary-urls.json
Link archive urls into asset_urls/archive-urls.json.
Add the text tokens that have same the meaning and can be mapped to these dictionary videos to parallel_texts/*-dictionary-mapping.json.

2. Record Dictionary Videos to capture diverse accents

Rename files to follow the convention (*_person-id_camera-angle*).
Upload zip archive to v0.0.3 Video Datasets release.
Link archive urls into asset_urls/archive-urls.json.

3. Scrape or Record sign language Sentences.

Upload & Link the data
Add translations and glosses to the parallel corpus

4. Contribute to the Synthetic Parallel Corpus

Write sentences of supported words
Compile dataset for training a language model to do the above step.
Translate to other text languages

5. Translate existing tokens, translations & glosses to other text languages.

Note

Ensure uniqueness in sign labels before publishing anything.

Citation

Coming Soon!

Glossary

Word	Definition
Label	Text identifier of a sign language video/data sample. A filename without extension.
Accent	A particular style of performing a sign such as speed, position and distance traveled by the hand.
Gloss	Word sequence corresponding to the signs performed in the source sign language video.
Translation	Valid text of a spoken language with the same meaning as source sign language video.
Parallel Corpus	Collection of statements in Sign Language and their translations/glosses in spoken language texts.
Supported Word	A text language token (word or phrase) for which a sign language video is available in the dictionary.
Replication	Videos created using the dictionary videos or web-scraped sentences as a reference clips. The performer can be hearing-abled person as well and multiple cameras from different angles can be used simultaneously.
Synthetic Sentence	A sign language sentence formed by concatenating videos corresponding to word tokens written in a particular order.
Word Sense Disambiguation	The task of figuring out the meaning or a relevant synonym of a word based on the current context.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

asset_urls

asset_urls

parallel_texts

parallel_texts

schemas

schemas

.gitignore

.gitignore

README.md

README.md

text-preprocessing.json

text-preprocessing.json

todo.json

todo.json

Repository files navigation

Sign Language Datasets

Problem Overview

Sign Recording Options

Translation Dataset Needs

Datasets

Download Tree

How to Contribute

Citation

Glossary

About

Releases 4

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
asset_urls		asset_urls
parallel_texts		parallel_texts
schemas		schemas
.gitignore		.gitignore
README.md		README.md
text-preprocessing.json		text-preprocessing.json
todo.json		todo.json

sign-language-translator/sign-language-datasets

Folders and files

Latest commit

History

Repository files navigation

Sign Language Datasets

Problem Overview

Sign Recording Options

Translation Dataset Needs

Datasets

Download Tree

How to Contribute

Citation

Glossary

About

Resources

Stars

Watchers

Forks