feat(dgs_corpus): add sentence level loading #19

AmitMY · 2022-09-02T08:55:59Z

No description provided.

sign_language_datasets/datasets/dgs_corpus/dgs_corpus.py

AmitMY · 2022-09-02T16:11:50Z

Note that the way tfds stores arrays of objects, is as objects of arrays.
No big deal, just is how it is

{
'id': <tf.Tensor: shape=(), dtype=string, numpy=b'1183203'>, 

'paths': {
    'cmdi': <tf.Tensor: shape=(), dtype=string, numpy=b'/home/nlp/amit/tensorflow_datasets/downloads/sign-lang.uni-hambu.de_meine_cmdi_11832Uu-LBUD5Ry8Msgq1i-CX5Qqjt_ylVHwxENi1ZzzXibc.cmdi'>, 
    'eaf': <tf.Tensor: shape=(), dtype=string, numpy=b'/home/nlp/amit/tensorflow_datasets/downloads/sign-lang.uni-hambur.de_meined_eaf_1183205qfufbL-ISImHk7v4fYT7bDsx-ZSKSXXUmhbb5mlp3s.eaf'>, 
    'ilex': <tf.Tensor: shape=(), dtype=string, numpy=b'/home/nlp/amit/tensorflow_datasets/downloads/sign-lang.uni-hambu.de_meine_ilex_11832JtU1YIDsu6SFkRBXU5DHJjeFWUFhkgEqfbazJGRkP7E.ilex'>,
    'srt': <tf.Tensor: shape=(), dtype=string, numpy=b'/home/nlp/amit/tensorflow_datasets/downloads/sign-lang.uni-hambu.de_meine_srt_11832_en9iAPAnPuZQv1uhtNQoV61EfCI4ozBY95ECDFElBGa-0.srt'>
}, 

'sentence': {
    'end': <tf.Tensor: shape=(), dtype=int32, numpy=385360>, 
    'english': <tf.Tensor: shape=(), dtype=string, numpy=b'Well, I would assume not.'>, 
    'german': <tf.Tensor: shape=(), dtype=string, numpy=b'Hm, ich glaube nicht.'>, 
                                                        
    'glosses': {
        'Gebärde': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-KOPFSCH\xc3\x9cTTELN1^', b'ICH2^*',b'$INDEX1^',b'WISSEN2B^', b'NEIN3A^'], dtype=object)>, 
        'Lexeme_Sign': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-SHAKE-HEAD1^', b'I2*', b'$INDEX1', b'TO-BELIEVE2B',b'NOT3A'], dtype=object)>, 
        'Sign': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-SHAKE-HEAD1^', b'I2^*', b'$INDEX1^',b'TO-KNOW-OR-KNOWLEDGE2B^', b'NO3A^'], dtype=object)>, 
        'end': <tf.Tensor: shape=(5,), dtype=int32, numpy=array([383120, 383860, 384320, 384600, 384960], dtype=int32)>, 'gloss': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-KOPFSCH\xc3\x9cTTELN1^', b'ICH2*', b'$INDEX1',b'GLAUBEN2B', b'NICHT3A'], dtype=object)>, 
        'hand': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'r', b'r', b'r', b'r', b'r'], dtype=object)>, 
        'start': <tf.Tensor: shape=(5,), dtype=int32, numpy=array([382820, 383820, 384120, 384480, 384820], dtype=int32)>
    }, 

    'id': <tf.Tensor: shape=(), dtype=string, numpy=b'a3127087'>, 
    'mouthings': {
        'end': <tf.Tensor: shape=(4,), dtype=int32, numpy=array([383120, 384320, 384600, 384960], dtype=int32)>, 
        'mouthing': <tf.Tensor: shape=(4,), dtype=string, numpy=array([b'[MG]', b'[MG]', b'glaub', b'nich{t}'], dtype=object)>, 
        'start': <tf.Tensor: shape=(4,), dtype=int32, numpy=array([382820, 383820, 384480, 384820], dtype=int32)>}, 
        'participant': <tf.Tensor: shape=(), dtype=string, numpy=b'A'>, 
        'start': <tf.Tensor: shape=(), dtype=int32, numpy=382820>
    }
}

# Conflicts: # sign_language_datasets/datasets/dgs_corpus/dgs_corpus.py

bricksdont · 2022-09-02T18:08:25Z

Thank you Amit! Will test this,

Note that the way tfds stores arrays of objects, is as objects of arrays.

I don't understand that yet, could you elaborate? (in the example I can't see an "object of an array", and don't understand why it matters)

AmitMY · 2022-09-03T14:08:15Z

It means that while in the code there is a definition:

                "glosses": tfds.features.Sequence(
                    {
                        "start": tf.int32,
                        "end": tf.int32,
                        "gloss": tfds.features.Text(),
                        "hand": tfds.features.Text(),
                        "Lexeme_Sign": tfds.features.Text(),
                        "Gebärde": tfds.features.Text(),
                        "Sign": tfds.features.Text(),
                    }
                ),

TFDS reverses the order: instead of being a sequence of objects, it is an object of sequences

    'glosses': {
        'Gebärde': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-KOPFSCH\xc3\x9cTTELN1^', b'ICH2^*',b'$INDEX1^',b'WISSEN2B^', b'NEIN3A^'], dtype=object)>, 
        'Lexeme_Sign': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-SHAKE-HEAD1^', b'I2*', b'$INDEX1', b'TO-BELIEVE2B',b'NOT3A'], dtype=object)>, 
        'Sign': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-SHAKE-HEAD1^', b'I2^*', b'$INDEX1^',b'TO-KNOW-OR-KNOWLEDGE2B^', b'NO3A^'], dtype=object)>, 
        'end': <tf.Tensor: shape=(5,), dtype=int32, numpy=array([383120, 383860, 384320, 384600, 384960], dtype=int32)>, 'gloss': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-KOPFSCH\xc3\x9cTTELN1^', b'ICH2*', b'$INDEX1',b'GLAUBEN2B', b'NICHT3A'], dtype=object)>, 
        'hand': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'r', b'r', b'r', b'r', b'r'], dtype=object)>, 
        'start': <tf.Tensor: shape=(5,), dtype=int32, numpy=array([382820, 383820, 384120, 384480, 384820], dtype=int32)>
    },

Same data, represented differently

sign_language_datasets/datasets/dgs_corpus/dgs_corpus.py

bricksdont · 2022-09-08T09:20:41Z

sign_language_datasets/datasets/dgs_corpus/dgs_corpus.py

+                continue
+
            features = {
                "id": _id,


I now tested everything, works fine! But in the data_type="sentence" case I think it is confusing that datum["id"] is the document id, while the sentence id is datum["sentence"]["id"].

In that case, imo datum["id"] should be a unique ID (can it be a tuple?)

For consistency it is a string, now: f'{features["id"]}_{sentence["id"]}'

bricksdont · 2022-09-08T09:30:27Z

Other than that, I think the examples Colab could be extended with a sentence-level loading example, such as

from sign_language_datasets.datasets.dgs_corpus.dgs_corpus import DgsCorpusConfig

config = DgsCorpusConfig(name="only-annotations-sentence-level", version="1.0.0", include_video=False, include_pose=None, data_type="sentence")
dgs_corpus = tfds.load('dgs_corpus', builder_kwargs=dict(config=config))

for datum in itertools.islice(dgs_corpus["train"], 0, 5):

  print(datum)

and that perhaps with some clever __init__.py importing this:

from sign_language_datasets.datasets.dgs_corpus.dgs_corpus import DgsCorpusConfig

could be

from sign_language_datasets.datasets.dgs_corpus import DgsCorpusConfig

sign_language_datasets/datasets/dgs_corpus/dgs_corpus.py

feat(dgs_corpus): add sentence level loading

5268172

bricksdont reviewed Sep 2, 2022

View reviewed changes

sign_language_datasets/datasets/dgs_corpus/dgs_corpus.py Outdated Show resolved Hide resolved

bricksdont reviewed Sep 2, 2022

View reviewed changes

sign_language_datasets/datasets/dgs_corpus/dgs_corpus.py Outdated Show resolved Hide resolved

AmitMY added 2 commits September 2, 2022 18:10

feat(dgs_corpus): slice poses and videos to sentence

d5f4a42

lint(): run black

bb497f3

Merge branch 'master' into dgs_sentences

9a7c3a0

# Conflicts: # sign_language_datasets/datasets/dgs_corpus/dgs_corpus.py

bricksdont reviewed Sep 5, 2022

View reviewed changes

sign_language_datasets/datasets/dgs_corpus/dgs_corpus.py Show resolved Hide resolved

bricksdont requested changes Sep 5, 2022

View reviewed changes

sign_language_datasets/datasets/dgs_corpus/dgs_corpus.py Show resolved Hide resolved

fix(dgs): don't include poses if they are not requested

db531c8

bricksdont reviewed Sep 8, 2022

View reviewed changes

fix(dgs): make sentence id the id of the features

2dab6f6

bricksdont reviewed Sep 9, 2022

View reviewed changes

sign_language_datasets/datasets/dgs_corpus/dgs_corpus.py Outdated Show resolved Hide resolved

AmitMY added 2 commits September 12, 2022 13:37

feat(dgs): add document id

1abd843

chore(): update version

66661a9

AmitMY merged commit a5a1789 into master Sep 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(dgs_corpus): add sentence level loading #19

feat(dgs_corpus): add sentence level loading #19

Uh oh!

AmitMY commented Sep 2, 2022 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

AmitMY commented Sep 2, 2022

Uh oh!

bricksdont commented Sep 2, 2022

Uh oh!

AmitMY commented Sep 3, 2022

Uh oh!

Uh oh!

Uh oh!

bricksdont Sep 8, 2022

Uh oh!

AmitMY Sep 8, 2022

Uh oh!

bricksdont commented Sep 8, 2022

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(dgs_corpus): add sentence level loading #19

feat(dgs_corpus): add sentence level loading #19

Uh oh!

Conversation

AmitMY commented Sep 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AmitMY commented Sep 2, 2022

Uh oh!

bricksdont commented Sep 2, 2022

Uh oh!

AmitMY commented Sep 3, 2022

Uh oh!

Uh oh!

Uh oh!

bricksdont Sep 8, 2022

Choose a reason for hiding this comment

Uh oh!

AmitMY Sep 8, 2022

Choose a reason for hiding this comment

Uh oh!

bricksdont commented Sep 8, 2022

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AmitMY commented Sep 2, 2022 •

edited

Loading