Skip to content

Conversation

@AmitMY
Copy link
Contributor

@AmitMY AmitMY commented Sep 2, 2022

No description provided.

@AmitMY
Copy link
Contributor Author

AmitMY commented Sep 2, 2022

Note that the way tfds stores arrays of objects, is as objects of arrays.
No big deal, just is how it is

{
'id': <tf.Tensor: shape=(), dtype=string, numpy=b'1183203'>, 

'paths': {
    'cmdi': <tf.Tensor: shape=(), dtype=string, numpy=b'/home/nlp/amit/tensorflow_datasets/downloads/sign-lang.uni-hambu.de_meine_cmdi_11832Uu-LBUD5Ry8Msgq1i-CX5Qqjt_ylVHwxENi1ZzzXibc.cmdi'>, 
    'eaf': <tf.Tensor: shape=(), dtype=string, numpy=b'/home/nlp/amit/tensorflow_datasets/downloads/sign-lang.uni-hambur.de_meined_eaf_1183205qfufbL-ISImHk7v4fYT7bDsx-ZSKSXXUmhbb5mlp3s.eaf'>, 
    'ilex': <tf.Tensor: shape=(), dtype=string, numpy=b'/home/nlp/amit/tensorflow_datasets/downloads/sign-lang.uni-hambu.de_meine_ilex_11832JtU1YIDsu6SFkRBXU5DHJjeFWUFhkgEqfbazJGRkP7E.ilex'>,
    'srt': <tf.Tensor: shape=(), dtype=string, numpy=b'/home/nlp/amit/tensorflow_datasets/downloads/sign-lang.uni-hambu.de_meine_srt_11832_en9iAPAnPuZQv1uhtNQoV61EfCI4ozBY95ECDFElBGa-0.srt'>
}, 

'sentence': {
    'end': <tf.Tensor: shape=(), dtype=int32, numpy=385360>, 
    'english': <tf.Tensor: shape=(), dtype=string, numpy=b'Well, I would assume not.'>, 
    'german': <tf.Tensor: shape=(), dtype=string, numpy=b'Hm, ich glaube nicht.'>, 
                                                        
    'glosses': {
        'Gebärde': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-KOPFSCH\xc3\x9cTTELN1^', b'ICH2^*',b'$INDEX1^',b'WISSEN2B^', b'NEIN3A^'], dtype=object)>, 
        'Lexeme_Sign': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-SHAKE-HEAD1^', b'I2*', b'$INDEX1', b'TO-BELIEVE2B',b'NOT3A'], dtype=object)>, 
        'Sign': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-SHAKE-HEAD1^', b'I2^*', b'$INDEX1^',b'TO-KNOW-OR-KNOWLEDGE2B^', b'NO3A^'], dtype=object)>, 
        'end': <tf.Tensor: shape=(5,), dtype=int32, numpy=array([383120, 383860, 384320, 384600, 384960], dtype=int32)>, 'gloss': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-KOPFSCH\xc3\x9cTTELN1^', b'ICH2*', b'$INDEX1',b'GLAUBEN2B', b'NICHT3A'], dtype=object)>, 
        'hand': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'r', b'r', b'r', b'r', b'r'], dtype=object)>, 
        'start': <tf.Tensor: shape=(5,), dtype=int32, numpy=array([382820, 383820, 384120, 384480, 384820], dtype=int32)>
    }, 

    'id': <tf.Tensor: shape=(), dtype=string, numpy=b'a3127087'>, 
    'mouthings': {
        'end': <tf.Tensor: shape=(4,), dtype=int32, numpy=array([383120, 384320, 384600, 384960], dtype=int32)>, 
        'mouthing': <tf.Tensor: shape=(4,), dtype=string, numpy=array([b'[MG]', b'[MG]', b'glaub', b'nich{t}'], dtype=object)>, 
        'start': <tf.Tensor: shape=(4,), dtype=int32, numpy=array([382820, 383820, 384480, 384820], dtype=int32)>}, 
        'participant': <tf.Tensor: shape=(), dtype=string, numpy=b'A'>, 
        'start': <tf.Tensor: shape=(), dtype=int32, numpy=382820>
    }
}

# Conflicts:
#	sign_language_datasets/datasets/dgs_corpus/dgs_corpus.py
@bricksdont
Copy link
Collaborator

Thank you Amit! Will test this,

Note that the way tfds stores arrays of objects, is as objects of arrays.

I don't understand that yet, could you elaborate? (in the example I can't see an "object of an array", and don't understand why it matters)

@AmitMY
Copy link
Contributor Author

AmitMY commented Sep 3, 2022

It means that while in the code there is a definition:

                "glosses": tfds.features.Sequence(
                    {
                        "start": tf.int32,
                        "end": tf.int32,
                        "gloss": tfds.features.Text(),
                        "hand": tfds.features.Text(),
                        "Lexeme_Sign": tfds.features.Text(),
                        "Gebärde": tfds.features.Text(),
                        "Sign": tfds.features.Text(),
                    }
                ),

TFDS reverses the order: instead of being a sequence of objects, it is an object of sequences

    'glosses': {
        'Gebärde': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-KOPFSCH\xc3\x9cTTELN1^', b'ICH2^*',b'$INDEX1^',b'WISSEN2B^', b'NEIN3A^'], dtype=object)>, 
        'Lexeme_Sign': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-SHAKE-HEAD1^', b'I2*', b'$INDEX1', b'TO-BELIEVE2B',b'NOT3A'], dtype=object)>, 
        'Sign': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-SHAKE-HEAD1^', b'I2^*', b'$INDEX1^',b'TO-KNOW-OR-KNOWLEDGE2B^', b'NO3A^'], dtype=object)>, 
        'end': <tf.Tensor: shape=(5,), dtype=int32, numpy=array([383120, 383860, 384320, 384600, 384960], dtype=int32)>, 'gloss': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-KOPFSCH\xc3\x9cTTELN1^', b'ICH2*', b'$INDEX1',b'GLAUBEN2B', b'NICHT3A'], dtype=object)>, 
        'hand': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'r', b'r', b'r', b'r', b'r'], dtype=object)>, 
        'start': <tf.Tensor: shape=(5,), dtype=int32, numpy=array([382820, 383820, 384120, 384480, 384820], dtype=int32)>
    }, 

Same data, represented differently

continue

features = {
"id": _id,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now tested everything, works fine! But in the data_type="sentence" case I think it is confusing that datum["id"] is the document id, while the sentence id is datum["sentence"]["id"].

In that case, imo datum["id"] should be a unique ID (can it be a tuple?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency it is a string, now: f'{features["id"]}_{sentence["id"]}'

@bricksdont
Copy link
Collaborator

Other than that, I think the examples Colab could be extended with a sentence-level loading example, such as

from sign_language_datasets.datasets.dgs_corpus.dgs_corpus import DgsCorpusConfig

config = DgsCorpusConfig(name="only-annotations-sentence-level", version="1.0.0", include_video=False, include_pose=None, data_type="sentence")
dgs_corpus = tfds.load('dgs_corpus', builder_kwargs=dict(config=config))

for datum in itertools.islice(dgs_corpus["train"], 0, 5):

  print(datum)

and that perhaps with some clever __init__.py importing this:

from sign_language_datasets.datasets.dgs_corpus.dgs_corpus import DgsCorpusConfig

could be

from sign_language_datasets.datasets.dgs_corpus import DgsCorpusConfig

@AmitMY AmitMY merged commit a5a1789 into master Sep 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants