-
Notifications
You must be signed in to change notification settings - Fork 30
feat(dgs_corpus): add sentence level loading #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Note that the way |
# Conflicts: # sign_language_datasets/datasets/dgs_corpus/dgs_corpus.py
|
Thank you Amit! Will test this,
I don't understand that yet, could you elaborate? (in the example I can't see an "object of an array", and don't understand why it matters) |
|
It means that while in the code there is a definition: "glosses": tfds.features.Sequence(
{
"start": tf.int32,
"end": tf.int32,
"gloss": tfds.features.Text(),
"hand": tfds.features.Text(),
"Lexeme_Sign": tfds.features.Text(),
"Gebärde": tfds.features.Text(),
"Sign": tfds.features.Text(),
}
),TFDS reverses the order: instead of being a sequence of objects, it is an object of sequences 'glosses': {
'Gebärde': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-KOPFSCH\xc3\x9cTTELN1^', b'ICH2^*',b'$INDEX1^',b'WISSEN2B^', b'NEIN3A^'], dtype=object)>,
'Lexeme_Sign': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-SHAKE-HEAD1^', b'I2*', b'$INDEX1', b'TO-BELIEVE2B',b'NOT3A'], dtype=object)>,
'Sign': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-SHAKE-HEAD1^', b'I2^*', b'$INDEX1^',b'TO-KNOW-OR-KNOWLEDGE2B^', b'NO3A^'], dtype=object)>,
'end': <tf.Tensor: shape=(5,), dtype=int32, numpy=array([383120, 383860, 384320, 384600, 384960], dtype=int32)>, 'gloss': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-KOPFSCH\xc3\x9cTTELN1^', b'ICH2*', b'$INDEX1',b'GLAUBEN2B', b'NICHT3A'], dtype=object)>,
'hand': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'r', b'r', b'r', b'r', b'r'], dtype=object)>,
'start': <tf.Tensor: shape=(5,), dtype=int32, numpy=array([382820, 383820, 384120, 384480, 384820], dtype=int32)>
}, Same data, represented differently |
| continue | ||
|
|
||
| features = { | ||
| "id": _id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now tested everything, works fine! But in the data_type="sentence" case I think it is confusing that datum["id"] is the document id, while the sentence id is datum["sentence"]["id"].
In that case, imo datum["id"] should be a unique ID (can it be a tuple?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency it is a string, now: f'{features["id"]}_{sentence["id"]}'
|
Other than that, I think the examples Colab could be extended with a sentence-level loading example, such as from sign_language_datasets.datasets.dgs_corpus.dgs_corpus import DgsCorpusConfig
config = DgsCorpusConfig(name="only-annotations-sentence-level", version="1.0.0", include_video=False, include_pose=None, data_type="sentence")
dgs_corpus = tfds.load('dgs_corpus', builder_kwargs=dict(config=config))
for datum in itertools.islice(dgs_corpus["train"], 0, 5):
print(datum)and that perhaps with some clever from sign_language_datasets.datasets.dgs_corpus.dgs_corpus import DgsCorpusConfigcould be from sign_language_datasets.datasets.dgs_corpus import DgsCorpusConfig |
No description provided.