-
Notifications
You must be signed in to change notification settings - Fork 0
/
CITATION.cff
61 lines (60 loc) · 2.3 KB
/
CITATION.cff
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: MayanV
message: >-
If you use this dataset, please cite it using the metadata
from this file.
type: dataset
authors:
- given-names: Andrés
family-names: Lou
email: and_lou@ua.es
affiliation: Universitat d'Alacant
- given-names: Felipe
family-names: Sánchez Martínez
email: fsanchez@dlsi.ua.es
affiliation: Universitat d'Alacant
- given-names: Víctor
name-particle: M
family-names: Sánchez Cartagena
email: vmsanchez@dlsi.ua.es
affiliation: Universitat d'Alacant
- given-names: Juan Antonio
family-names: Pérez Ortiz
email: japerez@dlsi.ua.es
affiliation: Universitat d'Alacant
repository-code: 'https://github.com/transducens/mayanv'
url: 'https://transducens.github.io/nmt-maya/'
abstract: >-
The Mayan languages comprise a language family with an
ancient history, millions of speakers, and immense
cultural value, that, nevertheless, remains severely
underrepresented in terms of resources and global
exposure. In this paper we develop, curate, and publicly
release a set of corpora in several Mayan languages spoken
in Guatemala and Southern Mexico, which we call MayanV.
The datasets are parallel with Spanish, the dominant
language of the region, and are taken from official native
sources focused on representing informal, day-to-day, and
non-domain-specific language. As such, and according to
our dialectometric analysis, they differ in register from
most other available resources. Additionally, we present
and release neural machine translation models, trained on
as many resources and Mayan languages as possible, and
evaluated exclusively on our datasets. We observe lexical
divergences between the dialects of Spanish used in the
resources we present and the more widespread written
standard of Spanish, and that resources other than the
ones we present do not seem to improve translation
performance, indicating that many such resources may not
accurately capture common, real-life language usage.
keywords:
- naacl2024
- mayan_corpora
- mayanv
- low_resource_languages
license: CC0-1.0
commit: b6029f356b32c3d320d6adf21700eaa34edcfa7b
version: '1.0'
date-released: '2024-03-22'