The SADID Evaluation Datasets for Low-Resource Spoken Language Machine Translation of Arabic Dialects
Source | Number of sentences | Number of words | Avg. number of words per sentence | Number of documents | Percentage from total |
---|---|---|---|---|---|
Simple Wikipedia | 2723 | 37550 | 13.79 | 958 | 45.05 |
Aesop Fables | 1647 | 21427 | 13.01 | 147 | 25.70 |
Movie Subtitles | 1757 | 24387 | 13.88 | 208 | 29.25 |
Total | 6127 | 83364 | 13.60 | 1351 | 100 |
set | Number of sentences | Number of English words | Number of Egyptian words | Number of Levantine words | Number of MSA words |
---|---|---|---|---|---|
dev | 2,997 | 40,885 | 37,480 | 36,362 | |
devtest | 2,997 | 41,946 | 37,928 | 37,928 | |
test | 2,994 | 40,587 | 38,672 | 37,187 | 38,512 |
The scripts directory contains the preprocessing scripts for the training data as well as the training and evaluation scripts
This work would not have been possible without the generous support of InstaDeep Ltd.