The SADID Evaluation Datasets for Low-Resource Spoken Language Machine Translation of Arabic Dialects

Source	Number of sentences	Number of words	Avg. number of words per sentence	Number of documents	Percentage from total
Simple Wikipedia	2723	37550	13.79	958	45.05
Aesop Fables	1647	21427	13.01	147	25.70
Movie Subtitles	1757	24387	13.88	208	29.25
Total	6127	83364	13.60	1351	100

set	Number of sentences	Number of English words	Number of Egyptian words	Number of Levantine words	Number of MSA words
dev	2,997	40,885	37,480	36,362
devtest	2,997	41,946	37,928	37,928
test	2,994	40,587	38,672	37,187	38,512

The scripts directory contains the preprocessing scripts for the training data as well as the training and evaluation scripts

Acknowledgment

This work would not have been possible without the generous support of InstaDeep Ltd.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
sadid-arabic-dialect-benchmark-dataset.zip		sadid-arabic-dialect-benchmark-dataset.zip