Skip to content
/ SADID Public

Benchmark datasets for evaluating Arabic Dialects Machine Translation systems

Notifications You must be signed in to change notification settings

we7el/SADID

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

The SADID Evaluation Datasets for Low-Resource Spoken Language Machine Translation of Arabic Dialects

Source Number of sentences Number of words Avg. number of words per sentence Number of documents Percentage from total
Simple Wikipedia 2723 37550 13.79 958 45.05
Aesop Fables 1647 21427 13.01 147 25.70
Movie Subtitles 1757 24387 13.88 208 29.25
Total 6127 83364 13.60 1351 100
set Number of sentences Number of English words Number of Egyptian words Number of Levantine words Number of MSA words
dev 2,997 40,885 37,480 36,362
devtest 2,997 41,946 37,928 37,928
test 2,994 40,587 38,672 37,187 38,512

The scripts directory contains the preprocessing scripts for the training data as well as the training and evaluation scripts

Acknowledgment

This work would not have been possible without the generous support of InstaDeep Ltd.

About

Benchmark datasets for evaluating Arabic Dialects Machine Translation systems

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published