QADI

QCRI Arabic Dialect Identification (QADI)

Country level Arabic dialect identification (DI) dataset. It provides a collection for benchmarking DI task.

QADI Dataset

The dataset contains 540,590 tweets from 18 Arab countries. The data is distributed according to the following table§:

Country*	Train	Test
AE	27,819	192
BH	28,295	184
DZ	17,603	170
EG	67,783	200
IQ	18,367	178
JO	34,109	180
KW	49,963	190
LB	38,386	194
LY	40,883	169
MA	12,813	178
OM	24,786	169
PL	48,641	173
QA	36,675	198
SA	35,396	199
SD	16,251	188
SY	18,317	194
TN	12,940	154
YE	11,563	193

* Country names are provided using ISO-3166-1 codes.

Download the dataset

To download the data, after cloning the repository or downloading its content. The dataset files contains ids for the all the tweets identified as from the designated country. you may used twarc or other Twitter Scraping tools to hydrate the tweets.

twarc dataset/tweetsCountryID.txt

The tweets are arranged per country. Each file is a list of ids for the tweets from the designated country.

Contact

Hamdy Mubarak (hmubarak at hbku dot edu dot qa)
Ahmed Abdelali (aabdelali at hbku dot edu dot qa)
Kareem Darwish (hmubarak at hbku dot edu dot qa)
Younes Samih (ysamih at hbku dot edu dot qa)
Sabit Hassan (sahassan2 at hbku dot edu dot qa)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
dataset		dataset
testset		testset
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QADI

QADI Dataset

Download the dataset

Contact

Reference

About

Releases

Packages

qcri/QADI

Folders and files

Latest commit

History

Repository files navigation

QADI

QADI Dataset

Download the dataset

Contact

Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages