Awesome Kurdish
(last updated on 04/01/2023)
A curated list of awesome resources, tools and scientific papers for Kurdish language technology
Although I do my best to keep this page as comprehensive as possible by including all projects, the list may not include all the fantastic small and big projects regarding Kurdish language processing. Please be kind and notify me by reaching out by email or through our community on Gitter.
Are you interested in contributing to Kurdish language processing? Check out this post to see how you can do so.
Development
Resources
Corpora
- Open Super-large Crawled ALMAnaCH coRpus (OSCAR) (Sorani and Kurmanji)
- Pewan (Sorani and Kurmanji)
- Kurdish folkloric lyrics corpus (Sorani)
- AsoSoft corpus (Sorani)
- Kurdish Textbooks Corpus (Sorani)
- Zaza-Gorani corpus (Zazaki and Gorani)
- Kurdish resources on Clarin
- University of Bamberg's corpora [Kurmanji & Laki]
Parallel corpora
- Ataman's Bianet corpus containing Turkish-English-Kurmanji aligned texts
- Ahmadi et al's corpus containing English-Kurmanji-Sorani aligned texts
- Tanzil: one Qoran translation alignable with many other translations in other languages, including 11 in English (see this project)
- Bible translations in Kurmanji-Latin and Kurmanji-Cyrillic
- TED Talks subtitles
- HLP Colloquial Corpus #1 (Sorani and Kurmanji (Latin and Arabic)) (not free)
- A parallel corpus of Sorani-English text
- FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation (Sorani)
- AsoSoft Speech Corpus for Central-Kurdish Text-To-Speech (Sorani)
Dictionaries, terminologies and ontologies
Check out a comprehensive list of Kurdish dictionaries and beware of copyright issues in the following projects:
- Kurdish lexicographical resources in Ontolex-Lemon (Sorani, Kurmanji, Gorani and Southern Kurdish)
- Check Dolan Hêriş's repositories for a list of Kurdish dictionaries and tools to extract words
- KurdNet-the Kurdish wordNet (Sorani)
- Kurdish annotated lexicon (Sorani)
- Freedict word lists (Sorani and Kurmanji)
- Translation Initiative for COVID-19 including Sorani and Kurmanji
- MyMemory dictionaries with an open-access API (Sorani)
Datasets
- Manchester Database of Kurdish Dialects
- Dataset of Kurdish poems with meter and form tags
- A Twitter dataset (Sorani and Kurmanji)
- Datasets for text to Kurdish Sign Language (Sorani)
- A dataset for speech recognition (Sorani)
- Universal dependency (Kurmanji)
- Web Inventory of Transcribed and Translated Talks (WIT3) (Sorani)
- Sorani and Kurmanji morphological datasets in UniMorph
- FakeKurdNews, an annotated dataset for Sorani Kurdish fake news detection
- profanity language (Sorani)
Benchmarks
- Morphological analysis:
- KurdishHunspell evaluation datasets (Sorani)
- Tokenization:
- KurdishTokenization (Sorani, Kurmanji)
- A sentence-segmented dataset (Sorani)
- Transliteration
- Spelling error correction
Other resources
Word Embeddings:
- fastText word vectors (Sorani and Kurmanji)
- Polyglot's word embeddings
Tools
Fundamental processing
- Language identifier (Sorani and Kurmanji)
- Wergor for transliteration (Sorani and Kurmanji)
- Kurdish Tokenization
- Jedar stemmer
- Apertium project for Kurmanji and Sorani morphological analysis
- Kurdish Hunspell for Sorani morphological analysig, spell checking, stemming and lemmatization
- A finite-state morphological analyzer for Central Kurdish (Sorani)
- Part-of-speech tagger (Sorani)
- Alexina Framework: morphological analysis and POS-tagger for Sorani (
soralex
) and Kurmanji (kurlex
) - Kurdspell for Sorani spell checking
- Apertium rule-based Sorani spell-checker
- Gende Stemmer (Sorani)
- Conversion of numbers into words (Sorani and Kurmanji)
- Conversion of words into IPA (Kurmanji)
Machine translation
- Apertium (Sorani and Kurmanji)
- Kurdish MT (Sorani)
Named-entity recognition
- Autoregressive Entity Retrieval (Kurmanji)
Optical character recognition
- Kurdish Handwritten Words (Sorani)
Libraries
- Kurdish Language Processing Toolkit: a natural language processing toolkit in Python
- Kurdînûs: pure JavaScript tools for transliteration, text conversion and normalization
- Kurdish Language Library: converting characters and digits in Persian, English and Arabic to Kurdish and vice versa
- AsoSoft's Library for Kurdish: normalizer, numeral converter, grapheme-to-phoneme convertor in C#
Other
In addition to these, you can find further information in other repositories and pages as follows:
Research
These references are provided based on the data collected in the paper entitled KLPT – Kurdish Language Processing Toolkit. Note that references are provided in the bibliography
file.
Reference | Year | Field | dialects |
---|---|---|---|
esmaili2013sorani |
2013 | Dialectology | Sorani, Kurmanji |
hassani2016automatic |
2016 | Dialectology | Sorani, Kurmanji |
malmasi2016subdialectal |
2016 | Dialectology | Sorani |
al2017kurdish |
2017 | Dialectology | Sorani, Kurmanji, Gorani |
amani:hal-03262435 |
2021 | Dialectology | Kurdish, Zazaki & Gorani |
mohammed2012automatic |
2012 | Information retrieval and Text mining | Sorani |
esmaili2012challenges |
2012 | Information retrieval and Text mining | Sorani |
littell2016named |
2016 | Information retrieval and Text mining | Sorani |
hassani2017method |
2017 | Information retrieval and Text mining | Sorani, Kurmanji |
esmaAl-Talabaniili2014towards |
2014 | Information retrieval and Text mining | Sorani, Kurmanji |
jaf2016simple |
2016 | Information retrieval and Text mining | Sorani |
rashid2017robust |
2017 | Information retrieval and Text mining | Sorani |
rashid2017automatic |
2017 | Information retrieval and Text mining | Sorani |
saeed2018improving |
2018 | Information retrieval and Text mining | Sorani |
mustafa2018kurdish |
2018 | Information retrieval and Text mining | Sorani |
saeed2018evaluation |
2018 | Information retrieval and Text mining | Sorani |
ahmadi2019wergor |
2019 | Information retrieval and Text mining | Sorani |
mahmudi2021automated |
2021 | Information retrieval and Text mining | Sorani |
abdulrahman2022lmspell |
2022 | Information retrieval and Text mining | Sorani |
esmaili2013building |
2013 | Lexical resources | Sorani |
aliabadi2014towards |
2014 | Lexical resources | Sorani |
aliabadi2014semi |
2014 | Lexical resources | Sorani |
ataman2018bianet |
2018 | Lexical resources | Kurmanji |
ahmadi2019towards |
2019 | Lexical resources | Sorani, Kurmanji, Gorani |
abdulrahman2019developing |
2019 | Lexical resources | Sorani |
abdulrahman2020using |
2020 | Lexical resources | Sorani |
veisi2020toward |
2020 | Lexical resources | Sorani |
ahmadi2020corpus |
2020 | Lexical resources | Sorani |
ahmadi-2020-building |
2020 | Lexical resources | Zaza, Gorani |
veisi2021jira |
2021 | Lexical resources | Sorani |
azin2021sk |
2021 | Lexical resources | Southern Kurdish |
hassani2017kurdish |
2017 | Machine Translation | Sorani, Kurmanji |
kaka2018english |
2018 | Machine Translation | Sorani |
ahmadi2020machine |
2020 | Machine Translation | Sorani |
goyal2021flores |
2021 | Machine Translation | 101 languages incl. Sorani |
amini2021central |
2021 | Machine Translation | Sorani |
ahmadi2022leveraging |
2022 | Machine Translation | Sorani |
baban1995programmable |
1995 | Morphological and syntactic analysis | Sorani |
walther2010developing |
2010 | Morphological and syntactic analysis | Sorani |
walther2010fast |
2010 | Morphological and syntactic analysis | Kurmanji |
salavati2013stemming |
2013 | Morphological and syntactic analysis | Sorani |
jaf2014stemmer |
2014 | Morphological and syntactic analysis | Sorani |
jaf2016chapter |
2016 | Morphological and syntactic analysis | Sorani |
gokirmak2017dependency |
2017 | Morphological and syntactic analysis | Kurmanji |
salavati2018building |
2018 | Morphological and syntactic analysis | Sorani |
mustafa2018kurdish |
2018 | Morphological and syntactic analysis | Sorani |
ahmadi2020towards |
2020 | Morphological and syntactic analysis | Sorani |
ahmadi-2020-tokenization |
2020 | Morphological and syntactic analysis | Sorani, Kurmanji |
ahmadi2021modelling |
2021 | Morphological and syntactic analysis | Sorani |
ahmadi2020Hunspell |
2021 | Morphological and syntactic analysis | Sorani |
naserzade2021ckmorph |
2021 | Morphological and syntactic analysis | Sorani |
mohammed2012uniqueness |
2012 | Optical character recognition | Sorani |
mohammed2013handwritten |
2013 | Optical character recognition | Sorani |
shaltookisentiment |
2016 | Optical character recognition | Sorani |
zarro2017recognition |
2017 | Optical character recognition | Sorani |
yaseen2018kurdish |
2018 | Optical character recognition | Sorani |
dinler2018kurdish |
2018 | Optical character recognition | Sorani |
app11209752 |
2021 | Optical character recognition | Sorani |
kaka2017building |
2017 | Other | Sorani |
mahmudi2021automatic |
2021 | Other | Sorani |
ahmadi2021ickl |
2021 | Other | Sorani |
hashim2018kurdish |
2018 | Sign language recognition | Sorani |
kamal-hassani-2020-towards |
2020 | Sign language recognition | Sorani |
daneshfar2009implementation |
2009 | Speech recognition | Sorani |
barkhoda2009comparison |
2009 | Speech recognition | Sorani |
bahrampour2009implementation |
2009 | Speech recognition | Sorani |
hassani2011kurdish |
2011 | Speech recognition | Sorani |
dinler2017formant |
2017 | Speech recognition | Kurmanji |
dinler2018extraction |
2018 | Speech recognition | Sorani, Kurmanji |
qader2019kurdish |
2019 | Speech recognition | Sorani |
ahmadi-2020-klpt |
2020 | Toolkits | Sorani, Kurmanji |
de2021multilingual |
2021 | Named-entity recognition | Kurmanji |
abdullah2022 |
2022 | Sentiment analysis | Sorani |
awlla2022 |
2022 | Sentiment analysis | Sorani |
amin2022kurdish |
2022 | Sentiment analysis | Sorani |
zuhair2021 |
2021 | Other | Sorani |
kamala2022kurdish |
2022 | Other | Sorani |
Cite this repository
If you find the provided data useful for your project, feel free to use it and please, cite the following paper, too:
@inproceedings{ahmadi-2020-klpt,
title = "{KLPT} {--} {K}urdish Language Processing Toolkit",
author = "Ahmadi, Sina",
booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.nlposs-1.11",
doi = "10.18653/v1/2020.nlposs-1.11",
pages = "72--84"
}