Skip to content

rominaoji/PerSpellData

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PerSpellData

A comprehensive parallel dataset designed for the task of spell checking in Persian. Misspelled sentences together with the correct form are produced using a massive confusion matrix, which is gathered from many sources. This dataset contains informal sentences in addition to the formal sentences, and contains texts from diverse topics. Both non-word and real-word errors are collected in the dataset

Description

Our approach is based on a large corpus of Persian texts in addition to the confusion matrix. Confusion matrix is a set of words that may mistakenly be replaced with each other, like ‘there’ and ‘their’ in English. We gathered a confusion matrix containing 2,072,396 pairs of words from various sources, which are explained below. Given the confusion matrix, we make our parallel dataset by replacing correct words of corpus sentences with words which are confusing with them.

Following shows some statistics of PerSpellData:

Errors Confusion Matrix PerSpellData
non-word errors 643,849 3.8M
real-word errors 1,428,547 2.5M
Total 2,072,396 6.4M

Examples

Example of real-word and non-word errors in Persian and English:

English Errors Persian Errors
Error type Correct Form Wrong Form Correct Form Wrong Form
non-word insertion This story is embracing This storey is embracing خوشبختانه همه هنوز دچار نشده اند خوشبخنانه همه هنوز دچار نشده اند
deletion She is an actress She is an acress مردم آن شهر خیلی خسته بودند مردم آن شهر خیی خسته بودند
substitution Tehran is the capital of Iran Tehran is the capitol of Iran ساعت هفت بیدار میشوم صاعت هفت بیدار میشوم
transposition He is afraid of bears He is afraid of bares از آنجا تاکسی گرفتیم از آنجا تاکسی گرتفیم
real-word insertion Good jobs are found in big cities Good jobs are found ink big cities در این مکان اسکان کنید در این مکان استکان کنید
deletion They live on their own They live on their on گرادیان این زاویه چند است؟ گدایان این زاویه چند است؟
substitution I cannot see you I cannot sea you این مبل گران است این مبل میان است
transposition I live here I live heer این عدد بر مبنای دو است ین عدد بر مبانی دو است
same pronunciation This is too much money This is two much money این میوه پرتقال است این میوه پرتغال است
word boundary You can do it Youcan do it به خانه می روم به خانه میروم

For some error type we provide two files, one of them is confusion matrix and the other is perSpellData parallel corpus. all of PerSpellData is upladed and can be downloaded. Here are statistics and links of different type of errors:

Type Error-Type Confused-words PerSpellData
Real-word Virastman's logs 1034 7,753
Real-word Synthetic 1,425,693 2,959,054
Real-word Make informal plural again plural 165 2,968
Real-word Common mistakes 87 847
Real-word Gozar 296 2,088
Real-word Tanvin 79 448
Non-word Be 515 1520
Non-word FaSepell 5,063 8,953
Non-word Virastman's logs 136,164 467,946
Non-word Close words 502,107 1,440,854
Non-word CPG - 707

Reference

If you use or discuss this dataset in your work, please cite our paper:

@inproceedings{persian-2021-romina-oji,
    title = "Romina Oji, Nasrin Taghizadeh and Heshaam Faili",
    author = "Persian, PerSpellData: An Exhaustive Parallel Spell Dataset For",
    booktitle = "Proceedings of The Second International Workshop on NLP Solutions for Under Resourced Languages (NSURL 2021) co-located with ICNLSP 2021",
    month = "12--13 " # nov,
    year = "2021",
    address = "Trento, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.nsurl-1.2",
    pages = "8--14",
}

Contact

If you have any technical question regarding the dataset or publication, please create an issue in this repository.

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages