Skip to content

sinaahmadi/ScriptNormalization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Script Normalization for Unconventional Writing

Perso-Arabic scripts that are targeted in this study.
[📑 ALC 2023 Paper] [📝 Slides] [📽️ Presentation] [📀 Datasets] [⚙️ Demo]

This repository contains the data and the models described in the ACL2023 paper "Script Normalization for Unconventional Perso-Arabic Writing". The models are deployed on HuggingFace: Demo 🔥


What is unconventional writing?

  • "mar7aba!"
  • "هاو ئار یوو؟"
  • "Μπιάνβενου α σε προζέ!"

What do all these sentences have in common? Being greeted in Arabic with "mar7aba" written in the Latin script, then asked how you are ("هاو ئار یوو؟") in English using the Perso-Arabic script of Kurdish and then, welcomed to this demo in French ("Μπιάνβενου α σε προζέ!") written in Greek script. All these sentences are written in an unconventional script.

Although you may find these sentences risible, unconventional writing is a common practice among millions of speakers in bilingual communities. In our paper entitled "Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities", we shed light on this problem and propose an approach to normalize noisy text written in unconventional writing.

This repository provides codes and datasets that can be used to reproduce our paper or extend it to other languages. The focus of the current project is on some of the main languages that use a Perso-Arabic script, namely the followings:

Please note that this project does not aim for spell-checking and cannot correct errors beyond character normalization.

Corpora

The data presented in the corpus folder have been extracted from Wikipedia dumps and cleaned using wikiextractor, unless the name of the file doesn't include wiki (followed by the date of the dump). Here are the sources of the material of the other languages:

All the corpora are cleaned to a decent extent.

Wordlists

Wordlists in the wordlist folder contain words that are extracted from the corpora based on certain frequency. Depending on the size and quality of the data, the frequency is in the range of 3 to 10, i.e. words that appear with a frequency of 3 to 10 are extracted as the vocabulary of the language.

To extract words from the corpora, run the following:

```
cat <file> |  tr -d '[:punct:]' | tr " " "\n" | sort | uniq -c | sort -n > <file_wordlist>
```

Following this, common words in the source and target languages are identified and stored in the common folder. For the target languages, dictionaries are used. This folder contains two sets of common words, whether written with the same spelling or slightly different, and is organized in two sub-folders:

  • corpus-based contains files of common words in two languages based on a corpus
  • dictionary-based contains files of common words in two languages extracted from dictionaries

If the source language has a dictionary, the common words are provided in the dictionary-based folder. Otherwise, check the corpus-based folder. The merged vocabularies (based on corpus and dictionary) are provided in the common folder. If a dictionary is not available for the source language, the common files in common are the same as those at corpus-based.

Scripts

Information about the target scripts can be found at data/scripts as follows:

Language Target Script Mapping
Kashmiri Urdu data/scripts/Kashmiri-Urdu.tsv
Sindhi Urdu data/scripts/Sindhi-Urdu.tsv
Mazanderani Persian data/scripts/Mazanderani-Persian.tsv
Gilaki Persian data/scripts/Gilaki-Persian.tsv
AzeriTurkish Persian data/scripts/AzeriTurkish-Persian.tsv
Gorani Kurdish data/scripts/Gorani-Kurdish.tsv
Gorani Arabic data/scripts/Gorani-Arabic.tsv
Gorani Persian data/scripts/Gorani-Persian.tsv
Kurdish Arabic data/scripts/Kurdish-Arabic.tsv
Kurdish Persian data/scripts/Kurdish-Persian.tsv

Also, find more meta-data about the usage of diacritics and zero-width non-joiner (ZWNJ) in each language at data/script/info.json. A mapping of all the scripts is also provided at data/scripts/scripts_all.tsv.

Character-alignment matrix (CAT)

Calculating the edit distance based on the wordlists, a character alignment matrix (CAT) is created for each source-target language pair. This matrix contains the normalized probability that a character in a language appears as the equivalent of another one in the other language, i.e. compare the letter 'ج' in بۆرج in Azeri Turkish with برج in Farsi.

In addition to the edit distance, if there are rule-based mappings in the data/scripts folder, the CAT is updated accordingly (by adding 1 for each mapping). Finally, any replacement with a score < 0.1 is removed from the matrix.

Datasets

The synthetic datasets are available on GDrive due to large size (2.39G in .tar.gz): link. The real data in Central Kurdish written unconventionally in Arabic and Persian can be found at data/real.

Create your own pipeline!

If you are interested in this project and want to extend it, here are the steps to consider:

  1. Add your corpus to the data/corpus folder
  2. Update the code/config.json file and specify directories to your data and other required files.
  3. Run extract_loanwords.py to extract common words (script should be optimized!)
  4. Add script mapping in TSV format to the data/scripts folder.
  5. Run create_CAT.py to create the character-alignment matrix
  6. Run synthesize.py to generate synthetic data.

You can use any NMT training platform of your choice for training your models. In the paper, we use joeynmt for which the configuration files are provided in the training folder. If using SLURM, you can also use the scripts in training/SLURMs.

Related Projects

Checkout the following related projects too:

Cite this paper

If you use any part of the data, please consider citing this paper as follows:

@inproceedings{ahmadi2023acl,
title = "Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities",
author = "Ahmadi, Sina and Anastasopoulos, Antonios",
month = july,
year = "2023",
address = "Tornonto, Cananda",
publisher = "The 61st Annual Meeting of the Association for Computational Linguistics (ACL)"
}

License

Apache License