Automatic Restoration of Diacritics for Speech Data Set

Sara Shatnawi, Sawsan Alqahtani, Hanan Aldarmaki
NAACL 2024

Abstract

Automatic text-based diacritic restoration models generally have high diacritic error rates when applied to speech transcripts as a result of domain and style shifts in spoken language. In this work, we explore the possibility of improving the performance of automatic diacritic restoration when applied to speech data by utilizing parallel spoken utterances. In particular, we use the pre-trained Whisper ASR model fine-tuned on relatively small amounts of diacritized Arabic speech data to produce rough diacritized transcripts for the speech utterances, which we then use as an additional input for diacritic restoration models. The proposed framework consistently improves diacritic restoration performance compared to text-only baselines. Our results highlight the inadequacy of current text-based diacritic restoration models for speech data sets and provide a new baseline for speech-based diacritic restoration.

Models

Text-based with Tashkeela: a text-only model trained on Tashkeela and fine-tuned with CLArTTS.
Text-based without Tashkeela: a text-only model trained only on CLArTTS.
Text+ASR with Tashkeela: a Text+ASR model trained on Tashkeela for text and fine-tuned with CLArTTS.
Text+ASR without Tashkeelh: a Text+ASR model trained only with CLArTTS.

For each one of the above, there are Transformer and LSTM versions for the text encoders.

Text+ASR models use an external ASR system, a fine-tuned Whisper, to pre-process speech. You can find the fine-tuned whisper here.

Environment & Installation

Prerequisites

Tested with Python 3.8
Install the required packages listed in requirements.txt file
- pip install -r requirements.txt

Citition

If you use the above model, please cite the following paper:

 @inproceedings{shatnawi2024automatic,
 title={Automatic Restoration of Diacritics for Speech Data Sets},
 author={Shatnawi, Sara and Alqahtani, Sawsan and Aldarmaki, Hanan},
 booktitle={Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)},
 pages={4166--4176},
 year={2024}
 }

Data Augmentation for Speech-Based Diacritic Restoration

Sara Shatnawi, Sawsan Alqahtani, Shady Shehata, Hanan Aldarmaki
Mohamed bin Zayed University of Artificial Intelligence
ArabicNLP 2024

Abstract

This paper describes a data augmentation technique for boosting the performance of speech-based diacritic restoration. Our experiments demonstrate the utility of this approach, resulting in improved generalization of all models across different test sets. In addition, we describe the first multi-modal diacritic restoration model, utilizing both speech and text as input modalities. This type of model can be used to diacritize speech transcripts. Unlike previous work that relies on an external ASR model, the proposed model is far more compact and efficient. While the multi-modal framework does not surpass the ASR-based model for this task, it offers a promising approach for improving the efficiency of speech-based diacritization, with a potential for improvement using data augmentation and other methods.

Data Augmentation Rules

Replacement Rules

Sukoon or Shaddah if they appear at the first letter of the word (i.e.,مْقعد ).
Tanween if it appears in any letter except the last letter in the word.
One of the two Shaddahs appearing on two contiguous characters (i.e.,مَقّعّد ).
Any diacritic that is not Fatha or Damma appearing on Hamza on top (أ) at the beginning of a word (example of allowed variations, in this case, أَصبحَ or أُصبحَ ).
Any diacritic that is not Kasra appearing on Hamza below Alef (إ), such as the word إلى.
Any diacritic that is not Fatha before the tied T (ة) (i.e., the Arabic word مَدرَسَة).
Any diacritic other than Fatha before the letter Alef of the following forms: ( ى ) or ( ا ).
Stand-alone Shadda should be followed by another diacritic.

Deletion Rules

All diacritics are placed on characters, not in the Arabic alphabet.
All diacritics applied to the following forms of Alef: Alef Madd (آ), Alef (ا), Maqsura (ى), and at the beginning of a word (Alef followed by the letter Lam) indicating the definiteness of a word (ال).
Any additional diacritic for each letter, except if this additional diacritic accompanies Shaddah (i.e., each letter should have only one diacritic except in the case of Shaddah, which can be followed by an additional diacritic).

Multi-Modal Diacritic Restoration

Citition

If you use the data augmentation or multi-modal model, please cite the following paper:

   @inproceedings{shatnawi2024data,
   title={Data Augmentation for Speech-Based Diacritic Restoration},
   author={Shatnawi, Sara and Alqahtani, Sawsan and Shehata, Shady and Aldarmaki, Hanan},    
   booktitle={Proceedings of ArabicNLP 2024},
   year={2024}
   }

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
Data-Augmentation		Data-Augmentation
scripts		scripts
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatic Restoration of Diacritics for Speech Data Set

Abstract

Models

Environment & Installation

Prerequisites

Citition

Data Augmentation for Speech-Based Diacritic Restoration

Abstract

Data Augmentation Rules

Replacement Rules

Deletion Rules

Multi-Modal Diacritic Restoration

Citition

About

Releases

Packages

Contributors 2

Languages

SaraShatnawi/Diacritization

Folders and files

Latest commit

History

Repository files navigation

Automatic Restoration of Diacritics for Speech Data Set

Abstract

Models

Environment & Installation

Prerequisites

Citition

Data Augmentation for Speech-Based Diacritic Restoration

Abstract

Data Augmentation Rules

Replacement Rules

Deletion Rules

Multi-Modal Diacritic Restoration

Citition

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages