Transliterator Using Python
==============

# 1.0 Introduction

## 1.1 Purpose

The purpose of this notebook is to demonstrate the implementation of a
transliterator pipeline in Python. A transliterator is a tool that converts
text from one script to another, typically preserving the pronunciation of the
original text. For example, the name "Magdalena" in English could be
transliterated to "مجدلينا" in Arabic.

This notebook focuses on creating a transliterator for Arabic, converting
English words into their Arabic equivalents based on their phonetic
representation. 

The current design can be extended to include other languages.

This code is based on: https://github.com/AMR-KELEG/English-to-arabic-transphonator/tree/master

## 1.2 Transliterator Pipeline

The transliterator pipeline consists of three main steps:
1. **Phoneme Retrieval**: Extracting the phonetic representation (phonemes) of a given word using a phoneme retriever.
2. **Phoneme to Character Mapping**: Converting the retrieved phonemes into corresponding characters in the target language script (Arabic in this case).
3. **Postprocessing**: Applying specific language rules to refine the transliterated output, ensuring it adheres to linguistic norms of the target language.

## 1.3 Dependencies

The implementation relies on the following dependencies:
- `g2p_en`: A library for converting English words into ARPAbet phonemes.
- `CMU Pronouncing Dictionary`: A fallback option for phoneme retrieval if the `g2p_en` library is unavailable.

## 1.4 Implementation

The implementation is designed with modularity in mind, allowing different parts of the pipeline to be easily replaced or extended. The core components include:
- **Phoneme Retriever**: An interface and its implementations for retrieving phonemes. New retrievers for other languages can be added by implementing the `BasePhonemeRetriever` interface.
- **Transliteration Map**: A mapping of phonemes to characters. Developers can create mappings for other languages by extending the `BaseTranslitMap` interface.
- **Transliteration Rules**: A set of postprocessing rules to refine the transliteration. Custom rules can be added by implementing the `BaseTranslitRule` interface.

For example, to add support for a new language, a developer would need to:
- Implement a new phoneme retriever for the language.
- Create a new transliteration map for the language.
- Define any necessary postprocessing rules specific to that language.

## 1.5 Usage

This notebook demonstrates how to use the transliterator pipeline to transliterate a list of English words into Arabic. The process involves setting up the phoneme retriever, mapping the phonemes to Arabic characters, applying postprocessing rules, and running the complete pipeline to get the final transliteration.

# 2.0 Imports

In [1]:
from abc import ABC, abstractmethod
from typing import List, Union

import re

# 3.0 Implementation

## 3.1 Phoneme Retriever

The purpose of the Phoneme Retriever is to extract the phonetic representation of a word.

### 3.1.1 PhonemeRetriever Interface

In [2]:
class BasePhonemeRetriever(ABC):
    @abstractmethod
    def get_phonemes(self, word: str) -> Union[List[str], None]:
        """Retrieve phonemes for a given word."""
        pass


### 3.1.2 PhonemeRetriever Implementations

- **G2pRetriever**: The `g2p_en` library is a tool that converts English text
  into ARPAbet phonemes, which are a standardized set of phonetic symbols
  representing English sounds. The library predicts words pronunciations that are not in the
  CMU dictionary using neural net model.

In [3]:
class G2pRetriever(BasePhonemeRetriever):
    def __init__(self):
        """Initialize the G2pRetriever, which uses the `g2p_en` library to convert English words into their corresponding ARPAbet phonemes.
        """
        from g2p_en import G2p
        self.g2p = G2p()

    def get_phonemes(self, word: str) -> Union[List[str], None]:
        """Retrieve the phonemes for a given word using the `g2p_en` library.

        Args:
            word (str): The English word for which to retrieve phonemes.

        Returns:
            list: A list of phonemes corresponding to the input word.
                  If the word cannot be converted, an empty list is returned.

        Example:
            ```
            retriever = G2pRetriever()
            phonemes = retriever.get_phonemes("example")
            # Returns something like ['IH0', 'G', 'Z', 'AE1', 'M', 'P', 'L']

            phonemes = retriever.get_phonemes("nonexistentword")
            # Returns an empty list if the word cannot be converted
            ````
        """
        phonemes = self.g2p(word)
        return [p for p in phonemes if re.match(r'[A-Z]+[\d]?', p)]

- CMURetriever: The CMU Pronouncing Dictionary is a widely used resource for phoneme retrieval. It contains mappings of English words to their phonetic representations. This retriever uses the dictionary to find phonemes and includes a fallback mechanism to a custom dictionary if needed.

In [4]:
class CMURetriever(BasePhonemeRetriever):
    def __init__(self, fallback_dict_path=None):
        """Initialize the CMURetriever with an optional fallback dictionary.

        Args:
            fallback_dict_path (str, optional): Path to a custom fallback
            dictionary file. If provided, this file will be used to supplement
            the CMU Pronouncing Dictionary for word-to-phoneme mapping.
            Defaults to None.
        """

        self.english_word_to_phoneme = self.load_cmudict()
        self.fallback_dict = {}
        if fallback_dict_path:
            self.fallback_dict = self.load_fallback_dict(fallback_dict_path)

    def load_cmudict(self):
        """Load the CMU dictionary"""
        try:
            with open(
                file="../data/resources/cmudict-0.7b.txt",
                encoding="ISO-8859-1",
                mode="r",
            ) as file_obj:
                english_word_to_phoneme = {}
                for line in file_obj:
                    if line.startswith(";;;"):
                        continue
                    # Clean up the line and split into word and phonemes
                    word_phonemes = re.sub(r"[0-9]", "", line.strip()).split()
                    # The word is the first part, the phonemes are the rest
                    word = word_phonemes[0].lower()
                    phonemes = word_phonemes[1:]
                    # Populate the dictionary
                    english_word_to_phoneme[word] = phonemes
        except Exception:
            raise

        return english_word_to_phoneme

    def load_fallback_dict(self, fallback_dict_path):
        """Load the fallback dictionary from a user-provided file."""
        fallback_dict = {}
        try:
            with open(fallback_dict_path, "r", encoding="utf-8") as f:
                for line in f:
                    parts = line.strip().split()
                    word = parts[0].lower()
                    phonemes = parts[1:]
                    fallback_dict[word] = phonemes
        except Exception:
            pass

        return fallback_dict

    def get_phonemes(self, word: str) -> Union[List[str], None]:
        """Retrieve the phonemes for a given word using the CMU Pronouncing Dictionary, with an optional fallback to a custom dictionary.

        Args:
            word (str): The word for which to retrieve the phonemes.

        Returns:
            list: A list of phonemes corresponding to the input word.
                  If the word is not found in either the CMU dictionary or the fallback dictionary,
                  None is returned.

        Example:
            ```
            retriever = CMURetriever()
            phonemes = retriever.get_phonemes("example")
            # Returns something like ['IH0', 'G', 'Z', 'AE1', 'M', 'P', 'L']

            phonemes = retriever.get_phonemes("nonexistentword")
            # Returns None if the word is not found
            ```
        """
        word = word.lower()
        phonemes = self.english_word_to_phoneme.get(word)
        if phonemes is None:
            phonemes = self.fallback_dict.get(word)
        return phonemes

## 3.2 Phonemes to Charachters Mapping

The purpose of this step is to convert the phonetic representation of a word into characters of the target language. This is done by mapping each phoneme to its corresponding character or sequence of characters in the target script (Arabic in this case).

### 3.2.1 TransliterationMap Interface

In [5]:
class BaseTranslitMap(ABC):
    @abstractmethod
    def get_equivalent(self, phoneme: str) -> str:
        """Retrieve the equivalent character for a given phoneme."""
        pass


### 3.2.2 TransliterationMap Implementation

The `TranslitMapAra` class handles the conversion of ARPAbet phonemes to Arabic characters. The `_common_prefix` method is particularly important as it ensures that the closest matching phoneme is selected, even when there are slight variations in the phonetic representation.

In [6]:
class TranslitMapAra(BaseTranslitMap):
    def __init__(self):
        # Phoneme to Arabic equivalent mapping
        phonemes = [
            'AO0', 'UH0', 'UW0', 'OY0', 'OW0', 'UW1', 'OY1', 'B', 'P', 'NG',
            'F', 'V', 'AA0', 'AE0', 'AH0', 'EH0', 'EH2', 'AY0', 'EY0', 'AW0',
            'IH0', 'T', 'CH', 'G', 'R', 'K', 'L', 'M', 'HH', 'W', 'N', 'Y',
            'PH', 'UX', 'ZH', 'D', 'JH', 'DH', 'ER0', 'ER2', 'Z', 'S', 'SH',
            'IY0', 'IX', 'TH', 'AH1'
            ]

        arabic_equivalent = [
            'ُو', 'ُو', 'ُو', 'ُو', 'ُو', 'ُو', 'وي', 'ب', 'ب', 'غ', 'ف', 'ف',
            'َا', 'َا', 'َا', 'َا', 'َا', 'َي', 'َي', 'َو', 'ِي', 'ت', 'تش',
            'ج', 'ر', 'ك', 'ل', 'م', 'ه', 'و', 'ن', 'ي', 'ف', 'ُو', 'ج', 'د',
            'دج', 'ذ', 'ر', 'ر', 'ز', 'س', 'ش', 'ِي', 'ِي', 'ث', 'أُ'
            ]
        self.transliteration_map = dict(zip(phonemes, arabic_equivalent))

    def _common_prefix(self, s1, s2):
        """
        Calculate the length of the common prefix between two strings.

        Args:
            s1 (str): First string.
            s2 (str): Second string.

        Returns:
            int: Length of the common prefix.
        """
        match_length = 0
        for c1, c2 in zip(s1, s2):
            if c1 == c2:
                match_length += 1
            else:
                break
        return match_length

    def get_equivalent(self, phoneme: str) -> str:
        """
        Find the closest Arabic equivalent for a given ARPAbet phoneme.

        Args:
            phoneme (str): The ARPAbet phoneme to convert.

        Returns:
            str: The corresponding Arabic character(s).
        """
        available_phonemes = sorted(self.transliteration_map.keys())
        matching_prefix_chars = [
            self._common_prefix(phoneme, trans_phoneme)
            for trans_phoneme in available_phonemes
        ]

        # Find the index of the maximum prefix match
        max_idx = max(
            range(len(matching_prefix_chars)), key=lambda i: matching_prefix_chars[i]
        )

        return self.transliteration_map[available_phonemes[max_idx]]

## 3.3 Transliteration (Postprocessing) Rules

The purpose of this step is to refine the transliteration by applying language-specific rules. These rules adjust the transliteration to better match the conventions and phonotactics of the target language.

### 3.3.1 TransliterationRule Interface

In [7]:
class BaseTranslitRule(ABC):
    @abstractmethod
    def apply(self, text: str) -> str:
        """Apply the rule to the given text and return the modified text."""
        pass

### 3.3.2 TransliterationRule Implementation

In [8]:
class TranslitRuleAra(BaseTranslitRule):
    def __init__(self):
        # Mapping of short vowels to long vowels
        self.short_to_long_vowel_dict = {
            "\u064E": "و",  # Fatha to Waw
            "\u064F": "ا",  # Damma to Alef
            "\u0650": "ي",  # Kasra to Ya
        }

        # Arabic character sets
        self.arabic_consonants = ["ب", "ت", "ث", "ج", "ح", "د", "ذ", "ر", "ز", "س", "ش", "غ", "ف", "ق", "ك", "ل", "م", "ن", "ه"]

        self.arabic_vowels = ["ا", "أ", "و", "ي", "ى"]
        # Fatha, Damma, Kasra (short vowels)
        self.arabic_short_vowels = ["\u064E", "\u064F", "\u0650"]

    def apply(self, text: str) -> str:
        """Apply transliteration rules to adjust the Arabic text.

        Args:
            text (str): The initial Arabic transliteration.

        Returns:
            str: The adjusted Arabic transliteration.
        """
        arabic_vowels_str = re.escape("".join(self.arabic_vowels))
        arabic_consonants_str = re.escape("".join(self.arabic_consonants))
        arabic_short_vowels_str = re.escape("".join(self.arabic_short_vowels))

        # Rule 1: Handle starting Fatha
        text = re.sub("^[\u064E\u064F]", "أ", text)
        text = re.sub("^[\u0650]", "إ", text)

        # Rule 2: Replace short vowels at the end
        text = re.sub(r'[\u064E\u064F\u0650]$', lambda m: self.short_to_long_vowel_dict[m.group()], text)

        # Rule 3: Convert short vowels following consonants to long vowels
        groups = re.search(
            f"^([{arabic_vowels_str}][{arabic_consonants_str}])([{arabic_short_vowels_str}])",
            text,
        )
        if groups:
            text = (
                groups.group(1)
                + self.short_to_long_vowel_dict[groups.group(2)]
                + text[3:]
            )

        # Rule 4: Handle 'ng' sound at the end
        text = re.sub(r'نق$', 'نغ', text)

        # Rule 5: Handle 'ng' sound in the middle
        text = re.sub(r'نق(?=[{0}])'.format(arabic_consonants_str), 'ن', text)

        return text

## 3.4 Transliterator

The purpose of the Transliterator is to bring together all the components—phoneme retrieval, phoneme to character mapping, and postprocessing rules—into a single process that can transliterate words from one language script to another.

### 3.4.1 Transliterator Interface

In [9]:
class BaseTransliterator(ABC):
    @abstractmethod
    def transphonate(self, word: str) -> Union[str, None]:
        pass

### 3.4.2 Translitertor PipeLine Implementation

The `TranslitPipeline` class implements the pipeline pattern, where each step of the transliteration process is executed in sequence.

In [10]:
class TranslitPipeline(BaseTransliterator):
    def __init__(
        self,
        phoneme_retriever: BasePhonemeRetriever,
        transliteration_map: BaseTranslitMap,
        transliteration_rules: BaseTranslitRule,
    ):
        self.phoneme_retriever = phoneme_retriever
        self.transliteration_map = transliteration_map
        self.transliteration_rules = transliteration_rules

    def transphonate(self, word: str) -> Union[str, None]:
        """Transphonate a word into the target language."""

        phonemes = self.phoneme_retriever.get_phonemes(word)
        if not phonemes:
            return None

        phonemes_equivelant = [
            self.transliteration_map.get_equivalent(phoneme)
            for phoneme in phonemes
        ]
        phonemes_equivelant = "".join(phonemes_equivelant)

        transliteration = self.transliteration_rules.apply(phonemes_equivelant)

        return transliteration

# 4.0 Usage: Arabic Transliterator

In [11]:
# Step 1: Create Phoneme Retriever

def create_phoneme_retriever_ar(
    fallback_dict_path=None,
) -> BasePhonemeRetriever:
    try:
        return G2pRetriever()
    except ImportError:
        return CMURetriever(fallback_dict_path=fallback_dict_path)

In [12]:
phoneme_retriever_ar = create_phoneme_retriever_ar()

In [13]:
# Step 2: Mapping list of phonemes to list of charachters
transliteration_map_ar = TranslitMapAra()

In [14]:
# Step 3: Postprocessing the converted phonemes
transliteration_rules_ar = TranslitRuleAra()

In [15]:
# Step 4: Create Pipeline
transliteration_pipline_ar = TranslitPipeline(
    phoneme_retriever_ar,
    transliteration_map_ar,
    transliteration_rules_ar
    )

In [16]:
# Step 5: Run the pipeline
words = "Magdalena kristersson Naruhito Ulf this is".split()
for word in words:
    print(word, transliteration_pipline_ar.transphonate(word))

Magdalena مَاجدَالِينَا
kristersson كرِيسترسَان
Naruhito نَارُوهِيتُو
Ulf أُلف
this ذِيس
is إيز


In [17]:
from g2p_en import G2p
G2p()("kristersson")

['K', 'R', 'IH1', 'S', 'T', 'ER0', 'S', 'AH0', 'N']