<a href="https://colab.research.google.com/github/thomas-chauvet/names_transliteration/blob/master/arabic_to_english_names_transliteration_with_nmt_and_attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Names transliteration - Neural machine translation with attention

This notebook trains a sequence to sequence (seq2seq) model to translitate names with arabic characters to names in latin character. Usually we call this task "romanization". It is the task to transform string from one alphabet to latin alphanet.

After training the model in this notebook, you will be able to input an arabic name, such as *محمد‎*, and return the transliteration/translation of this name: *mohammad*.

## Clone repository

In [None]:
! git clone https://github.com/thomas-chauvet/names_transliteration.git

fatal: destination path 'names_transliteration' already exists and is not an empty directory.


## Install python library

In [None]:
!python names_transliteration/setup.py install

running install
running bdist_egg
running egg_info
writing names_transliteration.egg-info/PKG-INFO
writing dependency_links to names_transliteration.egg-info/dependency_links.txt
writing entry points to names_transliteration.egg-info/entry_points.txt
writing requirements to names_transliteration.egg-info/requires.txt
writing top-level names to names_transliteration.egg-info/top_level.txt
reading manifest file 'names_transliteration.egg-info/SOURCES.txt'
writing manifest file 'names_transliteration.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib

creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying names_transliteration.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying names_transliteration.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying names_transliteration.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying names_transliteration.e

## Import libraries

In [None]:
from transliteration.model.nmt import get_model
from transliteration.model.process import load_dataset
from transliteration.model.save import save_keras_tokenizer_json, save_model_metadata
from transliteration.model.train import train
from transliteration.model.save import load_keras_tokenizer_json, load_model_metadata
from transliteration.model.transliterate import transliterate
from pathlib import Path
import logging
from sklearn.model_selection import train_test_split
import tensorflow as tf

logging.basicConfig()
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

## Configuration

In [None]:
num_examples = None
test_size = 0.2
batch_size = 64
embedding_dim = 256
units = 1024
epochs = 20

model_path = Path("names_transliteration/model/")

## Train model

In [None]:
logger.info("Load dataset")
source_tensor, target_tensor, source, target = load_dataset(
    Path("names_transliteration/data/clean/arabic_english.csv"), num_examples=num_examples
)

logger.info("Creating training and validation sets")
(
    source_tensor_train,
    source_tensor_val,
    target_tensor_train,
    target_tensor_val,
) = train_test_split(source_tensor, target_tensor, test_size=test_size)

BUFFER_SIZE = len(source_tensor_train)
logger.info(f"BUFFER_SIZE: {BUFFER_SIZE}")
STEPS_PER_EPOCH = len(source_tensor_train) // batch_size
logger.info(f"STEPS_PER_EPOCH: {STEPS_PER_EPOCH}")
VOCAB_INP_SIZE = len(source.word_index) + 1
logger.info(f"VOCAB_INP_SIZE: {VOCAB_INP_SIZE}")
VOCAB_TAR_SIZE = len(target.word_index) + 1
logger.info(f"VOCAB_TAR_SIZE: {VOCAB_TAR_SIZE}")

logger.info("Tensorflow dataset batch")
dataset = tf.data.Dataset.from_tensor_slices(
    (source_tensor_train, target_tensor_train)
).shuffle(BUFFER_SIZE)
dataset = dataset.batch(batch_size, drop_remainder=True)

logger.info("Instanciate encoder, attention and decoder")
encoder, attention_layer, decoder = get_model(
    VOCAB_INP_SIZE, VOCAB_TAR_SIZE, embedding_dim=embedding_dim, units=units, batch_sz=batch_size
)

logger.info("Train model")
encoder, decoder = train(
    dataset,
    encoder,
    decoder,
    target,
    STEPS_PER_EPOCH,
    epochs=epochs,
    checkpoint_dir=model_path / "training_checkpoints",
)

metadata = {
    "batch_size": batch_size,
    "embedding_dim": embedding_dim,
    "units": units,
    "vocab_inp_size": len(source.word_index) + 1,
    "vocab_tar_size": len(target.word_index) + 1,
    "max_length_source": source_tensor.shape[1],
    "max_length_target": target_tensor.shape[1],
}

logger.info("Save models weights")
encoder.save_weights(
    (model_path / "encoder/checkpoint").as_posix(), save_format="tf"
)
decoder.save_weights(
    (model_path / "decoder/checkpoint").as_posix(), save_format="tf"
)

logger.info("Save tokenizers")
save_keras_tokenizer_json(source, model_path / "source_tokenizer.json")
save_keras_tokenizer_json(target, model_path / "target_tokenizer.json")

logger.info("Save model's metadata")
save_model_metadata(metadata, model_path / "model_metadata.json")

INFO:__main__:Load dataset
INFO:__main__:Creating training and validation sets
INFO:__main__:BUFFER_SIZE: 94438
INFO:__main__:STEPS_PER_EPOCH: 1475
INFO:__main__:VOCAB_INP_SIZE: 55
INFO:__main__:VOCAB_TAR_SIZE: 60
INFO:__main__:Tensorflow dataset batch
INFO:__main__:Instanciate encoder, attention and decoder
INFO:__main__:Train model


Epoch 1 Batch 0 Loss 1.1517
Epoch 1 Batch 100 Loss 0.7790
Epoch 1 Batch 200 Loss 0.5619
Epoch 1 Batch 300 Loss 0.2848
Epoch 1 Batch 400 Loss 0.2902
Epoch 1 Batch 500 Loss 0.2274
Epoch 1 Batch 600 Loss 0.2146
Epoch 1 Batch 700 Loss 0.2315
Epoch 1 Batch 800 Loss 0.2603
Epoch 1 Batch 900 Loss 0.2534
Epoch 1 Batch 1000 Loss 0.2555
Epoch 1 Batch 1100 Loss 0.1834
Epoch 1 Batch 1200 Loss 0.2418
Epoch 1 Batch 1300 Loss 0.2644
Epoch 1 Batch 1400 Loss 0.2535
Epoch 2 Batch 0 Loss 0.2912
Epoch 2 Batch 100 Loss 0.2091
Epoch 2 Batch 200 Loss 0.2039
Epoch 2 Batch 300 Loss 0.2240
Epoch 2 Batch 400 Loss 0.2217
Epoch 2 Batch 500 Loss 0.1594
Epoch 2 Batch 600 Loss 0.2386
Epoch 2 Batch 700 Loss 0.2581
Epoch 2 Batch 800 Loss 0.2658
Epoch 2 Batch 900 Loss 0.1931
Epoch 2 Batch 1000 Loss 0.1971
Epoch 2 Batch 1100 Loss 0.1719
Epoch 2 Batch 1200 Loss 0.2275
Epoch 2 Batch 1300 Loss 0.1963
Epoch 2 Batch 1400 Loss 0.1784
Epoch 3 Batch 0 Loss 0.1985
Epoch 3 Batch 100 Loss 0.1895
Epoch 3 Batch 200 Loss 0.2006
Epoch 

INFO:__main__:Save models weights
INFO:__main__:Save tokenizers
INFO:__main__:Save model's metadata


## Evalute on different instances

In [None]:
input_tokenizer = load_keras_tokenizer_json(model_path / "source_tokenizer.json")
output_tokenizer = load_keras_tokenizer_json(model_path / "target_tokenizer.json")
model_metadata = load_model_metadata(model_path / "model_metadata.json")
encoder, _, decoder = get_model(
    model_metadata["vocab_inp_size"],
    model_metadata["vocab_tar_size"],
    model_metadata["embedding_dim"],
    model_metadata["units"],
    model_metadata["batch_size"],
)
encoder.load_weights((model_path / "encoder/checkpoint").as_posix())
decoder.load_weights((model_path / "decoder/checkpoint").as_posix())

names = {
    "Mohammed": "محمد‎",
    "Mamun": "مامون",
    "Urdu": "فیضان‎",
    "Thomas": "توماس",
    "Léna": "لينا",
    "Jean": "جينز",
    "Boubacar": "بوبكر",
    "Ghita": "غيتا",
    "Ezékiel": "حزقيال",
    "Gaspard": "جاسبارد",
    "Balthasar": "بالتازار",
    "Olivier": "أوليفر",
    "Jason": "جيسون",
    "Nicolas": "نيكولاس",
    "George": "جورج",
    "Joséphine": "جوزفين",
    "Cunégonde": "كونيجوند",
    "Hortense": "هورتنس",
    "Boutros Boutros-Ghali": "بطرس بطرس غالي",
    "Rifa'a al-Tahtawi": "رفاعة رافع الطهطاوي",
    "Saad Zaghloul": "سعد زغلول‎",
    "Farouk El-Baz": "فاروق الباز‎",
    "Abū ʿAbdallāh Yaʿīsh ibn Ibrāhīm ibn Yūsuf ibn Simāk al-Andalusī al-Umawī": "يعيش بن إبراهيم بن يوسف بن سماك الأموي الأندلسي",
    "Ahmed Hassan Zewail": "أحمد حسن زويل‎",
    "Abdel-Wahed El-Wakil": "عبد الواحد الوكيل‎",
    "Suad Amiry": "سعاد العامري‎",
    "Aḥmad ibn Faḍlān ibn al-ʿAbbās ibn Rāšid ibn Ḥammād": "أحمد بن فضلان بن العباس بن راشد بن حماد‎",
    "Ahmad ibn Mājid": "أحمد بن ماجد",
    "Abbas Mahmoud al-Aqqad": "عباس محمود العقاد‎",
    "Imru' al-Qais Junduh bin Hujr al-Kindi ": "ٱمْرُؤ ٱلْقَيْس جُنْدُح ٱبْن حُجْر ٱلْكِنْدِيّ‎",
    "Abū al-Qāsim Khalaf ibn al-'Abbās al-Zahrāwī al-Ansari": "أبو القاسم خلف بن العباس الزهراوي",
}

for latin, arabic in names.items():
    transliterated = transliterate(
        name=arabic,
        input_tokenizer=input_tokenizer,
        output_tokenizer=output_tokenizer,
        encoder=encoder,
        decoder=decoder,
        metadata=model_metadata,
    )
    print(f"Original      : {arabic}")
    print(f"Transliterated: {transliterated}")
    print(f"Ground truth  : {latin}")
    print("-------")


Original      : محمد‎
Transliterated: mahamed
Ground truth  : Mohammed
-------
Original      : مامون
Transliterated: mamoon
Ground truth  : Mamun
-------
Original      : فیضان‎
Transliterated: vidan
Ground truth  : Urdu
-------
Original      : توماس
Transliterated: tuomas
Ground truth  : Thomas
-------
Original      : لينا
Transliterated: lina
Ground truth  : Léna
-------
Original      : جينز
Transliterated: jenes
Ground truth  : Jean
-------
Original      : بوبكر
Transliterated: boubacar
Ground truth  : Boubacar
-------
Original      : غيتا
Transliterated: gheta
Ground truth  : Ghita
-------
Original      : حزقيال
Transliterated: hazaghall
Ground truth  : Ezékiel
-------
Original      : جاسبارد
Transliterated: gasbard
Ground truth  : Gaspard
-------
Original      : بالتازار
Transliterated: baltazar
Ground truth  : Balthasar
-------
Original      : أوليفر
Transliterated: olliver
Ground truth  : Olivier
-------
Original      : جيسون
Transliterated: jeison
Ground truth  : Jason
-------
O