<a href="https://colab.research.google.com/github/thomas-chauvet/names_transliteration/blob/master/arabic_to_english_names_transliteration_with_nmt_and_attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Names transliteration - Neural machine translation with attention

This notebook trains a sequence to sequence (seq2seq) model to translitate names with arabic characters to names in latin character. Usually we call this task "romanization". It is the task to transform string from one alphabet to latin alphanet.

After training the model in this notebook, you will be able to input an arabic name, such as *محمد‎*, and return the transliteration/translation of this name: *mohammad*.

## Clone repository

In [1]:
! rm -rf names_transliteration/
! git clone https://github.com/thomas-chauvet/names_transliteration.git

Cloning into 'names_transliteration'...
remote: Enumerating objects: 83, done.[K
remote: Counting objects: 100% (83/83), done.[K
remote: Compressing objects: 100% (66/66), done.[K
remote: Total 83 (delta 22), reused 51 (delta 6), pack-reused 0[K
Unpacking objects: 100% (83/83), done.


## Install python library

In [2]:
!cd names_transliteration/ && python setup.py install

running install
running bdist_egg
running egg_info
creating names_transliteration.egg-info
writing names_transliteration.egg-info/PKG-INFO
writing dependency_links to names_transliteration.egg-info/dependency_links.txt
writing entry points to names_transliteration.egg-info/entry_points.txt
writing requirements to names_transliteration.egg-info/requires.txt
writing top-level names to names_transliteration.egg-info/top_level.txt
writing manifest file 'names_transliteration.egg-info/SOURCES.txt'
writing manifest file 'names_transliteration.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib
creating build/lib/transliteration
copying transliteration/train_nmt.py -> build/lib/transliteration
copying transliteration/get_data.py -> build/lib/transliteration
copying transliteration/transliterate_name.py -> build/lib/transliteration
copying transliteration/__init__.py -> build/lib/transliteration
cr

# Link to google drive

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Import libraries

In [4]:
from transliteration.model.nmt import get_model
from transliteration.model.process import load_dataset
from transliteration.model.save import save_keras_tokenizer_json, save_model_metadata
from transliteration.model.train import train
from transliteration.model.save import load_keras_tokenizer_json, load_model_metadata
from transliteration.model.transliterate import transliterate
from pathlib import Path
import logging
from sklearn.model_selection import train_test_split
import tensorflow as tf

logging.basicConfig()
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

## Configuration

In [5]:
num_examples = None
test_size = 0.2
batch_size = 64
embedding_dim = 256
units = 1024
epochs = 20

model_path = Path("/content/drive//My Drive/")

## Train model

In [6]:
logger.info("Load dataset")
source_tensor, target_tensor, source, target = load_dataset(
    Path("names_transliteration/data/clean/arabic_english.csv"), num_examples=num_examples
)

logger.info("Save tokenizers")
save_keras_tokenizer_json(source, model_path / "source_tokenizer.json")
save_keras_tokenizer_json(target, model_path / "target_tokenizer.json")

logger.info("Creating training and validation sets")
(
    source_tensor_train,
    source_tensor_val,
    target_tensor_train,
    target_tensor_val,
) = train_test_split(source_tensor, target_tensor, test_size=test_size)

BUFFER_SIZE = len(source_tensor_train)
logger.info(f"BUFFER_SIZE: {BUFFER_SIZE}")
STEPS_PER_EPOCH = len(source_tensor_train) // batch_size
logger.info(f"STEPS_PER_EPOCH: {STEPS_PER_EPOCH}")
VOCAB_INP_SIZE = len(source.word_index) + 1
logger.info(f"VOCAB_INP_SIZE: {VOCAB_INP_SIZE}")
VOCAB_TAR_SIZE = len(target.word_index) + 1
logger.info(f"VOCAB_TAR_SIZE: {VOCAB_TAR_SIZE}")

metadata = {
    "batch_size": batch_size,
    "embedding_dim": embedding_dim,
    "units": units,
    "vocab_inp_size": len(source.word_index) + 1,
    "vocab_tar_size": len(target.word_index) + 1,
    "max_length_source": source_tensor.shape[1],
    "max_length_target": target_tensor.shape[1],
}

logger.info("Save model's metadata")
save_model_metadata(metadata, model_path / "model_metadata.json")

logger.info("Tensorflow dataset batch")
dataset = tf.data.Dataset.from_tensor_slices(
    (source_tensor_train, target_tensor_train)
).shuffle(BUFFER_SIZE)
dataset = dataset.batch(batch_size, drop_remainder=True)

logger.info("Instanciate encoder, attention and decoder")
encoder, attention_layer, decoder = get_model(
    VOCAB_INP_SIZE, VOCAB_TAR_SIZE, embedding_dim=embedding_dim, units=units, batch_sz=batch_size
)

logger.info("Train model")
encoder, decoder = train(
    dataset,
    encoder,
    decoder,
    target,
    STEPS_PER_EPOCH,
    epochs=epochs,
    checkpoint_dir=model_path / "training_checkpoints",
)

logger.info("Save models weights")
encoder.save_weights(
    (model_path / "encoder/checkpoint").as_posix(), save_format="tf"
)
decoder.save_weights(
    (model_path / "decoder/checkpoint").as_posix(), save_format="tf"
)



INFO:__main__:Load dataset
INFO:__main__:Save tokenizers
INFO:__main__:Creating training and validation sets
INFO:__main__:BUFFER_SIZE: 94438
INFO:__main__:STEPS_PER_EPOCH: 1475
INFO:__main__:VOCAB_INP_SIZE: 55
INFO:__main__:VOCAB_TAR_SIZE: 60
INFO:__main__:Save model's metadata
INFO:__main__:Tensorflow dataset batch
INFO:__main__:Instanciate encoder, attention and decoder
INFO:__main__:Train model


Epoch 1 Batch 0 Loss 1.2066
Epoch 1 Batch 100 Loss 0.7683
Epoch 1 Batch 200 Loss 0.5617
Epoch 1 Batch 300 Loss 0.2970
Epoch 1 Batch 400 Loss 0.2902
Epoch 1 Batch 500 Loss 0.2224
Epoch 1 Batch 600 Loss 0.3280
Epoch 1 Batch 700 Loss 0.2446
Epoch 1 Batch 800 Loss 0.2455
Epoch 1 Batch 900 Loss 0.2580
Epoch 1 Batch 1000 Loss 0.2527
Epoch 1 Batch 1100 Loss 0.2387
Epoch 1 Batch 1200 Loss 0.2234
Epoch 1 Batch 1300 Loss 0.2964
Epoch 1 Batch 1400 Loss 0.2075
Epoch 2 Batch 0 Loss 0.2522
Epoch 2 Batch 100 Loss 0.2376
Epoch 2 Batch 200 Loss 0.1977
Epoch 2 Batch 300 Loss 0.2659
Epoch 2 Batch 400 Loss 0.2061
Epoch 2 Batch 500 Loss 0.2162
Epoch 2 Batch 600 Loss 0.1943
Epoch 2 Batch 700 Loss 0.1964
Epoch 2 Batch 800 Loss 0.2032
Epoch 2 Batch 900 Loss 0.2136
Epoch 2 Batch 1000 Loss 0.2091
Epoch 2 Batch 1100 Loss 0.1823
Epoch 2 Batch 1200 Loss 0.1864
Epoch 2 Batch 1300 Loss 0.1878
Epoch 2 Batch 1400 Loss 0.1720
Epoch 3 Batch 0 Loss 0.1933
Epoch 3 Batch 100 Loss 0.2152
Epoch 3 Batch 200 Loss 0.2103
Epoch 

INFO:__main__:Save models weights


## Evalute on different instances

In [7]:
input_tokenizer = load_keras_tokenizer_json(model_path / "source_tokenizer.json")
output_tokenizer = load_keras_tokenizer_json(model_path / "target_tokenizer.json")
model_metadata = load_model_metadata(model_path / "model_metadata.json")
encoder, _, decoder = get_model(
    model_metadata["vocab_inp_size"],
    model_metadata["vocab_tar_size"],
    model_metadata["embedding_dim"],
    model_metadata["units"],
    model_metadata["batch_size"],
)
encoder.load_weights((model_path / "encoder/checkpoint").as_posix())
decoder.load_weights((model_path / "decoder/checkpoint").as_posix())

names = {
    "Mohammed": "محمد‎",
    "Mamun": "مامون",
    "Urdu": "فیضان‎",
    "Thomas": "توماس",
    "Léna": "لينا",
    "Jean": "جينز",
    "Boubacar": "بوبكر",
    "Ghita": "غيتا",
    "Ezékiel": "حزقيال",
    "Gaspard": "جاسبارد",
    "Balthasar": "بالتازار",
    "Olivier": "أوليفر",
    "Jason": "جيسون",
    "Nicolas": "نيكولاس",
    "George": "جورج",
    "Joséphine": "جوزفين",
    "Cunégonde": "كونيجوند",
    "Hortense": "هورتنس",
    "Boutros Boutros-Ghali": "بطرس بطرس غالي",
    "Rifa'a al-Tahtawi": "رفاعة رافع الطهطاوي",
    "Saad Zaghloul": "سعد زغلول‎",
    "Farouk El-Baz": "فاروق الباز‎",
    "Abū ʿAbdallāh Yaʿīsh ibn Ibrāhīm ibn Yūsuf ibn Simāk al-Andalusī al-Umawī": "يعيش بن إبراهيم بن يوسف بن سماك الأموي الأندلسي",
    "Ahmed Hassan Zewail": "أحمد حسن زويل‎",
    "Abdel-Wahed El-Wakil": "عبد الواحد الوكيل‎",
    "Suad Amiry": "سعاد العامري‎",
    "Aḥmad ibn Faḍlān ibn al-ʿAbbās ibn Rāšid ibn Ḥammād": "أحمد بن فضلان بن العباس بن راشد بن حماد‎",
    "Ahmad ibn Mājid": "أحمد بن ماجد",
    "Abbas Mahmoud al-Aqqad": "عباس محمود العقاد‎",
    "Imru' al-Qais Junduh bin Hujr al-Kindi ": "ٱمْرُؤ ٱلْقَيْس جُنْدُح ٱبْن حُجْر ٱلْكِنْدِيّ‎",
    "Abū al-Qāsim Khalaf ibn al-'Abbās al-Zahrāwī al-Ansari": "أبو القاسم خلف بن العباس الزهراوي",
}

for latin, arabic in names.items():
    transliterated = transliterate(
        name=arabic,
        input_tokenizer=input_tokenizer,
        output_tokenizer=output_tokenizer,
        encoder=encoder,
        decoder=decoder,
        metadata=model_metadata,
    )
    print(f"Original      : {arabic}")
    print(f"Transliterated: {transliterated}")
    print(f"Ground truth  : {latin}")
    print("-------")


Original      : محمد‎
Transliterated: mohammed
Ground truth  : Mohammed
-------
Original      : مامون
Transliterated: mammon
Ground truth  : Mamun
-------
Original      : فیضان‎
Transliterated: faidan
Ground truth  : Urdu
-------
Original      : توماس
Transliterated: thomas
Ground truth  : Thomas
-------
Original      : لينا
Transliterated: lena
Ground truth  : Léna
-------
Original      : جينز
Transliterated: ginz
Ground truth  : Jean
-------
Original      : بوبكر
Transliterated: boubakeur
Ground truth  : Boubacar
-------
Original      : غيتا
Transliterated: geta
Ground truth  : Ghita
-------
Original      : حزقيال
Transliterated: hazquial
Ground truth  : Ezékiel
-------
Original      : جاسبارد
Transliterated: gaspard
Ground truth  : Gaspard
-------
Original      : بالتازار
Transliterated: baltazar
Ground truth  : Balthasar
-------
Original      : أوليفر
Transliterated: oliver
Ground truth  : Olivier
-------
Original      : جيسون
Transliterated: jeison
Ground truth  : Jason
-------
Or