# English to French Machine Translation

By Raj Pulapakura.

- GitHub: https://github.com/raj-pulapakura
- Contact: raj.pulapakura@gmail.com

### Table of contents:

<!-- TOC start (generated with https://github.com/derlin/bitdowntoc) -->

- [1. Load Data](#1-load-data)
- [2. Create datasets](#2-create-datasets)
- [3. TextVectorization](#3-textvectorization)
   * [3.1 Prepare vectorizers](#31-prepare-vectorizers)
      + [3.1.1 English Vectorizer](#311-english-vectorizer)
      + [3.1.2 French Vectorizer](#312-french-vectorizer)
      + [3.1.3 Example from dataset](#313-example-from-dataset)
   * [3.2 Create new datasets with word indices](#32-create-new-datasets-with-word-indices)
- [4. Building up the Encoder-Decoder Model](#4-building-up-the-encoder-decoder-model)
   * [4.1 Encoder](#41-encoder)
   * [4.2 Cross-Attention](#42-cross-attention)
   * [4.3 Decoder](#43-decoder)
   * [4.4 Combining Encoder and Decoder into Translator](#44-combining-encoder-and-decoder-into-translator)
- [5. Training](#5-training)
- [6. Inference](#6-inference)
- [7. Conclusion](#7-conclusion)

<!-- TOC end -->


### Machine Translation

Machine Translation is the process of converting text/speech from one language to another. In this notebook, we tackle specifically translation of English text to French text.

### Encoder-Decoder with Attention

![Encoder-Decoder architecture with Attention - TensorFlow "Neural Machine Translation with Attention" tutorial](https://www.tensorflow.org/images/tutorials/transformer/RNN%2Battention-words-spa.png)

`Encoder-Decoder with Attention` is a well-known architecture for machine translation, although it has become somewhat outdated with the rise of the powerful `Transformer` architecture.

However, it is still a very useful project to work through to get a deeper understanding of sequence-to-sequence models and attention mechanisms (before going on to Transformers).

### Inspiration

This notebook was mainly inspired by TensorFlow's amazing tutorial on [Neural machine translation with attention](https://www.tensorflow.org/text/tutorials/nmt_with_attention), which I have made open source contributions to.

### Set-up Notebook

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
import os
import shutil
import numpy as np
import pandas as pd

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as tf_text
from tensorflow.keras.layers import TextVectorization

In [None]:
os.environ['CUDA_VISIBLE_DEVICES'] = '-1' # use cpu

<!-- TOC --><a name="1-load-data"></a>
# 1. Load Data

In [None]:
data_path = "/kaggle/input/english-to-french-small-dataset/english_french.csv"
data = pd.read_csv(data_path)

In [None]:
data.head(10)

In [None]:
data.tail(10)

In [None]:
# Shuffle dataset

data = data.sample(frac=1).reset_index(drop=True)

In [None]:
data.head()

<!-- TOC --><a name="2-create-datasets"></a>
# 2. Create datasets

In [None]:
test_pct = 0.05
n_samples = len(data)
n_test = int(n_samples * test_pct)
n_train = n_samples - n_test

print(f"Total samples: {n_samples}")
print(f"Test samples: {n_test}")
print(f"Train samples: {n_train}")

In [None]:
english_text = data["English"].to_numpy()
french_text = data["French"].to_numpy()

In [None]:
BUFFER_SIZE = 1000
BATCH_SIZE = 64

ds = tf.data.Dataset.zip(
    tf.data.Dataset.from_tensor_slices(english_text),
    tf.data.Dataset.from_tensor_slices(french_text)
)

test_raw = ds.take(n_test).shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
train_raw = ds.skip(n_test).shuffle(BUFFER_SIZE).batch(BATCH_SIZE)

In [None]:
for english_batch, french_batch in train_raw.take(1):
    print("English")
    print(english_batch[0:5].numpy())
    print("\nFrench")
    print(french_batch[0:5].numpy())

<!-- TOC --><a name="3-textvectorization"></a>
# 3. TextVectorization

Models don't understand text, so we need to find a way to convert words into numbers.

TextVectorization maps each word to an integer. In the process it constructs a vocabulary (dictionary), mapping each word to a unique integer.

<!-- TOC --><a name="31-prepare-vectorizers"></a>
## 3.1 Prepare vectorizers

In [None]:
def tf_lower_and_split_punct(text):
    """
    Processes text before vectorization.
    """
    # French text contains special symbols. Unicode normalization:
    text = tf_text.normalize_utf8(text, 'NFKD')
    # Lowercase
    text = tf.strings.lower(text)
    # Keep space, a to z, and select punctuation.
    text = tf.strings.regex_replace(text, '[^ a-z.?!,¿]', '')
    # Add spaces around punctuation.
    text = tf.strings.regex_replace(text, '[.?!,¿]', r' \0 ')
    # Strip whitespace.
    text = tf.strings.strip(text)
    # start and end tokens
    text = tf.strings.join(['[START]', text, '[END]'], separator=' ')
    return text

<!-- TOC --><a name="311-english-vectorizer"></a>
### 3.1.1 English Vectorizer

In [None]:
# maximum amount of words in the vocabulary
max_vocab_size = 50000 

english_vectorizer = TextVectorization(
    standardize=tf_lower_and_split_punct,
    max_tokens=max_vocab_size, 
    ragged=True, # ragged=True allows variable length input sequences
)

# fit vectorization on training dataset english only
english_vectorizer.adapt(train_raw.map(lambda english, french: english))

In [None]:
# vectorize example sentence
example_sentence = "Example sentence"
print(f"Input: {example_sentence}")
print(f"Vectorized: {english_vectorizer(example_sentence)}")

The reason there are 4 tokens is because there is a \<START> token at the start and an \<END> token at the end.

In [None]:
# get vocabulary size
vocab_size = english_vectorizer.vocabulary_size()
print(f"English Vocabulary size: {vocab_size}")

In [None]:
# get first 10 words in the English vocabulary
print(english_vectorizer.get_vocabulary()[0:10])

Special tokens:

- `''` : Padding
- `[UNK]` : Unknown token, for words which are not in our vocabulary
- `[START]` : Start token, precedes every sentence
- `[END]` : End token, succeeds every sentence

<!-- TOC --><a name="312-french-vectorizer"></a>
### 3.1.2 French Vectorizer

In [None]:
french_vectorizer = TextVectorization(
    standardize=tf_lower_and_split_punct,
    max_tokens=max_vocab_size, 
    ragged=True, # ragged=True allows variable length input sequences
)

# fit vectorization on training dataset french only
french_vectorizer.adapt(train_raw.map(lambda english, french: french))

In [None]:
# vectorize example sentence
example_sentence = "Comment vas-tu?"
print(f"Input: {example_sentence}")
print(f"Vectorized: {english_vectorizer(example_sentence)}")

In [None]:
# get vocabulary size
vocab_size = french_vectorizer.vocabulary_size()
print(f"French Vocabulary size: {vocab_size}")

In [None]:
# get first 10 words in the French vocabulary
print(french_vectorizer.get_vocabulary()[0:10])

<!-- TOC --><a name="313-example-from-dataset"></a>
### 3.1.3 Example from dataset

In [None]:
# take sample from dataset and vectorize

for english_b, french_b in train_raw.take(1):
    english = english_b[0]
    french = french_b[0]
    print("\n\nEnglish (Text)\n")
    print(english)
    print("\n\nEnglish (Tokens)\n")
    print(english_vectorizer(english))
    print("\n\nFrench (Text)\n")
    print(french)
    print("\n\nFrench (Tokens)\n")
    print(french_vectorizer(french))

<!-- TOC --><a name="32-create-new-datasets-with-word-indices"></a>
## 3.2 Create new datasets with word indices

In [None]:
def process_text(english, french):
    """
    Convert english and french to word indices (tokens).
    Extract french_in and french_out from summary.
    The difference between french_in and french_out is that they are shifted by one step relative to eachother, so that at each location the label is the next token.
    """
    english_tok = english_vectorizer(english)
    french_tok = french_vectorizer(french)
    french_tok_in = french_tok[:,:-1]
    french_tok_out = french_tok[:, 1:] 
    return (english_tok, french_tok_in), french_tok_out

train_ds = train_raw.map(process_text, tf.data.AUTOTUNE)
test_ds = test_raw.map(process_text, tf.data.AUTOTUNE)

In [None]:
for (english_tok, french_in), french_out in train_ds.take(1):
    print("\nEnglish tokens:")
    print(english_tok[0, :10].numpy()) 
    print("\nFrench_in tokens:")
    print(french_in[0, :10].numpy())
    print("\nFrench_out tokens (shifted):")
    print(french_out[0, :10].numpy())

As you can see, the `French_out` tokens are equivalent to the `French_in` tokens except they are shifted forward by 1.

This automatically creates labels for us, as each token in `French_in` is matched to the following token in `French_out`.

<!-- TOC --><a name="4-building-up-the-encoder-decoder-model"></a>
# 4. Building up the Encoder-Decoder Model

In [None]:
UNITS = 256

<!-- TOC --><a name="41-encoder"></a>
## 4.1 Encoder

**Purpose**: Process the english tokens.

**Input**: English tokens.

**Output**: English encodings.

**Steps**:
1. Convert English tokens to word embeddings.
2. Feed embeddings through Bi-directional RNN.
3. Return final English encodings.

In [None]:
class Encoder(tf.keras.layers.Layer):
    def __init__(self, vectorizer, units):
        super(Encoder, self).__init__()
        self.vectorizer = vectorizer
        self.vocab_size = vectorizer.vocabulary_size()
        self.units = units
        
        # The embedding layer converts tokens into vectors
        self.embedding = tf.keras.layers.Embedding(
            input_dim=self.vocab_size,
            output_dim=units,
        )
        
        # The RNN layer processes those vectors sequentially
        self.rnn = tf.keras.layers.Bidirectional(
            merge_mode='sum', # sum forward and backward activation
            layer=tf.keras.layers.GRU(
                units,
                return_sequences=True,
                recurrent_initializer='glorot_uniform'
            )
        )
    
    def call(self, x):
        # 1. The embedding layer looks up the embedding vector for each token.
        x = self.embedding(x)
        # 2. The GRU processes the sequence of embeddings.
        x = self.rnn(x)
        # 3. Return the new sequence of embeddings.
        return x
    
    def encode_text(self, texts):
        """
        Converts a list of english texts into encodings
        """
        texts = tf.convert_to_tensor(texts)
        if len(texts.shape) == 0:
            texts = tf.convert_to_tensor(texts)[tf.newaxis]
        tokens = self.vectorizer(texts).to_tensor()
        encodings = self(tokens)
        return encodings

In [None]:
# Try it out:
encoder = Encoder(english_vectorizer, UNITS)

# pass example english tokens
english_enc = encoder(english_tok)

print(f'english tokens, shape (batch, s): {english_tok.shape}')
print(f'english encodings, shape (batch, s, units): {english_enc.shape}')

The reason that the shapes contain `None` is because each sentence has a variable length.

<!-- TOC --><a name="42-cross-attention"></a>
## 4.2 Cross-Attention

**Purpose**: The attention layer lets the decoder access the information extracted by the encoder. It essentially computes contextually aware word embeddings.

**Inputs**: English encodings

**Outputs**: Attention vectors (contextually aware English encodings)

**Steps**: 
1. Compute Multi-head Attention.
2. Add Skip Connection.
3. Layer Normalization.
4. Return Attention vectors.

In [None]:
class CrossAttention(tf.keras.layers.Layer):
    def __init__(self, units, **kwargs):
        super().__init__()
        self.mha = tf.keras.layers.MultiHeadAttention(key_dim=units, num_heads=1, **kwargs)
        self.layernorm = tf.keras.layers.LayerNormalization()
        self.add = tf.keras.layers.Add()

    def call(self, french_enc, english_enc):
        # compute attention vectors
        attn_output, attn_scores = self.mha(
            query=french_enc, # query: french encodings
            value=english_enc, # value: condition on english encodings
            return_attention_scores=True)
        
        # skip connection to preserve input signals
        x = self.add([french_enc, attn_output])
        # layer normalization
        x = self.layernorm(x)

        return x

In [None]:
# Try it out
attention_layer = CrossAttention(UNITS)

# simulate French embeddings
embed = tf.keras.layers.Embedding(french_vectorizer.vocabulary_size(),
                                  output_dim=UNITS)
french_embed = embed(french_in)

# pass French embeddings and English encodings
result = attention_layer(french_embed, english_enc)

print(f'English encodings, shape (batch, s, units): {english_enc.shape}')
print(f'French embeddings, shape (batch, t, units): {french_embed.shape}')
print(f'Attention result, shape (batch, t, units): {result.shape}')

<!-- TOC --><a name="43-decoder"></a>
## 4.3 Decoder

**Purpose**: Predict the next token given an input sequence.

**Inputs**: English encodings, French input tokens.

**Outputs**: Logit predictions for next tokens.

**Steps**:
1. Convert French tokens to word embeddings.
2. Feed word embeddings through Uni-directional RNN.
3. Use RNN output as Query for Cross-Attention on English encodings.
4. Generate logit predictions for next token.

In [None]:
class Decoder(tf.keras.layers.Layer):
    @classmethod
    def add_method(cls, fun):
        """
        This will allows us to add additional methods to the class later.
        """
        setattr(cls, fun.__name__, fun)
        return fun
    
    def __init__(self, vectorizer, units):
        super(Decoder, self).__init__()
        self.vectorizer = vectorizer
        self.vocab_size = vectorizer.vocabulary_size()
        
        self.word_to_id = tf.keras.layers.StringLookup(
            vocabulary=vectorizer.get_vocabulary(),
            mask_token="", oov_token="[UNK]"
        )
        
        self.id_to_word = tf.keras.layers.StringLookup(
            vocabulary=vectorizer.get_vocabulary(),
            mask_token="", oov_token="[UNK]",
            invert=True
        )
        
        self.start_token = self.word_to_id("[START]")
        self.end_token = self.word_to_id("[END]")

        # 1. The embedding layer converts token indices to vectors
        self.units = units
        self.embedding = tf.keras.layers.Embedding(
            self.vocab_size,
            units,
        )

        # 2. The RNN keeps track of what's been generated so far
        self.rnn = tf.keras.layers.GRU(
            units,
            return_sequences=True,
            return_state=True,
            recurrent_initializer="glorot_uniform",
        )
        
        # 3. The RNN output will be the query for the attention layer
        self.attention = CrossAttention(units)
        
        # 4. This fully connected layer produces the logits for each output token
        self.output_layer = tf.keras.layers.Dense(self.vocab_size)
        
    def call(
            self, 
            english_enc, 
            french_in, 
            state=None, 
            return_state=False):
        
        # 1. Convert french tokens to embeddings
        x = self.embedding(french_in)
        
        # 2. Process the french embeddings
        x, state = self.rnn(x, initial_state=state)
        
        # 3. Use the RNN output as the query for the attention over the english encodings
        # Essentially condition the french encodings on the english encodings
        x = self.attention(x, english_enc)
        
        # 4. Generate logit predictions for the next token
        logits = self.output_layer(x)
        
        if return_state:
            return logits, state,
        else:
            return logits

In [None]:
# Try it out:
decoder = Decoder(french_vectorizer, UNITS)

# use example English encodings and French input tokens
logits = decoder(english_enc, french_in)

print(f'English encodings shape (encoder output and decoder input): (batch, s, units) {english_enc.shape}')
print(f'French input tokens shape (decoder input): (batch, t) {french_in.shape}')
print(f'Logits shape (decoder output): (batch, french_vocabulary_size) {logits.shape}')

Amazing! This is sufficient for training.

For inference, we need a couple more methods:

In [None]:
@Decoder.add_method
def get_initial_state(self, english_encodings):
    batch_size = tf.shape(english_encodings)[0]
    # create tensor of n=batch_size start tokens [START]
    start_tokens = tf.fill([batch_size, 1], self.start_token)
    done = tf.zeros([batch_size, 1], dtype=tf.bool)
    embedded = self.embedding(start_tokens)
    return start_tokens, done, self.rnn.get_initial_state(embedded)[0]

In [None]:
@Decoder.add_method
def tokens_to_text(self, tokens):
    """
    Convert tokens (word indices) to text
    """
    words = self.id_to_word(tokens)
    result = tf.strings.reduce_join(words, axis=-1, separator=' ')
    result = tf.strings.regex_replace(result, '^ *\[START\] *', '')
    result = tf.strings.regex_replace(result, ' *\[END\] *$', '')
    return result

In [None]:
@Decoder.add_method
def get_next_token(
        self, 
        english_encodings, 
        next_token, 
        done, 
        state, 
        temperature=0.0):
    """
    Note: Temperature is a hyperparameter that regulates the randomness or creativity of the AI's responses in language models.
    """
    # running self() automatically runs the call() method
    logits, state = self(
        english_encodings,
        next_token,
        state=state,
        return_state=True
    )
    
    if temperature == 0.00:
        next_token = tf.argmax(logits, axis=-1)
    else:
        logits = logits[:, -1, :]/temperature
        next_token = tf.random.categorical(logits, num_samples=1)
        
    # if a sequence produces an end_token, set it "done"
    done = done | (next_token == self.end_token)
    # once a sequence is done it only produces 0-padding
    next_token = tf.where(done, tf.constant(0, dtype=tf.int64), next_token)
    
    return next_token, done, state

With these extra functions, we can write a generation loop.

In [None]:
next_token, done, state = decoder.get_initial_state(english_enc)
tokens = []

for n in range(10):
    # run one step
    next_token, done, state = decoder.get_next_token(
        english_enc, next_token, done, state, temperature=1.0
    )
    # add the token to the output
    tokens.append(next_token)

# stack all the tokens together
tokens = tf.concat(tokens, axis=-1) # (batch, t)

# Convert the tokens back to strings
result = decoder.tokens_to_text(tokens)
result

Of course the model is untrained, so the outputs are uniformly random items from the vocabulary.

<!-- TOC --><a name="44-combining-encoder-and-decoder-into-translator"></a>
## 4.4 Combining Encoder and Decoder into Translator

**Purpose**: Translate English to French.

**Inputs**: English tokens, French input tokens.

**Outputs**: French translation.

**Steps**:
1. Feed English tokens through Encoder, generate English encodings.
2. Feed English encodings and French input tokens to Decoder, generate prediction logits.

In [None]:
class Translator(tf.keras.Model):
    @classmethod
    def add_method(cls, fun):
        setattr(cls, fun.__name__, fun)
        return fun

    def __init__(self, units, english_vectorizer, french_vectorizer):
        super().__init__()
        # build the encoder and decoder
        encoder = Encoder(english_vectorizer, units)
        decoder = Decoder(french_vectorizer, units)
        
        self.encoder = encoder
        self.decoder = decoder
        
    def call(self, inputs):
        # extract english tokens and french input tokens
        english_tok, french_in = inputs
        # convert english tokens to encodings
        english_enc = self.encoder(english_tok)
        # compute logits from english encodings and french input tokens
        logits = self.decoder(english_enc, french_in)
        return logits

In [None]:
# Try it out:
model = Translator(UNITS, english_vectorizer, french_vectorizer)

# pass English tokens and French input tokens
logits = model((english_tok, french_in))

print(f'English tokens shape (encoder input): (batch, s, units) {english_tok.shape}')
print(f'English encodings shape (encoder output and decoder input): (batch, s, units) {english_enc.shape}')
print(f'French tokens shape (decoder input): (batch, t) {french_in.shape}')
print(f'Logits shape (decoder output): (batch, french_vocabulary_size) {logits.shape}')

<!-- TOC --><a name="5-training"></a>
# 5. Training

For training, we need to implement our own masked loss and accuracy functions:

In [None]:
def masked_loss(y_true, y_pred):
    # Calculate the loss for each item in the batch.
    loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(
        from_logits=True, reduction='none')
    loss = loss_fn(y_true, y_pred)

    # Mask off the losses on padding.
    mask = tf.cast(y_true != 0, loss.dtype)
    loss *= mask

    # Return the total.
    return tf.reduce_sum(loss)/tf.reduce_sum(mask)

In [None]:
def masked_acc(y_true, y_pred):
    # Calculate the loss for each item in the batch.
    y_pred = tf.argmax(y_pred, axis=-1)
    y_pred = tf.cast(y_pred, y_true.dtype)

    match = tf.cast(y_true == y_pred, tf.float32)
    mask = tf.cast(y_true != 0, tf.float32)

    return tf.reduce_sum(match)/tf.reduce_sum(mask)

In [None]:
model.compile(optimizer='adam',
              loss=masked_loss, 
              metrics=[masked_acc, masked_loss])

In [None]:
vocab_size = 1.0 * french_vectorizer.vocabulary_size()

{
    "expected_loss": tf.math.log(vocab_size).numpy(),
    "expected_acc": 1/vocab_size
}

This should roughly match the values returned by running a few steps of evaluation:

In [None]:
model.evaluate(test_ds, steps=20, return_dict=True)

In [None]:
history = model.fit(
    train_ds.repeat(), # .repeat() makes it an infinite dataset
    validation_data=test_ds,
    epochs=20,
    steps_per_epoch = 100, # since we are using an infinite dataset, we need to specify the number of steps per epoch
    validation_steps = 20,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(patience=3)
    ]
)

<!-- TOC --><a name="6-inference"></a>
# 6. Inference

In [None]:
@Translator.add_method
def translate(self,
              texts, *,
              max_length=50,
              temperature=0.0):
    # Process the input texts
    context = self.encoder.encode_text(texts)
    batch_size = tf.shape(texts)[0]

    # Setup the loop inputs
    tokens = []
    next_token, done, state = self.decoder.get_initial_state(context)

    for _ in range(max_length):
        # Generate the next token
        next_token, done, state = self.decoder.get_next_token(context, next_token, done,  state, temperature)

        # Collect the generated tokens
        tokens.append(next_token)

        if tf.executing_eagerly() and tf.reduce_all(done):
            break

    # Stack the lists of tokens and attention weights.
    tokens = tf.concat(tokens, axis=-1)   # t*[(batch 1)] -> (batch, t)

    result = self.decoder.tokens_to_text(tokens)
    return result

In [None]:
# Try it out:
result = model.translate(["This is a wonderful day"]) # C’est un jour merveilleux
result[0].numpy().decode()

<!-- TOC --><a name="7-conclusion"></a>
# 7. Conclusion

In this notebook, we used the Encoder-Decoder architecture with Attention to translate English text to French text.

😊 If you enjoyed this notebook or found it inspiring/useful, an upvote would be really appreciated.