<a href="https://colab.research.google.com/github/rdkdaniel/NLP-Projects/blob/main/English_to_Spanish_translator_with_S_2_S_Transfomer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#https://keras.io/examples/nlp/neural_machine_translation_with_transformer/

# **Well Ordered Work!**

# **Libraries**

In [1]:
import pathlib
import random
import string
import re
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization

# **Downloading the data**



*   Using Eng to Spanish translation dataset by Anki




In [2]:
text_file = keras.utils.get_file(
    fname="spa-eng.zip",
    origin="http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip",
    extract=True,
)
text_file = pathlib.Path(text_file).parent / "spa-eng" / "spa.txt"

Downloading data from http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip


# **Scanning Through the Data**



*   The data has Eng and Spansh Sentences.
*   English is the source sequence and Spanish is the target sequence. Thus a "start" and "end" tokens are created with this in mind



In [3]:
with open(text_file) as f:
    lines = f.read().split("\n")[:-1]
text_pairs = []
for line in lines:
    eng, spa = line.split("\t")
    spa = "[start] " + spa + " [end]"
    text_pairs.append((eng, spa))

**Output to this pair is below:**

In [4]:
for _ in range(5):
    print(random.choice(text_pairs))

("Hurry up. You'll be late for school.", '[start] Apúrate. Vas a llegar tarde a la escuela. [end]')
("A fox isn't caught twice in the same snare.", '[start] Un zorro nunca es capturado dos veces con el mismo cepo. [end]')
('Give him this message as soon as he arrives.', '[start] Dale este recado apenas él llegue. [end]')
('You have to memorize this sentence.', '[start] Te debes memorizar esta frase. [end]')
('Go wake up Tom and tell him that breakfast is ready.', '[start] Andá a despertar a Tom y decile que el desayuno está listo. [end]')


# **Data Prep**



*   Splitting the dataset into training, valodation and test set.




In [5]:
random.shuffle(text_pairs)
num_val_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples
train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples : num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples :]

print(f"{len(text_pairs)} total pairs")
print(f"{len(train_pairs)} training pairs")
print(f"{len(val_pairs)} validation pairs")
print(f"{len(test_pairs)} test pairs")

118964 total pairs
83276 training pairs
17844 validation pairs
17844 test pairs


# **Vectorization of the Data**



*   That is, turning the original strings into integer sequences.
*   Each integer will represent an index of a word within the vocabulary
*   NB: punctuation marks were removed...good idea? Do not think so.


In [6]:
strip_chars = string.punctuation + "¿"
strip_chars = strip_chars.replace("[", "")
strip_chars = strip_chars.replace("]", "")

vocab_size = 15000
sequence_length = 20
batch_size = 64


def custom_standardization(input_string):
    lowercase = tf.strings.lower(input_string)
    return tf.strings.regex_replace(lowercase, "[%s]" % re.escape(strip_chars), "")


eng_vectorization = TextVectorization(
    max_tokens=vocab_size, output_mode="int", output_sequence_length=sequence_length,
)
spa_vectorization = TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length + 1,
    standardize=custom_standardization,
)
train_eng_texts = [pair[0] for pair in train_pairs]
train_spa_texts = [pair[1] for pair in train_pairs]
eng_vectorization.adapt(train_eng_texts)
spa_vectorization.adapt(train_spa_texts)
