# **Ultimate Guide to Sequence-to-Sequence (Seq2Seq) Models for Translation**

## **1. Introduction**

### **What is a Seq2Seq Model?**
A **Sequence-to-Sequence (Seq2Seq)** model is a neural network architecture designed for tasks where the input and output are sequences of varying lengths. It consists of two main components:
1. **Encoder**: Processes the input sequence and encodes it into a fixed-size context vector.
2. **Decoder**: Generates the output sequence step-by-step, using the context vector as input.

### **Applications**
- **Machine Translation**: Translating text from one language to another (e.g., English to French).
- **Text Summarization**: Generating concise summaries of long documents.
- **Chatbots**: Generating conversational responses.

### **Objective**
To build and train a Seq2Seq model for **English-to-French translation** using TensorFlow/Keras.

---

## **2. Metadata and Dataset Overview**

### **Dataset Used**
- **Dataset Name**: English-French Translation Dataset
- **Source**: [Tatoeba Project](https://tatoeba.org/)
- **Description**: A collection of sentence pairs in English and French.

### **Acknowledgement**
This dataset is publicly available and widely used for educational purposes in NLP.

---



## **3. Loading and Exploring the Dataset**

### **Code: Load the Dataset**

In [24]:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import os
import zipfile

warnings.filterwarnings('ignore')
color_pal = sns.color_palette()
plt.style.use('fivethirtyeight')

# Download the dataset
path_to_zip = tf.keras.utils.get_file(
    'fra-eng.zip',
    origin='http://storage.googleapis.com/download.tensorflow.org/data/fra-eng.zip',
    extract=False  # Do not extract automatically
)

# Extract the dataset
with zipfile.ZipFile(path_to_zip, 'r') as zip_ref:
    zip_ref.extractall(os.path.dirname(path_to_zip))

# Path to the extracted file
path_to_file = os.path.join(os.path.dirname(path_to_zip), 'fra.txt')

# Check if the file exists
if not os.path.exists(path_to_file):
    raise FileNotFoundError(f"The file was not found at {path_to_file}")

# Load the dataset
with open(path_to_file, 'r', encoding='utf-8') as f:
    lines = f.read().split('\n')

# Preview the dataset
print(f"Total sentence pairs: {len(lines)}")
print(lines[:5])  # Print the first 5 sentence pairs


Total sentence pairs: 167131
['Go.\tVa !', 'Hi.\tSalut !', 'Run!\tCours\u202f!', 'Run!\tCourez\u202f!', 'Who?\tQui ?']


### **Explanation**
- The dataset is downloaded and extracted using TensorFlow utilities.
- Each line contains an English sentence and its corresponding French translation, separated by a tab (`\t`).


### **Code: Preprocess the Dataset**

In [25]:
import re

# Add <start> and <end> tokens to the target sentences
def preprocess_sentence(sentence, is_target=False):
    sentence = sentence.lower().strip()
    sentence = re.sub(r"([?.!,¿])", r" \1 ", sentence)  # Add spaces around punctuation
    sentence = re.sub(r'[" "]+', " ", sentence)  # Remove extra spaces
    sentence = sentence.strip()
    if is_target:
        sentence = '<start> ' + sentence + ' <end>'  # Add <start> and <end> tokens
    return sentence

# Preprocess the dataset
word_pairs = [[preprocess_sentence(w[0]), preprocess_sentence(w[1], is_target=True)] for w in [l.split('\t') for l in lines[:10000]]]
print(word_pairs[:5])


[['go .', '<start> va ! <end>'], ['hi .', '<start> salut ! <end>'], ['run !', '<start> cours\u202f ! <end>'], ['run !', '<start> courez\u202f ! <end>'], ['who ?', '<start> qui ? <end>']]


### **Explanation**
- Each sentence is preprocessed by:
  - Converting to lowercase.
  - Adding spaces around punctuation.
  - Removing extra spaces.
- The dataset is limited to the first 10,000 pairs for faster training.



## **4. Tokenization and Padding**

### **Code: Tokenize the Sentences**

In [26]:

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Tokenize English and French sentences
def tokenize(lang):
    tokenizer = Tokenizer(filters='')
    tokenizer.fit_on_texts(lang)
    tensor = tokenizer.texts_to_sequences(lang)
    tensor = pad_sequences(tensor, padding='post')
    return tensor, tokenizer

# Split into input (English) and target (French) sequences
input_tensor, input_tokenizer = tokenize([pair[0] for pair in word_pairs])
target_tensor, target_tokenizer = tokenize([pair[1] for pair in word_pairs])

# Vocabulary sizes
input_vocab_size = len(input_tokenizer.word_index) + 1
target_vocab_size = len(target_tokenizer.word_index) + 1

print(f"Input vocabulary size: {input_vocab_size}")
print(f"Target vocabulary size: {target_vocab_size}")


Input vocabulary size: 2146
Target vocabulary size: 4859


### **Explanation**
- **Tokenizer**: Converts sentences into sequences of integers.
- **Padding**: Ensures all sequences have the same length by adding zeros at the end.
- **Vocabulary Size**: The number of unique words in each language.

---

## **5. Build the Seq2Seq Model**


### **Key Concepts**
1. **Encoder**:
   - Processes the input sequence (English) and encodes it into a context vector.
   - Uses an LSTM or GRU layer.
2. **Decoder**:
   - Generates the output sequence (French) step-by-step.
   - Uses an LSTM or GRU layer with attention (optional).



### **Code: Define the Encoder**

In [27]:


from tensorflow.keras.layers import Embedding, LSTM, Dense, Input

# Encoder
encoder_inputs = Input(shape=(None,))
encoder_embedding = Embedding(input_vocab_size, 256)(encoder_inputs)
encoder_lstm = LSTM(256, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(encoder_embedding)
encoder_states = [state_h, state_c]  # Context vector


### **Explanation**
- **Embedding Layer**: Converts word indices into dense vectors.
- **LSTM Layer**: Processes the sequence and outputs the final hidden states (`state_h`, `state_c`), which serve as the context vector.


### **Code: Define the Decoder**

In [28]:


# Decoder
decoder_inputs = Input(shape=(None,))
decoder_embedding = Embedding(target_vocab_size, 256)(decoder_inputs)
decoder_lstm = LSTM(256, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=encoder_states)
decoder_dense = Dense(target_vocab_size, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)


### **Explanation**
- **Embedding Layer**: Converts French word indices into dense vectors.
- **LSTM Layer**: Uses the context vector (`encoder_states`) as its initial state.
- **Dense Layer**: Outputs probabilities for each word in the French vocabulary.


### **Code: Build the Model**

In [29]:

from tensorflow.keras.models import Model

# Combine encoder and decoder
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()


### **Explanation**
- The model takes two inputs: English sentences (`encoder_inputs`) and French sentences (`decoder_inputs`).
- It outputs the predicted French sentence (`decoder_outputs`).


## **6. Train the Model**

### **Code: Prepare the Data**


In [30]:
# Split into training and validation sets
from sklearn.model_selection import train_test_split

train_input, val_input, train_target, val_target = train_test_split(
    input_tensor, target_tensor, test_size=0.2
)

# Prepare decoder input and output
train_decoder_input = train_target[:, :-1]  # Exclude the last token
train_decoder_output = train_target[:, 1:]  # Exclude the first token

val_decoder_input = val_target[:, :-1]
val_decoder_output = val_target[:, 1:]


### **Explanation**
- The decoder input is shifted by one token to predict the next word in the sequence.


### **Code: Train the Model**

In [31]:
# Training parameters
epochs = 30
batch_size = 64

# Train the model
history = model.fit(
    [train_input, train_decoder_input],
    train_decoder_output,
    batch_size=batch_size,
    epochs=epochs,
    validation_data=([val_input, val_decoder_input], val_decoder_output)
)


Epoch 1/30
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 26ms/step - accuracy: 0.5594 - loss: 4.0934 - val_accuracy: 0.7228 - val_loss: 1.9643
Epoch 2/30
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 23ms/step - accuracy: 0.7264 - loss: 1.8531 - val_accuracy: 0.7300 - val_loss: 1.7788
Epoch 3/30
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 27ms/step - accuracy: 0.7327 - loss: 1.6750 - val_accuracy: 0.7389 - val_loss: 1.6737
Epoch 4/30
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 23ms/step - accuracy: 0.7435 - loss: 1.5532 - val_accuracy: 0.7571 - val_loss: 1.5820
Epoch 5/30
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 23ms/step - accuracy: 0.7661 - loss: 1.4405 - val_accuracy: 0.7734 - val_loss: 1.5038
Epoch 6/30
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 28ms/step - accuracy: 0.7805 - loss: 1.3355 - val_accuracy: 0.7817 - val_loss: 1.4384
Epoch 7/30
[1m125/125

### **Explanation**
- The model is trained for 30 epochs with a batch size of 64.
- Validation data is used to monitor performance.



## **7. Inference (Translation)**


### **Code: Build Inference Models**

In [32]:
# Encoder inference model
encoder_model = Model(encoder_inputs, encoder_states)

# Decoder inference model
decoder_state_input_h = Input(shape=(256,))
decoder_state_input_c = Input(shape=(256,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

decoder_outputs, state_h, state_c = decoder_lstm(
    decoder_embedding, initial_state=decoder_states_inputs
)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)

decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states
)


### **Explanation**
- The inference models are used to generate translations step-by-step.


### **Code: Translate Function**

In [33]:
def translate(input_seq):
    # Encode the input sequence
    states_value = encoder_model.predict(input_seq)

    # Initialize the decoder input with the <start> token
    target_seq = np.zeros((1, 1))
    target_seq[0, 0] = target_tokenizer.word_index['<start>']

    # Generate the translation
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # Sample the next word
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_word = target_tokenizer.index_word[sampled_token_index]
        decoded_sentence += ' ' + sampled_word

        # Exit condition
        if sampled_word == '<end>' or len(decoded_sentence) > 50:
            stop_condition = True

        # Update the target sequence and states
        target_seq = np.zeros((1, 1))
        target_seq[0, 0] = sampled_token_index
        states_value = [h, c]

    return decoded_sentence.strip()

### **Explanation**
- The function translates an English sentence into French using the trained model.



## **8. Results and Evaluation**

### **Code: Test Translation**


In [37]:
# Test translation
input_seq = input_tensor[0:1]  # First English sentence
translated_sentence = translate(input_seq)
print(f"Input: {word_pairs[0][0]}")
print(f"Translation: {translated_sentence}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
Input: go .
Translation: va ! <end>


## **9. Conclusion**

### **Key Learnings**
- Seq2Seq models are powerful for sequence-based tasks like translation.
- The encoder-decoder architecture effectively handles variable-length sequences.
- Attention mechanisms (not covered here) can further improve performance.

### **Next Steps**
- Add **attention mechanisms** to improve translation quality.
- Experiment with **larger datasets** and **deeper models**.
- Use **beam search** for better decoding.

---

## **10. References**
- TensorFlow Seq2Seq Tutorial: [Neural Machine Translation](https://www.tensorflow.org/tutorials/text/nmt_with_attention)
- Tatoeba Dataset: [Tatoeba Project](https://tatoeba.org/)
- Seq2Seq Paper: [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215)

