Our main objective is to achieve Language Translation (English → French) using Seq2Seq with Attention. Automates english to french language translations. Useful for real world translation in various fields which has native english and french speakers.



In [None]:
from IPython.display import Markdown as md
display(md("## Language Translation (English → French) — Seq2Seq with Attention\nCPU-friendly demo using a small subset of Tatoeba."))


## Language Translation (English → French) — Seq2Seq with Attention
CPU-friendly demo using a small subset of Tatoeba.

1. Setup & Dependencies

In [None]:
!pip install tensorflow keras nltk tqdm



2. Dataset Loading

In [None]:
english_sentences = ['hello', 'how are you', 'i am learning ai', 'good morning', 'thank you']
french_sentences = ['bonjour', 'comment allez-vous', "j'apprends l'intelligence artificielle", 'bonjour', 'merci']

print("Using", len(english_sentences), "demo sentence pairs.")


Using 5 demo sentence pairs.


3. Preprocessing

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Tokenizers
src_tokenizer = Tokenizer()
trg_tokenizer = Tokenizer()

src_tokenizer.fit_on_texts(english_sentences)
trg_tokenizer.fit_on_texts(['<sos> ' + s + ' <eos>' for s in french_sentences])

# Convert sentences to sequences
src_seq = src_tokenizer.texts_to_sequences(english_sentences)
trg_seq = trg_tokenizer.texts_to_sequences(['<sos> ' + s + ' <eos>' for s in french_sentences])

# Pad sequences
max_src_len = max(len(s) for s in src_seq)
max_trg_len = max(len(s) for s in trg_seq)

src_seq = pad_sequences(src_seq, maxlen=max_src_len, padding='post')
trg_seq = pad_sequences(trg_seq, maxlen=max_trg_len, padding='post')

print("Source sequences:", src_seq)
print("Target sequences:", trg_seq)

Source sequences: [[ 2  0  0  0]
 [ 3  4  1  0]
 [ 5  6  7  8]
 [ 9 10  0  0]
 [11  1  0  0]]
Target sequences: [[ 1  3  2  0  0]
 [ 1  4  5  6  2]
 [ 1  7  8  9  2]
 [ 1  3  2  0  0]
 [ 1 10  2  0  0]]


4. Model Architecture

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, GRU, Dense, Bidirectional
from tensorflow.keras.models import Model

src_vocab = len(src_tokenizer.word_index) + 1
trg_vocab = len(trg_tokenizer.word_index) + 1
embedding_dim = 64
units = 64

# Encoder
encoder_inputs = Input(shape=(max_src_len,))
encoder_embed = Embedding(src_vocab, embedding_dim)(encoder_inputs)
encoder_out, forward_h, backward_h = Bidirectional(GRU(units, return_state=True))(encoder_embed)
encoder_state = forward_h + backward_h

# Decoder
decoder_inputs = Input(shape=(max_trg_len,))
decoder_embed = Embedding(trg_vocab, embedding_dim)(decoder_inputs)
decoder_gru = GRU(units, return_sequences=True)
decoder_out = decoder_gru(decoder_embed, initial_state=encoder_state)
decoder_dense = Dense(trg_vocab, activation='softmax')
outputs = decoder_dense(decoder_out)

# Define model
model = Model([encoder_inputs, decoder_inputs], outputs)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model.summary()

5. Training

In [None]:
# Decoder input = full trg_seq
trg_input = trg_seq  # shape: (num_samples, max_trg_len)

# Decoder target = shift by one
trg_output = np.zeros_like(trg_input)
trg_output[:, :-1] = trg_input[:, 1:]
trg_output[..., np.newaxis] = trg_output[..., np.newaxis]

# Train
history = model.fit([src_seq, trg_input], trg_output, batch_size=2, epochs=50)



Epoch 1/50
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 52ms/step - loss: 2.3859
Epoch 2/50
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 38ms/step - loss: 2.3340
Epoch 3/50
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 44ms/step - loss: 2.2883
Epoch 4/50
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 46ms/step - loss: 2.2385
Epoch 5/50
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step - loss: 2.1385 
Epoch 6/50
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step - loss: 2.0929
Epoch 7/50
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 24ms/step - loss: 1.9194
Epoch 8/50
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step - loss: 1.9155
Epoch 9/50
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 24ms/step - loss: 1.5761
Epoch 10/50
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step - loss: 1.4306
Epoch 11/50
[1m3/

6. Inference / Translation Function

In [None]:
# Add <sos> and <eos> tokens
trg_sentences_with_tokens = ['<sos> ' + s + ' <eos>' for s in french_sentences]

# Tokenizers (keep punctuation)
src_tokenizer = Tokenizer()
trg_tokenizer = Tokenizer(filters='')

# Fit tokenizers
src_tokenizer.fit_on_texts(english_sentences)
trg_tokenizer.fit_on_texts(trg_sentences_with_tokens)

# Convert to sequences
src_seq = src_tokenizer.texts_to_sequences(english_sentences)
trg_seq = trg_tokenizer.texts_to_sequences(trg_sentences_with_tokens)

# Pad sequences
max_src_len = max(len(s) for s in src_seq)
max_trg_len = max(len(s) for s in trg_seq)
src_seq = pad_sequences(src_seq, maxlen=max_src_len, padding='post')
trg_seq = pad_sequences(trg_seq, maxlen=max_trg_len, padding='post')

print("Target tokenizer word index:", trg_tokenizer.word_index)

def translate(sentence):
    seq = src_tokenizer.texts_to_sequences([sentence])
    seq = pad_sequences(seq, maxlen=max_src_len, padding='post')

    trg_seq_inf = np.zeros((1, max_trg_len))
    trg_seq_inf[0,0] = trg_tokenizer.word_index['<sos>']  # must exist

    result = []
    for i in range(1, max_trg_len):
        pred = model.predict([seq, trg_seq_inf], verbose=0)
        word_id = np.argmax(pred[0,i-1,:])
        if word_id == 0:
            break
        word = [w for w, idx in trg_tokenizer.word_index.items() if idx == word_id]
        if not word:
            break
        result.append(word[0])
        trg_seq_inf[0,i] = word_id
        if word[0] == '<eos>':
            break
    return ' '.join(result).replace('<eos>', '')




Target tokenizer word index: {'<sos>': 1, '<eos>': 2, 'bonjour': 3, 'comment': 4, 'allez-vous': 5, "j'apprends": 6, "l'intelligence": 7, 'artificielle': 8, 'merci': 9}


7. How to test

In [None]:
from IPython.display import Markdown as md
display(md("Use `translate('your sentence')` to translate new English sentences."))

Use `translate('your sentence')` to translate new English sentences.

8. Testing

In [None]:
translate("hello")

'bonjour '

Model Structure overview:

Dataset preparation -> Encoder -> Decoder -> Training -> Inference

Conclusion: Thus, this short implementation successfully translates words and small phrases from english to french with the small dataset we used for demonstration purposes.

Ethical Considerations & Responsible AI

1: Bias and Fairness

 Language translation models can inherit biases present in the training data.

 Even a small demo dataset may reflect gender, cultural, or social biases in sentence pairs.

 Users must be aware that translations may not always be neutral or fully accurate.


2: Accuracy and Reliability

 The model is trained on a limited dataset for demonstration purposes, so translations are not production-ready.

 Misinterpretation of translations could lead to misunderstandings if used in real-world contexts.


3: Data Privacy and Licensing

 The project uses publicly available datasets (Tatoeba / ManyThings).

 No personal, sensitive, or proprietary data is used, ensuring privacy compliance.


4:  Responsible Use
The system is intended for educational and research purposes only.

 Users should not rely on this model for critical decision-making or official translations.


5: Transparency and Disclosure

 AI tools, libraries, and models used are disclosed (TensorFlow, Keras, Python libraries).

 Any automated outputs should be clearly labeled as AI-generated to avoid misleading users.


6: Future Improvements

  Expanding the dataset and including diverse linguistic contexts can reduce bias.
  
  Adding human-in-the-loop review or validation would improve reliability and accountability.