# Light Machine Translation Demo (EN → ES)

This notebook demonstrates **toy** machine translation using several model families:
- Seq2Seq (Encoder-Decoder) with **SimpleRNN**
- Seq2Seq with **LSTM**
- Seq2Seq with **GRU**
- A small **Transformer** encoder-decoder (educational)
- A pretrained Hugging Face model (`Helsinki-NLP/opus-mt-en-es`) for comparison (inference only)

**Notes:** training uses a *very small* dataset and only a few epochs so it runs quickly for demo/teaching purposes. Results are illustrative, not production-quality.


In [1]:
!pip install -q tensorflow transformers sentencepiece --upgrade
print('Install complete')

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.1/40.1 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m620.7/620.7 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.6/11.6 MB[0m [31m66.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m49.0 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-decision-forests 1.12.0 requires tensorflow==2.19.0, but you have tensorflow 2.20.0 which is incompatible.
tensorflow-text 2.19.0 requires tensorflow<2.20,>=2.19.0, but you have tensorflow 2.20.0 which is incompatible.
tf-keras 2.19.0 requires tensorflow<2.20,>=2.19, but you have tensorflow 2.20.0 which is incompatible.[0m[

## 1) Imports and Tiny Parallel Dataset

In [2]:

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, Dense, SimpleRNN, LSTM, GRU, TimeDistributed, MultiHeadAttention, LayerNormalization, Dropout
import numpy as np, random
from transformers import MarianMTModel, MarianTokenizer, pipeline

# Tiny parallel dataset (English -> Spanish)
pairs = [
    ("hello", "hola"),
    ("good morning", "buenos días"),
    ("how are you", "cómo estás"),
    ("thank you", "gracias"),
    ("i love you", "te quiero"),
    ("see you later", "hasta luego"),
    ("what is your name", "¿cómo te llamas?"),
    ("where is the bathroom", "¿dónde está el baño?"),
    ("i need help", "necesito ayuda"),
    ("nice to meet you", "mucho gusto")
]

random.shuffle(pairs)
print('Total pairs:', len(pairs))
for e,s in pairs[:5]:
    print(f'EN: {e}  -->  ES: {s}')


Total pairs: 10
EN: what is your name  -->  ES: ¿cómo te llamas?
EN: i love you  -->  ES: te quiero
EN: nice to meet you  -->  ES: mucho gusto
EN: hello  -->  ES: hola
EN: i need help  -->  ES: necesito ayuda


## 2) Tokenization & Sequence Preparation
We create separate tokenizers for source (English) and target (Spanish). The decoder input sequences are the target sequences shifted right (with a start token).

In [3]:

# Special tokens
start_token = "<sos>"
end_token = "<eos>"

# Prepare English tokenizer (source)
src_texts = [p[0].lower() for p in pairs]
tgt_texts = [f"{start_token} " + p[1].lower() + f" {end_token}" for p in pairs]

src_tok = Tokenizer(filters='', oov_token='<oov>')
src_tok.fit_on_texts(src_texts)
src_vocab_size = len(src_tok.word_index) + 1

tgt_tok = Tokenizer(filters='', oov_token='<oov>')
tgt_tok.fit_on_texts(tgt_texts)
tgt_vocab_size = len(tgt_tok.word_index) + 1

# Convert to sequences
src_seqs = src_tok.texts_to_sequences(src_texts)
tgt_seqs = tgt_tok.texts_to_sequences(tgt_texts)

max_src_len = max(len(s) for s in src_seqs)
max_tgt_len = max(len(s) for s in tgt_seqs)

# Prepare encoder input, decoder input, decoder target
encoder_input = pad_sequences(src_seqs, maxlen=max_src_len, padding='post')
decoder_input = pad_sequences([s[:-1] for s in tgt_seqs], maxlen=max_tgt_len-1, padding='post')
decoder_target = pad_sequences([s[1:] for s in tgt_seqs], maxlen=max_tgt_len-1, padding='post')

print('src_vocab_size', src_vocab_size, 'tgt_vocab_size', tgt_vocab_size)
print('max_src_len', max_src_len, 'max_tgt_len', max_tgt_len)


src_vocab_size 25 tgt_vocab_size 24
max_src_len 4 max_tgt_len 6


## 3) Helper to build Seq2Seq models (RNN/LSTM/GRU)
We'll build encoder-decoder models with teacher forcing for training.

In [4]:

from tensorflow.keras.layers import Embedding, Dense, TimeDistributed

def build_seq2seq(cell_type='rnn', embedding_dim=64, latent_dim=64):
    # Encoder
    encoder_inputs = Input(shape=(max_src_len,), name='encoder_inputs')
    enc_emb = Embedding(src_vocab_size, embedding_dim, mask_zero=True)(encoder_inputs)
    if cell_type == 'rnn':
        _, state_h = SimpleRNN(latent_dim, return_state=True)(enc_emb)
        encoder_states = [state_h]
    elif cell_type == 'lstm':
        _, state_h, state_c = LSTM(latent_dim, return_state=True)(enc_emb)
        encoder_states = [state_h, state_c]
    elif cell_type == 'gru':
        _, state_h = GRU(latent_dim, return_state=True)(enc_emb)
        encoder_states = [state_h]
    else:
        raise ValueError('cell_type must be rnn/lstm/gru')

    # Decoder (training)
    decoder_inputs = Input(shape=(max_tgt_len-1,), name='decoder_inputs')
    dec_emb = Embedding(tgt_vocab_size, embedding_dim, mask_zero=True)(decoder_inputs)
    if cell_type == 'rnn':
        dec_outputs, _ = SimpleRNN(latent_dim, return_sequences=True, return_state=True)(dec_emb, initial_state=encoder_states[0])
    elif cell_type == 'lstm':
        dec_outputs, _, _ = LSTM(latent_dim, return_sequences=True, return_state=True)(dec_emb, initial_state=encoder_states)
    elif cell_type == 'gru':
        dec_outputs, _ = GRU(latent_dim, return_sequences=True, return_state=True)(dec_emb, initial_state=encoder_states[0])

    decoder_dense = TimeDistributed(Dense(tgt_vocab_size, activation='softmax'))
    decoder_outputs = decoder_dense(dec_outputs)

    model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model


### Train and inference helpers (for small demo)

In [5]:

def train_seq2seq(model, epochs=50):
    history = model.fit([encoder_input, decoder_input], decoder_target[..., None], epochs=epochs, batch_size=16, verbose=0)
    return history

# Simple inference builders that reuse layers where possible (lightweight approach)
def make_inference_models_trivial(training_model, cell_type='rnn', embedding_dim=64, latent_dim=64):
    # This helper constructs simple inference models and attempts to set relevant weights from the trained model where possible.
    # Encoder inference: map encoder input -> encoder state(s)
    encoder_inputs = training_model.input[0]
    encoder_emb = training_model.layers[1].output  # embedding output
    # Re-run RNN on embedding to get states (re-using the same RNN layer weights is tricky in Keras API here)
    # For simplicity in this lightweight demo, we will use the training model for direct predictions during greedy decoding.
    return training_model, None  # return training_model to use for direct greedy predictions in this tiny demo


## 4) Train & Demo Seq2Seq Models (RNN / LSTM / GRU)
We'll train each model for a few epochs and do a greedy decode using the training model directly (lightweight).

In [6]:

def greedy_decode_from_training_model(model, src_text):
    # Prepare encoder input
    seq = src_tok.texts_to_sequences([src_text.lower()])
    seq = pad_sequences(seq, maxlen=max_src_len, padding='post')
    # Start decoder input as <sos>
    dec_input = np.zeros((1, max_tgt_len-1), dtype='int32')
    dec_input[0,0] = tgt_tok.word_index[start_token]
    for t in range(1, max_tgt_len-1):
        preds = model.predict([seq, dec_input], verbose=0)
        next_id = np.argmax(preds[0, t-1, :])
        dec_input[0, t] = next_id
        if tgt_tok.index_word.get(next_id) == end_token:
            break
    # Convert to words
    out = [tgt_tok.index_word.get(i, '') for i in dec_input[0] if i!=0 and i!=tgt_tok.word_index[start_token]]
    return ' '.join([w for w in out if w != end_token and w != ''])

results = {}
for cell in ['rnn','lstm','gru']:
    print('\nTraining', cell.upper())
    m = build_seq2seq(cell_type=cell, embedding_dim=64, latent_dim=64)
    train_seq2seq(m, epochs=100)  # light training
    # Demo on sample sentences
    res = []
    for en, es in pairs[:5]:
        pred = greedy_decode_from_training_model(m, en)
        res.append((en, es, pred))
    results[cell] = res
    for en, es, pred in res:
        print(f"EN: {en} | GOLD: {es} | PRED: {pred}")



Training RNN
EN: what is your name | GOLD: ¿cómo te llamas? | PRED: ¿cómo te llamas?
EN: i love you | GOLD: te quiero | PRED: te quiero
EN: nice to meet you | GOLD: mucho gusto | PRED: mucho gusto
EN: hello | GOLD: hola | PRED: hola
EN: i need help | GOLD: necesito ayuda | PRED: necesito ayuda

Training LSTM
EN: what is your name | GOLD: ¿cómo te llamas? | PRED: ¿cómo te llamas?
EN: i love you | GOLD: te quiero | PRED: te quiero
EN: nice to meet you | GOLD: mucho gusto | PRED: mucho gusto
EN: hello | GOLD: hola | PRED: 
EN: i need help | GOLD: necesito ayuda | PRED: necesito ayuda

Training GRU
EN: what is your name | GOLD: ¿cómo te llamas? | PRED: ¿cómo te llamas?
EN: i love you | GOLD: te quiero | PRED: te quiero
EN: nice to meet you | GOLD: mucho gusto | PRED: mucho gusto
EN: hello | GOLD: hola | PRED: hola
EN: i need help | GOLD: necesito ayuda | PRED: necesito ayuda


## 5) Small Educational Transformer (Encoder-Decoder)

In [7]:

# Build a tiny transformer-like encoder-decoder for demo purposes
from tensorflow.keras.layers import Input, Embedding, Dense, TimeDistributed

def build_tiny_transformer(src_vocab, tgt_vocab, embedding_dim=64, num_heads=2, ff_dim=128):
    # Encoder
    enc_in = Input(shape=(max_src_len,), name='enc_in')
    enc_emb = Embedding(src_vocab, embedding_dim)(enc_in)
    enc_att = MultiHeadAttention(num_heads=num_heads, key_dim=embedding_dim)(enc_emb, enc_emb)
    enc_out = LayerNormalization(epsilon=1e-6)(enc_att + enc_emb)
    ff = Dense(ff_dim, activation='relu')(enc_out)
    ff = Dense(embedding_dim)(ff)
    enc_outputs = LayerNormalization(epsilon=1e-6)(ff + enc_out)

    # Decoder (cross-attention)
    dec_in = Input(shape=(max_tgt_len-1,), name='dec_in')
    dec_emb = Embedding(tgt_vocab, embedding_dim)(dec_in)
    cross_att = MultiHeadAttention(num_heads=num_heads, key_dim=embedding_dim)(dec_emb, enc_outputs)
    dec_out = LayerNormalization(epsilon=1e-6)(cross_att + dec_emb)
    ff2 = Dense(ff_dim, activation='relu')(dec_out)
    ff2 = Dense(embedding_dim)(ff2)
    dec_out2 = LayerNormalization(epsilon=1e-6)(ff2 + dec_out)
    outputs = TimeDistributed(Dense(tgt_vocab, activation='softmax'))(dec_out2)

    model = Model([enc_in, dec_in], outputs)
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

trans_model = build_tiny_transformer(src_vocab_size, tgt_vocab_size, embedding_dim=64)
trans_model.fit([encoder_input, decoder_input], decoder_target[..., None], epochs=100, batch_size=16, verbose=0)

# Greedy decode similar to seq2seq approach
for en, es in pairs[:5]:
    seq = src_tok.texts_to_sequences([en.lower()])
    seq = pad_sequences(seq, maxlen=max_src_len, padding='post')
    dec_in = np.zeros((1, max_tgt_len-1), dtype='int32')
    dec_in[0,0] = tgt_tok.word_index[start_token]
    for t in range(1, max_tgt_len-1):
        preds = trans_model.predict([seq, dec_in], verbose=0)
        next_id = np.argmax(preds[0, t-1, :])
        dec_in[0, t] = next_id
        if tgt_tok.index_word.get(next_id) == end_token:
            break
    out = [tgt_tok.index_word.get(i, '') for i in dec_in[0] if i!=0 and i!=tgt_tok.word_index[start_token]]
    print(f"EN: {en} | GOLD: {es} | PRED: {' '.join(out)}")


EN: what is your name | GOLD: ¿cómo te llamas? | PRED: ¿cómo te llamas? <eos>
EN: i love you | GOLD: te quiero | PRED: te quiero <eos>
EN: nice to meet you | GOLD: mucho gusto | PRED: mucho gusto <eos>
EN: hello | GOLD: hola | PRED: hola <eos>
EN: i need help | GOLD: necesito ayuda | PRED: necesito ayuda <eos>


## 6) Pretrained Hugging Face Translation (Helsinki-NLP)
This shows strong translations without training.

In [8]:

# Note: downloading the model may take time. This is inference-only and will show high-quality translations.
model_name = "Helsinki-NLP/opus-mt-en-es"
hf_tokenizer = MarianTokenizer.from_pretrained(model_name)
hf_model = MarianMTModel.from_pretrained(model_name)
translator = pipeline("translation", model=hf_model, tokenizer=hf_tokenizer)

for en, es in pairs[:5]:
    out = translator(en, max_length=60)
    print(f"EN: {en}  |  GOLD: {es}  |  HF PRED: {out[0]['translation_text']}")


Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/826k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]



pytorch_model.bin:   0%|          | 0.00/312M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/312M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

Device set to use cpu


EN: what is your name  |  GOLD: ¿cómo te llamas?  |  HF PRED: ¿Cuál es tu nombre?
EN: i love you  |  GOLD: te quiero  |  HF PRED: Te quiero.
EN: nice to meet you  |  GOLD: mucho gusto  |  HF PRED: Encantado de conocerte.
EN: hello  |  GOLD: hola  |  HF PRED: Hola.
EN: i need help  |  GOLD: necesito ayuda  |  HF PRED: Necesito ayuda.


## Notes
- This notebook uses tiny datasets and a small number of epochs so training is fast for demonstration.
- The seq2seq RNN/LSTM/GRU models are educational and will not match the quality of the pretrained model.
- If you'd like, I can add attention visualization (alignment heatmaps) for the transformer section.
