# In this notebook, I have Trained a Full-Transformer Model for Language Translation.

In [None]:
# ------------------------------
# Required Dependencies
# ------------------------------
!pip install fr_core_news_sm

In [1]:
# ---------------------------------
# Required Imports
# ----------------------------------
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Dense, Add, Dropout, MultiHeadAttention, Lambda, LayerNormalization, Embedding, Input
from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tqdm import tqdm
from datasets import load_dataset
import spacy
import json
import sentencepiece as sp
import pandas as pd

2025-11-08 03:48:04.074815: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1762573684.097356     115 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1762573684.104102     115 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [6]:
data = load_dataset("opus_books", "en-fr")
df = pd.DataFrame(data['train'])
df['english'] = df['translation'].apply(lambda x: x['en'])
df['french'] = df['translation'].apply(lambda x: x['fr'])

df2 = df[['english', 'french']]
df3 = df2.head(30000)

README.md: 0.00B [00:00, ?B/s]

en-fr/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/127085 [00:00<?, ? examples/s]

In [None]:
nlp_fr = spacy.load('fr_core_news_sm')
nlp_eng = spacy.load('en_core_web_sm')

def smart_case(sentences, nlp, batch_size = 1000):
    processed = []
    for doc in tqdm(nlp.pipe(sentences, batch_size = batch_size, n_process = -1 ), total = len(sentences)):
        tokens = []
        for token in doc:
            if token.ent_type_ or token.pos_ in ['PROPN'] or token.text.isupper():
                tokens.append(token.text)
            else:
                tokens.append(token.text.lower())
        processed.append(" ".join(tokens))

    return processed

english_sentences = smart_case(df['english'], nlp_eng)
french_sentences = smart_case(df['french'], nlp_fr)

In [2]:
# with open("english_cased.json", "w", encoding = 'utf-8') as f:
#     json.dump(english_sentences, f, ensure_ascii = False, indent = 2)

# with open("french_cased.json", "w", encoding = 'utf-8') as f:
#     json.dump(french_sentences, f, ensure_ascii = False, indent = 2)


with open("/kaggle/input/mydataset/english_cased.json", "r", encoding = 'utf-8') as f:
    english_sentences = json.load(f)

with open("/kaggle/input/mydataset/french_cased.json", "r", encoding = 'utf-8') as f:
    french_sentences = json.load(f)


In [8]:
print(f"French Sentences After Casing: {french_sentences[:10]}\n")
print(f"English Sentences After Casing: {english_sentences[:10]}")

French Sentences After Casing: ['le grand Meaulnes', 'Alain - Fournier', 'PREMIÈRE PARTIE', 'CHAPITRE PREMIER', 'LE PENSIONNAIRE', 'il arriva chez nous un dimanche de novembre 189- …', 'je continue à dire « chez nous » , bien que la maison ne nous appartienne plus .', 'nous avons quitté le pays depuis bientôt quinze ans et nous n’ y reviendrons certainement jamais .', 'nous habitions les bâtiments du Cours Supérieur de Sainte-Agathe .', 'mon père , que j’ appelais M. Seurel , comme les autres élèves , y dirigeait à la fois le Cours Supérieur , où l’ on préparait le brevet d’ instituteur , et le Cours Moyen .']

English Sentences After Casing: ['the wanderer', 'Alain - Fournier', 'First Part', 'I', 'THE BOARDER', 'he arrived at our home on a Sunday of November , 189- .', "I still say ' our home , ' although the house no longer belongs to us .", 'we left that part of the country nearly fifteen years ago and shall certainly never go back to it .', "we were living in the building of the Hi

In [None]:
french_input_sentences = ["<start> " + s for s in french_sentences] # Always give a space after "> " in 'start' and 'end'.
french_output_sentences = [s + " <end>" for s in french_sentences]


# Saving the Eng and French sentences in a txt file.
with open("eng_sentences.txt", "w", encoding = 'utf-8') as f:
    for s in english_sentences:
        f.write( s + "\n")

with open("french_sentences.txt", "w", encoding = 'utf-8') as f:
    for s in french_sentences:
        f.write(s + "\n")

# Passing the text files into SentencePiece as input. This will give French and English Sentence-Piece models.
sp.SentencePieceTrainer.Train(input = "eng_sentences.txt", model_type = "bpe", vocab_size = 2500, model_prefix = "sp_eng") # vocab_size ranges from 8k-12k for sentences in the range 120k-500k.
sp.SentencePieceTrainer.Train(input = "french_sentences.txt", model_type = "bpe", vocab_size = 2500, model_prefix = "sp_fr")

# Loading English and French Sentence-Processor Models.
eng_sp = sp.SentencePieceProcessor()
eng_sp.Load("sp_eng.model")
fr_sp = sp.SentencePieceProcessor()
fr_sp.Load("sp_fr.model")

# Capturing the encoded tokens from the Eng and French Processor models.
fr_in_seq = [fr_sp.EncodeAsIds(s) for s in french_input_sentences] # 'sp' expects a string of sentences in EncodeAsIds.
fr_out_seq = [fr_sp.EncodeAsIds(s) for s in french_output_sentences]
eng_seq = [eng_sp.EncodeAsIds(s) for s in english_sentences]




In [10]:
lengths = [len(s) for s in french_input_sentences]
fr_max_len = int(np.percentile(lengths, 90))
print(fr_max_len)

lengths1 = [len(s) for s in english_sentences]
eng_max_len = int(np.percentile(lengths1, 90))
print(eng_max_len)


fr_in_seq = pad_sequences(fr_in_seq, maxlen = fr_max_len, padding = 'post')
fr_out_seq = pad_sequences(fr_out_seq, maxlen = fr_max_len, padding = 'post')
eng_seq = pad_sequences(eng_seq, maxlen = eng_max_len, padding = 'post')

eng_max_len = eng_seq.shape[1] # For giving it to the Input() layer.
fr_max_len = fr_in_seq.shape[1]

254
238


In [11]:
french_input_sentences[:5]

['<start> le grand Meaulnes',
 '<start> Alain - Fournier',
 '<start> PREMIÈRE PARTIE',
 '<start> CHAPITRE PREMIER',
 '<start> LE PENSIONNAIRE']

In [None]:
# -------------------------
# Parameter Values
# -------------------------
eng_vocab = eng_sp.GetPieceSize()
fr_vocab = fr_sp.GetPieceSize()
embed_dim = 128
num_heads = 8
ffn_dim = 512 # The general formula which is there in the research paper of 'attention is all you need' is: ffn_dim =  4 * embed_dim.
num_encoder_layers = 4
num_decoder_layers = 4
drop_rate = 0.1 # Dropout Rate


# ------------------
# Function to create encoder block
# ------------------
def transformer_encoder_block(x, mask):
    # Self Attention(MultiHead):
    attn = MultiHeadAttention(num_heads = num_heads, key_dim = embed_dim//num_heads)(x, x, attention_mask = mask) # attention_mask will allow to avoid looking at 0 values in the padded sequence. We haven't used this parameter in the simple self attention code even though we are using mask_zero = True in Embedding because, LSTM output(encoder-decoder output) when goes to the MultiHeadAttention() layer, the padded LSTM output for padded words is not fetched by the MHA layer.
    # The key_dim = embed_dim//num_heads is the general formula used in transformers.
    attn = Dropout(drop_rate)(attn)
    x = Add()([x, attn]) # This is the residual connection which we apply in order to make the original context of encoder_outputs along with the self-attention-context. We didn't applied it in the simple-self-attention because simple LSTM captures the connections very well. The Add() performs elment-wise addition.
    x = LayerNormalization(epsilon = 1e-6)(x) # epsilon is the small hyperparameter of value 0.0000001 which gets multiplied with the variance in the demonitor to avoid zero divide or dead tensor problem.

    # Feed-Forward. This is a small MLP(Multi Layered Preceptron we add in order to capture the non-linear patterns(in the form of outputs) see the below notes).  
    ff = Dense(ffn_dim, activation = 'relu')(x) # Relu captures non linearity
    ff = Dense(embed_dim)(ff)
    ff = Dropout(drop_rate)(ff)
    x = Add()([x, ff]) # Calculating the residuals again(see the jaldamar's blog)
    x = LayerNormalization(epsilon=1e-6)(x) # Normalizing again
    
    return x


def transformer_decoder_block(x, enc_outputs, look_ahead_mask, padding_mask):
    # Masked Self Attention (causal)
    attn1 = MultiHeadAttention(num_heads = num_heads, key_dim = embed_dim//num_heads)(x, x, attention_mask = look_ahead_mask)
    attn1 = Dropout(drop_rate)(attn1)
    x = Add()([x, attn1])
    x = LayerNormalization(epsilon = 1e-6)(x) # Now this will be sent as a query(which is the current word) in the cross attention.


    # Cross Attention (Verification: decoder's query and encoder's key and value):
    # Notes: decoder's query is the current word predicted by the decoder in 'Masked Self-Attention'. 'Key' contains the context of all the other words (excluding the Query-word) which will be used to calculate that how much correct was the decoder at the time of 'masked self-attention'. 'Value' will store this result of correctness. So, encoder's key is the correct original(input) and query of decoder was the predicted context of self-attention while masking.
    attn2 = MultiHeadAttention(num_heads = num_heads, key_dim = embed_dim//num_heads)(x, enc_outputs, attention_mask = padding_mask)
    attn2 = Dropout(drop_rate)(attn2)
    x = Add()([x, attn2]) # this residual connection will capture the results to show how much the decoder is actually correct after cross attention on the basis of the masked self attention(on look ahead).
    x = LayerNormalization(epsilon = 1e-6)(x)

    ff = Dense(ffn_dim, activation = 'relu')(x)
    ff = Dense(embed_dim)(ff)
    ff = Dropout(drop_rate)(ff)
    x = Add()([x, ff])
    x = LayerNormalization(epsilon = 1e-6)(x)

    return x


# -----------------------------
# Inputs and Embeddings
# -----------------------------
enc_inputs = Input(shape = (eng_max_len, ))
dec_inputs = Input(shape = (fr_max_len, ))

enc_tok_emb = Embedding(eng_vocab, embed_dim, mask_zero = True, name = 'enc_tok_emb')(enc_inputs)
dec_tok_emb = Embedding(fr_vocab, embed_dim, mask_zero = True, name = 'dec_tok_emb')(dec_inputs)


# Positional Embeddings(similar to positional encoding)
pos_indices_enc = tf.range(start = 0, limit = eng_max_len, delta = 1) # This will give each positional index of an english input sequence token. If eng_max_len = 5, pos_indecies_enc = [0, 1, 2, 3, 4]. delta = step_size and limit = the n - 1 length of sequence till the indices will form. Important: np.expand_dims/tf.---- adds or modifes all the other dimensions if extra dimension is added(by expansion), e.g., shape: (2, 3), arr: [[1 2 3][4 5 6]] if arr1 = tf.expand_dims(arr, -1) then - shape: (2, 3, 1) arr1: [[[1][2][3]] [[4][5][6]]].
pos_embed_layer_enc = Embedding(eng_max_len, embed_dim, name = 'pos_emb_enc')
enc_pos_emb = pos_embed_layer_enc(pos_indices_enc)
enc_pos_emb = tf.expand_dims(enc_pos_emb, axis = 0) # This will add extra dimensions at the 0th index making it a 3D shape: (1, max_eng_len, embed_dim)
enc_x = enc_tok_emb + enc_pos_emb # Now after adding them, the model will understand the context as well as the position of the sequences.

pos_indices_dec = tf.range(start = 0, limit = fr_max_len, delta = 1)
pos_embed_layer_dec = Embedding(fr_max_len, embed_dim, name = 'pos_emb_dec')
dec_pos_emb = pos_embed_layer_dec(pos_indices_dec)
dec_pos_emb = tf.expand_dims(dec_pos_emb, axis = 0)
dec_x = dec_tok_emb + dec_pos_emb


#----------------
# Mask
#----------------
# Padding Mask: (batch_size, 1, 1, seq_len) is the shape expected for encoder/decoder paddings by attention_mask in MHA() to make that shape mapped to the shape (batch_size, num_heads, query_len, key_len) batch_size → matches your batch, num_heads → all attention heads use the same mask, query_len → each position in the sequence looking for keys, key_len → number of keys to attend to. By reshaping (batch, seq_len) → (batch, 1, 1, seq_len), we are telling TensorFlow: “For every batch and every attention head, each query position should mask the same key positions.” You will notice later(when we call the 'encoder_transformer_block' function) is that before calling 'encoder_transformer_block' function, in enc_padding_mask variable, we have called 'create_padding_mask' function and then called that function. You will observe there that the extra dimension which we will create in the following function will be removed by squeeze because since the 'create padding function' is universal(for both encoder & decoder) the decoder needs this 4D shape(for calculations. You'll see later) while encoder doesn't need this.
def create_padding_mask(seq):
    # seq shape: (batch, seq_len); mask 1 for valid tokens, 0 for padding
    mask = tf.cast(tf.not_equal(seq, 0), tf.float32) # This will first give boolean results and then replaces Trues with 1.1 and Falses with 0.0.
    # expand to (batch, 1, 1, seq_len) to be compatible with attention_mask API
    return mask[:, tf.newaxis, tf.newaxis, :] # tf.newaxis expands the shape implicitly just like tf.expand_dims.

enc_padding_mask = Lambda(create_padding_mask)(enc_inputs) # # Lambda = lightweight wrapper that allows any custom TensorFlow logic(like the 'create_padding_mask') to be part of the Keras model graph, preventing the need of creating a custom class(manually) to work correctly. This output is used in self and cross attention as /keys/values
dec_padding_mask = Lambda(create_padding_mask)(enc_inputs) # This output is used when decoder attends(sees) encoder outputs


# Look ahead (causal) mask for decoder self attention: 'causal' means not looking at future words.
def create_look_ahead_mask(seq_len):
    # convert so allowed positions are 1, disallowed are 0 by subtracting from 1
    causal = tf.linalg.band_part(tf.ones((seq_len, seq_len)), -1, 0) # tf.ones will create a ones matrix of the shape(seq_len, seq_len). tf.linalg.band_part will make 0s as masked and 1s as allowed. So, the resulting matrix will look like 'a two triangular matrices of 1s and 0s combined together'.
    return causal
look_ahead = create_look_ahead_mask(fr_max_len) # shape(max_fr_len, max_fr_len). I will have to expand later to (batch, 1, dec_seq_len, dec_seq_len) by broadcasting when passed to MHA because MHA expects (batch, 1, dec_seq_len, dec_seq_len) as an attention_mask shape.


# We'll construct causal masks dynamically in the model graph using Lambda during call.

# We'll use a custom layer to broadcast look-ahead and padding for decoder input
# Important: We have written the following line codes in order to padd '0' tokens while preventing them in masking. So, this means only creating a look-ahead mask of non-zero values ,i.e., that are not padded.
def build_decoder_self_attention_mask(dec_inputs):
    # dec_inputs: (batch_size, dec_seq_len)
    dec_pad_mask = tf.cast(tf.not_equal(dec_inputs, 0), tf.float32) # (batch, dec_seq_len). Just like we did at the time of causal(where we treated 0 as masked and 1 allowed) simililarly, we have to follow the same strict rule of considering 0 as padd and 1 as not padd. This strict rule is necessary to be followed to make the padding and masking correct when applied together.
    dec_pad_mask = dec_pad_mask[:, tf.newaxis, tf.newaxis, :] # (batch,1,1,dec_seq_len). This is the required padding_mask shape in MHA.
    causal = look_ahead[tf.newaxis, tf.newaxis, :, :] # (1, 1, dec_seq_len, dec_seq_len). Expanded the mask shape as I told earlier.
    combined = causal * dec_pad_mask # shape: (batch, 1, dec_seq_len, dec_seq_len). The aim is always to apply causal first and then padding(which should happen as the result of this multiplication) that's why we place the dec_pad_mask on the LHS and causal on the RHS. Also, the reason this that this is the dot product of two metrices, but not the element-wise multiplication. this multiplication will broadcast(expansion of shape by adding new axis). Done this broadcast intentionally in order to achive the desired shape which will later gets squeezed and then we'll get the padded and look-ahead results together. 
    combined = tf.squeeze(combined, axis = 1) # shape (batch, dec_seq_len, dec_seq_len). This line of code is necessary because in decoder self-attention, target_len = input_len(source) = dec_seq_len. The Multi-Head Attention (MHA) layer expects the mask shape to be (batch, target_len, source_len).When we multiply the masks, the result automatically gets the shape (batch, target_len, source_len), which matches what MHA requires. Earlier, I mentioned that MHA() expects (batch_size, 1, 1, seq_len) for padding masks, but for look-ahead masking, it requires (batch, target_len, source_len).   
    return combined


# --------------------
# Encoder Stack
# --------------------
enc_out = enc_x
for _ in range(num_encoder_layers): # For all the 4 encoders we are creating the encoder transformation and capture the final enc_output.
    # Like I said: MultiHeadAttention layer expects attention_mask shape (batch, target_len, source_len)
    # Important: for encoder-self-attention: target_len=source_len=max_eng_len(enc_seq_len).
    enc_out = transformer_encoder_block(enc_out, enc_padding_mask)


# -------------------
# Decoder Stack
# -------------------
dec_out = dec_x
for _ in range(num_decoder_layers):
    # create masks inside training graph using Lambda to keep Keras traceable. Meaning: we will create a separate layer which will mask all the decoder inputs at once.
    dec_self_mask = tf.keras.layers.Lambda(lambda d: build_decoder_self_attention_mask(d))(dec_inputs)
    dec_out = transformer_decoder_block(dec_out, enc_out, look_ahead_mask = dec_self_mask, padding_mask = dec_padding_mask)


# -------------------------
# Final Linear + Softmax
# -------------------------
logits = Dense(fr_vocab, name = 'logits')(dec_out) # Using Dense and Activation layer separately because since the default dtype of the Dense() is float16 which can reduce make the prediction unstable for softmax when large vocab_size is passed. 'Logits' is the term used to define the outputs of the final layer('dec_out' in our case) which we get before passing it to the Dense or softmaxi layer.
outputs = tf.keras.layers.Activation('softmax', dtype = 'float32', name = 'softmax')(logits) # So, that's why we create a separate Activaton() layer which uses the softmax activation function along with the dtype = float32. Here, logits means the raw-values.


# ------------------------
# Full Model Training
# ------------------------
model = Model([enc_inputs, dec_inputs], outputs, name = 'transformer_seq2seq_model')
model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 1e-4), loss = 'sparse_categorical_crossentropy', metrics = ['accuracy']) # This LR is stable for small-to-medium data.
model.summary()

model.fit([eng_seq, fr_in_seq], np.expand_dims(fr_out_seq, -1), epochs = 100, validation_split = 0.15)


# ---------------------------
# Inference (greedy decoding)
# ---------------------------
# For inference we'll run encoder to get enc_out, then step decoder token-by-token.

# Encoder Inference (returns encoder outputs)
encoder_inf_model = Model(enc_inputs, [enc_out, enc_padding_mask], name = 'encoder_inference') # We are sending the enc_padding_mask as an output placeholder because, we can't use the same enc_padding_mask at the time of inference. So, this 'enc_padding_mask' will take the greedy english input and then, will give us the 'enc_padding_mask'. Don't get confuse: you may say that 'we are sending the same enc_padding_mask' but, enc_padding_mask doesn't store any numerical value, instead it is placeholder. So, we are using the same placeholder for: both training and inference. And the magical thing is that - the 'enc_inputs' will store the different values each time it is called in 'create_padding_mask'; training input_values in 'model' and inference inputt_values in 'encoder_inf_model'.

enc_out_input = Input(shape = (eng_max_len, embed_dim), name = 'enc_out_input')


# Simpler practical approach: reuse parts by building a small decoder step model that expects full decoder prefix.
# We'll feed the entire current decoded sequence (prefix) to decoder and read only last token logits.

# So decoder_inference will accept:
#   decoder_prefix (batch, cur_len) padded to max_fr_len with zeros and last positions used
dec_prefix_input = Input(shape = (fr_max_len, ), name = 'dec_prefix_input')

dec_tok_emb_inf = model.get_layer('dec_tok_emb')(dec_prefix_input) # Using the same embedding layer.
dec_pos_emb_inf = pos_emb_layer_dec(tf.range(start = 0, limit = fr_max_len, delta = 1)) # Here, using the same embedding layer(to capture the position) where I will get all the input words(from 0 till max_len) and the embed_dim of each word. so, shape(input_len, embed_dim). Later it will become: (batch_size, input_len, embed_dim) when we expand it from axis = 0.
dec_pos_emb_inf = tf.expand_dims(dec_pos_emb_inf, axis = 0) # (batch_size, input_len, embed_dim)
dec_x_inf = dec_tok_emb_inf + dec_pos_emb_inf

enc_padding_mask_input = Input(shape = (1,1, eng_max_len), name = 'enc_padding_mask_input') # This is the input placeholder of the 'enc_padding_mask' which is the output in enc_inf_model. This will go into the dec_inf_model.

# Running decoder stack with encoder outputs as enc_out_input
x = dec_x_inf
for _ in range(num_decoder_layers):
    # Note: We have to use the same transformer_decoder_block function but supply enc_out_input and mask.
    dec_self_mask_inf = tf.keras.layers.Lambda(lambda d: build_decoder_self_attention_mask(d))(dec_prefix_input) # Lambda = lightweight wrapper that allows any custom TensorFlow logic(like the 'build_decoder_self_attention_mask') to be part of the Keras model graph, preventing the need of creating a custom class(manually) to work correctly.
    x = transformer_decoder_block(x, enc_out_input, look_ahead_mask = dec_self_mask_inf, padding_mask = enc_padding_mask_input) # Now we need to translate token-by-token that's why we send the enc_out_inputs as inputs

logits_inf = model.get_layer('logits')(x) # We use the same Dense layer
probs_inf = tf.keras.layers.Activation('softmax', dtype = 'float32')(logits_inf)

dec_inf_model = Model([dec_prefix_input, enc_out_input, enc_padding_mask_input], probs_inf, name = 'dec_inf_model')


# ----------------------------------------------------------------
# Greedy decode function (uses enc_inf_model and dec_inf_model)
# ----------------------------------------------------------------
def greedy_decode(input_sequence, fr_sp = fr_sp, eng_sp = eng_sp, max_len = fr_max_len):
    sentence = ' '.join(smart_case([input_sequence], nlp_eng))
    seq = eng_sp.EncodeAsIds(sentence)
    seq = pad_sequences([seq], padding = 'post', maxlen = eng_max_len)

    enc_out_seq, enc_padding_mask_seq = enc_inf_model(seq) # 'enc_padding_mask_seq' is the output of the padding mask as per the 'seq'.

    end_id = fr_sp.GetPieceId('<end>')
    start_token = fr_sp.GetPieceId('<start>')
    decoded = [start_token] # This is a Python list of token IDs predicted so far for the target sentence.
    
    for i in range(max_len):
        prefix = np.array(decoded + [0]*(max_fr_len - len(decoded)))[None, :] # Example: if <start> = 2 and model predicted "bonjour" = 5. Transformer decoder expects input of shape (batch_size, max_fr_len) for training (or inference with the fixed graph). max_fr_len - len(decoded) → number of positions left to pad. [0]*(max_fr_len - len(decoded)) → list of zeros (padding token). Example: max_fr_len = 6 and decoded = [2,5] → [0,0,0,0]. Concatenate: [2,5,0,0,0,0]. np.array(...) converts list to NumPy array → shape (max_fr_len,). [None, :] adds a batch dimension → shape (1, max_fr_len). Why? Keras predict() expects (batch_size, seq_len). prefix = a 2D array with one sentence, padded to max_fr_len which is ready to feed to decoder_model_inf.
        probs = decoder_inf_model.predict([prefix, enc_out_seq, enc_padding_mask_seq], verbose = 0) # shape(batch_size, max_fr_len, vocab_size). 'verbose = 0' will not show the progressive bar.
        idx = len(decoded) - 1 # Important. This will take the position of the current word that is to be predicted.
        next_id = int(np.argmax(probs[0, idx, :])) # Important: np.argmax() return the index of the maximum value not that value itself. Now, will get the most relevant token id(as output) in terms of highest probability value. This token id is the current prediction as well as the next input of the decoder_model(but the next input will be sent after prefixing this predicted value too).j Remeber, in RNN-based self-attention, we were supposed to take the last timestep as the relevant row and from that row we were supposed to find the tokenId having the maximum probability value, but here we take the output on the basis of the position of the current token(to be decoded) and find the tokenId on the basis of maximum probability value.
        if next_id == end_id:
            break

        decoded.append(next_id)

    decoded_sentence = fr_sp.DecodeIds(decoded[1:])
    return decoded_sentence
        

In [None]:
test = ['I love you.', 'Go to kitchen.']
for s in test:
    greedy_decode(s)

# Below are the Doubts (questions) which I had. ChatGPT solved my all doubts.

## But, we calculate the residuals by subtracting the y with yhat which is the subtraction instead of addition.

1️⃣ Residuals in statistics / regression

Residual = observed − predicted

Example: residual = y - ŷ

Purpose: measure error between the true value and the model’s prediction.

Operation: subtraction.

2️⃣ Residual connections in neural networks (Transformers, ResNets, etc.)

Residual = output of a layer + input of that layer

Example: x = LayerNorm(x + Sublayer(x))

Purpose: help gradients flow, preserve original features, stabilize training.

Operation: addition (element-wise).

<br/>

## Why ff is used here?
ChatGPT said:

Ah, good question. In the Transformer architecture, ff stands for Feed-Forward network, and it’s a crucial component. Here’s the reasoning in simple, real-world terms:

1️⃣ Where it sits

Each encoder or decoder block has:

Multi-Head Attention → captures relationships between words (self-attention or cross-attention).

Feed-Forward (ff) → processes each position individually to enrich representation.

So after attention tells “which words are relevant to each other,” the feed-forward network says:

“Now that I know what to focus on, let’s transform this information to a richer, non-linear representation.”

2️⃣ Structure

In the code:

ff = Dense(ff_dim, activation='relu')(x)
ff = Dense(embed_dim)(ff)


First layer (ff_dim, large) → expands dimensionality and applies non-linearity (ReLU).

Second layer → projects back to embed_dim (model dimension).

Added with residual → preserves original info + enriched info.

3️⃣ Why needed

Attention alone is linear — it just mixes input vectors with weights; it can’t model complex interactions per position.

Feed-forward adds non-linear capacity — like a mini MLP per word.

Position-wise — same FF network is applied independently to each position, so computation is parallelizable.

</br>

## Shape logic:

| Attention Type               | Mask Type           | Expected Shape            | Why                                                      |
| ---------------------------- | ------------------- | ------------------------- | -------------------------------------------------------- |
| Encoder Self-Attention       | Padding             | (batch, 1, 1, seq_len)    | All queries share same key mask                          |
| Decoder Self-Attention       | Lookahead + Padding | (batch, seq_len, seq_len) | Query position–specific masking                          |
| Encoder–Decoder (Cross-Attn) | Padding             | (batch, 1, 1, seq_len)    | Each query attends all encoder tokens except padded ones |

</br>

## (batch, target_len, source_len)  and (batch, dec_seq_len, dec_seq_len) are equal and means the same?

ChatGPT said:

Good question — they look similar but don’t mean the same.

Let’s compare them clearly:

Shape	Meaning	Used In	Description
(batch, target_len, source_len)	Decoder attends to encoder outputs	Encoder–Decoder (Cross) Attention	Each target token (in decoder) looks at all encoder tokens (in source). Example: translating English → French. French word attends to all English words.
(batch, dec_seq_len, dec_seq_len)	Decoder attends to its own past tokens	Self-Attention (in Decoder)	Each decoder token looks only at previous decoder tokens (using causal mask). Used to generate sequences autoregressively.

So:

target_len = decoder length (output sequence)

source_len = encoder length (input sequence)

✅ In short →
(batch, target_len, source_len) ≠ (batch, dec_seq_len, dec_seq_len).
Only the first is for cross-attention, the second for decoder self-attention.

## You said: Then why I said: "# (batch, dec_seq_len, dec_seq_len). This line of code is necessary because, MHA accepts shape (batch, target_len, source_len) so, multiplying them changes the shape to (batch, target_len, source_len). Earlier I mentioned that MHA() excepts (batch_size, 1, 1, seq_len) in case of padding, but for look_ahead_masking it requires (batch, target_len, source_len)?"

ChatGPT said:

Excellent observation — you’ve caught a very subtle but crucial distinction about mask shapes in multi-head attention.
Let’s unpack it clearly.

1️⃣ What MultiHeadAttention expects

TensorFlow’s MultiHeadAttention layer expects:

attention_mask → shape (batch_size, target_seq_len, source_seq_len)


So:

When you’re doing self-attention,
→ target_seq_len == source_seq_len == dec_seq_len

When you’re doing cross-attention,
→ target_seq_len = decoder length, source_seq_len = encoder length

2️⃣ Why we had (batch, 1, seq_len) before

Earlier, for padding masks, we used:

mask shape = (batch_size, 1, seq_len)


This works because padding mask is broadcastable — it’s the same across all target positions (each target token ignores the same padded encoder tokens).
TensorFlow automatically broadcasts (batch, 1, seq_len) → (batch, target_len, seq_len) inside MHA.

3️⃣ Why we now need (batch, dec_seq_len, dec_seq_len)

For look-ahead masking, every position in the decoder must have a different visibility window:

Token 1 → can attend to nothing before it

Token 2 → can attend to tokens 1

Token 3 → can attend to tokens 1–2
and so on.

Hence, you can’t just broadcast one mask — each target position has its own row of visibility.
That’s why we need (batch, dec_seq_len, dec_seq_len).

4️⃣ Why you wrote:
combined = tf.squeeze(combined, axis=1)


This made the shape (batch, dec_seq_len, dec_seq_len) — exactly what MHA expects when using look-ahead masks.

✅ In short:

(batch, 1, seq_len) → padding mask (broadcasted)

(batch, dec_seq_len, dec_seq_len) → look-ahead mask (non-broadcasted, per-token visibility)

Both are valid because MHA() always expects (batch, target_len, source_len) internally.

### Taj said: Thanks for this knowledge, but I wanted to know that why i wrote this comment on the code and what shape is exactly expected by the MHA?
ChatGPT said:

Perfect — let’s directly connect your comment and what TensorFlow’s MultiHeadAttention really expects.

Your code and comment:
combined = tf.squeeze(combined, axis = 1)  # (batch, dec_seq_len, dec_seq_len) # This line of code is necessary because,MHA accepts shape (batch, target_len, source_len), so, multiplying them changes the shape to (batch, target_len, source_len).

✅ Why this comment makes sense

The comment is correct because:

You are preparing the mask to match what MultiHeadAttention expects.

That expected shape is (batch_size, target_seq_len, source_seq_len).

For decoder self-attention,
→ target_seq_len = source_seq_len = dec_seq_len,
so final shape = (batch, dec_seq_len, dec_seq_len).

That’s exactly what your squeeze() step ensures — removing the redundant dimension (axis 1) that was (batch, 1, dec_seq_len, dec_seq_len) before.