In [2]:
import numpy as np
# Use tensorflow.keras instead of just keras
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense, Embedding, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Layer
import matplotlib as plt
import warnings
warnings.simplefilter('ignore', FutureWarning)

start by defining the sentences and text for translation training Sentence Pairs: Defines a small dataset of English-Spanish sentence pairs. Target Sequences: Prepends "startseq" and appends "endseq" to each target sentence for the decoder to learn when to start and stop translating.:

In [6]:
# Sample parallel sentences (English -> Spanish)
input_texts = [
    "Hello.", "How are you?", "I am learning machine translation.", "What is your name?", "I love programming."
]
target_texts = [
    "Hola.", "¿Cómo estás?", "Estoy aprendiendo traducción automática.", "¿Cuál es tu nombre?", "Me encanta programar."
]

target_texts = ["startseq " + x + " endseq" for x in target_texts]


- convert the text from the sentences to tokens and create a vocabulary
- Tokenization: Uses Tokenizer to convert words into numerical sequ144ences

In [7]:
# Tokenization
input_tokenizer = Tokenizer()
input_tokenizer.fit_on_texts(input_texts)
input_sequences = input_tokenizer.texts_to_sequences(input_texts)

output_tokenizer = Tokenizer()
output_tokenizer.fit_on_texts(target_texts)
output_sequences = output_tokenizer.texts_to_sequences(target_texts)

input_vocab_size = len(input_tokenizer.word_index) + 1
output_vocab_size = len(output_tokenizer.word_index) + 1

Padding: Ensures all sequences have the same length.

In [8]:
# Padding
max_input_length = max([len(seq) for seq in input_sequences])
max_output_length = max([len(seq) for seq in output_sequences])

input_sequences = pad_sequences(input_sequences, maxlen=max_input_length, padding='post')
output_sequences = pad_sequences(output_sequences, maxlen=max_output_length, padding='post')

In [9]:
# Prepare the target data for training
decoder_input_data = output_sequences[:, :-1]
decoder_output_data = output_sequences[:, 1:]

# Convert to one-hot
decoder_output_data = np.array([np.eye(output_vocab_size)[seq] for seq in decoder_output_data])

Self-attention is a mechanism that allows a model to **focus on relevant parts of the input sequence** while processing each word. This is particularly useful in:
- Machine Translation (e.g., aligning words correctly)
- Text Summarization
- Speech Recognition
- Image Processing (Vision Transformers)
In this implementation, self-attention is used for text based sequence-to-sequence modeling.


Self-Attention works for a given an input sequence by computing a weighted representation of all words for each position. It does so using three key components:

- Query **(Q)**, Key **(K)**, and Value **(V)** Matrices
For each word (token) in a sequence:

Query (Q): What this word is looking for.
Key (K): What this word represents.
Value (V): The actual information in the word.

- Compute **Attention Scores**
Next, we **calculate the similarity between each query and key** using dot-product attention:
Each word in a sequence attends to every other word based on these scores.

- Apply **Scaling & Softmax**
Since dot-product values can be large, we scale them.
Next, Applying softmax converts scores into attention weights:


### In this implementation of self-attention layer:

### We first initialize the weights in the build method, where:<BR/>
A) self.Wq, self.Wk, self.Wv are the trainable weight matrices.<BR/>
B) Their shape is (feature_dim, feature_dim), meaning they transform input features into Q, K, and V representations.<BR/>
### Applying Attention using call method. The call() method:<BR/>
A) computes Q, K, V by multiplying inputs (encoder/decoder output) with their respective weight matrices.<BR/>
B) Computes dot-product attention scores using K.batch_dot(q, k, axes=[2, 2]), resulting in a (batch_size, seq_len, seq_len) matrix.<BR/>
C) Scales the scores to avoid large values.<BR/>
D) Applies softmax to normalize the attention scores.<BR/>
### Multiplies attention weights with V to get the final output.<BR/>
A) The compute_output_shape method defines the shape of the output tensor after the layer processes an input.<BR/>
B) The output shape of the Self-Attention layer remains the same as the input shape.<BR/>
C) The attention mechanism transforms the input but does not change its dimensions.4
If the attention layer changed the shape, you would modify compute_output_shape

In [10]:
# Define the Self-Attention Layer
class SelfAttention(Layer):
    def __init__(self, **kwargs):
        super(SelfAttention, self).__init__(**kwargs)

    def build(self, input_shape):
        feature_dim = input_shape[-1]
        # Weight matrices for Q, K, V
        self.Wq = self.add_weight(shape=(feature_dim, feature_dim),
                                  initializer='glorot_uniform',
                                  trainable=True,
                                  name='Wq')
        self.Wk = self.add_weight(shape=(feature_dim, feature_dim),
                                  initializer='glorot_uniform',
                                  trainable=True,
                                  name='Wk')
        self.Wv = self.add_weight(shape=(feature_dim, feature_dim),
                                  initializer='glorot_uniform',
                                  trainable=True,
                                  name='Wv')
        super(SelfAttention, self).build(input_shape)

    def call(self, inputs):
        # Linear projections
        q = K.dot(inputs, self.Wq)  # Query
        k = K.dot(inputs, self.Wk)  # Key
        v = K.dot(inputs, self.Wv)  # Value

        # Scaled Dot-Product Attention
        scores = K.batch_dot(q, k, axes=[2, 2])  # (batch, seq_len, seq_len)
        scores = scores / K.sqrt(K.cast(K.shape(k)[-1], dtype=K.floatx()))  # Scale
        attention_weights = K.softmax(scores, axis=-1)  # Normalize

        # Weighted sum of values
        output = K.batch_dot(attention_weights, v)  # (batch, seq_len, feature_dim)
        return output

    def compute_output_shape(self, input_shape):
        return input_shape


The model follows an Encoder-Decoder structure:

### Encoder:
1) Takes input sentences (padded and tokenized).<BR/>
2) Uses an Embedding layer (word representations) + LSTM (to process sequences).<BR/>
    1. The LSTMs are used as the **help process variable-length input sentences** and generate meaningful translations.<BR/>
4) Outputs context vectors (hidden & cell states).

### Attention Layer
1) Applied to both the encoder and decoder outputs.<BR/>
2) Helps the decoder focus on relevant words during translation.<BR/>

### Decoder
1) Receives target sequences (shifted one step ahead).<BR/>
2) Uses an LSTM with encoder states as initial states.<BR/>
3) Applies self-attention for better learning.<BR/>
4) Uses a Dense layer (Softmax) to predict the next word.
