1. Transformers are a type of deep learning model that utilizes "self-attention mechanisms" to process and generate sequences of data efficiently.

2. Transformer model is built on encoder-decoder architecture where both the encoder and decoder are composed of a series of layers that utilize self-attention mechanisms and feed-forward neural networks.

![image.png](attachment:image.png)

---

### 1. Encoder

The primary function of the encoder is to create a high-dimensional representation of the input sequence that the decoder can use to generate the output. Encoder consists of multiple layers and each layer is composed of two main sub-layers:

1. Self-Attention Mechanism: This sub-layer allows the encoder to weigh the importance of different parts of the input sequence differently to capture dependencies regardless of their distance within the sequence.

2. Feed-Forward Neural Network: This sub-layer consists of two linear transformations with a ReLU activation in between. It processes the output of the self-attention mechanism to generate a refined representation.

### 2. Decoder

Decoder in transformer also consists of multiple identical layers. Its primary function is to generate the output sequence based on the representations provided by the encoder and the previously generated tokens of the output.

Each decoder layer consists of three main sub-layers:

1. Masked Self-Attention Mechanism: Similar to the encoder's self-attention mechanism but its main purpose is to prevent attending to future tokens to maintain the autoregressive property (no cheating during generation).

2. Encoder-Decoder Attention Mechanism: This sub-layer allows the decoder to focus on relevant parts of the encoder's output representation. This allows the decoder to focus on relevant parts of the input, essential for tasks like translation.

3. Feed-Forward Neural Network: This sub-layer processes the combined output of the masked self-attention and encoder-decoder attention mechanisms.

---

# In-Depth Analysis of Transformer Components

### 1. Multi-Head Self-Attention Mechanism
![image-2.png](attachment:image-2.png)
![image-3.png](attachment:image-3.png)

### 2. Position-wise Feed-Forward Networks
![image-4.png](attachment:image-4.png)

### 3. Positional Encoding
![image-5.png](attachment:image-5.png)

### 4. Layer Normalization and Residual Connections
![image-6.png](attachment:image-6.png)

---

# How Transformers Work

### 1. Input Representation
The first step in processing input data involves converting raw text into a format that the transformer model can understand. This involves tokenization and embedding.

1. Tokenization: The input text is split into smaller units called tokens, which can be words, sub words or characters. Tokenization ensures that the text is broken down into manageable pieces.

2. Embedding: Each token is then converted into a fixed-size vector using an embedding layer. This layer maps each token to a dense vector representation that captures its semantic meaning.

3. Positional encodings are added to these embeddings to provide information about the token positions within the sequence.

### 2. Encoder Process in Transformers

1. Input Embedding: The input sequence is tokenized and converted into embeddings with positional encodings added.

2. Self-Attention Mechanism: Each token in the input sequence attends to every other token to capture dependencies and contextual information.

3. Feed-Forward Network: The output from the self-attention mechanism is passed through a position-wise feed-forward network.

4. Layer Normalization and Residual Connections: Layer normalization and residual connections are applied.

### 3. Decoder Process

1. Input Embedding and Positional Encoding: The partially generated output sequence is tokenized and embedded with positional encodings added.

2. Masked Self-Attention Mechanism: The decoder uses masked self-attention to prevent attending to future tokens ensuring that the model generates the sequence step-by-step.

3. Encoder-Decoder Attention Mechanism: The decoder attends to the encoder's output allowing it to focus on relevant parts of the input sequence.

4. Feed-Forward Network: Similar to the encoder the output from the attention mechanisms is passed through a position-wise feed-forward network.

5. Layer Normalization and Residual Connections: Similar to the encoder Layer normalization and residual connections are applied.

### 4. Training and Inference

1. Transformers are trained with teacher forcing, where the correct previous tokens are provided during training to predict the next token. Their encoder-decoder architecture combined with multi-head attention and feed-forward networks enables highly effective handling of sequential data.

2. Transformers have transformed deep learning by using self-attention mechanisms to efficiently process and generate sequences capturing long-range dependencies and contextual relationships. Their encoder-decoder architecture combined with multi-head attention and feed-forward networks enables highly effective handling of sequential data.

---

# Self - Attention in Transformers

#### In Transformer models, self-attention allows the model to look at all words in a sentence at once but it doesn’t naturally understand the order of those words. This is a problem because word order matters in language. To solve this Transformers use positional embeddings extra information added to each word that tells the model where it appears in the sentence. This helps the model understand both the meaning of each word and its position so it can process sentences more effectively.

---

#### Encoder Decoder Model

An encoder decoder model is used in machine learning tasks that involve sequences like translating sentences, generating text or creating captions for images. Here's how it works:

1. Encoder: It takes the input sequence like sentences and processes them. It converts input into a fixed size summary called a latent vector or context vector. This vector holds all the important information from the input sequence.

2. Decoder: It then uses this summary to generate an output sequence such as a translated sentence. It tries to reconstruct the desired output based on the encoded information.

![image.png](attachment:image.png)

---

### Attention Layer in Transformer

1. Input Embedding: Input text like a sentences are first converted into embeddings. These are vector representations of words in a continuous space.

2. Positional Encoding: Since Transformer doesn’t process words in a sequence like RNNs positional encodings are added to the input embeddings and these encode the position of each word in the sentence.

3. Multi Head Attention: In this multiple attention heads are applied in parallel to process different part of sequences simultaneously. Each head finds the attention scores based on queries (Q), keys (K) and values (V) and adds information from different parts of input.

4. Add and Norm: This layer helps in residual connections and layer normalization. This helps to avoid vanishing gradient problems and ensures stable training.

5. Feed Forward: After attention output is passed through a feed forward neural network for further transformation.

6. Masked Multi Head Attention for the Decoder: This is used in the decoder and ensures that each word can only attend to previous words in the sequence not future ones.

7. Output Embedding: Finally transformed output is mapped to a final output space and processed by softmax function to generate output probabilities.

---

## Self Attention Mechanism
![image-2.png](attachment:image-2.png)
![image-3.png](attachment:image-3.png)

## Multi Head Attention
![image-4.png](attachment:image-4.png)
![image-5.png](attachment:image-5.png)

#### Use in Transformer Architecture

1. Encoder Decoder Attention: In this layer queries come from the previous decoder layer while the keys and values come from the encoder’s output. This allows each position in the decoder to focus on all positions in the input sequence.

2. Encoder Self Attention: This layer receives queries, keys and values from the output of the previous encoder layer. Each position in the encoder looks at all positions from the previous layer to calculate attention scores.

![image-6.png](attachment:image-6.png)

3. Decoder Self Attention: Similar to the encoder's self attention but here the queries, keys and values come from the previous decoder layer. Each position can attend to the current and previous positions but future positions are masked to prevent the model from looking ahead when generating the output and this is called masked self attention.

![image-7.png](attachment:image-7.png)

---

# Implementation

Step 1: Install Necessary Libraries
This line imports numpy for matrix operations and softmax from scipy.special to convert attention scores into probability distributions.

In [None]:
import numpy as np
from scipy.special import softmax

Step 2: Extract Dimensions
This function starts by extracting the input shape: batch size, sequence length and model dimension. It sets d_k, the dimension of keys and queries equal to the model dimension for simplicity.

In [None]:
def self_attention(X):
    batch_size, seq_len, d_model = X.shape
    d_k = d_model

Step 3: Initialize Weight Matrix
These lines initialize random weight matrices for queries (W_q), keys (W_k) and values (W_v). In real models these are learnable parameters used to project the input into Q, K, and V representations.

In [None]:
W_q = np.random.randn(d_model, d_k)
W_k = np.random.randn(d_model, d_k)
W_v = np.random.randn(d_model, d_k)

Step 4: Compute Q, K, V matrices
These lines project the input X into query (Q), key (K) and value (V) matrices by multiplying with their respective weights. This transforms the input into different views used for computing attention.






In [None]:
Q = X @ W_q
K = X @ W_k
V = X @ W_v

Step 5: Compute Attention scores and weights
This computes the final output by weighting the values (V) with the attention scores, aggregating relevant information from the sequence. It then returns both the attention output and the attention weights for further use or analysis.

In [None]:
output = attention_weights @ V
return output, attention_weights

Step 6: Example Usage
This sets a random seed for reproducibility creates a sample input tensor with shape (1, 3, 4) runs the self-attention function on it and then prints the resulting output and attention weights.

In [None]:
np.random.seed(42)
X = np.random.rand(1, 3, 4)
output, weights = self_attention(X)

print("Output:\n", output)
print("\nAttention Weights:\n", weights)

![image.png](attachment:image.png)

---

# Positional Encoding in Transformers

#### Positional encoding is a technique that adds information about the position of each token in the sequence to the input embeddings. This helps transformers to understand the relative or absolute position of tokens which is important for differentiating between words in different positions and capturing the structure of a sentence. Without positional encoding, transformers would struggle to process sequential data effectively.

![image.png](attachment:image.png) 
![image-2.png](attachment:image-2.png)
![image-3.png](attachment:image-3.png)

---

# How Does Positional Encoding Work?

* The most common method for calculating positional encodings is based on sinusoidal functions. The intuition behind using sine and cosine functions

![image-4.png](attachment:image-4.png)

---

#### Example with Dimensionality
![image-5.png](attachment:image-5.png)

---

# Implementation of Positional Encoding in Transformers

* angle_rads = np.arange(position)[:, np.newaxis] / np.power(10000, (2 * (np.arange(d_model)[np.newaxis, :] // 2)) / np.float32(d_model)) : Calculate angle values based on position and model dimension.

* position = 50, d_model = 512 : Set the sequence length (number of positions) and dimensionality of the model respectively.

In [None]:
import numpy as np
import tensorflow as tf

def positional_encoding(position, d_model):

    angle_rads = np.arange(position)[:, np.newaxis] / np.power(10000, (2 * (np.arange(d_model)[np.newaxis, :] // 2)) / np.float32(d_model))
    
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
    
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
    
    pos_encoding = angle_rads[np.newaxis, ...]
    
    return tf.cast(pos_encoding, dtype=tf.float32)

position = 50  
d_model = 512 
pos_encoding = positional_encoding(position, d_model)

print("Positional Encodings Shape:", positional_encodings.shape)
print("Positional Encodings Example:\n", positional_encodings)

![image.png](attachment:image.png)

---

# BERT model

1. BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based neural network

2. BERT contains an encoder-only architecture.

3. In the original Transformer architecture, there are both encoder and decoder modules.

4. The decision to use an encoder-only architecture in BERT suggests a primary focus on understanding input sequences rather than generating output sequences.

---

# Workflow of BERT

![image.png](attachment:image.png)

## 1. Masked Language Models (MLM) in BERT

* Masked Language Models (MLMs) are a type of machine learning model designed to predict missing or "masked" words in a sentence. These models are trained on large datasets of text where certain words are intentionally hidden during training. The goal of the model is to guess the hidden word based on the surrounding context. This approach helps the model learn the relationships between words and develop a deeper understanding of language structure.

#### How Do Masked Language Models Work?
The process of training a masked language model involves two main steps:

![image-2.png](attachment:image-2.png)
![image-3.png](attachment:image-3.png)

#### Why Are Masked Language Models Important?
Masked language models become important for modern NLP for several reasons:

1. Bidirectional Understanding
Unlike earlier models that processed text in a single direction (either left-to-right or right-to-left) MLMs are bidirectional . This means they analyze the entire context of a word—both the words before it and the words after it. This bidirectional approach allows the model to capture richer and more nuanced meanings.

2. Contextual Word Representations
Words can have different meanings depending on the context in which they appear. For example the word "bank" could refer to a financial institution or the side of a river. MLMs excel at understanding these contextual differences because they rely on the surrounding words to make predictions.

3. Versatility
Once trained, masked language models can be fine-tuned for a wide range of downstream tasks, such as:

Text Classification : Determining the sentiment of a review (positive, negative, neutral).
Named Entity Recognition : Identifying names, dates and locations in a document.
Question Answering : Providing answers to questions based on a given passage of text.
Language Translation : Converting text from one language to another.

4. State-of-the-Art Performance
MLMs like BERT (Bidirectional Encoder Representations from Transformers) have achieved groundbreaking results in various NLP benchmarks. Their ability to understand context and relationships between words has set new standards for AI-driven language understanding.


## 2. Next Sentence Prediction (NSP) using BERT

* Next Sentence Prediction is a pre-training task used in BERT to help the model understand the relationship between different sentences. It is widely used for tasks like question answering, summarization and dialogue systems. The goal is to determine whether a given second sentence logically follows the first one. For example :

Sentence A: “She opened the door.”
Sentence B: “She saw her friend standing there.”
In this case Sentence B follows Sentence A so the label is 1 (consecutive). If Sentence B was unrelated like “The sky was blue” the label would be 0 meaning non consecutive.

* BERT predicts if the second sentence is connected to the first. This is done by transforming the output of the [CLS] token into a 2×1 shaped vector using a classification layer, and then calculating the probability of whether the second sentence follows the first using SoftMax.

1. In the training process, BERT learns to understand the relationship between pairs of sentences, predicting if the second sentence follows the first in the original document.

2. 50% of the input pairs have the second sentence as the subsequent sentence in the original document, and the other 50% have a randomly chosen sentence.

3. To help the model distinguish between connected and disconnected sentence pairs. The input is processed before entering the model.

4. BERT predicts if the second sentence is connected to the first. This is done by transforming the output of the [CLS] token into a 2×1 shaped vector using a classification layer, and then calculating the probability of whether the second sentence follows the first using SoftMax.

---

#### Why to train Masked LM and Next Sentence Prediction together?
Masked LM helps BERT to understand the context within a sentence and Next Sentence Prediction helps BERT grasp the connection or relationship between pairs of sentences. Hence, training both the strategies together ensures that BERT learns a broad and comprehensive understanding of language, capturing both details within sentences and the flow between sentences.

---

# BERT Architecture
![image-4.png](attachment:image-4.png)
![image-5.png](attachment:image-5.png)

---


# 🧠 Interview Questions

Here are **2-minute structured interview answers** (clear, confident, and slightly detailed) 👇

---

# 🔹 1. Basic Concept Questions

### 1️⃣ What is a Transformer model?

A Transformer is a deep learning architecture introduced in the paper Attention Is All You Need.
It is based entirely on attention mechanisms instead of recurrence.

Unlike RNNs, it processes all tokens in parallel and captures long-range dependencies efficiently.
It consists of Encoder and Decoder blocks built using Multi-Head Attention and Feed Forward Networks.

---

### 2️⃣ Why were Transformers introduced?

Transformers were introduced to overcome limitations of RNNs and LSTMs:

* Sequential computation (slow training)
* Difficulty capturing long-term dependencies
* Vanishing gradient problem

Transformers allow parallelization and better context modeling using self-attention.

---

### 3️⃣ What problem of RNN/LSTM does Transformer solve?

It solves:

* Long-range dependency issues
* Vanishing/exploding gradients
* Training inefficiency due to sequential processing

Because attention connects every word directly to every other word.

---

### 4️⃣ What is the main idea behind self-attention?

Self-attention allows each word in a sentence to focus on other relevant words.

It calculates importance scores between tokens using Query, Key, and Value vectors.
This helps build contextual understanding of words.

---

### 5️⃣ Difference between RNN and Transformer?

RNN:

* Sequential processing
* Hidden state memory
* Hard to parallelize

Transformer:

* Parallel processing
* Uses attention instead of recurrence
* Captures global dependencies efficiently

---

### 6️⃣ Why is Transformer faster than LSTM?

Because it processes all tokens simultaneously using matrix operations, which are highly optimized on GPUs.

LSTMs process token by token, which slows training.

---

### 7️⃣ What does “Attention is All You Need” mean?

It means sequence modeling does not require recurrence or convolution — attention mechanism alone is sufficient to model relationships in data.

---

### 8️⃣ What is parallelization in Transformers?

Parallelization means computing representations for all tokens at the same time instead of sequentially, which significantly speeds up training.

---

# 🔹 2. Architecture-Based Questions

### 9️⃣ Explain Transformer architecture

A Transformer consists of stacked Encoder and Decoder layers.

Each Encoder layer has:

* Multi-Head Self-Attention
* Feed Forward Network
* Residual connections + LayerNorm

Each Decoder layer has:

* Masked Self-Attention
* Cross-Attention
* Feed Forward Network

---

### 🔟 What are Encoder and Decoder blocks?

Encoder extracts contextual representation of input.
Decoder generates output sequence using masked attention and encoder information.

---

### 1️⃣1️⃣ Components of Encoder?

* Multi-Head Self-Attention
* Feed Forward Network
* Add & Norm (Residual + LayerNorm)

---

### 1️⃣2️⃣ Components of Decoder?

* Masked Self-Attention
* Encoder-Decoder (Cross) Attention
* Feed Forward Network
* Add & Norm

---

### 1️⃣3️⃣ What is Multi-Head Attention?

Instead of computing attention once, it computes multiple attention heads in parallel.

Each head captures different semantic relationships, and outputs are concatenated.

---

### 1️⃣4️⃣ What is Feed Forward Network (FFN)?

A position-wise fully connected network applied independently to each token.
Usually consists of two linear layers with ReLU or GELU activation.

---

### 1️⃣5️⃣ What is Layer Normalization?

LayerNorm normalizes activations across features for stable gradients and faster convergence.

---

### 1️⃣6️⃣ Why use Residual Connections?

Residual connections allow gradients to flow easily and prevent vanishing gradient problems in deep networks.

---

# 🔹 3. Attention Mechanism Questions

### 1️⃣7️⃣ What is Self-Attention?

Self-attention allows each token in a sequence to attend to every other token to build contextual representation.

---

### 1️⃣8️⃣ What are Query, Key, and Value?

They are learned linear projections of input embeddings.

* Query → What I’m looking for
* Key → What I contain
* Value → Information I pass

---

### 1️⃣9️⃣ How is attention score calculated?

By taking dot product of Query and Key, scaling by √dk, applying softmax, and multiplying by Value.

---

### 2️⃣0️⃣ What is Scaled Dot-Product Attention?

[
Attention(Q,K,V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V
]

It stabilizes training by scaling large dot-product values.

---

### 2️⃣1️⃣ Why divide by √dk?

To prevent large dot-product values that make softmax extremely peaked and gradients unstable.

---

### 2️⃣2️⃣ What is Masked Attention?

Masked attention prevents a token from seeing future tokens.
Used in decoder models like GPT.

---

### 2️⃣3️⃣ Self-Attention vs Cross-Attention?

Self-Attention → Same sequence.
Cross-Attention → Decoder attends to Encoder outputs.

---

# 🔹 4. Positional Encoding

### 2️⃣4️⃣ Why need positional encoding?

Because Transformers do not inherently understand sequence order.

---

### 2️⃣5️⃣ How does it work?

It adds position-specific vectors to input embeddings, either sinusoidal or learned embeddings.

---

### 2️⃣6️⃣ Sinusoidal vs Learned encoding?

Sinusoidal → Fixed mathematical function.
Learned → Trainable parameters.

---

### 2️⃣7️⃣ Why can’t Transformers understand order without it?

Because attention is permutation-invariant — shuffling tokens gives same attention structure.

---

# 🔹 5. Training & Optimization

### 2️⃣8️⃣ What loss function is used?

Cross-Entropy Loss for classification and language modeling tasks.

---

### 2️⃣9️⃣ What is Teacher Forcing?

During training, the actual previous token is fed into the decoder instead of model’s predicted token.

---

### 3️⃣0️⃣ What is Label Smoothing?

A regularization method that softens target labels to prevent overconfidence.

---

### 3️⃣1️⃣ What is Dropout?

Randomly disabling neurons during training to prevent overfitting.

---

### 3️⃣2️⃣ What optimizer is used?

Adam or AdamW (improved weight decay handling).

---

### 3️⃣3️⃣ What is Warmup learning rate?

Learning rate gradually increases during early training steps, then decays — helps stabilize training.

---
