### Attention Mechanism

In [2]:
import numpy as np

# Example phrase
phrase = "The quick brown fox jumps over the lazy dog"

# Function to make a word embedding
def make_word_embedding(word):
    return np.random.rand(4)

# Softmax function
def softmax(x):
    return np.exp(x) / np.sum(np.exp(x), axis=0)

# Making the word embeddings
word = phrase.split()
word_embeddings = [make_word_embedding(w) for w in word]

# Simple self-attention function
def self_attention(word_embeddings):
    # Making matrix of word embeddings
    Q = np.array(word_embeddings)
    K = np.array(word_embeddings)
    V = np.array(word_embeddings)

    # Calculating the attention scores
    scores = np.dot(Q, K.T)

    # Applying the softmax function to the scores
    attention_weights = softmax(scores)

    # Calculating the weighted sum of the word embeddings
    weighted_values = np.dot(attention_weights, V)

    return attention_weights, weighted_values

# Apply the self-attention function to the word embeddings
attention_weights, weighted_values = self_attention(word_embeddings)

# Print the attention weights and weighted values
print("Original phrase: ", phrase)
print("\nWeighted of Attention:")
for i, w in enumerate(word):
    print(f"\nFor the word: '{w}':")
    for j, w2 in enumerate(word):
        print(f"Attention for '{w2}': {attention_weights[i][j]:.4f}")

print("\nExplanation:")
for i, w in enumerate(word):
    max_attention_weight = np.argmax(attention_weights[i])
    print(f"{w} has the highest attention weight with {word[max_attention_weight]}")

Original phrase:  The quick brown fox jumps over the lazy dog

Weighted of Attention:

For the word: 'The':
Attention for 'The': 0.1102
Attention for 'quick': 0.0869
Attention for 'brown': 0.0940
Attention for 'fox': 0.0863
Attention for 'jumps': 0.0800
Attention for 'over': 0.0763
Attention for 'the': 0.0949
Attention for 'lazy': 0.1004
Attention for 'dog': 0.1109

For the word: 'quick':
Attention for 'The': 0.0852
Attention for 'quick': 0.1375
Attention for 'brown': 0.0700
Attention for 'fox': 0.0940
Attention for 'jumps': 0.0743
Attention for 'over': 0.1288
Attention for 'the': 0.0713
Attention for 'lazy': 0.1007
Attention for 'dog': 0.0688

For the word: 'brown':
Attention for 'The': 0.0882
Attention for 'quick': 0.0670
Attention for 'brown': 0.1164
Attention for 'fox': 0.0711
Attention for 'jumps': 0.0824
Attention for 'over': 0.0801
Attention for 'the': 0.0949
Attention for 'lazy': 0.0778
Attention for 'dog': 0.1089

For the word: 'fox':
Attention for 'The': 0.1468
Attention for 

## Simplified Example of Self-Attention

This notebook demonstrates a simplified example of how the **self-attention** mechanism works using a basic sentence.

### Overview

- We use the sentence “The quick brown fox jumps over the lazy dog” as an example.
- Simple random embeddings are generated for each word.
- A simplified version of the self-attention mechanism is implemented.
- We analyze how each word relates to the others in the sentence based on attention weights.

### Key Components

**Embedding Creation**

- The `make_word_embedding` function generates random vectors to represent words.
- In real models, these embeddings are learned during training and capture semantic meaning.

**Softmax Function**

- Converts attention scores into probabilities, allowing the model to assign different levels of importance to each word in context.

**Simplified Self-Attention**

- The `self_attention` function simulates the self-attention process:
  - Creates Q (query), K (key), and V (value) matrices from the same word embeddings.
  - Computes attention scores by taking the dot product between Q and K.
  - Applies softmax to the scores to obtain attention weights.
  - Computes weighted values by multiplying attention weights with the V matrix.

### Process

1. The sentence is split into words.
2. Each word is transformed into a random embedding vector.
3. The self-attention function is applied to compute how each word "attends" to the others.
4. The output displays attention weights and which word received the most attention from each one.

### Results

- The code prints the attention weight matrix between each pair of words.
- It also provides a simple interpretation of which word receives the most attention from each word.

### Notes

- This is a highly simplified, illustrative example for educational purposes.
- In real Transformer models, the attention mechanism is more complex, including:
  - Separate linear projections for Q, K, and V.
  - Scaling of attention scores.
  - Multi-head attention.
  - Additional layers and learned parameters during training.

However, this example helps visualize how context is considered in natural language processing through self-attention.


In [None]:
import numpy as np

# Example phrase
phrase = "Vorges Data is a personal blog about data science and machine learning."

# Function to make a simple embedding
def make_embedding(word):
    return np.random.rand(4)

# Function to make a simple softmax
def softmax(x):
    return np.exp(x) / np.sum(np.exp(x), axis=0)

# Making embeddings for the words
words = phrase.split()
word_embeddings = [make_embedding(w) for w in words]

# Making the attention weights
def attention_weights(word_embeddings):
    # Making the query, key and value matrices
    Q = np.array(word_embeddings)
    K = np.array(word_embeddings)
    V = np.array(word_embeddings)

    # Calculating the attention scores
    scores = np.dot(Q, K.T)

    # Applying the softmax function to the scores
    attention_weights = softmax(scores)

    # Calculating the weighted sum of the word embeddings
    weighted_values = np.dot(attention_weights, V)

    return attention_weights, weighted_values

# Apply self-attention to the word embeddings
attention_weights, weighted_values = attention_weights(word_embeddings)

# Print the attention weights and weighted values
print("Original phrase: ", phrase)
print("\nWeighted of Attention:")
for i, w in enumerate(word):
    print(f"\nFor the word: '{w}':")
    # Print only 3 words with the highest attention weights
    top_indices = sorted(range(len(attention_weights[i])), key=lambda j: attention_weights[i][j], reverse=True)[:3]
    for j in top_indices:
        print(f"Attention for '{word[j]}': {attention_weights[i][j]:.4f}")

print("\nExplanation:")
for i, w in enumerate(word):
    max_attention_weight = np.argmax(attention_weights[i])
    print(f"{w} has the highest attention weight with {word[max_attention_weight]}")

## Comparative Example of Simplified Self-Attention

This notebook explores two simplified implementations of the **self-attention** mechanism using different sentences. Both examples aim to illustrate how words in a sentence "attend to" each other based on their vector representations (embeddings).

---

### Example 1 — Short Sentence

**Sentence:**
"The quick brown fox jumps over the lazy dog"

**Overview:**
- Consists of 9 words.
- All words are treated equally and embedded into random vectors.
- Attention weights are calculated for every word pair.
- The result shows attention for all words per row in the matrix.

**Output Style:**
- Displays full attention weights (9x9 matrix).
- For each word, shows attention with all others in the sentence.

---

### Example 2 — Longer and Richer Sentence

**Sentence:**
"Vorges Data is a personal blog about data science and machine learning."

**Overview:**
- Consists of 11 words, including domain-specific terms.
- Same attention mechanism is used: embeddings, dot product for attention scores, softmax normalization, and weighted value computation.
- Introduces named entities like "Vorges Data" and multi-word concepts such as "machine learning".

**Output Style:**
- To improve readability, the output is limited to only the **top 3 most attended words** for each word.
- This makes the results easier to interpret with a larger attention matrix (11x11).

---

### Key Differences

| Aspect                        | Example 1                            | Example 2                                 |
|------------------------------|--------------------------------------|-------------------------------------------|
| Sentence length              | 9 words                              | 11 words                                  |
| Output detail                | Full matrix for all word pairs       | Top 3 attention weights per word          |
| Named entities               | None                                  | "Vorges Data", "machine learning"         |
| Context complexity           | Simple sentence with generic words   | Richer context with technical vocabulary  |
| Readability adjustment       | None                                  | Output trimmed for clarity                |

---

### Important Notes

- These examples are **simplified** for educational purposes.
- In real-world models (e.g., Transformers):
  - Q, K, and V are derived through separate learned linear projections.
  - Attention is computed across multiple **heads** and **layers**.
  - Embeddings are pre-trained and semantically rich.
  - Attention is often scaled and masked depending on the task (e.g., causal masking for language generation).

---

### Conclusion

These two examples demonstrate how the self-attention mechanism adapts to sentences of different lengths and complexities. While the underlying logic remains the same, longer or more semantically rich sentences require better visualization strategies (like limiting output) to make interpretation easier. This reflects a core strength of attention mechanisms: flexibility in handling variable-length input and contextual relevance.
