### The Premise

Imagine a super librarian in a magical library where books can talk to each other. This librarian’s job is to translate a story from one language to another or generate a continuation of a story. The library is full of words, and each word is like a book with its own meaning and position in the story.  

### The Problem
When the librarian reads a sentence like “Hi, how are you?”, they need to understand which words are most important to generate a response like “I am fine.” Unlike older librarians (like Recurrent Neural Networks, or RNNs), who could only remember a few words at a time and forgot the beginning of long stories, this super librarian can pay attention to all words at once, no matter how long the story is.

### How It Works

1. Word Books (Embeddings)
2. Position Tags (Positional Encoding)
3. Attention Magic (Self-Attention)
4. Teamwork (Multi-Headed Attention)
5. Processing and Refining (Feed-Forward Layers)
6. Translating or Generating (Encoder-Decoder)
    - Encoder: Reads and creates summary with attention scores
    - Decoder: Uses the summary to generate a word-by-word output
7. Final Touch (Softmax)

### Why It's Useful

Unlike older librarians (RNNs), who could only hold a short piece of the story in their memory, the super librarian’s attention magic lets them reference the entire story at once. This makes them great at tasks like translating languages, writing stories, or answering questions, even for very long texts.  


This *super librarian* is the **Transformer**, and its attention magic is why it’s so powerful in natural language processing (NLP) tasks!

In [2]:
import numpy as np

# Set random seed for reproducibility
np.random.seed(42)

# Hyperparameters
vocab_size = 10  # Numbers 0-9
d_model = 8      # Embedding dimension
n_heads = 2      # Number of attention heads
d_ff = 16        # Feed-forward dimension
seq_length = 5   # Sequence length

We use NumPy for matrix operations.  

The hyperparameters define the model size:  
`vocab_size` is the range of input numbers,  
`d_model` is the embedding size,  
`n_heads` splits attention,  
`d_ff` is the feed-forward layer size, and  
`seq_length` is the input sequence length.

In [4]:
def generate_data(num_samples, seq_length, vocab_size):
    """Generate random sequences and their reversed versions."""
    X = np.random.randint(0, vocab_size, (num_samples, seq_length))
    y = np.flip(X, axis=1)
    return X, y

# Generate 100 samples
X, y = generate_data(100, seq_length, vocab_size)
print("Example input:", X[0])
print("Example output:", y[0])

print("\nData shape:", X.shape, y.shape)

Example input: [8 0 0 3 8]
Example output: [8 3 0 0 8]

Data shape: (100, 5) (100, 5)
