# Linear Algebra for Deep Learning: The Mathematical Foundation

## 🎯 Introduction

Welcome to the mathematical backbone of deep learning! This notebook will transform you from someone who uses linear algebra operations into someone who understands why they work and how they enable neural network magic. Every operation in deep learning - from simple matrix multiplication to complex attention mechanisms - relies on linear algebra.

### 🧠 What You'll Master

This comprehensive guide covers:
- **Vector operations**: The building blocks of all neural network computations
- **Matrix multiplication**: The fundamental operation that powers every layer
- **Eigenvalues and eigenvectors**: Understanding principal components and transformations
- **Matrix decompositions**: SVD, PCA, and their roles in model compression
- **Norms and distances**: Essential for optimization and regularization

### 🎓 Prerequisites

- Basic understanding of vectors and matrices
- Familiarity with PyTorch tensor operations
- Elementary knowledge of coordinate systems
- High school algebra and basic calculus

### 🚀 Why Linear Algebra is Deep Learning's Language

Linear algebra enables deep learning because:
- **Efficient computation**: Matrix operations are highly optimized on GPUs
- **Dimensional transformations**: Networks learn to map between feature spaces
- **Batch processing**: Linear operations naturally handle multiple samples
- **Optimization**: Gradients and parameter updates are vector operations
- **Interpretability**: Understanding transformations helps debug and improve models

---

## 📚 Table of Contents

1. **[Vectors and Vector Spaces](#vectors-and-vector-spaces)** - The fundamental building blocks
2. **[Matrix Operations in Deep Learning](#matrix-operations-in-deep-learning)** - Core transformations
3. **[Eigendecomposition and PCA](#eigendecomposition-and-pca)** - Understanding data structure
4. **[Matrix Norms and Distances](#matrix-norms-and-distances)** - Measuring and regularizing
5. **[Advanced Decompositions](#advanced-decompositions)** - SVD and low-rank approximations

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt

## Vectors and Vector Spaces

### 📐 The Foundation of All Neural Network Operations

Vectors are the fundamental data structure in deep learning. Every input, output, parameter, and gradient is a vector or collection of vectors. Understanding vector operations deeply is essential for mastering how neural networks transform information.

In [None]:
# =============================================================================
# VECTORS: THE BUILDING BLOCKS OF DEEP LEARNING
# =============================================================================

print("📐 Vector Operations in Deep Learning Context")
print("=" * 50)

# Vectors in deep learning represent features, embeddings, parameters, gradients
print("🎯 What Vectors Represent in Neural Networks:")
print("• Input features: [height, weight, age, income]")
print("• Word embeddings: [semantic_dim_1, semantic_dim_2, ..., semantic_dim_n]") 
print("• Hidden activations: [neuron_1_output, neuron_2_output, ...]")
print("• Gradients: [∂loss/∂param_1, ∂loss/∂param_2, ...]")
print("• Model parameters: [weight_1, weight_2, bias_1, ...]")

# Demonstrate fundamental vector operations
a = torch.tensor([2.0, 1.0, 3.0])    # Feature vector (e.g., RGB color)
b = torch.tensor([1.0, 4.0, 2.0])    # Another feature vector

print(f"\n📊 Basic Vector Operations")
print("=" * 30)
print(f"Vector a: {a}")
print(f"Vector b: {b}")

# Addition: Combining features or gradients
vector_sum = a + b
print(f"\nAddition (a + b): {vector_sum}")
print("Use case: Combining gradients from different loss terms")

# Scalar multiplication: Scaling learning rates or regularization
scaled = 0.5 * a
print(f"\nScalar multiplication (0.5 * a): {scaled}")
print("Use case: Applying learning rate to gradient updates")

# Dot product: Similarity, projections, attention scores
dot_product = torch.dot(a, b)
print(f"\nDot product (a · b): {dot_product.item():.3f}")
print("Use case: Computing attention scores, similarity measures")

# Norms: Magnitude, regularization, gradient clipping
l2_norm = torch.norm(a, p=2)
l1_norm = torch.norm(a, p=1)
print(f"\nL2 norm (||a||₂): {l2_norm.item():.3f}")
print(f"L1 norm (||a||₁): {l1_norm.item():.3f}")
print("Use case: L2 for weight decay, L1 for sparsity, gradient clipping")

print(f"\n🎯 Vector Geometry in High Dimensions")
print("=" * 50)

# Demonstrate how vectors behave in high-dimensional spaces
dims = [2, 10, 100, 1000]
print("Dimension | Random vectors | Avg dot product | Avg angle (degrees)")
print("----------|----------------|------------------|--------------------")

for dim in dims:
    # Generate random unit vectors (normalized)
    v1 = torch.randn(dim)
    v1 = v1 / torch.norm(v1)  # Normalize to unit length
    
    v2 = torch.randn(dim) 
    v2 = v2 / torch.norm(v2)  # Normalize to unit length
    
    # Compute dot product and angle
    dot_prod = torch.dot(v1, v2).item()
    angle_rad = torch.acos(torch.clamp(torch.dot(v1, v2), -1, 1))
    angle_deg = torch.rad2deg(angle_rad).item()
    
    print(f"{dim:9d} | Unit vectors      | {dot_prod:16.4f} | {angle_deg:18.1f}")

print(f"\n💡 High-Dimensional Insight:")
print("As dimensions increase, random vectors become nearly orthogonal!")
print("This is why embeddings can represent many concepts without interference.")

print(f"\n🧮 Linear Combinations: The Core of Neural Networks")
print("=" * 50)

# Linear combinations are what every layer computes
print("Every neural network layer computes: y = W @ x + b")
print("This is a linear combination of input features!")

# Demonstrate with a mini example
input_features = torch.tensor([0.5, 0.8, 0.2])  # 3 input features
weights = torch.tensor([                          # 2 neurons, 3 inputs each
    [1.0, -0.5, 2.0],   # Neuron 1 weights
    [0.3, 1.5, -1.0]    # Neuron 2 weights
])
bias = torch.tensor([0.1, -0.2])

print(f"\nInput features: {input_features}")
print(f"Weight matrix shape: {weights.shape}")
print(f"Bias vector: {bias}")

# Matrix multiplication + bias (linear layer computation)
output = weights @ input_features + bias
print(f"\nLayer output: {output}")

print(f"\nWhat each neuron computes:")
for i, (w_row, b_val) in enumerate(zip(weights, bias)):
    linear_combo = torch.dot(w_row, input_features) + b_val
    print(f"Neuron {i+1}: {w_row[0]:.1f}*{input_features[0]:.1f} + {w_row[1]:.1f}*{input_features[1]:.1f} + {w_row[2]:.1f}*{input_features[2]:.1f} + {b_val:.1f} = {linear_combo:.3f}")

print(f"\n🔍 Vector Spaces in Deep Learning")
print("=" * 50)

# Show how neural networks learn to transform vector spaces
print("Neural networks learn transformations between vector spaces:")
print("• Input space: Raw features (pixels, words, sensor readings)")
print("• Hidden spaces: Learned representations (edges, concepts, patterns)")
print("• Output space: Task-specific features (class probabilities, predictions)")

# Demonstrate basis vectors and linear independence
print(f"\n📐 Basis Vectors and Linear Independence")
print("-" * 40)

# Standard basis in 3D
e1 = torch.tensor([1.0, 0.0, 0.0])
e2 = torch.tensor([0.0, 1.0, 0.0]) 
e3 = torch.tensor([0.0, 0.0, 1.0])

print(f"Standard basis vectors:")
print(f"e₁: {e1}")
print(f"e₂: {e2}")
print(f"e₃: {e3}")

# Any vector can be written as a linear combination of basis vectors
arbitrary_vector = torch.tensor([2.5, -1.0, 3.2])
print(f"\nArbitrary vector: {arbitrary_vector}")
print(f"As linear combination: {arbitrary_vector[0]:.1f}*e₁ + {arbitrary_vector[1]:.1f}*e₂ + {arbitrary_vector[2]:.1f}*e₃")

# Verify this is correct
reconstructed = arbitrary_vector[0]*e1 + arbitrary_vector[1]*e2 + arbitrary_vector[2]*e3
print(f"Reconstructed: {reconstructed}")
print(f"Match: {torch.allclose(arbitrary_vector, reconstructed)}")

print(f"\n💡 Deep Learning Connection:")
print("Neural networks learn new basis vectors (features) that are more")
print("useful for the task than the original input features!")
print("Each hidden layer finds a new coordinate system for the data.")

## Matrix Transpose

**Formula:** $(\mathbf{A})^T_{ij} = \mathbf{A}_{ji}$

Essential for backpropagation - the transpose "reverses" the forward direction of information flow.

In [None]:
# Forward pass: y = x @ W.T
# Backward pass: dx = dy @ W (using transpose automatically)
x = torch.randn(32, 784, requires_grad=True)
W = torch.randn(128, 784, requires_grad=True)
y = x @ W.T

# Create dummy loss and backpropagate
loss = y.sum()
loss.backward()

print(f"Forward: x {x.shape} @ W.T {W.T.shape} = y {y.shape}")
print(f"Gradient flows back through transpose automatically")
print(f"x.grad shape: {x.grad.shape}")  # Same as x.shape
print(f"W.grad shape: {W.grad.shape}")  # Same as W.shape

## Matrix Inverse

**Formula:** $\mathbf{A}\mathbf{A}^{-1} = \mathbf{I}$

Used in analytical solutions and understanding linear transformations.

In [None]:
# Normal equations for linear regression: θ = (X.T @ X)^(-1) @ X.T @ y
n_samples, n_features = 100, 10
X = torch.randn(n_samples, n_features)
true_theta = torch.randn(n_features)
y = X @ true_theta + 0.1 * torch.randn(n_samples)

# Analytical solution using matrix inverse
XtX = X.T @ X
XtX_inv = torch.inverse(XtX)
theta_analytical = XtX_inv @ X.T @ y

print(f"True theta: {true_theta[:3]}")
print(f"Estimated theta: {theta_analytical[:3]}")
print(f"Error: {torch.norm(true_theta - theta_analytical):.6f}")

# Note: In practice, use torch.linalg.lstsq for numerical stability
theta_stable = torch.linalg.lstsq(X, y).solution
print(f"Stable solution: {theta_stable[:3]}")

## Eigenvalues & Eigenvectors

**Formula:** $\mathbf{A}\mathbf{v} = \lambda\mathbf{v}$

Reveals principal directions of data variation and helps analyze gradient flow.

In [None]:
# Analyzing weight matrix conditioning
W = torch.randn(100, 100)
eigenvals, eigenvecs = torch.linalg.eig(W @ W.T)  # Eigendecomposition
eigenvals = eigenvals.real  # Take real part

condition_number = eigenvals.max() / eigenvals.min()
print(f"Condition number: {condition_number:.2f}")
print(f"Max eigenvalue: {eigenvals.max():.2f}")
print(f"Min eigenvalue: {eigenvals.min():.2f}")

# PCA example - find principal components
data = torch.randn(1000, 50)  # 1000 samples, 50 features
centered_data = data - data.mean(dim=0)
cov_matrix = (centered_data.T @ centered_data) / (len(data) - 1)

eigenvals, eigenvecs = torch.linalg.eigh(cov_matrix)  # For symmetric matrices
# Sort by eigenvalue magnitude
sorted_indices = torch.argsort(eigenvals, descending=True)
principal_components = eigenvecs[:, sorted_indices]

print(f"Explained variance ratios: {eigenvals[sorted_indices][:5] / eigenvals.sum()}")

## Singular Value Decomposition (SVD)

**Formula:** $\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$

Decomposes any matrix into orthogonal transformations and scaling.

In [None]:
# SVD for dimensionality reduction and analysis
data = torch.randn(1000, 100)  # High-dimensional data

# Perform SVD
U, S, Vt = torch.linalg.svd(data, full_matrices=False)

# Analyze the singular values
print(f"Data shape: {data.shape}")
print(f"U shape: {U.shape}")    # Left singular vectors
print(f"S shape: {S.shape}")    # Singular values
print(f"Vt shape: {Vt.shape}")  # Right singular vectors

# Reconstruct with fewer components (dimensionality reduction)
k = 20  # Keep top 20 components
data_reduced = U[:, :k] @ torch.diag(S[:k]) @ Vt[:k, :]

reconstruction_error = torch.norm(data - data_reduced)
compression_ratio = (k * (U.shape[0] + Vt.shape[1])) / (data.shape[0] * data.shape[1])

print(f"Reconstruction error: {reconstruction_error:.2f}")
print(f"Compression ratio: {compression_ratio:.2%}")
print(f"Variance explained by top {k} components: {(S[:k]**2).sum() / (S**2).sum():.2%}")

# SVD for weight initialization (orthogonal initialization)
def svd_init(tensor):
    """Initialize weights using SVD for orthogonal matrices"""
    if tensor.dim() >= 2:
        U, _, Vt = torch.linalg.svd(tensor, full_matrices=False)
        return U if U.shape == tensor.shape else Vt
    return tensor

weight = torch.empty(128, 64)
orthogonal_weight = svd_init(weight)
print(f"Orthogonality check: {torch.norm(orthogonal_weight @ orthogonal_weight.T - torch.eye(128)):.6f}")