# Vectors & Geometry for Neural Networks

This notebook contains PyTorch examples demonstrating vector and geometric concepts essential for understanding neural networks.

## Table of Contents
1. [Dot Product](#dot-product)
2. [Cosine Similarity](#cosine-similarity)
3. [Euclidean Distance](#euclidean-distance)
4. [Lp Norms](#lp-norms)

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt

## Dot Product

**Formula:** $\mathbf{a} \cdot \mathbf{b} = \sum_i a_i b_i = \|\mathbf{a}\|\|\mathbf{b}\|\cos(\theta)$

Measures how "aligned" two vectors are. Core operation for neuron activation.

In [None]:
# Neuron activation using dot product
input_vector = torch.tensor([1.0, 0.5, -0.2, 0.8])
weight_vector = torch.tensor([0.3, 0.7, -0.1, 0.4])

# Raw activation (dot product)
activation = torch.dot(input_vector, weight_vector)
print(f"Activation: {activation:.3f}")

# Understanding geometric interpretation
magnitude_a = torch.norm(input_vector)
magnitude_b = torch.norm(weight_vector)
cosine_angle = activation / (magnitude_a * magnitude_b)
angle_degrees = torch.acos(cosine_angle) * 180 / torch.pi

print(f"Input magnitude: {magnitude_a:.3f}")
print(f"Weight magnitude: {magnitude_b:.3f}")
print(f"Cosine of angle: {cosine_angle:.3f}")
print(f"Angle between vectors: {angle_degrees:.1f}°")

# Batch processing - multiple samples
batch_inputs = torch.randn(32, 4)  # 32 samples, 4 features
batch_activations = batch_inputs @ weight_vector
print(f"Batch activations shape: {batch_activations.shape}")
print(f"Sample activations: {batch_activations[:5]}")

## Cosine Similarity

**Formula:** $\cos(\theta) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\|\|\mathbf{b}\|}$

Measures similarity independent of magnitude. Used in attention and word embeddings.

In [None]:
# Word embedding similarity
word1 = torch.tensor([0.2, 0.8, -0.1, 0.3, 0.7])  # "king"
word2 = torch.tensor([0.1, 0.7, -0.2, 0.2, 0.6])  # "queen"
word3 = torch.tensor([-0.5, 0.1, 0.8, -0.3, 0.2])  # "apple"

def cosine_similarity(a, b):
    return torch.dot(a, b) / (torch.norm(a) * torch.norm(b))

sim_king_queen = cosine_similarity(word1, word2)
sim_king_apple = cosine_similarity(word1, word3)
sim_queen_apple = cosine_similarity(word2, word3)

print(f"King-Queen similarity: {sim_king_queen:.3f}")
print(f"King-Apple similarity: {sim_king_apple:.3f}")
print(f"Queen-Apple similarity: {sim_queen_apple:.3f}")

# Attention mechanism using cosine similarity
query = torch.randn(1, 64)
keys = torch.randn(10, 64)  # 10 different keys

# Compute attention weights using cosine similarity
similarities = torch.nn.functional.cosine_similarity(query, keys, dim=1)
attention_weights = torch.softmax(similarities, dim=0)

print(f"Attention weights: {attention_weights}")
print(f"Max attention to key: {attention_weights.argmax()}")

## Euclidean Distance

**Formula:** $d(\mathbf{a}, \mathbf{b}) = \|\mathbf{a} - \mathbf{b}\|_2 = \sqrt{\sum_i (a_i - b_i)^2}$

Measures how "far apart" two points are in feature space.

In [None]:
# k-Nearest Neighbors in embedding space
embeddings = torch.randn(100, 50)  # 100 data points, 50-dim embeddings
query_point = torch.randn(1, 50)

# Compute distances to all points
distances = torch.norm(embeddings - query_point, dim=1)
k = 5
nearest_k = torch.topk(distances, k, largest=False)

print(f"Distances to {k} nearest neighbors: {nearest_k.values}")
print(f"Indices of {k} nearest neighbors: {nearest_k.indices}")

# Mean Squared Error loss using Euclidean distance
predictions = torch.randn(32, 10)  # 32 samples, 10 classes
targets = torch.randn(32, 10)

mse_loss = torch.mean((predictions - targets) ** 2)
# Equivalent to: torch.mean(torch.norm(predictions - targets, dim=1) ** 2)
euclidean_based_loss = torch.mean(torch.norm(predictions - targets, dim=1) ** 2)

print(f"MSE loss: {mse_loss:.4f}")
print(f"Euclidean-based loss: {euclidean_based_loss:.4f}")

# Clustering: assign points to nearest centroid
centroids = torch.randn(3, 50)  # 3 cluster centers
data_points = torch.randn(20, 50)  # 20 data points

# Compute distance from each point to each centroid
distances_to_centroids = torch.cdist(data_points, centroids)  # Shape: (20, 3)
cluster_assignments = torch.argmin(distances_to_centroids, dim=1)

print(f"Cluster assignments: {cluster_assignments}")

## Lp Norms

**Formula:** $\|\mathbf{x}\|_p = \left(\sum_i |x_i|^p\right)^{1/p}$

Measures vector "size" in different ways. Used for regularization.

In [None]:
# Different norms and their effects
weights = torch.tensor([3.0, -1.0, 0.5, -2.0, 0.1])

l1_norm = torch.norm(weights, p=1)  # Sum of absolute values
l2_norm = torch.norm(weights, p=2)  # Euclidean norm
l_inf_norm = torch.norm(weights, p=float('inf'))  # Maximum absolute value

print(f"Original weights: {weights}")
print(f"L1 norm: {l1_norm:.3f}")
print(f"L2 norm: {l2_norm:.3f}")
print(f"L∞ norm: {l_inf_norm:.3f}")

# L1 vs L2 regularization effects
def train_with_regularization(reg_type='l2', reg_strength=0.01):
    model_weights = torch.tensor([2.0, -1.5, 0.3, -0.8], requires_grad=True)
    optimizer = torch.optim.SGD([model_weights], lr=0.1)
    
    for _ in range(100):
        # Dummy loss (normally would be your actual loss)
        loss = torch.sum(model_weights ** 2)  # Dummy objective
        
        # Add regularization
        if reg_type == 'l1':
            loss += reg_strength * torch.norm(model_weights, p=1)
        elif reg_type == 'l2':
            loss += reg_strength * torch.norm(model_weights, p=2) ** 2
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
    return model_weights.detach()

# Compare effects
weights_no_reg = train_with_regularization(reg_type=None, reg_strength=0)
weights_l1 = train_with_regularization(reg_type='l1', reg_strength=0.1)
weights_l2 = train_with_regularization(reg_type='l2', reg_strength=0.1)

print(f"No regularization: {weights_no_reg}")
print(f"L1 regularization: {weights_l1}")
print(f"L2 regularization: {weights_l2}")

# Count near-zero weights (sparsity)
def count_sparse(weights, threshold=0.01):
    return (torch.abs(weights) < threshold).sum().item()

print(f"Sparse weights (L1): {count_sparse(weights_l1)}/4")
print(f"Sparse weights (L2): {count_sparse(weights_l2)}/4")

# Gradient norms for training stability
gradients = torch.randn(1000)  # Simulated gradients
grad_l1_norm = torch.norm(gradients, p=1)
grad_l2_norm = torch.norm(gradients, p=2)

print(f"Gradient L1 norm: {grad_l1_norm:.2f}")
print(f"Gradient L2 norm: {grad_l2_norm:.2f}")

# Gradient clipping using L2 norm
max_norm = 1.0
if grad_l2_norm > max_norm:
    gradients = gradients * (max_norm / grad_l2_norm)
    print(f"Gradients clipped to norm: {torch.norm(gradients, p=2):.3f}")