# PyTorch Tensors: Vectors & Matrices

## 🎯 Introduction

Welcome to the foundational world of PyTorch tensors! This notebook will take you from complete beginner to confident tensor manipulator. Tensors are the fundamental data structure in PyTorch - they're like NumPy arrays but with superpowers: automatic differentiation, GPU acceleration, and deep integration with neural network operations.

### 🧠 What You'll Learn

By the end of this notebook, you'll understand:
- **Tensor fundamentals**: Creation, properties, and mental models
- **Matrix operations**: The mathematical backbone of neural networks
- **Broadcasting magic**: How PyTorch handles mismatched dimensions
- **Embedding vs one-hot**: Why embeddings revolutionized NLP
- **Batch processing**: The key to efficient neural network training

### 🎓 Prerequisites

- Basic Python knowledge (lists, functions, loops)
- Elementary linear algebra (vectors, matrices, matrix multiplication)
- No prior PyTorch experience needed!

### 🚀 Why This Matters

Understanding tensors is crucial because:
- Every neural network operation is a tensor transformation
- Efficient tensor operations = faster training and inference
- Shape mismatches are the #1 source of PyTorch bugs
- Proper tensor design enables scalable deep learning

---

## 📚 Table of Contents

1. **[Core Mental Model](#core-mental-model)** - Understanding what tensors really are
2. **[Vector/Matrix Operations](#vector-matrix-operations)** - Essential operations for neural networks  
3. **[One-Hot vs Embedding Vectors](#one-hot-vs-embedding-vectors)** - The embedding revolution explained
4. **[Tiny Exercise: Batch Operations](#tiny-exercise-batch-operations)** - Putting it all together

In [1]:
# Essential imports for tensor operations
import torch                    # Core PyTorch library
import torch.nn as nn          # Neural network modules
import torch.nn.functional as F # Functional interface (activations, losses, etc.)
import numpy as np             # For comparison and some operations

# Reproducibility setup - always do this first!
# These ensure your results are consistent across runs
torch.manual_seed(0)    # Sets PyTorch random seed
np.random.seed(0)       # Sets NumPy random seed

# Environment information - important for debugging
print(f"PyTorch version: {torch.__version__}")
print(f"Python executable: {__import__('sys').executable}")

# Device detection - CPU vs GPU
# We'll use CPU for this tutorial to ensure reproducibility
available_device = 'cuda' if torch.cuda.is_available() else 'cpu'
device = torch.device('cpu')  # Force CPU for consistent results

print(f"Available device: {available_device}")
print(f"Using device: {device}")

# Quick tensor creation test to verify everything works
test_tensor = torch.tensor([1.0, 2.0, 3.0])
print(f"✓ PyTorch is working! Test tensor: {test_tensor}")

PyTorch version: 2.6.0+cu124
Python executable: /usr/bin/python3
Available device: cpu
Using device: cpu
✓ PyTorch is working! Test tensor: tensor([1., 2., 3.])


## Core Mental Model

### 🧠 What Is a Tensor?

A **tensor** is a generalization of scalars, vectors, and matrices:
- **Scalar** (0D): Just a number → `5.0`
- **Vector** (1D): Array of numbers → `[1, 2, 3]`  
- **Matrix** (2D): Grid of numbers → `[[1, 2], [3, 4]]`
- **Tensor** (3D+): Multi-dimensional array → `[[[1, 2], [3, 4]], [[5, 6], [7, 8]]]`

### 🎯 The Neural Network Connection

In neural networks, tensors represent:
- **Inputs**: Your data (images, text, audio)
- **Parameters**: Learnable weights and biases
- **Activations**: Values flowing between layers
- **Gradients**: How to update parameters

### 🔧 PyTorch Tensors vs NumPy Arrays

| Feature | NumPy Array | PyTorch Tensor |
|---------|-------------|----------------|
| GPU Support | ❌ | ✅ |
| Automatic Differentiation | ❌ | ✅ |
| Neural Network Integration | ❌ | ✅ |
| Scientific Computing | ✅ | ✅ |

**Key insight**: PyTorch tensors are like NumPy arrays that know calculus!

In [2]:
# =============================================================================
# TENSOR CREATION PATTERNS
# =============================================================================

print("🔧 Basic Tensor Creation")
print("=" * 50)

# Method 1: From Python lists/numbers
scalar = torch.tensor(5.0)                    # 0D tensor (scalar)
vector = torch.tensor([1.0, 2.0, 3.0])       # 1D tensor (vector)
matrix = torch.tensor([[1.0, 2.0], [3.0, 4.0]])  # 2D tensor (matrix)

print(f"Scalar: {scalar}, shape: {scalar.shape}, dimensions: {scalar.ndim}")
print(f"Vector: {vector}, shape: {vector.shape}, dimensions: {vector.ndim}")
print(f"Matrix: {matrix}, shape: {matrix.shape}, dimensions: {matrix.ndim}")

print("\n🏗️ Common Creation Patterns")
print("=" * 50)

# Method 2: Structural creation (most common in practice)
zeros_2d = torch.zeros(3, 4)              # 3×4 matrix filled with zeros
ones_3d = torch.ones(2, 3, 4)             # 2×3×4 tensor filled with ones
randn_2d = torch.randn(2, 5)              # Random normal (mean=0, std=1)
arange_1d = torch.arange(0, 10, 2)        # Evenly spaced: [0, 2, 4, 6, 8]
linspace_1d = torch.linspace(0, 1, 5)     # 5 points from 0 to 1

print(f"Zeros matrix shape: {zeros_2d.shape}")
print(f"Ones tensor shape: {ones_3d.shape}")
print(f"Random normal shape: {randn_2d.shape}")
print(f"Arange result: {arange_1d}")
print(f"Linspace result: {linspace_1d}")

print("\n🔍 Tensor Properties")
print("=" * 50)

# Every tensor has these key properties
example_tensor = torch.randn(2, 3, 4)

print(f"Shape (dimensions): {example_tensor.shape}")      # torch.Size([2, 3, 4])
print(f"Data type: {example_tensor.dtype}")               # torch.float32 (default)
print(f"Device location: {example_tensor.device}")        # cpu or cuda
print(f"Requires gradients: {example_tensor.requires_grad}")  # False (default)
print(f"Number of elements: {example_tensor.numel()}")    # 2 × 3 × 4 = 24
print(f"Memory layout: {example_tensor.stride()}")        # How data is stored

print("\n💡 Key Insight: Shape is Everything!")
print("In deep learning, getting tensor shapes right is 80% of the battle.")
print("Always think: [batch_size, sequence_length, feature_dimension]")

🔧 Basic Tensor Creation
Scalar: 5.0, shape: torch.Size([]), dimensions: 0
Vector: tensor([1., 2., 3.]), shape: torch.Size([3]), dimensions: 1
Matrix: tensor([[1., 2.],
        [3., 4.]]), shape: torch.Size([2, 2]), dimensions: 2

🏗️ Common Creation Patterns
Zeros matrix shape: torch.Size([3, 4])
Ones tensor shape: torch.Size([2, 3, 4])
Random normal shape: torch.Size([2, 5])
Arange result: tensor([0, 2, 4, 6, 8])
Linspace result: tensor([0.0000, 0.2500, 0.5000, 0.7500, 1.0000])

🔍 Tensor Properties
Shape (dimensions): torch.Size([2, 3, 4])
Data type: torch.float32
Device location: cpu
Requires gradients: False
Number of elements: 24
Memory layout: (12, 4, 1)

💡 Key Insight: Shape is Everything!
In deep learning, getting tensor shapes right is 80% of the battle.
Always think: [batch_size, sequence_length, feature_dimension]


## Vector/Matrix Operations

### 🎯 The Heart of Neural Networks

Matrix multiplication is the fundamental operation in neural networks. Every layer transformation, attention mechanism, and parameter update relies on efficient matrix operations. Let's master the essential patterns!

In [3]:
# =============================================================================
# NEURAL NETWORK MATRIX OPERATIONS
# =============================================================================

print("🧮 The Linear Layer Pattern")
print("=" * 50)

# This is the most important pattern in deep learning!
# Every linear layer does: output = input @ weight + bias

batch_size, d_input, d_hidden = 4, 6, 8

# Typical neural network shapes
X = torch.randn(batch_size, d_input)    # Input batch: [batch_size, input_features]
W = torch.randn(d_input, d_hidden)      # Weight matrix: [input_features, output_features]
b = torch.randn(d_hidden)               # Bias vector: [output_features]

print(f"Input X shape: {X.shape} - {batch_size} samples, {d_input} features each")
print(f"Weight W shape: {W.shape} - transforms {d_input} → {d_hidden} features")
print(f"Bias b shape: {b.shape} - one bias per output feature")

# The fundamental neural network operation
y = X @ W + b  # Matrix multiplication + broadcasting bias addition

print(f"\nOutput y shape: {y.shape} - {batch_size} samples, {d_hidden} features each")
print(f"✓ Linear transformation complete: {d_input} → {d_hidden} dimensions")

print("\n🎯 Why This Shape Pattern Matters")
print("- Batch dimension (first) allows parallel processing")
print("- Feature dimension (last) is what gets transformed")
print("- This pattern scales from tiny MLPs to massive transformers")

print("\n🔪 Tensor Slicing and Indexing")
print("=" * 50)

# Indexing patterns you'll use constantly
first_sample = X[0]              # Get first sample: [d_input]
first_two = X[:2]                # Get first two samples: [2, d_input]
last_feature = X[..., -1]        # Last feature across all samples: [batch_size]
middle_features = X[:, 1:4]      # Features 1-3 for all samples: [batch_size, 3]

print(f"First sample shape: {first_sample.shape}")
print(f"First two samples shape: {first_two.shape}")
print(f"Last feature across batch shape: {last_feature.shape}")
print(f"Middle features shape: {middle_features.shape}")

print("\n🏗️ Stacking vs Concatenation")
print("=" * 50)

# Two fundamental ways to combine tensors
X1 = torch.randn(4, 6)
X2 = torch.randn(4, 6)

# Stack: Creates NEW dimension
stacked = torch.stack([X1, X2], dim=0)    # [2, 4, 6] - NEW first dimension
stacked_last = torch.stack([X1, X2], dim=-1)  # [4, 6, 2] - NEW last dimension

# Cat: Concatenates along EXISTING dimension
concat_batch = torch.cat([X1, X2], dim=0)     # [8, 6] - double the batch size
concat_features = torch.cat([X1, X2], dim=1)  # [4, 12] - double the features

print(f"Original X1, X2 shapes: {X1.shape}, {X2.shape}")
print(f"Stack (dim=0): {stacked.shape} - creates new dimension")
print(f"Stack (dim=-1): {stacked_last.shape} - creates new last dimension")
print(f"Cat (dim=0): {concat_batch.shape} - bigger batch")
print(f"Cat (dim=1): {concat_features.shape} - more features")

print("\n💡 Pro Tip:")
print("- Use stack() when you want to create batches")
print("- Use cat() when you want to combine features or increase batch size")

🧮 The Linear Layer Pattern
Input X shape: torch.Size([4, 6]) - 4 samples, 6 features each
Weight W shape: torch.Size([6, 8]) - transforms 6 → 8 features
Bias b shape: torch.Size([8]) - one bias per output feature

Output y shape: torch.Size([4, 8]) - 4 samples, 8 features each
✓ Linear transformation complete: 6 → 8 dimensions

🎯 Why This Shape Pattern Matters
- Batch dimension (first) allows parallel processing
- Feature dimension (last) is what gets transformed
- This pattern scales from tiny MLPs to massive transformers

🔪 Tensor Slicing and Indexing
First sample shape: torch.Size([6])
First two samples shape: torch.Size([2, 6])
Last feature across batch shape: torch.Size([4])
Middle features shape: torch.Size([4, 3])

🏗️ Stacking vs Concatenation
Original X1, X2 shapes: torch.Size([4, 6]), torch.Size([4, 6])
Stack (dim=0): torch.Size([2, 4, 6]) - creates new dimension
Stack (dim=-1): torch.Size([4, 6, 2]) - creates new last dimension
Cat (dim=0): torch.Size([8, 6]) - bigger batch
C

## One-Hot vs Embedding Vectors

### 🚀 The Embedding Revolution

One of the biggest breakthroughs in deep learning was replacing sparse one-hot vectors with dense embedding vectors. This single change enabled the transformer revolution and modern NLP. Let's see why embeddings are so powerful!

In [4]:
# =============================================================================
# ONE-HOT vs EMBEDDINGS: THE GREAT TRANSFORMATION
# =============================================================================

print("🔥 The Old Way: One-Hot Vectors")
print("=" * 50)

# One-hot: sparse, large, fixed representation
vocab_size = 1000      # Typical small vocabulary
token_id = 42          # Word "hello" might be token 42

# Create one-hot vector - mostly zeros!
onehot = torch.zeros(vocab_size)
onehot[token_id] = 1.0

print(f"One-hot vector size: {onehot.shape}")
print(f"Memory usage: {onehot.numel()} floats")
print(f"Mostly zeros: {torch.sum(onehot == 0).item()} zeros out of {onehot.numel()}")
print(f"Sparsity: {(onehot == 0).float().mean().item():.1%}")
print(f"Sample values: [{onehot[40]:.0f}, {onehot[41]:.0f}, {onehot[42]:.0f}, {onehot[43]:.0f}, {onehot[44]:.0f}]")

print("\n✨ The New Way: Dense Embeddings")
print("=" * 50)

# Embedding: dense, compact, learnable representation
embedding_dim = 64
embedding_layer = nn.Embedding(vocab_size, embedding_dim)

# Get embedding for the same token
embedded = embedding_layer(torch.tensor([token_id]))

print(f"Embedding vector size: {embedded.shape}")
print(f"Memory usage: {embedded.numel()} floats")
print(f"All values meaningful: no zeros wasted")
print(f"Density: 100% of values carry information")
print(f"Sample values: {embedded[0, :5]}")  # First 5 dimensions

print("\n📊 Efficiency Comparison")
print("=" * 50)

# Memory efficiency
onehot_memory = vocab_size * 4  # 4 bytes per float32
embedding_memory = embedding_dim * 4  # 4 bytes per float32

print(f"One-hot memory per token: {onehot_memory:,} bytes")
print(f"Embedding memory per token: {embedding_memory:,} bytes")
print(f"Memory savings: {onehot_memory / embedding_memory:.1f}x smaller")

# Computational efficiency
print(f"\nComputational efficiency:")
print(f"One-hot @ weight: {vocab_size} × {embedding_dim} = {vocab_size * embedding_dim:,} operations")
print(f"Embedding lookup: {embedding_dim} operations (just indexing!)")
print(f"Speed improvement: {(vocab_size * embedding_dim) // embedding_dim}x faster")

print("\n🧠 Learning and Representation Power")
print("=" * 50)

# Demonstrate batch processing
batch_tokens = torch.tensor([1, 5, 10, 42, 100, 999])  # Batch of different tokens
batch_embedded = embedding_layer(batch_tokens)

print(f"Batch tokens: {batch_tokens}")
print(f"Batch embeddings shape: {batch_embedded.shape}")
print(f"Processing {len(batch_tokens)} tokens simultaneously")

# Show that embeddings are learnable parameters
print(f"\nEmbedding layer parameters: {embedding_layer.weight.numel():,}")
print(f"Embedding weight matrix shape: {embedding_layer.weight.shape}")
print(f"Each row is a learnable representation for one token")

print("\n🎯 Why Embeddings Won")
print("=" * 50)
print("1. 🚀 EFFICIENCY:")
print(f"   - Memory: {embedding_dim} floats vs {vocab_size:,} floats")
print("   - Computation: Lookup vs matrix multiplication")
print("   - Storage: Dense vs sparse operations")

print("\n2. 🧠 LEARNING:")
print("   - One-hot: Fixed, no learning possible")
print("   - Embeddings: Every dimension is trainable")
print("   - Semantic relationships emerge automatically")

print("\n3. 🔗 COMPOSITIONALITY:")
print("   - Similar words get similar embeddings")
print("   - Math operations work: king - man + woman ≈ queen")
print("   - Transfer learning becomes possible")

print("\n💡 Modern Impact:")
print("Embeddings enabled BERT, GPT, and all modern NLP models.")
print("The same concept now works for images, code, and multimodal data!")

🔥 The Old Way: One-Hot Vectors
One-hot vector size: torch.Size([1000])
Memory usage: 1000 floats
Mostly zeros: 999 zeros out of 1000
Sparsity: 99.9%
Sample values: [0, 0, 1, 0, 0]

✨ The New Way: Dense Embeddings
Embedding vector size: torch.Size([1, 64])
Memory usage: 64 floats
All values meaningful: no zeros wasted
Density: 100% of values carry information
Sample values: tensor([ 1.0902, -0.2947, -0.7762, -0.2292, -1.5605], grad_fn=<SliceBackward0>)

📊 Efficiency Comparison
One-hot memory per token: 4,000 bytes
Embedding memory per token: 256 bytes
Memory savings: 15.6x smaller

Computational efficiency:
One-hot @ weight: 1000 × 64 = 64,000 operations
Embedding lookup: 64 operations (just indexing!)
Speed improvement: 1000x faster

🧠 Learning and Representation Power
Batch tokens: tensor([  1,   5,  10,  42, 100, 999])
Batch embeddings shape: torch.Size([6, 64])
Processing 6 tokens simultaneously

Embedding layer parameters: 64,000
Embedding weight matrix shape: torch.Size([1000, 64]

## Tiny Exercise: Batch Operations

### 🎯 Putting It All Together

Let's practice the fundamental tensor operations you'll use in every neural network. We'll work with realistic batch sizes and dimensions, just like in real deep learning.

In [5]:
# =============================================================================
# COMPREHENSIVE TENSOR EXERCISE
# =============================================================================

print("🎯 Neural Network Simulation")
print("=" * 50)

# Realistic neural network dimensions
batch_size = 3        # Small batch for clarity
d_model = 4          # Feature dimension (like a tiny transformer)

# Step 1: Create input batch (like tokenized text)
print("Step 1: Input Data")
X_batch = torch.randn(batch_size, d_model)  # [batch_size, features]
W_linear = torch.randn(d_model, d_model)    # [input_dim, output_dim]
b_linear = torch.randn(d_model)             # [output_dim]

print(f"Input batch shape: {X_batch.shape}")
print(f"Weight matrix shape: {W_linear.shape}")
print(f"Bias vector shape: {b_linear.shape}")

# Step 2: Linear transformation (core of every neural layer)
print(f"\nStep 2: Linear Transformation")
y_batch = X_batch @ W_linear + b_linear     # Matrix multiplication + bias

print(f"Output shape: {y_batch.shape}")
print(f"✓ Shape check: input {X_batch.shape} → output {y_batch.shape}")

# Verify our understanding with assertions
assert X_batch.shape == (batch_size, d_model), f"Expected {(batch_size, d_model)}, got {X_batch.shape}"
assert y_batch.shape == (batch_size, d_model), f"Expected {(batch_size, d_model)}, got {y_batch.shape}"
print("✓ All shape assertions passed!")

print(f"\nStep 3: Common Neural Network Operations")
print("=" * 50)

# Element-wise operations (activations)
relu_output = torch.relu(y_batch)           # ReLU: max(0, x)
squared = torch.pow(y_batch, 2)             # Element-wise square
exp_output = torch.exp(y_batch)             # Element-wise exponential
sigmoid_output = torch.sigmoid(y_batch)     # Sigmoid activation

print(f"Original output (sample):\n{y_batch[0]}")                    # First sample
print(f"After ReLU (negatives → 0):\n{relu_output[0]}")
print(f"After sigmoid (0-1 range):\n{sigmoid_output[0]}")

print(f"\nStep 4: Reduction Operations")
print("=" * 50)

# Reductions - crucial for pooling and attention
sum_all = torch.sum(y_batch)                    # Sum all elements → scalar
sum_batch_dim = torch.sum(y_batch, dim=0)       # Sum across batch → [d_model]
sum_feature_dim = torch.sum(y_batch, dim=1)     # Sum across features → [batch_size]
mean_batch = torch.mean(y_batch, dim=0)         # Mean across batch → [d_model]

print(f"Sum all elements: {sum_all.item():.3f} (scalar)")
print(f"Sum across batch: {sum_batch_dim} (shape: {sum_batch_dim.shape})")
print(f"Sum across features: {sum_feature_dim} (shape: {sum_feature_dim.shape})")
print(f"Mean across batch: {mean_batch} (shape: {mean_batch.shape})")

print(f"\nStep 5: Broadcasting Magic")
print("=" * 50)

# Broadcasting: PyTorch's secret sauce for shape flexibility
A = torch.randn(3, 1)    # [3, 1] - 3 rows, 1 column
B = torch.randn(1, 4)    # [1, 4] - 1 row, 4 columns

print(f"Tensor A shape: {A.shape}")
print(f"Tensor B shape: {B.shape}")

# Broadcasting automatically expands dimensions
C = A + B                # [3, 1] + [1, 4] → [3, 4]

print(f"A + B result shape: {C.shape} (broadcasted!)")
print("Broadcasting rule: dimensions align from the right, 1s expand to match")

# Manual verification of broadcasting
A_expanded = A.expand(3, 4)  # [3, 1] → [3, 4]
B_expanded = B.expand(3, 4)  # [1, 4] → [3, 4]
C_manual = A_expanded + B_expanded

print(f"Manual expansion matches: {torch.allclose(C, C_manual)}")

print(f"\nStep 6: Real-World Pattern - Attention Scores")
print("=" * 50)

# Simulate attention mechanism pattern
seq_len = 5
query = torch.randn(batch_size, seq_len, d_model)    # [B, seq_len, d_model]
key = torch.randn(batch_size, seq_len, d_model)      # [B, seq_len, d_model]

# Attention scores: Q @ K^T
attention_scores = torch.matmul(query, key.transpose(-2, -1))  # [B, seq_len, seq_len]

print(f"Query shape: {query.shape}")
print(f"Key shape: {key.shape}")
print(f"Attention scores shape: {attention_scores.shape}")
print(f"✓ This is how transformers compute attention!")

print(f"\n🎉 Congratulations!")
print("=" * 50)
print("You've mastered the fundamental tensor operations of deep learning:")
print("✓ Tensor creation and properties")
print("✓ Matrix multiplication and broadcasting")
print("✓ Element-wise operations and reductions")
print("✓ Real neural network patterns")
print("✓ Shape thinking and debugging")

print(f"\n🚀 Next Steps:")
print("- Learn about automatic differentiation (autograd)")
print("- Understand how tensors become neural network layers")
print("- Explore GPU acceleration and optimization")
print("- Build your first neural network!")

print(f"\n💡 Key Insight:")
print("Deep learning is just smart tensor manipulation.")
print("Master tensors, master deep learning!")

🎯 Neural Network Simulation
Step 1: Input Data
Input batch shape: torch.Size([3, 4])
Weight matrix shape: torch.Size([4, 4])
Bias vector shape: torch.Size([4])

Step 2: Linear Transformation
Output shape: torch.Size([3, 4])
✓ Shape check: input torch.Size([3, 4]) → output torch.Size([3, 4])
✓ All shape assertions passed!

Step 3: Common Neural Network Operations
Original output (sample):
tensor([ 0.7814,  0.6957,  1.1554, -0.5608])
After ReLU (negatives → 0):
tensor([0.7814, 0.6957, 1.1554, 0.0000])
After sigmoid (0-1 range):
tensor([0.6860, 0.6672, 0.7605, 0.3634])

Step 4: Reduction Operations
Sum all elements: 7.466 (scalar)
Sum across batch: tensor([-2.5937,  8.9430,  3.3929, -2.2764]) (shape: torch.Size([4]))
Sum across features: tensor([2.0716, 3.9262, 1.4679]) (shape: torch.Size([3]))
Mean across batch: tensor([-0.8646,  2.9810,  1.1310, -0.7588]) (shape: torch.Size([4]))

Step 5: Broadcasting Magic
Tensor A shape: torch.Size([3, 1])
Tensor B shape: torch.Size([1, 4])
A + B resu