# Module 03: PyTorch Fundamentals

**The Foundation of Deep Learning in PyTorch**

---

## Objectives

By the end of this notebook, you will:
- Master PyTorch tensor creation and operations
- Understand the relationship between NumPy and PyTorch
- Know how to use GPU acceleration
- Understand automatic differentiation (autograd)

**Prerequisites:** 
- [Module 01 - Python & Math Prerequisites](../01_python_math_prerequisites/01_prerequisites.ipynb)
- [Module 02 - Introduction to Deep Learning](../02_intro_to_deep_learning/02_intro_deep_learning.ipynb)

---

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

---

# Part 1: Tensors - The Core Data Structure

---

## 1.1 What is a Tensor?

A **tensor** is a multi-dimensional array. It's the fundamental data structure in PyTorch.

| Dimensions | Name | Example |
|------------|------|--------|
| 0 | Scalar | Single number: `5` |
| 1 | Vector | List: `[1, 2, 3]` |
| 2 | Matrix | 2D array: image (height x width) |
| 3 | 3D Tensor | Batch of images, video frame |
| 4+ | N-D Tensor | Batch of videos, etc. |

## 1.2 Creating Tensors

In [None]:
# From Python list
t1 = torch.tensor([1, 2, 3])
print(f"From list: {t1}")

# 2D tensor
t2 = torch.tensor([[1, 2, 3],
                   [4, 5, 6]])
print(f"\n2D tensor:\n{t2}")

In [None]:
# Common initialization patterns
zeros = torch.zeros(3, 4)       # 3x4 zeros
ones = torch.ones(2, 3)         # 2x3 ones  
eye = torch.eye(3)              # 3x3 identity
empty = torch.empty(2, 2)       # Uninitialized (random memory)

print(f"Zeros (3x4):\n{zeros}")
print(f"\nIdentity (3x3):\n{eye}")

In [None]:
# Random tensors - essential for weight initialization
rand_uniform = torch.rand(2, 3)      # Uniform [0, 1)
rand_normal = torch.randn(2, 3)      # Standard normal N(0, 1)
rand_int = torch.randint(0, 10, (2, 3))  # Random integers [0, 10)

print(f"Uniform [0,1):\n{rand_uniform}")
print(f"\nNormal (0,1):\n{rand_normal}")
print(f"\nRandom integers [0,10):\n{rand_int}")

In [None]:
# Range and linspace
arange = torch.arange(0, 10, 2)      # Start, end, step
linspace = torch.linspace(0, 1, 5)   # Start, end, num_points

print(f"Arange (0 to 10 step 2): {arange}")
print(f"Linspace (0 to 1, 5 points): {linspace}")

In [None]:
# Create tensor with same properties as another
x = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32)

zeros_like = torch.zeros_like(x)     # Same shape, dtype, device
ones_like = torch.ones_like(x)
rand_like = torch.randn_like(x)

print(f"Original:\n{x}")
print(f"\nZeros like:\n{zeros_like}")

## 1.3 Tensor Attributes

Every tensor has three key attributes:
- **shape**: Dimensions of the tensor
- **dtype**: Data type of elements
- **device**: Where the tensor lives (CPU or GPU)

In [None]:
x = torch.randn(3, 4)

print(f"Tensor:\n{x}")
print(f"\nShape: {x.shape}")        # or x.size()
print(f"Dtype: {x.dtype}")
print(f"Device: {x.device}")
print(f"Number of dimensions: {x.ndim}")
print(f"Total elements: {x.numel()}")

In [None]:
# Data types
# float32 is the default and most common for deep learning
float_tensor = torch.tensor([1.0, 2.0], dtype=torch.float32)
double_tensor = torch.tensor([1.0, 2.0], dtype=torch.float64)
int_tensor = torch.tensor([1, 2], dtype=torch.int64)
bool_tensor = torch.tensor([True, False], dtype=torch.bool)

print(f"float32: {float_tensor.dtype}")
print(f"float64: {double_tensor.dtype}")
print(f"int64: {int_tensor.dtype}")
print(f"bool: {bool_tensor.dtype}")

### Why float32?

- **Memory efficient**: Half the memory of float64
- **Faster on GPU**: GPUs are optimized for float32
- **Sufficient precision**: Neural networks don't need float64 precision

For training very large models, even float16 (half precision) is used!

## 1.4 Reshaping Tensors

In [None]:
x = torch.arange(12)
print(f"Original: {x}")
print(f"Shape: {x.shape}")

# Reshape
reshaped = x.reshape(3, 4)
print(f"\nReshaped (3, 4):\n{reshaped}")

# View (requires contiguous memory, but no copy)
viewed = x.view(4, 3)
print(f"\nViewed (4, 3):\n{viewed}")

# -1 means "infer this dimension"
auto_reshape = x.reshape(2, -1)  # 2 rows, auto columns
print(f"\nAuto reshape (2, -1) = (2, 6):\n{auto_reshape}")

In [None]:
# Adding/removing dimensions
x = torch.randn(3, 4)
print(f"Original shape: {x.shape}")

# Unsqueeze: add dimension
x_unsqueeze = x.unsqueeze(0)  # Add at position 0
print(f"After unsqueeze(0): {x_unsqueeze.shape}")

x_unsqueeze2 = x.unsqueeze(-1)  # Add at last position
print(f"After unsqueeze(-1): {x_unsqueeze2.shape}")

# Squeeze: remove dimensions of size 1
y = torch.randn(1, 3, 1, 4)
print(f"\nBefore squeeze: {y.shape}")
print(f"After squeeze: {y.squeeze().shape}")

In [None]:
# Flatten
x = torch.randn(2, 3, 4)
print(f"Original shape: {x.shape}")

flat = x.flatten()  # Completely flat
print(f"Fully flattened: {flat.shape}")

# Flatten starting from dimension 1 (keep batch)
batch_flat = x.flatten(start_dim=1)
print(f"Flatten from dim 1: {batch_flat.shape}")

### Deep Learning Context

Common reshape patterns:
- **Flatten before dense layer**: (batch, C, H, W) -> (batch, C*H*W)
- **Add batch dimension**: (features,) -> (1, features)
- **Prepare for broadcasting**: (batch, features) -> (batch, features, 1, 1)

## 1.5 Tensor Operations

In [None]:
# Element-wise operations
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])

print(f"a = {a}")
print(f"b = {b}")
print(f"\na + b = {a + b}")
print(f"a * b = {a * b}")
print(f"a / b = {a / b}")
print(f"a ** 2 = {a ** 2}")

In [None]:
# Mathematical functions
x = torch.tensor([0.0, 1.0, 2.0])

print(f"x = {x}")
print(f"exp(x) = {torch.exp(x)}")
print(f"log(exp(x)) = {torch.log(torch.exp(x))}")
print(f"sqrt(x+1) = {torch.sqrt(x + 1)}")
print(f"sin(x) = {torch.sin(x)}")
print(f"abs(x-1) = {torch.abs(x - 1)}")

In [None]:
# Reduction operations
x = torch.tensor([[1.0, 2.0, 3.0],
                  [4.0, 5.0, 6.0]])

print(f"Tensor:\n{x}")
print(f"\nSum (all): {x.sum()}")
print(f"Sum (dim=0, columns): {x.sum(dim=0)}")
print(f"Sum (dim=1, rows): {x.sum(dim=1)}")
print(f"Mean: {x.mean()}")
print(f"Std: {x.std()}")
print(f"Max: {x.max()}")
print(f"Argmax: {x.argmax()}")

In [None]:
# Matrix operations
A = torch.randn(3, 4)
B = torch.randn(4, 2)

# Matrix multiplication
C = A @ B  # or torch.matmul(A, B)
print(f"A shape: {A.shape}")
print(f"B shape: {B.shape}")
print(f"A @ B shape: {C.shape}")

In [None]:
# Dot product (1D vectors)
v1 = torch.tensor([1.0, 2.0, 3.0])
v2 = torch.tensor([4.0, 5.0, 6.0])

dot = torch.dot(v1, v2)
print(f"v1 . v2 = {dot}")  # 1*4 + 2*5 + 3*6 = 32

## 1.6 Indexing and Slicing

In [None]:
x = torch.arange(12).reshape(3, 4)
print(f"Tensor:\n{x}")

# Basic indexing
print(f"\nx[0, 0] = {x[0, 0]}")
print(f"x[1, 2] = {x[1, 2]}")
print(f"x[-1] (last row) = {x[-1]}")

In [None]:
# Slicing
print(f"First two rows:\n{x[:2]}")
print(f"\nColumns 1 to 3:\n{x[:, 1:3]}")
print(f"\nEvery other row:\n{x[::2]}")

In [None]:
# Boolean indexing
x = torch.randn(5)
print(f"x = {x}")
print(f"x > 0: {x > 0}")
print(f"x[x > 0] = {x[x > 0]}")

## 1.7 In-place Operations

Operations with `_` suffix modify the tensor in-place.

In [None]:
x = torch.tensor([1.0, 2.0, 3.0])
print(f"Original: {x}")

# In-place operations (modify x directly)
x.add_(10)  # x = x + 10
print(f"After add_(10): {x}")

x.mul_(2)   # x = x * 2
print(f"After mul_(2): {x}")

x.zero_()   # x = 0
print(f"After zero_(): {x}")

### Warning!

In-place operations can cause problems with autograd. Avoid them when tracking gradients.

---

# Part 2: PyTorch and NumPy

---

## 2.1 Conversion Between NumPy and PyTorch

In [None]:
# NumPy to PyTorch
np_array = np.array([1, 2, 3, 4, 5])
print(f"NumPy array: {np_array}")

# Method 1: from_numpy (shares memory!)
tensor_shared = torch.from_numpy(np_array)
print(f"Tensor (shared): {tensor_shared}")

# Method 2: torch.tensor (copies data)
tensor_copied = torch.tensor(np_array)
print(f"Tensor (copied): {tensor_copied}")

In [None]:
# Memory sharing demonstration
np_arr = np.array([1, 2, 3])
t_shared = torch.from_numpy(np_arr)

print(f"Before: NumPy = {np_arr}, Tensor = {t_shared}")

# Modify NumPy array
np_arr[0] = 999

print(f"After modifying NumPy: NumPy = {np_arr}, Tensor = {t_shared}")
print("The tensor changed too! They share memory.")

In [None]:
# PyTorch to NumPy
tensor = torch.tensor([1.0, 2.0, 3.0])

# .numpy() shares memory (CPU tensor only)
np_from_tensor = tensor.numpy()
print(f"Tensor to NumPy: {np_from_tensor}")
print(f"Type: {type(np_from_tensor)}")

## 2.2 Key Differences

| Feature | NumPy | PyTorch |
|---------|-------|--------|
| GPU support | No | Yes |
| Automatic differentiation | No | Yes |
| Deep learning layers | No | Yes |
| Naming | ndarray | tensor |
| Primary use | Numerical computing | Deep learning |

---

# Part 3: GPU Acceleration

---

## 3.1 Device Management

In [None]:
# Check GPU availability
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"Number of GPUs: {torch.cuda.device_count()}")
    print(f"Current GPU: {torch.cuda.current_device()}")
    print(f"GPU name: {torch.cuda.get_device_name(0)}")

In [None]:
# Device-agnostic code pattern
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

In [None]:
# Moving tensors to device
x = torch.randn(3, 4)
print(f"Original device: {x.device}")

# Move to device (GPU if available, else stays on CPU)
x = x.to(device)
print(f"After .to(device): {x.device}")

In [None]:
# Create tensor directly on device
y = torch.randn(3, 4, device=device)
print(f"Created on device: {y.device}")

In [None]:
# Moving back to CPU (needed for NumPy, plotting, etc.)
x_cpu = x.cpu()
print(f"After .cpu(): {x_cpu.device}")

# Convert to NumPy (must be on CPU)
x_numpy = x_cpu.numpy()
print(f"As NumPy array: {x_numpy}")

## 3.2 Operations Across Devices

**Important:** All tensors in an operation must be on the same device!

In [None]:
# This would cause an error if one tensor is on GPU and one on CPU
# a_cpu = torch.randn(3)
# b_gpu = torch.randn(3, device='cuda')
# c = a_cpu + b_gpu  # ERROR!

# Correct approach: move to same device
a = torch.randn(3, device=device)
b = torch.randn(3, device=device)
c = a + b  # Works!
print(f"Result device: {c.device}")

## 3.3 GPU Speed Comparison

In [None]:
import time

# Matrix multiplication benchmark
sizes = [1000, 2000, 4000]

for size in sizes:
    # CPU
    a_cpu = torch.randn(size, size)
    b_cpu = torch.randn(size, size)
    
    start = time.time()
    c_cpu = a_cpu @ b_cpu
    cpu_time = time.time() - start
    
    # GPU (if available)
    if torch.cuda.is_available():
        a_gpu = a_cpu.cuda()
        b_gpu = b_cpu.cuda()
        
        torch.cuda.synchronize()  # Wait for GPU operations
        start = time.time()
        c_gpu = a_gpu @ b_gpu
        torch.cuda.synchronize()
        gpu_time = time.time() - start
        
        speedup = cpu_time / gpu_time
        print(f"Size {size}x{size}: CPU={cpu_time:.4f}s, GPU={gpu_time:.4f}s, Speedup={speedup:.1f}x")
    else:
        print(f"Size {size}x{size}: CPU={cpu_time:.4f}s (GPU not available)")

---

# Part 4: Automatic Differentiation (Autograd)

---

This is one of PyTorch's most powerful features. It automatically computes gradients, which is essential for training neural networks.

## 4.1 The Concept

Recall from calculus: to minimize a loss function, we need its gradient.

**Manually computing gradients is:**
- Tedious for complex networks
- Error-prone
- Hard to modify

**Autograd:**
- Tracks operations on tensors
- Automatically computes gradients
- Handles arbitrarily complex graphs

## 4.2 requires_grad

In [None]:
# By default, tensors don't track gradients
x = torch.tensor([1.0, 2.0, 3.0])
print(f"requires_grad: {x.requires_grad}")

# Enable gradient tracking
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
print(f"requires_grad: {x.requires_grad}")

In [None]:
# When requires_grad=True, operations build a computation graph
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2  # y = x^2
z = 2 * y   # z = 2 * x^2

print(f"x = {x}")
print(f"y = x^2 = {y}")
print(f"z = 2*y = {z}")
print(f"\ny.grad_fn: {y.grad_fn}")  # Tracks how y was created
print(f"z.grad_fn: {z.grad_fn}")

## 4.3 Computing Gradients with backward()

In [None]:
# Simple example: y = x^2, find dy/dx
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2

# Compute gradients
y.backward()  # dy/dx

print(f"x = {x.item()}")
print(f"y = x^2 = {y.item()}")
print(f"dy/dx = 2x = {x.grad.item()}")  # 2 * 2 = 4

In [None]:
# More complex example: chain rule
# y = (x + 2)^2
# dy/dx = 2(x + 2)

x = torch.tensor([3.0], requires_grad=True)
y = (x + 2) ** 2

y.backward()

print(f"x = {x.item()}")
print(f"y = (x+2)^2 = {y.item()}")
print(f"dy/dx = 2(x+2) = {x.grad.item()}")  # 2 * (3 + 2) = 10

In [None]:
# Multiple variables
# z = x^2 + y^2
# dz/dx = 2x, dz/dy = 2y

x = torch.tensor([3.0], requires_grad=True)
y = torch.tensor([4.0], requires_grad=True)

z = x**2 + y**2

z.backward()

print(f"z = x^2 + y^2 = {z.item()}")
print(f"dz/dx = 2x = {x.grad.item()}")  # 6
print(f"dz/dy = 2y = {y.grad.item()}")  # 8

## 4.4 Vector-Valued Functions

When the output is a vector, we need to provide a gradient argument.

In [None]:
# For neural networks, we usually sum the loss first
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2
loss = y.sum()  # Scalar output

loss.backward()

print(f"x = {x}")
print(f"y = x^2 = {y}")
print(f"loss = sum(y) = {loss.item()}")
print(f"d(loss)/dx = 2x = {x.grad}")

## 4.5 Gradient Accumulation

**Important:** Gradients accumulate by default! You must zero them before each backward pass.

In [None]:
x = torch.tensor([2.0], requires_grad=True)

# First backward
y = x ** 2
y.backward()
print(f"After first backward: grad = {x.grad.item()}")

# Second backward (without zeroing)
y = x ** 2
y.backward()
print(f"After second backward (accumulated!): grad = {x.grad.item()}")

# Correct way: zero gradients first
x.grad.zero_()
y = x ** 2
y.backward()
print(f"After zeroing and backward: grad = {x.grad.item()}")

## 4.6 Detaching and No-Grad Context

In [None]:
# Detach: remove from computation graph
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2

y_detached = y.detach()  # No longer tracks gradients
print(f"y requires_grad: {y.requires_grad}")
print(f"y_detached requires_grad: {y_detached.requires_grad}")

In [None]:
# torch.no_grad(): disable gradient tracking temporarily
# Used during inference/evaluation

x = torch.tensor([2.0], requires_grad=True)

# Normal operation (tracks gradients)
y = x ** 2
print(f"Normal: y.requires_grad = {y.requires_grad}")

# With no_grad
with torch.no_grad():
    y_no_grad = x ** 2
    print(f"In no_grad: y.requires_grad = {y_no_grad.requires_grad}")

### When to Use No-Grad

- **Inference/Evaluation**: No need to compute gradients
- **Memory saving**: Gradient tracking uses memory
- **Speed**: Slightly faster without tracking

## 4.7 A Neural Network Gradient Example

Let's see how autograd works for a simple one-layer network.

In [None]:
# Simple neuron: y = sigmoid(w*x + b)
# Loss = (y - target)^2

# Input
x = torch.tensor([1.0])
target = torch.tensor([0.0])

# Learnable parameters
w = torch.tensor([0.5], requires_grad=True)
b = torch.tensor([0.1], requires_grad=True)

# Forward pass
z = w * x + b
y = torch.sigmoid(z)
loss = (y - target) ** 2

print(f"w = {w.item():.3f}, b = {b.item():.3f}")
print(f"z = w*x + b = {z.item():.3f}")
print(f"y = sigmoid(z) = {y.item():.3f}")
print(f"loss = (y - target)^2 = {loss.item():.3f}")

# Backward pass
loss.backward()

print(f"\nGradients:")
print(f"d(loss)/dw = {w.grad.item():.4f}")
print(f"d(loss)/db = {b.grad.item():.4f}")

In [None]:
# Let's verify the gradient manually using chain rule
# loss = (sigmoid(w*x + b) - target)^2
# d(loss)/dw = 2*(y-target) * y*(1-y) * x

y_val = y.item()
manual_grad_w = 2 * (y_val - target.item()) * y_val * (1 - y_val) * x.item()
print(f"Manual d(loss)/dw = {manual_grad_w:.4f}")
print(f"Autograd d(loss)/dw = {w.grad.item():.4f}")
print(f"Match: {abs(manual_grad_w - w.grad.item()) < 1e-6}")

---

# Key Points Summary

---

## Tensors
- Tensors are multi-dimensional arrays, the core data structure
- Key attributes: shape, dtype, device
- Use float32 for deep learning (memory efficient, fast on GPU)

## NumPy Integration
- `torch.from_numpy()` shares memory
- `torch.tensor()` copies data
- `.numpy()` converts to NumPy (CPU only)

## GPU Acceleration
- Use `device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')`
- Move tensors with `.to(device)`
- All tensors in an operation must be on the same device

## Autograd
- Set `requires_grad=True` to track gradients
- Call `.backward()` to compute gradients
- Gradients accumulate - zero them with `.zero_()`
- Use `torch.no_grad()` during inference

---

# Interview Tips

---

## Common Questions

**Q: What is a tensor?**
A: A tensor is a multi-dimensional array, similar to a NumPy ndarray but with GPU support and automatic differentiation. It's the fundamental data structure in PyTorch.

**Q: What is the difference between `.view()` and `.reshape()`?**
A: `.view()` requires the tensor to be contiguous in memory and returns a view (shared memory). `.reshape()` works on any tensor and may return a copy if needed. Use `.reshape()` when unsure.

**Q: How does autograd work?**
A: When `requires_grad=True`, PyTorch builds a computational graph that records operations. When `.backward()` is called, it traverses this graph backwards, applying the chain rule to compute gradients.

**Q: Why do we need to zero gradients?**
A: PyTorch accumulates gradients by default. This is useful for implementing gradient accumulation across batches, but during normal training, we need to zero gradients before each backward pass.

**Q: What is the purpose of `torch.no_grad()`?**
A: It temporarily disables gradient computation. Used during inference to save memory and speed up computation since we don't need gradients for predictions.

---

# Practice Exercises

---

## Exercise 1: Tensor Creation

Create a 3x4 tensor with values from a normal distribution with mean=5 and std=2.

In [None]:
# Your code here
tensor = None  # Replace

# Verify
# print(f"Shape: {tensor.shape}")
# print(f"Approximate mean: {tensor.mean():.2f}")

## Exercise 2: Reshaping

Given a batch of 32 images of shape (32, 3, 28, 28), flatten each image to feed into a linear layer.

In [None]:
images = torch.randn(32, 3, 28, 28)

# Your code here
# Reshape to (32, 3*28*28) = (32, 2352)
flattened = None  # Replace

# print(f"Flattened shape: {flattened.shape}")

## Exercise 3: Gradient Computation

For f(x) = x^3 - 2x^2 + x, compute df/dx at x=3 using autograd.

In [None]:
# Your code here
x = None  # Create tensor with requires_grad=True
# Compute f and backward

# The analytical derivative is: 3x^2 - 4x + 1
# At x=3: 3*9 - 12 + 1 = 16

## Solutions

In [None]:
# Exercise 1
print("Exercise 1:")
tensor = torch.randn(3, 4) * 2 + 5  # N(0,1) * std + mean
print(f"Shape: {tensor.shape}")
print(f"Approximate mean: {tensor.mean():.2f}")

# Exercise 2
print("\nExercise 2:")
images = torch.randn(32, 3, 28, 28)
flattened = images.flatten(start_dim=1)  # or images.view(32, -1)
print(f"Flattened shape: {flattened.shape}")

# Exercise 3
print("\nExercise 3:")
x = torch.tensor([3.0], requires_grad=True)
f = x**3 - 2*x**2 + x
f.backward()
print(f"f(3) = {f.item()}")
print(f"df/dx at x=3: {x.grad.item()} (analytical: 16)")

---

## Next Module: [04 - The Neuron](../04_the_neuron/04_neuron.ipynb)

Now that we understand PyTorch basics, let's dive into the fundamental building block of neural networks - the artificial neuron.