# PyTorch — Complete Foundations

PyTorch is the dominant framework for deep learning research and production. Meta AI built it, and it has become the standard in academia and is widely used in industry alongside TensorFlow. Understanding PyTorch at its foundation — not just calling high-level APIs — is what separates someone who can copy tutorials from someone who can debug models, design architectures, and understand what is actually happening during training.

PyTorch has two fundamental components. First, a tensor library that works like NumPy but can run on GPUs. Second, an automatic differentiation engine called Autograd that tracks every operation you perform on tensors and can compute gradients of any quantity with respect to any other quantity automatically. Everything else in PyTorch — neural network layers, optimizers, loss functions — is built on these two primitives.

If you have read the NumPy notebook, you will find the tensor API immediately familiar. The key differences are the device system, the default dtype (`float32` vs NumPy's `float64`), and the presence of `requires_grad`.

---
## Installation

PyTorch comes pre-installed on Google Colab. The cell below confirms the version and whether CUDA is available. If you are running this in Colab, go to Runtime > Change runtime type > T4 GPU to enable GPU support.

In [None]:
!pip install --upgrade torch torchvision torchaudio

import torch
import numpy as np

print("PyTorch version :", torch.__version__)
print("CUDA available  :", torch.cuda.is_available())
print("MPS available   :", torch.backends.mps.is_available())

if torch.cuda.is_available():
    print("GPU name        :", torch.cuda.get_device_name(0))
    print("GPU memory (GB) :", torch.cuda.get_device_properties(0).total_memory / 1e9)

---
## Device Setup

Establishing the device once at the top of every script or notebook is standard practice. All tensors and models are moved to this device. Code written this way runs identically whether a GPU is available or not.

In [None]:
if torch.cuda.is_available():
    device = torch.device("cuda")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")

print("Active device:", device)

This three-way check — CUDA (NVIDIA GPU), MPS (Apple Silicon), CPU — is the standard device selection pattern in modern PyTorch code. You will see it at the top of nearly every repository. Writing `device = "cuda" if torch.cuda.is_available() else "cpu"` is the older, simpler version. The pattern using `torch.device` is more explicit and supports string formatting like `f"Using: {device}"`.

---
## Creating Tensors

A tensor is PyTorch's fundamental data structure. It is conceptually identical to a NumPy ndarray with the addition of device and gradient tracking. The creation API mirrors NumPy almost exactly.

In [None]:
a = torch.tensor([1, 2, 3, 4, 5])
b = torch.tensor([[1.0, 2.0, 3.0],
                  [4.0, 5.0, 6.0]])
c = torch.tensor([[[1, 2], [3, 4]],
                  [[5, 6], [7, 8]]])

print("1D:", a)
print("2D:\n", b)
print("3D shape:", c.shape)

`torch.tensor` always copies the data and infers dtype from the Python or NumPy input. If the input is a Python list of integers, you get `int64`. If the input contains floats, you get `float32`. This is different from NumPy, which defaults to `float64` for floats. The dtype difference matters when you mix PyTorch and NumPy — converting a `float64` NumPy array directly to PyTorch gives a `float64` tensor, which most model layers will reject because they expect `float32`.

In [None]:
zeros   = torch.zeros(3, 4)
ones    = torch.ones(2, 3)
full    = torch.full((2, 2), fill_value=3.14)
eye     = torch.eye(4)
empty   = torch.empty(2, 3)

print("zeros:\n", zeros)
print("\nones:\n", ones)
print("\nfull(3.14):\n", full)
print("\neye(4):\n", eye)

Note that `torch.zeros`, `torch.ones`, and `torch.full` take the shape as separate arguments `(3, 4)` rather than a tuple — unlike NumPy's `np.zeros((3, 4))`. Both styles are accepted in PyTorch, but the unpacked form is more common in practice. `torch.empty` allocates memory without initializing it, which is faster when you know you will fill every element immediately.

In [None]:
torch.manual_seed(42)

uniform  = torch.rand(3, 4)
normal   = torch.randn(3, 4)
integers = torch.randint(low=0, high=10, size=(3, 4))
perm     = torch.randperm(8)

print("Uniform [0,1):\n", uniform)
print("\nStandard normal:\n", normal)
print("\nRandom integers:\n", integers)
print("\nRandom permutation:", perm)

`torch.manual_seed` sets the random seed for reproducibility. Setting it ensures that every time you run this code, `torch.rand` and `torch.randn` produce the same numbers. `torch.randperm` generates a random permutation of integers from 0 to n-1 — this is used directly in batch shuffling. In a data loader, you call `randperm(dataset_size)` and index into your data with the result to create a shuffled epoch.

In [None]:
x = torch.tensor([[1.0, 2.0], [3.0, 4.0]])

print("zeros_like:\n", torch.zeros_like(x))
print("ones_like:\n",  torch.ones_like(x))
print("rand_like:\n",  torch.rand_like(x))
print("empty_like shape:", torch.empty_like(x).shape)

The `_like` family creates tensors with the same shape, dtype, and device as the input tensor. This is a critical pattern in model code. When you compute a mask, add Gaussian noise, or create a bias term, you want it on the same device as the existing tensors without explicitly specifying device every time. Using `torch.zeros_like(x)` instead of `torch.zeros(x.shape, device=x.device, dtype=x.dtype)` is cleaner and less error-prone.

In [None]:
arange = torch.arange(0, 10, 2)
linsp  = torch.linspace(0.0, 1.0, 5)

print("arange(0,10,2)  :", arange)
print("linspace(0,1,5) :", linsp)

`torch.arange` is the range equivalent. `torch.linspace` gives evenly spaced points from start to stop inclusive. These appear in positional encodings, learning rate schedulers, and plotting activation curves.

---
## Tensor Attributes

Every PyTorch tensor has four key attributes: shape, dtype, device, and requires_grad. The first three mirror NumPy. The fourth is unique to PyTorch.

In [None]:
x = torch.randn(3, 4, 5)

print("shape         :", x.shape)
print("dtype         :", x.dtype)
print("device        :", x.device)
print("requires_grad :", x.requires_grad)
print("ndim          :", x.ndim)
print("numel         :", x.numel())
print("is_contiguous :", x.is_contiguous())

`numel()` returns the total number of elements. `is_contiguous()` tells you whether the tensor's data is stored in a contiguous block of memory in C-order. Transposing a tensor makes it non-contiguous. Some operations require contiguous memory and will raise an error if the tensor is not — calling `.contiguous()` before `.view()` is a common fix.

`requires_grad` is the switch that activates Autograd. Leaf tensors with `requires_grad=True` are the inputs to the computation graph — model parameters are exactly this. Setting this flag tells PyTorch: "when gradients are computed, I need the gradient with respect to this tensor."

In [None]:
dtypes = [
    (torch.float16,   "float16  (half precision)"),
    (torch.float32,   "float32  (single — default)"),
    (torch.float64,   "float64  (double)"),
    (torch.int8,      "int8"),
    (torch.int32,     "int32"),
    (torch.int64,     "int64    (default int)"),
    (torch.bool,      "bool"),
    (torch.bfloat16,  "bfloat16 (used in TPU/modern GPUs)"),
]

for dtype, label in dtypes:
    t = torch.tensor([1.0], dtype=dtype)
    print(f"{label:40s} | itemsize: {t.element_size()} bytes")

dtype selection has real-world consequences. `float32` is the standard for model parameters and activations. `float16` halves memory and speeds up matrix multiplications on modern GPUs (A100, H100) but has a narrower range, risking overflow. `bfloat16` has the same range as `float32` but lower precision — it is the preferred half-precision format for training large language models. `int64` is used for class labels and token indices. Feeding an `int64` label tensor into a loss function that expects `float32` is a common error.

In [None]:
x = torch.tensor([1, 2, 3], dtype=torch.int64)

print("Original         :", x.dtype)
print(".float()         :", x.float().dtype)
print(".double()        :", x.double().dtype)
print(".half()          :", x.half().dtype)
print(".to(torch.float32):", x.to(torch.float32).dtype)
print(".to(device)      : moves device + keeps dtype")

`.float()` is shorthand for `.to(torch.float32)`. `.to()` is the most flexible method — it accepts a dtype, a device, or both. You can write `x.to(device='cuda', dtype=torch.float16)` to move and cast in one call. In production model code, `.to(device)` is called on both the model and each batch of data at the start of the training loop.

---
## Reshaping Tensors

Reshaping in PyTorch works identically to NumPy in concept, with the distinction between `view` and `reshape` being important to understand.

In [None]:
a = torch.arange(24)
print("Original shape:", a.shape)

b = a.view(4, 6)
c = a.reshape(2, 3, 4)
d = a.view(-1, 6)
e = a.reshape(4, -1)

print("view(4, 6)    :", b.shape)
print("reshape(2,3,4):", c.shape)
print("view(-1, 6)   :", d.shape)
print("reshape(4, -1):", e.shape)

`view` requires the tensor to be contiguous in memory and always returns a view (shares data). `reshape` is more flexible — it returns a view when possible, and a copy when not. In practice, prefer `view` when you know the tensor is contiguous (freshly created, not transposed) for clarity of intent. If you hit a `RuntimeError: view size is not compatible`, call `.contiguous().view(...)`. The `-1` dimension inference works in both.

In [None]:
x = torch.tensor([1.0, 2.0, 3.0])
print("Original        :", x.shape)

print("unsqueeze(0)    :", x.unsqueeze(0).shape)
print("unsqueeze(1)    :", x.unsqueeze(1).shape)
print("unsqueeze(-1)   :", x.unsqueeze(-1).shape)
print("[None, :]       :", x[None, :].shape)
print("[:, None]       :", x[:, None].shape)

y = torch.randn(1, 3, 1, 5)
print("\nBefore squeeze  :", y.shape)
print("squeeze()       :", y.squeeze().shape)
print("squeeze(0)      :", y.squeeze(0).shape)
print("squeeze(2)      :", y.squeeze(2).shape)

`unsqueeze(dim)` inserts a size-1 dimension at the specified position. `None` indexing is an alias — `x[None, :]` is equivalent to `x.unsqueeze(0)`. You constantly use this to make shapes broadcastable. For example, an attention mask of shape `(batch, seq_len)` must be unsqueezed to `(batch, 1, 1, seq_len)` before being added to attention scores of shape `(batch, heads, seq_len, seq_len)`. `squeeze` removes size-1 dimensions — useful when a model outputs shape `(batch, 1)` for binary classification and you need `(batch,)` for the loss function.

In [None]:
x = torch.randn(2, 3, 4)
print("Original shape          :", x.shape)

perm = x.permute(0, 2, 1)
print("permute(0, 2, 1) shape  :", perm.shape)

m = torch.randn(3, 4)
print("\nMatrix shape            :", m.shape)
print(".T shape                :", m.T.shape)
print("transpose(0,1) shape    :", m.transpose(0, 1).shape)

`permute` reorders all dimensions in one call by specifying the desired order. `.T` and `transpose(dim0, dim1)` swap two specific dimensions. In image processing, data often arrives as `(batch, height, width, channels)` (NHWC format) and PyTorch convolutions expect `(batch, channels, height, width)` (NCHW). The conversion is `x.permute(0, 3, 1, 2)`. In Transformer attention, the query, key, and value tensors are permuted multiple times to align dimensions for batched matrix multiplication.

In [None]:
batch_features = torch.randn(32, 8, 8, 64)
print("Input shape (batch, H, W, C):", batch_features.shape)

flat = batch_features.flatten(start_dim=1)
print("After flatten(start_dim=1)  :", flat.shape)

flat2 = batch_features.view(32, -1)
print("After view(32, -1)          :", flat2.shape)

This is the canonical "flatten before a linear layer" operation. A convolutional network produces a 4D tensor. Before the classification head (a linear layer), you collapse all non-batch dimensions into one. `flatten(start_dim=1)` leaves the batch dimension untouched and flattens everything else. This is equivalent to `view(batch_size, -1)`. The resulting shape `(32, 4096)` is then fed into a `nn.Linear(4096, num_classes)` layer.

---
## Indexing and Slicing

PyTorch indexing is essentially identical to NumPy. The same slice notation, boolean masking, and fancy indexing all work.

In [None]:
x = torch.tensor([[10, 20, 30, 40],
                  [50, 60, 70, 80],
                  [90,100,110,120]])

print("x[0, 0]    =", x[0, 0])
print("x[1, 2]    =", x[1, 2])
print("x[-1, -1]  =", x[-1, -1])
print("Scalar val :", x[0, 0].item())

print("\nRow 0              :", x[0])
print("Column 1           :", x[:, 1])
print("First two rows     :\n", x[:2])
print("Submatrix [0:2,1:3]:\n", x[0:2, 1:3])
print("Every other col    :", x[:, ::2])

The key difference from NumPy: accessing a single element gives a 0-dimensional tensor, not a Python scalar. Call `.item()` to extract a Python scalar — this is essential when logging loss values. `loss.item()` is the correct way to get the numeric value without keeping the tensor in memory.

In [None]:
x = torch.tensor([3.0, 7.0, -1.0, 12.0, -5.0, 9.0])

mask = x > 0
print("Mask        :", mask)
print("Positive    :", x[mask])

y = x.clone()
y[y < 0] = 0.0
print("ReLU manual :", y)

result = torch.where(x > 0, x, torch.zeros_like(x))
print("torch.where :", result)

`torch.where(condition, x, y)` selects elements from `x` where condition is True and from `y` where it is False. This is the vectorized ternary operator and it supports autograd — `y[y < 0] = 0` does not. When implementing custom activation functions that need gradients, use `torch.where` or torch functional equivalents, not in-place boolean assignment.

In [None]:
logits = torch.tensor([[0.1, 2.5, -0.3],
                       [1.2, 0.5, 3.1],
                       [0.8, 1.9, 0.2]])

labels = torch.tensor([1, 2, 1])

gathered = logits.gather(dim=1, index=labels.unsqueeze(1))
print("Logits:\n", logits)
print("Labels :", labels)
print("Gathered (correct class logits):", gathered.squeeze())

`gather` picks elements along a dimension using an index tensor. Here it collects the logit corresponding to the correct class for each sample in the batch. This operation appears inside cross-entropy loss implementations and in reinforcement learning when you need to select Q-values for the actions that were actually taken. It is fully differentiable.

---
## Math Operations

PyTorch math mirrors NumPy's element-wise design. All standard operators are overloaded and every operation has a functional equivalent in `torch.*`.

In [None]:
a = torch.tensor([1.0, 2.0, 3.0, 4.0])
b = torch.tensor([10.0, 20.0, 30.0, 40.0])

print("a + b  :", a + b)
print("a - b  :", a - b)
print("a * b  :", a * b)
print("a / b  :", a / b)
print("a ** 2 :", a ** 2)
print("a % 3  :", a % 3)

print("\ntorch.sin  :", torch.sin(a))
print("torch.exp  :", torch.exp(a))
print("torch.log  :", torch.log(a))
print("torch.sqrt :", torch.sqrt(a))
print("torch.abs  :", torch.abs(torch.tensor([-3.0, -1.0, 2.0])))

All of these operations are tracked by Autograd when the input tensors have `requires_grad=True`. Every arithmetic operation adds a node to the computation graph. The graph is built dynamically — this is what PyTorch calls define-by-run (dynamic computation graph), as opposed to TensorFlow 1.x's define-then-run approach. You can use normal Python control flow (if statements, for loops) and the graph is built differently each forward pass.

In [None]:
A = torch.tensor([[1.0, 2.0, 3.0],
                  [4.0, 5.0, 6.0]])
B = torch.tensor([[7.0, 8.0],
                  [9.0, 10.0],
                  [11.0, 12.0]])

C = A @ B
print("A shape:", A.shape, " B shape:", B.shape)
print("A @ B shape:", C.shape)
print("A @ B:\n", C)

u = torch.tensor([1.0, 2.0, 3.0])
v = torch.tensor([4.0, 5.0, 6.0])
print("\nDot product        :", torch.dot(u, v))

A_batch = torch.randn(8, 3, 4)
B_batch = torch.randn(8, 4, 5)
C_batch = torch.bmm(A_batch, B_batch)
print("\nBatched matmul (bmm) shape:", C_batch.shape)

`@` and `torch.matmul` handle both 2D matrix multiplication and batched variants for higher-dimensional tensors. `torch.bmm` (batch matrix multiply) is the explicit 3D-only version. In Transformer attention, `scores = queries @ keys.transpose(-2, -1)` is the scaled dot-product attention computation, operating on tensors of shape `(batch, heads, seq_len, head_dim)`. The `@` operator broadcasts correctly over the batch and head dimensions.

In [None]:
x = torch.tensor([[1.0, 2.0, 3.0],
                  [4.0, 5.0, 6.0],
                  [7.0, 8.0, 9.0]])

print("sum all    :", x.sum())
print("sum dim=0  :", x.sum(dim=0))
print("sum dim=1  :", x.sum(dim=1))
print("mean all   :", x.mean())
print("max dim=1  :", x.max(dim=1))
print("argmax dim=1:",x.argmax(dim=1))
print("std dim=0  :", x.std(dim=0))
print("norm       :", x.norm())
print("norm dim=1 :", x.norm(dim=1))

In PyTorch the keyword is `dim=` everywhere. This replaces NumPy's `axis=`. `x.max(dim=1)` returns a named tuple with `.values` and `.indices`. `x.argmax(dim=1)` is the direct equivalent of NumPy's argmax and is used to convert batch logit tensors to predicted class indices for accuracy computation. `x.norm(dim=1)` computes the L2 norm of each row — this is used in gradient clipping and embedding normalization.

In [None]:
x = torch.tensor([1.0, 2.0, 3.0])

y = x + 10
print("Out-of-place (x + 10):", y, "| x unchanged:", x)

x.add_(10)
print("In-place (add_)      :", x)

x.mul_(2)
print("In-place (mul_)      :", x)

x.zero_()
print("In-place (zero_)     :", x)

In-place operations use a trailing underscore (`add_`, `mul_`, `zero_`). They modify the tensor without allocating new memory. This is faster and reduces memory usage, but there is a critical constraint: **never use in-place operations on tensors that require gradients or are part of the computation graph**. PyTorch records operations for backpropagation, and in-place modification destroys the information needed to compute gradients. The one legitimate use of in-place ops is `optimizer.zero_grad()` which calls `.zero_()` internally on parameter gradients.

---
## Autograd: Automatic Differentiation

Autograd is the engine that makes training neural networks possible. Instead of manually deriving and coding gradient formulas, you define the forward computation and PyTorch automatically computes all gradients via backpropagation.

The mechanism: PyTorch builds a directed acyclic graph (DAG) as you perform operations on tensors with `requires_grad=True`. Each node stores the operation and a function to compute its local gradient. When you call `.backward()`, PyTorch traverses this graph in reverse (reverse-mode automatic differentiation, also called backpropagation) applying the chain rule at every node.

In [None]:
x = torch.tensor(4.0, requires_grad=True)

y = x ** 2 + 3 * x + 5

print("x        :", x)
print("y = x^2 + 3x + 5 :", y)
print("grad_fn  :", y.grad_fn)

y.backward()

print("\ndy/dx at x=4     :", x.grad.item())
print("Expected (2x+3)  :", (2*4 + 3))

`grad_fn` shows the last operation in the graph — here `AddBackward0` because the last operation was an addition. After `backward()`, `x.grad` holds `dy/dx`. PyTorch computed this using the chain rule: `d/dx(x^2 + 3x + 5) = 2x + 3`. At `x=4`, that is `2*4+3 = 11`. This is exactly what gradient descent uses: it updates `x` by subtracting a fraction of this gradient to minimize `y`.

In [None]:
a = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(3.0, requires_grad=True)
c = torch.tensor(4.0, requires_grad=True)

z = (a + b) * c

print("z = (a+b)*c =", z.item())

z.backward()

print("\ndz/da :", a.grad.item(), "  (expected:", c.item(), ")")
print("dz/db :", b.grad.item(), "  (expected:", c.item(), ")")
print("dz/dc :", c.grad.item(), "  (expected:", (a+b).item(), ")")

All leaf tensors with `requires_grad=True` receive gradients simultaneously from a single `backward()` call. PyTorch traverses the graph once and distributes gradients everywhere. In a neural network with millions of parameters, this single traversal computes gradients for every parameter at once. The chain rule for `z = (a+b)*c` gives `dz/dc = a+b = 5` (the derivative of multiplication with respect to the multiplier is the other factor).

In [None]:
x = torch.tensor(3.0, requires_grad=True)

for i in range(4):
    y = x * 2
    y.backward()
    print(f"Pass {i+1}: x.grad = {x.grad.item()}")

print("\n--- With zero_grad() ---")
x.grad.zero_()

for i in range(4):
    y = x * 2
    y.backward()
    print(f"Pass {i+1}: x.grad = {x.grad.item()}")
    x.grad.zero_()

Gradients accumulate by default. Each `backward()` call adds to `x.grad` rather than replacing it. This is a deliberate design for cases where you want to sum gradients across multiple forward passes before updating (gradient accumulation, used when GPU memory is too small for the desired batch size). In normal training, you must zero gradients before each backward pass. `optimizer.zero_grad()` does this automatically for all model parameters.

In [None]:
x = torch.randn(3, requires_grad=True)
W = torch.randn(3, 3, requires_grad=True)

with torch.no_grad():
    y = W @ x
    print("Inside no_grad:")
    print("  y.requires_grad:", y.requires_grad)
    print("  y.grad_fn      :", y.grad_fn)

y_track = W @ x
print("\nOutside no_grad:")
print("  y.requires_grad:", y_track.requires_grad)
print("  y.grad_fn      :", y_track.grad_fn)

detached = x.detach()
print("\nDetached tensor requires_grad:", detached.requires_grad)

`torch.no_grad()` disables gradient tracking for all operations inside the block. No computation graph is built, which reduces memory usage significantly. Use this during evaluation and inference — there is no point tracking gradients when you will not call `backward()`. `.detach()` creates a new tensor that shares data with the original but is detached from the computation graph. This is used when you want to use a tensor's value as a constant in further computation, or when logging/visualizing intermediate values.

---
## GPU Operations

Moving computation to a GPU can provide 10x to 100x speedup for the matrix multiplications that dominate deep learning. PyTorch makes this transparent — the same code runs on CPU or GPU.

In [None]:
x_cpu = torch.randn(3, 4)
print("CPU tensor device:", x_cpu.device)

x_gpu = x_cpu.to(device)
print("GPU tensor device:", x_gpu.device)

y_gpu = torch.randn(3, 4, device=device)
print("Created on device:", y_gpu.device)

z = x_gpu + y_gpu
print("Result device    :", z.device)

z_cpu = z.cpu()
print("Back to CPU      :", z_cpu.device)

Tensors on different devices cannot interact. Attempting `cpu_tensor + gpu_tensor` raises a RuntimeError. All operands must be on the same device. In training loops, you move the model to the device once before training, then move each batch to the device at the start of each iteration. The pattern is `X, y = X.to(device), y.to(device)` inside the loop.

In [None]:
import time

size = 2048

A_cpu = torch.randn(size, size)
B_cpu = torch.randn(size, size)

start = time.time()
for _ in range(5):
    C = A_cpu @ B_cpu
cpu_time = (time.time() - start) / 5
print(f"CPU matmul {size}x{size}: {cpu_time:.4f}s per call")

if torch.cuda.is_available():
    A_gpu = A_cpu.to(device)
    B_gpu = B_cpu.to(device)
    
    torch.cuda.synchronize()
    start = time.time()
    for _ in range(5):
        C = A_gpu @ B_gpu
    torch.cuda.synchronize()
    gpu_time = (time.time() - start) / 5
    print(f"GPU matmul {size}x{size}: {gpu_time:.4f}s per call")
    print(f"Speedup: {cpu_time / gpu_time:.1f}x")
else:
    print("No GPU available — enable Colab GPU runtime to see the speedup")

`torch.cuda.synchronize()` is essential for accurate GPU benchmarking. GPU operations are asynchronous — the CPU issues a command and moves on without waiting for the GPU to finish. Without synchronize, you are timing only the CPU dispatch time, not the actual computation. The speedup for large matrix multiplications on a T4 GPU is typically 20-50x over CPU.

---
## Concatenation and Stacking

In [None]:
a = torch.tensor([[1.0, 2.0, 3.0]])
b = torch.tensor([[4.0, 5.0, 6.0]])

cat_0 = torch.cat([a, b], dim=0)
cat_1 = torch.cat([a, b], dim=1)
stk_0 = torch.stack([a, b], dim=0)

print("cat dim=0 shape:", cat_0.shape, "\n", cat_0)
print("cat dim=1 shape:", cat_1.shape, "\n", cat_1)
print("stack dim=0 shape:", stk_0.shape)

`torch.cat` joins tensors along an existing dimension. `torch.stack` creates a new dimension. In training, you typically collect model outputs from multiple mini-batches and concatenate them for evaluation: `all_preds = torch.cat(pred_list, dim=0)`. When assembling a batch from individual samples, you stack them: `batch = torch.stack([sample1, sample2, ...], dim=0)`.

---
## NumPy Interoperability

PyTorch and NumPy are designed to work together. Conversions are essentially free on CPU because they share memory.

In [None]:
np_arr = np.array([[1.0, 2.0, 3.0],
                   [4.0, 5.0, 6.0]])

t_shared = torch.from_numpy(np_arr)
t_copy   = torch.tensor(np_arr)

print("NumPy dtype  :", np_arr.dtype)
print("Tensor dtype :", t_shared.dtype)

np_arr[0, 0] = 999.0
print("\nAfter modifying np_arr[0,0]:")
print("t_shared[0,0] :", t_shared[0, 0].item(), "  (shared — changed)")
print("t_copy[0,0]   :", t_copy[0, 0].item(), "   (copy — unchanged)")

t = torch.randn(2, 3)
arr = t.numpy()
print("\nTensor to NumPy:", type(arr), arr.shape)

`torch.from_numpy` creates a tensor sharing the same memory as the NumPy array. No data is copied. Modifying either one modifies both. `torch.tensor` always copies. For GPU tensors, you must call `.cpu()` first before converting to NumPy. The full pattern for going from a GPU tensor to a NumPy array that may have gradients is: `tensor.detach().cpu().numpy()`. This is the standard way to convert model outputs to NumPy for evaluation metrics in Scikit-learn.

---
## Putting It Together: Manual Training Loop from Scratch

This builds a complete training loop using nothing but tensors and Autograd — no `nn.Module`, no optimizer class, no loss class. Seeing it at this level makes the high-level APIs less magical.

In [None]:
torch.manual_seed(0)

n_samples  = 200
n_features = 4
n_classes  = 3

X = torch.randn(n_samples, n_features)
true_W = torch.randn(n_features, n_classes)
logits_true = X @ true_W
y = logits_true.argmax(dim=1)

W = torch.randn(n_features, n_classes, requires_grad=True)
b = torch.zeros(n_classes, requires_grad=True)

lr     = 0.05
epochs = 100

for epoch in range(epochs):
    logits = X @ W + b

    exp_logits  = torch.exp(logits - logits.max(dim=1, keepdim=True).values)
    probs       = exp_logits / exp_logits.sum(dim=1, keepdim=True)
    correct_log = torch.log(probs[torch.arange(n_samples), y])
    loss        = -correct_log.mean()

    loss.backward()

    with torch.no_grad():
        W -= lr * W.grad
        b -= lr * b.grad

    W.grad.zero_()
    b.grad.zero_()

    if (epoch + 1) % 20 == 0:
        with torch.no_grad():
            preds    = (X @ W + b).argmax(dim=1)
            accuracy = (preds == y).float().mean().item()
        print(f"Epoch {epoch+1:3d} | Loss: {loss.item():.4f} | Accuracy: {accuracy:.3f}")

Several important patterns appear here. The numerically stable softmax subtracts the row maximum before exponentiation — without this, large logit values cause `exp` to overflow to infinity. The cross-entropy loss is the negative log probability assigned to the correct class. The weight update uses `torch.no_grad()` because the update step itself should not be tracked by Autograd — you are modifying parameters, not computing gradients. `zero_()` clears gradients after the update so they do not accumulate into the next iteration.

Every modern deep learning framework wraps these exact steps: `optimizer.zero_grad()`, `loss.backward()`, `optimizer.step()`.

---
## The Same Loop with Modern PyTorch APIs

Now the identical problem using `nn.Linear`, `nn.CrossEntropyLoss`, and `torch.optim.Adam`. Compare this to the manual version above to see what each API is abstracting.

In [None]:
import torch.nn as nn

torch.manual_seed(0)

X = torch.randn(200, 4)
true_W = torch.randn(4, 3)
y = (X @ true_W).argmax(dim=1)

model     = nn.Linear(in_features=4, out_features=3)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.05)

model.to(device)
X, y = X.to(device), y.to(device)

for epoch in range(100):
    optimizer.zero_grad()

    logits = model(X)
    loss   = criterion(logits, y)

    loss.backward()
    optimizer.step()

    if (epoch + 1) % 20 == 0:
        with torch.no_grad():
            preds    = logits.argmax(dim=1)
            accuracy = (preds == y).float().mean().item()
        print(f"Epoch {epoch+1:3d} | Loss: {loss.item():.4f} | Accuracy: {accuracy:.3f}")

`nn.Linear` initializes weights with Kaiming uniform initialization (a carefully chosen distribution that keeps activations from vanishing or exploding). `nn.CrossEntropyLoss` combines the numerically stable softmax and the negative log-likelihood into a single operation. `torch.optim.Adam` implements adaptive moment estimation — it maintains a running mean and variance of gradients to scale each parameter's update individually, which makes it much faster to converge than plain gradient descent.

The structure `zero_grad → forward → loss → backward → step` is the canonical PyTorch training loop. You will find this exact structure in every repository, from a small classifier to GPT-scale language model training.

---
## Reference: NumPy vs PyTorch

| Concept | NumPy | PyTorch | Notes |
|---|---|---|---|
| Create from list | `np.array([1,2,3])` | `torch.tensor([1,2,3])` | |
| Zeros | `np.zeros((m,n))` | `torch.zeros(m, n)` | PyTorch: no outer tuple |
| Random uniform | `np.random.rand(m,n)` | `torch.rand(m, n)` | |
| Random normal | `np.random.randn(m,n)` | `torch.randn(m, n)` | |
| Range | `np.arange(a,b,s)` | `torch.arange(a,b,s)` | |
| Default float dtype | `float64` | `float32` | Important difference |
| Shape | `a.shape` | `a.shape` | Same |
| Dtype | `a.dtype` | `a.dtype` | |
| Type cast | `a.astype(np.float32)` | `a.float()` / `.to(torch.float32)` | |
| Reshape | `a.reshape(m,n)` | `a.reshape(m,n)` / `a.view(m,n)` | view requires contiguous |
| Transpose | `a.T` | `a.T` / `a.transpose(d0,d1)` | |
| Permute dims | `np.transpose(a, axes)` | `a.permute(d0,d1,d2)` | |
| Add dim | `np.expand_dims(a,0)` | `a.unsqueeze(0)` | |
| Remove dim | `np.squeeze(a)` | `a.squeeze()` | |
| Flatten | `a.flatten()` | `a.flatten()` | |
| Reduction axis | `a.sum(axis=0)` | `a.sum(dim=0)` | keyword differs |
| Matrix multiply | `A @ B` | `A @ B` / `torch.mm(A,B)` | |
| Batched matmul | — | `torch.bmm(A,B)` | 3D only |
| Element scalar | `a[0, 0]` → Python scalar | `a[0, 0].item()` → Python scalar | .item() needed |
| Copy | `a.copy()` | `a.clone()` | |
| Where | `np.where(c, x, y)` | `torch.where(c, x, y)` | |
| Concatenate | `np.concatenate([a,b], axis=0)` | `torch.cat([a,b], dim=0)` | |
| Stack | `np.stack([a,b], axis=0)` | `torch.stack([a,b], dim=0)` | |
| To/from NumPy | — | `torch.from_numpy(arr)` / `t.numpy()` | Shares memory on CPU |
| GPU move | — | `.to(device)` / `.cuda()` | |
| Track gradients | — | `requires_grad=True` | PyTorch only |
| Disable grad | — | `torch.no_grad()` | Use for inference |
| Compute grads | — | `loss.backward()` | Fills `.grad` attributes |
| Clear grads | — | `optimizer.zero_grad()` | Must call each iteration |
| Seed | `np.random.seed(n)` | `torch.manual_seed(n)` | |