# NumPy — Complete Foundations

NumPy (Numerical Python) is the backbone of scientific computing in Python. Every major library you will encounter — Pandas, Matplotlib, Scikit-learn, TensorFlow, PyTorch — either wraps NumPy or heavily borrows its design. Before you can understand machine learning, data science, or deep learning, you must understand NumPy.

A Python list is flexible but slow. It can hold mixed types, which means Python cannot predict how much memory each element needs. NumPy solves this with the `ndarray` — an N-dimensional array where every element has the same type, stored in one contiguous block of memory. Operations run in C under the hood, making NumPy tens to hundreds of times faster than a Python loop.

This notebook walks you from installation through every core concept with explanations tied to how things are actually used in modern machine learning workflows.

---
## Installation

NumPy comes pre-installed on Google Colab. The cell below upgrades it to the latest version and confirms the installation.

In [None]:
!pip install --upgrade numpy

import numpy as np
print(np.__version__)

---
## Why NumPy Exists

Before seeing the benchmarks, it is worth understanding why the speed gap exists. A Python list stores pointers to objects scattered in memory. When you add two lists element-by-element, Python must follow each pointer, unbox the integer, do the math, and box the result back. NumPy stores raw numbers in a flat block of memory and hands the entire array to compiled C code. No boxing, no pointer chasing, no Python overhead per element.

This matters enormously in machine learning where you routinely operate on matrices with millions of elements.

In [None]:
import time

n = 1_000_000
py_list = list(range(n))
np_arr  = np.arange(n)

start = time.time()
result = [x * 2 for x in py_list]
python_time = time.time() - start

start = time.time()
result = np_arr * 2
numpy_time = time.time() - start

print(f"Python list : {python_time:.4f} seconds")
print(f"NumPy array : {numpy_time:.4f} seconds")
print(f"Speedup     : {python_time / numpy_time:.1f}x faster")

That speedup is not a trick. It is the direct result of memory layout and compiled operations. Every time a neural network multiplies weights by activations, or when Pandas computes a column mean, this is what is happening underneath.

---
## Creating Arrays

There are three categories of array creation: from existing Python data, from built-in generators, and from file or other arrays. You will use all three regularly.

In [None]:
a = np.array([1, 2, 3, 4, 5])

b = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

c = np.array([[[1, 2], [3, 4]],
              [[5, 6], [7, 8]]])

print("1D array  :", a)
print("2D array  :\n", b)
print("3D shape  :", c.shape)

A 1D array maps to a vector, a 2D array maps to a matrix, and a 3D array maps to a batch of matrices. In deep learning, image data is typically stored as a 4D array of shape `(batch_size, channels, height, width)`. The nesting depth of your Python list determines the number of dimensions NumPy creates.

In [None]:
zeros   = np.zeros((3, 4))
ones    = np.ones((2, 3))
full    = np.full((3, 3), fill_value=7.0)
eye     = np.eye(4)
empty   = np.empty((2, 2))

print("zeros:\n", zeros)
print("\nones:\n", ones)
print("\nfull(7.0):\n", full)
print("\neye(4):\n", eye)
print("\nempty (uninitialized values):\n", empty)

`np.zeros` and `np.ones` are used constantly when initializing weight matrices or bias vectors before training. `np.eye` creates an identity matrix, which appears in linear algebra proofs and certain neural network weight initializations. `np.empty` allocates memory without initializing it — it is faster when you know you will fill every element immediately anyway.

In [None]:
arange  = np.arange(0, 10, 2)
linsp   = np.linspace(0, 1, 9)
logsp   = np.logspace(0, 3, 4)

print("arange(0, 10, 2) :", arange)
print("linspace(0, 1, 9):", linsp)
print("logspace(0, 3, 4):", logsp)

`np.arange` works like Python's built-in `range` but returns an array and accepts floats. `np.linspace` is different in a subtle but important way: you specify how many points you want, not the step size, and the endpoint is always included. This is the standard tool for creating evenly spaced evaluation grids when plotting functions. `np.logspace` spaces points logarithmically — essential when sweeping hyperparameters like learning rate over several orders of magnitude.

In [None]:
rng = np.random.default_rng(seed=42)

uniform  = rng.random((3, 3))
normal   = rng.standard_normal((3, 3))
integers = rng.integers(0, 10, size=(3, 3))

print("Uniform [0,1):\n", uniform)
print("\nStandard normal:\n", normal)
print("\nRandom integers [0,10):\n", integers)

The `default_rng` generator is the modern NumPy approach to random number generation. Setting a seed guarantees the same random values every run, which is critical for reproducibility in experiments. In machine learning, reproducibility means another researcher can run your code and get exactly the same results. The standard normal distribution (mean 0, standard deviation 1) is specifically used in weight initialization schemes like Xavier and He initialization, which directly affect whether a neural network trains successfully.

---
## Array Attributes: Shape, Dtype, Ndim

These three attributes describe everything structurally important about an array. Misunderstanding them is the source of most shape errors in machine learning code.

In [None]:
x = np.array([[1.0, 2.0, 3.0],
              [4.0, 5.0, 6.0]])

print("Array:\n", x)
print("shape :", x.shape)    
print("ndim  :", x.ndim)     
print("dtype :", x.dtype)    
print("size  :", x.size)     
print("itemsize:", x.itemsize, "bytes") 
print("nbytes  :", x.nbytes, "bytes")   

Shape `(2, 3)` means 2 rows and 3 columns. The shape is always a tuple, and its length equals `ndim`. In neural networks, you constantly check shapes to verify that matrix multiplications are compatible — two matrices can multiply only when the inner dimensions match: `(A, B) @ (B, C)` gives `(A, C)`.

The dtype determines memory use and precision. `float64` uses 8 bytes per element; `float32` uses 4. Halving precision roughly doubles how much data fits in GPU memory, which is why most deep learning uses `float32` by default.

In [None]:
int8_arr   = np.array([1, 2, 3], dtype=np.int8)
int32_arr  = np.array([1, 2, 3], dtype=np.int32)
float32    = np.array([1.0, 2.0, 3.0], dtype=np.float32)
float64    = np.array([1.0, 2.0, 3.0], dtype=np.float64)
complex128 = np.array([1+2j, 3+4j], dtype=np.complex128)
bool_arr   = np.array([True, False, True], dtype=bool)

for arr, name in [(int8_arr,'int8'), (int32_arr,'int32'), (float32,'float32'),
                  (float64,'float64'), (bool_arr,'bool')]:
    print(f"{name:10s} | dtype: {str(arr.dtype):10s} | itemsize: {arr.itemsize} bytes")

Choosing the right dtype is a practical skill. Model weights are stored in `float32`. Integer labels for classification (class 0, class 1, ...) are stored in `int64`. Boolean masks are stored as `bool`. Using the wrong dtype triggers silent type promotion or explicit errors. When you load image pixels from disk (values 0–255), they arrive as `uint8`. You must cast them to `float32` and normalize them before feeding them into a model.

In [None]:
pixels = np.array([0, 128, 255], dtype=np.uint8)
print("Raw pixels       :", pixels, "|", pixels.dtype)

normalized = pixels.astype(np.float32) / 255.0
print("Normalized float :", normalized, "|", normalized.dtype)

This pattern — load as `uint8`, cast to `float32`, divide by 255 — is something you will write or see written thousands of times when working with image data. The division brings values into the `[0.0, 1.0]` range that neural networks train on efficiently.

---
## Reshaping, Views, and Copies

Reshaping is about reinterpreting the same data with a different shape. The total number of elements must stay the same. Understanding the difference between a view and a copy prevents subtle bugs.

In [None]:
a = np.arange(24)
print("Original shape:", a.shape)

b = a.reshape(4, 6)
c = a.reshape(2, 3, 4)
d = a.reshape(-1, 6)
e = a.reshape(4, -1)

print("reshape(4, 6)  :", b.shape)
print("reshape(2,3,4) :", c.shape)
print("reshape(-1, 6) :", d.shape)
print("reshape(4, -1) :", e.shape)

The `-1` in reshape is a convenience: it tells NumPy to compute that dimension automatically. This is used constantly in real code because you often know one dimension (say, batch size) but want NumPy to handle the rest. For example, after a convolutional layer outputs a 3D feature map, you reshape it to `(-1, features)` to feed into a linear layer without hardcoding the exact size.

In [None]:
original = np.array([1, 2, 3, 4, 5, 6])

view = original.reshape(2, 3)
copy = original.reshape(2, 3).copy()

view[0, 0] = 999
print("After modifying view[0,0]:")
print("original :", original)
print("view     :\n", view)

copy[0, 0] = 777
print("\nAfter modifying copy[0,0]:")
print("original :", original)
print("copy     :\n", copy)

A view shares the underlying memory with the original array. No data is moved or copied — only the shape metadata changes. Modifying a view modifies the original. This is efficient but dangerous if you do not expect it. A copy owns its own memory. Changes to a copy never affect the original.

You can check whether an array owns its data with `a.base is None`. If it is `None`, the array owns its data. If it points to another array, it is a view.

In [None]:
m = np.array([[1, 2, 3],
              [4, 5, 6]])

print("Original shape:", m.shape)
print("Flattened     :", m.flatten())
print("Ravel         :", m.ravel())
print("Transposed    :\n", m.T)
print("Transposed shape:", m.T.shape)

`flatten()` always returns a copy. `ravel()` returns a view when possible, making it faster. The transpose `.T` swaps all axes. For a 2D matrix, this turns rows into columns. For higher-dimensional arrays, use `np.transpose(arr, axes)` to specify exactly which axes to reorder — this is the NumPy equivalent of PyTorch's `permute`.

In [None]:
x = np.array([1.0, 2.0, 3.0])
print("Original shape          :", x.shape)

row_vector = np.expand_dims(x, axis=0)
col_vector = np.expand_dims(x, axis=1)
print("expand_dims(axis=0) :", row_vector.shape)
print("expand_dims(axis=1) :", col_vector.shape)

y = np.array([[[1, 2, 3]]])
print("\nBefore squeeze:", y.shape)
print("After squeeze() :", np.squeeze(y).shape)

`expand_dims` adds a size-1 dimension at the specified axis. This is needed constantly when broadcasting — for instance, adding a bias vector of shape `(features,)` to a batch of outputs of shape `(batch, features)` requires the bias to behave as shape `(1, features)`. `squeeze` removes all size-1 dimensions, which is the reverse operation. Both appear heavily in data pipeline code.

---
## Indexing and Slicing

NumPy's indexing is a generalization of Python list indexing to multiple dimensions. Mastering it eliminates the need for most explicit loops.

In [None]:
a = np.array([[10, 20, 30, 40],
              [50, 60, 70, 80],
              [90,100,110,120]])

print("a[0, 0]   =", a[0, 0])
print("a[1, 2]   =", a[1, 2])
print("a[-1, -1] =", a[-1, -1])

print("\nRow 0              :", a[0])
print("Column 1           :", a[:, 1])
print("First two rows     :\n", a[:2])
print("Submatrix [0:2,1:3]:\n", a[0:2, 1:3])
print("Every other column :", a[:, ::2])

The slice notation is `start:stop:step` along each dimension, separated by commas. Negative indices count from the end. `a[:, 1]` selects column 1 across all rows — the colon alone means "all elements along this axis". Slices return views, not copies. In practice this means you can write efficient code that operates on subregions of large arrays without duplicating data.

In [None]:
x = np.array([3, 7, -1, 12, -5, 9, 0, 4])

mask = x > 0
print("Mask           :", mask)
print("Positive values:", x[mask])

x[x < 0] = 0
print("After zeroing negatives:", x)

compound = np.array([1.0, -2.0, 3.0, -4.0, 5.0])
print("\nValues between -3 and 3:", compound[(compound > -3) & (compound < 3)])

Boolean indexing creates a mask — an array of True/False values — then uses it to select elements. This pattern replaces filtering loops entirely. In data preprocessing, you use it to clip negative pixel values, remove outliers, or select samples meeting some criterion. Note the use of `&` (bitwise AND) instead of Python's `and` — this is required because NumPy overloads bitwise operators element-wise.

In [None]:
a = np.array([[10, 20, 30],
              [40, 50, 60],
              [70, 80, 90],
              [100,110,120]])

print("Rows 0 and 2:\n", a[[0, 2]])
print("\nRows 1,3 and cols 0,2:\n", a[np.ix_([1, 3], [0, 2])])

indices = np.array([2, 0, 3, 1])
print("\nRows in order [2,0,3,1]:\n", a[indices])

Fancy indexing uses an array of integers to select specific rows or columns. Unlike slices, fancy indexing always returns a copy. `np.ix_` constructs an open mesh from index arrays, allowing you to select the intersection of specific rows and columns. This is how mini-batch sampling works — you generate a random array of indices and index into your dataset array to extract that batch.

---
## Element-wise Math and Universal Functions

Every arithmetic operator in NumPy operates element-wise. NumPy also provides a library of universal functions (ufuncs) — vectorized functions that apply a mathematical operation to every element of an array efficiently.

In [None]:
a = np.array([1.0, 2.0, 3.0, 4.0])
b = np.array([10.0, 20.0, 30.0, 40.0])

print("a + b  :", a + b)
print("a - b  :", a - b)
print("a * b  :", a * b)
print("a / b  :", a / b)
print("a ** 2 :", a ** 2)
print("a % 3  :", a % 3)
print("a // 3 :", a // 3)

These operators call underlying ufuncs: `np.add`, `np.multiply`, `np.power`, and so on. Writing `a + b` is identical to writing `np.add(a, b)`. The ufunc form is sometimes useful when you need the function as a first-class object, for example passing it as an argument.

In [None]:
x = np.array([0.0, np.pi/6, np.pi/4, np.pi/2, np.pi])

print("np.sin        :", np.sin(x))
print("np.cos        :", np.cos(x))
print("np.exp([0,1,2]):", np.exp(np.array([0.0, 1.0, 2.0])))
print("np.log([1,e,e2]):", np.log(np.array([1.0, np.e, np.e**2])))
print("np.log2([1,2,4]):", np.log2(np.array([1.0, 2.0, 4.0])))
print("np.sqrt      :", np.sqrt(np.array([1.0, 4.0, 9.0, 16.0])))
print("np.abs       :", np.abs(np.array([-3, -2, -1, 0, 1])))

These functions are the building blocks of activation functions, loss functions, and normalization. The sigmoid activation function is `1 / (1 + np.exp(-x))`. The softmax function requires `np.exp`. Log-loss requires `np.log`. Understanding that these are just element-wise operations on arrays demystifies a large part of neural network mathematics.

In [None]:
z = np.array([-3.0, -1.0, 0.0, 1.0, 3.0])

sigmoid  = 1.0 / (1.0 + np.exp(-z))
relu     = np.maximum(0, z)
tanh     = np.tanh(z)

exp_z    = np.exp(z)
softmax  = exp_z / exp_z.sum()

print("Input   :", z)
print("Sigmoid :", np.round(sigmoid, 4))
print("ReLU    :", relu)
print("Tanh    :", np.round(tanh, 4))
print("Softmax :", np.round(softmax, 4))
print("Softmax sums to:", softmax.sum())

These are the actual activation functions used in neural networks, implemented from scratch using NumPy operations. Sigmoid squashes any input to `(0, 1)`, historically used in binary classification outputs. ReLU clips negative values to zero — currently the most popular hidden-layer activation. Softmax converts a vector of raw scores into a probability distribution (sums to 1), used in multi-class classification output layers. Deep learning frameworks implement these identically under the hood.

---
## Broadcasting

Broadcasting is NumPy's mechanism for applying operations between arrays of different but compatible shapes. It avoids copying data and enables concise, efficient code. It is also one of the most misunderstood features.

In [None]:
a = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]], dtype=float)

print("Scalar broadcast (a + 10):\n", a + 10)

row = np.array([10, 20, 30])
print("\nRow vector broadcast (a + row):\n", a + row)

col = np.array([[10], [20], [30]])
print("\nCol vector broadcast (a + col):\n", a + col)

Broadcasting rules: NumPy compares shapes from the right. Dimensions are compatible if they are equal or one of them is 1. A missing dimension is treated as 1. So a scalar has shape `()`, which is compatible with anything. A row vector of shape `(3,)` broadcasts against a matrix `(3, 3)` by repeating itself along the row axis. A column vector of shape `(3, 1)` broadcasts by repeating along the column axis.

In practice: adding a bias term to every sample in a batch, normalizing each feature by its mean, or computing pairwise distances — all are broadcasting problems.

In [None]:
batch = np.random.randn(32, 10)
mean  = batch.mean(axis=0)
std   = batch.std(axis=0)

normalized = (batch - mean) / std

print("Batch shape      :", batch.shape)
print("Mean shape       :", mean.shape)
print("Normalized shape :", normalized.shape)
print("Normalized mean  :", normalized.mean(axis=0).round(6))
print("Normalized std   :", normalized.std(axis=0).round(6))

This is batch normalization in its simplest form. `mean` has shape `(10,)` which broadcasts against `(32, 10)` — NumPy subtracts each feature's mean from every sample's corresponding feature. The result has zero mean and unit variance per feature. This normalization step stabilizes training and appears in almost every serious deep learning pipeline.

---
## Aggregation and Reduction

Reduction operations collapse one or more dimensions by applying a function across elements. The `axis` parameter controls which dimension gets collapsed.

In [None]:
x = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]], dtype=float)

print("Sum all     :", x.sum())
print("Sum axis=0  :", x.sum(axis=0))
print("Sum axis=1  :", x.sum(axis=1))

print("\nMean all    :", x.mean())
print("Mean axis=0 :", x.mean(axis=0))
print("Mean axis=1 :", x.mean(axis=1))

print("\nMax all     :", x.max())
print("Argmax all  :", x.argmax())
print("Argmax axis=1:", x.argmax(axis=1))

print("\nStd axis=0  :", x.std(axis=0))
print("Cumsum axis=1:\n", x.cumsum(axis=1))

`axis=0` collapses rows (aggregates down each column). `axis=1` collapses columns (aggregates across each row). Thinking about axis as "the dimension that disappears" helps: summing shape `(3, 3)` along `axis=0` produces shape `(3,)` — the row dimension was consumed.

`argmax` returns the index of the maximum value, not the value itself. This is how you convert network output probabilities to predicted class labels: `predicted_class = logits.argmax(axis=1)`.

In [None]:
x = np.random.rand(3, 4)

mean_no_keep = x.mean(axis=1)
mean_keep    = x.mean(axis=1, keepdims=True)

print("Shape without keepdims:", mean_no_keep.shape)
print("Shape with keepdims   :", mean_keep.shape)

try:
    centered_bad = x - mean_no_keep
except ValueError as e:
    print("\nError without keepdims:", e)

centered_good = x - mean_keep
print("Centered shape with keepdims:", centered_good.shape)

`keepdims=True` preserves the collapsed dimension as a size-1 dimension. This keeps the result broadcastable against the original array. Without it, subtracting a shape `(3,)` mean from a shape `(3, 4)` matrix fails because the shapes are not compatible. With `keepdims=True`, the mean has shape `(3, 1)` and broadcasts correctly. This is a common source of shape errors.

---
## Linear Algebra

Linear algebra is the language of machine learning. Matrix multiplication, dot products, eigendecomposition, and norms appear throughout optimization theory, dimensionality reduction, and attention mechanisms.

In [None]:
A = np.array([[1, 2, 3],
              [4, 5, 6]], dtype=float)
B = np.array([[7, 8],
              [9, 10],
              [11, 12]], dtype=float)

C = A @ B
print("A shape:", A.shape, "B shape:", B.shape)
print("A @ B shape:", C.shape)
print("A @ B:\n", C)

print("\nElement-wise A * A.T would fail (wrong shapes)")
D = A.T @ A
print("A.T @ A shape:", D.shape)
print("A.T @ A:\n", D)

The `@` operator is matrix multiplication. `A @ B` requires the inner dimensions to match: `(m, k) @ (k, n)` produces `(m, n)`. A forward pass through a linear layer is exactly `output = X @ W.T + b` where `X` is the input batch, `W` is the weight matrix, and `b` is the bias. `A.T @ A` produces a symmetric positive semi-definite matrix — this structure appears in normal equations, covariance matrices, and the Gram matrix used in style transfer.

In [None]:
u = np.array([1.0, 2.0, 3.0])
v = np.array([4.0, 5.0, 6.0])

print("Dot product        :", np.dot(u, v))

print("L1 norm (sum |x|)  :", np.linalg.norm(u, ord=1))
print("L2 norm (Euclidean):", np.linalg.norm(u))
print("L2 norm manual     :", np.sqrt((u**2).sum()))

u_unit = u / np.linalg.norm(u)
print("Unit vector        :", u_unit)
print("Unit vector norm   :", np.linalg.norm(u_unit))

cos_sim = np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))
print("\nCosine similarity  :", cos_sim)

The L2 norm measures the length of a vector. Dividing a vector by its norm gives a unit vector — this is how you normalize word embeddings or gradient vectors. L1 regularization penalizes the L1 norm of weights, L2 regularization penalizes the L2 norm squared. Cosine similarity measures the angle between two vectors regardless of their magnitude — this is the primary similarity metric used in semantic search and recommendation systems with embedding models.

In [None]:
A = np.array([[3, 1],
              [1, 3]], dtype=float)

det  = np.linalg.det(A)
inv  = np.linalg.inv(A)
vals, vecs = np.linalg.eig(A)

print("Matrix A:\n", A)
print("\nDeterminant:", det)
print("Inverse:\n", inv)
print("\nA @ inv(A) (should be identity):\n", (A @ inv).round(6))
print("\nEigenvalues :", vals)
print("Eigenvectors:\n", vecs)

The determinant tells you whether a matrix is invertible (non-zero) and how it scales area or volume. Eigenvalues and eigenvectors describe which directions a matrix stretches and by how much — this is the foundation of Principal Component Analysis (PCA), which reduces the dimensionality of data by projecting onto the directions of maximum variance. The eigenvectors become the new coordinate axes; the eigenvalues tell you how much variance each axis captures.

In [None]:
X = np.random.randn(5, 3)

U, S, Vt = np.linalg.svd(X, full_matrices=False)

print("X shape :", X.shape)
print("U shape :", U.shape)
print("S shape :", S.shape)
print("Vt shape:", Vt.shape)

X_reconstructed = U @ np.diag(S) @ Vt
print("\nReconstruction error:", np.abs(X - X_reconstructed).max())

X_compressed = U[:, :2] @ np.diag(S[:2]) @ Vt[:2, :]
print("Compressed shape:", X_compressed.shape)
print("Compression error:", np.abs(X - X_compressed).max())

Singular Value Decomposition (SVD) factorizes any matrix into `U @ diag(S) @ Vt`. Keeping only the top-k singular values and corresponding vectors gives the best rank-k approximation to the original matrix — this is the mathematical foundation of image compression, recommendation systems (matrix factorization), and the modern Transformer attention mechanism's low-rank approximation techniques.

---
## Concatenation, Stacking, and Splitting

Data pipelines frequently need to combine arrays from different sources or split arrays for batching and cross-validation.

In [None]:
a = np.array([[1, 2, 3]])
b = np.array([[4, 5, 6]])

cat_row = np.concatenate([a, b], axis=0)
cat_col = np.concatenate([a, b], axis=1)
vstack  = np.vstack([a, b])
hstack  = np.hstack([a, b])
stacked = np.stack([a, b], axis=0)

print("concatenate axis=0:\n", cat_row, "shape:", cat_row.shape)
print("concatenate axis=1:\n", cat_col, "shape:", cat_col.shape)
print("vstack:\n", vstack, "shape:", vstack.shape)
print("hstack:\n", hstack, "shape:", hstack.shape)
print("stack axis=0 shape:", stacked.shape)

`concatenate` joins arrays along an existing axis. `stack` creates a new axis before joining. `vstack` is shorthand for `concatenate(axis=0)`. `hstack` is `concatenate(axis=1)` for 2D arrays. The critical distinction: concatenating two `(1, 3)` arrays along axis=0 gives `(2, 3)`; stacking them along axis=0 gives `(2, 1, 3)`. In batch assembly, you typically collect individual samples as 1D arrays and stack them to form a 2D batch.

In [None]:
data = np.arange(40).reshape(8, 5)

train, val, test = np.split(data, [6, 7], axis=0)
print("Train shape:", train.shape)
print("Val shape  :", val.shape)
print("Test shape :", test.shape)

chunks = np.array_split(data, 3, axis=0)
print("\narray_split into 3 unequal parts:")
for i, chunk in enumerate(chunks):
    print(f"  Chunk {i}: shape {chunk.shape}")

`np.split` takes indices at which to cut. `np.split(data, [6, 7])` makes cuts before index 6 and before index 7, producing three parts: rows 0-5, row 6, and row 7. `np.array_split` is more forgiving — it accepts a number and splits as evenly as possible, with some chunks having one more element than others when it does not divide evenly. This is the workhorse for train/validation/test splitting.

---
## Sorting and Searching

In [None]:
x = np.array([3, 1, 4, 1, 5, 9, 2, 6, 5, 3])

print("Original   :", x)
print("Sorted     :", np.sort(x))
print("Argsort    :", np.argsort(x))
print("Top-3 vals :", np.sort(x)[-3:])
print("Top-3 idx  :", np.argsort(x)[-3:])

logits = np.array([0.1, 2.5, -0.3, 1.8, 3.2, 0.7])
top_k = 3
top_k_idx = np.argsort(logits)[-top_k:][::-1]
print("\nClass logits :", logits)
print("Top-3 classes:", top_k_idx)
print("Top-3 values :", logits[top_k_idx])

`argsort` returns the indices that would sort the array — not the sorted values themselves. This is valuable because you can use those indices to reorder other arrays consistently. The top-k prediction pattern shown above — sort logits by value, take the last k indices, reverse to get descending order — is used in evaluation metrics like top-5 accuracy (used to evaluate ImageNet models).

---
## Putting It Together: A Complete Data Preprocessing Pipeline

This section combines everything covered above into a realistic workflow: generating synthetic data, splitting it, normalizing it using only training statistics, and verifying the result.

In [None]:
rng = np.random.default_rng(seed=0)

n_samples  = 1000
n_features = 8

X = rng.normal(loc=5.0, scale=2.0, size=(n_samples, n_features))
y = (X[:, 0] + X[:, 1] > 10).astype(np.int64)

n_train = int(0.7 * n_samples)
n_val   = int(0.15 * n_samples)

idx = rng.permutation(n_samples)
train_idx = idx[:n_train]
val_idx   = idx[n_train:n_train + n_val]
test_idx  = idx[n_train + n_val:]

X_train, y_train = X[train_idx], y[train_idx]
X_val,   y_val   = X[val_idx],   y[val_idx]
X_test,  y_test  = X[test_idx],  y[test_idx]

mean = X_train.mean(axis=0)
std  = X_train.std(axis=0)

X_train_norm = (X_train - mean) / std
X_val_norm   = (X_val   - mean) / std
X_test_norm  = (X_test  - mean) / std

print("Dataset shape            :", X.shape)
print("Train / Val / Test sizes :", X_train.shape[0], X_val.shape[0], X_test.shape[0])
print("Class balance (train)    :", y_train.mean().round(3))
print("\nTrain norm mean (should be ~0):", X_train_norm.mean(axis=0).round(3))
print("Train norm std  (should be ~1):", X_train_norm.std(axis=0).round(3))
print("\nVal   norm mean (approx  ~0)  :", X_val_norm.mean(axis=0).round(3))

Several important design decisions are embedded here. Shuffling with `rng.permutation` before splitting prevents accidentally creating splits where one set contains only early or only late data. Normalization statistics are computed on training data only and applied to validation and test sets — if you use the full dataset statistics, you leak information about the test set into your normalization, which inflates evaluation scores. This is called data leakage and it is a common mistake in machine learning practice.

---
## Reference: NumPy Vocabulary

| Concept | Code | Notes |
|---|---|---|
| Create from list | `np.array([1,2,3])` | Infers dtype |
| Zeros | `np.zeros((m, n))` | Default float64 |
| Range | `np.arange(start, stop, step)` | Stop is exclusive |
| Evenly spaced | `np.linspace(start, stop, num)` | Stop is inclusive |
| Random uniform | `rng.random((m, n))` | Use `default_rng` |
| Shape | `a.shape` | Returns a tuple |
| Type | `a.dtype` | float64 default |
| Cast | `a.astype(np.float32)` | Returns copy |
| Reshape | `a.reshape(m, n)` | Returns view |
| Transpose | `a.T` | Returns view |
| Add dim | `np.expand_dims(a, axis)` | Same as `a[np.newaxis]` |
| Remove dim | `np.squeeze(a)` | Removes size-1 dims |
| Flatten | `a.flatten()` | Returns copy |
| Slice | `a[0:2, 1:3]` | Returns view |
| Boolean mask | `a[a > 0]` | Returns copy |
| Matrix multiply | `A @ B` | Also `np.matmul` |
| Element multiply | `A * B` | Hadamard product |
| Dot product | `np.dot(u, v)` | For 1D vectors |
| Sum | `a.sum(axis=0)` | axis=0 collapses rows |
| Argmax | `a.argmax(axis=1)` | Index of max |
| L2 norm | `np.linalg.norm(a)` | Euclidean length |
| SVD | `np.linalg.svd(A)` | Factorization |
| Concatenate | `np.concatenate([a,b], axis=0)` | Existing axis |
| Stack | `np.stack([a,b], axis=0)` | New axis |
| Copy | `a.copy()` | Independent data |
| Where | `np.where(cond, x, y)` | Element-wise ternary |