In a perfect world, your model would predict $y$ exactly:$$X\beta = y$$However, in reality, this system is overdetermined (more equations/data points than variables).The columns of matrix $X$ span a subspace (a flat sheet or volume) inside a much larger space. This is called the Column Space ($C(X)$).The target vector $y$ almost never lies perfectly on this flat sheet. It sticks out into the empty space above it.Because $y$ is not on the sheet, no combination of $X$ can ever equal $y$.3. The Geometric Solution: ProjectionSince we cannot reach $y$ exactly, we look for the closest possible point to $y$ that does exist on the sheet (the Column Space).Call this closest point $\hat{y}$ (the prediction).$$\hat{y} = X\hat{\beta}$$The Key Insight:The shortest distance between a point ($y$) and a plane ($C(X)$) is the perpendicular (orthogonal) line.If you were standing on the plane directly underneath the floating point $y$, looking straight up, that spot you are standing on is $\hat{y}$. This is the Orthogonal Projection of $y$ onto the column space of $X$.4. Deriving the Equation from GeometryWe can derive the Normal Equation purely using this geometric insight, without needing calculus derivatives.Step 1: Define the Error VectorThe error (or residual) is the vector connecting our prediction to the truth:$$e = y - X\hat{\beta}$$Step 2: The Orthogonality ConditionFor the distance to be minimized, the error vector $e$ must be perpendicular (orthogonal) to the flat sheet (Column Space of $X$).If $e$ is perpendicular to the whole sheet, it must be perpendicular to every single column vector inside $X$.In linear algebra, two vectors are orthogonal if their dot product is zero. Therefore, the dot product of $X$ and the error $e$ is zero:$$X^T \cdot e = 0$$Step 3: Solve for $\beta$Substitute the definition of $e$ ($y - X\hat{\beta}$) into that equation:$$X^T (y - X\hat{\beta}) = 0$$Expand the terms:$$X^T y - X^T X \hat{\beta} = 0$$Move the negative term to the other side:$$X^T X \hat{\beta} = X^T y$$This is the Normal Equation!To isolate $\hat{\beta}$, strictly mathematically (assuming $X^T X$ is invertible), we multiply by the inverse:$$\hat{\beta} = (X^T X)^{-1} X^T y$$

In [None]:
import torch

def linear_regression_normal_equation(X, y) -> torch.Tensor:
    """
    Solve linear regression via the normal equation using PyTorch.
    X: Tensor or convertible of shape (m,n); y: shape (m,) or (m,1).
    Returns a 1-D tensor of length n, rounded to 4 decimals.
    """
    X_t = torch.as_tensor(X, dtype=torch.float)
    y_t = torch.as_tensor(y, dtype=torch.float).reshape(-1,1)
    # Your implementation here

    first_transform = (X_t.T @ X_t)

    final = torch.linalg.inv(first_transform) @ X_t.T @ y_t
    return torch.round(final, decimals=4).squeeze()