# (Preconditioned) Normal Equations

As noted in the introduction to this chapter, the classical direct approach to solving the linear regression problem {eq}`eqn-regression` is to use a factorization-based solver, which requires $O(nd^2)$ operations.
Similar to in our chapter on [Matrix Factorizations](../QR-Factorization/intro.md), we would like to offload the high flop operations to matrix-matrix multiplication, which has a high flop-efficiency.

Consider the following observation:
:::{prf:theorem}
:label: thm:half_pc-normal_eq
For any $\vec{P}$ with $\range(\vec{P}) = \range(\vec{A})$, one verifies that the solution to {eq}`eqn-regression` is also the solution of the linear system
\begin{equation*}
\vec{P}^\T\vec{A}\vec{x} = \vec{P}^\T\vec{b}.
\end{equation*}
:::

:::{admonition} Proof
:class: dropdown
It is well-known that the solution to the normal equations $\vec{A}^\T\vec{A}\vec{x} = \vec{A}^\T\vec{b}$ is also a the solution to the linear regression problem {eq}`eqn-regression`, and any $\vec{P}$ with $\range(\vec{P}) = \range(\vec{A})$ can be written as $\vec{P} = \vec{A}\vec{C}$ for some invertible matrix $\vec{C}$.
:::

Readers should already be familiar with two special cases:

### Normal Equations

If $\vec{P} = \vec{A}$, then we obtain the standard normal equations. 
In this case, the dominant cost is the matrix-matrix multiplication $\vec{A}^\T\vec{A}$, which is $O(nd^2)$, but very flop-efficient.
However, since $\cond(\vec{A}^\T\vec{A}) = \cond(\vec{A})^2$, this approach can lead to poor accuaracy, when $\cond(\vec{A})$ is on the order of the inverse of the *square root* of the machine precision.

### QR factorization

If $\vec{P} = \vec{Q}$, where $\vec{Q}$ is is the Q factorization in the QR factorization of $\vec{A}$, then $\vec{P}^\T\vec{A} = \vec{R}$. 
In this case we avoid needing to even form $\vec{P}^\T\vec{A}$ explicitly (as long as we also have the $\vec{R}$ factor for the QR factorization).
Moreover, $\cond(\vec{R}) = \cond(\vec{A})$.

We can use the [randomized Choleksy QR algorithm](../QR-Factorization/randomized-cholesky-qr.ipynb) to compute the QR factorization of $\vec{A}$ in a stable, flop-efficient manner.


## Half preconditioned Normal Equations

As noted in {cite:p}`ipsen_25`, randomization gives us a third option.
Instead of taking $\vec{P}$ as the Q factor of a QR factorization, we can take $\vec{P}$ as the approximate orthogonal basis produced by the [Sketched QR algorithm](../QR-Factorization/randomized-cholesky-qr.ipynb#sketched-qr).

We suggest implementing this approach as follows:

:::{prf:algorithm} Randomized Half Preconditioned Normal Equations
:label: rand-HPNE

**Input:** $\vec{A}\in\R^{n\times d}$, $\vec{b}\in\R^n$, sketching dimension $k$

1. Get $\vec{P},\vec{R}_1 = \Call{Sketched-QR}(\vec{A},k)$
1. Form $\vec{X} = \vec{P}^\T\vec{P}$
1. Compute Choleksy factorization $\vec{R} = \Call{chol}(\vec{X})$
1. $\vec{R} = \vec{R}_2\vec{R}_1$
1. Solve $\vec{R}_2^\T\vec{R}\vec{x} = \vec{Q}^\T\vec{b}$

**Output:** $\vec{x}$
:::

This approach has a computational profile very similar to Randomized Cholesky-QR.
However, note that we avoid the need to compute a triangular solve with a $n\times d$ matrix.


We can easily implement the algorithm in Numpy:

In [2]:
def randomized_HPNE(A,b,k,zeta,rng):

    P,R1 = sketched_qr(A,k,zeta,rng)
    X = P.T@P
    R2 = np.linalg.cholesky(X)
    R = R2@R1
    y = sp.linalg.solve_triangular(R,P.T@b,lower=False)
    x = sp.linalg.solve_triangular(R2.T,y,lower=True)

    return x

Because $\vec{P}$ is well-conditioned, this approach results in solves with matrices that have a similar condition number as $\vec{A}$.

:::{prf:theorem}

Let $\vec{P}$ be the approximate orthogonal basis produced by the [Sketched QR algorithm](../QR-Factorization/randomized-cholesky-qr.ipynb#sketched-qr).
Suppose $\vec{S}$ is an $\varepsilon$-subspace embedding for $\vec{A}$.
Then 
\begin{equation*}
\cond(\vec{P}^\T\vec{A}) \leq  \frac{1+\varepsilon}{1-\varepsilon} \cond(\vec{A}).
\end{equation*}
:::

:::{admonition} Proof
:class: dropdown

By {prf:ref}`sketched-qr-well-conditioned`, we have that 
\begin{equation*}
\smin(\vec{P})  \geq \frac{1}{1+\varepsilon} 
,\qquad
\smax(\vec{P}) \leq \frac{1}{1-\varepsilon}.
\end{equation*}

The result then follows from the fact that $\cond(\vec{P}^\T\vec{A}) \leq \cond(\vec{P})\cond(\vec{A})$.
:::

Thus, the resulting system has a similar condition number as $\vec{A}$, even if $\vec{S}$ is only an $\varepsilon$-subspace embedding for $\vec{A}$ for some constant $\varepsilon$ (e.g. $\varepsilon = 1/2$).



## Numerical Example

Let's compare the performance of the three approaches described above.

In [None]:
import numpy as np
import scipy as sp
import time
import pandas as pd

def sparse_stack_sketch(n,k,zeta,rng):

    k_rem = k%zeta
    k_loc = k//zeta+(k_rem>0)

    C = np.random.randint(0,k_loc,size=(n,zeta))
    if k_rem > 0:
        C[:,-1] = np.random.randint(0,k_rem,size=n)
    C += np.arange(0,k,k_loc)

    indices = C.flatten()
    values = np.sqrt(1/zeta)*(2*np.random.randint(2,size=n*zeta)-1)
    indptr = np.arange(0,n+1)*zeta
    S = sp.sparse.csc_matrix ((values,indices,indptr),shape=(k,n))

    return S

def cholesky_QR(A):

    X = A.T@A
    R = np.linalg.cholesky(X).T
    Q = sp.linalg.solve_triangular(R.T, A.T, lower=True).T

    return Q, R

def sketched_qr(A,k,zeta,rng):

    n, d = A.shape
    S = sparse_stack_sketch(n,k,zeta,rng) 
    Y = S @ A 
    R = np.linalg.qr(Y, mode='r') 
    Q = sp.linalg.solve_triangular(R.T,A.T,lower=True).T 
    
    return Q, R

def randomized_cholesky_QR(A,k,zeta,rng):

    Q1, R1 = sketched_qr(A,k,zeta,rng)
    Q, R2 = cholesky_QR(Q1)
    R = R2 @ R1
    
    return Q, R