# Important note!

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
YOUR_ID = "" # Please enter your GT login, e.g., "rvuduc3" or "gtg911x"
COLLABORATORS = [] # list of strings of your collaborators' IDs

In [None]:
import re

RE_CHECK_ID = re.compile (r'''[a-zA-Z]+\d+|[gG][tT][gG]\d+[a-zA-Z]''')
assert RE_CHECK_ID.match (YOUR_ID) is not None

collab_check = [RE_CHECK_ID.match (i) is not None for i in COLLABORATORS]
assert all (collab_check)

del collab_check
del RE_CHECK_ID
del re

**Jupyter / IPython version check.** The following code cell verifies that you are using the correct version of Jupyter/IPython.

In [None]:
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it."

## Perturbation theory and condition numbers [10 points]

Let's start by asking how "hard" it is to solve a given linear system, $Ax=b$. You will apply perturbation theory to answer this question.

**Intuition: Continuous functions of a single variable.** To build your intuition, consider the simple case of a scalar function in a single continuous variable, $y = f(x)$. Suppose the input is perturbed by some amount, $\Delta x$. The output will also change by some amount, $\Delta y$. How large is $\Delta y$ relative to $\Delta x$?

Supposing $\Delta x$ is sufficiently small, you can approximate the change in the output by a Taylor series expansion of $f(x + \Delta x)$:

$$
  y + \Delta y = f(x + \Delta x) = f(x) + \Delta x \frac{df}{dx} + O(\Delta x^2).
$$

Since $\Delta x$ is assumed to be "small," we can approximate this relation by

$$
\begin{eqnarray}
    y + \Delta y & \approx & f(x) + \Delta x \frac{df}{dx} \\
        \Delta y & \approx & \Delta x \frac{df}{dx}.
\end{eqnarray}
$$

This result should not be surprising: the first derivative measures the sensitivity of changes in the output to changes in the input. We will give the derivative a special name: it is the _(absolute) condition number_. If it is very large in the vicinity of $x$, then even small changes to the input will result in large changes in the output. Put differently, a large condition number indicates that the problem is intrinsically sensitive, so we should expect it may be difficult to construct an accurate algorithm.

In addition to the absolute condition number, we can define a _relative_ condition number for the problem of evaluating $f(x)$.

$$
\begin{eqnarray}
                \Delta y &  \approx   & \Delta x \frac{df}{dx} \\
                         & \Downarrow & \\
  \frac{|\Delta y|}{|y|} &  \approx   & \frac{|\Delta x|}{|x|} \cdot \underbrace{\frac{|df/dx| \cdot |x|}{|f(x)|}}_{\kappa_f(x)}.
\end{eqnarray}
$$

Here, the underscored factor, defined to be $\kappa_f(x)$, is the relative analogue of the absolute condition number. Again, its magnitude tells us whether the output is sensitive to the input.

**Perturbation theory for linear systems.** What if we perturb a linear system? How can we measure its sensitivity or "intrinsic difficulty" to solve?

First, recall the following identities linear algebraic identities:

* _Triangle inequality_: $\|x + y\|_2 \leq \|x\|_2 + \|y\|_2$
* _Norm of a matrix-vector product_: $\|Ax\|_2 \leq \|A\|_F\cdot\|x\|_2$
* _Norm of matrix-matrix product_: $\|AB\|_F \leq \|A\|_F\cdot\|B\|_F$

To simplify the notation a little, we will drop the "$2$" and "$F$" subscripts.

Suppose all of $A$, $b$, and the eventual solution $x$ undergo additive perturbations, denoted by $A + \Delta A$, $b + \Delta b$, and $x + \Delta x$, respectively. Then, subtracting the original system from the perturbed system, you would obtain the following.

$$
\begin{array}{rrcll}
   &         (A + \Delta A)(x + \Delta x) & = & b + \Delta b & \\
- [&                                   Ax & = & b & ] \\
\hline
   & \Delta A x + (A + \Delta A) \Delta x & = & \Delta b & \\
\end{array}
$$

Now look more closely at the perturbation, $\Delta x$, of the solution. Let $\hat{x} \equiv x + \Delta x$ be the perturbed solution. Then the above can be rewritten as,

$$\Delta x = A^{-1} \left(\Delta b - \Delta A \hat{x}\right),$$

where we have assumed that $A$ is invertible. (That won't be true for our overdetermined system, but let's not worry about that for the moment.)

How large is $\Delta x$? Let's use a norm to measure it and bound it using 

$$
\begin{array}{rcl}
  \|\Delta x\| &   =   & \|A^{-1} \left(\Delta b - \Delta A \hat{x}\right)\| \\
               &  \leq & \|A^{-1}\|\cdot\left(\|\Delta b\| + \|\Delta A\|\cdot\|\hat{x}\|\right).
\end{array}
$$

You can rewrite this as follows:

$$
\begin{array}{rcl}
  \frac{\|\Delta x\|}
       {\|\hat{x}\|}
    & \leq &
    \|A^{-1}\| \cdot \|A\| \cdot \left(
                                   \frac{\|\Delta A\|}
                                        {\|A\|}
                                   +
                                   \frac{\Delta b}
                                        {\|A\| \cdot \|\hat{x}\|}
                                 \right).
\end{array}
$$

This bound says that the relative error of the perturbed solution, compared to relative perturbations in $A$ and $b$, scales with the product, $\|A^{-1}\| \cdot \|A\|$. This factor is the linear systems analogue of the condition number for evaluating the function $f(x)$! As such, we define

$$\kappa(A) \equiv \|A^{-1}\| \cdot \|A\|$$

as the _condition number of $A$_ for solving linear systems.

What values of $\kappa(A)$ are "large?" Generally, you want to compare $\kappa(A)$ to $1/\epsilon$, where $\epsilon$ is _machine precision_, which is the [maximum relative error under rounding](https://sites.ualberta.ca/~kbeach/phys420_580_2010/docs/ACM-Goldberg.pdf). We may look more closely at floating-point representations later on, but for now, a good notional value for $\epsilon$ is about $10^{-7}$ in single-precision and $10^{-15}$ in double-precision. (In Python, the default format for floating-point values is double-precision.)

This analysis explains why solving the normal equations directly could lead to computational problems. In particular, one can show that $\kappa(X^T X) \approx \kappa(X)^2$, which means forming $X^T X$ explicitly may make the problem harder to solve by a large amount.

## Examples

Let's look at some examples of conditioning.

First, recall the data matrix from Part 1.

In [None]:
import pandas as pd
import numpy as np

from IPython.display import display

VARIABLES = ['BP', 'Age', 'Weight', 'BSA', 'Dur', 'Pulse', 'Stress']
df = pd.read_table ('bloodpress.txt', usecols=VARIABLES)

RESPONSE = 'BP'
PREDICTORS = ['Weight', 'BSA']

y = df.as_matrix ([RESPONSE])
m = len (y)

X_initial = df.as_matrix (PREDICTORS)
X = np.insert (X_initial, [0], [[1.]] * m, axis=1)

print ("X (first five rows):\n")
print (X[:5])

**Exercise 1** (1 point). Use Numpy's condition number estimator, `np.linalg.cond()`, to estimate the condition numbers for a) $X$ and b) $X^T X$. Use the form of the condition number based on the two-norm. Store the results in `cond_X` and `cond_XTX`, respectively.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
from IPython.display import display, Math, Latex

display (Math (r'\kappa(X) \approx {:.0f}'.format (cond_X)))
display (Math (r'\kappa(X)^2 \approx {:.0f}'.format (cond_X**2)))
display (Math (r'\kappa(X^T X) \approx {:.0f}'.format (cond_XTX)))

assert 1. <= cond_X <= 3e3
assert 1. <= cond_XTX <= 6e6

print ("\n(Passed.)")

**Example.** Let's look at a system that is ill-conditioned and see what happens when we make a tiny perturbation to it.

In [None]:
A = np.array([(1., 1000.),
              (2.**(-10) + 2.**(-11), 1.)])

print ("A ==\n", A)
print ("\ncond (A) == ", np.linalg.cond (A))

In [None]:
Delta_A = np.array ([(0., 0.),
                     (-2.**(-11), 0.)
                    ])
B = A + Delta_A

print ("B := A + dA ==\n", B)
print ("\ncond (B) / cond (A) == ",
      np.linalg.cond (B) / np.linalg.cond (A))

In [None]:
b = np.array([1., 1.])

x_A = np.linalg.solve (A, b)
print ("x ~= A^(-1)*b == ", x_A)

x_B = np.linalg.solve (B, b)
print ("x ~= B^(-1)*b == ", x_B)

**Example.** Here is another example, taken from [Trefethen and Bau (1997)](http://bookstore.siam.org/ot50/) with Python code by [Da Kuang (2014)](http://math.ucla.edu/~dakuang/).

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

m = 101

t = np.linspace (0., 1., m)
y = np.exp (np.sin (4*t))
y /= 2006.787453080206

plt.plot (t, y, 'o--')

**Exercise 2** (2 points). The following code creates a matrix, $X$, based on the data created above. Explain what approximately solving the system $X \alpha \approx y$ for the vector $\alpha$ does, effectively, given $y$ as computed above (`y[0:m]`).

In [None]:
n = 15
X = np.zeros ((m, n))

for j in range (n):
    X[:, j] = np.power (t, j)
    
print ("Condition number of X: %g" % np.linalg.cond (X))

YOUR ANSWER HERE

**Exercise 3** (2 points). Compute the solution to this problem by solving the normal equations. Store your solution in a vector `alpha1`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

print ("Your solution, alpha1:")
print (alpha1)

plt.plot (t, y, '.', t, X.dot (alpha1), '-')

In [None]:
r1 = X.dot (alpha1) - y
r1_norm2 = np.linalg.norm (r1, 2)**2

plt.semilogy (t, np.abs (y/r1), '*--')

print ("\nResidual two-norm squared:")
print (r1_norm2)
assert r1_norm2 <= 1e-13

print ("\n(Passed.)")

Numpy's built-in linear least squares solver uses a different algorithm. Compare its solution to the one computed by solving the normal equations.

In [None]:
alpha2 = np.linalg.lstsq (X, y)[0]

print ("Numpy's built-in linear least squares solver:")
print (alpha2)

r2 = X.dot (alpha2) - y
r2_norm2 = np.linalg.norm (r2, 2)**2

print ("\nTwo-norm squared:", r2_norm2)

print ("\n|alpha1 ./ alpha2| ==")
print (alpha1 / alpha2)

plt.plot (alpha1 / alpha2, '*--')

**Example: Nearly collinear predictors.** Another more practical situation in which poor conditioning can arise in linear regression modeling is when you include two strongly correlated predictors.

**Exercise 4** (2 points). Create an $m \times 2$ data matrix `X` with two columns of nearly collinear predictors. That is, if $x_0$ and $x_1$ are the columns, then create these columns such that they are not identical but have a correlation coefficient close to 1.0 and a condition number that is at least $10^{16}$.

> You can compute the correlation coefficient between two vectors `a` and `b` by using Numpy's built-in method, `np.corrcoef(a, b)`.

In [None]:
M = 10 # Number of rows

# YOUR CODE HERE
raise NotImplementedError()

XTX = X.T.dot(X)
print ('cond (X):', np.linalg.cond (X))
print ('cond (X^T*X):', np.linalg.cond (XTX))

In [None]:
corr_coef = np.corrcoef (X[:, 0], X[:, 1])[0, 1]
print ("Correlation coefficient: {:.15f}".format (corr_coef))
assert corr_coef >= 0.99

print ("\n(Passed.)")

**Exercise 5** (1 point). Let $\theta^T = (0.1, 0.1)$ be the true solution of this problem. Compute the corresponding right-hand side, $y = X \theta$. Call your right-hand side vector `y`.

In [None]:
theta_true = np.array ([[0.1], [0.1]])

# YOUR CODE HERE
raise NotImplementedError()

**Exercise 6** (2 points). Solve this system $X \theta = y$ using the normal equations method, given `y` but "pretending" that you do not know `theta_true`. Store your result as `theta_1`. How does it compare to `theta_true`?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

print (theta_1)

**Exercise 7** (1 point). Use Numpy's built-in least squares solver instead (`np.linalg.lstsq()`), storing its estimate as a vector `theta_2`. Does it give a better answer?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

print (theta_2)