# Principles of Numerical Mathematics

Mostly from Ch. 2 of Quarteroni (2000)




## Well-posedness and condition number

### Definition of problem types
A problem is to find $x$ such that $F(x,d)=0$, where $d$ is a set of data and $F$ is the functional relationship between $x$ and $d$.
- Direct  : $x$ unknown, $F$ and $d$ given
- Inverse : $d$ unknown, $F$ and $x$ given
- Identification : $F$ unknown, $x$ and $d$ given

### Well-posed problem
A (PDE) problem is called *well posed* or *stable* provided it has a **unique** solution which depends **continuously** on the given data. Otherwise, it is called *ill-posed* or *improperly posed* or *unstable.*
#### Example
\begin{equation}
p(x) = x^{4}-x^{2}(2a-1)+a(a-1) 
\end{equation}
has discontinuous variation of the number of real roots: 4 if $a \ge 1$, 2 if $a \in [0,1)$ and 0 if $a \lt 0$.

#### Meaning of *continuous dependence* 
Continuous dependence on the data means that **small** perturbations on the data $d$ yield **small** changes in the solution $x$.
- $\delta d$: An admissiable perturbation on the data
- $\delta x$: The consequent change in the solution
- $F(x+\delta x, d + \delta d) = 0$, then,
\begin{equation}
^{\forall}\eta > 0, \  ^{\exists}K(\eta,d) \text{ such that } \Vert \delta d \Vert < \eta \Rightarrow \Vert \delta x \Vert \le K(\eta, d) \Vert \delta d \Vert.
\end{equation}

#### Condition number
- Relative condition number
\begin{equation}
 K(d) = \sup_{\delta a \in D} \frac{ \Vert \delta x \Vert / \Vert x \Vert}{\Vert \delta d \Vert / \Vert d \Vert},
\end{equation}
where $D$ is a small neighborhood around the origin for *admissable* perturbations.
- Absolute condition number for $x=0$ or $d=0$
\begin{equation}
 K(d) = \sup_{\delta a \in D} \frac{\Vert \delta x \Vert}{\Vert \delta d \Vert}.
\end{equation}

- *ill-conditioned* if $K(d)$ is *big*, of which precise meaning varies from one problem to another.
- *ill-conditioned* $\neq$ *ill-posed*! 

#### Example 1
\begin{equation}
x^{2} - 2px + 1 = 0 \quad (p \ge 1)
\end{equation}
The solution is $x_{\pm}=p \pm \sqrt{p^{2}-1}$. 
Now, let a function $F(x,p) = x^{2}-2px +1$, where the "datum" $d$ is the coefficient, $p$, and $x$ is the vector of components $\{x_{+},x_{-}\}$.
Let's also define a *resolvent* $G: x=G(d)$, where $x$ is the solution, such that $F(G(d),d)=0$.
Expressing the solutions using resolvents, we get
\begin{equation}
  x_{\pm} = G_{\pm}(p) = p \pm \sqrt{p^{2}-1}.
\end{equation}
and 
\begin{equation}
  G^{\prime}_{\pm}(p) = 1 \pm \frac{p}{\sqrt{p^{2}-1}}.
\end{equation}

From
\begin{equation}
x + \delta x = G(d+\delta d),
\end{equation}
we get
\begin{equation}
\delta x = G(d+\delta d)-x = G(d+\delta d)-G(d).
\end{equation}

The Taylor's expansion of $G$ tells us that
\begin{equation}
 G(d+\delta d)-G(d) = G^{\prime}(d)\delta d + O(||\delta d||) \quad \text{for} \quad \delta d \rightarrow 0.
\end{equation}

Thus, 
\begin{equation}
\delta x \approx G^{\prime}(d)\delta d.
\end{equation}

Now we get the following form of the condition number:
\begin{equation}
K(d) = \sup_{\delta d \in D} \frac{\Vert \delta x \Vert / \Vert x \Vert}{\Vert \delta d \Vert / \Vert d \Vert } = \sup_{\delta d \in D} \frac{ \Vert G^{\prime}(d)\delta d \Vert  \,  \Vert d \Vert }{ \Vert G(d) \Vert  \,  \Vert \delta d \Vert }
 \approx  \Vert G^{\prime}(d) \Vert  \frac{ \Vert d \Vert }{ \Vert G(d) \Vert }
\end{equation}

Using the identity, $d=p$, we get
\begin{equation}
K(p) = \left\Vert 1 \pm \frac{p}{\sqrt{p^{2}-1}}  \right\Vert \frac{\Vert p\Vert }{p \pm \sqrt{p^{2}-1}} = \frac{|p|}{\sqrt{p^{2}-1}}
\end{equation}

So, $K$ is small (.ie., $\sim$1) for $p \ge \sqrt{2}$; goes to $\infty$ as $p \rightarrow 1$ making the problem (ill-posed or ill-conditioned?).

However, this problem can **regularized**: i.e., the singularity at $p=1$ can be removed by the change of parameters.
By letting $t = p + \sqrt{p^{2}-1}$ and $F(x,t) = x^{2} - ((1+t^{2})/t)x + 1 = 0$, the roots become $x=t$ and $1/t$ and coincide with each other when $t=1$ (i.e., $p=1$).

#### Example 2
Let's consider a linear mapping:
\begin{equation}
 f: \mathbb{R}^{2} \rightarrow \mathbb{R}, \quad f(a,b) = a+b
\end{equation}

Let's note that the resolvent $G$ is equal to $f(a,b)$ in this case and the gradient is the vector $f^{\prime}(a,b)=(1,1)$. Also, the data ($d$) in this problem is a vector composed of two parameters, $(a,b)$.
Using $L_{1}$ norm ($\Vert \mathbf{a} \Vert_{1}=\sum_{i=1}^{n}|a|$), we get the condition number:
\begin{equation}
  K(a,b) \approx \Vert G^{\prime}(d) \Vert_{1}  \frac{ \Vert d \Vert_{1} }{ \Vert G(d) \Vert_{1} } = 2\frac{|a|+|b|}{|a+b|}
\end{equation}

If $a$, $b$ are of the same sign, $K=1$ and the problem is will-posed;
if $a$, $b$ are almost equal but of the opposite signs, $K \rightarrow \infty$ and the problem is ill-conditioned.
The ill-conditioned situation arises from the **cancellation of significant digits**.

## Stability of Numerical Methods

A numerical method for the approximate solution of $F(x,d)=0$ will consist, in general, of a sequence of approximate problems
\begin{equation}
F_{n}(x_{n}, d_{n})=0 \quad n \ge 1
\end{equation}
such that $x_{n} \rightarrow x$ as $n \rightarrow \infty$: In other words, **the numerical solution converges to the exact solution**. A typical example of data $d_{n}$ is grid spacing (often denoted as $h$) that are sequentially refined. 

In methods for finding an approximate solution to a partial differential equation such as finite difference method or finite element method, it is a common practice to prove convergence by showing that error or residual decreases at the expected rate as $h$ gets smaller.

[//]: # "Is such a demonstration sufficient to show that the employed numerical method is convergent? Most of the time but not always. That's why we distinguish convergence from **consistency**:
The problem sequence is **consistent** if $F_{n}(x,d) - F(x,d) \rightarrow 0$ for $n \rightarrow \infty$."

For a numerical method to be **stable** or **well-posed**, we require for any fixed $n$,

1. there exists a unique solution $x_{n}$ corresponding to $d_{n}$
2. $x_{n}$ is a unique and continuous function of $d_{n}$:
\begin{equation}
^{\forall}\eta > 0, ^{\exists}K_{n}(\eta, d_{n}): \Vert \delta d_{n} \Vert < \eta \Rightarrow \Vert \delta x_{n} \Vert \le K_{n} \Vert \delta d_{n} \Vert,
\end{equation}
where the condition number $K_{n}$ should be understood as a relative or an absolute one according to the context. Recall the following definitions:
\begin{equation}
 K_{n}(d_{n}) = \sup_{\delta d_{n} \in D} \frac{ \Vert \delta x_{n} \Vert / \Vert x_{n} \Vert}{\Vert \delta d_{n} \Vert / \Vert d_{n} \Vert},
\end{equation}
and 
\begin{equation}
 K_{abs,n}(d_{n}) = \sup_{\delta d_{n} \in D} \frac{\Vert \delta x_{n} \Vert}{\Vert \delta d_{n} \Vert}.
\end{equation}

Asymptotic condition numbers are defined as follows:
\begin{align}
K^{num}(d_{n}) &= \lim_{k\rightarrow \infty} \sup_{n\ge k} K_{n}(d_{n}) \\
K^{num}_{\text{abs}}(d_{n}) &= \lim_{k\rightarrow \infty} \sup_{n\ge k} K_{\text{abs},n}(d_{n})
\end{align}

The numerical method is said to be **well conditioned** if $K^{num}$ is "small" for any admissible datum $d_{n}$; **ill-conditioned** otherwise.

A more formal way of defining the convergence of a numerical method is as follows:
A numerical method 
\begin{equation}
F_{n}(x_{n}, d_{n})=0 \quad n \ge 1
\end{equation}
is **convergent** if and only if
\begin{equation}
\begin{split}
&^{\forall}\epsilon > 0, \quad ^{\exists}n_{0}(\epsilon), \quad ^{\exists}\delta(n_{0},\epsilon) > 0 \quad \text{such that} \\
&^{\forall}n \gt n_{0}(\epsilon), \quad ^{\forall}\Vert \delta d_{n} \Vert < \delta(n_{0},\epsilon), \quad
\Vert x(d) - x_{n}(d+\delta d_{n}) \Vert \le \epsilon.
\end{split}
\end{equation}

What would matter more to us is how to measure the convergence, in other words, how to measure the error of an approximate solution.

- **Absolute error**: $E(x_{n}) = \Vert x - x_{n} \Vert $
- **Releative error**: $E_{rel}(x_{n}) = \Vert x-x_{n} \Vert / \Vert x \Vert $ if $\Vert x \Vert \neq 0 $



## Sources of Error

- A physical problem (PP), of which solution is denoted as $x_{ph}$
- $F(x,d)=0$: A mathematical model of (PP)
- $F_{n}(\hat{x}_{n},d_{n})=0$: A computational model for PP

The error associated with the computational model ($e$) is
\begin{equation}
\begin{split}
e &= e_{m} + e_{c} \\
  &= (x-x_{ph}) + (\hat{x}_{n}-x),
\end{split}
\end{equation}
where 

- $e_{m}$: Error of mathematical model $+$ error in data
- $e_{c}$: Discretization error ($e_{n}$) $+$ numerical algorithm $+$ roundoff error ($e_{a}$)

Withe the above the notations, we can summarize the sources of errors as in the next figure from Quarteroni (2000):

<img src="./Figures/Quarteroni_Fig2.1.PNG" width=480 />

## Other criteria for good numerical methods
Of course, convergence is of the prime importance for a numerical method, the following concepts are also considered when choosing or developing a numerical method:

- Accuracy: $e$ is small with respect to a fixed tolerance. Usually quantified as $e_{n}$ with respect to the discretization characteristic parameter (for instance
the largest grid spacing between the discretization nodese.g., $e_{n} \sim h^{2}$.
- Reliability: $e$ is likely to be below a certain tolerance. Needs testing (benchmarking).
- Efficiency: The computational complexity needed to control the error is as small as possible.
- Complexity of an algorithm: A measure of execution time
- Complexity of a problem: The complexity of the most efficient among the algorithms for the problem

## Machine representation of numbers

### The *positional representation* of a real number

\begin{equation}
x_{\beta} = (-1)^{s} \left[ x_{n}\, x_{n-1}\, \cdots\, x_{1}\, x_{0}.x_{-1}\, x_{-2}\,\cdots x_{-m} \right],
\end{equation}

where $\beta$ is *base*, $s$ determines the sign (0: $+$, 1: $-$), and the point is the decimal point if $\beta=10$ or the binary point if $\beta=2$.

This representation can be also written as
\begin{equation}
x_{\beta} = (-1)^{s}\left( \sum_{k=-m}^{n} x_{k}\beta^{k} \right),
\end{equation}
where $0\le x_{k} \lt \beta$.

#### Example

\begin{matrix}
x_{10} &= 425.33 = &4\cdot 10^{2} &+ 2\cdot 10 &+  5 &+  3\cdot 10^{-1} &+ 3\cdot 10^{-2} \\
x_{6}  &= 425.33 = &4\cdot 6^{2}  &+ 2\cdot 6  &+  5 &+  3\cdot 6^{-1}  &+ 3\cdot 6^{-2}
\end{matrix}

\begin{equation}
\begin{split}
x_{10} &= 1/3 = 0.3333\cdots = 0.\bar{3} \\
x_{3}  &= 1/3 = 0.1
\end{split}
\end{equation}


### The *fixed-point* number system

Let's say we have $N$ memory positions in a computer. Then the **fixed-point system** goes as follows:
\begin{equation}
x=\underbrace{(-1)^{s}}_\text{takes 1 space} [ \underbrace{a_{n-2}\, a_{n-3}\, \cdots\, a_{k}}_\text{N-1-k}.\underbrace{a_{k-1}\, \cdots a_{0}}_\text{k} ].
\end{equation}
An equivalent summation expression is
\begin{equation}
x=(-1)^{s} \underbrace{\beta^{-k}}_\text{a fixed scaling factor} \sum_{j=0}^{N-2} a_{j}\beta^{j},
\end{equation}
Since $k$ is a fixed number, minimum and maximum of the representable numbers are very limited.

### The *floating-point* number system

Let's consider another number representation:
\begin{equation}
x = (-1)^{s} (0.a_{1}\, a_{2}\, \cdots\, a_{t})\beta^{e} = (-1)^{s}\, m \, \beta^{e-t},
\end{equation}
where $t \in \mathbb{N}$ is the number of allowed significant digits $a_{i}$, $m=a_{1}\, a_{2}\, \cdots\, a_{t}$ an integer number called "mantissa" ($0 \le m \le \beta^{t}-1$) and $e$ an integer called "exponent" ($L\le e \le U$ and $L\lt 0, U\gt 0$).

- **single precision**: N = 32 bits
<img src="./Figures/Quarteroni_singleprecision.PNG" width=400 />

- **double precision**: N = 64 bits

<img src="./Figures/Quarteroni_doubleprecision.PNG" width=600 />

Let's denote the set of **floating-point numbers** with $t$ significant digits, base $\beta \ge 2$, $0 \le a_{i} \le \beta-1$, and range $(L,U)$ with $L \le e \le U$ by
\begin{equation}
\mathbb{F}(\beta, t, L, U) = \{0\} \, \cup \, \left\{ x \in \mathbb{R}: x = (-1)^{s} \beta^{e} \sum_{i=1}^{t} a_{i}\beta^{-i} \right\}
\end{equation}


#### Example
$\beta=10$, $t$=4, $L=-1$, $U=4$. If $a_{1}$ can be 0, we end up with the following redundancy:
\begin{equation}
\begin{split}
1 &= 0.1000 \cdot 10^{1} \\
  &= 0.0100 \cdot 10^{2} \\
  &= 0.0010 \cdot 10^{3} \\
  &= 0.0001 \cdot 10^{4}
\end{split}
\end{equation}
Therefore $a_{1} \neq 0$ for a unique representation and this representation is called *normalized*.
Read pp 47-48 and Example 2.11 of Quarteroni (2000) for further discussion on normalized and *denormalized* representations.


We can see immediately that 

- if $x \in \mathbb{F}(\beta, t, L, U)$, then $-x \in \mathbb{F}$.
- $x_{min} = \beta^{L-1} \le |x| \le \beta^{U}(1-\beta^{-t})=x_{max}$

#### Example
Find all the positivie numbers in the set $\mathbb{F}(2,3,-1,2)$.

Verify that $x_{min}=\beta^{L-1}=2^{-2}=1/4$ and $x_{max}=\beta^{U}(1-\beta^{-t})=2^{2}(1-s^{-3})=7/2$.


The *standard* floating-point numbers are 

- $\mathbb{F}(2,24,-125,128)$ for the single precision
- $\mathbb{F}(2,53,-1021,1024)$ for the double precision