# Error Analysis

In [71]:
import numpy as np
import matplotlib.pyplot as plt

Main sources of numerical errors: 
- rounding and cancellation (due to the use of finite-precision arithmetic)
- truncation or approximation errors (due to approximation of infinite sequences or continuous functions by a finite number of samples)

## Basic definitions

**Absolute and relative error:**

If $ \tilde{x} $ is an approximation for $ x $, then:

  * absolute error $ A(x) = | \tilde{x} - x |$

  * relative error $ R(x) = | \tilde{x} - x | \, / \, | x | $
  
**Decimal precision:**

Given a relative error $ R $, the decimal precision $ p $ is the largest integer such that $ R \leq 5 \times 10^{-p} $

**Big-O notation:**

The error term in an approximation to a mathematical function can be described by the [big-O](https://en.wikipedia.org/wiki/Big_O_notation) notation:

$$
    f(x) =  O(g(x)) \quad \text{as} \quad x \rightarrow a
$$ 
if and only if 
$$
    |f(x)| \leq M |g(x)| \quad \text{as}\quad  |x - a| < \delta \quad \text{where} \quad M,a > 0.
$$ 

## Roundoff error

### Representation of real numbers in computer

Most widely used representation of real numbers in computers is the floating-point representation. Floating-point representations have a base $ \beta $, exponent $ E $ and precision $ p $. In general, a floating-point number is represented as 

$$ f = \pm \, d_1.d_2d_3 \dots d_p \times \beta^E, $$

where $ d_1.d_2d_3 \dots d_p $ is called significand (also mantissa).

**Properties of floating-point systems:**

- Smallest positive number ([underflow](https://en.wikipedia.org/wiki/Arithmetic_underflow) if below)
- Largest number ([overflow](https://en.wikipedia.org/wiki/Integer_overflow) if above)
- [Machine epsilon](https://en.wikipedia.org/wiki/Machine_epsilon), $ \varepsilon $, defined as the difference between 1 and the next larger floating point number (upper bound on the relative error due to rounding in floating point arithmetic)
- Special values: zero (`0`), infinities (`+Inf`, `-Inf`), [not a number](https://en.wikipedia.org/wiki/NaN) (`NaN`)

**Example:** Consider a decimal system of floating-point numbers defined as $ f = \pm \, d_1.d_2 \cdot \beta^E, $ where $ \beta = 10 $ and $ E \in \{-2, -1, 0 \} $.  

1) How many numbers such system represents?  

2) What is the smallest positive number and the largest number?  

3) What is the machine epsilon? 

4) What is the distribution of numbers on the real line?

**Example:** Consider a binary system of floating-point numbers defined as $ f = \pm \, d_1.d_2 \cdot \beta^E, $ where $ \beta = 2 $ and $ E = \{-1, 0, 1 \} $.

1) How many numbers such system represents?  

2) What is the smallest positive number and the largest number?  


3) What is the machine epsilon?


4) What is the distribution of numbers on the real line?

Note that unlike the real number system (which is continuous), a floating-point systems have always gaps between numbers. If a number is not exactly representable, then it must be approximated by one of the nearest representable values. Furthermore, the distribution of numbers on the real line is not uniform.

### Implementation of floating-point systems

Over the years, a variety of floating-point representations have been used in computers. In 1985, the [IEEE 754](https://en.wikipedia.org/wiki/IEEE_754) standard for floating-point arithmetic was established. Since then, most implementations of the floating-point systems in computers conform to the rules defined by IEEE.

**Half precision** [IEEE 754-2008](https://en.wikipedia.org/wiki/IEEE_754-2008_revision): 

Total storage alloted is 16 bits (10 bits for mantissa, 5 bits for exponent, 1 bit for sign)

In [74]:
print(np.finfo(np.float16))

Machine parameters for float16
---------------------------------------------------------------
precision =   3   resolution = 1.00040e-03
machep =    -10   eps =        9.76562e-04
negep =     -11   epsneg =     4.88281e-04
minexp =    -14   tiny =       6.10352e-05
maxexp =     16   max =        6.55040e+04
nexp =        5   min =        -max
---------------------------------------------------------------



**Single precision:**

Total storage alloted is 32 bits (23 bits for mantissa, 8 bits for exponent, 1 bit for sign)

In [75]:
print(np.finfo(np.float32))

Machine parameters for float32
---------------------------------------------------------------
precision =   6   resolution = 1.0000000e-06
machep =    -23   eps =        1.1920929e-07
negep =     -24   epsneg =     5.9604645e-08
minexp =   -126   tiny =       1.1754944e-38
maxexp =    128   max =        3.4028235e+38
nexp =        8   min =        -max
---------------------------------------------------------------



**Double precision:**

Total storage allocated is 64 bits (52 bits for mantissa, 11 bits for exponent, 1 bit for sign)

In [76]:
print(np.finfo(np.float64))

Machine parameters for float64
---------------------------------------------------------------
precision =  15   resolution = 1.0000000000000001e-15
machep =    -52   eps =        2.2204460492503131e-16
negep =     -53   epsneg =     1.1102230246251565e-16
minexp =  -1022   tiny =       2.2250738585072014e-308
maxexp =   1024   max =        1.7976931348623157e+308
nexp =       11   min =        -max
---------------------------------------------------------------



The IEEE 754 standard alse defines algorithm for addition, subtraction, multiplication, division and square root, and requires that implementations produce the same result as that algorithm. Thus, when a program is moved from one machine to another, the results of the basic operations will be the same in every bit if both machines support the IEEE 754 standard. 

**Example:** Using the double precision floating-point format, compare that $ 0.1 + 0.2 = 0.3 $.

In [84]:
print((0.1 + 0.2) == 0.3)

False


The result above is not a Python bug (read more e.g. [here](https://docs.python.org/3/faq/design.html#why-are-floating-point-calculations-so-inaccurate) or [here](https://docs.python.org/3/tutorial/floatingpoint.html)). Representing infinitely many real numbers by a finite number of bits requires an approximate representation. Given any fixed number of bits, most calculations with real numbers will produce quantities that cannot be exactly represented using that many bits. None of the numbers 0.1, 0.2, and 0.3 (actually most of the decimal fractions) has an exact representation as a binary floating-point number, no matter how many base 2 digits you’re willing to use. 

Note that decimal numbers could be represented exactly by the base-10 floating-point systems. However, base-10 implementations are rare because base-2 (binary) arithmetic is so much faster on computers.

Python by default displays always a rounded value (to keep the number of digits manageable):

In [82]:
print(0.1)

0.1


but the actual stored value is the nearest representable binary fraction:

In [83]:
print(format(0.1, ".55f"))

0.1000000000000000055511151231257827021181583404541015625


**Example:** Using the double precision floating-point format, calculate $ x = 0.1 + 0.2 - 0.3 $. Afterwards, perform $ 100 \ \times $ the operation $ x := x + x $ and display the value of $ x $.

**Example:** Consider the following function, $$ f(x) = \frac{1 - \cos{x}}{x^2}. $$

The behaviour of this function as $ x $ approaches zero can be determined by evaluating the limit $$ \lim_{x \to 0} f(x) = \frac{1}{2}. $$

Can be find the same result using the floating-point representation?

**Example:** Calculate the sum of the following series,
$$
\sum_{k = 0}^{n} 0.9^{n},
$$
for $ n = 400 $ and compare the results.

When summing a series of numbers using floating-point systems, always sum the smaller numbers first.

### Summary

- Floating point arithmetic is not commutative, associative, and not necessarily distributive:

valid operations:
$$ 1 \cdot x = x,  $$
$$ x \cdot y = y \cdot x, $$
$$ x + x = 2 \cdot x $$

not necessarily valid operations:
$$ x \cdot x^{-1} = 1, $$
$$ (1 + x) - 1 = x, $$
$$ (x + y) + z = x + (y + z) $$

- Errors propagate between calculations. A small error in the input may result in a large error in the output.

**Which operations should I avoid in order to minimize the roundoff error?**

- subtractions of numbers that are nearly equal
- additions and subtractions of numbers that differ greatly in magnitude


### Representation of integers in computer

Although standard arithmetic operations are safe, some care may be necessary when working with large numbers

**16 bit unsigned**

In [93]:
print(np.iinfo(np.uint16))

Machine parameters for uint16
---------------------------------------------------------------
min = 0
max = 65535
---------------------------------------------------------------



**16 bit signed**

In [95]:
print(np.iinfo(np.int16))

Machine parameters for int16
---------------------------------------------------------------
min = -32768
max = 32767
---------------------------------------------------------------



**32 bit signed**

In [96]:
print(np.iinfo(np.int32))

Machine parameters for int32
---------------------------------------------------------------
min = -2147483648
max = 2147483647
---------------------------------------------------------------



## Truncation error

The discrepancy between the true answer and the answer obtained by a numerical method regardless the roundoff error. Truncation error would persist even if we would have an infinitely accurate representation of numbers.

**Example:** Construct a Taylor series of the function 

$$
f(x) = e^x
$$

using the first 3 elements. Plot the result of the approximation in $ x \in [-1, 1] $ and compare with $ e^x $.

Calculate the absolute and relative errors of the approximation above and plot the result. Find the maximum values.

## Combination of errors

In general, the roundoff errors and the truncation errors may combine.

**Example:** Calculate the derivative of the function
$$
f(x) = \sin{x}
$$
in $ x \in [-2 \pi, 2 \pi] $ using finite differences with forward scheme, i.e.,
$$
f^{\prime}(x) = \frac{f(x + h) - f(x)}{h},
$$
where $ h $ is a finite step. Compare the approximation with
$$
f^{\prime}(x) = \cos{x}.
$$

Calculate the relative error of the approximation above for $ x = 1 $ and $ h = 2^{-n} $, where $ n \in \{ 0, \dots, 55 \} $. Find the optimal value of $ h $.

## Numerical stability

Numerical method may magnify the roundoff error introduced into the computation at an early stage. Such a method is called unstable.

**Example:** Using the half-precision floating-point format, calculate the first 20 integer powers of the number called "golden mean",

$$
\phi = \frac{\sqrt{5} - 1}{2}.
$$
with the following recursive algorithm,
$$
\phi^n = \phi^{n-1} - \phi^{n}.
$$

Compare the calculated values with the ones obtained by the simple multiplication.

## Condition number

[Condition number](https://en.wikipedia.org/wiki/Condition_number) measures how much the output value of the function can change for a small change in the input argument (i.e., sensitivity to input error). Given the problem $ f $ and the input $ x $, the condition number $ C_p $ is defined as 

$$
C_p = \frac{\| \delta f(x) \| \, / \, \| f(x) \|}{\| \delta x \| \, / \, \| x \|}.
$$

A problem with $ C_p \approx 1 $ is said to be well-conditioned.  
A problem with $ C_p > 100 $ is said to be ill-conditioned.  
A problem with $ C_p > \varepsilon^{-1} $ is not solvable within the specified precision.

**Example:**
Find the condition number of the following system of equations,

$$ x + \alpha y = 1, $$
$$ \alpha x + y = 0, $$

depending on the parameter $ \alpha \in [-2, 2] $. Plot the result and decide for which $ \alpha $ the system is ill-conditioned.

## References

[1] https://raw.githubusercontent.com/mandli/intro-numerical-methods/master/04_error.ipynb  
[2] http://www.lahey.com/float.htm  
[3] https://www.cl.cam.ac.uk/teaching/1011/FPComp/floatingmath.pdf  