# Sources of Numerical Error

By the end of this lecture, you will be able to describe two common sources of errors encountered in numerical computation, namely *roundoff* and *truncation* error.  The first is an ever present but often neglibible feature of floating-point arithmetic.  The second arises any time we apply discretization to continuous domains, e.g., with use of finite-difference approximations for derivatives.

## Required Preparation

- Review [finite differences](https://robertsj.github.io/me400_notes/lectures/Numerical_Differentiation.html) for context.
- Skim the Wikipedia article on the [IEEE_754](https://en.wikipedia.org/wiki/IEEE_754) floating-point standard and tinker with [this neat tool](https://www.h-schmidt.net/FloatConverter/IEEE754.html).

## Basic Linear System

Consider the forward difference approximation for $\frac{dy}{dx}\Big |_{x=1/2}$ with $y(x) = \sin^2(x)$.  (Note, actual derivative is $2\sin(x)\cos(x)$).

In [None]:
import numpy as np
import matplotlib.pyplot as plt
f = lambda z: np.sin(z)**2
delta = np.logspace(-8,-1)
x = 0.5
dfdx_approx = np.zeros(len(delta))
dfdx_actual = 2*np.sin(x)*np.cos(x)
for i in range(len(delta)):
    dfdx_approx[i] = (f(x)-f(x-delta[i]))/delta[i]
error = abs(dfdx_approx - dfdx_actual)
plt.loglog(delta, error, 'k-',
           delta, delta, 'r--')
plt.legend(['abs error', 'expected behavior'])
plt.xlabel('$\delta$')
plt.grid(True)

## Representing Numbers

We can write any number $x$ in the form

$$
  x =  (d_n \ldots d_2 d_1 d_0.d_{-1}d_{-2}\ldots )_{b}
$$

where the integer $b$ is the *base*, $d_i \in [0, b)$ is an integer, and the $.$ between $d_0$ and $d_{-1}$ separates the integral from the fractional components.

For $b = 2$, we have the base-2 or *binary* system.  For $b=10$, we have the *decimal* system.

The value of $x$ in the *decimal* system is

$$
 x =   d_n b^n + \ldots + d_2 b^2 + d_1 b^1 + d_0 b^0 + d_{-1} b^{-1} + \ldots 
$$

**Exercise**:  Do the following conversions from base-10 to base-2 representations.

  - $10010_2 = \,  ???_{10}$
  - $10.01_2 = \,  ???_{10}$
  - $123_{10} = \, ???_2 $
  - $1.125_{10} =  \, ???_2$
  - $0.1_{10}  = \, ???_2$

## A Fixed-Point System

An easy, finite-memory solution would be to enforce a *fixed-point* number format, e.g., the 8-bit

$$
  (d_3 d_2 d_1 d_0.d_{-1}d_{-2}d_{-3}d_{-4})_{2}
$$

What are the largest and smallest positive numbers one can store (in base-10)?

## Floating-Point Numbers

A more flexible alternative is a *floating-point* representation, which
has the generic form (in base-2)

$$
  \pm (d_0.d_1 d_2 d_3 \ldots d_{p-1})2^e
$$

where $d$ is 0 or 1, the precision $p$ is a finite
integer, and the exponent $e$ is bounded by $e_{L} \leq e \leq e_U$
for integers $e_L$ and $e_U$.

How many bits are required to store a number in this format?

## The IEEE Standard

Adopts basic format above along with the following specifications:

  - $d_0$ is *always* set to be 1 through *normalization* (a free bit)
  - Exponent represented as $e - \beta$, where $e \in [1, e_u]$ (so no explicit sign bit needed)
  - For *single precision* (32-bits), 1 for sign, 8 for exponent, 23 for fraction, $\beta = 127$
  - For *double precision* (64-bits), 1 for sign, 11 for exponent, 52 for fraction, $\beta = 1023$
  
So, $x =  (-1)^s \times (1.d_1d_2d_3\ldots d_{p}) \times 2^{e -\beta}$

Some questions:

 - What is the *largest, positive* number?
 - What is difference between one and the next number after 1?

In [None]:
e = 2**11-1
beta = 1023

In [None]:
import sys
sys.float_info