Any number can be represent4ed by (infinite) expansion base $\beta$ in the *normalized form*:

$$0.d_1 d_2 d_3 ... \times \beta^p, d_1 \neq 0$$

$p$ = integer = exponent

$d_k$ = digit base $\beta, 0 \leq d_k \leq \beta - 1$

A floating-point number system limits:

1. density - keep only finite # of digits
2. range - finite # of integers for $p$

A floating-point number system can be identified by 4 parameters: $F(\beta, m, L, U)$ and any *non-zero* numbers have the form:

$$\pm 0.d_1 d_2 ... d_m \times \beta^p, d_1 \neq 0, 0 \leq d_i \leq \beta - 1, i = 1 ... m, L \leq p \leq U$$

$\beta$ = digit base

$m$ = # of digits in mantissa

$L$ = lower-bound

$U$ = upper-bound

*Zero* is given by $d_1 = ... d_m = 0, p = 0$

Examples:

IEEE single precision: $\beta = 2, m = 24, L = -126, U = 127$

IEEE double precision: $\beta = 2, m = 53, L = -1022, U = 1023$

Two limitations:

- Fixed # of digits. An arbitrary real number $x$ must be approximated. 2 ways:
    - Chopping. Take the first $m$ digits of the normalized form.
    
    $0.d_1 d_2 ... d_m ... \times \beta^p \to 0.d_1 d_2 ... d_m \times \beta^p$
    
    - Rounding. 
    
    $x = 0.d_1 d_2 ... d_m d_{m+1}... \times \beta^p$.
    
    If $d_{m+1} < \beta / 2$, use $x = 0.d_1 d_2 ... d_m \times \beta^p$. Else use $x = 0.d_1 d_2 ... (d_m+1) \times \beta^p$.
    
    $fl(x) = $chop$(x + 0.5 \times \beta^{p-m})$


- Exponent is bounded.
    - Underflow ($p < L$).
        - A number is too small
        - normally underflow is set to 0
        - e.g. IEEE SP: smallest pos number $= 2^{-126} \approx 10^{-38}$
    - Overflow ($p > U$).
        - A number is too big
        - normally causes program to abort

### Exception handling

| op | example | result |
| --- | --- | --- |
| invalid op | 0/0 | NaN |
| division by 0 | 1 / 0 | $\pm \infty$ |
| overflow | $N_{max} + 1$ | $\pm \infty$ |
| underflow | $N_{min} / 2 $ | 0 |

### Error of FP representation
Let $\tilde{x}$ be an approximation to $x$.

Absolute error:

$$| x - \tilde{x} |$$

Relative error:
$$\frac{| x - \tilde{x} |}{|x|}$$

$\tilde{x}$ is said to approximate $x$ to about $s$ significant digits.

If the relative error $\approx 10^{-s}$

$0.5 \times 10^{-s} \leq \frac{| x - \tilde{x} |}{|x|} \leq 5 \times 10^{-s}$

### $x \in \mathbb{R}$ and $fl(x) \in$ FP
- The relative error of $fl(x)$ for $x$ is bounded for all $x$ in the exponent range.
- The max relative error is called *machine epsilon* (or *machine precision*).

$\frac{fl(x) - x}{x} = \delta$, where $|\delta| \leq \epsilon_{machine}$

or $fl(x) = x(1+\delta)$, where $|\delta| \leq \epsilon_{machine}$

Def: $\epsilon_{machine}$ is defined to be the smallest number $\epsilon$ such that $fl(1+\epsilon) > 1$.