# Floating Point Arithmetic

* Finite computing and memory capacity
* Need to find a way to represent floating point numbers
* IEEE 754 defines common standard on modern computers to represent floating
point numbers.
* Two most commonly used types: `IEEE double precision` and `IEEE single precision`.


## Floating point types in Python

The Numpy module defines convenient ways to query properties of floating point numbers.

In [1]:
import numpy as np # import the numpy extension module
                   # and call it np as short form

The data type names in `Numpy` for floating point types are:

* `IEEE double precision`: np.float, np.double, np.float64
* `IEEE single precision`: np.single, np.float32


Let us query some properties of these numbers

In [2]:
double_precision_info = np.finfo(np.float64)

The biggest and smallest (by absolute value) normalized floating point numbers are

In [3]:
double_precision_info.max

1.7976931348623157e+308

In [4]:
double_precision_info.tiny

2.2250738585072014e-308

Floating point numbers can not be arbitrarily close to each other. There is a smallest relative distance, which we define shortly. It is given as

In [5]:
double_precision_info.eps

2.220446049250313e-16

This leads to a limited precision of floating point numbers. The approximate relative precision is

In [6]:
double_precision_info.precision

15

### Task: What are the values for single precision arithmetic?

In [7]:
single_precision_info = np.finfo(np.float32)
print([single_precision_info.max,
       single_precision_info.tiny,
       single_precision_info.eps,
       single_precision_info.precision])

[3.4028235e+38, 1.1754944e-38, 1.1920929e-07, 6]


## Definition of floating point numbers

The set of floating point numbers is defined as follows:

$$
\mathcal{F} = \left\{(-1)^s\cdot b^e \cdot \frac{m}{b^{p-1}} :\right.
\left. s = 0,1; e_{min}\leq e \leq e_{max}; b^{p-1}\leq m\leq b^{p}-1\right\}.
$$

* `IEEE double precision`: $e_{min} = -1022, e_{max} = 1023, p = 53$
* `IEEE single precision`: $e_{min} = -126, e_{max} = 127, p = 24$

Typically, b = 2. For Mantissa have:

$$
\frac{m}{b^{p-1}} = 1, 1+b^{1-p}, 1+2b^{1-p}, \dots, b-b^{1-p}
$$

$\rightarrow$ Distance of neighbouring floats is $2^e b^{1-p}$.

$\epsilon_{rel} = b^{1-p}$ is smallest number such that
$$
1 + \epsilon_{rel} \neq 1.
$$

In [8]:
1 + double_precision_info.eps

1.0000000000000002

In [15]:
1 + .25 * double_precision_info.eps

1.0

## Approximating numbers in floating point arithmetic
Let $x\in\mathbb{R}$, $b^{e_{min}}\leq x < b^{e_{max}+1}$. Define $\epsilon_{mach}:=\frac{1}{2}b^{1-p}$

There exists $x'\in \mathcal{F}$ such that $|x-x'|\leq\epsilon_{mach}|x|$.

**$\epsilon_{mach}$ is relative distance to the next floating point number in $\mathcal{F}$.**

Define the projection
$$fl:fl(x)\rightarrow x',$$
where $x'$ is the closest floating point number in $\mathcal{F}$. 

It follows that $fl(x)=x*(1+\epsilon)$ for some $|\epsilon|\leq \epsilon_{mach}$.

**Fundamental Axiom of Floating Point Arithmetic**
Define $x\odot y = fl(x \cdot y)$, where $\cdot$ is one of $+,-,\times,\div$. Then for all $x,y\in\mathcal{F}$ there exists $\epsilon$ with $|\epsilon| \leq \epsilon_{mach}$ such that
$$ x\odot y = (x \cdot y)(1+\epsilon).$$

Most modern architectures guarantee this property.

## Special symbols in floating point arithmetic

In addition to numbers several other important symbols are defined in the floating point standard. The most important are:

* NaN: Not a number
* $\pm$inf

In [18]:
import numpy as np
a = np.inf
b = np.float64(0) / 0
print(b)

nan


  app.launch_new_instance()
