# Exercise 01.a -- Roundoff errors

In [1]:
import numpy as np

## Floating point precision:

- double precision, 64 bit:
  - 1 bit for the sign
  - 11 bits for the exponent
  - 52 bits for the mantissa
  - $\epsilon = 2^{-52} \approx 2.2 \cdot 10^{-16}$

- single precision, 32 bit:
  - 1 bit for the sign
  - 8 bits for the exponent
  - 23 bits for the mantissa
  - $\epsilon = 2^{-23} \approx 1.2 \cdot 10^{-7}$

- half precision, 16 bit
  - 1 bit for the sign
  - 5 bits for the exponent
  - 10 bits for the mantissa
  - $\epsilon = 2^{-10} \approx 9.8 \cdot 10^{-4}$

## Playing with floating point precision

In [2]:
epsilon_64 = np.float64(2 ** (-52))
epsilon_32 = np.float32(2 ** (-23))
epsilon_16 = np.float16(2 ** (-10))

In [3]:
print(f"{epsilon_16:e}")
print(np.float16(1.0 + epsilon_16))
print(np.float16(1.0 + epsilon_16 / 2))

9.765625e-04
1.001
1.0


In [4]:
print(f"{epsilon_32:e}")
print(np.float32(1.0 + epsilon_32))
print(np.float32(1.0 + epsilon_32 / 2))

1.192093e-07
1.0000001
1.0


In [5]:
print(f"{epsilon_64:e}")
print(np.float64(1.0 + epsilon_64))
print(np.float64(1.0 + epsilon_64 / 2))

2.220446e-16
1.0000000000000002
1.0


## Roundoff errors

For a given precision, we have seen that 

$$
1 + z = 1 \quad {\rm for} \quad 0 \leq z \leq \epsilon \\
1 + z > 1 \quad {\rm for} \quad z > \epsilon
$$

Hence, we can expect every calcylation to be off by $\epsilon$.  Generalizing the above, we find $A$ and $A +\epsilon\,A$ to be indistinguishable.  The roundoff error of $A$ is $\epsilon\,A$.

### Propagation of roundoff errors

We can use the known rules of error propagation:

- For addition and substraction, the errors are added
  $$\delta(A \pm B)| \sim \epsilon\,A + \epsilon\,B = O(\epsilon\,\max(A, B))$$

- For multiplication and division, the relative errors are added
  $$
  \frac{|\delta(A\,B)|}{|A \cdot B|}
    \sim \frac{\epsilon A}{A} + \frac{\epsilon\,B}{B}
    = \frac{\epsilon\,(A\,B + A\,B)}{A\,B}
    = O(\epsilon) \\
  \delta(AB) = O(\epsilon)
  $$

Hence, we consider multplication and division as safe and addition / substraction as dangerous.

### See roundoff errors at work

In [6]:
a = np.float32(10_000.1)
b = np.float32(10_000.0)

diff_ab = a - b
sum_ab = a + b

print(format(diff_ab - 0.1, 'e'))
print(format(sum_ab - 20_000.1, 'e'))

-3.906250e-04
-3.906250e-04


In [7]:
a_64 = np.float64(1)
a_32 = np.float32(1)
a_16 = np.float16(1)

for n in range(5_000):
    a_64 += np.float64(1 / 5_000)
    a_32 += np.float32(1 / 5_000)
    a_16 += np.float16(1 / 5_000)

print(f"{a_64 - 2.0 : e}")
print(f"{a_32 - 2.0 : e}")
print(f"{a_16 - 2.0 : e}")

-1.101341e-13
 1.659393e-04
-1.000000e+00
