## Floating Point Representation: A Discussion
---

References:
- What every programmer should know about floating point arithmetic?[Article web reprint by David Goldberg](https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html)
- What every programmer should know about floating point arithmetic? [More accessible web article](https://floating-point-gui.de/)
- A [good summary](https://www.volkerschatz.com/science/float.html) of the first reference.


In [2]:
print(0.1+0.2==0.3)

False


**Example** What should be the output of $x_{20}$ when $x_1 = 1/10$ and $x_{n+1}  =f(x_n)$.

[Following Example Source](https://nbviewer.jupyter.org/github/fastai/numerical-linear-algebra/blob/master/nbs/1.%20Why%20are%20we%20here.ipynb#Accuracy)

In [4]:
def f(x):
    if x <= 0.5:
        return 2 * x
    else:
        return 2*x - 1

In [9]:
x = 0.1
for i in range(20):
    print(x)
    x = f(x)
    

0.1
0.2
0.4
0.8
0.6000000000000001
0.20000000000000018
0.40000000000000036
0.8000000000000007
0.6000000000000014
0.20000000000000284
0.4000000000000057
0.8000000000000114
0.6000000000000227
0.20000000000004547
0.40000000000009095
0.8000000000001819
0.6000000000003638
0.2000000000007276
0.4000000000014552
0.8000000000029104


In [10]:
from IPython.display import YouTubeVideo
#YouTubeVideo("gp_D8r-2hwk")

## IEEE 754 Normalized Binary Form
---
Represents numbers in float-64 in the following way
$$\Large
(-1)^s \left( 1+ f\right) \times 2^{e-1023}
$$
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/a9/IEEE_754_Double_Floating_Point_Format.svg/2560px-IEEE_754_Double_Floating_Point_Format.svg.png" width="800xp" />

[Image](https://en.wikipedia.org/wiki/Double-precision_floating-point_format#/media/File:IEEE_754_Double_Floating_Point_Format.svg)

>- One sign bit, denoted by s.
>- Biased exponent, e, takes 11 bits.
>- The fractional part (mantissa), f, in the normalized form takes 52 bits.

**Example**
\begin{aligned}
(152.356425)_{10} &= (1001 1000 .0101 1011 0011 1110 10101 0110 0110 1100 1111 0100 0010)_2\\
&= (1.0011~0000~1011~0110~0111~1101~0101~0110~0110~1100~1111~0100~0010)_2 \times 2^7
\end{aligned}

On comparison with the standard form, we have $e-1023 = 7$, implying that the biased exponent is $$e = 1030 = (1000 0000 110)_2.$$ 
The sign bit should be 0 for a positive number. And the fractional part is   
$$f=.0011~0000~1011~0110~0111~1101~0101~0110~0110~1100~1111~0100~0010$$
Hence $ 152.356425$ in 64-bits is given by
 $$ {\color{blue}{0}} ~{\color{green}{100~0000~0110}}~ {\color{pink}{0011~0000~1011~0110~0111~1101~0101~0110~0110~1100~1111~0100~0010}}$$ 
(That is  4063 0B67 D566 CF42 in hexadecimal representation. How? Make groups of four bits to get 16 hexadecimal numbers. For example 0100 is 4, 1111 is F, and so on.)


### Special Encodings in float64

- The biased exponent for decimal numbers satisfy: $-1023 < e -1023 < 1024$, or $0<e<2047$.

- The  value $e=0$ is reserved for some special cases.
  >- When $e=0$ (all zero bits) with $f=0$, it represents **plus or minus zero**. 
  >- When $e=0$ with $f \ne 0$, it represents **subnormals** that are small numbers quite close to zero, even smaller than the smallest normalized binary numbers. They are used to handle underflow in floating point arithmetics.

- The value $e=2047$ is reserved for some special cases.
>- When $e = 2047$ (all 1 bits) with $f=0$, it represent **plus or minus infinity**
>- When $e=2047$ (all 1 bits) and $f \ne 0$, it represents  **NaN** (not a number)

<div class="alert alert-block alert-danger">
<strong>Overflow</strong>: When the resulting number from some mathematical operation is larger then the largest possible number in float 64 ($\approx 10^{308}$ in magnitude), we say that an overflow has occurred.
<br>
<strong>Underflow</strong>: When the result of an arithmetic operation is quite close to zero beyond the normalized representation in float 64 (smaller than $\approx 10^{-308}$ in magnitude), it is called an underflow.
</div>

Similarly, float-32 is given in the following way
$$\Large
(-1)^s \left( 1+ f\right) \times 2^{e-127}
$$
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d2/Float_example.svg/2560px-Float_example.svg.png" width="800xp" />

[Image](https://en.wikipedia.org/wiki/Double-precision_floating-point_format#/media/File:IEEE_754_Double_Floating_Point_Format.svg)

>- One sign bit, denoted by s.
>- Biased exponent, e, takes 8 bits.
>- The fractional part (mantissa), f, in the normalized form takes 23 bits.


### Error Analysis with Floating Point Representation


In general, a real number $x$ can not be represented exactly in a floating point systems such as IEEE-754. Let $fl[x]$ be the floating point representation of $x$, Then
$$
 fl[x] = x(1+\delta),\ \text{ where }  \quad |\delta | =\left|\dfrac{fl(x)-x}{x}\right| \le \epsilon 
 $$
 where $\delta$ is the relative error and  $\epsilon$  represent the unit round-off error (or machine precision $\approx 10^{-16}$ in float64).
 
 <div class="alert alert-block alert-info">
  If $x$ and $y$ are machine numbers and $\odot$ is some floating poin operation ($+,-,\times, \div$) between them, then
  $$
  fl[x\odot  y] = (x \odot y) (1 + \delta), \text{ where } \quad |\delta| \leq \epsilon
  $$
  
  which could alternatively be written as
  $$
  fl[x\odot  y] = (x + \delta_1)\odot (y + \delta_2).
  $$
  This is an example of a backward error analysis where $fl[x\odot  y]$ could be seen as the exact result of slightly perturbed data.
  </div>

### Loss of Precision

**Loss of Precision**: Subtraction of two nearly equal  numbers on a finite precision machine leads to loss of significance due to cancellation of significant digits and can lead to disastrous errors in calculations.
 
**Remedy for loss of precision**: Use higher precision arithmetic or reformulate the calculation to avoid such subtractions. Use rationalization of radicals, Taylor polynomial approximation, for example, if possible.

**EXAMPLE** Write the following expressions in ways that avoid the loss of significance due to subtraction of close quantities.

(A) $f(x) = \sqrt{x+1/x} - \sqrt{x - 1/x}.$

(B) $f(x) = \tan{x} - \sin{x}.$



<div class="alert alert-danger "><strong> Loss of Precision Theorem</strong>
    Let $x$ and $y$ be normalized floating-point machine numbers, where $x>y>0$. 
    If we have $$2^{-p} \le 1-\dfrac{y}{x}\le 2^{-q}$$ for some positive integers $p$ and $q$, then at most $p$ and at least $q$ significant binary bits are lots in the subtraction $y-x$.
</div>

In [8]:
# The following difference equation has the solution x_n  =1/3^n. 
import numpy as np
x=np.zeros((40,1),dtype=float)
x[0]=1.0
x[1]=1.0/3
for n in range(1,39):
    x[n+1]= (13.0/3.0) * x[n] - (4.0/3.0) * x[n-1]

print(x.T)

[[ 1.00000000e+00  3.33333333e-01  1.11111111e-01  3.70370370e-02
   1.23456790e-02  4.11522634e-03  1.37174211e-03  4.57247371e-04
   1.52415789e-04  5.08052602e-05  1.69350748e-05  5.64497734e-06
   1.88146872e-06  6.26394672e-07  2.05751947e-07  5.63988754e-08
  -2.99408028e-08 -2.04941979e-07 -8.48160840e-07 -3.40210767e-06
  -1.36115854e-05 -5.44473934e-05 -2.17789924e-04 -8.71159813e-04
  -3.48463929e-03 -1.39385572e-02 -5.57542287e-02 -2.23016915e-01
  -8.92067659e-01 -3.56827064e+00 -1.42730825e+01 -5.70923302e+01
  -2.28369321e+02 -9.13477283e+02 -3.65390913e+03 -1.46156365e+04
  -5.84625461e+04 -2.33850184e+05 -9.35400738e+05 -3.74160295e+06]]


What is your observations from the above iteration?

---

We say that a numerical process is <span style="color:red">unstable</span> if small error at one stage of the process are magnified in subsequent stages and seriously degrade the accuracy of the overall calculation.

<br>

**Condition or conditioning**  indicates sensitivity of a solution to  a problems with small relative change in the input.

#### Care that should be taken while writing numerical algorithms
---
>- Loss of precision: avoid subtraction of close quantities by mathematical manipulations.
>- Minimize the introduction of roundoff errors.
>- Be careful converting large integers.
>- Extra care when iteration is being used.
>- Minimize truncation errors in mathematical terms.