# Floating point numbers

In order to complete computations in finite space and bounded time, we replace the real numbers with a surrogate finite set $\mathbb{F}$, the floating point numbers. (The "floating point" term originally differentiated it from "fixed point", which was an early alternative system based on absolute errors rather than relative errors.) Most scientific computing now conforms to the IEEE 754 double precision standard. We won't need to think about the details of this standard going forward, but it is useful to briefly explore what is really going on in the computer.  

In double precision there are 64 binary bits used to represent the members of $\mathbb{F}$. We can get them directly.

In [1]:
one = bitstring(1.0)

"0011111111110000000000000000000000000000000000000000000000000000"

In [1]:
bitstring(nextfloat(1.0))

"0011111111110000000000000000000000000000000000000000000000000001"

These bits define three integers $s$, $e$, and $m$ used in the representation

$$ x = (-1)^s \cdot \left( 1 + 2^{-52}m \right) \cdot 2^e.$$

Here $s\in\{0,1\}$ requires one bit, $e\in\{-1022,\ldots,1023\}$ requires 11 bits, and $m\in\{0,1,\ldots,2^{52}-1\}$ requires 52 bits. We can dissect a double precision number to see these parts. Let's choose a more interesting value.

In [2]:
b = bitstring(3/8)

"0011111111011000000000000000000000000000000000000000000000000000"

In [3]:
@show e_string = b[2:12]
@show e = parse(Int,e_string,base=2) - 1023;

@show m_string = b[13:64]
@show m = parse(Int,m_string,base=2);

e_string = b[2:12] = "01111111101"
e = parse(Int, e_string, base=2) - 1023 = -2
m_string = b[13:64] = "1000000000000000000000000000000000000000000000000000"
m = parse(Int, m_string, base=2) = 2251799813685248


The binary form of $m$ may be more transparent than the decimal integer. In fact,

In [4]:
m/2^52

0.5

That is, $(1+0.5)\times 2^{-2}=3/8$. This number is represented exactly in $\mathbb{F}$. There are built-in ways to get the values of $e$ and $(1+2^{-52}m)$.

In [5]:
exponent(3/8),significand(3/8)

(-2, 1.5)

The smallest element of $\mathbb{F}$ greater than 1 is $1+\epsilon_M$, for machine epsilon $\epsilon_M=2^{-52}$. 

In [6]:
@show eps()
@show bitstring(1.)
@show bitstring(1. + eps());

eps() = 2.220446049250313e-16
bitstring(1.0) = "0011111111110000000000000000000000000000000000000000000000000000"
bitstring(1.0 + eps()) = "0011111111110000000000000000000000000000000000000000000000000001"


There are $2^{52}$ elements of $\mathbb{F}$ equally spaced throughout $[1,2)$. After these, the exponent increases to 1 and the value of $m$ resets to zero. Thus there are also $2^{52}$ elements equally spaced throughout $[2,4)$, as well as $[4,8)$, $[1/2,1)$, and in general, $[2^{e},2^{e+1})$. The spacing between floating point numbers scales with the exponent. 

In [7]:
nextfloat(1/4)-1/4

5.551115123125783e-17

In [8]:
nextfloat(1/2)-1/2

1.1102230246251565e-16

In [9]:
nextfloat(256.)-256

5.684341886080802e-14

Unlike the $\mathbb{F}$ defined in the book, the floating point numbers don't go on forever. The largest is

In [10]:
@show R = (2.0^1023)*(1 + (2^52-1)/2^52)
@show floatmax(1.)
bitstring(R)

R = 2.0 ^ 1023 * (1 + (2 ^ 52 - 1) / 2 ^ 52) = 1.7976931348623157e308
floatmax(1.0) = 1.7976931348623157e308


"0111111111101111111111111111111111111111111111111111111111111111"

Results that should be larger than this become the special value `Inf`; this situation is called _overflow_.

In [11]:
nextfloat(R)

Inf

The analogous situation near zero is called _underflow_. The smallest full-precision positive number is 

In [12]:
@show r = 2.0^-1022;
floatmin(1.)
bitstring(r)

r = 2.0 ^ -1022 = 2.2250738585072014e-308


"0000000000010000000000000000000000000000000000000000000000000000"

In [2]:
51/512

0.099609375

Note that this minimum value is far smaller than $\epsilon_M$, which is the number spacing relative to 1. 