# IEEE floating points

* A single precision floating point number is defiend by 32 bits.
* Each bit takes the value of either 0 or 1 (base 2 system)

* The first bit is for the sign of the number, and uses the convention $$ (-1)^S, $$ 
        * so S=0 if the number is positive

* The next 8 bits gives the magnitude of the number, in base 2. This quatity is represented by here by $e$:
$$ e = b_7 2^7 + b_6 2^6 +\cdots+ b_0 2^0$$ 

* The remaining 23 bits stores the actual "digits" of the number.

$$ A = (-1)^S*\frac{e}{2^127}*1.m = (-1)^S \times 2^{e-127} \times (1.m)_{\mathrm{base} 2} $$

* Base 2 representation: $$(1.m)_{\mathrm{base} 2} = 1+ b_1~2^{-1} + b_2~2^{-2}+\cdots+b_{23}~2^{-23}$$



In [6]:
printbits(Float32(2.5))

[31m0[0m [32m10000000[0m [34m01000000000000000000000[0m

In [2]:
printred(x)  =print("\x1b[31m"*x*"\x1b[0m ")
printgreen(x)=print("\x1b[32m"*x*"\x1b[0m ")
printblue(x) =print("\x1b[34m"*x*"\x1b[0m\n")

#for ANSI codes for defining color in a terminal: check out the blog: 
# http://jafrog.com/2013/11/23/colors-in-terminal.html


function printbits(x::Float32)
   bts=bits(x)
    printred(bts[1:1])
    printgreen(bts[2:2+8-1])
    printblue(bts[2+8:end])
end

function printbits(x::Float64)
#to be filled in as a homework assignment
end

printbits (generic function with 2 methods)

In [12]:
println("Sub-normal numbers: ")
print("-0.0: ")
bits(-0.0)

Sub-normal numbers: 
-0.0: 

"1000000000000000000000000000000000000000000000000000000000000000"

In [13]:
println("Exceptions: ")
print(" NaN = ")
printbits(Float32(NaN))
println()

print(" Inf = ")
printbits(Float32(Inf))
println()

print("-Inf = ")
printbits(Float32(-Inf))

Exceptions: 
 NaN = [31m0[0m [32m11111111[0m [34m10000000000000000000000[0m

 Inf = [31m0[0m [32m11111111[0m [34m00000000000000000000000[0m

-Inf = [31m1[0m [32m11111111[0m [34m00000000000000000000000[0m


# Home work assignments:

1. Write a Julia function that prints the bits of a double precision floating point.

2. Write an algorithm, in psuedo-code, that converts a given real number into the closest IEEE floating number. You may assume that all the mathematical Explain your algorithm.

3. Let $x$ be a floating number in $[1,2]$. Can $x*(1/x)$ be different from $1$ on a computer? find an upper bound on $|1-fl(x*fl(1/x))|$. Explain your answer.

## Cancellation errors

In [9]:
a=Float32(1/3)
b=Float32(1/3)+Float32(2.0^(-25)*(π/4))

printbits(a)
println()
printbits(b)
println()

c=(a-b)*2^20
printbits(c)

println(c)

[31m0[0m [32m01111101[0m [34m01010101010101010101011[0m

[31m0[0m [32m01111101[0m [34m01010101010101010101100[0m

[31m1[0m [32m01111010[0m [34m00000000000000000000000[0m
-0.03125


In [10]:
a=Float64(1/3)
b=Float64(1/3)+Float64(2.0^(-25)*(π/4))

printbits(a)
println()
printbits(b)
println()

println()
c=(a-b)*2^20

println(c)

[31m0[0m [32m01111111101[0m [34m0101010101010101010101010101010101010101010101010101[0m

[31m0[0m [32m01111111101[0m [34m0101010101010101010101101110011101110101000010101001[0m


-0.024543692590668797


In [25]:
a=Float64(Float32(1/3))
bb=a+Float64(2.0^(-28)*(π/4))

b-bb

-9.934107481068821e-9

## Problem of evaluating the quadratic formula

$$ ax^2+bx+x = 0 \implies x=\pm \frac{-b\pm\sqrt{b^2-4ac}}{2a} $$

* Cancellation errors occurs when $b^2$ is much bigger than $4ac$. 
* If so, the root $$ \sqrt{b^2-4ac} \approx |b| $$ and the root $$x= (-b-(|b|+\delta))/2a $$ will lose accuracy

In [7]:
a=1.0
c=1e-12
b=2.0

bb=-b+sqrt(b^2-4.0*a*c)

-1.000088900582341e-12

In [12]:
#remedy formula:

4.0*a*c/(-b-sqrt(b^2-4.0*a*c))

-1.00000000000025e-12

### Use of BigFloat

We apply the standard quadratic formula with floating points with many more digits.
The outcome should be close to the one evaluated by the modified formula.

In [13]:
A=BigFloat(1.0)
C=BigFloat(1e-12)
B=BigFloat(2.0)

BB=-B+sqrt(B^2-4.0*A*C)

-1.000000000000249979886647754245558691145588771407201475735952561810612921414033e-12