# Computer arithmetic


## Floating-Point Numbers and Roundoff Errors

### Single precision arithmetic
In Julia we have **Float32** type for ***single precision airhtmetic***;<br/>
see [https://en.wikipedia.org/wiki/Single-precision_floating-point_format]

In [1]:
### Single precision arithmetic (float in C/C++)
z = Float32(1.0)
x = 1.0f0
y = 0.5f-8
typeof(x)
typeof(y)

Float32

### Double precision arithmetic
By default we have **Float64** type for ***floating-point numbers***.

In [2]:
### Double precision arithmetic - Float64 (double in C/C++)
x = 12.0;
y = 2.0e-7;
typeof(x)
typeof(y)

Float64

In [3]:
x = Float32( 1.0 )
y = Float64( 1.0 )
typeof(x), typeof(y)

(Float32, Float64)

### IEEE 754 
$$ x = s m 2^c $$

* [single precision](https://en.wikipedia.org/wiki/Single-precision_floating-point_format)
* [double precision](https://en.wikipedia.org/wiki/Double-precision_floating-point_format)


In [4]:
### floating-point arithmetic representation
# s-sign, c-exponent, m-mantissa
(s,c,m) = ( 1.0,
           -5,
           1.0 + 2.0^(-3) + 2.0^(-5) )

@show x = s*2.0^c*m;
typeof(x)

x = s * 2.0 ^ c * m = 0.0361328125


Float64

In [5]:
repr = bits(x) # gives binary representation

"0011111110100010100000000000000000000000000000000000000000000000"

In [6]:
rs = repr[1];        # sign      - the first bit
rc = repr[2:12];     # exponent  - next 11 bits
rm = repr[13:end]    # mantissa  - last 52 bits
@printf("64-bit machine word for x = %f is \n%c|%s|%s\n", x, rs,rc,rm);

64-bit machine word for x = 0.036133 is 
0|01111111010|0010100000000000000000000000000000000000000000000000


In [7]:
### The exponent is encoded as simple binary representation of c+1023 (only 11 bits)

@show c+1023
@show bits( c+1023 )
tmp = bits( c+1023 )[end-10:end];
rc == tmp

c + 1023 = 1018
bits(c + 1023) = "0000000000000000000000000000000000000000000000000000001111111010"


true

### NaN and Inf ***numbers***

In [8]:
# (s,c,m) = (1.0, 1024, 1.5)  =>  NaN, (for example x = 1.0 / 0.0)
# (s,c,m) = (1.0, 1024, 1.0)  =>  Inf

println(bits(NaN))  #> 0|11111111111|1000000000000000000000000000000000000000000000000000
println(bits(Inf))  #> 0|11111111111|0000000000000000000000000000000000000000000000000000

0111111111111000000000000000000000000000000000000000000000000000
0111111111110000000000000000000000000000000000000000000000000000


### Half-precision (16-bit)

In [9]:
##########################################################
x = Float16(1.3); # Half precision (16 bits)
@printf("Resentation of 1.3 in 16-bit arithmetic = %c.8f  = %.8f\n",  '%', x);  # 1.29980469
@printf("Resentation of 1.3 in 16-bit arithmetic = %c.16f = %.16f\n", '%', x);  # 1.2998046875000000

Resentation of 1.3 in 16-bit arithmetic = %.8f  = 1.29980469
Resentation of 1.3 in 16-bit arithmetic = %.16f = 1.2998046875000000


### Arithmetic precision
$$ \mathsf{u} = \frac{1}{2} 2^{-t} $$

In [10]:
##########################################################
# Computing the arithmetic precision number
# Example:  myeps( Float16 )
function aprec(T = Float64)
  one, two = T(1.0), T(2.0)
  x = one
  while x+one>one
    x=x/two
  end
  return x
end
@printf("Arithmetic precision number for Float16 = %.4e\n", aprec(Float16))
@printf("Arithmetic precision number for Float32 = %.4e\n", aprec(Float32))
@printf("Arithmetic precision number for Float64 = %.4e\n", aprec(Float64))

Arithmetic precision number for Float16 = 4.8828e-04
Arithmetic precision number for Float32 = 5.9605e-08
Arithmetic precision number for Float64 = 1.1102e-16


In [11]:
### Some test
for (ftype,t) in [ (Float16,10), (Float32,23), (Float64,52) ]
  assert( aprec(ftype) == 0.5 * 2.0^(-t) );
end;

### Machine epsilon
Julia has built-in function [***eps(...)***](https://docs.julialang.org/en/stable/manual/integers-and-floating-point-numbers/#Machine-epsilon-1)

In [12]:
# help("Base.eps");
@printf("eps(Float16) = %.4e\n", eps(Float16));
@printf("eps(Float32) = %.4e\n", eps(Float32));
@printf("eps(Float64) = %.4e\n", eps(Float64));

eps(Float16) = 9.7656e-04
eps(Float32) = 1.1921e-07
eps(Float64) = 2.2204e-16


Thus we conclude that
$$ \mathsf u = \frac{1}{2} \mathtt{eps} $$