# Punto Flotante

El estandar IEEE754 establece una forma para almacenar numeros en una computadora. 

1. A single-precision number consists of 32 bits, with 1 bit for the sign, 8 for the exponent, and 23 for the significand.

2. A double-precision number consists of 64 bits with 1 bit for the sign, 11 for the exponent, and 52 for the significand.

3. An extended-precision number consists of 80 bits, with 1 bit for the sign, 15 for the exponent, and 64 for the significand.



Por ejemplo, para usar doble-precision usamos :

$x=(-1)^s (1.b_{13} b_{14} \ldots b_{64})_2 \times 2^{e-1023}$

1. $s \in \{0,1\}$ es el signo.
2. $b_{13} b_{14} \ldots b_{64}$ es la mantisa.
3. $e$ es el exponente.


In [17]:
2^11/2-1

1023.0

In [38]:
bitstring(Int32(16))

"00000000000000000000000000010000"

In [27]:
i=Float32(-16)
b=bitstring(i)

"11000001100000000000000000000000"

In [29]:
i=Float32(16)
b=bitstring(i)

"01000001100000000000000000000000"

In [28]:
length(b)

32

In [33]:
typeof(b)

String

Por ejemplo el nuero $x=10.375$ puede ser descompuesto por una parte entera y real.

In [35]:
x=10.375

10.375

In [41]:
typeof(x)

Float64

In [36]:
using Printf

p=round(x)
r=mod(x,p)
@printf("Parte entera %d, real %f",p,r)

Parte entera 10, real 0.375000

In [37]:
p == 0*2^0+1*2^1+0*2^2+1*2^3 #1010_2
r == 0*2^-1+1*2^-2+1*2^-3    #0.11_2 

true

In [42]:
b=bitstring(x)

"0100000000100100110000000000000000000000000000000000000000000000"

In [50]:
length(b)

64

In [51]:
e=parse(Int,b[2:12],base=2)-1023

3

In [55]:
m=b[13:64]

"0100110000000000000000000000000000000000000000000000"

In [58]:
u=string('1',m)

"10100110000000000000000000000000000000000000000000000"

In [62]:
u[1:e+1]

"1010"

In [66]:
p=length(u[1:e+1]):-1:1
p[1]

4

In [101]:
function convert_float(u,e)
    f=0
    p=length(u[1:e+1]):-1:0
    j=1
    for i in u[1:e+1]
        d=parse(Int,i,base=2)
        v=Int(p[j])
        @printf(", Digito %d ",2^v)
        f+=d*(2^p[j])
        j+=1
        @printf("Digito %d, exponente %d \n",d,p[j])
    end
    return f
end

convert_float (generic function with 1 method)

In [102]:
xf=convert_float(u,e)

, Digito 16 Digito 1, exponente 3 
, Digito 8 Digito 0, exponente 2 
, Digito 4 Digito 1, exponente 1 
, Digito 2 Digito 0, exponente 0 


20

In [92]:
2^4.0

8.0

### Ejercicio 

Encuentre la representación (mantisa, exponente y signo) de punto flotante para el número $x=123.15625_{10}$

In [7]:
x=123.15625
b=bitstring(x)

"0100000001011110110010100000000000000000000000000000000000000000"

# Error de Aproximación

Sabemos que la función $f(x)=e^x$ puede ser aproximada mediante la serie de Taylor truncada:

$e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \frac{x^4}{4!} + \cdots.$

Por ejemplo, si evaluamos $f(1)$ debiesemos obtener un valor cercano a $e$

In [103]:
exp(1.)

2.718281828459045

In [113]:
function taylor_series(x::Float64,k::Int)::Float64
    range=1:1:k
    res=1.
    for i in range
        res+=(x^i)/factorial(big(i))
    end
    return res
end

taylor_series (generic function with 1 method)

In [114]:
taylor_series(1.,3)

2.6666666666666665

In [135]:
err=Inf
k=1
while err>eps(Float64)
    val=taylor_series(1.,k)
    err=abs(exp(1.)-val)
    @printf("val : %f, orden : %d, error : %d \n",val,k,err)
    k+=1
end


In [136]:
val=taylor_series(1.,k)

2.718281828459045

In [134]:
k

18

In [137]:
function taylor_series(x::Float32,k::Int)::Float32
    range=1:1:k
    res=1.
    for i in range
        res+=(x^i)/factorial(big(i))
    end
    return res
end

taylor_series (generic function with 2 methods)

In [139]:
err=Inf
k=1
while err>eps(Float32)
    val=taylor_series(Float32(1.),k)
    err=abs(exp(1.)-val)
    @printf("val : %f, orden : %d, error : %d \n",val,k,err)
    k+=1
end


In [140]:
k

11

In [142]:
eps(Float32)>eps(Float64)

true