# Block Mini Float vs IEEE 754

## Structure and description

### Standard 32-bit single precision format in IEEE 754 has the form of:

Sign bit = 1, Exponent bits = 8, Significand bits = 23

This gives way to 6-9 bits of significant decimal digits of precision.
It can be calculated as: 

When $E = 0$
$$Subnormal = (-1)^S \times 2^{-126} \times (0 + \sum_{i=1}^{23} b_{23-i}2^{-i}) $$

Otherwise:

$$Normal = (-1)^S \times 2^{E-127} \times (1 + \sum_{i=1}^{23} b_{23-i}2^{-i}) $$


### Block MiniFloat is an 8 bit subset representation of IEEE 754 and has the form of:

Sign bit = 1, Exponent Bits = 4, Mantissa bits = 3

It has two representations:

When $E = 0$

$$denormal: X(e,m) = (-1)^S \times 2^{1-\beta} \times (0 + F\times2^{-m}) $$

Otherwise:

$$Normal: X(e,m) = (-1)^S \times 2^{E-\beta} \times (1 + F\times2^{-m}) $$

Where $\beta = 2^{e-1} - 1$ and represents the exponent bias for the binary offset encoding scheme. This should be consistent with IEEE 754 and do not handle NaN's or infinities. Instead we allow the number to saturate at the limits of the range. $[X_{min}^+, X_{max}^+]$






### Lets test the effects of rounding
In both functions lets pass the same values that when added together showcase the rounding hopefully.

Lets try $0.151 + 2.345 = 2.496$.

Using a helpful website developed by H. Schmdit at: https://www.h-schmidt.net/FloatConverter/IEEE754.html

As well as at: https://www.doc.ic.ac.uk/~eedwards/compsys/float/

Now in 32 bit format this is given as:
$0b'00111110000110101001111110111110 + 0b'01000000000101100001010001111011 = 0b'01000000000111111011111001110111$
These numbes when converted to binary have an intrinisic loss of precision. However for 32FP we know that we expect atleast 6 significant figures.

In 8 bit MiniFloat format, this is given as:

????

In [1]:
# Function Definition of an variable-bit adder
def adder_var_bit(a, b, length):
    #create variables necessary to catch all conditions.
    carryFlag = 0
    zeroFlag = 0
    overflowFlag = 0
    result = 0

    # Extract bits for arithmetic use.
    for i in range(length):
        # curr_a and curr_b is a 1 or 0 value at a position i in the 8 bit value.
        curr_a = (a >> i) & 1
        curr_b = (b >> i) & 1
        
        # We or the bits together to create binary addition making sure to include a
        # potential carry value, and place it in the sum_bit variable.
        sum_bit = curr_a ^ curr_b ^ carryFlag

        # We then set the carryFlag using boolean logic on the other two bits as well as 
        # check the carryFlag is not already active.
        carryFlag = (curr_a & curr_b) | (curr_a & carryFlag) | (curr_b & carryFlag)
        
        # Then we append the current working bit onto the result value.
        result |= (sum_bit << i)
        
        # Check for overflow on final bit calculation
        if (carryFlag & 1) != ((result >> length-1) & 0):
            overflowFlag = 1
    return result, overflowFlag

In [2]:
# Function Definition of a variable-bit subtractor
def subber_var_bit(a, b, length):
    #create variables necessary to catch all conditions.
    borrowFlag = 0
    zeroFlag = 0
    underflowFlag = 0
    result = 0

    # Extract bits for arithmetic use.
    for i in range(length):
        # curr_a and curr_b is a 1 or 0 value at a position i in the 8 bit value.
        curr_a = (a >> i) & 1
        curr_b = (b >> i) & 1

        # We or the bits together to create binary subtraction, making sure to include a
        # potential borrow value, and place it in the sum_bit variable.
        sum_bit = (curr_a ^ borrowFlag) ^ curr_b

        # We then set the borrowFlag using boolean logic on the other two bits as well as 
        # check the borrowFlag is not already active.
        borrowFlag = (~curr_a & curr_b) | ((~curr_a | curr_b) & borrowFlag)
        
        # Then we append the current working bit onto the result value.
        result |= (sum_bit << i)
        
        # Check for underflow on final bit calculation
    if (borrowFlag & 1) != ((result >> length - 1) & 0):
        underflowFlag = 1

    return result, underflowFlag

In [3]:
a = 0b1
b = 0b111
sub, underflowFlag = subber_var_bit(a, b, 3)
print("Binary Sum:", bin(sub)[2:].zfill(3))
print("Decimal Sum", sub)
print("Underflow Flag:", underflowFlag)

Binary Sum: 010
Decimal Sum 2
Underflow Flag: 1


In [14]:
def add_IEEE_745(a, b):

    carryFlag = 0
    zeroFlag = 0
    overflowFlag = 0
    underflowFlag = 0
    result = 0
    
    # Should be 0
    signA = (a >> 31) & 1
    
    # Should be 0
    signB = (b >> 31) & 1
    
    # Should be 01111100
    exponentA = (a & 0b01111111100000000000000000000000) >> 23
    print("ExponentA:", bin(exponentA)[2:].zfill(8))
    
    # Should be 10000000
    exponentB = (b & 0b01111111100000000000000000000000) >> 23
    print("ExponentB:", bin(exponentB)[2:].zfill(8))
    
    # Should be 00110101001111110111110
    mantissaA = (a & 0b00000000011111111111111111111111)
    print("mantissaA:", bin(mantissaA)[2:].zfill(23))
    
    # Should be 00101100001010001111011
    mantissaB = (b & 0b00000000011111111111111111111111) 
    print("mantissaB:", bin(mantissaB)[2:].zfill(23))

    print()

    # Calculate exponent/mantissa values and hence calculate entire binary value from IEEE 745 standard

    E_diff = exponentA - exponentB #subber_var_bit(exponentA, exponentB, 8)
    print(E_diff)
    
    if (E_diff == 0b0):
        if (signA == 0b0) and (signB == 0b0):
            MANTISSA = mantissaA + mantissaB #adder_var_bit(mantissaA, mantissaB, 23)
            SIGN = 0
            
        if (signA == 0b0) and (signB == 0b1):
            MANTISSA = mantissaA - mantissaB #subber_var_bit(mantissaA, mantissaB, 23)
            if (MANTISSA < 0):
                SIGN = 1
            else:
                SIGN = 0
                
        if (signA == 0b1) and (signB == 0b0):
            MANTISSA = mantissaB - mantissaA #subber_var_bit(mantissaB, mantissaA, 23)
            if (MANTISSA < 0):
                SIGN = 1
            else:
                SIGN = 0
                
        if (signA == 0b1) and (signB == 0b1):
            MANTISSA = mantissaA + mantissaB #adder_var_bit(mantissaA, mantissaB, 23)
            SIGN = 1
            
        EXPONENT = exponentA

    else:
        if (E_diff < 0):
            E_diff = -E_diff
            mantissaA <<= E_diff
            EXPONENT = exponentA + E_diff #adder_var_bit(exponentA, E_diff, 8)
            

        if (E_diff > 0):
            mantissaB <<= E_diff
            EXPONENT = exponentB + E_diff #adder_var_bit(exponentB, E_diff, 8)

        if (signA == 0b0) and (signB == 0b0):
            MANTISSA = mantissaA + mantissaB #adder_var_bit(mantissaA, mantissaB, 23)
            SIGN = 0
            
        if (signA == 0b0) and (signB == 0b1):
            MANTISSA = mantissaA - mantissaB #subber_var_bit(mantissaA, mantissaB, 23)
            if (MANTISSA < 0):
                SIGN = 1
            else:
                SIGN = 0
                
        if (signA == 0b1) and (signB == 0b0):
            MANTISSA = mantissaB - mantissaA #subber_var_bit(mantissaB, mantissaA, 23)
            if (MANTISSA < 0):
                SIGN = 1
            else:
                SIGN = 0
                
        if (signA == 0b1) and (signB == 0b1):
            MANTISSA = mantissaA + mantissaB #adder_var_bit(mantissaA, mantissaB, 23)
            SIGN = 1

    print(SIGN)
    print(EXPONENT)
    print(MANTISSA)
    # For Exponent
    if (EXPONENT != 0b00000000):
        # Use normal equation
        sum = 1
        for i in range(23):
            sum = sum + ((MANTISSA >> 22-i) & 1)*2**(-i)
            print(sum)
        result = ((-1)**SIGN)*2**(EXPONENT - 127)*sum
        

    else:
        # Use subnormal equation
        sum = 0
        for i in range(23):
            sum = sum + ((MANTISSA >> 22-i) & 1)*2**(-i)
        result = ((-1)**SIGN)*2**(-126)*sum
        
    return result
    

In [15]:
a = 0b00111110000110101001111110111110
b = 0b01000000000101100001010001111011
c = add_IEEE_745(a,b)
print(c)

ExponentA: 01111100
ExponentB: 10000000
mantissaA: 00110101001111110111110
mantissaB: 00101100001010001111011

-4
0
132
51069840
1
1.0
1.0
1.125
1.125
1.15625
1.171875
1.171875
1.17578125
1.17578125
1.17578125
1.17578125
1.17578125
1.1759033203125
1.17596435546875
1.175994873046875
1.175994873046875
1.175994873046875
1.1759986877441406
1.1759986877441406
1.1759986877441406
1.1759986877441406
1.1759986877441406
37.6319580078125


so very wrong

In [13]:
print(0b01111100)
print(0b10000000)

124
128
