# Optional Example: Fun with 32-bit Floating Point Numbers
Most of the time (depending on your hardware), you'll use 64-bit floating-point numbers by default. However, many machine learning applications use `Float32` as their default precision. 

> __Why do we care?__ The `Float32` type offers a sweet spot for deep learning: its 24‐bit significand (which gives seven decimal digits) and 8‐bit exponent provide sufficient precision and dynamic range for most applications, halving the memory footprint compared to `Float64`. This enables computation leveraging specialized hardware optimized specifically for 32-bit arithmetic.

In this example, you'll explore the memory layout of a `Float32`, and compute this floating point type's dynamic range and precision limits.
___

<div>
    <center>
        <img src="figs/Fig-Float32-bit-pattern.svg" width="680"/>
    </center>
</div>

## Example 32-bit memory layout
Suppose we have a floating point number $x\in\mathbb{R}$ that is approximated as a 32-bit variable in memory. A 32-bit number $x\in\mathbb{R}$ is encoded in memory as:
$$
\begin{align*}
x = \underbrace{S}_{\text{sign}}\times\underbrace{\text{significand}}_{\text{fraction}}\times\underbrace{{2^{E-127}}}_{\text{scale}}
\end{align*}
$$
where:
$$
\begin{align*}
S &= -1^{d_{31}}\\
\text{significand} &= 1 + \sum_{i = 1}^{23}d_{i}2^{-i}\\
E &= \sum_{i=23}^{30}d_{i}2^{i - 23}
\end{align*}
$$
where $d_{i}$ denotes the digit at position $i$ in the number. Notice the difference between the 64- and 32-bit numbers: the number of elements used to compute the significand and the exponent terms are different, and the location of the sign bit has changed, but otherwise they have a similar structural layout in memory.

Now, let's compute the components of the 32-bit representation of $x\in\mathbb{R}$. First, specify an example number, save it in the `x::Float32` variable:

In [3]:
x = 141.72 |> Float32; # why do we need |> Float32?

Check the type using [the `typeof(...)` method](https://docs.julialang.org/en/v1/base/base/#Core.typeof):

In [5]:
typeof(x) == Float32 # if Float32, this should be true

true

Next, let's use [the `bitstring(...)` method](https://docs.julialang.org/en/v1/base/numbers/#Base.bitstring) to generate the bits of our 32-bit floating point number $x$ as a `String`, and then convert and save the bitstring into a `0`-based dictionary called `d::Dict{Int,Int}`:

In [7]:
d = let

    # initialize -
    bitpattern_dictionary = Dict{Int64,Int64}(); # storage for the 0-based bit pattern
    wordsize = 32; # how many boxes do we have?
    a = bitstring(x) |> reverse |> collect .|> v-> parse(Int64, v) # fancy. Nothing to see here, move along (for now anyway).
    
    # put stuff in the dictionary
    for i ∈ 0:(wordsize-1)
        bitpattern_dictionary[i] = a[i+1];
    end
    bitpattern_dictionary # return to caller
end;

### Sign term
Now that we have the bitpattern dictionary `d::Dict{Int, Int}`, we can compute the three components of our floating point number. Let's start with the sign, which we'll save in the `S:Float64` variable:

In [9]:
S = let
    S = (-1.0)^(d[31]);
end

1.0

### Significand
Next, let's compute a value for the `significand` of $x\in\mathbb{R}$, which we'll store in the `calculated_significand_value::Float64` variable:

In [11]:
calculated_significand_value = let

    # initialize -
    calculated_significand_value = 0.0;
    b = 2.0; # binary, base = 2
    msb = 23; # most significant bit (msb)
    lsb = 1; # least significant bit (lsb)
    significand_range_array = range(lsb,stop=msb,step=1) |> collect; # range of bits to use for the significand

    for i ∈ significand_range_array
        calculated_significand_value += (b^(-i))*d[msb-i]
    end
    calculated_significand_value + 1 # don't forget to add 1!
end

1.1071875095367432

__Check__: Let's use [the `@assert` macro](https://docs.julialang.org/en/v1/base/base/#Base.@assert) to check our calculated significand value against the output of [the `significand(...)` method](https://docs.julialang.org/en/v1/base/numbers/#Base.Math.significand) using [the `==` comparison operator](https://docs.julialang.org/en/v1/manual/missing/#Equality-and-Comparison-Operators). 
> _What happens_? If [the `==` comparison](https://docs.julialang.org/en/v1/manual/missing/#Equality-and-Comparison-Operators) comes back `false`, [an `AssertionError` is thrown](https://docs.julialang.org/en/v1/base/base/#Core.AssertionError) (and we know something is wrong with our calculation). However, if the comparison comes back `true`, we can be confident that our calculation is correct (no error is thrown).

In [13]:
@assert significand(x) == calculated_significand_value # compare built-in versus our calculated value

### Scale term
Now, let's compute the scale of the floating point number $x\in\mathbb{R}$, which requires us to calculate the exponent value $E$, which we'll store in the `E::Float64` variable. 
    
_Aside_: Sometimes you'll see the exponent $E$ expression for a 32-bit floating point number written as:
$$
E = \sum_{i=0}^{7}e_{i}2^{i}
$$
where the $e_{i}$'s denote _exponent bits_, i.e., digits from the original bit string whose indexes have been remapped to be $0\rightarrow{7}$. In this convention, $e_{0} = d_{23},e_{1} = d_{24},\dots,e_{7} = d_{30}$.  Let's implement the $0\rightarrow{7}$ summation below:

In [15]:
E = let

    # initialize -
    calculated_exponent_value = 0.0;
    b = 2.0; # binary, base = 2
    msb = 30; # most significant bit (msb)
    lsb = 23; # least significant bit (lsb)
    exponent_bit_range_array = range(lsb, stop=msb, step = 1) |> collect

    for i ∈ eachindex(exponent_bit_range_array)
        j = exponent_bit_range_array[i]; # remap operation: i runs from 1->8 (notice not zero based), while j runs from lsb -> msb
        calculated_exponent_value += d[j]*(b^(i - 1)) # why -1?
    end
    calculated_exponent_value # return
end;

#### Do we get the same number $x$?
Let's put all the pieces together and check our work. If our calculations are correct, our calculated number should be the same (evaluated [using the `==` comparison operator](https://docs.julialang.org/en/v1/manual/missing/#Equality-and-Comparison-Operators)) as the `x::Float32` value specified above. Let's use [the `@assert` macro](https://docs.julialang.org/en/v1/base/base/#Base.@assert) to check our calculated $x$ value against the original value using [the `==` comparison operator](https://docs.julialang.org/en/v1/manual/missing/#Equality-and-Comparison-Operators). 
* _What happens_? If [the `==` comparison](https://docs.julialang.org/en/v1/manual/missing/#Equality-and-Comparison-Operators) comes back `false`, [an `AssertionError` is thrown](https://docs.julialang.org/en/v1/base/base/#Core.AssertionError) (and we know something is wrong with our calculation):

In [46]:
@assert S*(calculated_significand_value)*2^(E - 127) == x # If this doesn't blow up, nice!

## Deep dive: How big (small) can the significand be?
The fractional component of the floating-point number is contained in the significand. Thus, an interesting question is how big (or small) can this component be?
* __Idea__: To explore this question, examine the summation term in the significand expression. If all the digits in the summation expression $\left\{d_{1},d_{2},\dots,d_{23}\right\}$ were `0`, then the _smallest possible value_ of the `significand = 1.` Alternatively, if all the digits $\left\{d_{1},d_{2},\dots,d_{23}\right\}$ were `1`, then we'd get a maximum value. What is the maximum possible value?

Let's explore this numerically and revisit the expression to get some analytical insight.

In [73]:
max_significand_value = let

    # initialize -
    d = Dict{Int64, Int64}(); 
    calculated_significand_value = 0.0;
    b = 2.0; # binary, base = 2
    msb = 23; # most significant bit (msb)
    lsb = 1; # least significant bit (lsb)
    significand_range_array = range(lsb,stop=msb,step=1) |> collect; # range of bits to use for the significand

    # all ones, gives max 
    number_of_digits = length(significand_range_array); # how many digits do we have for the significand?
    max_digits_array = ones(number_of_digits); # max case: our digits array will be all ones
    for i ∈ 1:number_of_digits
        d[i-1] = 1.0;
    end

    for i ∈ significand_range_array
        calculated_significand_value += (b^(-i))*d[msb-i]
    end
    calculated_significand_value + 1
end

1.9999998807907104

### Analytical analysis

The numerical calculation gave a `max_significand_value ≈ 2`, i.e., the value of the summation term, is $\approx{1}$. We'd expect this because the summation term is an infinite series in the powers $2^{-i}$ truncated at the number of bits used for the fraction. To see this, let's do a little math. 
$$
\begin{align*}
S_{N} & = \sum_{i=1}^{N}2^{-i} = 2^{-1} + 2^{-2}+\dots+2^{-N}\quad\text{this gives}\,{a = 2^{-1}\,\text{and}\,{r} = 2^{-1}}\\
S_{N} &= \frac{a\left(1-r^{N}\right)}{1-r} = 1-2^{-N}\quad\text{substitute}\,a\,\text{and}\,{r}\,\text{simplify}\\
S_{N} &= 1 - 2^{-23}\approx{0.9999998807907104}\quad{N = 23}\,\blacksquare
\end{align*}
$$
As $N\rightarrow\infty$ the partial sum $S_{N}\rightarrow{1}$. However, we don't get exactly `1` numerically. Why? Because we truncate the series early, i.e, for a 32-bit number we run the series up to $N = 23$, which gives a value slightly less than `1.`

### Precision

The analysis above also gives us insight into the _precision_ of a Float32 value, i.e., the number of possible decimal digits. The precision of a Float32 value is set by its $N = 23$ explicit fraction bits plus one implicit leading bit, so $p=24$. Then, the machine epsilon - the gap between 1.0 and the next representable float - is given by:
$$
\begin{align*}
\epsilon &= 2^{(1-p)}\quad\text{substitute}\,{p=24}\,\,\text{for a 32-bit number}\\
\epsilon & \approx 1.19209\times{10}^{-7}
\end{align*}
$$
This corresponds to $d\approx{-\log_{10}\epsilon}$ digits of precision, which for a `Float32` is $\approx{7}$ digits.

In [22]:
let
    p = 24; # p = {24,53} for {32,64}-bit
    ϵ = 2.0^(1-p);
    d = -log10(ϵ) |> round
end

7.0

## How big (small) can the scale be?
Next, let's think about the possible scale of a `Float32` given by: $\text{scale} = 2^{E - 127}$. To explore this question (in a first approximation where we ignore edge cases associated with representing $\pm\infty$ or NaNs), let's start by looking at the possible limits for the $E$ term in the memory layout for $x\in\mathbb{R}$ approximated as a `Float32`.
* __Idea__: The $E$ expression is computed from the summation of 8 bits (the exponent bits). If all the exponent digits $\left\{e_{0},e_{1},\dots,e_{7}\right\}$ were `0`, then the value of the $\text{scale} = 2^{-127}\approx{0}$. Alternatively, if all the exponent digits $\left\{e_{0},e_{1},\dots,e_{7}\right\}$ were `1`, then we'd get a maximum value for $E$. What is the maximum possible value?

Let's compute the maximum permissible value for $E$ numerically, and then think about what we should expect to see analytically. Store the maximum possible $E$ in the `max_possible_E::Float64` variable:

In [24]:
max_possible_E = let

    # initialize -
    d = Dict{Int64, Int64}();
    calculated_exponent_value = 0.0;
    b = 2.0; # binary, base = 2
    msb = 30; # most significant bit (msb)
    lsb = 23; # least significant bit (lsb)
    exponent_bit_range_array = range(lsb, stop=msb, step = 1) |> collect

    # all ones, gives max 
    number_of_digits = length(exponent_bit_range_array); # how many digits do we have for the exponent E?
    max_digits_array = ones(number_of_digits); # max case: our digits array will be all ones
    for i ∈ 1:number_of_digits
        d[i-1] = 1.0;
    end

    for i ∈ eachindex(exponent_bit_range_array)
        calculated_exponent_value += d[i-1]*(b^(i - 1)) # why -1?
    end
    calculated_exponent_value
end

255.0

We get `255`; however, in practice, our logic has a flaw (edge cases mentioned above)!
* The `255` case is a special reserved case. An exponent bit sequence of all 1’s (255) is a special code: if the fraction bits (the bit sequence in the significand calculation) are zero, it means we are representing $\pm\infty$, and if the fraction is nonzero, it means not a number (NaN) — hence the maximum finite exponent for non-edge case numbers is `254.`

Given our _corrected_ `max_possible_E = 254` value, the maximum permissible scale will be: $2^{127}$!

In [26]:
max_scale = 2^(max_possible_E - 1 - 127) # wow! 

1.7014118346046923e38

Putting this all together (and ignoring the edge cases) gives an approximate (back of the envelope, first approximation) range scale for `Float32` of $x\approx\pm\left[2^{-127},2^{127}\right]$ with `7` decimal digits.
* __However, in reality__: there is a more complex convention for handling the exponent $\left\{e_{0} = 0,e_{1} = 0,\dots,e_{7} = 0\right\}$ case which gives a different lower bound of $2^{-149}$. This involves [subnormal numbers](https://en.wikipedia.org/wiki/Subnormal_number), which is beyond the scope of our discussion in this example. Check it out if you are really interested; it's a cool convention! 

__Should we have expected 255__? 

Yes, we should have! Let's look at the exponent summation, assuming all eight exponent bits are `1`, then we have (where $N$ is the index of the most significant bit):
$$
\begin{align*}
E &= \sum_{i=0}^{N}2^{i} = 1+2+2^{1}+\dots+2^{N-1}+2^{N}\quad\text{multiply by 2}\\
2E &= 2 + 2^{2}+\dots+2^{N}+2^{N+1}\quad\text{subtract}\,2E-E\\
E &= 2^{N+1} - 1\quad\text{substitute}\,{N = 7}\\
E &= 2^{8} - 1 = 255\quad\blacksquare
\end{align*}
$$