# L1d: A Deeper Dive into Numbers and Floating-Point Types
In this lab, we will drill into the representation of floating-point numbers.

We work with numerical data in science, technology, engineering, and mathematics (STEM) all the time. But have you ever stopped to wonder how computers actually represent (and manipulate) numbers? Let's explore the secret life of common floating-point data types, how they are represented in memory, and the precision associated with each type.

> __Key (surprising) fact__: Integer values on computers are __exact__ representations. However, all floating-point numbers are __approximations__! A floating-point value uses a _fixed number of bits_ to store a value, so it can only represent a finite set of rational values rather than the continuum of real numbers. Thus, any real number must be rounded to the nearest floating-point value, making every floating-point value an approximation.

The typical floating-point number types you will likely encounter in applications are `Float16`, `Float32`, and `Float64`. But what do these numbers mean? For example, what do the `16`, `32`, and `64` mean in `FloatXX`, what precision can these numbers describe, and how are they represented in memory?

**What we'll cover:** On conventional (non-quantum) hardware, floating-point numbers are stored as binary values (base `2`). Let's start by reviewing base-`b` positional notation, then examine integer bitstrings and finally dig into floating-point formats.

Let's explore the secret life of floating-point numbers.

___

## Setup, Data, and Prerequisites
First, we set up the computational environment by including the `Include.jl` file and loading any needed resources.

The [include command](https://docs.julialang.org/en/v1/base/base/#include) evaluates the contents of the input source file, `Include.jl`, in the notebook's global scope. The `Include.jl` file sets paths, loads required external packages, etc. For additional information on functions and types used in this material, see the [Julia programming language documentation](https://docs.julialang.org/en/v1/). 

In [4]:
include(joinpath(@__DIR__, "Include.jl")); # what is this doing?

In addition to standard Julia libraries, we'll also use [the `VLDataScienceMachineLearningPackage.jl` package](https://github.com/varnerlab/VLDataScienceMachineLearningPackage.jl), check out [the documentation](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/) for more information on the functions, types and data used in this material. 

___

## Theory: Base b representation of numbers
A number in base $b$ is represented by a finite sequence of digits $(d_{n}d_{n-1}\dots{d_{1}}d_{0})_{b}$ where each digit $d_{i}$ satisfies $0\leq d_{i} < b$. The value (in base 10) of a base-$b$ number is the positional sum:
$$
\begin{align*}
\underbrace{(d_{n}d_{n-1}\dots{d_{1}}d_{0})_{b}}_{\text{base b}} = \underbrace{\sum_{i=0}^{n}d_{i}b^{i}}_{\text{value in base 10}}
\end{align*}
$$
Let's use integers for a few examples to better understand this expression (and then we'll move on to floating-point numbers).

Consider an `Int64` number. We know that memory storage on modern (non-quantum) hardware is binary, i.e., base $b = 2$; thus, all the digits $d_{i}$ must satisfy $0\leq d_{i} < 2$.

However, how many digits do we have, i.e., the value of $n$? This is the _word size_, i.e., the `64` in `Int64`.

> __Hmmm__. Didn't we already see that? Yes — it's the length of the string output from [the `bitstring(...)` method](https://docs.julialang.org/en/v1/base/numbers/#Base.bitstring)! Let's count the number of zero digits and the number of one digits of a test 64-bit integer using the [`count_zeros(...)` method](https://docs.julialang.org/en/v1/base/numbers/#Base.count_zeros) and the [`count_ones(...)` method](https://docs.julialang.org/en/v1/base/numbers/#Base.count_ones).

To check the equality condition, we use the [Julia @assert macro](https://docs.julialang.org/en/v1/base/base/#Base.@assert). If the statement passed to the [@assert macro](https://docs.julialang.org/en/v1/base/base/#Base.@assert) evaluates to `false`, i.e., the number of zeros and ones does not equal the `wordsize`, then an [AssertionError](https://docs.julialang.org/en/v1/base/base/#Core.AssertionError) is thrown, alerting us that there is an issue.

> __Note__: we use [the equality `==` operator](https://docs.julialang.org/en/v1/manual/missing/#Equality-and-Comparison-Operators) (not the assignment operator `=`). There is also [the `===` comparison operator](https://docs.julialang.org/en/v1/manual/missing/#Equality-and-Comparison-Operators) in Julia, which determines whether `x` and `y` are identical in the sense that no program could distinguish them. We'll see this operator later.

So, what do we see?

In [7]:
let
    wordsize = 64; # default word size
    x = 18; # pick an integer value (Int64 value by default)
    n = count_zeros(x) + count_ones(x); # this counts 0's and 1's (doesn't give any info about position)
    @assert wordsize == n # see https://docs.julialang.org/en/v1/base/base/#Base.@assert
end

### Binary numbers
We can get the bit pattern (binary representation) of an integer by calling [the `bitstring(...)` method](https://docs.julialang.org/en/v1/base/numbers/#Base.bitstring), but there is a __wrinkle__.

> __Wrinkle__: the [`bitstring(...)` method](https://docs.julialang.org/en/v1/base/numbers/#Base.bitstring) returns the bit pattern as a `String`. We'll have to convert that `String` to an array of `0` and `1` to do any computation with these values. More on that shortly.

The positions of the `0` and `1` values in the binary number give the number's value. Suppose we get the bit pattern, i.e., the positions of the digits of some integer value `x::Int`, using [the `bitstring(...)` method](https://docs.julialang.org/en/v1/base/numbers/#Base.bitstring) and save this value in the `s::String`.

In [9]:
sₒ,xₒ = let
    x = 100456; # Int64 value by default
    s = bitstring(x)
    s,x
end

("0000000000000000000000000000000000000000000000011000100001101000", 100456)

For the binary string $s$, we sum powers of 2 (the $b^{i}$ terms in the sum) for positions whose digit is `1`, processing the string from right to left. Let's make this more concrete.

> __Hypothesis__: We should be able to process the string `s` (compute its positional sum) and recover the integer that generated it. To do this we'll use a few techniques we haven't covered yet. Don't worry about the implementation for now.  

To check our hypothesis, we need to do a few things. The first is to convert the bit pattern in `s::String` into an array of numbers (so we can compute the positional sum). 

The following logic contains a few advanced things, e.g., working with arrays and [`String` and `Char` types](https://docs.julialang.org/en/v1/manual/strings/#man-strings), function piping (`|>`), etc.; don't worry too much about the details yet:

In [11]:
bit_pattern_array = bitstring(xₒ) |> collect |> reverse .|> x-> parse(Int,x) # This is a magical line!

64-element Vector{Int64}:
 0
 0
 0
 1
 0
 1
 1
 0
 0
 0
 0
 1
 0
 ⋮
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0

__Hmmm__. Okay — we can convert the string to an `Array{Int64,1}`, which is good. However, arrays in Julia are `1`-based, meaning the first index in the array occurs at index `1`. But our positional expressions assume zero-based indexing (first value at index `0`). Can we make a zero-based array in Julia?

* __Hack__: Yes — we can copy `bit_pattern_array` into a dictionary (which we can make 0-based), called `bit_pattern_dictionary::Dict{Int64,Int64}`. This allows us to start counting from 0 instead of 1.
* __Proper solution__: In addition to this hack (which is convenient), a cleaner solution is to use [an `OffsetArray` from the `OffsetArrays.jl` package](https://github.com/JuliaArrays/OffsetArrays.jl) to fix the 1-based indexing.

Let's use `0`-based dictionary hack to populate a dictionary with `bit_pattern_array` values (indexed from zero).

In [13]:
bit_pattern_dictionary = let
    bit_pattern_dictionary = Dict{Int64,Int64}(); # Declare memory
    for i ∈ eachindex(bit_pattern_array)
        bit_pattern_dictionary[i-1] = bit_pattern_array[i] # what are we doing here?
    end
    bit_pattern_dictionary; # return the 0-based mapping
end

Dict{Int64, Int64} with 64 entries:
  5  => 1
  56 => 0
  35 => 0
  55 => 0
  60 => 0
  30 => 0
  32 => 0
  6  => 1
  45 => 0
  4  => 0
  13 => 0
  54 => 0
  63 => 0
  62 => 0
  58 => 0
  52 => 0
  12 => 0
  28 => 0
  23 => 0
  41 => 0
  43 => 0
  11 => 1
  36 => 0
  39 => 0
  7  => 0
  ⋮  => ⋮

In [14]:
bit_pattern_dictionary[1]

0

Finally, let's compute the positional sum and see what our number is.

In [16]:
let

    b = 2; # What base do we have?
    count = 0; # if this works, when we are finished, this should be our original number
    positions = keys(bit_pattern_dictionary) |> collect |> sort; # what is going on here? (we're iterating the 0-based bit_pattern_dictionary)
    for i ∈ positions
        dᵢ = bit_pattern_dictionary[i];
        count+= (dᵢ)*(b^i) # what is += doing?
    end
    println("Was original your number $(count)?")
end

Was original your number 100456?


### Thought Question 1: Binary Representation
Why do you think computers use binary (base 2) instead of decimal (base 10) for representing numbers internally? What are the advantages and disadvantages of this choice? How might this affect the way we think about numerical computations in programming?

___

### Beyond binary numbers
There are many everyday applications for base $b>2$ numbers! Larger bases like decimal (base 10), dozenal (base 12), and sexagesimal (base 60) exist in everyday measurements and commerce. There are also a few others that you may encounter every day, but not realize it:
> __Hexadecimal (base 16)__ compactly encodes binary data for color codes; for example, Cornell red is `#B31B1B`, while base 32/64 are used to encode arbitrary binary data (e-mail attachments, URLs, certificates) into printable characters.

Though higher bases require a more complex digit set, they dramatically shorten the representation of large values.

#### Digits Example
Let's consider an octal (base 8) example. Instead of calling [the `bitstring(...)` method](https://docs.julialang.org/en/v1/base/numbers/#Base.bitstring) (which always returns a base $b=2$ value), let's explore [the `digits(...)` method](https://docs.julialang.org/en/v1/base/numbers/#Base.digits). The [`digits(...)` method](https://docs.julialang.org/en/v1/base/numbers/#Base.digits) takes a `number,` a `base,` and a `pad` argument and returns the bit pattern for `number` written with respect to `base` assuming a word size equal to `pad.`
 > __Octal__: Let's use [the `digits(...)` method](https://docs.julialang.org/en/v1/base/numbers/#Base.digits) to get the bit pattern for $n = 74$ written in `base = 8` for a `16-bit` machine. Save this data in the `bit_pattern_array::Vector{Int64}` variable.

In [21]:
bit_pattern_array_octal = digits(16941, base=8, pad=16) # produces the bit pattern for a base 8 number

16-element Vector{Int64}:
 5
 5
 0
 1
 4
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0

__Check__: Let's convert the octal number stored in the `bit_pattern_array_octal::Array{Int64,1}` variable back into base 10 by computing the positional sum in base 8.

In [23]:
let
    # initialize -
    bit_pattern_dictionary = Dict{Int64,Int64}();
    b = 8.0; # base 8 number of the example (
    wordsize = 16;
    foreach(i -> bit_pattern_dictionary[i-1] = bit_pattern_array_octal[i], 
        eachindex(bit_pattern_array_octal)); # compact syntax for building bit dict

    # loop -
    value = 0.0;
    bitrangearray = range(0,stop=(wordsize-1),step=1) |> collect;
    for i ∈ bitrangearray
         dᵢ = bit_pattern_dictionary[i];
         value += (dᵢ)*(b^i)
    end

    value
end

16941.0

#### Thought Question 2: Number Bases
We explored binary (base 2) and octal (base 8) representations. How does the choice of base affect the compactness of representing large numbers? Can you think of real-world applications where using a base other than 10 might be advantageous? What challenges might arise when converting between different bases?

___

###

## Floating-point numbers
Now that we have seen how integers are laid out in memory, let's explore floating-point formats: `Float16`, `Float32`, and `Float64`. In particular, we'll look at the memory layout of `Float64`.

Why have multiple floating-point precisions?
>Using multiple floating-point types lets us balance precision and resource usage for different applications:
> * `Float16` (half-precision) minimizes memory footprint at the expense of precision — useful for large-scale machine learning inference or graphics where fine precision isn't critical.
> * `Float32` (single-precision) offers a good compromise of speed and accuracy for many numerical and real-time workloads.
> * `Float64` (double-precision) provides high precision and a wide exponent range needed in scientific computing, simulations, and financial modeling where rounding errors must be controlled.

If we need more precision than `Float64`, specialized packages (for example, `Quadmath.jl`) offer larger types such as `Float128`.

<div>
    <center>
        <img src="figs/Fig-64-bit-label-pattern.svg" width="580"/>
    </center>
</div>

### Example: Memory Layout Float64
Suppose we have a floating-point number $x\in\mathbb{R}$ that is approximated as a 64-bit value in memory. A 64-bit value $x\in\mathbb{R}$ is encoded in memory as:
$$
\begin{align*}
x = \underbrace{S}_{\text{sign}}\times\underbrace{\text{significand}}_{\text{fraction}}\times\underbrace{{2^{E-1023}}}_{\text{scale}}
\end{align*}
$$
where:
$$
\begin{align*}
S &= -1^{d_{63}}\\
\text{significand} &= 1 + \sum_{i = 1}^{52}d_{52-i}2^{-i}\\
E &= \sum_{i=52}^{62}d_{i}2^{i - 52}
\end{align*}
$$
The 64- and 32-bit formats differ in the number of bits allocated to the significand and exponent and in the position of the sign bit; otherwise they follow the same structural layout.

Let's compute the components of an example 64-bit floating-point value and see if we can reconstruct the original number. First, choose a test value for $x$:

In [117]:
x = -65.78912; # example 64-bit floating point number, let's use π

Next, we'll use [the `bitstring(...)` method](https://docs.julialang.org/en/v1/base/numbers/#Base.bitstring) to get the 64-bit binary String, then we'll convert that into a 0-based bit pattern dictionary which we save in the `d::Dict{Int64,Int64}` variable:

In [119]:
d = let

    # initialize -
    bitpattern_dictionary = Dict{Int64,Int64}();
    wordsize = 64; # how big is the word size?
    a = bitstring(x) |> reverse |> collect .|> v-> parse(Int64,v) # fancy. Nothing to see here, move along (for now anyway).
    
    # put stuff in the bit pattern dictionary
    for i ∈ 0:(wordsize-1)
        bitpattern_dictionary[i] = a[i+1];
    end
    bitpattern_dictionary # return the dictionary
end

Dict{Int64, Int64} with 64 entries:
  5  => 1
  56 => 0
  35 => 0
  55 => 0
  60 => 0
  30 => 1
  32 => 0
  6  => 0
  45 => 1
  4  => 0
  13 => 1
  54 => 1
  63 => 0
  62 => 1
  58 => 0
  52 => 1
  12 => 0
  28 => 1
  23 => 0
  41 => 1
  43 => 0
  11 => 0
  36 => 0
  39 => 1
  7  => 1
  ⋮  => ⋮

#### Sign term
Now that we have the bit pattern dictionary `d::Dict{Int64, Int64}`, we can compute the components of the 64-bit floating point number. Let's start with the sign value `S::Int64`:

In [121]:
S = let
    s = d[63]; # sign bit is at d63
    S = (-1)^s # if d63 = 1, we'll have a negative number, d63 = 0 gives us a positive number
end

1

#### Significand
Next, we'll compute the significand using the expression above. We'll also check our computed value using [the `significand(...)` method](https://docs.julialang.org/en/v1/base/numbers/#Base.Math.significand) to make sure we are correct. We'll store our calculated value in the `calculated_significand_value::Float64` variable:

In [123]:
calculated_significand_value = let

    calculated_significand_value = 0.0;
    b = 2.0; # binary, base = 2
    lsb = 1; # lsb = least significant bit
    msb = 52; # msb = most significant bit
    significand_range_array = range(lsb,stop=msb,step=1) |> collect; # range of digits used for the fraction

    # loop: process each bit in the significand_range_array -
    for i ∈ significand_range_array
        calculated_significand_value += (b^(-i))*d[msb-i]
    end
    calculated_significand_value + 1 # don't forget to add 1
end

1.027955

__Check__: Let's use [the `@assert` macro](https://docs.julialang.org/en/v1/base/base/#Base.@assert) to check our calculated significand value against the output of [the `significand(...)` method](https://docs.julialang.org/en/v1/base/numbers/#Base.Math.significand) using [the `==` comparison operator](https://docs.julialang.org/en/v1/manual/missing/#Equality-and-Comparison-Operators). 
> __What happens?__ If [the `==` comparison](https://docs.julialang.org/en/v1/manual/missing/#Equality-and-Comparison-Operators) comes back `false`, [an `AssertionError` is thrown](https://docs.julialang.org/en/v1/base/base/#Core.AssertionError) (and we know something is wrong with our calculation):

So what happens?

In [125]:
@assert abs(significand(x)) == calculated_significand_value # compare built-in versus our calculated value

#### Exponent scale term: 
Lastly, let's compute the exponent value $E$, which gives us the scale of the number. We'll save this value in the `E::Float64` variable:

In [127]:
E = let

    # initialize 
    calculated_exponent_value = 0.0;
    b = 2.0; # binary, base = 2
    lsb = 52; # least significant bit
    msb = 62; # most significant bit
    exponent_bit_range_array = range(lsb,stop=msb, step = 1) |> collect; # range of bits for E

    # loop: Let's process each of the bits in exponent_bit_range_array -
    for i ∈ exponent_bit_range_array
        calculated_exponent_value += d[i]*(b^(i-lsb))
    end
    calculated_exponent_value # return
end

1029.0

#### Do we get the same number $x$?
If our implementation is correct, we should be able to reconstruct the original 64-bit value $x$ from its bit pattern.

> We'll use the `@assert` macro to compare our reconstructed value with the original `x`. If the comparison fails, an `AssertionError` will indicate an issue with the calculation.

In [144]:
let
    our_calculated_value = S*calculated_significand_value*2^(E - 1023);
    @assert our_calculated_value == x # same value for x?
end

In [146]:
our_calculated_value = S*calculated_significand_value*2^(E - 1023)

65.78912

### Thought Questions: Floating-Point Precision
> We saw that floating-point numbers are approximations due to their fixed bit representation. How might this limitation affect scientific computations or financial calculations? When would you choose Float32 over Float64, and what are the trade-offs? Can you think of a scenario where floating-point precision could lead to unexpected results?

> How do the concepts of binary representation, number bases, and floating-point precision connect to real-world programming and scientific computing? Reflect on how understanding these low-level representations might change how you approach writing code or interpreting computational results. What new questions do you have about computer arithmetic after working through this notebook?

## Want some more?
If you want some more examples of the layout of floating point numbers, then check out this example notebook:

> [▶ Layout of a `Float32`](./CHEME-5800-L1d-Optional-Example-FunWith32BitFloatingPointTypes-Fall-2025.ipynb). In this example, we will analyze the layout of a `Float32` floating point variable in a computer's memory. This example is similar to our 64-bit floating point variable example (but we only have 32 bits to work with).

## Tests
In the code block below, we check some values in your notebook and give you feedback on which items are correct or different. `Unhide` the code block below (if you are curious) about how we implemented the tests and what we are testing.

In [134]:
let
    @testset verbose = true "CHEME 4/5800 L1d Test Suite" begin

        @testset "Integer bitstring properties" begin
            x = 32
            @test length(bitstring(x)) == 64
            @test count_zeros(x) + count_ones(x) == 64
        end

        @testset "Binary positional sum recovery" begin
            x = 45678
            s = bitstring(x)
            bit_pattern_array = bitstring(x) |> collect |> reverse .|> x-> parse(Int,x)
            bit_pattern_dictionary = Dict{Int64,Int64}()
            for i ∈ eachindex(bit_pattern_array)
                bit_pattern_dictionary[i-1] = bit_pattern_array[i]
            end
            
            count = 0
            b = 2
            positions = keys(bit_pattern_dictionary) |> collect |> sort
            for i ∈ positions
                dᵢ = bit_pattern_dictionary[i]
                count += (dᵢ)*(b^i)
            end
            @test count == x
        end

        @testset "Octal conversion" begin
            n = 74
            base = 8
            pad = 16
            bit_pattern_array_octal = digits(n, base=base, pad=pad)
            bit_pattern_dictionary = Dict{Int64,Int64}()
            foreach(i -> bit_pattern_dictionary[i-1] = bit_pattern_array_octal[i], 
                eachindex(bit_pattern_array_octal))
            
            value = 0.0
            b = 8.0
            wordsize = 16
            bitrangearray = range(0,stop=(wordsize-1),step=1) |> collect
            for i ∈ bitrangearray
                 dᵢ = bit_pattern_dictionary[i]
                 value += (dᵢ)*(b^i)
            end
            @test value == n
        end

        @testset "Floating-point reconstruction" begin
            x = 3.1415926535897
            d = let
                bitpattern_dictionary = Dict{Int64,Int64}()
                wordsize = 64
                a = bitstring(x) |> reverse |> collect .|> v-> parse(Int64,v)
                for i ∈ 0:(wordsize-1)
                    bitpattern_dictionary[i] = a[i+1]
                end
                bitpattern_dictionary
            end
            
            S = (-1)^d[63]
            calculated_significand_value = let
                calculated_significand_value = 0.0
                b = 2.0
                lsb = 1
                msb = 52
                significand_range_array = range(lsb,stop=msb,step=1) |> collect
                for i ∈ significand_range_array
                    calculated_significand_value += (b^(-i))*d[msb-i]
                end
                calculated_significand_value + 1
            end
            
            E = let
                calculated_exponent_value = 0.0
                b = 2.0
                lsb = 52
                msb = 62
                exponent_bit_range_array = range(lsb,stop=msb, step = 1) |> collect
                for i ∈ exponent_bit_range_array
                    calculated_exponent_value += d[i]*(b^(i-lsb))
                end
                calculated_exponent_value
            end
            
            our_calculated_value = S*calculated_significand_value*2^(E - 1023)
            @test our_calculated_value == x
        end
    end
end;

[0m[1mTest Summary:                    | [22m[32m[1mPass  [22m[39m[36m[1mTotal  [22m[39m[0m[1mTime[22m
CHEME 4/5800 L1d Test Suite      | [32m   5  [39m[36m    5  [39m[0m0.1s
  Integer bitstring properties   | [32m   2  [39m[36m    2  [39m[0m0.0s
  Binary positional sum recovery | [32m   1  [39m[36m    1  [39m[0m0.0s
  Octal conversion               | [32m   1  [39m[36m    1  [39m[0m0.0s
  Floating-point reconstruction  | [32m   1  [39m[36m    1  [39m[0m0.0s
