# Numerical Limitations

In [None]:
import decimal
import math
import struct

import matplotlib.pyplot as plt
import numpy as np

## Approximations in scientific computation

### Basic concepts

#### Absolute Error and Relative Error

**Absolute Error** = approximate value - true value

**Relative Error** = $\frac{\rm{absolute\,error}}{\rm{true\,value}}$

Another interpretation of relative error is that if an approximate value has a relative error of about $10^{-p}$, then its decimal representation has about $p$ correct **significant digits** (the leading nonzero digit and the $p-1$ following  digits).

#### Precision and Accuracy

**Precision**: the number of digits with which a number is expressed.

**Accuracy**: the number of *correct* significant digits in an approximation of the desired quantity.

> **Example**
>
> 3.25260376469 is a very precise number but is not very accurate as an approximation for $\pi$. Computing a quantity using a given precision does not necessarily mean that the result will be accurate to that precision!

#### Truncation and Rounding error

**Truncation error**: The difference between the true result and the result given by an algorithm using exact arithmetic. It is due to approximations such as truncating an infinite series or replacing derivatives by finite differences,...

**Rounding error**: The difference between the result produced by a given algorithm using exact arithmetic and the same algorithm, using finite-precision, rounded arithmetic. 

## Computer arithmetic

### Floating-point number systems

In a digital computer, numbers are represented by a *floating-point* number system, which resembles *scientific notation* in which a number is expressed as a number of moderate size times an appropriate power of ten, e.g. $0.0007396$ can be written as $7.396\times10^{-4}$. In this format, the decimal point moves, or *floats* as the power of 10 changes.

A floating-point number system is characterized by four integers:

| symbol &nbsp;      | name             |
|--------------------|------------------|
|$\beta$             | Base or radix    |
|$p$                 | Precision        |
|$[L,U]$             | Exponent range   |

Any floating-point number then has the form

$\pm \left(d_0+\frac{d_1}{\beta}+\frac{d_2}{\beta^2}+\cdots+\frac{d_{p-1}}{\beta^{p-1}}\right)\beta^E$
where $d_i$ and $E$ are integers such that $0\leq d_i \leq \beta-1$ and $L \leq E\leq U$.


A floating-point system is **normalized** if the leading digit $d_0$ always equals  1 (unless the number represented is zero).

This is advantageous because
- Each number has a unique representation.
- No digits are wasted on leading zeros, thereby maximizing precision.
- In a binary system ($\beta=2$), the leading bit is always 1, and thus need not be stored, thereby gaining an additional bit of precision.

The two most important systems in use are the IEEE single precision (SP) and double precision (DP) standards with:

| System &nbsp;   | $\beta$ | $p$ | $L$ | $U$|
|-----------------|---------|-----|-----|----|
|IEEE SP          |2        |24   | -126| 127|
|IEEE DP          |2        |53   |-1022|1023|


The single-precision binary floating-point exponent is encoded using an **offset-binary representation**, with the zero offset being 127

> **Example** 
>
> Single precision numbers are stored in 4 bytes (or 32 bits), used as follows:
> - 1 sign bit
> - 8 bits for the exponent (ranging from -126 to 127)(the exponent sets the order of magnitude of the float)
> - 23 bits for the mantissa (the mantissa sets the actual precise value of the float)
>
> The exponent is calculated by taking -127 and then adding $2^n$ for every 1 in the 8 bits starting from the right.
> A simple example 01011001 is the same as: $-127 + 1 + 8 + 16 + 64 = -38$ 
>
> For instance the number 0.75 is stored as
>
> 0 01111110 10000000000000000000000
>
>The float is calculated by multiplying a **number** (called the **mantissa** with a certain **sign** by 2 raised to a certain **power**.  
>
>The first digit defines the **sign** of the number: 1 meaning negative and 0 meaning positive. In this case we have a 0, so the float will be positive.
>
>The following 8 digits define the **power**. The 8 digits refer to powers of 2. The first of them corresponds to $2^7$, the second to $2^6$ and so on till the eight digit referring to $2^0$. To find the exponent we sum over these powers of 2 where the corresponding digit is 1. The value of this sum is than added to -127, the result is the exponent. In this case the exponent E is given by:
>
>$E = -127 + (2^1 + 2^2 + 2^3 + 2^4 + 2^5 +2^6) = -127 + 126 = -1$
>
>***Note:** The smallest value for the exponent is L = -126 corresponding to the digits 00000001. The largest value for the exponent is U = 127 corresponding to the digits 11111110. The digits 00000000 and 11111111 are invalid combinations.*
>
>The last 23 digits define the **mantissa** which is calculated by:
>
>$\left(1+\frac{d_1}{2}+\frac{d_2}{4}+.\ldots \frac{d_{23}}{2^{23}}\right)$
>
>In the equation $d_1$ till $d_{23}$ referred to the last 23 digits, being a 1 or a 0. In this case the following is found:
>
>$\left(1+\frac{1}{2}+\frac{0}{4}+.\ldots \frac{0}{2^{23}}\right) = 1.5$
>
>So in total we get:
>
>$\left(1+\frac{1}{2}+\frac{0}{4}+.\ldots \frac{0}{2^{23}}\right) 2^{(-127+126)}=1.5\cdot 2^{-1}=0.75$
>

> If we would explicitly store $d_0$ (and allow for **denormalized** numbers with $d_0=0$), the representation would read 
>
> 0 &nbsp;01111110&nbsp; **1**1000000000000000000000 
>
> Note that by adding the $d_0$ bit (in bold), we lost a bit of precision ($d_{23}$) because we only have 23 bits in total in the mantissa. 
>
>However, the same number could also be written as
>
> 0 &nbsp;  01111111 &nbsp;  **0**11000000000000000000000
> , corresponding to $+\left(0+\frac{1}{2}+\frac{1}{4}+.\ldots \frac{0}{2^{23}}\right) 2^{(-127^{\rm offset}+\mathit{127})}=0.75\cdot 2^{0}=0.75$
>
> This means that we've lost a bit of precision and gained nothing.

In [None]:
print(
    "python in itself only supports double precison floats,\n "
    "but numpy allows to use several different data types,\n "
    "including single precision floats\n"
)


def bit_rep(num):
    """Translate a number into its bit representation."""
    return "".join(
        bin(c).replace("0b", "").rjust(8, "0") for c in struct.pack("!f", num)
    )


print("np.single(0.75) has bit representation of:", bit_rep(np.single(0.75)))

> **Exercise**
>
> Convert this 32 bit single precision number to its decimal representation.
>
> 0 10000000 01101010000010011110011

### Properties

A floating-point number is finite and discrete. The number of normalized floating-point numbers in a given system is
$2 (\beta-1)\beta^{p-1}(U-L+1)+1$
- $2$ choices of sign
- $(\beta-1)$ choices for the leading digit of the mantissa $(=d_0)$ 
- $\beta^{p-1}$ because there are $\beta$ choices for each of the remaining $p-1$ digits of the mantissa
- $(U-L+1)$ possible values for the exponent ($+1$ because the boundaries of [L, U] are being counted)
- $+1$ because the number could be zero

The smallest positive normalized number (the **underflow level**): if all the bits in the mantissa part and all but the last bit in the exponent part are 0, the number equals

$\beta^L$

The largest number (the **overflow level**) equals

$\beta^{U+1}(1-\beta^{-p})$


>**Short derivation of the overflow level:**
>
>The largest possible value is the value where the upper limit U is reached and all values in the mantissa are equal to the maximum value of $(\beta-1)$ 
> 
> the overflow level thus is $\beta^{U}(\beta-1)\cdot \sum_{k=0}^{p-1}\beta^{-k}$: for a IEEE single precision binary number this would look like
>
>0 11111110 11111111111111111111111
>
> > If this was in ternary, all the ones would be two's, explaining the factor $\beta-1$.
>
> Note that not all values in the 8 bits of the exponent are 1. This form is reserved for `inf` and `NaN`.
>
> Hence we find:
>
>$$
\beta^{U}(\beta-1)\cdot \sum_{k=0}^{p-1}\beta^{-k}
= \beta^{U} (\beta-1)\cdot \dfrac{\sum_{k=0}^{p-1}\beta^{k}}{\beta^{p-1}}
= \beta^{U}(\beta-1)\cdot \dfrac{1}{\beta^{p-1}}\cdot \dfrac{\beta^{p}-1}{\beta-1}
= \beta^{U+1} \cdot (1-\beta^{-p})
$$


>**Example**
>
> Let's look at two examples of how the overflow level is calculated.
>
> First lets look at the situation where p = 24 and $\beta$ = 2.
>
> The overflow level is then: 
>
>$2^{U}\left(1+\frac{\mathbf{1}}{2}+\frac{\mathbf{1}}{4}+.\ldots \frac{\mathbf{1}}{2^{23}}\right)=2^{U+1}\left(\frac{1}{2}+\frac{1}{4}+.\ldots \frac{1}{2^{24}}\right)=2^{U+1}\left(1 - \frac{1}{2^{24}}\right)$
>
> - **Note:** for $\beta$ = 2 the $d_i$ are always either 0 or 1 since $0 \leq d_i \leq \beta -1$. The maximum value for the $d_i$ thus is 1. 
>
>For the second example lets look at the situation where p = 50 and $\beta$ = 10.
>
> The overflow level is then:
>
>$10^{U}\left(9+\frac{\mathbf{9}}{10}+\frac{\mathbf{9}}{100}+.\ldots \frac{\mathbf{9}}{10^{49}}\right)=10^{U+1}\left(\frac{9}{10}+\frac{9}{100}+.\ldots \frac{9}{10^{50}}\right)=10^{U+1}\left(1 - \frac{1}{10^{50}}\right)$
>
> - **Note:** for $\beta$ = 10 the maximum value for the $d_i$ is 9. 
>
> As expected, for both cases the overflow level is given by: $\beta^{U+1}(1-\beta^{-p})$.

>**Example**
>
> Now, let's take a look at a *toy* floating point system with $\beta$=2, p=3, L=-1 and U=1.
>
> This system supports 25 floating point-numbers:
>
>$2 (\beta-1)\beta^{p-1}(U-L+1)+1 = 2 (2-1)2^{3-1}(1-(-1)+1)+1 = 25$
>
> The largest number is $\beta^{U}(\beta-1)\cdot \sum_{k=0}^{p-1}\beta^{-k}=2^2\cdot 1(1-2^{-3})=3.5$
>
> The smallest number is $\beta^L=2^{-1}=0.5$

Floating-point numbers are not uniformly distributed throughout their range, but are equally spaced only between successive powers of $\beta$.

> A comparison of small systems is shown in the graph below. Note that, although these systems are extremely small, they are representative for all float-point systems in their property that they are unevenly spaced. Try larger values of p=5, p=8 to see the density grow.

In [None]:
def generate_custom_floats():
    # Custom floating-point format parameters
    sign_bits = ["0", "1"]  # 1 bit for sign
    exponent_bits_list = [
        "01",
        "10",
        "11",
    ]  # 2 bits for exponent ('00' reserved for zero)
    mantissa_bits_list = ["00", "01", "10", "11"]  # 2 bits for mantissa
    bias = 2  # Exponent bias
    m = 2  # Number of mantissa bits (excluding implicit leading 1)
    float_values = []  # List to hold values
    bit_strings = []  # List to hold bit strings

    # Manually add zero (special case where exponent and mantissa are zero)
    zero_positive_bit_string = "0 00 00"
    zero_negative_bit_string = "1 00 00"
    zero_positive_value = 0.0
    zero_negative_value = -0.0
    float_values.append(zero_negative_value)
    bit_strings.append(zero_negative_bit_string)
    float_values.append(zero_positive_value)
    bit_strings.append(zero_positive_bit_string)

    for sign_bit in sign_bits:
        for exponent_bits in exponent_bits_list:
            exponent_value = int(exponent_bits, 2)
            E = exponent_value - bias
            for mantissa_bits in mantissa_bits_list:
                # Skip adding zero again
                if exponent_bits == "00" and mantissa_bits == "00":
                    continue
                mantissa_value = int(mantissa_bits, 2)
                mantissa = 1 + mantissa_value * (2**-m)
                value = (-1) ** int(sign_bit) * mantissa * (2**E)
                # Create bit string with spaces between sign, exponent, and mantissa
                bit_string = f"{sign_bit} {exponent_bits} {mantissa_bits}"
                float_values.append(value)
                bit_strings.append(bit_string)

    # Sort the lists based on the floating-point values
    sorted_indices = sorted(range(len(float_values)), key=lambda i: float_values[i])
    sorted_float_values = [float_values[i] for i in sorted_indices]
    sorted_bit_strings = [bit_strings[i] for i in sorted_indices]

    return sorted_float_values, sorted_bit_strings


def plot_custom_float_distribution():
    # Generate the floats and their corresponding bit strings
    float_values, bit_strings = generate_custom_floats()

    # Filter values and bit strings within the range -3.5 to 3.5 separately
    filtered_values = [v for v in float_values if -3.5 <= v <= 3.5]
    filtered_bits = [
        bit_strings[i] for i, v in enumerate(float_values) if -3.5 <= v <= 3.5
    ]

    # Print the sorted list with binary representations
    print("Sorted Floating-Point Numbers with Binary Representations:")
    for val, bits in zip(filtered_values, filtered_bits, strict=False):
        print(f"{val:6.3f}   {bits}")

    # Plotting
    plt.close("custom_float")
    fig, ax = plt.subplots(figsize=(12, 3), num="custom_float")
    ax.set_title("Distribution of Custom Floating-Point Numbers")
    ax.set_yticks([])
    ax.spines["left"].set_visible(False)

    # Plot the floating-point numbers
    ax.plot(
        filtered_values,
        [0] * len(filtered_values),
        marker="|",
        linestyle="None",
        markersize=10,
    )

    # Add horizontal line (x-axis)
    ax.axhline(y=0, color="black", linewidth=0.5)

    # Display all numbers on the x-axis
    ax.set_xticks(
        filtered_values, [f"{v:.3f}" for v in filtered_values], rotation=90
    )


plot_custom_float_distribution()

Real numbers that are exactly representable in given floating-point system are called **machine numbers**. If a given number is not representable, it must be rounded to a "nearby" floating-point number. The error introduced by this approximation is called the **rounding error**. The most accurate and unbiased rule to round (and the de-facto standard today, also in the IEEE standards) is **round to nearest**, where a number is represented by its nearest floating-point number. In case of a tie, we use the number whose last stored digit is even. An alternative would be **chopped**: the expansion in $\frac{d_i}{\beta^i}$ is truncated after the ($p$-1)st digit, i.e.  $\frac{d_{p-2}}{\beta^{p-2}}$.

The accuracy of a floating-point system is called the **machine precision**. Its value depends on the particular rounding rules that are being used. In case of rounding to nearest it equals to:
$\epsilon_{\rm mach}=\frac{1}{2}\beta^{1-p}$

>**Example**
>
> For the IEEE SP and DP systems, $\epsilon_{\rm mach}=2^{-24}\approx 10^{-7}$ and $\epsilon_{\rm mach}=2^{-53}\approx 10^{-16}$, respectively. These systems thus have about 7 and 16 decimal digits of precision.

Although both values are small,  $\epsilon_{\rm mach}$ should not be confused with the underflow level.

Finally, there are two additional special values to indicate exceptional situations:
- `Inf`, which stand for **infinity**, which results e.g. from dividing a non-zero number by zero.
- `NaN`, which stands for **not a number**, and results from an undefined operations such as $\frac{0}{0}$.

In [None]:
print(
    "The largest double precision number is about 1.8e308, "
    "larger numbers become infinity."
)
print(1.797e308, 1.798e308)
print("\n")

print(
    "The smallest double precision number is about 4.95e-324, "
    "numbers that are smaller than half this value get rounded to zero."
)
print(4.95e-324, 2.4e-324)
print("\n")

print("Because 0.1 is stored as: ")
print("\t", decimal.Decimal(0.1))
print("and 0.3 is stored as ")
print("\t", decimal.Decimal(0.3))
print(".1+.1+.1 does not equal .3")
print("As shown by testing .1+.1+.1 == .3, which gives:", 0.1 + 0.1 + 0.1 == 0.3)
print("Instead, it equals", 0.1 + 0.1 + 0.1)
print("which differs from .3 by", 0.1 + 0.1 + 0.1 - 0.3, "\n")

### Good Practices for computer arithmetic

**Cancellation**

- Avoid subtracting two almost identical numbers

**Addition**

- Avoid adding small and large numbers
- Perform a sequence of additions ordered from the smallest number to the largest


In [None]:
def example_cancellation():
    print("Example of cancellation error")
    x = 0.1234567891234567890
    y = 0.1234567891234567
    print(
        "The real value of x - y is 8.9e-17, "
        "however python returns a number which is about 7% smaller"
    )
    print(x - y, "\n")


example_cancellation()

Consider the following sum of an alternating harmonic series

$$
S = \sum_{k=1}^{N} (-1)^{k+1} \frac{1}{k} \approx \ln(2)
$$

for large $k$.

In [None]:
# Function to sum in natural order with single precision
def alternating_harmonic_natural(N):
    total = np.float32(0.0)
    print("Summing alternating harmonic series in natural order (from 1 to N):")
    for k in range(1, N + 1):
        term = np.float32((-1) ** (k + 1) / k)
        total += term
    true_value = np.float32(math.log(2))  # True value is ln(2)
    relative_error = (total - true_value) / true_value
    print(f"\tTotal sum: {total:.16f}")
    print(f"\tRelative error: {relative_error:.16e}\n")
    return total


# Function to sum in reverse order with single precision
def alternating_harmonic_reverse(N):
    total = np.float32(0.0)
    print("Summing alternating harmonic series in reverse order (from N to 1):")
    for k in range(N, 0, -1):
        term = np.float32((-1) ** (k + 1) / k)
        total += term
    true_value = np.float32(math.log(2))  # True value is ln(2)
    relative_error = (total - true_value) / true_value
    print(f"\tTotal sum: {total:.16f}")
    print(f"\tRelative error: {relative_error:.16e}\n")
    return total


# Function to sum positive and negative terms separately in single precision
def alternating_harmonic_grouped(N):
    print("Summing positive and negative terms separately:")
    total_positive = np.float32(
        sum(np.float32(1.0 / k) for k in range(1, N + 1, 2))
    )  # Positive terms
    total_negative = np.float32(
        sum(np.float32(-1.0 / k) for k in range(2, N + 1, 2))
    )  # Negative terms
    total = total_positive + total_negative
    true_value = np.float32(math.log(2))  # True value is ln(2)
    relative_error = (total - true_value) / true_value
    print(f"\tTotal sum: {total:.16f}")
    print(f"\tRelative error: {relative_error:.16e}\n")
    return total


# Main block to execute all summation methods in single precision
N = 10000000
print(
    f"Computing the alternating harmonic series up to N = {N} in single precision\n"
)
print(f"The true value is ln(2) ≈ {np.float32(math.log(2)):.16f}\n")

# Compute and compare the results
sum_natural = alternating_harmonic_natural(N)
sum_reverse = alternating_harmonic_reverse(N)
sum_grouped = alternating_harmonic_grouped(N)

Even relatively simple mathematical problems can exhibit numerical issues when computed using finite-precision arithmetic. A classic example is the **quadratic formula** used to find the roots of a quadratic equation.

Consider the quadratic equation:

$$
ax^2 + bx + c = 0
$$

Its solutions are given by:

$$
x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}
$$

However, naïvely implementing this formula can lead to numerical problems such as **overflow**, **underflow**, and **catastrophic cancellation**, especially when the coefficients are very large or very small in magnitude.

- When $b^2$ is much larger than $ 4ac $, the discriminant $ \sqrt{b^2 - 4ac} $ is nearly equal to $ |b| $. Subtracting two nearly equal numbers, can cause significant loss of precision due to catastrophic cancellation.
  
- To compute the roots more accurately, use the following rearranged formula:

    $$
    x = \frac{2c}{-b \mp \sqrt{b^2 - 4ac}}
    $$

    which has the flipped sign in the denominator and avoids the subtraction of two nearly equal numbers. 

**Function evaluations**: When implementing a function for evaluation, take care in how you write it to avoid numerical problems: 

E.g. when evaluating $ f(x) = \sqrt{x + 1} - \sqrt{x} $ for large $ x$ both terms are nearly equal, and their subtraction leads to loss of significant digits.

This can be solved by re-writing it as

$$
f(x) = \frac{(\sqrt{x + 1} - \sqrt{x})(\sqrt{x + 1} + \sqrt{x})}{\sqrt{x + 1} + \sqrt{x}} = \frac{1}{\sqrt{x + 1} + \sqrt{x}}
$$


In [None]:
x = 1e16  # Large value of x

# Original expression
f_original = math.sqrt(x + 1) - math.sqrt(x)

# alternative expression
f_alt = 1 / (math.sqrt(x + 1) + math.sqrt(x))

print("Original function result:", f_original)
print("Alternative function result:", f_alt)