# Numerical computation

1. Representation of integers in Python and NumPy
   - Maximum/minimum values, overflow
1. Concepts of
   - Absolute error
   - Relative error
   - Precision and representation error
   - Accuracy
2. Representation of floating point numbers in Python and NumPy
   - Maximum/minimum values, machine precision, overflow and underflow
   - Representation error
   - Roundoff error
   - Commutative, associative and distributive properties
   - Comparison of floating point values
   - Cancellation error
   - Summation example
   - Quadratic equation example
   - Extended precision
1. Classic very detailed reference about floating point numbers https://www.itu.dk/~sestoft/bachelor/IEEE754_article.pdf



In [None]:
import numpy as np

## Integers

### Division operation

In Python3, integer division is not closed, unless you use the `//` operator.  I.e., if you use '/' to divide two integers the result is always a floating point number even if the result is whole number.

In [None]:
print("Python3:    4/2", 4/2)
print("Python3:    3/2 =", 3/2, "    3//2 =", 3//2, "   1//2 =",1//2)

*NumPy behaves the same way*

In [None]:
a = np.arange(10)
print(a.dtype)
print(a)
print(a/2)
print(a/np.full(10,2))
print(a//2)

### Allowed values of integers

Native Python3 integers can be arbitrarily large (positive or negative)

In [None]:
i = 1
print(" m       2**m             2**m in binary format")
print("--    ----------     -------------------------------------")

for iteration in range(35):
    print("{0:2d}  {1:12d}  {1:40b}".format(iteration,i))
    i*=2

In [None]:
np.int32(2**1024)

However, Numpy (and Python2 and nearly every other programming language you will use) has integers with a fixed size for compact storage and efficient computing (since this is what the computer hardware actually uses)

E.g., a 32-bit integer is a sequence of 32 bits (4 bytes) using two's complement representation (https://en.wikipedia.org/wiki/Two%27s_complement)

```
 [bit31, bit30, bit29, bit28, ..., bit2, bit1, bit0]
 ```

$i = - b_{31}*2^{31} +  b_{30}*2^{30} +  b_{29}*2^{29}+  b_{28}*2^{28} + \cdots + b_{2}*2^{2} + b_{1}*2^{1} + b_{0}*2^0 $

Most modern computers support in hardware 8-bit, 16-bit, 32-bit and 64-bit integers (with and without signs)

The default integer in NumPy is 64-bit.

If you are curious, or if you ever need to do it by hand:
* Computing the binary representation of a positive number is straightforward.  In decreasing order, subtract powers of two less than the value.  For example, `265`.  The largest power of 2 less than 265 is $2^8=256$.  Subtracting 256 leaves $9$, which you can see is `8+1 = 2^3 + 2^0`, so the 32-bit binary representation of `265` is `00000000000000000000000100001001`.
* For a negative integer $-i$ the lower 31 bits are computed from $2^{31}-i$ and the 32nd bit is set.  So armed with the binary representation of +265 we can compute the two's-complement binary representation of -265 by first computing (in binary) $2^{31}-265$.

```
  10000000000000000000000000000000
- 00000000000000000000000010001001
= 01111111111111111111111101110111
```

and then seting the upper bit to obtain `11111111111111111111111011110111`.
* Notice that the representation of the negative value just requires "flipping" each bit of the positive value.

**Exercise:** In two's-complement 32-bit format, what are the binary representations of 
* 3
* -3

If you want to check your answer look [here](https://www.exploringbinary.com/twos-complement-converter)


In [None]:
# Aside: Python can print binary representations of integers but does not internally use the
# two's-complement representation and so negative numbers appear just with a minus sign
print(f"{3:b}")
print(f"{-3:b}")

Here's a function to convert an integer into a string containing its binary 32-bit two's-complement representation

In [None]:
def b(i):
    ''' Returns bit-string representation of a 32-bit two's-complement integer '''
    if i<-2**31 or i>=2**31:
        raise(ValueError)
    s = "0"
    if i<0:
        s = "1"
        i += 2**31
    for q in range(30,-1,-1):
        s += ["0","1"][(i&(1<<q))>>q]
    return s

print(-2147483648, b(-2147483648)) # The most negative integer

In [None]:
b(-3)

In [None]:
# Test it on some integers from numpy
import numpy as np
i = np.array([3],np.int32)
print("{0:11d} {1}".format(i[0],b(i[0])))
i[0] = -3
print("{0:11d} {1}".format(i[0],b(i[0])))


In [None]:
x = 2147483647
print(x)
x = x + 1
print(x)

The fixed width of NumPy integers means they cannot be arbitrarily large. The largest positive value you can put in a 32-bit signed integer must fit into 31 bits which is

$2^{31}-1$ = 2147483647

And the largest magnitude negative value is 
$-2^{31}$ = -2147483648

For the default 64-bit integers the limits are $2^{63}-1$ = 9223372036854775807 and $-2^{63}$=-9223372036854775808

**Exercise:** Explain the following behaviors

In [None]:
# Unexpected things can happen when you "overflow" a fixed-size integer

i = np.array([2**31-1],np.int32)
print("{0:11d} {1}".format(i[0],b(i[0])))
i += 1
print("{0:11d} {1}".format(i[0],b(i[0])))
print()
i = np.array([-2**31],np.int32)
print("{0:11d} {1}".format(i[0],b(i[0])))
i -= 1
print("{0:11d} {1}".format(i[0],b(i[0])))
print("{0:11d} {1}".format(-1,b(-1)))


In [None]:
# Explore increasing powers of 2
i = np.array([1],np.int32)
for iteration in range(40):
    print("{0:2d}  {1:11d}   {2}".format(iteration,i[0],b(i[0])))
    i *= 2

**Summary:** *these issues only affect very large integers in NumPy (or other programming languages including Python2) but most of the time (since you are usually working with small values) things will behave as you expect.*

## Absolute error, relative error, significant digits, precision, and accuracy

These are central concepts to numerical computation

**Absolute error** is the magnitude of the error between an approximate result and the exact one $\epsilon_{\text{abs}} = |x_\text{exact} - x_{\text{approx}}|$


In [None]:
x = (1.0/(1.0/7.0**0.5)**3)**(2/3) #  exact answer is 7
xexact = 7
print(x)
abserr = abs(x-xexact)
print(abserr)

While absolute error is very important, it requires that we have some understanding about how big an error is bad.  

For instance, an error of a million would be a lot if counting the population of a town, but would likely be regarded as small if measuring the distance to the sun in meters (149,597,870,700m).

**Relative error** can be defined as the ratio of the absolute error to the exact result 

$$ \epsilon_{\text{rel}} = |x_\text{exact} - x_{\text{approx}}| / |x_\text{exact}| $$

but other definitions are also useful depending on the circumstance, e.g.,

$$ \epsilon_{\text{rel}} = |x_\text{exact} - x_{\text{approx}}| / \text{max}(|x_\text{exact}|,|x_\text{approx}|) $$

$$ \epsilon_{\text{rel}} = |x_\text{exact} - x_{\text{approx}}| / |x_\text{approx}| $$

From the perspective of relative error we can see the above computation was quite accurate

In [None]:
relerr = abs(x-xexact)/xexact
print(relerr)

**Significant figures:** Relative error can also be interpreted as the number of significant figures or digits ($N$) in the value, where $N$ is computed as the largest integer such that 

$$ \epsilon_{\text{rel}} < 0.5 * 10^{-N} $$

or 

$$ N \approx \text{floor}( -\log_{10} (2* \epsilon_{\text{rel}}) ) $$

(`floor` rounds towards zero --- i.e., takes a floating point value and discards the fraction).


In [None]:
print("{0:11d} {1}".format(i[0],b(i[0])))
from math import *
sigfig = int(-log10(2*relerr))
print(sigfig)


**Precision** is often stated as the number of digits (or relative error) used to store or compute a result. Sometimes this might be stated as relative error, or the absolute error (e.g., if fixed-point arithmetic is being used instead of floating point).

For example, here is an approximation to $\pi$ (hint, it is not a particularly good approximation)
```
   piapprox = 3.1415243098283216
```
that has been specified with a precision of 17 digits.  We will see below that IEEE double-precision floating-point arithmetic has a precision of about 16 digits (and specifying a correctly rounded value can require a few more digits).

Note: if we had specified more digits that they would have been discarded --- double-precision simply cannot hold any more information

In [None]:
piapprox = 3.1415243098283216
print(piapprox)
piapprox = 3.1415243098283216217439821748972198
print(piapprox)

**Representation error:** because of finite precision some values cannot be exactly represented --- this is usually not a problem but can lead to some unexpected outputs.

In [None]:
print(3.1415243098283217)  # focus on the last digit
print("%.17f" % 0.2)       # focus on the last digit

In [None]:
import math
math.sin(math.pi)          # pi is not exactly representable, so sin(pi) cannot be exactly zero

**Accuracy** is often stated as the number of significant figures in a value comparing to the exact or true value. Sometimes relative or absolute accuracy might be used.

Clearly the attainable accuracy is limited by the precision of computation, but there may be other limits
*  finite accuracy of values input into a calculation
*  finite accuracy of an algorithm to compute something

Looking at the above approximate value for $\pi$ that was stated with a precision of 17 digits we can see that it is only accurate to a little more than 4 decimal digits.

In [None]:
print("approx:", piapprox)
print(" exact:", math.pi, "(i.e., the closest floating-point representation of pi)")
pirelerr = abs(piapprox-math.pi)/math.pi
print("relerr:", pirelerr)
N = int(-log10(2*pirelerr))
print("sigfig:", N)

## Floating point numbers 

Nearly all languages, including Python and NumPy, use the computer hardware supported IEEE 754 representation. 
* Note that libraries (e.g., MPFR) and packages (e.g., Maple or Mathematica) exist that provide greater precision at the price of loss of speed.  For Python, the module `mpmath` (http://mpmath.org/) provides extended precision arithmetic and we use this a bit further below.

E.g., in 64-bit (https://en.wikipedia.org/wiki/Double-precision_floating-point_format)

$$(-1)^s \times 1.m \times 2^{e-1023}$$

- 1 bit for sign ($s$)
- 11 bits for exponent ($e$)
  - In range -1024 to +1023 in binary, or $\pm$308 in decimal.  Thus, numbers larger than circa $10^{308}$ will overflow, and numbers smaller than circa $10^{-308}$ will underflow (gradually due to denormalized representation)
- 52 bits for mantissa ($m$) giving 53 bits of significand 
  - 53 bits since we know for a non-zero number that the leading binary digit is 1 so we don't bother storing that
  - $2^{-53} \approx$ `1.11e-16`
- Special values are reserved to represent
  - Signed zero
  - Overflow (number too in magnitude to represent) --- $\pm \infty$
  - Not a number (result is not a valid number) --- `NaN`

In [None]:
1.23e-2

In [None]:
x = 0.0000000000001234
print(x)

For native Python these limits and associated values can be found in `sys.float_info`

In [None]:
import sys
print(sys.float_info)
print()
print("maximum floating point number is", sys.float_info.max)

In [None]:
print(np.finfo(float)) # or np.finfo(np.float64)

In [None]:
# Illustrating what happens when you exceed the most positive exponent --- overflow
x = 10.0**308
print(x)
x *= 2.0
print(x)

In [None]:
# Illustrating what happens when you exceed the most positive exponent and mantissa --- overflow
import sys
x = sys.float_info.max
print(x)
x = x*1.0000000000000002
print(x)

In [None]:
# Illustrating what happens as you approach and exceed the most negative exponent (i.e., very small numbers) 
# --- gradual underflow and loss of precision
x = 1.23456789012345678e-300
while x>0:
    print(x)
    x *= 0.1
print(x)

**Machine epsilon** (https://en.wikipedia.org/wiki/Machine_epsilon) is the smallest positive number such that 

$1 + \epsilon \ne 1$

It is the **relative error** for floating point computation.

For any real value $x$ (assuming no over/underflow) there exists a numerical representation $x^\prime$ such that $|x-x^\prime| < \epsilon |x|$

In [None]:
# Computing epsilon (normally would just look in sys.float_info)
for n in range(-60,1):
    epsilon = 2.0**n
    print(n, epsilon, 1.0+epsilon)
    if (1.0+epsilon) != 1.0:
        break
print()
print(sys.float_info.epsilon)

**Rounding error:** Is related to representation error already introduced above. While some numbers and floating computations can be, or can appear to be, exact, most suffer rounding error because the significand has only 53 bits, and the last bit must usually be rounded. 

This gives you 15-16 significant decimal figures.

In IEEE 754 floating point arithmetic the default rounding mode is towards the closest exactly representable number.  Other rounding modes are available (e.g., round to zero, etc.) but are rarely needed.

In [None]:
import math
print(0.3/0.1 - 3.0)

The IEEE standard requires that the result of addition, subtraction, multiplication, division, square root, remainder, and conversion between integer and floating-point be *correctly rounded*.  It is not possible to do this efficiently for transcendental functions (e.g., `exp`) but these days most math libraries do correctly round all values and offer slightly less accurate modes (errors in the last 1-3 bits) that are potentially much faster.

More precisely, let $\times = +, -, *, /, \ldots$ in exact arithmetic and let $\otimes$ be the corresponding floating-point operation. Assuming no under/overflow, IEEE 754 arithmetic guarantees that given two floating-point values $x$ and $y$ that $|(x \times y) - (x\otimes y)| < \epsilon |x \times y|$. 

### Floating point arithmetic is *commutative* but *not associative* and *not distributive*

Commutative means $x \otimes y = y\otimes x$. 
* True in floating-point arithmetic for $\otimes=+$ or $\otimes=*$.

Associative means $(x \otimes y) \otimes z = x \otimes (y \otimes z)$ where sub-expressions within parentheses are evaluated first. 
* Not true in floating point.

Distributive means $x\otimes(y + z) = x\otimes y + x \otimes z$ for $\otimes = *, /$. 
* Not true in floating point.

In [None]:
x=239480912809.2930841092
y=8309482109.193284092183018
z=1.328488213048321094
print("x*y == y*x", x*y == y*x)  # * commutes
print("x+y == y+x", x+y == y+x)  # + commutes
print("x-y == -(y-x)", x-y == -(y-x)) # - commutes (taking care of sign)
print("(x+y)+z == x+(y+z)", (x+y)+z == x+(y+z)) # + is NOT exactly associative in floating point
val1 = (x+y)+z
val2 = x+(y+z)
relerr = abs(val1-val2)/val1
print("relative error in associative test", relerr)
print("x*(y+z)==x*y+x*z", x*(y+z)==x*y+x*z) # * is NOT exactly distributive in floating point
val1 = x*(y+z)
val2 = x*y+x*z
relerr = abs(val1-val2)/val1
print("relative error in distributive test", relerr)

**Reliably comparing floating-point values**

You must pay attention when comparing two floating point numbers --- since floating computation is imprecise, comparing two numbers should be done allowing for some reasonable error.  But what is reasonable depends on what accuracy you are expecting --- i.e., often *you* have to decide what is acceptable, but a reasonable default can be obtained from the machine epsilon assuming just rounding error is present.

The easiest way to do this is using `math.isclose` (https://docs.python.org/3/whatsnew/3.5.html#pep-485-a-function-for-testing-approximate-equality)


In [None]:
x = (math.pi+100.0)-100.0 # introduces a "small" error into the value of pi
print(x==math.pi, x-math.pi)
print(math.isclose(x, math.pi, rel_tol=1e-14))

In [None]:
x = math.pi
y = x + 100000000000000
print(x,y)

In [None]:
print(y-100000000000000)

In [None]:
# pi + 1000000000000000 + (-1000000000000000)
x = (math.pi + 1000000000000000) + (-1000000000000000)
y = math.pi + (1000000000000000 + (-1000000000000000))
print(x, y)

**Cancellation error:** (loss of significance, https://en.wikipedia.org/wiki/Loss_of_significance) is a more pernicious problem.  

Adding/subtracting numbers of similar magnitude to obtain a relatively small result or intermediate value can lose significant digits, sometimes catastrophically.

In [None]:
x = 0.1234567890123456
y = 0.1234567890123000
print(x)
print(y)
print(x-y)

# Below we show how cancellation error can expose previous rounding error

print((1.0+2.0**52)-2.0**52)   # Example of floating-point arithmetic not being distributive
print((1.0+2.0**53)-2.0**53)
print((math.pi+1e8)-1e8, math.pi) 
print((math.pi+1e16)-1e16, math.pi)
print((math.pi + 1e17) - 1e17, math.pi)

**Exercise:** Summing data with varying magnitude and sign.

Below we first define a function to make a random value that has a large variation in magnitude but is always positive.  Each value is exactly represented as a floating point number since each is just a small power of 2.

In [None]:
from random import random

def ranval():
    return 2.0**((random()-0.5)*100)

# Print a few out to see what they look like
for i in range(8):
    print(ranval())

Let's sum a list of 1000 such values

In [None]:
values=[ranval() for i in range(1000)]
print("sum in original order:   ", sum(values))

**Question:** Is there rounding error when computing the sum?

**Question:** Is cancellation error a concern here? [Hint, the values are all positive.]

**Question:** Should the order of summation greatly affect the result?

In [None]:
values.reverse()
print("sum in reverse order:   ", sum(values))
print("sum in sorted order:    ", sum(sorted(values)))

We can see that any variation in the relative error is consistent with machine epsilon --- it is just rounding error.


Next we make a list of 2000 elements --- the first 1000 are random values computed with the first function and the second 1000 are just the negative of the first.

So we know the exact sum should be zero.

In [None]:
values=[ranval() for i in range(1000)]
values = values + [-value for value in values]
print("sum in original order:   ", sum(values))


**Question:** Is cancellation error a concern here?

**Question:** Why should the order of summation matter?

**Question:** How can we get the computer to give us the "correct" answer?

**Question:** In general, if we don't know the "exact" result, what do we even mean by "correct?"

In [None]:
print("sum in original order:  ", sum(values))
values.reverse()
print("sum in reverse order:   ", sum(values))
print("sum in sorted order:    ", sum(sorted(values)))
print("sum in abs sorted order:", sum(sorted(values,key=abs)))

In general even summing sorted data will still have some rounding error and may not even be the optimal algorithm --- we only get zero error here because of this special test case.

If you are concerned about the accuracy of a sum of floating point values, you could try using the `math.fsum` function.

In [None]:
?math.fsum

In [None]:
import math
math.fsum(values)

### Example of the quadratic equation

A classic example for numerical woes is the standard expression for the roots of a quadratic equation

$ a x^2 + b x + c = 0$

$ x = \frac{-b \pm \sqrt{b^2 - 4 a c}}{2 a}$

Three aspects of numerical computation conspire to make problems for this simple formula when $|ac| \ll b^2$:
*  Rounding error when computing $b^2 - 4 a c$
*  Cancellation error when computing $-b \pm \sqrt{b^2 - 4 a c}$

Consider adding two numbers $p+q$.  We've already seen that if $q$ is small compared to $p$ the addition operation must discard some of the digits in $q$. In the worst cast, if $|q|<\epsilon |p|$ then $p+q=p$ in floating-point arithmetic (where $\epsilon$ is the machine epsilon).


In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

x = np.linspace(-2,4,31)

y = x**2 - 2*x - 3   # (x+1)*(x-3) is zero if x=-1 or x=+3

fig,ax = plt.subplots()
ax.plot(x,y)
ax.axvline(x=-1, color='r', linestyle='dashed')
ax.axvline(x= 3, color='r', linestyle='dashed')
ax.axhline(y=0, color='r');


In [None]:
p=1.0
q=1.2345678901234567e-12
print(p,q,p+q)

Coming back to the quadratic equation problem, imagine that $|ac| < \epsilon b^2$ (where $\epsilon$ is machine-epsilon).
* The floating-point value for $b^2 - 4 a c$ will be computed as $b^2$ --- can you explain why?
* As a result (and assuming that $b>0$) we will compute $-b + \sqrt{b^2 - 4 a c}$ to have the value zero. 
  * This is the catastrophic cancellation error --- we have lost all information.
* Again assuming $|ac| \ll b^2$, a little bit of math (Taylor series) tells us that the correct answer is  $-b + \sqrt{b^2 - 4 a c} \approx + 2 ac/b$ and so that the corresponding root is $x \approx c/b$
* Similar problems arise for the other root if $b<0$.

These errors can be avoided by using the alternative formula 

$$x = \frac{2c}{-b \mp \sqrt{b^2 - 4 a c}}$$

with '-' when $b\ge 0$ and '+' when $b<0$.

The original and alternative algorithms are implemented below.

In [None]:
import math


def roots1(a, b, c):
    r = math.sqrt(b**2 - 4 * a * c)
    return (-b - r) / (2 * a), (-b + r) / (2 * a)


def roots2(a, b, c):
    r = math.sqrt(b**2 - 4 * a * c)
    x1 = (-b - math.copysign(r, b)) / (2 * a)
    x2 = c / (a * x1)
    return (x1, x2)


#
print("roots1:", roots1(1.0, 1e8, 1.0))
print("roots2:", roots2(1.0, 1e8, 1.0))
print("exact:", (-1e8, -1e-08))

**What else can go wrong?**  
* What if $a$ is zero?  I.e., you have a straight line instead of a parabola. The alternative formula works in this instance.
* Can also get loss of signficance if $b^2 \approx 4 a c$ but fixing this is not so easy unless we use extended precision.


**Extended precision:** We can easily do this in native Python but not with NumPy.  In extended precision arithmetic we can use more bits in the mantissa and have arbitrarily large exponents --- but the price is a significant loss of speed.  Also, it is not a magical solution --- some algorithms can be so badly conditioned that it would be impossible to guarantee sufficient precision especially if there is only finite precision or accuracy in the input data.

In [None]:
import mpmath as mp
def roots3(a,b,c):
    saveprec, mp.mp.prec = mp.mp.prec, 108 # set precision to 108 bits
    a,b,c = mp.mpf(a),mp.mpf(b),mp.mpf(c)  # convert to quadruple precision
    r = mp.sqrt(b**2 - 4*a*c)
    if b < 0: r = -r
    x1 = (-b - r)/(2*a)
    x2 = c/(a*x1)
    mp.mp.prec = saveprec                  # reset mp precision 
    return (float(x1),float(x2))           # return Python floats

# roots should be
x1 = 1.000000028975958
x2 = 1.000000000000000
print("roots1:",roots1(94906265.625,-189812534.0,94906268.375))    
print("roots2:",roots2(94906265.625,-189812534.0,94906268.375))    
print("roots3:",roots3(94906265.625,-189812534.0,94906268.375))    
print("eaxct: ",(x1,x2))

Extended precision is implemented in software so it is slow compared to double precision that is directly implemented in hardware. 

In [None]:
mp.mp.prec = 200 # 200 bits precision
doubles = list(float(value) for value in range(2000))
mpfs = [mp.mpf(value) for value in doubles]
%timeit sum(doubles)
%timeit sum(mpfs)

In [None]:
d = np.array(doubles)
%timeit np.sum(d)