<a href="https://colab.research.google.com/github/stephenbeckr/numerical-analysis-class/blob/master/Demos/HowToCheckYourAnswerUsingExtendedPrecision.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Checking your answer using extended precision

This is an example of using higher precision to get a "true answer" and then using that to check roundoff error for lower precision.

Let's look at $$f(x)=e^x - 1$$ which is tricky when $$x\approx 0$$ due to subtractive cancellation.

We can evaluate this with the naive method, as well as `numpy.expm1`. Let's see which is more accurate.

To be extra careful that the $1$ is in the appropriate precision, we'll write a function like
$$
F(x,y) = e^x - y
$$
and then use
$$
f(x) = F(x,1).

> Indented block


$$

Note, in the code below, the `float128` type doesn't necessarily have 128 bits, it might have 80 bits and be padded (see [Numpy data types](https://numpy.org/doc/stable/user/basics.types.html))

In [6]:
import numpy as np
precisionTypes = [np.longdouble,np.float64,np.float32,np.float16]

# Try different sizes for x.  We'll start in the highest precision,
#   and later reduce precision.
x = np.array( 1e-4, dtype=np.longdouble )

f = lambda x, y : np.exp(x) - y  # y = 1
g = lambda x, y : np.expm1(x)    # the -1 part is included in expm1
fx, gx = [], []
for prec in precisionTypes:
  fx.append(f(np.array(x,dtype=prec),np.array(1,dtype=prec)) )
  gx.append(g(np.array(x,dtype=prec),np.array(1,dtype=prec)) )
fx, gx

trueAnswer = gx[0]
relAccuracy = lambda x : np.abs(x-trueAnswer)/np.abs(trueAnswer)
numDigits   = lambda x : -np.log10( relAccuracy(x) + 1e-18 )
for i,prec in enumerate(precisionTypes):
  print("For float{:<3d}, rel. err of f is {:5.2e} and of g is {:5.2e}".format(int(128/(2**i)),relAccuracy(fx[i]),relAccuracy(gx[i])))
print("\nPut another way, here's the number of correct digits:\n")  
for i,prec in enumerate(precisionTypes):
  print("For float{:<3d}, f has {:4.1f} correct digits, g has {:4.1f} correct digits".format(int(128/(2**i)),numDigits(fx[i]),numDigits(gx[i])))

For float128, rel. err of f is 3.44e-18 and of g is 0.00e+00
For float64 , rel. err of f is 4.33e-13 and of g is 3.44e-18
For float32 , rel. err of f is 1.16e-04 and of g is 4.11e-08
For float16 , rel. err of f is 1.00e+00 and of g is 1.16e-04

Put another way, here's the number of correct digits:

For float128, f has 17.4 correct digits, g has 18.0 correct digits
For float64 , f has 12.4 correct digits, g has 17.4 correct digits
For float32 , f has  3.9 correct digits, g has  7.4 correct digits
For float16 , f has -0.0 correct digits, g has  3.9 correct digits


Conclusion: the relative error in the "g" version of the expression is consistently a lot smaller. The second expression really is more stable