
#Numerical Stability

**Goal:**
In this experiment you will investigate the numerical stability of the product of probabilities and compare it with the (improved) stability of the sum of the logs of these probailities. With this you get a feeling why optimizing the joint log-likelihood is computationally more stable than optimizing the joint likelihood.

**Usage:** For additional Information read chapter 4 of the [Probabilistic Deeplearning book](https://www.manning.com/books/probabilistic-deep-learning?a_aid=probabilistic_deep_learning&a_bid=78e55885).

**Content:**
* show that calculating the product of many probabilities (which are <= 1) leads to numerical instabilities which are not observed when calculating the sum of the log of these probabilities.


In [None]:
try: #If running in colab
    import google.colab
    IN_COLAB = True
    %tensorflow_version 2.x
except:
    IN_COLAB = False

import tensorflow as tf
if (not tf.__version__.startswith('2')): #Checking if tf 2.0 is installed
    print('Please install tensorflow 2.0 to run this notebook')
print('Tensorflow version: ',tf.__version__, ' running in colab?: ', IN_COLAB)

#load required libraries:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('default')

# Numerical Stability


### The product of probabilities.

To calculate the joint likelihood you have to determine the product of many probabilities. As you can see in the following, mulitiplying many values between zero and one, leads to very small values which is set to zero in python, if the number gets too small.   
To demonstrate this, we sample 100 values from an uniform distribution with min = 0 and max = 1, then we take the product of those values and do the same for  1000 values.


In [None]:
vals100 = np.random.uniform(0,1,100)
vals1000 = np.random.uniform(0,1,1000)
x100 = np.product(vals100)
x1000 = np.product(vals1000)
print(f'product of 100  samples: {x100}',f'\nproduct of 1000 samples: {x1000}')

When multiplying 100 values you get a very very small number but for 1000 values you get 0.0, this is due to the limited precision of the float numbers in a computer. But this is a real problem, because it looks like that joint likelihood is zero, but its not (its just very small due to the large amount of data).


### Taking the log does not change the position of the maximum

In the next cell we show that the x value which gives the position of the maximum of the function f(x), gives also the position of the maximum of log(f(x)). For demonstration, we use the absolute values of the product of two sine waves as our function f(x) and take the log of it.

In [None]:
vals = 1 + np.abs(np.sin(np.linspace(0, 3*np.pi, 1000)) * np.sin(np.linspace(0, np.pi, 1000)))
plt.plot(range(0, 1000),vals,'b-')
plt.plot(range(0, 1000),np.log(vals),'g--')
plt.xlabel('x')
plt.legend(("f(x)","log(f(X))"),fontsize=10)

Here it is clearly visible: the maximum for both functions is at the same position.

### Takeing the logs and summing up.

If we take the log of a product it leads to $\log(A \cdot B) = \log(A) + \log(B)$  meaning that we can work with a sum of the logs (see book).  

Now you apply a log to the product of probabilities which gives you a sum of logs of these probabilities. Remember we have values from a uniform distribution with min = 0 and max = 1 (probabilities), which lead to numerical problems when calculating the product of theses probabilities.
As you can see now, on the log scale you don't have the problem of the numerical precision anymore.

#### Listing 4.3 Fixing the numerical instabilities by taking the log                                                                                                                                                                                                                        


In [None]:
import numpy as np
log_x100 = np.sum(np.log(vals100))
log_x1000 = np.sum(np.log(vals1000))
log_x100, log_x1000
# The product becomes the sum of the logs
print(f'log of product, 100  samples: {log_x100}',f'\nlog of product, 1000 samples: {log_x1000}')

This result is quite important for implementing the maximum likelihood estimation procedure. In the maximum likelihood approach you want to determine the parameter value that yields the highest joint likelihood over all observed data. The very same parameter value will also maximize the joint log-likelihood. However, if you have a lot of data, the likelihood cannot be precisely determined, but the log-likelihood can. This is the reason why in DL you work with the negative log-likelihood as loss function instead of the negative likelihood.