# Model the uncertainty in o2sat measurements

Create a lookup table that links the possible true O2Sat values for the measured and rounded values that we put into the model. For that generate a number in a range 90-100, 1) add gaussian noise, 2) round it, 3) keep that number if the result is 98. Do this many times to get the input distribution values that map to 98. Take the std of that distribution.

o2 saturation measurements are subject to technical and biological noise. From clinical practice, the o2 saturation varies by at most by 1 point over 2 minutes of consecutive measurement (technical noise), and varies by to 2-3 points over the a day (biological noise). Hence, where an oximeter displays 98%, the individual's true o2 saturation can be a value in the range [96-100].

Since o2 saturation is has a high SNR, it is important to add an o2 saturation measurement noise model to the lung model. Otherwise the model will propagate strong belief in o2 saturation (point-mass distribution) where in reality the belief would be weaker (wider than point-mass distribution).

The o2 saturation noise model adresses this issue by generating the underlying true distribution of o2 saturation values corresponding to a single observation.

I modelled the noise with two sequential components:
1) gaussian noise: the true value is shuffled according to a gaussian distribution centered on that true value and with an  standard deviation that has to be determined.
2) rounding: the shuffled value is rounded to the nearest integer. 

Using this generative model, I can draw the underlying distribution by rejection sampling.
0) Pick an o2 saturation observation. This is the measured value.
1) Sample a value from a uniform distribution in the range [90-100]. This is the true value.
2) Add gaussian noise to the true value with N(true value, std).
3) Round the noisy true value to the nearest integer
4) If the obtained value is equal to the measured value, keep the true value, otherwise discard it.
5) Repeat this algorithm one million times to get the distribution of true values that lead to the observed measured value.

Show figure XX of two distributions, with and without the 100 edge effect. Explain that the std is the same for all generated distributions, except the 100 edge effect.

#### Computing the gaussian noise's std parameter
Note that the obtained underlying distribution can be used in both ways: 1) to sample measurements from a single true value, 2) to sample true values from a single measurement. Hence, all measurements of an individual are sampled from the underlying distribution corresponding to that individual, provided that the individual is healthy enough to have a constant true o2 saturation. Hence, avg_std_ID = std_dist = sqrt(std_gauss + std_rounding)

avg_std_ID: an individual's average standard deviation of its o2 saturation measurements. The average is computed over a subset of healthy enough (>80% FEV1 % predicted) individuals
std_dist: standard deviation of the underlying distribution of the measurements. Ignoring the 100% boundary effect, the std_dist is the same across all individuals (see figure XX).
sqrt(): mathetmatical relation to surface the relationships between the three types of uncertainties: uncertainty due to gaussian noise, uncertainty due to rounding and total uncertainty. This can be derived using the sum of two independent randome variables. In fact, whilst the gaussian noise and the rounding effects aggregate, the two phenomenons are independent.

Using this equation, I empirically derived std_gauss with the following steps:
1) For each individual, compute the measurement's standard deviation and take the average. This is std_ID
2) Pick an std_gauss 0.5 below std_ID. This is a good starting point because the std_rounding shoud be around 0.5
3) Run the generative model and save the std_dist
4) Update std_gauss and run and go back to 3) until the obtained std_dist equals std_ID computed in 1)

I obtained std_gauss 0.86.

### How to build the cpt
For each possible value of O2 saturation, bin up the output distribution in bins of unbiased O2 saturation, and fill its cpt.

I computed the gaussian noise's std parameter using the sum of two random variables. Whilst the gaussian noise and the rounding effects aggregate, the two phenomenons are independent.
https://www.milefoot.com/math/stat/rv-sums.htm#:~:text=For%20any%20two%20random%20variables,sum%20of%20their%20expected%20values.&text=The%20proof%2C%20for%20both%20the,continuous%20cases%2C%20is%20rather%20straightforward.

In [1]:
import numpy as np
import src.data.breathe_data as breathe_data
import src.data.helpers as datah
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import src.models.helpers as mh
import src.modelling_o2.o2sat as o2sat

In [2]:
# df = breathe_data.build_O2_FEV1_df()
df = breathe_data.load_from_excel()()

# Find the uncertainy of an o2 sat measurement

In [8]:
# Estimate the uncertainty of an o2 sat measurement
# Take the healthiest individuals and compute the standard deviation of their measurements
# Plot a histogram of the standard deviations
# Take the mean as the std of an o2 sat measurement


def get_std(df):
    """
    If there are more than 10 values, compute the standard deviation
    Else, return NaN
    """
    if len(df) > 10:
        return df.std()
    else:
        return np.nan


df_std = datah.compute_avg(df, "FEV1 % Predicted", "%")
print(f"{df.ID.nunique()} IDs")
# Filter healthy individuals
df_std = df_std[df_std["FEV1 % Predicted"] > 80]
print(f"{df_std.ID.nunique()} healthy IDs")

stds = df_std.groupby("ID")["O2 Saturation"].agg(get_std)
stds = stds.dropna()
print(f"{len(stds)}/{df_std.ID.nunique()} IDs with > 10 measurements")

# Print avg std
print(f"Average std: {stds.mean()}")
print(f"Median std: {stds.median()}")

# Plost histogram of stds
fig = px.histogram(stds, nbins=20, marginal="box")
fig.update_layout(
    title_text="Distribution of standard deviations of O2 Saturation measurements"
)
fig.show()

213 IDs
96 healthy IDs
54/96 IDs with > 10 measurements
Average std: 0.907397942237467
Median std: 0.8717543896524964


# Define the generative noise model and tailor it to our data

In [121]:
# Randomly generate a number in the range 90.0-100.0
x = np.random.uniform(90.0, 100.0)
print(x)
# Add gaussian noise to the number with a standard deviation of 0.9
x = np.random.normal(x, 0.9)
print(x)
# Round x to the nearest 1
x = round(x)
print(x)

93.473881565419
93.34408539888815
93


In [10]:
O2Sat_vals = np.arange(50, 101, 1)

# Parameters
UO2Sat = mh.variableNode("Unbiased O2 saturation (%)", 50, 100, 0.5, prior=None)
o2sat_obs = 95
std_gauss = 0.86
repetitions = 100000

hist, bin_edges, true_o2sat_arr = o2sat.generate_underlying_uo2sat_distribution(
    UO2Sat, o2sat_obs, repetitions, std_gauss, show_std=True
)


fig = go.Figure()
fig.add_trace(
    go.Histogram(
        x=true_o2sat_arr,
        xbins=dict(start=UO2Sat.a, end=UO2Sat.b, size=UO2Sat.bin_width),
        autobinx=False,
        histnorm="probability",
    )
)

fig.update_layout(width=800)
fig.update_xaxes(range=[UO2Sat.a, UO2Sat.b], title="Unbiased O2 saturation")
fig.update_yaxes(title="Probability")
fig.show()

Std of values giving 95: 0.913


# Build the CPT of O2Sat-UnbiasedO2Sat
It uses the same algorithm as above but saves all the data into a table

In [11]:
# Parameters
UO2Sat = mh.variableNode("Unbiased O2 saturation (%)", 50, 100, 0.5, prior=None)
std_gauss = 0.86
repetitions = 1000000
cpt = np.zeros((len(O2Sat_vals), len(UO2Sat.bins)))

for i, o2sat_obs in enumerate(O2Sat_vals):
    hist, _, _ = o2sat.generate_underlying_uo2sat_distribution(
        UO2Sat, o2sat_obs, repetitions, std_gauss, show_std=True
    )

    cpt[i, :] = hist

# Normalise cpt
normaliser = cpt.sum(axis=0)
for i, norm in enumerate(normaliser):
    if norm != 0:
        cpt[:, i] = cpt[:, i] / norm

Std of values giving 50: 0.5502
Std of values giving 51: 0.7412
Std of values giving 52: 0.8729
Std of values giving 53: 0.9072
Std of values giving 54: 0.9078
Std of values giving 55: 0.9121
Std of values giving 56: 0.9045
Std of values giving 57: 0.9025
Std of values giving 58: 0.9059
Std of values giving 59: 0.9049
Std of values giving 60: 0.9105
Std of values giving 61: 0.9121
Std of values giving 62: 0.9217
Std of values giving 63: 0.9124
Std of values giving 64: 0.9104
Std of values giving 65: 0.8995
Std of values giving 66: 0.9035
Std of values giving 67: 0.9131
Std of values giving 68: 0.9094
Std of values giving 69: 0.9025
Std of values giving 70: 0.9171
Std of values giving 71: 0.9072
Std of values giving 72: 0.9135
Std of values giving 73: 0.9026
Std of values giving 74: 0.9177
Std of values giving 75: 0.9048
Std of values giving 76: 0.9141
Std of values giving 77: 0.9073
Std of values giving 78: 0.9052
Std of values giving 79: 0.9092
Std of values giving 80: 0.9093
Std of v

In [12]:
cpt[:, 40]

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.00059365, 0.02234474, 0.17519066,
       0.41088857, 0.31513737, 0.07110914, 0.00473586, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        ])

In [46]:
UO2Sat.bins[34]

67.0

In [13]:
uo2sat_bin_idx = 40
fig = go.Figure()
fig.add_trace(go.Scatter(x=O2Sat_vals, y=cpt[:, uo2sat_bin_idx]))
fig.update_layout(
    title_text=f"P(O2Sat | UO2Sat = {UO2Sat.bins_arr[uo2sat_bin_idx]})",
    xaxis_title="O2 saturation (%)",
    yaxis_title="Probability",
)
fig.show()

Values below 70% are not realistic, but mathematically they can be obtained with the model.

We create the CPT for O2Sat going from 50 to 100 and for UO2Sat going for 50 to 100. The 100 boundary for UO2Sat is meaningful, but not the 50. It's not a problem, because it has an effect for O2Sat = [50-53] and UO2Sat [50-57] which should never be obtained

In [14]:
# Save cpt in text file
np.savetxt(f"cpt_o2sat_50_100_uo2sat_{UO2Sat.a}_{UO2Sat.b}_{UO2Sat.bin_width}.txt", cpt, delimiter=",")

In [None]:
O2SatVar = mh.variableNode("O2 saturation (%)", 49.5, 100.5, 1, prior=None)
UO2Sat = mh.variableNode("Unbiased O2 saturation (%)", 50, 100, 1, prior=None)

o2sat.load_cpt(O2SatVar, UO2Sat)

In [8]:
np.loadtxt("cpt_o2sat_50_100_uo2sat_50_100_1.txt", delimiter=",")

array([[0.56572725, 0.20357705, 0.02374338, ..., 0.        , 0.        ,
        0.        ],
       [0.32333015, 0.36028837, 0.13968624, ..., 0.        , 0.        ,
        0.        ],
       [0.10038918, 0.31963256, 0.34985606, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.35486752, 0.31581025,
        0.0965846 ],
       [0.        , 0.        , 0.        , ..., 0.13779415, 0.35926979,
        0.32615087],
       [0.        , 0.        , 0.        , ..., 0.02384281, 0.21004499,
        0.56679325]])