# Model the uncertainty in o2sat measurements

Create a lookup table that links the possible true O2Sat values for the measured and rounded values that we put into the model. For that generate a number in a range 90-100, 1) add gaussian noise, 2) round it, 3) keep that number if the result is 98. Do this many times to get the input distribution values that map to 98. Take the std of that distribution.

o2 saturation measurements are subject to technical noise and daily biological variation. From clinical practice, the o2 saturation varies by at most by 1 point over 2 minutes of consecutive measurement (technical noise), and varies by to 2-3 points over the a day (biological noise). Hence, where an oximeter displays 98%, the individual's true o2 saturation can be a value in the range [96-100].

Since o2 saturation is has a low SNR, it is important to add an o2 saturation measurement noise model to the lung model. Otherwise the model will propagate strong belief in o2 saturation (point-mass distribution) where in reality the belief would be weaker (wider than point-mass distribution).

The o2 saturation noise model adresses this issue by generating the underlying true distribution of o2 saturation values corresponding to a single observation.

I modelled the noise with two sequential components:
1) gaussian noise: the true value is shuffled according to a gaussian distribution centered on that true value and with an  standard deviation that has to be determined.
2) rounding: the shuffled value is rounded to the nearest integer. 

Using this generative model, I can draw the underlying distribution by rejection sampling.
0) Pick an o2 saturation observation. This is the measured value.
1) Sample a value from a uniform distribution in the range [90-100]. This is the true value.
2) Add gaussian noise to the true value with N(true value, std).
3) Round the noisy true value to the nearest integer
4) If the obtained value is equal to the measured value, keep the true value, otherwise discard it.
5) Repeat this algorithm one million times to get the distribution of true values that lead to the observed measured value.

Show figure XX of two distributions, with and without the 100 edge effect. Explain that the std is the same for all generated distributions, except the 100 edge effect.

#### Computing the gaussian noise's std parameter
Note that the obtained underlying distribution can be used in both ways: 1) to sample measurements from a single true value, 2) to sample true values from a single measurement. Hence, all measurements of an individual are sampled from the underlying distribution corresponding to that individual, provided that the individual is healthy enough to have a constant true o2 saturation. Hence, avg_std_ID = std_dist = sqrt(std_gauss^2 + std_rounding^2)

avg_std_ID: an individual's average standard deviation of its o2 saturation measurements. The average is computed over a subset of healthy enough (>80% FEV1 % predicted) individuals
std_dist: standard deviation of the underlying distribution of the measurements. Ignoring the 100% boundary effect, the std_dist is the same across all individuals (see figure XX).
sqrt(): mathetmatical relation to surface the relationships between the three types of uncertainties: uncertainty due to gaussian noise, uncertainty due to rounding and total uncertainty. This can be derived using the sum of two independent randome variables. In fact, whilst the gaussian noise and the rounding effects aggregate, the two phenomenons are independent.

Using this equation, I empirically derived std_gauss with the following steps:
1) For each individual, compute the measurement's standard deviation and take the average. This is std_ID
2) Pick an std_gauss 0.5 below std_ID. This is a good starting point because the std_rounding should be around 0.5
3) Run the generative model and save the std_dist
4) Update std_gauss and run and go back to 3) until the obtained std_dist equals std_ID computed in 1)

I obtained std_gauss 0.86.

### How to build the cpt
For each possible value of O2 saturation, bin up the output distribution in bins of unbiased O2 saturation, and fill its cpt.

Note that sampling downwards or samping upwards produces the same results. Sapmling downwards is however less expensive

I computed the gaussian noise's std parameter using the sum of two random variables. Whilst the gaussian noise and the rounding effects aggregate, the two phenomenons are independent.
https://www.milefoot.com/math/stat/rv-sums.htm#:~:text=For%20any%20two%20random%20variables,sum%20of%20their%20expected%20values.&text=The%20proof%2C%20for%20both%20the,continuous%20cases%2C%20is%20rather%20straightforward.

In [None]:
import numpy as np
import src.data.breathe_data as breathe_data
import src.data.helpers as datah
import plotly.express as px
import plotly.graph_objects as go
import src.models.helpers as mh
import src.modelling_o2.o2sat as o2sat
import src.models.cpts.load_cpt as load_cpt

In [None]:
# df = breathe_data.build_O2_FEV1_df()
df = breathe_data.load_o2_fev1_df_from_excel()

# Find the uncertainy of an o2 sat measurement

In [None]:
# Estimate the uncertainty of an o2 sat measurement
# Take the healthiest individuals and compute the standard deviation of their measurements
# Plot a histogram of the standard deviations
# Take the mean as the std of an o2 sat measurement


def get_std(df):
    """
    If there are more than 10 values, compute the standard deviation
    Else, return NaN
    """
    if len(df) > 10:
        return df.std()
    else:
        return np.nan


df_std = datah.compute_avg(df, "FEV1 % Predicted", "%")
print(f"{df.ID.nunique()} IDs")
# Filter healthy individuals
df_std = df_std[df_std["FEV1 % Predicted"] > 80]
print(f"{df_std.ID.nunique()} healthy IDs")

stds = df_std.groupby("ID")["O2 Saturation"].agg(get_std)
stds = stds.dropna()
print(f"{len(stds)}/{df_std.ID.nunique()} IDs with > 10 measurements")

# Print avg std
print(f"Average std: {stds.mean()}")
print(f"Median std: {stds.median()}")

# Plost histogram of stds
fig = px.histogram(stds, nbins=20)  # , marginal="box")
# Update x axis
fig.update_xaxes(
    title_text=f"Standard deviation of the individual's<br> O2 saturation measurements"
)
fig.update_yaxes(title_text="Individuals count")
fig.update_layout(
    title_text="Distribution of standard deviations of O2 Saturation measurements",
    height=300,
    width=500,
    showlegend=False,
    font=dict(size=9),
)
fig.show()

# Define the generative noise model and tailor it to our data

In [None]:
# Randomly generate a number in the range 90.0-100.0
x = np.random.uniform(90.0, 100.0)
print(x)
# Add gaussian noise to the number with a standard deviation of 0.9
x = np.random.normal(x, 0.86)
print(x)
# Round x to the nearest 1
x = round(x)
print(x)

In [None]:
O2Sat_vals = np.arange(50, 101, 1)

# Parameters
UO2Sat = mh.VariableNode("Underlying O2 saturation (%)", 50, 100, 0.5, prior=None)
o2sat_obs = 95
std_gauss = 0.86
repetitions = 1000000

hist, bin_edges, true_o2sat_arr = o2sat.generate_underlying_uo2sat_distribution(
    UO2Sat, o2sat_obs, repetitions, std_gauss, show_std=True
)

fig = go.Figure()
fig.add_trace(
    go.Histogram(
        x=true_o2sat_arr,
        xbins=dict(start=UO2Sat.a, end=UO2Sat.b, size=1),
        autobinx=False,
        histnorm="probability",
    )
)

fig.update_layout(
    width=300, height=300, font=dict(size=7), title=f"P(UO2Sat | O2Sat={o2sat_obs}%)"
)
fig.update_xaxes(range=[80, UO2Sat.b], title=UO2Sat.name)
fig.update_yaxes(title="Probability", range=[0, 0.45])
# Add annotation top left
fig.add_annotation(
    x=0.0,
    y=1.15,
    xref="paper",
    yref="paper",
    showarrow=False,
    text=f"P(UO2Sat | O2Sat={o2sat_obs}%)",
    font=dict(size=10),
)
fig.show()

In [None]:
o2sat_idx = 46
print(O2SatVar.get_bins_str()[o2sat_idx])
cpt[o2sat_idx, :]

In [None]:
# Sample from cpt using to create a histogram
# Used for the thesis report images
O2SatVar = mh.VariableNode("O2 saturation (%)", 49.5, 100.5, 1, prior=None)
UO2Sat = mh.VariableNode("Underlying O2 saturation (%)", 50, 100, 0.5, prior=None)

o2sat_bin_idx = 49
print(f"O2Sat bin: {O2SatVar.get_bins_str()[o2sat_bin_idx]}")
p = cpt[o2sat_bin_idx, :]
p = p / p.sum()
o2sat_arr = UO2Sat.sample(n=1000000, p=p)
fig = go.Figure()
fig.add_trace(
    go.Histogram(
        x=o2sat_arr,
        xbins=dict(start=UO2Sat.a, end=UO2Sat.b, size=UO2Sat.bin_width),
        autobinx=False,
        histnorm="probability",
    )
)

fig.update_layout(
    width=300,
    height=300,
    font=dict(size=7),
    title=f"P(UO2Sat | O2Sat = {O2SatVar.get_bins_str()[o2sat_bin_idx]}",
)
fig.update_xaxes(range=[90, UO2Sat.b], title=UO2Sat.name)
fig.update_yaxes(title="Probability", range=[0, 0.45])
# Add annotation top left
fig.add_annotation(
    x=0.0,
    y=1.15,
    xref="paper",
    yref="paper",
    showarrow=False,
    text=f"P(UO2Sat | O2Sat={round(O2SatVar.midbins[o2sat_bin_idx])}%)",
    font=dict(size=10),
)
fig.show()

# Build the CPT of O2Sat-UnbiasedO2Sat

## By sampling upwards and inverting the CPT (more expensive)
It uses the same algorithm as above but saves all the data into a table

In [None]:
# Parameters
O2Sat_vals = np.arange(50, 101, 1)
UO2Sat = mh.VariableNode("Unbiased O2 saturation (%)", 50, 100, 0.5, prior=None)
std_gauss = 0.86
repetitions = 10000
cpt = np.zeros((len(O2Sat_vals), UO2Sat.card))

for i, o2sat_obs in enumerate(O2Sat_vals):
    hist, _, _ = o2sat.generate_underlying_uo2sat_distribution(
        UO2Sat, o2sat_obs, repetitions, std_gauss, show_std=True
    )

    cpt[i, :] = hist

# Normalise cpt
normaliser = cpt.sum(axis=0)
for i, norm in enumerate(normaliser):
    if norm != 0:
        cpt[:, i] = cpt[:, i] / norm

In [None]:
O2SatVar = mh.VariableNode("O2 saturation (%)", 49.5, 100.5, 1, prior=None)
UO2Sat = mh.VariableNode("Underlying O2 saturation (%)", 50, 100, 0.5, prior=None)

cpt = load_cpt.get_cpt([O2SatVar, UO2Sat])

In [None]:
uo2sat_bin_idx = 90
fig = go.Figure()
fig.add_trace(go.Bar(x=O2Sat_vals, y=cpt[:, uo2sat_bin_idx]))
fig.update_layout(bargap=0)
fig.update_layout(
    title_text=f"P(O2Sat | UO2Sat = {UO2Sat.get_bins_str()[uo2sat_bin_idx]})",
    xaxis_title="O2 saturation (%)",
    yaxis_title="Probability",
    width=600,
    height=300,
    font=dict(size=8),
)
fig.show()

Values below 70% are not realistic, but mathematically they can be obtained with the model.

We create the CPT for O2Sat going from 50 to 100 and for UO2Sat going for 50 to 100. The 100 boundary for UO2Sat is meaningful, but not the 50. It's not a problem, because it has an effect for O2Sat = [50-53] and UO2Sat [50-57] which should never be obtained

In [None]:
# Save cpt in text file
np.savetxt(
    f"cpt_o2sat_50_100_uo2sat_{UO2Sat.a}_{UO2Sat.b}_{UO2Sat.bin_width}.txt",
    cpt,
    delimiter=",",
)

In [None]:
# Sample from cpt using to create a histogram
# Used for the thesis report images
O2SatVar = mh.VariableNode("O2 saturation (%)", 49.5, 100.5, 1, prior=None)
UO2Sat = mh.VariableNode("Underlying O2 saturation (%)", 50, 100, 0.5, prior=None)

uo2sat_bin_idx = 90
print(f"UO2Sat bin: {UO2Sat.get_bins_str()[uo2sat_bin_idx]}")
o2sat_arr = O2SatVar.sample(
    n=1000000, p=(cpt[:, 97] + cpt[:, 98]) / sum(cpt[:, 97] + cpt[:, 98])
)
# len(cpt[:, uo2sat_bin_idx])
# UO2Sat.card
fig = go.Figure()
fig.add_trace(
    go.Histogram(
        x=o2sat_arr,
        xbins=dict(start=O2SatVar.a, end=O2SatVar.b, size=O2SatVar.bin_width),
        autobinx=False,
        histnorm="probability",
    )
)

fig.update_layout(
    width=300,
    height=300,
    font=dict(size=7),
    title=f"P(O2Sat | UO2Sat = {UO2Sat.get_bins_str()[uo2sat_bin_idx]}",
)
fig.update_xaxes(range=[90, O2SatVar.b], title=O2SatVar.name)
fig.update_yaxes(title="Probability", range=[0, 0.7])
# Add annotation top left
fig.add_annotation(
    x=0.0,
    y=1.15,
    xref="paper",
    yref="paper",
    showarrow=False,
    text=f"P(O2Sat | UO2Sat = {UO2Sat.get_bins_str()[uo2sat_bin_idx]})",
    font=dict(size=10),
)
fig.show()

## By using sampling downwards from the noise model

In [None]:
# Parameters
O2Sat_vals = np.arange(50, 101, 1)
UO2Sat = mh.VariableNode("Unbiased O2 saturation (%)", 50, 100, 1, prior=None)
std_gauss = 0.86
repetitions = 100000
cpt2 = np.zeros((len(O2Sat_vals), UO2Sat.card))

for i in range(UO2Sat.card):
    hist, _, _ = o2sat.generate_o2sat_distribution(
        [50, 100], UO2Sat.get_bins_arr()[i], repetitions, std_gauss, show_std=True
    )

    cpt2[:, i] = hist

# Normalise cpt
normaliser = cpt2.sum(axis=0)
for i, norm in enumerate(normaliser):
    if norm != 0:
        cpt2[:, i] = cpt2[:, i] / norm

In [None]:
uo2sat_bin_idx = 45
fig = go.Figure()
fig.add_trace(go.Bar(x=O2Sat_vals, y=cpt2[:, uo2sat_bin_idx]))
fig.update_layout(
    title_text=f"P(O2Sat | UO2Sat = {UO2Sat.get_bins_arr()[uo2sat_bin_idx]})",
    xaxis_title="O2 saturation (%)",
    yaxis_title="Probability",
    bargap=0,
    height=300,
    width=600,
    font=dict(size=9),
)
fig.show()

In [None]:
uo2sat_bin_idx = 46
fig = go.Figure()
fig.add_trace(
    go.Bar(x=O2Sat_vals, y=cpt2[uo2sat_bin_idx, :] / sum(cpt2[uo2sat_bin_idx, :]))
)
fig.update_layout(
    title_text=f"P(O2Sat | UO2Sat = {UO2Sat.get_bins_arr()[uo2sat_bin_idx]})",
    xaxis_title="O2 saturation (%)",
    yaxis_title="Probability",
    bargap=0,
    height=300,
    width=600,
    font=dict(size=9),
)
fig.show()

In [None]:
cpt2[uo2sat_bin_idx, :] / sum(cpt2[uo2sat_bin_idx, :])

In [None]:
cpt2[:, uo2sat_bin_idx] / sum(cpt2[:, uo2sat_bin_idx])