# Additional topics

Want to learn more about the literature on deriving thresholds on PET scan data in Alzheimer's disease? Want to know about how to derive your own simulated data to test the spatial extent module with different distributions? You've come to the right place!

## Creating Gaussian simulated data

### Inspiration from realistic data

As you saw in the tutorial of the {ref}`spex module <2.spex/spex_module:Practice data>`, I generated random data to be used in the module. The goal was to mimic as close as possible what one could reasonable expect to see in a dataset with tau-PET data, but at the same time making sure `sihnpy` could derive thresholds relatively easily **AND** that I could showcase potential issues with the data.

As I describe briefly in my {ref}`short introduction to GMM<2.spex/spex_module:Introduction to Gaussian mixture modelling (GMM)>`, tau-PET data in the Alzheimer's disease spectrum can usually be described as a **bimodal** distribution, where there is a large group of individuals with *low* PET SUVR and a small group with *high* PET SUVR. As the disease progress from normal cognition to diagnosed Alzheimer's disease, individuals will usually progress from *low* to *high*. So we know that to generate random data, we will need to generate data from a *low* and a *high* distribution.

In Python, we can do this relatively easily if we have the **mean** and **standard deviation** for both the *low* and *high* distribution. So how do we choose that? At the time of developping `sihnpy`, and in the paper in which we use this methodology,[^Stonge_2023] we were working with data from the [Alzheimer's disease neuroimaging initiative (ADNI)](https://adni.loni.usc.edu/), one of the largest repository of open data on Alzheimer's disease, including a lot of amyloid- and tau-PET scans. In this dataset, we have participants that are at very low risk of progressing to Alzheimer's disease (participants who are cognitively unimpaired, with no discernable amounts of amyloid pathology) and participants who have already progressed to Alzheimer's disease (participants meeting the diagnostic criteria and with a lot of amyloid pathology). In `sihnpy`, the simulated data was created by slightly modifying the mean and standard deviation of these two groups of participants from ADNI. This information is actually available to all using `sihnpy` when using the `datasets` module.

In [1]:
from sihnpy import datasets
tau_data, regional_thresholds, regional_averages = datasets.pad_spex_input()
regional_averages

Unnamed: 0_level_0,mean_negative,sd_negative,mean_positive,sd_positive
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CTX_LH_ENTORHINAL_SUVR,1.119,0.11,1.578,0.321
CTX_RH_ENTORHINAL_SUVR,1.122,0.099,1.582,0.3
CTX_LH_AMYGDALA_SUVR,1.185,0.112,1.642,0.299
CTX_RH_AMYGDALA_SUVR,1.187,0.107,1.632,0.272
CTX_LH_FUSIFORM_SUVR,1.185,0.116,1.69,0.506
CTX_RH_FUSIFORM_SUVR,1.175,0.078,1.665,0.485
CTX_LH_PARAHIPPOCAMPAL_SUVR,1.097,0.095,1.45,0.292
CTX_RH_PARAHIPPOCAMPAL_SUVR,1.091,0.079,1.442,0.313
CTX_LH_INFERIORTEMPORAL_SUVR,1.199,0.132,1.799,0.548
CTX_RH_INFERIORTEMPORAL_SUVR,1.19,0.078,1.774,0.554


These numbers are what was used to generate `sihnpy` data, with a few exceptions. The goal was simply to get "realistic" data.

### Generating randomized data

From there, we have another choice to make. In the PREVENT-AD data, we have 308 participants in the Open Dataset. We need to choose how many are going to be attributed a "normal" and how many are going to be attributed an "abnormal" value. This is really arbitrary, so feel free to modify the numbers as needed. I created a simple function that generates the randomized data:

In [2]:
import numpy as np
from scipy import stats

def gen_random_population(mean1, sd1, size1, mean2, sd2, size2):
    """Generates random data mimicking tau-PET SUVR data, of the size of the PREVENT-AD Open
    dataset, by pulling 308 sample data points from 2 randomly generated populations.

    Parameters
    ----------
    mean1 : float
        Mean of the first distribution
    sd1 : float
        Standard deviation of the first distribution
    size1 : int
        Size of the first population
    mean2 : float
        Mean of the second distribution
    sd2 : float
        Standard deviation of the second distribution
    size2 : int
        Size of the second population

    Returns
    -------
    numpy.array
        Numpy array containing the 308 random data points; 100 AD, 208 CU
    """

    #Create a random population based on CU and on AD. Concatenate
    pop_sim_neg_data = np.random.choice(stats.norm.rvs(mean1, sd1, size1, random_state=667), 208)
    pop_sim_pos_data = np.random.choice(stats.norm.rvs(mean2, sd2, size2, random_state=667), 100)

    return np.concatenate((pop_sim_neg_data, pop_sim_pos_data))

Let's breakdown how this function is working. The core of the function is `scipy`'s `stats.norm.rvs` function, which generates random samples of participants based on a given mean, standard deviation and size. I forced the `random_state` to be constant here, so the generated data would remain the same throughout testing, but it is up to you to do the same or not.

Then, once the random samples are generated, we use `numpy`'s `random.choice` to randomly choose data points of the dimension we want (so for us, we will arbitrarily say there are 100 participants with *high* SUVR values and 208 participants with *low* SUVR values). This size choice is a bit different then the one inside of `stats.norm.rvs`. In `stats.norm.rvs`, the size is **how many random samples it should create**, while the size in `np.random.choice` is how many samples numpy should keep.

From the tests I did before generating the final data, it is usually better to give a very large number of samples to `stats.norm.rvs` (e.g., 10,000) so that it gives more variety and stability to the final values `numpy` chooses.

Once you have that function, you can simply generate a `pandas.DataFrame` or a `dict` containing the random data. Here is the code I used for the data in `sihnpy`

In [4]:
import pandas as pd

dict_random_tau_data = {}

dict_random_tau_data["CTX_LH_ENTORHINAL_SUVR"] = gen_random_population(mean1=1.119, sd1=0.110, size1=10000, mean2=1.578, sd2=0.321, size2=10000)
dict_random_tau_data["CTX_RH_ENTORHINAL_SUVR"] = gen_random_population(mean1=1.122, sd1=0.099, size1=10000, mean2=1.582, sd2=0.300, size2=10000)
dict_random_tau_data["CTX_LH_AMYGDALA_SUVR"] = gen_random_population(mean1=1.125, sd1=0.112, size1=10000, mean2=1.642, sd2=0.299, size2=10000)
dict_random_tau_data["CTX_RH_AMYGDALA_SUVR"] = gen_random_population(mean1=1.127, sd1=0.107, size1=10000, mean2=1.632, sd2=0.272, size2=10000)
dict_random_tau_data["CTX_LH_FUSIFORM_SUVR"] = gen_random_population(mean1=1.185, sd1=0.112, size1=10000, mean2=1.690, sd2=0.506, size2=10000)
dict_random_tau_data["CTX_RH_FUSIFORM_SUVR"] = gen_random_population(mean1=1.175, sd1=0.078, size1=10000, mean2=1.665, sd2=0.485, size2=10000)
dict_random_tau_data["CTX_LH_PARAHIPPOCAMPAL_SUVR"] = gen_random_population(mean1=1.097, sd1=0.095, size1=10000, mean2=1.450, sd2=0.292, size2=10000)
dict_random_tau_data["CTX_RH_PARAHIPPOCAMPAL_SUVR"] = gen_random_population(mean1=1.091, sd1=0.079, size1=10000, mean2=1.442, sd2=0.313, size2=10000)
dict_random_tau_data["CTX_LH_INFERIORTEMPORAL_SUVR"] = gen_random_population(mean1=1.199, sd1=0.132, size1=10000, mean2=1.799, sd2=0.548, size2=10000)
dict_random_tau_data["CTX_RH_INFERIORTEMPORAL_SUVR"] = gen_random_population(mean1=1.190, sd1=0.078, size1=10000, mean2=1.774, sd2=0.554, size2=10000)
dict_random_tau_data["CTX_LH_MIDDLETEMPORAL_SUVR"] = gen_random_population(mean1=1.161, sd1=0.130, size1=10000, mean2=1.671, sd2=0.523, size2=10000)
dict_random_tau_data["CTX_RH_MIDDLETEMPORAL_SUVR"] = gen_random_population(mean1=1.162, sd1=0.077, size1=10000, mean2=1.674, sd2=0.516, size2=10000)
dict_random_tau_data["CTX_LH_PRECENTRAL_SUVR"] = gen_random_population(mean1=0.997, sd1=0.070, size1=10000, mean2=1.139, sd2=0.264, size2=10000)
dict_random_tau_data["CTX_RH_PRECENTRAL_SUVR"] = gen_random_population(mean1=1.200, sd1=0.045, size1=10000, mean2=0.995, sd2=0.074, size2=10000) #Mimics inverted distribution
dict_random_tau_data["CTX_LH_POSTCENTRAL_SUVR"] = gen_random_population(mean1=0.972, sd1=0.074, size1=10000, mean2=1.084, sd2=0.252, size2=10000)
dict_random_tau_data["CTX_RH_POSTCENTRAL_SUVR"] = gen_random_population(mean1=1.091, sd1=0.286, size1=10000, mean2=1.091, sd2=0.286, size2=10000) #Mimics single distribution

simulated_data = pd.DataFrame(data=dict_random_tau_data) #Fit in a dataframe
simulated_data

Unnamed: 0,CTX_LH_ENTORHINAL_SUVR,CTX_RH_ENTORHINAL_SUVR,CTX_LH_AMYGDALA_SUVR,CTX_RH_AMYGDALA_SUVR,CTX_LH_FUSIFORM_SUVR,CTX_RH_FUSIFORM_SUVR,CTX_LH_PARAHIPPOCAMPAL_SUVR,CTX_RH_PARAHIPPOCAMPAL_SUVR,CTX_LH_INFERIORTEMPORAL_SUVR,CTX_RH_INFERIORTEMPORAL_SUVR,CTX_LH_MIDDLETEMPORAL_SUVR,CTX_RH_MIDDLETEMPORAL_SUVR,CTX_LH_PRECENTRAL_SUVR,CTX_RH_PRECENTRAL_SUVR,CTX_LH_POSTCENTRAL_SUVR,CTX_RH_POSTCENTRAL_SUVR
0,1.051291,1.079570,1.148615,1.387371,1.234989,1.222629,0.997319,1.238408,1.193029,1.189048,1.176353,1.068580,1.043006,1.193375,0.956099,1.303371
1,1.261927,1.128798,1.108885,1.094912,1.169001,1.110779,1.032457,1.114748,1.313168,1.245361,1.130955,1.215578,1.070534,1.210303,0.735064,0.525758
2,0.913342,1.232583,1.113257,1.125692,1.121716,1.122116,1.239292,1.101351,1.437924,1.237383,1.407009,1.041744,0.965578,1.205039,0.848469,1.489493
3,1.244253,1.095530,1.273662,0.976596,1.240247,1.218630,0.968086,1.073787,1.289952,1.209346,0.961912,1.262703,1.073792,1.101692,0.944882,1.112140
4,1.157736,1.385124,1.044202,1.251263,1.036178,1.223966,1.228288,1.084805,1.398520,1.274703,1.215081,1.195192,0.922243,1.161125,1.106643,1.203184
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
303,1.780437,1.470793,1.810637,1.948906,2.205414,1.974782,1.251331,0.981840,2.141611,3.032395,1.702946,0.695836,1.218570,0.893558,0.557759,0.907017
304,1.737320,1.264792,0.968907,1.809377,1.337959,1.933215,1.777517,1.088886,1.668091,2.180018,1.428016,1.467454,1.083992,1.022499,1.463762,1.388066
305,1.317423,1.635983,1.068091,2.225916,1.712976,2.139381,1.617846,1.525750,1.128397,2.026166,0.426102,1.897636,1.108812,0.978606,1.017627,0.814425
306,1.624967,1.661325,1.087939,2.092671,1.889238,1.318229,0.871777,1.447002,2.302911,2.067462,1.571238,1.570629,1.119042,0.937101,1.239425,1.391649


Note that I purposefully changed two distributions in the data to simulate potential situations: first I inverted the *low* and *high* distributions for the `CTX_RH_PRECENTRAL_SUVR`, and I also forced a single distribution for the `CTX_RH_POSTCENTRAL_SUVR`. It is simulated data: it's up to you to try and simulate a problem you want to check using the spatial extent.

From there you can simply merge this data to the IDs of the PREVENT-AD (or your own IDs) and you have a fully simulated dataset!

```{admonition} Advanced topics: Going further than two components
Once you understand the concepts behind the spatial extent, you can run with it and create your own. For instance, what if your data had not 1, not 2, but 3 distributions? You could think of deriving multiple sets of thresholds (low to medium, medium to high for instance).

Accordingly, you can use the examples on this page to build new simulated data fitting your needs, on which you can test the functions.
```

## Literature - Abnormality in PET data

`sihnpy` regroups tools meant to serve the neuroimaging field as broadly as possible. As such, while I talk a bit about the literature behind each module, I don't really go in depth behind the reasoning of some of the module but rather provide ressources and references meant to be used to go further.

That said, I feel like learning a bit more about the Alzheimer's literature on the topic may help guide you in terms of **choosing the right methodology and thresholds** for your own data, whatever population you apply it to, so I created this extra section, discussing a bit on the literature of Alzheimer's and also discusses how to choose your thresholds.

Studies of pathology in Alzheimer's disease is usually focused on the two main pathological hallmarks: amyloid and tau pathology. Some studies have used Gaussian mixture modelling before to generate thresholds to determine what is **"normal"** or **"abnormal"**. Using amyloid pathology for instance, authors have generally opted for values above 90th percentile of belonging to the **"normal"** distribution.[^Ozlen_2022] [^Mormino_2014] [^Villeneuve_2015] This is a somewhat more liberal threshold, but goes with the rationale that once the probability that a person belongs to the "normal" distribution drops, we become uncertain of whether the person can still be considered **"normal"**. See annotated graph below:

INSERT ENTORHINAL GRAPH

To use this type of threshold in `sihnpy`, you simply have to set the "abnormality" threshold to `0.1` (i.e., the opposite of 90% probability of belonging to the "normal" distribution is 10% of belonging to the "abnormal" distribution). 

A different philosophy to set these thresholds is to directly go with the probability of belonging to the **"abnormal"** distribution (i.e., the philosophy behind `sihnpy`'s `spatial_extent` module). [^Vogel_2020] [^Franzmeier_2020] This usually yields slightly more conservative thresholds, as probability of abnormality increases with PET uptake. See annotated graph below:

INSERT ENTORHINAL GRAPH

`sihnpy` is already set so that any probability thresholds given reflect the probability of belonging to the "abnormal" distribution. While previous work used this method, [^Vogel_2020] [^Franzmeier_2020] they did not set thresholds to be used. In our recent work, we opted for 50% probability of belonging to the "abnormal" distribution, but we found similar results using a conservative threshold at 90%. [^Stonge_2023]

So what should you choose?

That really depends on whether you need a more liberal/sensitive threshold or a more conservative/specific threshold. You also have to be mindful of the overlap between the two distribution. In some cases, the two distributions will overlap quite a lot, meaning that the probabilities may yield thresholds of very low values. See example below:

INSERT OVERLAP DISTRIB IMAGE

**My recommendation is to choose 1 main threshold and 2-3 other thresholds to replicate your results**. That way you can make sure that the thresholds set make sense with your data. **My other recommendation is to always look at the data to ensure that the thresholds you derive are biologically plausible**.


## References

Need for info? Make sure to go read these sources:

[^Ozlen_2022]: REF
[^Mormino_2014]: REF
[^Villeneuve_2015]: REF
[^Vogel_2020]: REF
[^Franzmeier_2020]: REF
[^Stonge_2023]: REF