# Spatial extent analysis

Ready to increase the "spatial extent" of your knowledge on this? ... Sorry not my best one.

You will find below concrete methods and detailed tutorials to apply the spatial extent methods to your data and what to do when you encounter specific (potentially problematic) situations.

Note that, by nature, the methodologies you can use to derive thresholds are almost... well... infinite. It really depends on the data you currently have and whether the method you chose is appropriate for it. In `sihnpy`, I chose to integrate two. However, when possible I point out different ways users can adapt their script to adapt the methods for thresholding. Depending on the demand, I will also add more methods in the future.

## Outline

As all other modules in `sihnpy`, some data is included in the package so you can practice the different aspects of the module before moving on to your data. I will first take a moment to describe the data included.

Then, we will jump in the spatial extent. Using the spatial extent for your data is divided in two steps: 

| **Derive thresholds** | **Apply thresholds** |
|:-------|:--------|
| Use a method to establish a singular <br> threshold for each brain region | Apply the thresholds from the first <br> step and compute spatial extent measures |

The thresholds in `sihnpy` are either derived using a **Gaussian mixture modelling** approach, or assume that thresholds were **derived from a normative population**. Both ways will be described in detail below.

## Practice data

In other `sihnpy` modules, real data from a small subset (15) of the PREVENT-AD Open Dataset is used. However, this method was developed using positron emission tomography data, specifically for tau pathology. While it is in the plans for the future to have this data available in the PREVENT-AD, it is currently not available in the data release. Furthermore, real data may not show all of the issues that can arise from some specific situations.

Have no fear though! I have worked hard to simulate tau-PET data for PREVENT-AD participants so you can have a realistic feel for using this module on data that feels realistic. As of now, simulated tau positron emission tomography (PET) data is available for the 308 participants of the PREVENT-AD Open dataset. Curious on how this was done or want to understand more about this data? {ref}`Find more details here <2.spex/spex_module:Simulating data>`

As a quick primer on PET data, the main things you should know when using this data is that the values that are simulated in `sihnpy` are called **Standardized Uptake Ratio Values (SUVR)**. They are a measure of how much of the PET tracer is uptake (i.e., how much pathology there is) in a given brain region compared to the uptake in a reference region that does not accumulate pathology. A SUVR close to or below 1 indicates very low levels of pathology, while higher values represent more and more pathology. Note that the uptake varies between regions, **so thresholds will change depending on the region**.

```{warning}
Note that `sihnpy` provides data to practice using the spatial extent module. While the PREVENT-AD participants are used, the data available for this method **is simulated data** (i.e., the numbers observed are fake; they were randomly generated to fit the purpose of the tutorials). As a general rule for `sihnpy`, and especially for this module, **only use the data provided to help you practice using the module, not to conduct or publish actual research**.
```

## Deriving thresholds
### Introduction to Gaussian mixture modelling (GMM)

The main method proposed by `sihnpy` to derive thresholds is to use **Gaussian mixture modelling (GMM)**. The rationale behind this method is that data points in a dataset belong to a set of gaussian (a.k.a. normal) distributions (a.k.a. distinct populations). GMMs are often referred to as *soft* clustering algorithms; contrary to other clustering algorithms, GMM assign **probabilities that each data point belongs to a specific cluster**. This approach is useful as it allows some granularity on how certain we are that a participant belongs to a specific group. I won't get much more in how GMMs actually work, as it is beyond the purview of `sihnpy`, but I encourage you to go read [`scikit-learn`'s documentation](https://scikit-learn.org/stable/modules/mixture.html#gmm) to learn more.

So GMM find clusters in the data. But what does that have anything to do with finding thresholds?

If you remember the {ref}`introduction to spatial extent <2.spex/spex_intro:Definitions>`, we know that **there are two distinct distribution in the data: people with low values of pathology (tight spread) and people with high values of pathology (spread out).** You can actually observe this visually in the data. Here is an example from the data included in `sihnpy` (i.e., the simulated distribution of tau-PET values in the entorhinal cortex).

INSERT HISTOGRAM IMAGE HERE

Here, there seems to be two distributions in the data: our **low** distribution in green (i.e., "normal") and our **high** distribution in red (i.e., "abnormal"). That's where the thresholding comes in. Since GMMs assign a probability of belonging to either cluster for each participant, we can set a threshold **based on how confident we are that a participant has abnormal values in the marker of interest**. For instance, we could want to be very conservative and say that we want that we will consider abnormal participants who have more than 90% probability of being in the abnormal distribution. Once we decide on the probability we want to set as a threshold, *we need to figure out how does that probability translates to a threshold in our original scale*. To do so, we will try to find the participant who has a probability of being abnormal closest to our threshold, and take their values in the original scale. In the `sihnpy` PREVENT-AD simulated data, this would mean taking the SUVR value of the participant with the probability closest to probability threshold. Below is in illustrated explanation of this process.

INSERT MODIFIED FIGURE 1 FROM MANUSCRIPT

This process is the one detailed below that is implemented in `sihnpy`. If you already have your own thresholds that you wish to apply to the data, you can skip ahead to the {ref}`Applying thresholds <2.spex/spex_module:Applying thresholds>` section.

#### Limitations

GMMs are really great, but they come with some assumptions and limitations.

| **Limitations** | **Possible fixes** |
| Needs a **clear** bimodal distribution | None |
| Large sample size * is needed, particularly <br> when the bimodal distribution is not super evident. | Increase sample size |
| More people with "normal" than "abnormal" values <br> as the algorithm with classify abnormality wrong otherwise | Can be fixed in code, but estimates might not be great <br> regardless |
| `sihnpy` expects higher values to be "abnormal" | Can be fixed in the code or its interpretation |

* Note on bimodal distributions and sample sizes: there is no guidelines on what constitutes an appropriate sample size for a GMM. Some estimates I saw online mention (appropriately) that it depends on the number of parameters used in the clustering, the number of clusters we expect, etc. Furthermore, I don't know that the sample size is the most important characteristic to select the GMM. The most important is really that there is a **clear bimodal** distribution. Having a higher sample may help to make that evident, but clear bimodal distribution at smaller sample sizes may still work. That said, from some estimates I saw in others using this method, samples sizes below 100 are generally not performing super well.


### Steps to derive thresholds with GMMs

Now that we got that out of the way, let's get down to it! I will take you through the steps needed to derive the thresholds using GMM models.

```{important}
You will notice as you explore this module that almost all the functions used to derive clusters with the GMM have some sort of **fix** that can be applied. My recommendation is that when you use the `spex` module try to run all the functions without fixing at first. One of the functions, `spex.gmm_histograms` produces important graphs that can be used to diagnose issues with your data, and you can then make sure that the applying the fixes proposed throughout the module are right for you.
```

#### 1. Get the data

The first step is to get the data we need to generate thresholds for. Fortunately, the `sihnpy.datasets` module already has some ready for us. You can simply download the data using the following:

In [1]:
from sihnpy import datasets

tau_data, regional_thresholds, regional_averages = datasets.pad_spex_input()

The function returns three `pandas.DataFrame` objects. The only one we need for this part is the first one, `tau_data`. The second one, `regional_thresholds` will be discussed for {ref}`applying thresholds from normative populations <2.spex/spex_module:Introduction to pre-determined (normative sample) thresholds>`. The last one will really only be useful if you take an interest in {ref}`simulating your own data <2.spex/spex_module:Simulating data>`.

So for now, let's focus on the `tau_data`. 

In [2]:
tau_data

Unnamed: 0_level_0,sex,test_language,handedness_score,handedness_interpretation,CTX_LH_ENTORHINAL_SUVR,CTX_RH_ENTORHINAL_SUVR,CTX_LH_AMYGDALA_SUVR,CTX_RH_AMYGDALA_SUVR,CTX_LH_FUSIFORM_SUVR,CTX_RH_FUSIFORM_SUVR,CTX_LH_PARAHIPPOCAMPAL_SUVR,CTX_RH_PARAHIPPOCAMPAL_SUVR,CTX_LH_INFERIORTEMPORAL_SUVR,CTX_RH_INFERIORTEMPORAL_SUVR,CTX_LH_MIDDLETEMPORAL_SUVR,CTX_RH_MIDDLETEMPORAL_SUVR,CTX_LH_PRECENTRAL_SUVR,CTX_RH_PRECENTRAL_SUVR,CTX_LH_POSTCENTRAL_SUVR,CTX_RH_POSTCENTRAL_SUVR
participant_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
sub-5458966,Male,French,80.00,Right-handed,1.111972,1.120199,1.006147,1.330316,1.322257,1.208377,0.856778,1.149150,1.200685,1.170536,1.136680,1.167629,0.836766,1.181594,0.964623,1.264212
sub-2424540,Female,French,100.00,Right-handed,1.279463,1.238721,1.118358,1.176036,1.064330,1.203981,0.939988,0.965154,1.143115,1.354172,1.189367,1.305499,1.008217,1.265188,0.903880,0.982667
sub-7855613,Female,French,90.00,Right-handed,1.165918,1.074124,1.133187,1.239481,1.057046,1.072006,0.919426,1.051297,1.188624,1.213766,1.178537,1.122608,0.994861,1.224359,1.039233,1.018787
sub-3137570,Male,French,90.00,Right-handed,1.057761,1.058959,1.003114,1.225939,0.950004,1.283570,1.173269,1.108080,1.127921,1.106209,1.007086,1.103633,0.906591,1.236180,0.985742,1.518770
sub-9650197,Female,French,100.00,Right-handed,1.115381,1.106487,1.214722,1.359531,1.346469,1.111211,1.009351,1.172829,1.176183,1.283605,1.016241,1.170783,1.058830,1.208158,0.861014,1.271836
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
sub-5336241,Female,French,-30.00,Ambidextrous,1.755116,1.791774,1.892483,0.914250,2.088089,1.693487,1.021844,1.371482,1.587848,2.456752,1.338308,1.571080,0.698112,1.045121,0.558954,0.849271
sub-1002928,Female,French,100.00,Right-handed,1.725995,1.665045,1.567078,1.379281,2.359009,1.743699,1.314826,1.472280,2.517382,1.227152,1.536081,1.932241,1.442845,1.076464,0.985675,0.800467
sub-1283278,Female,English,80.00,Right-handed,1.763810,1.557945,1.831518,1.901642,1.960012,2.085522,1.729197,1.458530,2.390055,2.020771,1.372247,1.800840,1.039855,0.974499,0.910037,0.898030
sub-9101699,Male,French,57.89,Right-handed,1.658679,1.751766,1.718346,1.842829,0.516473,1.770625,1.566308,1.269817,2.208593,1.718491,2.319590,0.499457,0.777812,1.049930,1.341256,1.253567


In this dataset, we see that we have 308 participants from the PREVENT-AD. The first few columns detail their basic demographic information available from the Open Dataset. All the other columns are the simulated tau-PET data. The data was simulated for a total of 16 brain regions: LH/RH indicate which hemisphere the region is from, while the name right after data (e.g., ENTORHINAL) is the name of the brain region we are simulating.

The first step here is to actually remove the demographic information. The GMM code will be applied to all the columns (except the index) that is provided to it. And well... clustering males and females in 2 groups is not really useful for our purposes...

Let's quickly do that using `pandas`

In [3]:
import pandas as pd

tau_data.drop(labels=["sex", "test_language", "handedness_score", "handedness_interpretation"], axis=1, inplace=True) #Axis 1 specifies to drop columns

tau_data

Unnamed: 0_level_0,CTX_LH_ENTORHINAL_SUVR,CTX_RH_ENTORHINAL_SUVR,CTX_LH_AMYGDALA_SUVR,CTX_RH_AMYGDALA_SUVR,CTX_LH_FUSIFORM_SUVR,CTX_RH_FUSIFORM_SUVR,CTX_LH_PARAHIPPOCAMPAL_SUVR,CTX_RH_PARAHIPPOCAMPAL_SUVR,CTX_LH_INFERIORTEMPORAL_SUVR,CTX_RH_INFERIORTEMPORAL_SUVR,CTX_LH_MIDDLETEMPORAL_SUVR,CTX_RH_MIDDLETEMPORAL_SUVR,CTX_LH_PRECENTRAL_SUVR,CTX_RH_PRECENTRAL_SUVR,CTX_LH_POSTCENTRAL_SUVR,CTX_RH_POSTCENTRAL_SUVR
participant_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
sub-5458966,1.111972,1.120199,1.006147,1.330316,1.322257,1.208377,0.856778,1.149150,1.200685,1.170536,1.136680,1.167629,0.836766,1.181594,0.964623,1.264212
sub-2424540,1.279463,1.238721,1.118358,1.176036,1.064330,1.203981,0.939988,0.965154,1.143115,1.354172,1.189367,1.305499,1.008217,1.265188,0.903880,0.982667
sub-7855613,1.165918,1.074124,1.133187,1.239481,1.057046,1.072006,0.919426,1.051297,1.188624,1.213766,1.178537,1.122608,0.994861,1.224359,1.039233,1.018787
sub-3137570,1.057761,1.058959,1.003114,1.225939,0.950004,1.283570,1.173269,1.108080,1.127921,1.106209,1.007086,1.103633,0.906591,1.236180,0.985742,1.518770
sub-9650197,1.115381,1.106487,1.214722,1.359531,1.346469,1.111211,1.009351,1.172829,1.176183,1.283605,1.016241,1.170783,1.058830,1.208158,0.861014,1.271836
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
sub-5336241,1.755116,1.791774,1.892483,0.914250,2.088089,1.693487,1.021844,1.371482,1.587848,2.456752,1.338308,1.571080,0.698112,1.045121,0.558954,0.849271
sub-1002928,1.725995,1.665045,1.567078,1.379281,2.359009,1.743699,1.314826,1.472280,2.517382,1.227152,1.536081,1.932241,1.442845,1.076464,0.985675,0.800467
sub-1283278,1.763810,1.557945,1.831518,1.901642,1.960012,2.085522,1.729197,1.458530,2.390055,2.020771,1.372247,1.800840,1.039855,0.974499,0.910037,0.898030
sub-9101699,1.658679,1.751766,1.718346,1.842829,0.516473,1.770625,1.566308,1.269817,2.208593,1.718491,2.319590,0.499457,0.777812,1.049930,1.341256,1.253567


Ok great! Now we only have our 16 regions with the simulated SUVR tau-PET data. We're ready to start.

#### 2. Estimate the GMM

The first step is to **estimate** a GMM. In `scikit-learn` terms, we need to **fit** a GMM to our data. We also verify whether the data does indeed present a bimodal distribution (i.e., fitting 2-clusters on the data works better than fitting a single cluster).

`sihnpy` makes it super easy to do this without thinking too much. The function `spex.gmm_estimation` only require a `pandas.dataframe` where each column requires a GMM to be applied to. We simply need to run the code below:

In [4]:
from sihnpy import spatial_extent as spex

gm_estimations, clean_data = spex.gmm_estimation(data_to_estimate=tau_data)

GMM estimation for CTX_LH_ENTORHINAL_SUVR
1-component: BIC = 136.0012272528757 | 2-components: BIC = 19.050405000530777 
GMM estimation for CTX_RH_ENTORHINAL_SUVR
1-component: BIC = 97.26552868290143 | 2-components: BIC = -63.68821541093003 
GMM estimation for CTX_LH_AMYGDALA_SUVR
1-component: BIC = 159.09515496414866 | 2-components: BIC = 18.747712888602752 
GMM estimation for CTX_RH_AMYGDALA_SUVR
1-component: BIC = 139.3931538620712 | 2-components: BIC = 6.848469295876654 
GMM estimation for CTX_LH_FUSIFORM_SUVR
1-component: BIC = 354.9936313615791 | 2-components: BIC = 136.7897619985344 
GMM estimation for CTX_RH_FUSIFORM_SUVR
1-component: BIC = 224.0243215398743 | 2-components: BIC = -42.43182354715772 
GMM estimation for CTX_LH_PARAHIPPOCAMPAL_SUVR
1-component: BIC = 55.96648057875871 | 2-components: BIC = -64.27572034876607 
GMM estimation for CTX_RH_PARAHIPPOCAMPAL_SUVR
1-component: BIC = -12.31665584933022 | 2-components: BIC = -163.2738193556711 
GMM estimation for CTX_LH_INFE

Whew that's a lot of text! And what did the function actually output?

Let's start with what the objects the function outputs. We have two objects: `gm_estimations` and `clean_data`. The `gm_estimations` object is a Python dictionary that contains 1 GMM object for each column in your data.

In [5]:
gm_estimations

{'CTX_LH_ENTORHINAL_SUVR': GaussianMixture(n_components=2, random_state=667),
 'CTX_RH_ENTORHINAL_SUVR': GaussianMixture(n_components=2, random_state=667),
 'CTX_LH_AMYGDALA_SUVR': GaussianMixture(n_components=2, random_state=667),
 'CTX_RH_AMYGDALA_SUVR': GaussianMixture(n_components=2, random_state=667),
 'CTX_LH_FUSIFORM_SUVR': GaussianMixture(n_components=2, random_state=667),
 'CTX_RH_FUSIFORM_SUVR': GaussianMixture(n_components=2, random_state=667),
 'CTX_LH_PARAHIPPOCAMPAL_SUVR': GaussianMixture(n_components=2, random_state=667),
 'CTX_RH_PARAHIPPOCAMPAL_SUVR': GaussianMixture(n_components=2, random_state=667),
 'CTX_LH_INFERIORTEMPORAL_SUVR': GaussianMixture(n_components=2, random_state=667),
 'CTX_RH_INFERIORTEMPORAL_SUVR': GaussianMixture(n_components=2, random_state=667),
 'CTX_LH_MIDDLETEMPORAL_SUVR': GaussianMixture(n_components=2, random_state=667),
 'CTX_RH_MIDDLETEMPORAL_SUVR': GaussianMixture(n_components=2, random_state=667),
 'CTX_LH_PRECENTRAL_SUVR': GaussianMixture

Ok but what is a GMM object? If you are not familiar with how `scikit-learn` works, when you fit a model to some data, `scikit-learn` creates an object that holds the model's information. This is important for us later down the road. For now, you can just remember that after the estimation step, we will have 1 GMM model per column, which is stored in `gm_estimations`.

Next, the other object outputed by `spex.gmm_estimation` is called `clean_data` (I'm not great with naming variables, so feel free to suggest ideas). Why clean? Because `spex.gmm_estimation` also does an important verification step: it checks whether each column **presents a bimodal distribution**. This is essential to derive the spatial extent. The function offers the user to apply a correction to remove columns that do not present this bimodal distribution, and cleans the data so that it can be used later. More information on this in the next bubble.

```{admonition} Fix: Bimodal distribution verification
:class: warning

`sihnpy` verifies that your data in each column presents a bimodal distribution. Based on previous work, [^Vogel_2020] we use the Bayesian Information Criteria to determine the fit of using a 2-cluster GMM solution compared to a 1-cluster solution. In other words, we want to check that there is indeed 2-clusters in our data.

If the BIC of the 2-cluster solution is less than the 1-cluster solution, I do not recommend keeping this region for the spatial extent. `sihnpy` offers you the option of re-running the function, but removing any columns where the BIC of the 1-cluster solution is higher than the BIC of the 2-cluster solution. In the simulated data provided by `sihnpy`, there is one such region: `CTX_RH_POSTCENTRAL_SUVR`. If we were to illustrate the distribution of values in this region, it looks pretty clear that there is only 1 distribution

INSERT IMAGE OF 1 CLUSTER HISTOGRAM OF CTX_RH_POSTCENTRAL_SUVR

Let's fix this issue below
```

In [6]:
from sihnpy import spatial_extent as spex

gm_estimations, clean_data = spex.gmm_estimation(data_to_estimate=tau_data, fix=True)

GMM estimation for CTX_LH_ENTORHINAL_SUVR
1-component: BIC = 136.0012272528757 | 2-components: BIC = 19.050405000530777 
GMM estimation for CTX_RH_ENTORHINAL_SUVR
1-component: BIC = 97.26552868290143 | 2-components: BIC = -63.68821541093003 
GMM estimation for CTX_LH_AMYGDALA_SUVR
1-component: BIC = 159.09515496414866 | 2-components: BIC = 18.747712888602752 
GMM estimation for CTX_RH_AMYGDALA_SUVR
1-component: BIC = 139.3931538620712 | 2-components: BIC = 6.848469295876654 
GMM estimation for CTX_LH_FUSIFORM_SUVR
1-component: BIC = 354.9936313615791 | 2-components: BIC = 136.7897619985344 
GMM estimation for CTX_RH_FUSIFORM_SUVR
1-component: BIC = 224.0243215398743 | 2-components: BIC = -42.43182354715772 
GMM estimation for CTX_LH_PARAHIPPOCAMPAL_SUVR
1-component: BIC = 55.96648057875871 | 2-components: BIC = -64.27572034876607 
GMM estimation for CTX_RH_PARAHIPPOCAMPAL_SUVR
1-component: BIC = -12.31665584933022 | 2-components: BIC = -163.2738193556711 
GMM estimation for CTX_LH_INFE

After the fix, we can see that the `gm_estimations` object doesn't contain the region anymore, and that the `clean_data` object also only contains 15 regions.

In [7]:
gm_estimations

{'CTX_LH_ENTORHINAL_SUVR': GaussianMixture(n_components=2, random_state=667),
 'CTX_RH_ENTORHINAL_SUVR': GaussianMixture(n_components=2, random_state=667),
 'CTX_LH_AMYGDALA_SUVR': GaussianMixture(n_components=2, random_state=667),
 'CTX_RH_AMYGDALA_SUVR': GaussianMixture(n_components=2, random_state=667),
 'CTX_LH_FUSIFORM_SUVR': GaussianMixture(n_components=2, random_state=667),
 'CTX_RH_FUSIFORM_SUVR': GaussianMixture(n_components=2, random_state=667),
 'CTX_LH_PARAHIPPOCAMPAL_SUVR': GaussianMixture(n_components=2, random_state=667),
 'CTX_RH_PARAHIPPOCAMPAL_SUVR': GaussianMixture(n_components=2, random_state=667),
 'CTX_LH_INFERIORTEMPORAL_SUVR': GaussianMixture(n_components=2, random_state=667),
 'CTX_RH_INFERIORTEMPORAL_SUVR': GaussianMixture(n_components=2, random_state=667),
 'CTX_LH_MIDDLETEMPORAL_SUVR': GaussianMixture(n_components=2, random_state=667),
 'CTX_RH_MIDDLETEMPORAL_SUVR': GaussianMixture(n_components=2, random_state=667),
 'CTX_LH_PRECENTRAL_SUVR': GaussianMixture

In [8]:
clean_data

Unnamed: 0_level_0,CTX_LH_ENTORHINAL_SUVR,CTX_RH_ENTORHINAL_SUVR,CTX_LH_AMYGDALA_SUVR,CTX_RH_AMYGDALA_SUVR,CTX_LH_FUSIFORM_SUVR,CTX_RH_FUSIFORM_SUVR,CTX_LH_PARAHIPPOCAMPAL_SUVR,CTX_RH_PARAHIPPOCAMPAL_SUVR,CTX_LH_INFERIORTEMPORAL_SUVR,CTX_RH_INFERIORTEMPORAL_SUVR,CTX_LH_MIDDLETEMPORAL_SUVR,CTX_RH_MIDDLETEMPORAL_SUVR,CTX_LH_PRECENTRAL_SUVR,CTX_RH_PRECENTRAL_SUVR,CTX_LH_POSTCENTRAL_SUVR
participant_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
sub-5458966,1.111972,1.120199,1.006147,1.330316,1.322257,1.208377,0.856778,1.149150,1.200685,1.170536,1.136680,1.167629,0.836766,1.181594,0.964623
sub-2424540,1.279463,1.238721,1.118358,1.176036,1.064330,1.203981,0.939988,0.965154,1.143115,1.354172,1.189367,1.305499,1.008217,1.265188,0.903880
sub-7855613,1.165918,1.074124,1.133187,1.239481,1.057046,1.072006,0.919426,1.051297,1.188624,1.213766,1.178537,1.122608,0.994861,1.224359,1.039233
sub-3137570,1.057761,1.058959,1.003114,1.225939,0.950004,1.283570,1.173269,1.108080,1.127921,1.106209,1.007086,1.103633,0.906591,1.236180,0.985742
sub-9650197,1.115381,1.106487,1.214722,1.359531,1.346469,1.111211,1.009351,1.172829,1.176183,1.283605,1.016241,1.170783,1.058830,1.208158,0.861014
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
sub-5336241,1.755116,1.791774,1.892483,0.914250,2.088089,1.693487,1.021844,1.371482,1.587848,2.456752,1.338308,1.571080,0.698112,1.045121,0.558954
sub-1002928,1.725995,1.665045,1.567078,1.379281,2.359009,1.743699,1.314826,1.472280,2.517382,1.227152,1.536081,1.932241,1.442845,1.076464,0.985675
sub-1283278,1.763810,1.557945,1.831518,1.901642,1.960012,2.085522,1.729197,1.458530,2.390055,2.020771,1.372247,1.800840,1.039855,0.974499,0.910037
sub-9101699,1.658679,1.751766,1.718346,1.842829,0.516473,1.770625,1.566308,1.269817,2.208593,1.718491,2.319590,0.499457,0.777812,1.049930,1.341256


````{admonition} Advanced topic: GMM settings
:class: danger

By default, `sihnpy` uses the default options for the GMM set by `scikit-learn`. However, GMMs have many [different options that can be set for them](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html). 

The goal of `sihnpy` is to provide an easy access to the simplest methods (i.e., default of `scikit-learn`). However, if you would like to set up your GMMs differently, you can still do so outside of `sihnpy` before running the next steps. For each of the brain region you want to include, you would simply need to create a GMM object using `scikit-learn`'s `GaussianMixture` function with the desired options. For example:

```python
from sklearn import GaussianMixture
gm_object_region1 = GaussianMixture(n_components=2, max_iter=500, init_params='k-means++', random_state=667).fit(data_region1)
```

From there, you simply need to store the GMM objects from each region in a Python dictionary, where the name of the dictionary keys **match the name of the regions from the your data**. This is critical as the other functions from `sihnpy` **will not work otherwise**.

```python
gm_estimations = {}
gm_estimations['region1'] = gm_object_region1
```

````

#### 3. Cluster measures

The next step in our journey is to derive cluster measures. This step has two goals:

1) Compute averages and SD for the clusters the GMM has estimated
2) Verify that the clusters are ordered in the right way (i.e., the "abnormal" cluster has higher values than the "normal" cluster)

The first goal is mostly useful if you need to report the average values of the marker you measure for each cluster (e.g., in a demographics or other type of table), but it will also be useful for {ref}`generating histograms and taking a look at the data <2.spex/spex_module:5. (Optional) Visual verifications with histograms>`. The second goal will be super important, as I will explain below. For now, let's just run the function. We simply need the `gm_estimations` and the `clean_data` objects we generated in the previous step.

In [9]:
final_data, final_gm_dict, gmm_measures = spex.gmm_measures(cleaned_data=clean_data, gm_objects=gm_estimations, fix=False)

Average of first component of CTX_RH_PRECENTRAL_SUVR is higher than second component.


The function runs quite silently, but we do get a message that for one region the average of the first component (i.e., our "normal" participants) is higher than the average of the second component (i.e., our "abnormal" participants). Why?

```{admonition} Fix: Inversed distributions - Part 1
:class: warning

As you noticed, `sihnpy` informs us that one region, the right precentral gyrus, has inversed components. This means that our "first" component (i.e., the component with the most data points) actually has higher values than our "second" component (i.e., the component with the least data points). This is problematic because the spatial extent implemented in `sihnpy` assumes that 1) abnormal values are high values and 2) there are less people with high values than people with low values.

This issue can arise for multiple reasons. For instance, in PET data, it may happen that a region with a lot of noisy signal may have higher values across participants, but some people, due to scan quality issues or some biological differences, may actually show low values. Here is an example:

INSERT IMAGE HERE

In such a case, you may want to remove this region as the threshold you would get in such a case would be very low (a the right extremity of the "red" distribution), meaning a very high number of individuals would be considered as positive.

It is also possible that your data has more than 2 distributions, in which case `sihnpy` may not select a "high" and "low" distribution.

**In all the above situations**, my recommendation is to **remove the region with inverted distributions**. This is simply done by re-running the function, and setting the `fix` argument to `True`. However, if you want to switch back the distributions so the higher values are considered abnormal, you can also do so {ref}`in the next step. <2.spex/spex_module:4. Extracting clustering probabilities>`

Finally, based on your data, it is also possible that, well, low values are the abnormal values. For instance, if you are using cerebrospinal fluid for amyloid, lower values are actually indicative of more pathology. If that is the case, there is no need to modify your data or fix the inverted distribution warning. **However, you will have to select your probability thresholds a bit differently.**
```

In [10]:
final_data, final_gm_estimations, gmm_measures = spex.gmm_measures(cleaned_data=clean_data, gm_objects=gm_estimations, fix=True)

Average of first component of CTX_RH_PRECENTRAL_SUVR is higher than second component.
- Fix is true, removing CTX_RH_PRECENTRAL_SUVR


In our case, I elected to remove it. In such a case, this region is also removed from the GMM estimations (`final_gm_estimations`) and from the raw data (`final_data`). If you want, you can also take a look at the averages for each region by calling its key in the dictionary:

In [11]:
gmm_measures['CTX_LH_ENTORHINAL_SUVR'] #Example with the entorhinal cortex

{'mean_comp1': 1.107276418020098,
 'mean_comp2': 1.5949715674372125,
 'sd_comp1': 0.11535371323585616,
 'sd_comp2': 0.2786392457527993}

#### 4. Extracting clustering probabilities

The last step we need to do before we actually find our SUVR thresholds is we need to find the probabilities of each participant to belong ot the "abnormal" distribution. To do so, we simply need to use the GMM objects we estimated, and apply them to the data we want to get probabilities for. In `sihnpy` it's as easy as the code below:

In [13]:
probability_data = spex.gmm_probs(final_data=final_data, final_gm_estimations=final_gm_estimations, fix=False)
probability_data

Unnamed: 0_level_0,CTX_LH_ENTORHINAL_SUVR,CTX_RH_ENTORHINAL_SUVR,CTX_LH_AMYGDALA_SUVR,CTX_RH_AMYGDALA_SUVR,CTX_LH_FUSIFORM_SUVR,CTX_RH_FUSIFORM_SUVR,CTX_LH_PARAHIPPOCAMPAL_SUVR,CTX_RH_PARAHIPPOCAMPAL_SUVR,CTX_LH_INFERIORTEMPORAL_SUVR,CTX_RH_INFERIORTEMPORAL_SUVR,CTX_LH_MIDDLETEMPORAL_SUVR,CTX_RH_MIDDLETEMPORAL_SUVR,CTX_LH_PRECENTRAL_SUVR,CTX_LH_POSTCENTRAL_SUVR
participant_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
sub-6788676,0.137856,0.006946,0.004584,0.023419,0.053343,0.046425,0.056590,0.009442,0.042380,0.041871,0.080851,0.038981,0.178866,0.001996
sub-6851811,0.061438,0.007531,0.002942,0.042546,0.060413,0.044989,0.051669,0.047147,0.047032,0.036001,0.076638,0.040372,0.150983,0.010531
sub-7658604,0.050549,0.085985,0.003469,0.020181,0.053678,0.038098,0.027977,0.009505,0.042350,0.033869,0.227532,0.182349,0.106834,0.125704
sub-5985051,0.050370,0.041347,0.001680,0.078365,0.149520,0.042657,0.049928,0.012832,0.092939,0.034090,0.061487,0.044282,0.083450,0.006093
sub-5707288,0.044150,0.005716,0.002221,0.228293,0.116941,0.083934,0.457827,0.011536,0.073762,0.035548,0.058167,0.039728,0.089950,0.004913
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
sub-7863867,1.000000,1.000000,0.921238,0.995260,0.905931,1.000000,0.999987,0.995697,0.702085,0.039963,1.000000,0.966072,0.999879,0.999417
sub-1121981,1.000000,0.999997,0.998894,0.998091,1.000000,1.000000,0.349934,1.000000,1.000000,1.000000,1.000000,0.999999,1.000000,0.356847
sub-7055352,1.000000,0.010788,0.999995,0.999982,1.000000,1.000000,0.587226,0.008471,0.057523,1.000000,0.283883,1.000000,1.000000,0.527916
sub-5013589,1.000000,0.098967,1.000000,0.889512,0.079242,1.000000,0.985568,1.000000,0.548537,1.000000,0.062901,0.998066,1.000000,0.014987


The final product of this function is a `pandas.DataFrame` where the values are now "probabilities" (i.e., the probability that the SUVR value of that participant is "abnormal"). This is what we will be using to determine our thresholds.

```{admonition} Fix: Inverted distributions - Part 2
:class: warning
We didn't get a warning here, as I removed the region with an inverted distribution, but this function would also output an error message in the case where there would be an inverted distribution.

If you wanted to keep the brain region, but force the "first" distribution to become the "second" distribution, you can do so here: this is what the `fix` argument does in the `spex.gmm_probs` function.
```

#### 5. (Optional) Visual verifications with histograms

My favorite part of science is the graphs. I love to look at graphs because they carry so much information compared to just the numbers. Accordingly, I created a whole function that can generate histograms to look at the data you have generated so far with this package. Very simply, this function takes in the `final_data` dataframe, the `gmm_measures` dictionary and the `probability_data` dataframe. It can then output 3 types of histograms: a density histogram (showing the gaussian distributions assigned to the data), a "raw" histogram (showing the raw data, not density transformed) and a histogram showing the probabilities.

You can choose to output all three types, or a specific type, as needed. Note however that this process can be demanding for the computer's memory as it stores the histograms in dictionaries. Particularly if you have a lot of regions, this process may take a while.

#### 6. Threshold derivation

### Introduction to pre-determined (normative sample) thresholds

## Applying thresholds

## tl;dr

## Other topics
### Simulating data