# Spatial extent analysis

Ready to increase the "spatial extent" of your knowledge on this? ... Sorry not my best one.

You will find below concrete methods and detailed tutorials to apply the spatial extent methods to your data and what to do when you encounter specific (potentially problematic) situations.

Note that, by nature, the methodologies you can use to derive thresholds are almost... well... infinite. It really depends on the data you currently have and whether the method you chose is appropriate for it. In `sihnpy`, I chose to integrate two. However, when possible I point out different ways users can adapt their script to adapt the methods for thresholding. Depending on the demand, I will also add more methods in the future.

## Outline

As all other modules in `sihnpy`, some data is included in the package so you can practice the different aspects of the module before moving on to your data. I will first take a moment to describe the data included.

Then, we will jump in the spatial extent. Using the spatial extent for your data is divided in two steps: 

| **Derive thresholds** | **Apply thresholds** |
|:-------|:--------|
| Use a method to establish a singular <br> threshold for each brain region | Apply the thresholds from the first <br> step and compute spatial extent measures |

The thresholds in `sihnpy` are either derived using a **Gaussian mixture modelling** approach, or assume that thresholds were **derived from a normative population**. Both ways will be described in detail below.

## Practice data

In other `sihnpy` modules, real data from a small subset (15) of the PREVENT-AD Open Dataset is used. However, this method was developed using positron emission tomography data, specifically for tau pathology. While it is in the plans for the future to have this data available in the PREVENT-AD, it is currently not available in the data release. Furthermore, real data may not show all of the issues that can arise from some specific situations.

Have no fear though! I have worked hard to simulate tau-PET data for PREVENT-AD participants so you can have a realistic feel for using this module on data that feels realistic. As of now, simulated tau positron emission tomography (PET) data is available for the 308 participants of the PREVENT-AD Open dataset. Curious on how this was done or want to understand more about this data? {ref}`Find more details here <2.spex/spex_module:Simulating data>`

As a quick primer on PET data, the main things you should know when using this data is that the values that are simulated in `sihnpy` are called **Standardized Uptake Ratio Values (SUVR)**. They are a measure of how much of the PET tracer is uptake (i.e., how much pathology there is) in a given brain region compared to the uptake in a reference region that does not accumulate pathology. A SUVR close to or below 1 indicates very low levels of pathology, while higher values represent more and more pathology. Note that the uptake varies between regions, **so thresholds will change depending on the region**.

```{warning}
Note that `sihnpy` provides data to practice using the spatial extent module. While the PREVENT-AD participants are used, the data available for this method **is simulated data** (i.e., the numbers observed are fake; they were randomly generated to fit the purpose of the tutorials). As a general rule for `sihnpy`, and especially for this module, **only use the data provided to help you practice using the module, not to conduct or publish actual research**.
```

## Deriving thresholds
### Introduction to Gaussian mixture modelling (GMM)

The main method proposed by `sihnpy` to derive thresholds is to use **Gaussian mixture modelling (GMM)**. The rationale behind this method is that data points in a dataset belong to a set of gaussian (a.k.a. normal) distributions (a.k.a. distinct populations). GMMs are often referred to as *soft* clustering algorithms; contrary to other clustering algorithms, GMM assign **probabilities that each data point belongs to a specific cluster**. This approach is useful as it allows some granularity on how certain we are that a participant belongs to a specific group. I won't get much more in how GMMs actually work, as it is beyond the purview of `sihnpy`, but I encourage you to go read [`scikit-learn`'s documentation](https://scikit-learn.org/stable/modules/mixture.html#gmm) to learn more.

So GMM find clusters in the data. But what does that have anything to do with finding thresholds?

If you remember the {ref}`introduction to spatial extent <2.spex/spex_intro:Definitions>`, we know that **there are two distinct distribution in the data: people with low values of pathology (tight spread) and people with high values of pathology (spread out).** You can actually observe this visually in the data. Here is an example from the data included in `sihnpy` (i.e., the simulated distribution of tau-PET values in the entorhinal cortex).

INSERT HISTOGRAM IMAGE HERE

Here, there seems to be two distributions in the data: our **low** distribution in green (i.e., "normal") and our **high** distribution in red (i.e., "abnormal"). That's where the thresholding comes in. Since GMMs assign a probability of belonging to either cluster for each participant, we can set a threshold **based on how confident we are that a participant has abnormal values in the marker of interest**. For instance, we could want to be very conservative and say that we want that we will consider abnormal participants who have more than 90% probability of being in the abnormal distribution. Once we decide on the probability we want to set as a threshold, *we need to figure out how does that probability translates to a threshold in our original scale*. To do so, we will try to find the participant who has a probability of being abnormal closest to our threshold, and take their values in the original scale. In the `sihnpy` PREVENT-AD simulated data, this would mean taking the SUVR value of the participant with the probability closest to probability threshold. Below is in illustrated explanation of this process.

INSERT MODIFIED FIGURE 1 FROM MANUSCRIPT

This process is the one detailed below that is implemented in `sihnpy`. If you already have your own thresholds that you wish to apply to the data, you can skip ahead to the {ref}`Applying thresholds <2.spex/spex_module:Applying thresholds>` section.

#### Limitations

GMMs are really great, but they come with some assumptions and limitations.

| **Limitations** | **Possible fixes** |
| Needs a **clear** bimodal distribution | None |
| Large sample size * is needed, particularly <br> when the bimodal distribution is not super evident. | Increase sample size |
| More people with "normal" than "abnormal" values <br> as the algorithm with classify abnormality wrong otherwise | Can be fixed in code, but estimates might not be great <br> regardless |
| `sihnpy` expects higher values to be "abnormal" | Can be fixed in the code or its interpretation |

* Note on bimodal distributions and sample sizes: there is no guidelines on what constitutes an appropriate sample size for a GMM. Some estimates I saw online mention (appropriately) that it depends on the number of parameters used in the clustering, the number of clusters we expect, etc. Furthermore, I don't know that the sample size is the most important characteristic to select the GMM. The most important is really that there is a **clear bimodal** distribution. Having a higher sample may help to make that evident, but clear bimodal distribution at smaller sample sizes may still work. That said, from some estimates I saw in others using this method, samples sizes below 100 are generally not performing super well.


### Steps to derive thresholds with GMMs

Now that we got that out of the way, let's get down to it! I will take you through the steps needed to derive the thresholds using GMM models.

#### 1. Get the data

The first step is to get the data we need to generate thresholds for. Fortunately, the `sihnpy.datasets` module already has some ready for us. You can simply download the data using the following:

In [12]:
from sihnpy import datasets

tau_data, regional_thresholds, regional_averages = datasets.pad_spex_input()

The function returns three `pandas.DataFrame` objects. The only one we need for this part is the first one, `tau_data`. The second one, `regional_thresholds` will be discussed for {ref}`applying thresholds from normative populations <2.spex/spex_module:Introduction to pre-determined (normative sample) thresholds>`. The last one will really only be useful if you take an interest in {ref}`simulating your own data <2.spex/spex_module:Simulating data>`.

So for now, let's focus on the `tau_data`. 

In [13]:
tau_data

Unnamed: 0_level_0,sex,test_language,handedness_score,handedness_interpretation,CTX_LH_ENTORHINAL_SUVR,CTX_RH_ENTORHINAL_SUVR,CTX_LH_AMYGDALA_SUVR,CTX_RH_AMYGDALA_SUVR,CTX_LH_FUSIFORM_SUVR,CTX_RH_FUSIFORM_SUVR,CTX_LH_PARAHIPPOCAMPAL_SUVR,CTX_RH_PARAHIPPOCAMPAL_SUVR,CTX_LH_INFERIORTEMPORAL_SUVR,CTX_RH_INFERIORTEMPORAL_SUVR,CTX_LH_MIDDLETEMPORAL_SUVR,CTX_RH_MIDDLETEMPORAL_SUVR,CTX_LH_PRECENTRAL_SUVR,CTX_RH_PRECENTRAL_SUVR,CTX_LH_POSTCENTRAL_SUVR,CTX_RH_POSTCENTRAL_SUVR
participant_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
sub-5458966,Male,French,80.00,Right-handed,1.111972,1.120199,1.006147,1.330316,1.322257,1.208377,0.856778,1.149150,1.200685,1.170536,1.136680,1.167629,0.836766,1.181594,0.964623,1.264212
sub-2424540,Female,French,100.00,Right-handed,1.279463,1.238721,1.118358,1.176036,1.064330,1.203981,0.939988,0.965154,1.143115,1.354172,1.189367,1.305499,1.008217,1.265188,0.903880,0.982667
sub-7855613,Female,French,90.00,Right-handed,1.165918,1.074124,1.133187,1.239481,1.057046,1.072006,0.919426,1.051297,1.188624,1.213766,1.178537,1.122608,0.994861,1.224359,1.039233,1.018787
sub-3137570,Male,French,90.00,Right-handed,1.057761,1.058959,1.003114,1.225939,0.950004,1.283570,1.173269,1.108080,1.127921,1.106209,1.007086,1.103633,0.906591,1.236180,0.985742,1.518770
sub-9650197,Female,French,100.00,Right-handed,1.115381,1.106487,1.214722,1.359531,1.346469,1.111211,1.009351,1.172829,1.176183,1.283605,1.016241,1.170783,1.058830,1.208158,0.861014,1.271836
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
sub-5336241,Female,French,-30.00,Ambidextrous,1.755116,1.791774,1.892483,0.914250,2.088089,1.693487,1.021844,1.371482,1.587848,2.456752,1.338308,1.571080,0.698112,1.045121,0.558954,0.849271
sub-1002928,Female,French,100.00,Right-handed,1.725995,1.665045,1.567078,1.379281,2.359009,1.743699,1.314826,1.472280,2.517382,1.227152,1.536081,1.932241,1.442845,1.076464,0.985675,0.800467
sub-1283278,Female,English,80.00,Right-handed,1.763810,1.557945,1.831518,1.901642,1.960012,2.085522,1.729197,1.458530,2.390055,2.020771,1.372247,1.800840,1.039855,0.974499,0.910037,0.898030
sub-9101699,Male,French,57.89,Right-handed,1.658679,1.751766,1.718346,1.842829,0.516473,1.770625,1.566308,1.269817,2.208593,1.718491,2.319590,0.499457,0.777812,1.049930,1.341256,1.253567


In this dataset, we see that we have 308 participants from the PREVENT-AD. The first few columns detail their basic demographic information available from the Open Dataset. All the other columns are the simulated tau-PET data. The data was simulated for a total of 16 brain regions: LH/RH indicate which hemisphere the region is from, while the name right after data (e.g., ENTORHINAL) is the name of the brain region we are simulating.

The first step here is to actually remove the demographic information. The GMM code will be applied to all the columns (except the index) that is provided to it. And well... clustering males and females in 2 groups is not really useful for our purposes...

Let's quickly do that using `pandas`

In [14]:
import pandas as pd

tau_data.drop(labels=["sex", "test_language", "handedness_score", "handedness_interpretation"], axis=1, inplace=True) #Axis 1 specifies to drop columns

tau_data

Unnamed: 0_level_0,CTX_LH_ENTORHINAL_SUVR,CTX_RH_ENTORHINAL_SUVR,CTX_LH_AMYGDALA_SUVR,CTX_RH_AMYGDALA_SUVR,CTX_LH_FUSIFORM_SUVR,CTX_RH_FUSIFORM_SUVR,CTX_LH_PARAHIPPOCAMPAL_SUVR,CTX_RH_PARAHIPPOCAMPAL_SUVR,CTX_LH_INFERIORTEMPORAL_SUVR,CTX_RH_INFERIORTEMPORAL_SUVR,CTX_LH_MIDDLETEMPORAL_SUVR,CTX_RH_MIDDLETEMPORAL_SUVR,CTX_LH_PRECENTRAL_SUVR,CTX_RH_PRECENTRAL_SUVR,CTX_LH_POSTCENTRAL_SUVR,CTX_RH_POSTCENTRAL_SUVR
participant_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
sub-5458966,1.111972,1.120199,1.006147,1.330316,1.322257,1.208377,0.856778,1.149150,1.200685,1.170536,1.136680,1.167629,0.836766,1.181594,0.964623,1.264212
sub-2424540,1.279463,1.238721,1.118358,1.176036,1.064330,1.203981,0.939988,0.965154,1.143115,1.354172,1.189367,1.305499,1.008217,1.265188,0.903880,0.982667
sub-7855613,1.165918,1.074124,1.133187,1.239481,1.057046,1.072006,0.919426,1.051297,1.188624,1.213766,1.178537,1.122608,0.994861,1.224359,1.039233,1.018787
sub-3137570,1.057761,1.058959,1.003114,1.225939,0.950004,1.283570,1.173269,1.108080,1.127921,1.106209,1.007086,1.103633,0.906591,1.236180,0.985742,1.518770
sub-9650197,1.115381,1.106487,1.214722,1.359531,1.346469,1.111211,1.009351,1.172829,1.176183,1.283605,1.016241,1.170783,1.058830,1.208158,0.861014,1.271836
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
sub-5336241,1.755116,1.791774,1.892483,0.914250,2.088089,1.693487,1.021844,1.371482,1.587848,2.456752,1.338308,1.571080,0.698112,1.045121,0.558954,0.849271
sub-1002928,1.725995,1.665045,1.567078,1.379281,2.359009,1.743699,1.314826,1.472280,2.517382,1.227152,1.536081,1.932241,1.442845,1.076464,0.985675,0.800467
sub-1283278,1.763810,1.557945,1.831518,1.901642,1.960012,2.085522,1.729197,1.458530,2.390055,2.020771,1.372247,1.800840,1.039855,0.974499,0.910037,0.898030
sub-9101699,1.658679,1.751766,1.718346,1.842829,0.516473,1.770625,1.566308,1.269817,2.208593,1.718491,2.319590,0.499457,0.777812,1.049930,1.341256,1.253567


Ok great! Now we only have our 16 regions with the simulated SUVR tau-PET data. We're ready to start.

#### 2. Estimate the GMM

The first step is to **estimate** a GMM. In `scikit-learn` terms, we need to **fit** a GMM to our data. 

`sihnpy` makes it super easy to do this without thinking too much. The function `spex.gmm_estimation` only require a `pandas.dataframe` where each column requires a GMM to be applied to. We simply need to run the code below:

In [15]:
from sihnpy import spatial_extent as spex

gm_estimations, clean_data = spex.gmm_estimation(data_to_estimate=tau_data)

GMM estimation for CTX_LH_ENTORHINAL_SUVR
1-component: 136.0012272528758 | 2-components: 19.050405000530834 
GMM estimation for CTX_RH_ENTORHINAL_SUVR
1-component: 97.26552868290143 | 2-components: -63.68821541092997 
GMM estimation for CTX_LH_AMYGDALA_SUVR
1-component: 159.09515496414866 | 2-components: 18.7477128886027 
GMM estimation for CTX_RH_AMYGDALA_SUVR
1-component: 139.3931538620712 | 2-components: 6.848469295876654 
GMM estimation for CTX_LH_FUSIFORM_SUVR
1-component: 354.99363136157905 | 2-components: 136.78976199853432 
GMM estimation for CTX_RH_FUSIFORM_SUVR
1-component: 224.0243215398744 | 2-components: -42.43182354715749 
GMM estimation for CTX_LH_PARAHIPPOCAMPAL_SUVR
1-component: 55.96648057875871 | 2-components: -64.27572034876609 
GMM estimation for CTX_RH_PARAHIPPOCAMPAL_SUVR
1-component: -12.316655849330177 | 2-components: -163.2738193556712 
GMM estimation for CTX_LH_INFERIORTEMPORAL_SUVR
1-component: 383.0554389230564 | 2-components: 176.85420579772477 
GMM estima

Whew that's a lot of text! And what did the function actually output?

Let's start with what did the function output as messages. 

#### 3. Cluster measures
#### 4. Extracting clustering probabilities
#### 5. (Optional) Visual verifications with histograms
#### 6. Threshold derivation

### Introduction to pre-determined (normative sample) thresholds

## Applying thresholds

## tl;dr

## Other topics
### Simulating data