# Spatial extent analysis

Ready to increase the "spatial extent" of your knowledge on this? ... Sorry not my best one.

You will find below concrete methods and detailed tutorials to apply the spatial extent methods to your data and what to do when you encounter specific (potentially problematic) situations.

Note that, by nature, the methodologies you can use to derive thresholds are almost... well... infinite. It really depends on the data you currently have and whether the method you chose is appropriate for it. In `sihnpy`, I chose to integrate two. However, when possible I point out different ways users can adapt their script to adapt the methods for thresholding. Depending on the demand, I will also add more methods in the future.

## Outline

As all other modules in `sihnpy`, some data is included in the package so you can practice the different aspects of the module before moving on to your data. I will first take a moment to describe the data included.

Then, we will jump in the spatial extent. Using the spatial extent for your data is divided in two steps: 

| **Derive thresholds** | **Apply thresholds** |
|:-------|:--------|
| Use a method to establish a singular <br> threshold for each brain region | Apply the thresholds from the first <br> step and compute spatial extent measures |

The thresholds in `sihnpy` are either derived using a **Gaussian mixture modelling** approach, or assume that thresholds were **derived from a normative population**. Both ways will be described in detail below.

## Practice data

In other `sihnpy` modules, real data from a small subset (15) of the PREVENT-AD Open Dataset is used. However, this method was developed using positron emission tomography data, specifically for tau pathology. While it is in the plans for the future to have this data available in the PREVENT-AD, it is currently not available in the data release. Furthermore, real data may not show all of the issues that can arise from some specific situations.

Have no fear though! I have worked hard to simulate tau-PET data for PREVENT-AD participants so you can have a realistic feel for using this module on data that feels realistic. As of now, simulated tau positron emission tomography (PET) data is available for the 308 participants of the PREVENT-AD Open dataset. Curious on how this was done or want to understand more about this data? {ref}`Find more details here <2.spex/spex_module:Simulating data>`

As a quick primer on PET data, the main things you should know when using this data is that the values that are simulated in `sihnpy` are called **Standardized Uptake Ratio Values (SUVR)**. They are a measure of how much of the PET tracer is uptake (i.e., how much pathology there is) in a given brain region compared to the uptake in a reference region that does not accumulate pathology. A SUVR close to or below 1 indicates very low levels of pathology, while higher values represent more and more pathology. Note that the uptake varies between regions, **so thresholds will change depending on the region**.

```{warning}
Note that `sihnpy` provides data to practice using the spatial extent module. While the PREVENT-AD participants are used, the data available for this method **is simulated data** (i.e., the numbers observed are fake; they were randomly generated to fit the purpose of the tutorials). As a general rule for `sihnpy`, and especially for this module, **only use the data provided to help you practice using the module, not to conduct or publish actual research**.
```

## Deriving thresholds
### Introduction to Gaussian mixture modelling (GMM)

The main method proposed by `sihnpy` to derive thresholds is to use **Gaussian mixture modelling (GMM)**. The rationale behind this method is that data points in a dataset belong to a set of gaussian (a.k.a. normal) distributions (a.k.a. distinct populations). GMMs are often referred to as *soft* clustering algorithms; contrary to other clustering algorithms, GMM assign **probabilities that each data point belongs to a specific cluster**. This approach is useful as it allows some granularity on how certain we are that a participant belongs to a specific group. I won't get much more in how GMMs actually work, as it is beyond the purview of `sihnpy`, but I encourage you to go read [`scikit-learn`'s documentation](https://scikit-learn.org/stable/modules/mixture.html#gmm) to learn more.

So GMM find clusters in the data. But what does that have anything to do with finding thresholds?

If you remember the {ref}`introduction to spatial extent <2.spex/spex_intro:Definitions>`, we know that **there are two distinct distribution in the data: people with low values of pathology (tight spread) and people with high values of pathology (spread out).** You can actually observe this visually in the data. Here is an example from the data included in `sihnpy` (i.e., the simulated distribution of tau-PET values in the entorhinal cortex).

INSERT HISTOGRAM IMAGE HERE

Here, there seems to be two distributions in the data: our **low** distribution in green (i.e., "normal") and our **high** distribution in red (i.e., "abnormal"). That's where the thresholding comes in. Since GMMs assign a probability of belonging to either cluster for each participant, we can set a threshold **based on how confident we are that a participant has abnormal values in the marker of interest**. For instance, we could want to be very conservative and say that we want that we will consider abnormal participants who have more than 90% probability of being in the abnormal distribution. Once we decide on the probability we want to set as a threshold, *we need to figure out how does that probability translates to a threshold in our original scale*. To do so, we will try to find the participant who has a probability of being abnormal closest to our threshold, and take their values in the original scale. In the `sihnpy` PREVENT-AD simulated data, this would mean taking the SUVR value of the participant with the probability closest to probability threshold. Below is in illustrated explanation of this process.

INSERT MODIFIED FIGURE 1 FROM MANUSCRIPT

This process is the one detailed below that is implemented in `sihnpy`.

#### Assumptions, use cases and limitations

GMMs are really great, but they come with some assumptions and limitations.

| Limitations | Possible fixes |


### Steps to derive thresholds with GMMs
#### 1. GMM Estimation
#### 2. Cluster measures
#### 3. Extracting clustering probabilities
#### 4. (Optional) Visual verifications with histograms
#### 5. Threshold derivation

### Introduction to pre-determined (normative sample) thresholds

## Applying thresholds

## tl;dr

## Other topics
### Simulating data