# Fingerprinting module
-----

```{note}
This code is under active development. Future example will include real data from the Prevent-AD dataset instead of randomly generated data.
```

## 1. Introduction to fingerprinting

### Rationale

Many of currently used statistical analyses rely on the assumption that **brains of similar individuals are relatively homogenous**; meaning that, for example, a group of 30 young individuals (18-30) can be grouped together without worrying for important differences between them. However, many studies, mainly in the field of functional magnetic resonance imaging have highlighted significant differences between individuals of the "same" group [^Mueller_2013] [^Finn_2015] 

This is what prompted the development of the original functional connectome fingerprinting (a.k.a. **fingerprinting** for the rest of this documentation) methodology by Finn et al. (2015)[^Finn_2015]. The general rationale behind **fingerprinting** is that if there is important variability between individuals in terms of brain connectivity patterns, the pattern of each individual should be unique, just like a digital fingerprint. Previous research has shown that these fingerprints are accurate **1) across time**, **2) across functional MRI tasks** and **3) when using other neuroimaging modalities**.

### How it works

The core idea behing **fingerprinting** is to determine how similar an individual is with themselves at a different point in time and whether this similarity is stronger the similarity between said individual and rest of the sample. This is illustrated below:

(FIGURE)

Just like a digital fingerprint, a brain fingerprint is good when it can reliably identify the same individual over different conditions. In our context, this is determined by whether the correlation of the brain features of the same individual is higher than the correlation of the brain features of different individuals.

```{admonition} Definitions

- **Fingerprinting accuracy**: whether a given individual was identified using a different brain imaging session
- **Fingerprint strength**: within-individual correlation of two brain imaging sessions
- **Alikeness coefficient**: between-individual correlation of two brain imaging sessions
- **Identifiability**: definition from Amico et al. (2018)[^Amico_2018], where identifiability is the difference between the fingerprint strength and the alikeness coefficient.

```

### Use cases and limitations

**Fingerprinting** methodology has been used with different imaging modalities including fMRI [^Finn_2015], structural and diffusion MRI [^Mansour_2021]. It has also been used with imaging taken at multiple time points for every individuals.[^Finn_2015] [^Horien_2019] 

Generally, **fingerprinting** can be used when you have:
* Highly dimensional data for each individual (e.g., measurements for many brain regions across the brain for a given individual)
* At least 2 different measurements for each individuals (either over time OR using different imaging modalities)

|Strength|Limitations|
|:-------|:-----------|
| o Easy-to-apply individual-level measure | x Hard to interpret (unsure whether a strong correlation is good or bad)|
| o Gives stable longitudinal measurements in cognitively unimpaired cohorts| x Hard to determine which regions contributes best to **fingerprinting**|

```{warning}
A major difficulty in interpreting **fingerprinting** measures is that very little research has indicated whether or not having high or low fingerprint measures can indicate meaningful behavioral/clinical/biomarker changes. Some research has showed that worse fingerprints were associated with mental health diagnoses[^Kaufmann_2017] [^Kaufmann_2018] and that lower brain volume was associated with lower **fingerprint strength**. [^Ousdal_2020] [^St_Onge_2022]

However, it is still unclear how these fingerprints change with different diseases and disease stages. Caution should be used when interpreting the results from the fingerprinting analyses in the context of clinical applications.
```

-----


[^Mueller_2013]: Mueller et al. (2013). Neuron. [10.1016/j.neuron.2012.12.028](https://doi.org/10.1016/j.neuron.2012.12.028)
[^Finn_2015]: Finn et al. (2015). Nat Neuro. [10.1038/nn.4135](https://doi.org/10.1038/nn.4135)
[^Amico_2018]: Amico et al. (2018). Sci Reports. [10.1038/s41598-018-25089-1](https://doi.org/10.1038/s41598-018-25089-1)
[^Mansour_2021]: Mansour et al. (2021). Neuroimage. [10.1016/j.neuroimage.2020.117695](https://doi.org/10.1016/j.neuroimage.2020.117695)
[^Horien_2019]: Horien et al. (2019). Neuroimage. [10.1016/j.neuroimage.2019.02.002](https://doi.org/10.1016/j.neuroimage.2019.02.002)
[^Kaufmann_2017]: Kaufmann et al. (2017). Nat Neuro. [10.1038/nn.4511](https://doi.org/10.1038/nn.4511)
[^Kaufmann_2018]: Kaufmann et al. (2018). JAMA Psychiatry. [10.1001/jamapsychiatry.2018.0844](https://doi.org/10.1001/jamapsychiatry.2018.0844)
[^Ousdal_2020]: Ousdal et al. (2020). Hum Brain Mapp. [10.1002/hbm.24833](https://10.1002/hbm.24833)
[^St_Onge_2022]: St-Onge et al. (2022). In revision.

## Fingerprinting analysis - Matrix-like data: step-by-step rundown

We demonstrate a typical **fingerprinting** analysis using `sihnpy`. You can run these analyses by building a script like we do below in a Python script or in a Jupyter Notebook. In the case where you need to run this analysis on a high number of participants or on very high dimensional data (>160,000 features per participants), I recommend using a command-line script (ADD THE REF TO THE NOTEBOOK HERE).

Note that the steps above are specific to the matrix-like data (e.g., functional or structural connectivity, covariance matrix, etc.). If you have table-like data (e.g., volume by region), a different method is required. (ADD REF TO THE NOTEBOOK HERE).

### 1. Preparing the data

To run a fingerprinting analysis, we need three things:
* The path to a list of participants to analyze
* The path to the folder containing the matrices of the first session of brain imaging
* The path to the folder containing the matrices of the second session brain imaging

If you already have the above for your data, you can skip ahead to 2. Otherwise, `sihnpy` also offers a small sample of data simulated using `numpy` and `pandas` to practice using the functions.

To get the simulated data for fingerprinting, you can run the following code:

In [1]:
from sihnpy.datasets import get_fingerprint_simulated_data

id_list, path_mod1, path_mod2 = get_fingerprint_simulated_data()

/Users/stong3/Desktop/sihnpy/src/sihnpy/data/fingerprinting/fp_simulated_id_list.csv
/Users/stong3/Desktop/sihnpy/src/sihnpy/data/fingerprinting/matrices_simulated_mod1
/Users/stong3/Desktop/sihnpy/src/sihnpy/data/fingerprinting/matrices_simulated_mod2


This outputs three things:
* The path to the list of IDs of participants (`id_list`)
* The path to the folder containing the matrices of the first session of brain imaging (`path_mod1`)
* The path to the folder containing the matrices of the second session of brain imaging (`path_mod2`)

These are the only mandatory input necessary for the fingerprinting to launch.

---
### 2. Importing the data

The first step in running the fingerprinting analysis is to import the libraries needed and the data. We use `import_fingerprint_ids` to import the list of participants. 

In [3]:
from sihnpy import fingerprinting as s_fp

list_of_ids = s_fp.import_fingerprint_ids(id_list) #Here we put the path to the list of IDs. Since we are using data within sihnpy, we just use the variable we got earlier.
print(list_of_ids)

['01a' '02a' '03a' '04a' '05a' '06a' '07a' '08a' '09a' '010a']


The function `import_fingerprint_ids` is a general utility function wrapped around `pandas.read_csv()` and `numpy.loadtxt()` functions. It accepts files ending with `.csv`, `.tsv` and `.txt`. The script then takes the first column in the data and returns it as a list of participant that we use in the rest of the analyses.

```{warning}
The script takes the first column of the dataframe as the column containing participants' IDs **OR** takes a text file of 1 ID number per line. As such, you need to insure that your input is correct. 

This step is critical for the fingerprinting. If the list of participant does not match the the name of the files for the matrices, `sihnpy` will not be able to import the matrices. 

**You should always check that the list of ids is what is expected after a first run**

```

In our simulated data, we can see that we have 10 participants ranging from ID `01a` to `010a`. 

### 3. Create a "fingerprinting object"

This title sounds a bit fancy, but the idea is simple: we need to store our list of participants and the paths where to get the matrices in a single python object. I won't get in the specifics, but just know that it streamlines some processes down the line.

The code is pretty simple: we just give the list of participant IDs, and the two paths to `FingerprintMats`. The code then takes this and creates prepares the field for the rest of our computations.

In [4]:
fp_mats = s_fp.FingerprintMats(list_of_ids, path_mod1, path_mod2)

/Users/stong3/Desktop/sihnpy/src/sihnpy/data/fingerprinting/matrices_simulated_mod1


### 4. File and subject selection

The idea here is that we want to list and store all the names of the matrices to be used in the **fingerprinting**. This is used to simplify the process of selecting matrices when doing the **fingerprinting**.

```{warning}
As of now, the **fingerprinting** only works when there is the same number of participants in both folders. Future functionalities should allevitate this, but in the mean time, you need to make sure that there is the same number of files in both folders.
```

Do not give any argument to the function.

In [5]:
fp_mats.fetch_matrice_file_names()
print(fp_mats.files_m1) #Print the file names of the first modality
print(fp_mats.files_m2) #Print the file names of the second modality

['mat_06a.txt', 'mat_07a.txt', 'mat_01a.txt', 'mat_02a.txt', 'mat_03a.txt', 'mat_010a.txt', 'mat_04a.txt', 'mat_08a.txt', 'mat_09a.txt', 'mat_05a.txt']
['mat_06a.txt', 'mat_07a.txt', 'mat_01a.txt', 'mat_02a.txt', 'mat_03a.txt', 'mat_010a.txt', 'mat_04a.txt', 'mat_08a.txt', 'mat_09a.txt', 'mat_05a.txt']


Once we have these lists, we intersect it with our subject list. This will confirm how many participants we will keep in the end.

In [6]:
fp_mats.subject_selection()

We have 10 subjects in the list.
We have in total 10 & 10 participants with both modalities.


Once the subject selection is done, we can move on to computing the fingerpriting.

```{important}
Currently, the script will throw errors in three situations:
* If the number of matrices in either folders is not matching between modalities
* If modality one and modality two is returning 0 files (the IDs from the list didn't match any file)
* If there are duplicated matrices within modality 1 or modality 2

Make sure to double check your files if you get this error.
```

### 5. Fingerprinting 

This function is the core of the fingerprinting method. It imports and correlates the values from the matrices from both modalities between all participants. The only argument it requires is to specify what are the nodes (i.e., columns/rows pairs) to consider in the analyses. By default, we usually fingerprint using within-network connections. Whole brain = number of columns.

In the simulated data, 