# Fingerprinting module
-----

```{note}
This code is under active development. Future example will include real data instead of randomly generated data.
```

## 1. Introduction

Many of currently used statistical analyses rely on the assumption that **brains of similar individuals are relatively homogenous**; meaning that individuals can be grouped together in relative confidence.

However, many studies, mainly in the field of functional magnetic resonance imaging **(REF MUELLER/REFS FP)**, have now highlighted that individuals in groups thought to be homogenous show significant within-group differences.

This is what prompted the development of the original fingerprinting methodology by [Finn et al. (2015)](https://www.nature.com/articles/nn.4135). Fingerprinting aims to derive brain signatures unique to each individuals, by (FINISH EXPLAINING) + ADD FIGURE HERE


## 2. General definitions and usage

Below is a set of definitions for terms used in this module

```{admonition} Definitions

- Fingerprinting accuracy: whether a given individual was identified using a different brain imaging session
- Fingerprint strength: within-individual correlation of two brain imaging sessions
- Alikeness coefficient: between-individual correlation of two brain imaging sessions
- Identifiability: definition from [Amico et al. (2015)](https://doi.org/10.1038/s41598-018-25089-1), where identifiability is the difference between the fingerprint strength and the alikeness coefficient.

```


## 3. Data example set-up

To demonstrate how to use the fingerprinting module, we created a small sample of simulated data with `numpy` and `pandas`. This data ships with `sinhpy` and can be used to test functionalities of the package.

To get the simulated data for fingerprinting, you can run the following code

In [1]:
from sihnpy.datasets import get_fingerprint_simulated_data

id_list, path_mod1, path_mod2 = get_fingerprint_simulated_data()

/Users/stong3/Desktop/sihnpy/src/sihnpy/data/fingerprinting/fp_simulated_id_list.csv
/Users/stong3/Desktop/sihnpy/src/sihnpy/data/fingerprinting/matrices_simulated_mod1
/Users/stong3/Desktop/sihnpy/src/sihnpy/data/fingerprinting/matrices_simulated_mod2


This outputs three things:
* The path to the list of IDs of participants
* The path to the folder containing the matrices of the first session of brain imaging
* The path to the folder containing the matrices of the second session of brain imaging

These are the only mandatory input necessary for the fingerprinting to launch.

---
## 4. Running the fingerprinting analyses locally

The code below provides a step-by-step overview of how to do the fingerprinting analysis. We propose two methods depending on the scope of the desired analysis.

### **Using a Jupyter notebook** (Testing or single network)

When test-driving this module, or in the case where a single network is of interest, a simple Jupyter notebook or python script can be created and run the analyses. This is the method presented here.

#### Step 1 - Set up and importing the data
The first step in running the fingerprinting analysis is to import the libraries needed and the data.

In [3]:
from sihnpy import fingerprinting as s_fp

list_of_ids = s_fp.import_fingerprint_ids(id_list) #We give the path we got earlier to the function
print(list_of_ids)

['01a' '02a' '03a' '04a' '05a' '06a' '07a' '08a' '09a' '010a']


The function `import_fingerprint_ids` is a general utility function wrapped around `pandas.read_csv()` and `numpy.loadtxt()` functions. It accepts files ending with `.csv`, `.tsv` and `.txt`. The script then takes the first column in the data and returns it as a list of participant that we use in the rest of the analyses.

```{warning}
The script takes the first column of the dataframe as the column containing participants' IDs **OR** takes a text file of 1 ID number per line. As such, you need to insure that your input is correct. 

This step is critical for the fingerprinting. If the list of participant does not match the the name of the files for the matrices, `sihnpy` will not be able to import the matrices. 

**You should always check that the list of ids is what is expected after a first run**

```

In our simulated data, we can see that we have 10 participants ranging from ID `01a` to `010a`. 

#### Step 2 - Initialize a FingerprintMats object

This title sounds a bit fancy, but the idea is simple: we need to store our list of participants and the paths where to get the matrices in a single python object. I won't get in the specifics, but just know that it streamlines some processes down the line.

The code is pretty simple: we just give the list of participant IDs, and the two paths to `FingerprintMats`. The code then takes this and creates prepares the field for the rest of our computations.

In [4]:
fp_mats = s_fp.FingerprintMats(list_of_ids, path_mod1, path_mod2)
print(path_mod1)

/Users/stong3/Desktop/sihnpy/src/sihnpy/data/fingerprinting/matrices_simulated_mod1


#### Step 3 - Get the matrices file names and final subject selection

The idea here is that we want to list all the matrix files that are available. 

In [5]:
fp_mats.fetch_matrice_file_names()
print(fp_mats.files_m1)
print(fp_mats.files_m2)

['mat_06a.txt', 'mat_07a.txt', 'mat_01a.txt', 'mat_02a.txt', 'mat_03a.txt', 'mat_010a.txt', 'mat_04a.txt', 'mat_08a.txt', 'mat_09a.txt', 'mat_05a.txt']
['mat_06a.txt', 'mat_07a.txt', 'mat_01a.txt', 'mat_02a.txt', 'mat_03a.txt', 'mat_010a.txt', 'mat_04a.txt', 'mat_08a.txt', 'mat_09a.txt', 'mat_05a.txt']


Once we have these lists, we need to figure out which participants among these are of interest for us.

In [6]:
fp_mats.subject_selection()

We have 10 subjects in the list.
We have in total 10 & 10 participants with both modalities.


Once the subject selection is done, we can move on to computing the fingerpriting.

```{important}
Currently, the script will throw errors in three situations:
* If the number of matrices in either folders is not matching between modalities
* If modality one and modality two is returning 0 files (the IDs from the list didn't match any file)
* If there are duplicated matrices within modality 1 or modality 2

Make sure to double check your files if you get this error.
```

#### Step 4 - Run the fingerprinting

This function is the core of the fingerprinting method. It imports and correlates the values from the matrices from both modalities between all participants. The only argument it requires is to specify what are the nodes (i.e., columns/rows pairs) to consider in the analyses. By default, we usually fingerprint using within-network connections. Whole brain = number of columns.

In the simulated data, 

### **Using a command-line script** (For running multiple networks and performance issues)

As it is currently written, the fingerprinting may take quite a lot more time to run depending on how many participants are included and on the size of the matrix fed to the script. In the original paper where I adapted this method I preferred using a command-line script and a high-performance cluster to run fingerprinting analyses in parallel as each fingerprint analysis for each network would take at least 2 hours. 

A command-line script and instructions on how to launch it on a cluster will be made available in a subsequent version.