# Fingerprinting analysis - Matrix-like data

How do you use the fingerprinting module of `sihnpy`?

We demonstrate a typical **fingerprinting** analysis using `sihnpy`. You can run these analyses by building a script like we do below in a Python script or in a Jupyter Notebook. In the case where you need to run this analysis on a high number of participants or on very high dimensional data (>160,000 features per participants), I recommend using a command-line script (ADD THE REF TO THE NOTEBOOK HERE).

Note that the steps above are specific to the matrix-like data (e.g., functional or structural connectivity, covariance matrix, etc.). If you have table-like data (e.g., volume by region), a different method is required. (ADD REF TO THE NOTEBOOK HERE).

## 1. Preparing the data

To run a fingerprinting analysis, we need three things:
* The path to a list of participants to analyze
* The path to the folder containing the matrices of the first session of brain imaging
* The path to the folder containing the matrices of the second session brain imaging

If you already have the above for your data, you can skip ahead to 2. Otherwise, `sihnpy` also offers a small sample of data simulated using `numpy` and `pandas` to practice using the functions.

To get the simulated data for fingerprinting, you can run the following code:

In [1]:
from sihnpy.datasets import get_fingerprint_simulated_data

id_list, path_mod1, path_mod2 = get_fingerprint_simulated_data()

/Users/stong3/Desktop/sihnpy/src/sihnpy/data/fingerprinting/fp_simulated_id_list.csv
/Users/stong3/Desktop/sihnpy/src/sihnpy/data/fingerprinting/matrices_simulated_mod1
/Users/stong3/Desktop/sihnpy/src/sihnpy/data/fingerprinting/matrices_simulated_mod2


This outputs three things:
* The path to the list of IDs of participants (`id_list`)
* The path to the folder containing the matrices of the first session of brain imaging (`path_mod1`)
* The path to the folder containing the matrices of the second session of brain imaging (`path_mod2`)

These are the only mandatory input necessary for the fingerprinting to launch.

---
### 2. Importing the data

The first step in running the fingerprinting analysis is to import the libraries needed and the data. We use `import_fingerprint_ids` to import the list of participants. 

In [3]:
from sihnpy import fingerprinting as s_fp

list_of_ids = s_fp.import_fingerprint_ids(id_list) #Here we put the path to the list of IDs. Since we are using data within sihnpy, we just use the variable we got earlier.
print(list_of_ids)

['01a' '02a' '03a' '04a' '05a' '06a' '07a' '08a' '09a' '010a']


The function `import_fingerprint_ids` is a general utility function wrapped around `pandas.read_csv()` and `numpy.loadtxt()` functions. It accepts files ending with `.csv`, `.tsv` and `.txt`. The script then takes the first column in the data and returns it as a list of participant that we use in the rest of the analyses.

```{warning}
The script takes the first column of the dataframe as the column containing participants' IDs **OR** takes a text file of 1 ID number per line. As such, you need to insure that your input is correct. 

This step is critical for the fingerprinting. If the list of participant does not match the the name of the files for the matrices, `sihnpy` will not be able to import the matrices. 

**You should always check that the list of ids is what is expected after a first run**

```

In our simulated data, we can see that we have 10 participants ranging from ID `01a` to `010a`. 

### 3. Create a "fingerprinting object"

This title sounds a bit fancy, but the idea is simple: we need to store our list of participants and the paths where to get the matrices in a single python object. I won't get in the specifics, but just know that it streamlines some processes down the line.

The code is pretty simple: we just give the list of participant IDs, and the two paths to `FingerprintMats`. The code then takes this and creates prepares the field for the rest of our computations.

In [4]:
fp_mats = s_fp.FingerprintMats(list_of_ids, path_mod1, path_mod2)

/Users/stong3/Desktop/sihnpy/src/sihnpy/data/fingerprinting/matrices_simulated_mod1


### 4. File and subject selection

The idea here is that we want to list and store all the names of the matrices to be used in the **fingerprinting**. This is used to simplify the process of selecting matrices when doing the **fingerprinting**.

```{warning}
As of now, the **fingerprinting** only works when there is the same number of participants in both folders. Future functionalities should allevitate this, but in the mean time, you need to make sure that there is the same number of files in both folders.
```

Do not give any argument to the function.

In [5]:
fp_mats.fetch_matrice_file_names()
print(fp_mats.files_m1) #Print the file names of the first modality
print(fp_mats.files_m2) #Print the file names of the second modality

['mat_06a.txt', 'mat_07a.txt', 'mat_01a.txt', 'mat_02a.txt', 'mat_03a.txt', 'mat_010a.txt', 'mat_04a.txt', 'mat_08a.txt', 'mat_09a.txt', 'mat_05a.txt']
['mat_06a.txt', 'mat_07a.txt', 'mat_01a.txt', 'mat_02a.txt', 'mat_03a.txt', 'mat_010a.txt', 'mat_04a.txt', 'mat_08a.txt', 'mat_09a.txt', 'mat_05a.txt']


Once we have these lists, we intersect it with our subject list. This will confirm how many participants we will keep in the end.

In [6]:
fp_mats.subject_selection()

We have 10 subjects in the list.
We have in total 10 & 10 participants with both modalities.


Once the subject selection is done, we can move on to computing the fingerpriting.

```{important}
Currently, the script will throw errors in three situations:
* If the number of matrices in either folders is not matching between modalities
* If modality one and modality two is returning 0 files (the IDs from the list didn't match any file)
* If there are duplicated matrices within modality 1 or modality 2

Make sure to double check your files if you get this error.
```

### 5. Fingerprinting 

This function is the core of the fingerprinting method. It imports and correlates the values from the matrices from both modalities between all participants. The only argument it requires is to specify what are the nodes (i.e., columns/rows pairs) to consider in the analyses. By default, we usually fingerprint using within-network connections. Whole brain = number of columns.

In the simulated data, 