# Fingerprinting analysis - Tabular data: Step-by-step

Ok, it's all great to know about the rationale and everything behind fingerprinting, but now let's get to the fun part of it: how do we actually use the module?

Here, I demonstrate, step-by-step, how to run the fingerprinting analysis. You can follow along once you have installed `sihnpy` and opened a Jupyter Notebook.

```{note}
Note that the functions described on this page will only work tabular data (i.e., spreadsheets). If you want to use fingerprinting with matrix-like data (one matrix per session, per participant), `sihnpy` offers {ref}`a different set of functions for that purpose <1.fingerprinting/fingerprinting_module:Fingerprinting analysis - Matrix-like data: Step-by-step>`.

Specifically, **the functions here only accept `pandas.DataFrame`.**
```

Already read the tutorial before and you just want the code (a.k.a. too long; didn't read)? Head on out to the {ref}`tl;dr section <1.fingerprinting/fp_tab_module:tl;dr>`.

## Preparing the data

As always, `sihnpy` comes shipped with data for you to practice with. In this case, we use T1-weighted{ref}`structural magnetic resonance imaging data processed using FreeSurfer <0.pad_data/datasets_usage:Structural magnetic resonance imaging data>`. To import it, you simply need the following code:


In [1]:
from sihnpy import datasets

volume_data, thickness_data, aseg_data = datasets.pad_fptab_input()
volume_data

Unnamed: 0_level_0,session,run,ctx_lh_bankssts_volume,ctx_lh_caudalanteriorcingulate_volume,ctx_lh_caudalmiddlefrontal_volume,ctx_lh_cuneus_volume,ctx_lh_entorhinal_volume,ctx_lh_fusiform_volume,ctx_lh_inferiorparietal_volume,ctx_lh_inferiortemporal_volume,...,ctx_rh_rostralanteriorcingulate_volume,ctx_rh_rostralmiddlefrontal_volume,ctx_rh_superiorfrontal_volume,ctx_rh_superiorparietal_volume,ctx_rh_superiortemporal_volume,ctx_rh_supramarginal_volume,ctx_rh_frontalpole_volume,ctx_rh_temporalpole_volume,ctx_rh_transversetemporal_volume,ctx_rh_insula_volume
participant_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
sub-1000173,ses-FU12,run-001,2046.0,1446.0,6174.0,2971.0,2560.0,8953.0,10505.0,8728.0,...,1536.0,14428.0,21615.0,11647.0,10390.0,9711.0,1075.0,3316.0,830.0,6757.0
sub-1002928,ses-BL00,run-001,2572.0,2188.0,7326.0,2961.0,1915.0,9140.0,11575.0,9313.0,...,1649.0,13692.0,20785.0,10793.0,10916.0,9025.0,950.0,2167.0,937.0,7021.0
sub-1004359,ses-BL00,run-001,1768.0,1333.0,4424.0,3324.0,1695.0,8514.0,10465.0,10928.0,...,2033.0,10784.0,17373.0,10712.0,10921.0,8554.0,1329.0,2508.0,718.0,5668.0
sub-1004359,ses-FU12,run-001,1845.0,1302.0,4471.0,3303.0,1758.0,8381.0,10413.0,10702.0,...,2235.0,11402.0,17398.0,10559.0,11107.0,8597.0,1164.0,2480.0,735.0,5704.0
sub-1016072,ses-BL00,run-001,2234.0,1721.0,6435.0,2831.0,1421.0,8454.0,11209.0,9532.0,...,1758.0,14514.0,16783.0,11713.0,11079.0,9948.0,1138.0,2562.0,998.0,6466.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
sub-9930257,ses-BL00,run-001,1692.0,1586.0,5212.0,3144.0,1641.0,8405.0,9919.0,9672.0,...,1813.0,14536.0,18152.0,13127.0,10875.0,8554.0,1271.0,2705.0,1066.0,6363.0
sub-9931234,ses-BL00,run-001,1886.0,1191.0,5521.0,2587.0,1628.0,9728.0,10930.0,11052.0,...,2118.0,16833.0,18207.0,13311.0,10490.0,8442.0,1135.0,2398.0,942.0,6929.0
sub-9931234,ses-FU12,run-001,1779.0,1166.0,5493.0,2310.0,1672.0,9822.0,10589.0,10550.0,...,2003.0,16640.0,18026.0,13163.0,10340.0,8363.0,1313.0,2484.0,932.0,6859.0
sub-9939055,ses-BL00,run-001,1785.0,2550.0,4754.0,1808.0,1604.0,8510.0,8750.0,8258.0,...,1975.0,13286.0,17322.0,10076.0,9996.0,7155.0,981.0,2212.0,651.0,5746.0


Let's take, for example, the volumetric data printed above. This dataset is in **long format**, i.e., each visit for each participant has its own row. You can distinguish the different visits by the **session** variable (either BL00 or FU12, representing Baseline and Follow-up at 12-months). The **run** variable is not particularly useful for you; for some PREVENT-AD participant who didn't have a good first run of imaging, a second run was taken. In `sihnpy`'s data, we kept `run-002` if it was there.

The rest of the columns (all starting with `ctx`) are the actual data in each cortical region, where `lh` is the region in the left hemisphere, and `rh` is the region in the right hemisphere.

Note that you can use the fingerprinting with the volume, thickness and even, the aseg data. However, note that aseg doesn't only hold volume data and include variables like total intracranial volume. Be careful if using this data.

Let's move on to the fingerprinting!

## Fingerprinting steps

### 1. Importing and cleaning the data

As you may have seen in the matrix data fingerprinting, {ref}`cleaning the data is a very important and very... long step <1.fingerprinting/fingerprinting_module:2. Importing the data for fingerprinting>`. Thankfully, this is much shorter in this fingerprinting version, but there are a couple of critical steps that need to be done:

* Your dataframe needs to have an index, where the index are the participant IDs (`sihnpy` relies on the index on multiple occasions)
* Each participant must have at least 2 visits (i.e., two rows) in the dataframe
* Your dataframe needs to be in long format, with one variable specifying which "visit" we are talking about
* The columns you want to use in your dataframe for fingerprinting must all start with the same suffix

Thankfully, the data `sihnpy` provides is already formatted for these requirements (it would not be super fair if I gave you data that wasn't ready to use right?).

Let's clean the data:

In [2]:
from sihnpy import fingerprinting as fp 

data_bl, data_fu = fp.import_fingerprint_data(volume_data, var='session')
data_bl

Unnamed: 0_level_0,session,run,ctx_lh_bankssts_volume,ctx_lh_caudalanteriorcingulate_volume,ctx_lh_caudalmiddlefrontal_volume,ctx_lh_cuneus_volume,ctx_lh_entorhinal_volume,ctx_lh_fusiform_volume,ctx_lh_inferiorparietal_volume,ctx_lh_inferiortemporal_volume,...,ctx_rh_rostralanteriorcingulate_volume,ctx_rh_rostralmiddlefrontal_volume,ctx_rh_superiorfrontal_volume,ctx_rh_superiorparietal_volume,ctx_rh_superiortemporal_volume,ctx_rh_supramarginal_volume,ctx_rh_frontalpole_volume,ctx_rh_temporalpole_volume,ctx_rh_transversetemporal_volume,ctx_rh_insula_volume
participant_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
sub-1004359,ses-BL00,run-001,1768.0,1333.0,4424.0,3324.0,1695.0,8514.0,10465.0,10928.0,...,2033.0,10784.0,17373.0,10712.0,10921.0,8554.0,1329.0,2508.0,718.0,5668.0
sub-1016072,ses-BL00,run-001,2234.0,1721.0,6435.0,2831.0,1421.0,8454.0,11209.0,9532.0,...,1758.0,14514.0,16783.0,11713.0,11079.0,9948.0,1138.0,2562.0,998.0,6466.0
sub-1072774,ses-BL00,run-001,2366.0,1753.0,5689.0,2570.0,1681.0,8522.0,11084.0,8635.0,...,1744.0,14151.0,16346.0,11987.0,9106.0,9915.0,943.0,2725.0,643.0,5986.0
sub-1076159,ses-BL00,run-001,2403.0,795.0,5211.0,2595.0,1918.0,8779.0,11293.0,11561.0,...,2211.0,13838.0,20560.0,13559.0,12788.0,11333.0,1467.0,2525.0,847.0,7030.0
sub-1154932,ses-BL00,run-001,3354.0,2513.0,5665.0,2870.0,1804.0,9756.0,12505.0,10587.0,...,2302.0,14159.0,19629.0,10676.0,10926.0,8423.0,1079.0,3148.0,712.0,7060.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
sub-9865768,ses-BL00,run-001,1859.0,1703.0,4267.0,2353.0,1420.0,7183.0,8749.0,6648.0,...,1694.0,12451.0,16895.0,9819.0,9702.0,7613.0,1117.0,2409.0,759.0,6087.0
sub-9889544,ses-BL00,run-001,1826.0,2437.0,7097.0,3055.0,1948.0,8378.0,10641.0,11911.0,...,2196.0,15937.0,20839.0,12166.0,11910.0,10482.0,1001.0,2479.0,954.0,6049.0
sub-9909448,ses-BL00,run-002,2813.0,1368.0,6195.0,2467.0,1930.0,9139.0,10930.0,11865.0,...,1952.0,16274.0,16658.0,11004.0,10621.0,10026.0,1302.0,3313.0,809.0,6644.0
sub-9931234,ses-BL00,run-001,1886.0,1191.0,5521.0,2587.0,1628.0,9728.0,10930.0,11052.0,...,2118.0,16833.0,18207.0,13311.0,10490.0,8442.0,1135.0,2398.0,942.0,6929.0


Ok, let's deconstruct what just happened. The function `import_fingerprint_data` does two things: 1) it removes participants who only have 1 row as we need two visits to compute **self-identifiability** and 2) Splits the first and second visit in two dataframes (it simplifies the code and calculations). From there, we remain with two dataframes; one with the baseline data and one with the follow-up data.

We are now ready for fingerprinting.

```{admonition} Fix: what if I have more than two visits per participant?
:class: warning

In cohort studies (PREVENT-AD included), participants may be followed for much more than two years. This complicates matter as it forces the user of the fingerprint methodology **to choose which visit to use for fingerprinting**. In `sihnpy`, the option currently implemented is to **fingerprint the first to the last visit available**.

If you have over two visits in your study, there are usually two types of analyses you might want to do: fingerprint two specific visits (e.g., first and last) or want to calculate the change over time. In the first case, before using `sihnpy` **you need to make sure that the session variable is ordered correctly for each participant**. This is usually (but not always) the case, so it needs to be checked before use. In the second case, you will have to create individual dataframes for each pair of visit you would like to fingerprint.

Depending on the interest shown, I could modify the code to remove this step before `sihnpy`, but it is currently not in the plans.

```

```{admonition} Advanced users: Unequal number of participants between visits.
:class: danger

In many fingerprinting papers, authors use an unequal number of participants between the first and second visit. This can be useful when adding more participants in the second visit for instance, as it adds more potential noise in the analysis, and reinforces that when identification works, it really works.

However, for simplicity, the code currently only offers fingerprinting to participants with both visits. This could change in the future depending on the interest.

```

### 2. Fingerprinting tabular data

And we're already ready to fingerprint the data!

The next function only requires the two dataframes we cleaned with the first function and the prefix used to mark the columns we want to keep for fingerprinting.

In [3]:
similarity_matrix = fp.fingerprint_tabs(data1=data_bl, data2=data_fu, pref='ctx') #This should take around 20 seconds on sihnpy's data

Participant 1 / 234
Participant 2 / 234
Participant 3 / 234
Participant 4 / 234
Participant 5 / 234
Participant 6 / 234
Participant 7 / 234
Participant 8 / 234
Participant 9 / 234
Participant 10 / 234
Participant 11 / 234
Participant 12 / 234
Participant 13 / 234
Participant 14 / 234
Participant 15 / 234
Participant 16 / 234
Participant 17 / 234
Participant 18 / 234
Participant 19 / 234
Participant 20 / 234
Participant 21 / 234
Participant 22 / 234
Participant 23 / 234
Participant 24 / 234
Participant 25 / 234
Participant 26 / 234
Participant 27 / 234
Participant 28 / 234
Participant 29 / 234
Participant 30 / 234
Participant 31 / 234
Participant 32 / 234
Participant 33 / 234
Participant 34 / 234
Participant 35 / 234
Participant 36 / 234
Participant 37 / 234
Participant 38 / 234
Participant 39 / 234
Participant 40 / 234
Participant 41 / 234
Participant 42 / 234
Participant 43 / 234
Participant 44 / 234
Participant 45 / 234
Participant 46 / 234
Participant 47 / 234
Participant 48 / 234
P

In [4]:
similarity_matrix

array([[0.99939915, 0.9668774 , 0.97540227, ..., 0.95741285, 0.96756766,
        0.96571496],
       [0.9668774 , 0.99961408, 0.98778711, ..., 0.9787953 , 0.97243421,
        0.98021875],
       [0.97540227, 0.98778711, 0.99907875, ..., 0.97767822, 0.98302374,
        0.9825117 ],
       ...,
       [0.95741285, 0.9787953 , 0.97767822, ..., 0.99862623, 0.96710599,
        0.97997632],
       [0.96756766, 0.97243421, 0.98302374, ..., 0.96710599, 0.9996869 ,
        0.97899449],
       [0.96571496, 0.98021875, 0.9825117 , ..., 0.97997632, 0.97899449,
        0.99953386]])

Ok so what did we do? {ref}`Just like in the matrix-like fingerprinting <1.fingerprinting/fingerprinting_module:5. Fingerprinting>`, we created a **similarity matrix** where the diagonal is the **self-identifiability** (i.e., within-individual correlation) and the off-diagonal elements are all the of the **others-identifiabilities** (i.e., between-individual correlation).

I discuss more on the different aspect of this method {ref}`in the section on fingerprinting matrix-like data <1.fingerprinting/fingerprinting_module:5. Fingerprinting>`. Note that contrary to the fingerprinting on matrix-like data, the script for tabular data is more limited at the moment: it doesn't offer normalization of the data, it doesn't offer the selection of specific columns (like selecting nodes in the other version) and it doesn't offer correlation with anything else but Pearson correlations. The reasons for this is simply that, by design, the tabular version of the fingerprinting should really be used with smaller data (e.g., structural data in a small set of parcels). That said, should you be interested in these additions, let me know [by opening an issue on Github.](https://github.com/stong3/sihnpy/issues)

And that's it! We're already ready to compute the measures.

```{admonition} Fix: Possible errors
:class: warning

By default, the script will output two possible errors:
* In the case where the participants in the first dataframe (i.e., the index) are different from the participants in the second dataframe
* In the case where the columns of the first dataframe are different from the columns of the second dataframe

In both cases, this is really dependant on the data input into `sihnpy`, so you have to be careful. If the data is not working, then you need to reformat the data from before the first step.
```

### 3. Computing fingerprinting metrics

The last step needed is to compute the fingerprint metrics (**accuracy**, **self-identifiability**, **others-identifiability** and **differential identifiability**).This is also quite simple, as you just need to provide a dataframe (`sihnpy` will grab the index of it), the similarity matrix we computed in the previous step.

In [8]:
fp_metrics = fp.tab_metrics_calc(data=data_bl, similar_matrix=similarity_matrix, name='tutorial')
fp_metrics

Unnamed: 0_level_0,si_tutorial,oi_tutorial,fia_tutorial,di_tutorial
participant_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
sub-1004359,0.999399,0.973846,1.0,0.025554
sub-1016072,0.999614,0.977725,1.0,0.021889
sub-1072774,0.999079,0.980443,1.0,0.018636
sub-1076159,0.997419,0.980987,1.0,0.016433
sub-1154932,0.999634,0.977076,1.0,0.022558
...,...,...,...,...
sub-9865768,0.999506,0.977502,1.0,0.022003
sub-9889544,0.999228,0.978876,1.0,0.020352
sub-9909448,0.998626,0.971501,1.0,0.027125
sub-9931234,0.999687,0.974556,1.0,0.025131


In [13]:
fp_metrics.iloc[233,3]

0.018491580529617635

In order, the script outputs the **self-identifiability** (`si`), the **others-identifiability** (`oi`), the **fingerprint identification accuracy** (`fia`) and the **differential identifiability** (`di`). 

`si`, `oi` and `di` will always range between 0 and 1 as they are correlation coefficients. `fia` will be either 0 or 1, where 1 represents an accurate identification of the participant. `oi` is the average of all between-individual correlations for a specific individual (i.e., on average, how similar is a participant to the rest of the cohort).

The `name` argument in the function is for the user to add a name at the end. This is particularly useful if you run multiple fingerprinting analyses so you can distinguish them.

Let's see what we got in our sample!

In [15]:
print(f"Total fingerprinting accuracy is: {round((fp_metrics['fia_tutorial'].sum() / len(fp_metrics)) * 100, 2)}%")

Total fingerprinting accuracy is: 98.29%


Wow we did pretty well!

### 4. Exporting the results

We're already at the end! Time flies by in good company (I hope).

The last step is simply to export the data. This is done by calling the following function:

```python
fp.tab_export("/path/to/output", data1=data_bl, data2=data_fu, similar_matrix=similarity_matrix, fp_metrics=fp_metrics, name='test')
```

It requires: 1) The path to where the files should be stored, 2) the first dataset (with the first visit), 3) the second dataset (with the second visit), 4) the similarity matrix and 5) the fingerprint metrics. Every one of these elements are output to file at the location specified by the user. 

## Conclusion

You made it through the whole fingerprinting tutorial! (or well, you skipped ahead to here) I hope I was able to make the steps clear for you and that you enjoyed following along. If things weren't clear in the documentation, please [submit an issue on Github](https://github.com/stong3/sihnpy/issues).

Don't forget to {ref}`cite the package and the paper <index:Authors>` describing this method if you end up using it in one of your paper!

## tl;dr

Got bored during the tutorial? You already finished the tutorial and just want a quick reminder of the main functions you need? Or you just want to bash ahead with the code without reading? I got you. Here's a condensed form of the code:

```python
from sihnpy import datasets
from sihnpy import fingerprinting as fp

volume_data = datasets.pad_fptab_input()[0]

data_bl, data_fu = fp.import_fingerprint_data(data=volume_data, var='session')

similarity_matrix = fp.fingerprint_tabs(data1=data_bl, data2=data_fu, suff='ctx')

fp_metrics = fp.tab_metrics_calc(data=data_bl, similar_matrix=similarity_matrix, name='test')

fp.tab_export("/path/to/output", data1=data_bl, data2=data_fu, similar_matrix=similarity_matrix, fp_metrics=fp_metrics, name='test')
```

## References

You can find references for this topic in the main introduction on fingerprinting {ref}`here <1.fingerprinting/fp_intro:References>`.