# Using PREVENT-AD data from `sihnpy`

**Have you ever had the feeling of seeing a really interesting package but you don't really have any data formatted in the right way available quickly to test it? Or have you tried the package but can't figure out whether the results you are getting are right?**

**So have I**. Inspired by the wonderful [`nilearn` package](https://nilearn.github.io), I decided to make data availability an integral part of `sihnpy`. The philosophy is that every module within `sihnpy` should have readily available data, formatted in way that allows testing by the user quickly so that the user can then move on to using their own data. "Practice data" if you will. There will be many different types of data available as the package continues to develop.

Some data currently included in `sihnpy` include {ref}`functional connectivity matrices <0.pad_data/datasets_usage:Functional connectivity data>`, {ref}`simulated positron emission tomography (PET) data <0.pad_data/datasets_usage:Simulated data>`, {ref}`simulated age data <0.pad_data/datasets_usage:Simulated data>` and `structural MRI (volume and thickness) data <0.pad_data/datasets_usage:Structural magnetic resonance imaging data>`. Full details on preprocessing can be found in their respective sections below.

If you want to download more data or want to use the Prevent-AD data for research purposes, head on over to the section where I explain in detail {ref}`how to get your hands on this great dataset <0.pad_data/download_pad_data:Downloading PREVENT-AD data>`.

## Using the datasets module in sihnpy

For each `sihnpy` module, there is a corresponding `dataset` composed of Prevent-AD data, unless specified otherwise. The documentation on each module starts with how to use the `dataset` for that module. It's usage is very simple:
1. Import the module
2. Import the data

That's it! It's that simple.

Below is an example using the import function to test the fingerprinting module.

In [1]:
from sihnpy.datasets import pad_fp_input

id_list, path_participant_list, path_data = pad_fp_input()

The output of the function will always have three parts: the basic demographics of the participants, the path to the participant IDs file and the parts necessary to be able to run the functions of the module targeted, often in a single Python dictionary or a single file. 

The first two outputs of a `sihnpy.datasets` function will always return the same thing (basic demographics and the path to said demographics).

In [2]:
id_list

Unnamed: 0,participant_id,sex,test_language,handedness_score,handedness_interpretation
0,sub-1000173,Male,French,100,Right-handed
1,sub-1002928,Female,French,100,Right-handed
2,sub-1004359,Female,French,90,Right-handed
3,sub-1016072,Female,French,-100,Left-handed
4,sub-1031654,Male,French,100,Right-handed
5,sub-1072774,Female,French,100,Right-handed
6,sub-1076159,Female,French,100,Right-handed
7,sub-1121981,Female,French,100,Right-handed
8,sub-1154932,Male,French,30,Ambidextrous
9,sub-1176949,Female,French,80,Right-handed


In `id_list`, we see the IDs of the participants as well as their basic demographics. Note that `id_list` is a `pandas.DataFrame` object, with `participant_id` as the index column. Therefore, any Pandas methods you can think of will work on `id_list`.

Most `sihnpy` functions have their own import function. This is to ensure that whatever is fed to the module can be checked properly right from the start. The path is local on your own computer, depending on how `sihnpy` was installed.

In [3]:
path_participant_list

'/Users/fredericst-onge/Desktop/sihnpy/src/sihnpy/data/pad_conp_minimal/participants.tsv'

As I mention elsewhere, please be sure to take a look at the [license](../license.md) before using the Prevent-AD data to practice the modules.

## Additional information on brain imaging preprocessing

Some of the modules included in `sihnpy` come from my first PhD project, which focused on functional magnetic resonance imaging. So, unsurprisingly, fMRI data is available through `sihnpy`. That said, many of the functions can be applied to other neuroimaging modalities. 

### Functional connectivity data

A small subset of PREVENT-AD participants (15) have functional connectivity data available through `sihnpy`, who had at least 1 anatomical MRI image and 1 functional MRI image available. Currently, the basic demographic data as well as the functional connectivity matrix of participants at baseline and at 12 months of follow-up are available. Multiple sessions are available for some participants, as some did tasks in the MRI (specifically a resting-state task, a memory encoding and a memory retrieval task). Note that not all participants have all fMRI modalities and/or follow-up timepoints available due to attrition or due to changes in the protocol over the years. This data is currently used for the {ref}`fingerprinting module <1.fingerprinting/fingerprinting_module:1. Preparing the data>`.

Below, I detail the procedure I used to pre-process the data (though to be fair these amazing pieces of software did a lot of the heavy lifting for me in terms of describing the steps). [`fMRIPrep` v20.2.0](https://fmriprep.org/en/20.2.0/) (which is already kind of an old version at the time of writing this) was used to pre-process the data. Then, [`nilearn` 0.9.2](https://nilearn.github.io/stable/index.html) was used to compute and derive the functional connectivity.

#### `fMRIPrep`
*Below is copied almost integrally from fMRIPrep's boilerplate post-preprocessing citations*

Results included in this manuscript come from preprocessing performed using fMRIPrep 20.2.0 (Esteban, Markiewicz, et al. (2018); Esteban, Blair, et al. (2018); RRID:SCR_016216), which is based on Nipype 1.5.1 (Gorgolewski et al. (2011); Gorgolewski et al. (2018); RRID:SCR_002502).

**Anatomical data preprocessing**
All available T1-weighted (T1w) images for each participants across visits were used. They were corrected for intensity non-uniformity (INU) with N4BiasFieldCorrection (Tustison et al. 2010), distributed with ANTs 2.3.3 (Avants et al. 2008, RRID:SCR_004757). The T1w-reference was then skull-stripped with a Nipype implementation of the antsBrainExtraction.sh workflow (from ANTs), using OASIS30ANTs as target template. Brain tissue segmentation of cerebrospinal fluid (CSF), white-matter (WM) and gray-matter (GM) was performed on the brain-extracted T1w using fast (FSL 5.0.9, RRID:SCR_002823, Zhang, Brady, and Smith 2001). A T1w-reference map was computed after registration of all T1w images (after INU-correction) using mri_robust_template (FreeSurfer 6.0.1, Reuter, Rosas, and Fischl 2010). Brain surfaces were reconstructed using recon-all (FreeSurfer 6.0.1, RRID:SCR_001847, Dale, Fischl, and Sereno 1999), and the brain mask estimated previously was refined with a custom variation of the method to reconcile ANTs-derived and FreeSurfer-derived segmentations of the cortical gray-matter of Mindboggle (RRID:SCR_002438, Klein et al. 2017). Volume-based spatial normalization to one standard space (MNI152NLin2009cAsym) was performed through nonlinear registration with antsRegistration (ANTs 2.3.3), using brain-extracted versions of both T1w reference and the T1w template. The following template was selected for spatial normalization: ICBM 152 Nonlinear Asymmetrical template version 2009c [Fonov et al. (2009), RRID:SCR_008796; TemplateFlow ID: MNI152NLin2009cAsym] Note that while the Prevent-AD Open BIDS do contain other brain imaging modalities that can be leveraged by fMRIPrep (e.g., FLAIR), it is not consistent across participants. As such, preprocessing was restricted to T1w and EPI images only.

**Functional data preprocessing**
For each of the 12 BOLD runs found per subject (across all tasks and sessions), the following preprocessing was performed. First, a reference volume and its skull-stripped version were generated using a custom methodology of fMRIPrep. A B0-nonuniformity map (or fieldmap) was estimated based on a phase-difference map calculated with a dual-echo GRE (gradient-recall echo) sequence, processed with a custom workflow of SDCFlows inspired by the epidewarp.fsl script and further improvements in HCP Pipelines (Glasser et al. 2013). The fieldmap was then co-registered to the target EPI (echo-planar imaging) reference run and converted to a displacements field map (amenable to registration tools such as ANTs) with FSL’s fugue and other SDCflows tools. Based on the estimated susceptibility distortion, a corrected EPI (echo-planar imaging) reference was calculated for a more accurate co-registration with the anatomical reference. The BOLD reference was then co-registered to the T1w reference using bbregister (FreeSurfer) which implements boundary-based registration (Greve and Fischl 2009). Co-registration was configured with six degrees of freedom. Head-motion parameters with respect to the BOLD reference (transformation matrices, and six corresponding rotation and translation parameters) are estimated before any spatiotemporal filtering using mcflirt (FSL 5.0.9, Jenkinson et al. 2002). BOLD runs were slice-time corrected using 3dTshift from AFNI 20160207 (Cox and Hyde 1997, RRID:SCR_005927). The BOLD time-series (including slice-timing correction when applied) were resampled onto their original, native space by applying a single, composite transform to correct for head-motion and susceptibility distortions. These resampled BOLD time-series will be referred to as preprocessed BOLD in original space, or just preprocessed BOLD. The BOLD time-series were resampled into standard space, generating a preprocessed BOLD run in MNI152NLin2009cAsym space. First, a reference volume and its skull-stripped version were generated using a custom methodology of fMRIPrep. Several confounding time-series were calculated based on the preprocessed BOLD: framewise displacement (FD), DVARS and three region-wise global signals. FD was computed using two formulations following Power (absolute sum of relative motions, Power et al. (2014)) and Jenkinson (relative root mean square displacement between affines, Jenkinson et al. (2002)). FD and DVARS are calculated for each functional run, both using their implementations in Nipype (following the definitions by Power et al. 2014). The three global signals are extracted within the CSF, the WM, and the whole-brain masks. Additionally, a set of physiological regressors were extracted to allow for component-based noise correction (CompCor, Behzadi et al. 2007). Principal components are estimated after high-pass filtering the preprocessed BOLD time-series (using a discrete cosine filter with 128s cut-off) for the two CompCor variants: temporal (tCompCor) and anatomical (aCompCor). tCompCor components are then calculated from the top 2% variable voxels within the brain mask. For aCompCor, three probabilistic masks (CSF, WM and combined CSF+WM) are generated in anatomical space. The implementation differs from that of Behzadi et al. in that instead of eroding the masks by 2 pixels on BOLD space, the aCompCor masks are subtracted a mask of pixels that likely contain a volume fraction of GM. This mask is obtained by dilating a GM mask extracted from the FreeSurfer’s aseg segmentation, and it ensures components are not extracted from voxels containing a minimal fraction of GM. Finally, these masks are resampled into BOLD space and binarized by thresholding at 0.99 (as in the original implementation). Components are also calculated separately within the WM and CSF masks. For each CompCor decomposition, the k components with the largest singular values are retained, such that the retained components’ time series are sufficient to explain 50 percent of variance across the nuisance mask (CSF, WM, combined, or temporal). The remaining components are dropped from consideration. The head-motion estimates calculated in the correction step were also placed within the corresponding confounds file. The confound time series derived from head motion estimates and global signals were expanded with the inclusion of temporal derivatives and quadratic terms for each (Satterthwaite et al. 2013). Frames that exceeded a threshold of 0.5 mm FD or 1.5 standardised DVARS were annotated as motion outliers. All resamplings can be performed with a single interpolation step by composing all the pertinent transformations (i.e. head-motion transform matrices, susceptibility distortion correction when available, and co-registrations to anatomical and output spaces). Gridded (volumetric) resamplings were performed using antsApplyTransforms (ANTs), configured with Lanczos interpolation to minimize the smoothing effects of other kernels (Lanczos 1964). Non-gridded (surface) resamplings were performed using mri_vol2surf (FreeSurfer).

Many internal operations of fMRIPrep use Nilearn 0.6.2 (Abraham et al. 2014, RRID:SCR_001362), mostly within the functional processing workflow. For more details of the pipeline, see the section corresponding to workflows in fMRIPrep’s documentation.

#### Nilearn

Once preprocessed by fMRIPrep, confounds were removed from the images and frames with excessive motion were scrubbed using Nilearn. Timeseries were extracted in 400 brain parcels from the Schaefer atlas (Schaefer et al. 2018) and the timeseries in each region was correlated with every other region using partial correlations to generate the functional connectivity matrices. This process yielded 400x400 matrices representing the functional links between each brain region of the atlas.

#### Scripts

While it is not the goal of `sihnpy` to focus on preprocessing the actual imaging data, I will be adding the scripts I used in case it can be useful to others. I also describe {ref}`in more details how to download the Prevent-AD data <0.pad_data/download_pad_data:Downloading Prevent-AD data>`.

### Simulated data

Currently, the PREVENT-AD Open Dataset doesn't offer information on all variables offered in the cohort. For instance, participants undergo tau positron emission tomography (tau-PET) scans which can be used with the **spatial extent** module. However, it isn't yet offered in the dataset. Furthermore, the age variable is considered a restricted variable only available with PREVENT-AD Registered access as it can be identifiable.

As such, we simulated tau-PET data to mimic behaviors observed in other cohorts and we simulated age data to mirror the inclusion/exclusion criteria of the participants. More information on how this was done is available {ref}`in the spatial extent Additional topics section <2.spex/spex_details:Creating Gaussian simulated data>`.

### Structural magnetic resonance imaging data

Longitudinal structural MRI data is available for most of the participants within the PREVENT-AD Open Dataset. In our case, we preprocessed baseline and follow-up at 12-months structural MRI of all 308 participants of our cohort. This yielded 543 structural MRI available for preprocessing. Two scans failed preprocessing, yielding a final number of 541 scans. Preprocessing was done using FreeSurfer 7.1.0., described below.

#### FreeSurfer
*Below is copied almost integrally from FreeSurfer's boilerplate methods citation*

Cortical reconstruction and volumetric segmentation was performed with the Freesurfer image analysis suite, which is documented and freely available for download online (http://surfer.nmr.mgh.harvard.edu/). The technical details of these procedures are described in prior publications (Dale et al., 1999; Dale and Sereno, 1993; Fischl and Dale, 2000; Fischl et al., 2001; Fischl et al., 2002; Fischl et al., 2004a; Fischl et al., 1999a; Fischl et al., 1999b; Fischl et al., 2004b; Han et al., 2006; Jovicich et al., 2006; Segonne et al., 2004, Reuter et al. 2010, Reuter et al. 2012). Briefly, this processing includes motion correction and averaging (Reuter et al. 2010) of multiple volumetric T1 weighted images (when more than one is available), removal of non-brain tissue using a hybrid watershed/surface deformation procedure (Segonne et al., 2004), automated Talairach transformation, segmentation of the subcortical white matter and deep gray matter volumetric structures (including hippocampus, amygdala, caudate, putamen, ventricles) (Fischl et al., 2002; Fischl et al., 2004a) intensity normalization (Sled et al., 1998), tessellation of the gray matter white matter boundary, automated topology correction (Fischl et al., 2001; Segonne et al., 2007), and surface deformation following intensity gradients to optimally place the gray/white and gray/cerebrospinal fluid borders at the location where the greatest shift in intensity defines the transition to the other tissue class (Dale et al., 1999; Dale and Sereno, 1993; Fischl and Dale, 2000). Once the cortical models are complete, a number of deformable procedures can be performed for further data processing and analysis including surface inflation (Fischl et al., 1999a), registration to a spherical atlas which is based on individual cortical folding patterns to match cortical geometry across subjects (Fischl et al., 1999b), parcellation of the cerebral cortex into units with respect to gyral and sulcal structure (Desikan et al., 2006; Fischl et al., 2004b), and creation of a variety of surface based data including maps of curvature and sulcal depth. This method uses both intensity and continuity information from the entire three dimensional MR volume in segmentation and deformation procedures to produce representations of cortical thickness, calculated as the closest distance from the gray/white boundary to the gray/CSF boundary at each vertex on the tessellated surface (Fischl and Dale, 2000). The maps are created using spatial intensity gradients across tissue classes and are therefore not simply reliant on absolute signal intensity. The maps produced are not restricted to the voxel resolution of the original data thus are capable of detecting submillimeter differences between groups. Procedures for the measurement of cortical thickness have been validated against histological analysis (Rosas et al., 2002) and manual measurements (Kuperberg et al., 2003; Salat et al., 2004). Freesurfer morphometric procedures have been demonstrated to show good test-retest reliability across scanner manufacturers and across field strengths (Han et al., 2006; Reuter et al., 2012). Note that while longitudinal data was available and used in this package, each session was processed individually.

**Aseg Atlas Information**
The aseg atlas is built from 40 subjects acquired using the same mp-rage sequence (by people at Wash U ages ago in collaboration with Randy Buckner). The subjects that make up the atlas are distributed in 4 groups of 10 subjects each: (1) young, (2) middle aged, (3) healthy older adults, (4) older adults with AD.

Following preprocessing, volume and thickness from the 68 bilateral parcels comprising the Desikan atlas (Desikan et al., 2006) were extracted. The volume in the Aseg atlas was also extracted. Volume and thickness in the Desikan atlas as well as the volume in the Aseg atlas are shipped with `sihnpy`.