This notebook goes through a simple binary classification example, explaining library functionality along the way.

In [1]:
import ABCD_ML

In [2]:
#Define directory with the 2.0_NDA_Data
nda_dr = '/mnt/sdb2/2.0_ABCD_Data_Explorer/2.0_NDA_Data/'

#This file stores the name mapping
test_mapping_loc = nda_dr + 'ABCD_Release_ Notes_Data_Release_ 2.0/22. ABCD_Release_2.0_mapping_r.csv'

#We will use as the neuroimaging data just the sMRI data
test_data_loc1 = nda_dr + 'MRI/ABCD sMRI Part 1.csv'
test_data_loc2 = nda_dr + 'MRI/ABCD sMRI Part 2.csv'

#We will load target data (and covariate data) from here
test_target_loc = nda_dr + 'Mental Health/ABCD Parent Demographics Survey.csv'

#We will load stratification data from here
test_strat_loc = nda_dr + 'Other Non-Imaging/ABCD ACS Post Stratification Weights.csv'

We first need to define the class object, which we will use to load load and to train/test different ML models.
There are a few global parameters which we can optionally set when defining this object as well, lets look and see what they are.

In [3]:
help(ABCD_ML.ABCD_ML.__init__)

Help on function __init__ in module ABCD_ML.ABCD_ML:

__init__(self, eventname='baseline_year_1_arm_1', use_default_subject_ids=True, default_na_values=['777', '999'], n_jobs=1, original_targets_key='targets', low_memory_mode=False, verbose=True)
    Main class init
    
    Parameters
    ----------
    eventname : str or None, optional
        Optional value to provide, specifying to keep certain rows
        when reading data based on the eventname flag.
        As ABCD is a longitudinal study, this flag lets you select only
        one specific time point, or if set to None, will load everything.
        (default = baseline_year_1_arm_1)
    
    use_default_subject_ids : bool, optional
        Flag to determine the usage of 'default' subject id behavior.
        If set to True, this will convert input NDAR subject ids
        into upper case, with prepended NDAR_ - type format.
        If set to False, then all input subject names must be entered
        explicitly the same, no pr

Most of the default parameters are okay for this simple example, but any of them can be changed. Let's change n_jobs to 4 instead of 1.

In [4]:
ML = ABCD_ML.ABCD_ML(n_jobs = 4)

ABCD_ML object initialized


We can continue by optionally loading in a name map, which is simply a dictionary that attempts to rename any column names loaded in, if those column names are a key in the dictionary. This is useful for ABCD data as the default column names might not be useful.

In [5]:
ML.load_name_map(loc = test_mapping_loc,
                 source_name_col = "NDAR name",
                 target_name_col = "REDCap name/NDA alias")

Loaded map file


We can look at what exactly is in this dictionary if we want to confirm we loaded it correctly.
It is loaded as name_map within the ABCD_ML class.

In [6]:
some_examples = {k: ML.name_map[k] for k in list(ML.name_map)[300:320]}
some_examples

{'ddtidp_674': 'dmri_dtifagm_cortdestrieux_ssuborbitallh',
 'ddtidp_748': 'dmri_dtifagm_cortdestrieux_ssuborbitalrh',
 'ddtidp_675': 'dmri_dtifagm_cortdestrieux_ssubparietallh',
 'ddtidp_749': 'dmri_dtifagm_cortdestrieux_ssubparietalrh',
 'ddtidp_676': 'dmri_dtifagm_cortdestrieux_stemporalinflh',
 'ddtidp_750': 'dmri_dtifagm_cortdestrieux_stemporalinfrh',
 'ddtidp_677': 'dmri_dtifagm_cortdestrieux_stemporalsuplh',
 'ddtidp_751': 'dmri_dtifagm_cortdestrieux_stemporalsuprh',
 'ddtidp_678': 'dmri_dtifagm_cortdestrieux_stemporaltransverselh',
 'ddtidp_752': 'dmri_dtifagm_cortdestrieux_stemporaltransverserh',
 'dmri_dtifagwc_cdsn_bslh': 'dmri_dtifagwc_cortdesikan_banksstslh',
 'dmri_dtifagwc_cdsn_bsrh': 'dmri_dtifagwc_cortdesikan_banksstsrh',
 'dmri_dtifagwc_cdsn_cdacatelh': 'dmri_dtifagwc_cortdesikan_caudalanteriorcingulatelh',
 'dmri_dtifagwc_cdsn_cdacaterh': 'dmri_dtifagwc_cortdesikan_caudalanteriorcingulaterh',
 'dmri_dtifagwc_cdsn_cdmflh': 'dmri_dtifagwc_cortdesikan_caudalmiddlefrontal

Next, we will load in the actual data. Like before we can check what parameters this function wants / can accept

In [7]:
help(ML.load_data)

Help on method load_data in module ABCD_ML._Data:

load_data(loc, dataset_type, drop_keys=[], filter_outlier_percent=None, winsorize_val=None) method of ABCD_ML.ABCD_ML.ABCD_ML instance
    Load a ABCD2p0NDA (default) or 2.0_ABCD_Data_Explorer (explorer)
    release formatted neuroimaging dataset - of derived ROI level info.
    
    Parameters
    ----------
    loc : str, Path or None
        The location of the csv file to load data load from.
    
    dataset_type : {'default', 'explorer', 'custom'}, optional
        The type of dataset to load from. Where,
    
        - 'default' : ABCD2p0NDA style, (.txt and tab seperated)
            The 4 columns before 'src_subject_id' and the 4 after,
            (typically the default columns, and therefore not neuroimaging
            data - also not including the eventname column), will be dropped.
    
        - 'explorer' : 2.0_ABCD_Data_Explorer tyle (.csv and comma seperated)
            The first 2 columns before 'src_subject_id'
   

In [8]:
ML.load_data(loc=test_data_loc1,
             dataset_type='explorer',
             filter_outlier_percent=.005, # Let's filter out .5% from both sides of the distribution, why not
             winsorize_val=.01)           # And then winsorize 1%

Loading /mnt/sdb2/2.0_ABCD_Data_Explorer/2.0_NDA_Data/MRI/ABCD sMRI Part 1.csv assumed to be dataset type: explorer
dropped ['abcd_smrip101_id', 'dataset_id', 'smri_visitid'] columns by default due to dataset type
Dropped 0 columns, per drop_keys argument
Dropped 522 rows for missing values
Dropped rows with missing data
Filtered data for outliers with value:  0.005
Winsorized data with value:  0.01
loaded shape:  (2099, 749)

Total valid (overlapping subjects / not in exclusions)         subjects = 2099

