This notebook goes through a simple binary classification example, explaining library functionality along the way.

In [1]:
import ABCD_ML

In [2]:
#Define directory with the 2.0_NDA_Data
nda_dr = '/mnt/sdb2/2.0_ABCD_Data_Explorer/2.0_NDA_Data/'

#This file stores the name mapping
test_mapping_loc = nda_dr + 'ABCD_Release_ Notes_Data_Release_ 2.0/22. ABCD_Release_2.0_mapping_r.csv'

#We will use as the neuroimaging data just the sMRI data
test_data_loc1 = nda_dr + 'MRI/ABCD sMRI Part 1.csv'
test_data_loc2 = nda_dr + 'MRI/ABCD sMRI Part 2.csv'

#We will load target data (and covariate data) from here
test_target_loc = nda_dr + 'Mental Health/ABCD Parent Demographics Survey.csv'

#We will load stratification data from here
test_strat_loc = nda_dr + 'Other Non-Imaging/ABCD ACS Post Stratification Weights.csv'

#We will load exclusions from here, it is the list of flipped subject ids
test_exclusion_loc = '/home/sage/bader_things/invalid_pguids.txt'

We first need to define the class object, which we will use to load load and to train/test different ML models.
There are a few global parameters which we can optionally set when defining this object as well, lets look and see what they are.

In [3]:
help(ABCD_ML.ABCD_ML.__init__)

Help on function __init__ in module ABCD_ML.ABCD_ML:

__init__(self, eventname='baseline_year_1_arm_1', use_default_subject_ids=True, default_na_values=['777', '999'], n_jobs=1, original_targets_key='targets', low_memory_mode=False, verbose=True)
    Main class init
    
    Parameters
    ----------
    eventname : str or None, optional
        Optional value to provide, specifying to keep certain rows
        when reading data based on the eventname flag.
        As ABCD is a longitudinal study, this flag lets you select only
        one specific time point, or if set to None, will load everything.
        (default = baseline_year_1_arm_1)
    
    use_default_subject_ids : bool, optional
        Flag to determine the usage of 'default' subject id behavior.
        If set to True, this will convert input NDAR subject ids
        into upper case, with prepended NDAR_ - type format.
        If set to False, then all input subject names must be entered
        explicitly the same, no pr

Most of the default parameters are okay for this simple example, but any of them can be changed. Let's change n_jobs to 4 instead of 1.

In [4]:
ML = ABCD_ML.ABCD_ML(n_jobs = 4)

ABCD_ML object initialized


We can continue by optionally loading in a name map, which is simply a dictionary that attempts to rename any column names loaded in, if those column names are a key in the dictionary. This is useful for ABCD data as the default column names might not be useful.

Note this name map and these parameters are for the 'ABCD 2.0 Explorer' formatting. 

In [None]:
ML.load_name_map(loc = test_mapping_loc,
                 source_name_col = "NDAR name",
                 target_name_col = "REDCap name/NDA alias")

We can look at what exactly is in this dictionary if we want to confirm we loaded it correctly.
It is loaded as name_map within the ABCD_ML class.

In [None]:
some_examples = {k: ML.name_map[k] for k in list(ML.name_map)[300:320]}
some_examples

Next, we will load in the actual data. Like before we can check what parameters this function wants / can accept

In [None]:
help(ML.load_data)

In [None]:
ML.load_data(loc=test_data_loc1,
             dataset_type='explorer',
             filter_outlier_percent=.005, 
             winsorize_val=.01)           

That ends up being a lot of data dropped just for dropping missing outliers... since we are not in low_memory_mode, we can just clear the data, and reload it. This time we will also load not just the first data loc, but the rest as well - and at the same time - but just providing the locations of both in a list.

In [None]:
ML.clear_data()

In [None]:
ML.load_data(loc=[test_data_loc1, test_data_loc2],
             dataset_type='explorer',
             filter_outlier_percent=.0005, 
             winsorize_val=.001)   

These seem okay settings, we can load the next half of the data with these as well.

Our data for this expiriment should now be loaded. We can check it out:

In [None]:
ML.data.head()

In [None]:
ML.data.shape

Great. Now that data is loaded we still need to load targets, and can optionally load covars, strat and exclusions. Lets load our target first, and begin as before by checking out the loading function.

In [None]:
help(ML.load_targets)

Great, lets just load in sex as our target variable.

In [None]:
ML.load_targets(loc=test_target_loc,
                col_name='demo_sex_v2',
                data_type='b')

If you read the verbose print out above, you'll notice that it says "More than two unique score values found,filtered all but [1. 2.]" This is because by default when a binary datatype is passed, the dataloader needs to make sure it loads in only two unique values. To solve this when there exists outliers, like in this case, all but the top two unique values by count will be dropped. It will further show which values it has kept, in the case that an error was made, but here 1 and 2 are the correct sex values. If more than two values are desired, the categorical data type should be used.

Let's look and see to make sure everything was loaded correctly.

In [None]:
ML.targets.head()

We can look into adding covars next. Where co-variates arn't quite treated as typical co-variates, but are values we would like to be able to pass as additional input to the ML model if desired (and input that is treated in a special way, specifically covar input won't be scaled with any data scaler by default). 

In [None]:
help(ML.load_covars)

In [None]:
ML.load_covars(loc=test_target_loc,
               col_names = 'demo_ed_v2',
               data_types = 'ordinal',
               standardize = False,
               normalize = True) 

And check to see it was loaded correctly (and normalized)

In [None]:
ML.covars.head()

For loading stratification values (strat), these are the values that we can optionally define custom validation / split behavior on. Within this example, we are just going to make sure that all splits preserve subjects with the same family id within the same fold, so lets load family id - after looking as the help function.

In [None]:
help(ML.load_strat)

In [None]:
ML.load_strat(loc=test_strat_loc,
              col_names='rel_family_id')

In [None]:
ML.strat.head()

Great looks good. Lastly, we can still optionally load in a list of subject ids to exclude - for whatever reason, from the analysis

In [None]:
help(ML.load_exclusions)

In [None]:
ML.load_exclusions(loc=test_exclusion_loc)

Great, we now have our data, targets, covars, strat and exclusions loaded (Noting that the minimum requiriments for running an ML expiriment are just data or covars and targets, the rest being optional). The actual length of the script is also not as terrible as it seems, and once loading behavior is confirmed, verbose can even be turned off. To show this, we can re-load everything as above with verbose off.

In [None]:
ML = ABCD_ML.ABCD_ML(n_jobs = 4, verbose = False) # Reloading the ML object itself to reset everything.

ML.load_name_map(loc = test_mapping_loc,
                 source_name_col = "NDAR name",
                 target_name_col = "REDCap name/NDA alias")

ML.load_data(loc=[test_data_loc1, test_data_loc2],
             dataset_type='explorer',
             filter_outlier_percent=.0005, 
             winsorize_val=.01)

ML.load_targets(loc=test_target_loc,
                col_name='demo_sex_v2',
                data_type='b')

ML.load_covars(loc=test_target_loc,
               col_names = 'demo_ed_v2',
               data_types = 'ordinal',
               standardize = False,
               normalize = True)

ML.load_strat(loc=test_strat_loc,
              col_names='rel_family_id')

ML.load_exclusions(loc=test_exclusion_loc)

Let's continue. We will turn verbose back on, and then move onto defining our validation stratagy (which is again optional, but as stated before for this example we are going to preserve like family ids within the same folds)

In [None]:
help(ML.define_validation_strategy)

So for group preserving behavior we are interested in supplying an argument for groups.

In [None]:
ML.verbose = True
ML.define_validation_strategy(groups='rel_family_id')

Note that when it says over 9985 unique values, this is just over all of the loaded values within ML.strat. In practice, splits will take place over only the overlap of subjects minus loaded exclusions, the above is just saying there are 9985 unique values in just strat - not the overlap.

Lastly before we get to modelling, we want to define a global train-test split, so that we can perform model exploration, and parameter tuning ect... on a training set, and leave a left-out testing set to eventually test with out final selected model.

In [None]:
help(ML.train_test_split)

In [None]:
ML.train_test_split(test_size=.25, #Let be somewhat conservative, and use a size of .25
                    random_state=1)

Great - and because we set the validation stratagy to preserve family structure within the folds, we know that no family id is in both the train and test set - for the paranoid we can make sure of this.

In [None]:
train_ids = set(ML.strat['rel_family_id'].loc[ML.train_subjects])
test_ids = set(ML.strat['rel_family_id'].loc[ML.test_subjects])

print('Unique family ids in train: ', len(train_ids))
print('Unique family ids in test: ', len(test_ids))
print('Overlap : ', len(train_ids.intersection(test_ids)))

Now we can proceed to modeling.

The main function we use here is 