Within this notebook we will look at an example Regression problem. We will try to predict "nihtbx_picvocab_agecorrected" scores.

# Init Phase

In [None]:
import ABCD_ML

In [None]:
# This is the 2.0.1 release dr, we'll load data from it
nda_dr1 = '/mnt/sdb2/ABCDFixRelease2p0p1/'

# This is the 2.0 release dr, we'll load targets from it
nda_dr2 = '/mnt/sdb2/ABCD2p0NDA/'

# We will use the gordon ROI resting state fMRI correlations as our data
data_loc = nda_dr1 + 'abcd_betnet02.txt'

# This file contains the NIH toolbox scores
target_loc = nda_dr2 + 'abcd_tbss01.txt'

In [None]:
# A lot of these are default values, but just showing everything as an example,
# See the docs / init help string for actual explanation

ML = ABCD_ML.ABCD_ML(exp_name = 'Regression_Example',
                     log_dr = '',
                     existing_log = 'overwrite',
                     notebook = True,
                     subject_id = 'src_subject_id',
                     eventname = 'baseline_year_1_arm_1',
                     use_default_subject_ids = True,
                     default_dataset_type = 'basic',
                     default_na_values = ['777', '999'],
                     original_targets_key = 'targets',
                     low_memory_mode = False,
                     random_state = 1,
                     verbose = True
                     )

In [None]:
ML.Show_Ensemble_Types()

# Data Loading Phase

We would start by loading exclusions, the reason for this is, if loaded before data or covars, ect... the excluded subjects will be removed when loading data, targets, ect... as loaded before any drop behavior based on values. For example, when computing different filter_outliers, or dropping columns ect... this way it will only consider the non-excluded subjects. 

We won't load any here, but if there were certain known subjects to exclude they could be read from a file as exclusions or passed in as a list to Load_Exclusions.

We will just load in the data instead, specifically we are using the resting state correlations

In [None]:
ML.Load_Data(loc = data_loc)

Lets take a quick look at the data, especially since we have a number of warning columns with not a lot of unique values. This is an indicator that thew column might not be wanted as data is supposed to be neuroimaging data.

In [None]:
ML.data.head()

Alright, the obvious problem is that we have a number of columns that we most likely dont want to include. Specifically, there are a number of descriptors, e.g., number of trials, which we don't want, because they are not neuroimaging ROIs. We just want to grab the correlations.
Let's clear the data and reload it, explicitly telling the data loader to drop those keys.

In [None]:
to_drop = list(ML.data)[:12]
to_drop

In [None]:
ML.Clear_Data()
ML.Load_Data(loc = data_loc, drop_keys=to_drop)

In [None]:
ML.data.head()

Great, looks better, though the correlations have an extra problem. That is namely, there are repeat columns as X corr with y, and y corr with x. Not, Load_Data has a function to remove duplicate columns (marked as duplicate if they above a user defined correlation threshold with another column). Instead of reloading the data though, we can also call a specific function for this purpose.

In [None]:
ML.Drop_Data_Duplicates(corr_thresh=.99)

Okay, now ets load our targets and some covars (just age + sex)

In [None]:
ML.Load_Targets(target_loc, 'nihtbx_picvocab_agecorrected', 'float')
ML.Load_Covars(target_loc, ['interview_age', 'gender'], ['f', 'b'])

In [None]:
ML.Show_Targets_Dist()

It appears that there are some severe outliers... lets try reloading targets and just cut off the top 1% from either side of the distribution and see if that helps. Note: what we are doing here is acting upon the whole dataset before any train/test split, therefore it is reccomended that at this stage any global actions should be reasonable... in this case it seems reasonable to remove subjects with scores that are likely just human input error.

In [None]:
ML.Load_Targets(target_loc, 'nihtbx_picvocab_agecorrected', 'float', filter_outlier_percent=1)

In [None]:
ML.Show_Targets_Dist()

One percent might even be too much...

In [None]:
ML.Load_Targets(target_loc, 'nihtbx_picvocab_agecorrected', 'float', filter_outlier_percent=.005)

In [None]:
ML.Show_Targets_Dist()

That looks a little better, so only the really severe outliers are removed. The distribution is still a bit weird, but nothing we can do about that.

We have now loaded:
-Exclusions
-Data, Targets and Covars

We could optionally load stratification values, but for this example, we will just use random cross validation.

# Validation Phase

Before defining the train test split, we would optionally define a validation strategy, but for this expiriment we are just going to use random splits.

In [None]:
ML.Train_Test_Split(test_size=.25, random_state=1)

# Modeling Phase

In [None]:
ML.Set_Default_ML_Params(problem_type='regression',
                         metric=['r2', 'mse'],
                         scaler='standard',
                         n_splits=2,
                         n_repeats=1,
                         int_cv=2,
                         n_jobs=8,
                         n_iter=20,
                         random_state=1)

In [None]:
raw_scores = ML.Evaluate(model_type = 'knn regressor',
                         feat_selector = 'univariate selection',
                         calc_shap_feature_importances = False,
                         model_type_param_ind = 0,
                         feat_selector_param_ind = 0,
                         search_type = None)

In [None]:
X

So what the above statements are telling us is that we are not using and feature selectors, and then that our default search type is set to None, which means we are by default not doing any sort of search for hyperparameters. In general the way a search for hyperparameters works, is that you can set a param ind, for data scaler, feat_selector and model_type, where the param ind specifies what grid of parameters you want to search over for that object. By default, each object (feat_selector, model type, scaler) has a base grid of parameters consisting of just 1 value, and that is always index 0. So when we set the search type to None, it actually forces every scaler_param_ind and model_type_param_ind to be 0. Are other options are to select search type ='s either 'grid' for an explicit grid search over all param options, or 'random', to search over whatever our value for n_iter number of parameters. Then, you can set any of the param inds to either a different numerical index, specifying a different distribution of params, or each param distribution also has a str indicator name which can be passed.

The extra complexity of all of this is worth it I promise, as it lets you define a random search of parameters over not just your model, but optionally a data scaler and feature selector, all relatively easily!

Oh but your asking, how do we see these different param ind options for each model or scaler?
Heck, I'll show ya

In [None]:
# First lets look at just which models are avaliable
ML.Show_Models(problem_type='regression')

Okay now lets choose to look at just the "svm regressor"

In [None]:
ML.Show_Models()

So, for the SVR it looks like we just have one other option besides 0, which is a random search option.

We can also check for the standard data scaler

In [None]:
ML.Show_scalers(scaler='standard')

In [None]:
raw_scores = ML.Evaluate(model_type = 'svm',
                         model_type_param_ind = 1,
                         search_type = 'random',
                         feat_selector = 'rfe',
                         n_iter = 100)

Well no options for that, and we are not using any feature selection, so lets just use param_ind = 1 for the svm regressor

In [None]:
raw_scores = ML.Evaluate(model_type = 'svm',
                         model_type_param_ind = 1,
                         search_type = 'random',
                         n_iter=100)

In [None]:
raw_scores = ML.Evaluate(model_type = 'svm')

Okay, lets say we wanted to use an elastic net now.

In [None]:
raw_scores = ML.Evaluate(model_type = 'elastic',
                         search_type= 'random',
                         model_type_param_ind = 1)

In [None]:
raw_scores = ML.Evaluate(model_type = 'elastic',
                         search_type= 'random',
                         model_type_param_ind = 1,
                         n_iter = 200)

What about light gbm you say? That sounds fancy. (For both elastic net and this we just use model_type_param_ind = 1, since it is typically the random search grid of parameters, but definetly look to make sure what you are actuall running).

In [None]:
raw_scores = ML.Evaluate(model_type = 'light gbm',
                         search_type='random',
                         model_type_param_ind = 1,
                         int_cv=2,
                         n_iter=100)