Within this notebook we will look at an example Regression problem. We will try to predict "nihtbx_picvocab_agecorrected" scores.

# Init Phase

In [None]:
import ABCD_ML

In [None]:
# We will use the default nda dr for this
nda_dr = '/mnt/sdb2/ABCD2p0NDA/'

# We will use the gordon ROI resting state fMRI correlations as our data
data_loc = nda_dr + 'abcd_betnet02.txt'

# This file contains the NIH toolbox scores
target_loc = nda_dr + 'abcd_tbss01.txt'

# For exclusions we will excluse the flipped subjects
exclusion_loc = '/home/sage/bader_things/invalid_pguids.txt'

In [None]:
ML = ABCD_ML.ABCD_ML()

In [None]:
ML.Show_Model_Types()

# Data Loading Phase

We will start by loading exclusions, the reason for this is, if loaded before data or covars, ect... the excluded subjects will be removed when loading data, targets, ect... as loaded before any drop behavior based on values. For example, when computing different filter_outliers, or dropping columns ect... this way it will only consider the non-excluded subjects. If instead it doesn't matter, then exclusions can be loaded later, and just will be retro-actively applied. 

In [None]:
ML.Load_Exclusions(loc = exclusion_loc)

In [None]:
ML.Load_Data(loc = data_loc)

Lets take a quick look at the data, especially since we have a number of warning columns with not a lot of unique values. This is an indicator that thew column might not be wanted as data is supposed to be neuroimaging data.

In [None]:
ML.data.head()

Alright, the obvious problem is that we have a number of columns that we most likely dont want to include. Specifically, there are a number of descriptors, like number of trials, which we don't want. We just want to grab the correlations.
Let's clear the data and reload it, explicitly telling the data loader to drop those keys.

In [None]:
to_drop = list(ML.data)[:12]
to_drop

In [None]:
ML.Clear_Data()
ML.Load_Data(loc = data_loc, drop_keys=to_drop)

In [None]:
ML.data.head()

Great, looks better, though the correlations have an extra problem. That is namely, there are repeat columns as X corr with y, and y corr with x. Not, Load_Data has a function to remove duplicate columns (marked as duplicate if they above a user defined correlation threshold with another column). Instead of reloading the data though, we can also call a specific function for this purpose.

In [None]:
ML.Drop_Data_Duplicates(corr_thresh=.99)

Okay, now we lets load our targets, and for covars just age + sex and our exclusions.

In [None]:
ML.Load_Targets(target_loc, 'nihtbx_picvocab_agecorrected', 'float')

ML.Load_Covars(data_loc, ['interview_age', 'gender'], ['f', 'b'])

In [None]:
ML.Show_Targets_Dist()

It appears that there are some severe outliers... lets try reloading targets and just cut off the top 1% from either side of the distribution and see if that helps. Note: what we are doing here is acting upon the whole dataset before any train/test split, therefore it is reccomended that at this stage any global actions should be reasonable... in this case it seems reasonable to remove subjects with scores that are likely just human input error.

In [None]:
ML.Load_Targets(target_loc, 'nihtbx_picvocab_agecorrected', 'float', filter_outlier_percent=1)

In [None]:
ML.Show_Targets_Dist()

One percent might even be too much...

In [None]:
ML.Load_Targets(target_loc, 'nihtbx_picvocab_agecorrected', 'float', filter_outlier_percent=.005)

In [None]:
ML.Show_Targets_Dist()

That looks a little better, so only the really severe outliers are removed. The distribution is still a bit weird, but nothing we can do about that.

We have now loaded:
-Exclusions
-Data, Targets and Covars

We could optionally load stratification values, but for this example, we will just use random cross validation.

# Validation Phase

In [None]:
#We could optional define a validation strategy here, but we use default
ML.Train_Test_Split(test_size=.25, random_state=1)

# Modeling Phase

In [None]:
ML.Set_Default_ML_Params(problem_type='regression',
                         metric=['r2', 'mse'],
                         data_scaler='standard',
                         n_splits=3,
                         n_repeats=2,
                         int_cv=3,
                         n_jobs=8,
                         n_iter=20,
                         random_state=1)

In [None]:
#ML.Show_Models(problem_type='regression')

In [None]:
raw_scores = ML.Evaluate(model_type = 'svm', feat_selector = 'univariate select half') 

Notably, when you provide multiple metrics/scorers, the first one in the list is used for model selection. Let's see if reversing the order and using mean squared error for model selection changes anything.

In [None]:
#raw_scores = ML.Evaluate(model_type = 'svm rs', metric=['mse', 'r2'])

In [None]:
ML.Model.model