This notebook goes through a simple binary classification example, explaining library functionality along the way.

In [1]:
import ABCD_ML

In [2]:
#Define directory with the 2.0_NDA_Data
nda_dr = '/mnt/sdb2/2.0_ABCD_Data_Explorer/2.0_NDA_Data/'

#This file stores the name mapping
test_mapping_loc = nda_dr + 'ABCD_Release_ Notes_Data_Release_ 2.0/22. ABCD_Release_2.0_mapping_r.csv'

#We will use as the neuroimaging data just the sMRI data
test_data_loc1 = nda_dr + 'MRI/ABCD sMRI Part 1.csv'
test_data_loc2 = nda_dr + 'MRI/ABCD sMRI Part 2.csv'

#We will load target data (and covariate data) from here
test_target_loc = nda_dr + 'Mental Health/ABCD Parent Demographics Survey.csv'

#We will load stratification data from here
test_strat_loc = nda_dr + 'Other Non-Imaging/ABCD ACS Post Stratification Weights.csv'

#We will load exclusions from here, it is the list of flipped subject ids
test_exclusion_loc = '/home/sage/bader_things/invalid_pguids.txt'

We first need to define the class object, which we will use to load load and to train/test different ML models.
There are a few global parameters which we can optionally set when defining this object as well, lets look and see what they are.

In [3]:
help(ABCD_ML.ABCD_ML.__init__)

Help on function __init__ in module ABCD_ML.ABCD_ML:

__init__(self, eventname='baseline_year_1_arm_1', use_default_subject_ids=True, default_na_values=['777', '999'], original_targets_key='targets', low_memory_mode=False, random_state=None, verbose=True)
    Main class init
    
    Parameters
    ----------
    eventname : str or None, optional
        Optional value to provide, specifying to keep certain rows
        when reading data based on the eventname flag.
        As ABCD is a longitudinal study, this flag lets you select only
        one specific time point, or if set to None, will load everything.
        (default = baseline_year_1_arm_1)
    
    use_default_subject_ids : bool, optional
        Flag to determine the usage of 'default' subject id behavior.
        If set to True, this will convert input NDAR subject ids
        into upper case, with prepended NDAR_ - type format.
        If set to False, then all input subject names must be entered
        explicitly the sa

Most of the default parameters are okay for this simple example, but any of them can be changed. Let's change n_jobs to 4 instead of 1.

In [4]:
ML = ABCD_ML.ABCD_ML()

ABCD_ML object initialized


We can continue by optionally loading in a name map, which is simply a dictionary that attempts to rename any column names loaded in, if those column names are a key in the dictionary. This is useful for ABCD data as the default column names might not be useful.

Note this name map and these parameters are for the 'ABCD 2.0 Explorer' formatting. 

In [5]:
ML.load_name_map(loc = test_mapping_loc,
                 source_name_col = "NDAR name",
                 target_name_col = "REDCap name/NDA alias")

Loaded map file


We can look at what exactly is in this dictionary if we want to confirm we loaded it correctly.
It is loaded as name_map within the ABCD_ML class.

In [6]:
some_examples = {k: ML.name_map[k] for k in list(ML.name_map)[300:320]}
some_examples

{'ddtidp_674': 'dmri_dtifagm_cortdestrieux_ssuborbitallh',
 'ddtidp_748': 'dmri_dtifagm_cortdestrieux_ssuborbitalrh',
 'ddtidp_675': 'dmri_dtifagm_cortdestrieux_ssubparietallh',
 'ddtidp_749': 'dmri_dtifagm_cortdestrieux_ssubparietalrh',
 'ddtidp_676': 'dmri_dtifagm_cortdestrieux_stemporalinflh',
 'ddtidp_750': 'dmri_dtifagm_cortdestrieux_stemporalinfrh',
 'ddtidp_677': 'dmri_dtifagm_cortdestrieux_stemporalsuplh',
 'ddtidp_751': 'dmri_dtifagm_cortdestrieux_stemporalsuprh',
 'ddtidp_678': 'dmri_dtifagm_cortdestrieux_stemporaltransverselh',
 'ddtidp_752': 'dmri_dtifagm_cortdestrieux_stemporaltransverserh',
 'dmri_dtifagwc_cdsn_bslh': 'dmri_dtifagwc_cortdesikan_banksstslh',
 'dmri_dtifagwc_cdsn_bsrh': 'dmri_dtifagwc_cortdesikan_banksstsrh',
 'dmri_dtifagwc_cdsn_cdacatelh': 'dmri_dtifagwc_cortdesikan_caudalanteriorcingulatelh',
 'dmri_dtifagwc_cdsn_cdacaterh': 'dmri_dtifagwc_cortdesikan_caudalanteriorcingulaterh',
 'dmri_dtifagwc_cdsn_cdmflh': 'dmri_dtifagwc_cortdesikan_caudalmiddlefrontal

Next, we will load in the actual data. Like before we can check what parameters this function wants / can accept

In [7]:
help(ML.load_data)

Help on method load_data in module ABCD_ML._Data:

load_data(loc, dataset_type, drop_keys=[], filter_outlier_percent=None, winsorize_val=None) method of ABCD_ML.ABCD_ML.ABCD_ML instance
    Load a ABCD2p0NDA (default) or 2.0_ABCD_Data_Explorer (explorer)
    release formatted neuroimaging dataset - of derived ROI level info.
    
    Parameters
    ----------
    loc : str, Path or list of
        The location of the csv file to load data load from.
        If passed a list, then will load each loc in the list,
        and will assume them all to be of the same dataset_type if one
        dataset_type is passed, or if they differ in type, a list must be
        passed to dataset_type with the different types in order.
        Note: some proc will be done on each loaded dataset before merging
        with the rest (duplicate subjects, proc for eventname ect...), but
        other dataset loading behavior won't occur until after the merge,
        e.g., dropping cols by key, filtering fo

In [8]:
ML.load_data(loc=test_data_loc1,
             dataset_type='explorer',
             filter_outlier_percent=.005, 
             winsorize_val=.01)           

Loading /mnt/sdb2/2.0_ABCD_Data_Explorer/2.0_NDA_Data/MRI/ABCD sMRI Part 1.csv assumed to be dataset type: explorer
dropped ['abcd_smrip101_id', 'dataset_id', 'smri_visitid'] columns by default due to dataset type
Dropped 0 columns, per drop_keys argument
Dropped 0 cols for all missing values
Dropped 522 rows for missing values
Dropped rows with missing data
Filtered data for outliers with value:  0.005
Winsorized data with value:  0.01
loaded shape:  (2099, 749)

Total valid (overlapping subjects / not in exclusions) subjects = 2099



That ends up being a lot of data dropped just for dropping missing outliers... since we are not in low_memory_mode, we can just clear the data, and reload it. This time we will also load not just the first data loc, but the rest as well - and at the same time - but just providing the locations of both in a list.

In [9]:
ML.clear_data()

cleared data.


In [10]:
ML.load_data(loc=[test_data_loc1, test_data_loc2],
             dataset_type='explorer',
             filter_outlier_percent=.0005, 
             winsorize_val=.001)   

Loading /mnt/sdb2/2.0_ABCD_Data_Explorer/2.0_NDA_Data/MRI/ABCD sMRI Part 1.csv assumed to be dataset type: explorer
dropped ['abcd_smrip101_id', 'dataset_id', 'smri_visitid'] columns by default due to dataset type
Loading /mnt/sdb2/2.0_ABCD_Data_Explorer/2.0_NDA_Data/MRI/ABCD sMRI Part 2.csv assumed to be dataset type: explorer
dropped ['abcd_smrip201_id', 'dataset_id'] columns by default due to dataset type
Dropped 0 columns, per drop_keys argument
Dropped 10 cols for all missing values
Dropped 1233 rows for missing values
Dropped rows with missing data
Filtered data for outliers with value:  0.0005
Winsorized data with value:  0.001
loaded shape:  (7303, 1186)

Total valid (overlapping subjects / not in exclusions) subjects = 7303



These seem okay settings, we can load the next half of the data with these as well.

Our data for this expiriment should now be loaded. We can check it out:

In [11]:
ML.data.head()

Unnamed: 0_level_0,smri_thick_cortdesikan_banksstslh,smri_thick_cortdesikan_caudalanteriorcingulatelh,smri_thick_cortdesikan_caudalmiddlefrontallh,smri_thick_cortdesikan_cuneuslh,smri_thick_cortdesikan_entorhinallh,smri_thick_cortdesikan_fusiformlh,smri_thick_cortdesikan_inferiorparietallh,smri_thick_cortdesikan_inferiortemporallh,smri_thick_cortdesikan_isthmuscingulatelh,smri_thick_cortdesikan_lateraloccipitallh,...,smri_t2w_subcortaseg_hippocampusrh,smri_t2w_subcortaseg_amygdalarh,smri_t2w_subcortaseg_accumbensarearh,smri_t2w_subcortaseg_ventraldcrh,smri_t2w_subcortaseg_wmhypointensities,smri_t2w_subcortaseg_ccposterior,smri_t2w_subcortaseg_ccmidposterior,smri_t2w_subcortaseg_cccentral,smri_t2w_subcortaseg_ccmidanterior,smri_t2w_subcortaseg_ccanterior
src_subject_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
NDAR_INV00R4TXET,2.817,3.372,3.012,2.155,3.115,2.81,2.854,3.147,2.629,2.348,...,166.147666,171.143448,152.63404,115.660336,152.816743,106.990743,120.449265,117.044747,112.399884,109.441508
NDAR_INV00UMK5VC,2.685,2.666,2.895,1.987,3.107,2.947,2.887,3.256,2.434,2.35,...,170.35049,154.837096,133.076824,116.094815,147.913201,98.521998,131.900032,122.04712,102.673528,105.812915
NDAR_INV014RTM1V,2.783,2.999,2.956,2.03,3.17,2.94,2.718,3.014,2.444,2.266,...,166.764158,164.723606,163.940763,117.205626,211.238987,106.615366,125.595535,119.674263,116.918631,111.743396
NDAR_INV019DXLU4,2.944,3.078,2.931,2.166,3.459,2.98,2.785,3.08,2.677,2.36,...,159.890286,151.097159,145.289319,114.540321,140.991964,102.489477,110.442037,100.033514,102.737771,112.328037
NDAR_INV01ELX9L6,2.861,2.4,2.898,2.101,4.043,2.939,2.677,2.811,3.035,2.202,...,147.023012,148.368816,153.563722,112.275222,141.126298,102.60776,107.345429,103.728713,105.512885,106.459537


In [12]:
ML.data.shape

(7303, 1186)

Great. Now that data is loaded we still need to load targets, and can optionally load covars, strat and exclusions. Lets load our target first, and begin as before by checking out the loading function.

In [13]:
help(ML.load_targets)

Help on method load_targets in module ABCD_ML._Data:

load_targets(loc, col_name, data_type, filter_outlier_percent=None) method of ABCD_ML.ABCD_ML.ABCD_ML instance
    Loads in a set of subject ids and associated targets from a
    2.0_ABCD_Data_Explorer release formatted csv.
    See Notes for more info.
    
    Parameters
    ----------
    loc : str, Path or None
        The location of the csv file to load targets load from.
    
    col_name : str
        The name of the column to load.
    
    data_type : {'binary', 'categorical', 'ordinal', 'float'}
        The data type of the targets column.
        Shorthands for datatypes can be used as well
    
        - 'binary' or 'b' : Binary input
        - 'categorical' or 'c' : Categorical input
        - 'ordinal' or 'o' : Ordinal input
        - 'float' or 'f' : Float numerical input
    
        Datatypes are explained further in Notes.
    
    filter_outlier_percent : float, tuple or None, optional
        For float or ordina

Great, lets just load in sex as our target variable.

In [14]:
ML.load_targets(loc=test_target_loc,
                col_name='demo_sex_v2',
                data_type='b')

Loading  /mnt/sdb2/2.0_ABCD_Data_Explorer/2.0_NDA_Data/Mental Health/ABCD Parent Demographics Survey.csv
More than two unique score values found, filtered all but [1. 2.]
Final shape:  (11866, 1)

Total valid (overlapping subjects / not in exclusions) subjects = 7298



If you read the verbose print out above, you'll notice that it says "More than two unique score values found,filtered all but [1. 2.]" This is because by default when a binary datatype is passed, the dataloader needs to make sure it loads in only two unique values. To solve this when there exists outliers, like in this case, all but the top two unique values by count will be dropped. It will further show which values it has kept, in the case that an error was made, but here 1 and 2 are the correct sex values. If more than two values are desired, the categorical data type should be used.

Let's look and see to make sure everything was loaded correctly.

In [15]:
ML.targets.head()

Unnamed: 0_level_0,targets
src_subject_id,Unnamed: 1_level_1
NDAR_INV0A4ZDYNL,0
NDAR_INV0BKE31EY,0
NDAR_INV0CBFTKR7,0
NDAR_INV0CCEN5K2,0
NDAR_INV0DC9BJZK,0


We can look into adding covars next. Where co-variates arn't quite treated as typical co-variates, but are values we would like to be able to pass as additional input to the ML model if desired (and input that is treated in a special way, specifically covar input won't be scaled with any data scaler by default). 

In [16]:
help(ML.load_covars)

Help on method load_covars in module ABCD_ML._Data:

load_covars(loc, col_names, data_types, dummy_code_categorical=True, filter_float_outlier_percent=None, standardize=True, normalize=False) method of ABCD_ML.ABCD_ML.ABCD_ML instance
    Load a covariate or covariates from a 2.0_ABCD_Data_Explorer
    release formatted csv.
    
    Parameters
    ----------
    loc : str, Path or None
        The location of the csv file to load co-variates load from.
    
    col_names : str or list
        The name(s) of the column(s) to load.
        Note: Must be in the same order as data types passed in.
    
    data_types : {'binary', 'categorical', 'ordinal', 'float'} or list of
        The data types of the different columns to load,
        in the same order as the column names passed in.
        Shorthands for datatypes can be used as well
    
        - 'binary' or 'b' : Binary input
        - 'categorical' or 'c' : Categorical input
        - 'ordinal' or 'o' : Ordinal input
        - 'f

In [17]:
ML.load_covars(loc=test_target_loc,
               col_names = 'demo_ed_v2',
               data_types = 'ordinal',
               standardize = False,
               normalize = True) 

Reading covariates from /mnt/sdb2/2.0_ABCD_Data_Explorer/2.0_NDA_Data/Mental Health/ABCD Parent Demographics Survey.csv
load: demo_ed_v2

Total valid (overlapping subjects / not in exclusions) subjects = 7298



And check to see it was loaded correctly (and normalized)

In [18]:
ML.covars.head()

Unnamed: 0_level_0,demo_ed_v2
src_subject_id,Unnamed: 1_level_1
NDAR_INV0A4ZDYNL,0.416667
NDAR_INV0BKE31EY,0.25
NDAR_INV0CBFTKR7,0.416667
NDAR_INV0CCEN5K2,0.25
NDAR_INV0DC9BJZK,0.5


For loading stratification values (strat), these are the values that we can optionally define custom validation / split behavior on. Within this example, we are just going to make sure that all splits preserve subjects with the same family id within the same fold, so lets load family id - after looking as the help function.

In [19]:
help(ML.load_strat)

Help on method load_strat in module ABCD_ML._Data:

load_strat(loc, col_names) method of ABCD_ML.ABCD_ML.ABCD_ML instance
    Load stratification values from a 2.0_ABCD_Data_Explorer
    release formatted csv.
    See Notes for more details on what stratification values are.
    
    Parameters
    ----------
    loc : str, Path or None
        The location of the csv file to load stratification vals from.
    
    col_names : str or list
        The name(s) of the column(s) to load.
    
    Notes
    ----------
    Stratification values are categorical variables which are loaded for the
    purpose of defining custom validation behavior.
    
    For example: Sex might be loaded here, and used later to ensure
    that any validation splits retain the same distribution of each sex.



In [20]:
ML.load_strat(loc=test_strat_loc,
              col_names='rel_family_id')

Reading stratification values from /mnt/sdb2/2.0_ABCD_Data_Explorer/2.0_NDA_Data/Other Non-Imaging/ABCD ACS Post Stratification Weights.csv

Total valid (overlapping subjects / not in exclusions) subjects = 7296

Removing non overlapping + excluded subjects


In [21]:
ML.strat.head()

Unnamed: 0_level_0,rel_family_id
src_subject_id,Unnamed: 1_level_1
NDAR_INV9EVRB30H,86
NDAR_INVDTHBFTNZ,1831
NDAR_INVLPY5VAXY,1410
NDAR_INVFL02R0H4,8956
NDAR_INVP16EZY2C,3789


Great looks good. Lastly, we can still optionally load in a list of subject ids to exclude - for whatever reason, from the analysis

In [22]:
help(ML.load_exclusions)

Help on method load_exclusions in module ABCD_ML._Data:

load_exclusions(loc=None, exclusions=None) method of ABCD_ML.ABCD_ML.ABCD_ML instance
    Loads in a set of excluded subjects,
    from either a file or as directly passed in.
    
    Parameters
    ----------
    loc : str, Path or None, optional
        Location of a file to load in excluded subjects from.
        The file should be formatted as one subject per line.
        (default = None)
    
    exclusions : list, set, array-like or None, optional
        An explicit list of subjects to add to exclusions.
        (default = None)
    
    Notes
    ----------
    If default subject id behavior is set to False,
    reading subjects from a exclusion loc might not
    function as expected.



In [23]:
ML.load_exclusions(loc=test_exclusion_loc)

Total excluded subjects:  1137

Total valid (overlapping subjects / not in exclusions) subjects = 6622

Removing non overlapping + excluded subjects


Great, we now have our data, targets, covars, strat and exclusions loaded (Noting that the minimum requiriments for running an ML expiriment are just data or covars and targets, the rest being optional). The actual length of the script is also not as terrible as it seems, and once loading behavior is confirmed, verbose can even be turned off. To show this, we can re-load everything as above with verbose off. (Commented out, but you get the idea~)

In [24]:
#ML = ABCD_ML.ABCD_ML(n_jobs = 4, verbose = False) # Reloading the ML object itself to reset everything.

#ML.load_name_map(loc = test_mapping_loc,
#                 source_name_col = "NDAR name",
#                 target_name_col = "REDCap name/NDA alias")

#ML.load_data(loc=[test_data_loc1, test_data_loc2],
#             dataset_type='explorer',
#             filter_outlier_percent=.0005, 
#             winsorize_val=.01)

#ML.load_targets(loc=test_target_loc,
#                col_name='demo_sex_v2',
#                data_type='b')

#ML.load_covars(loc=test_target_loc,
#               col_names = 'demo_ed_v2',
#               data_types = 'ordinal',
#               standardize = False,
#               normalize = True)

#ML.load_strat(loc=test_strat_loc,
#              col_names='rel_family_id')

#ML.load_exclusions(loc=test_exclusion_loc)

Let's continue. We will turn verbose back on, and then move onto defining our validation stratagy (which is again optional, but as stated before for this example we are going to preserve like family ids within the same folds)

In [25]:
help(ML.define_validation_strategy)

Help on method define_validation_strategy in module ABCD_ML.ABCD_ML:

define_validation_strategy(groups=None, stratify=None) method of ABCD_ML.ABCD_ML.ABCD_ML instance
    Define a validation stratagy to be used during different train/test splits,
    in addition to model selection and model hyperparameter CV.
    See Notes for more info.
    
    Parameters
    ----------
    groups : str, list or None, optional
        In the case of str input, will assume the str to refer
        to a column key within the loaded strat data,
        and will assign it as a value to preserve groups by
        during any train/test or K-fold splits.
        If a list is passed, then each element should be a str,
        and they will be combined into all unique
        combinations of the elements of the list.
        (default = None)
    
    stratify : str, list or None, optional
        In the case of str input, will assume the str to refer
        to a column key within the loaded strat data,
    

So for group preserving behavior we are interested in supplying an argument for groups.

In [26]:
ML.verbose = True
ML.define_validation_strategy(groups='rel_family_id')

CV defined with group preserving behavior, over 5990 unique values.


Note that when it says over 9985 unique values, this is just over all of the loaded values within ML.strat. In practice, splits will take place over only the overlap of subjects minus loaded exclusions, the above is just saying there are 9985 unique values in just strat - not the overlap.

Lastly before we get to modelling, we want to define a global train-test split, so that we can perform model exploration, and parameter tuning ect... on a training set, and leave a left-out testing set to eventually test with out final selected model.

In [27]:
help(ML.train_test_split)

Help on method train_test_split in module ABCD_ML.ABCD_ML:

train_test_split(test_size=None, test_loc=None, test_subjects=None, random_state=None) method of ABCD_ML.ABCD_ML.ABCD_ML instance
    Define the overarching train / test split, highly reccomended.
    
    test_size: float, int or None, optional
        If float, should be between 0.0 and 1.0 and represent
        the proportion of the dataset to be included in the test split.
        If int, represents the absolute number (or target number) to
        include in the testing group.
        Set to None if using test_loc or test_subjects.
        (default = None)
    
    test_loc : str, Path or None, optional
        Location of a file to load in test subjects from.
        The file should be formatted as one subject per line.
        (default = None)
    
    test_subjects : list, set, array-like or None, optional
        An explicit list of subjects to constitute the testing set
        (default=None)
    
    random_state : 

In [28]:
ML.train_test_split(test_size=.25, #Let be somewhat conservative, and use a size of .25
                    random_state=1)

Final data for modeling loaded shape: (6622, 1188)
Performed train/test split, train size: 4960 test size:  1662


Great - and because we set the validation stratagy to preserve family structure within the folds, we know that no family id is in both the train and test set - for the paranoid we can make sure of this.

In [29]:
train_ids = set(ML.strat['rel_family_id'].loc[ML.train_subjects])
test_ids = set(ML.strat['rel_family_id'].loc[ML.test_subjects])

print('Unique family ids in train: ', len(train_ids))
print('Unique family ids in test: ', len(test_ids))
print('Overlap : ', len(train_ids.intersection(test_ids)))

Unique family ids in train:  4492
Unique family ids in test:  1498
Overlap :  0


Now we can proceed to modeling.

The main function we use here is Evaluate, we can look at its docstring, but from a very high level, this is the function we use to test different expirimental setups.

In [30]:
help(ML.Evaluate)

Help on method Evaluate in module ABCD_ML._ML:

Evaluate(model_type, problem_type='default', data_scaler='default', n_splits='default', n_repeats='default', int_cv='default', metric='default', class_weight='default', n_jobs='default', n_iter='default', random_state='default', extra_params='default') method of ABCD_ML.ABCD_ML.ABCD_ML instance
    Class method to be called during the model selection phase.
    Used to evaluated different combination of models and scaling, ect...
    
    Parameters
    ----------
    model_type : str or list of str,
        Each string refers to a type of model to train.
        If a list of strings is passed then an ensemble model
        will be created over all individual models.
        For a full list of supported options call:
        self.show_model_types(), with optional problem type parameter.
    
    problem_type : {'regression', 'binary', 'categorical', 'default'}, optional
    
        - 'regression' : For ML on float or ordinal target data


Lets check what different model types we have avaliable for binary first.

In [31]:
ML.show_model_types(problem_type='binary')

Note: gs and rs are  Grid Search and Random Search
Models with gs or rs will have their hyper-parameters tuned accordingly.

Problem Type: binary
----------------------
Avaliable models: 

Model str indicator:  dt classifier
Model object:  <class 'sklearn.tree.tree.DecisionTreeClassifier'>

Model str indicator:  dt classifier gs
Model object:  <class 'sklearn.model_selection._search.GridSearchCV'>

Model str indicator:  gaussian nb
Model object:  <class 'sklearn.naive_bayes.GaussianNB'>

Model str indicator:  knn classifier
Model object:  <class 'sklearn.neighbors.classification.KNeighborsClassifier'>

Model str indicator:  knn classifier gs
Model object:  <class 'sklearn.model_selection._search.GridSearchCV'>

Model str indicator:  lgbm classifier
Model object:  <class 'lightgbm.sklearn.LGBMClassifier'>

Model str indicator:  lgbm classifier rs
Model object:  <class 'sklearn.model_selection._search.RandomizedSearchCV'>

Model str indicator:  logistic
Model object:  <class 'sklearn.lin

We can also set some default ML params for some of the settings that we will be keeping the same.

In [32]:
help(ML.set_default_ML_params)

Help on method set_default_ML_params in module ABCD_ML._ML:

set_default_ML_params(problem_type='default', metric='default', data_scaler='default', n_splits='default', n_repeats='default', int_cv='default', class_weight='default', n_jobs='default', n_iter='default', random_state='default', extra_params='default') method of ABCD_ML.ABCD_ML.ABCD_ML instance
    Sets the self.default_ML_params dictionary with user passed or default
    values. In general, if any argument is left as 'default' and it has
    not been previously defined, it will be set to a default value,
    sometimes passed on other values.
    See notes for rationale behind default ML params.
    
    Parameters
    ----------
    problem_type : {'regression', 'binary', 'categorical', 'default'}, optional
    
        - 'regression' : For ML on float or ordinal target data
        - 'binary' : For ML on binary target data
        - 'categorical' : For ML on categorical target data,
                          as either mult

The parameters are mostly the same for setting default params as they are passed to Evaluate or Test. Importantly, by defining defaults, we define the value to be used if no value is passed to a given argument in Evaluate or Test.
Lets set some.

In [33]:
ML.set_default_ML_params(problem_type = 'binary',
                         metric = 'macro roc auc',
                         data_scaler = 'standard',
                         n_splits = 3,
                         n_repeats = 2,
                         int_cv = 3,
                         class_weight = 'balanced',
                         n_jobs = 8,
                         n_iter = 20)

No default random state passed, using class random state value of None
No default extra params passed, set to empty dict
Default params set.



Now that we have default parameters set up, we can run an evaluation with the random forest classifier as just:

In [34]:
scores = ML.Evaluate(model_type='rf classifier')

Running Evaluate with:
model_type = rf classifier
problem_type = binary
metric = macro roc auc
data_scaler = standard
n_splits = 3
n_repeats = 2
int_cv = 3
class_weight = balanced
n_jobs = 8
n_iter = 20
random_state = None
extra_params = {}

Macro mean score:  0.8124107227126789
Macro std in score:  0.0005335189642660665
Micro mean score:  0.8124107227126788
Micro std in score:  0.0080002119314011


In [35]:
scores = ML.Evaluate(model_type='logistic')

Running Evaluate with:
model_type = logistic
problem_type = binary
metric = macro roc auc
data_scaler = standard
n_splits = 3
n_repeats = 2
int_cv = 3
class_weight = balanced
n_jobs = 8
n_iter = 20
random_state = None
extra_params = {}

Macro mean score:  0.9239552196541189
Macro std in score:  0.0
Micro mean score:  0.9239552196541189
Micro std in score:  0.011150051065993628


So the simple logistic regression does a lot better then just a single random forest. Let's compare also with random forest with hyperparameter tuning via random search, let also use extra params to show how we can override the default value of verbose within the sklearn random search object, and see the progress. 

In [37]:
# This will take a while, feel free to comment it out or just skip it
#extra_params = {'rf classifier rs': {'verbose': 1}}
#scores = ML.Evaluate(model_type='rf classifier rs', extra_params=extra_params)

Okay what about just using logistic CV instead of straight logistic regression

In [38]:
scores = ML.Evaluate(model_type='logistic cv')

Running Evaluate with:
model_type = logistic cv
problem_type = binary
metric = macro roc auc
data_scaler = standard
n_splits = 3
n_repeats = 2
int_cv = 3
class_weight = balanced
n_jobs = 8
n_iter = 20
random_state = None
extra_params = {}

Macro mean score:  0.9383524635356503
Macro std in score:  0.0
Micro mean score:  0.9383524635356503
Micro std in score:  0.009194294299791721


In [39]:
scores

[0.9488978903824488,
 0.9264920304027544,
 0.9396674698217475,
 0.9488978903824488,
 0.9264920304027544,
 0.9396674698217475]

[0.9488978903824488,
 0.9264920304027544,
 0.9396674698217475,
 0.9488978903824488,
 0.9264920304027544,
 0.9396674698217475]