This notebook is intended to show an example with categorical classification, and tries to show some of the nuances . The dataloading and other aspects of the library are brushed over in this notebook without much explanation. Check another for better explanations.

The categorical problem we load is demo_relig_v2, which has a number of problems, the most severe being a very dominant single class and a lot of classes with very little representation - in actual practice, this notebook does not represent how we would actually approach a problem like this. This notebook is simply to display categorical capabilities of the library (In practice, a good choice for a problem like this might be to treat it as binary, with the major class vs anything but...)

We also only load the first half of the structural data, again just as an example.

In [1]:
import ABCD_ML

In [2]:
nda_dr = '/mnt/sdb2/2.0_ABCD_Data_Explorer/2.0_NDA_Data/'
test_mapping_loc = nda_dr + 'ABCD_Release_ Notes_Data_Release_ 2.0/22. ABCD_Release_2.0_mapping_r.csv'
test_data_loc = nda_dr + 'MRI/ABCD sMRI Part 1.csv'
test_target_loc = nda_dr + 'Mental Health/ABCD Parent Demographics Survey.csv'
test_exclusion_loc = '/home/sage/bader_things/invalid_pguids.txt'

In [3]:
ML = ABCD_ML.ABCD_ML(default_dataset_type='explorer')

ML.Load_Name_Map(loc = test_mapping_loc,
                 source_name_col = "NDAR name",
                 target_name_col = "REDCap name/NDA alias")

ML.Load_Data(loc=test_data_loc,
             dataset_type='explorer',
             filter_outlier_percent=.0005, 
             winsorize_val=.001)

ML.Load_Targets(loc=test_target_loc,
                col_name='demo_relig_v2',
                data_type='c')
ML.Load_Exclusions(loc=test_exclusion_loc)

ABCD_ML object initialized
Loading /mnt/sdb2/2.0_ABCD_Data_Explorer/2.0_NDA_Data/ABCD_Release_ Notes_Data_Release_ 2.0/22. ABCD_Release_2.0_mapping_r.csv assumed to be dataset type: explorer
Loaded map file
Loading /mnt/sdb2/2.0_ABCD_Data_Explorer/2.0_NDA_Data/MRI/ABCD sMRI Part 1.csv assumed to be dataset type: explorer
dropped ['abcd_smrip101_id', 'dataset_id', 'smri_visitid'] columns by default due to dataset type
Dropped 0 columns, per drop_keys argument
Dropped 0 cols for all missing values
Dropped 522 rows for missing values
Dropped rows with missing data
Filtered data for outliers with value:  0.0005
Winsorized data with value:  0.001
loaded shape:  (8725, 749)

Total valid (overlapping subjects / not in exclusions) subjects = 8725

Loading targets!
Loading /mnt/sdb2/2.0_ABCD_Data_Explorer/2.0_NDA_Data/Mental Health/ABCD Parent Demographics Survey.csv assumed to be dataset type: explorer
Encoded to 17 categories
Final shape:  (11310, 17)

Total valid (overlapping subjects / not 

In [4]:
ML.Define_Validation_Strategy(stratify=ML.original_targets_key)

CV defined with stratifying behavior, over 17 unique values.


We are defining our validation strategy as stratifying over the target keys, as some of the religions have very little representation, and we could potentially throw errors if some value is not represented in one of the various validation folds - We still might if we use too much internal CV.

In [5]:
ML.Train_Test_Split(test_size=.2,
                    random_state=9)

Final data for modeling loaded shape: (7547, 766)
Performed train/test split, train size: 6037 test size:  1510


In [6]:
ML.Set_Default_ML_Params(problem_type = 'categorical',
                         data_scaler = 'standard',
                         n_splits = 2,
                         n_repeats = 1,
                         int_cv = 2,
                         class_weight = 'balanced',
                         n_jobs = 4,
                         n_iter = 2,
                         random_state = 9)

No default metric passed, set to, weighted roc auc based on default problem type.
No default extra params passed, set to empty dict
Default params set.



In [7]:
ML.Show_Model_Types(problem_type='categorical')

Note: gs and rs are  Grid Search and Random Search
Models with gs or rs will have their hyper-parameters tuned accordingly.

Problem Type: categorical
----------------------
Avaliable models: 

Model str indicator:  dt classifier
(MultiClass)
Model object:  <class 'sklearn.tree.tree.DecisionTreeClassifier'>

Model str indicator:  dt classifier gs
(MultiClass)
Model object:  <class 'sklearn.model_selection._search.GridSearchCV'>

Model str indicator:  elastic net logistic
(MultiClass)
Model object:  <class 'sklearn.linear_model.logistic.LogisticRegression'>

Model str indicator:  elastic net logistic rs
(MultiClass)
Model object:  <class 'sklearn.model_selection._search.RandomizedSearchCV'>

Model str indicator:  gaussian nb
(MultiClass)
Model object:  <class 'sklearn.naive_bayes.GaussianNB'>

Model str indicator:  gp classifier
(MultiClass)
Model object:  <class 'sklearn.gaussian_process.gpc.GaussianProcessClassifier'>

Model str indicator:  knn classifier
(MultiClass)
Model object:  <

In [8]:
bunch_of_metrics = ['macro roc auc', 'macro f1', 'samples roc auc', 'weighted recall',
                    'log', 'macro jaccard', 'weighted ap']

# Ensemble of all multilabel compatible classifiers
ML.Evaluate(model_type = ['knn', 'rf'],
            metric = bunch_of_metrics,
            data_scaler = ['standard', 'minmax'])

Running Evaluate with:
model_type = ['knn', 'rf']
problem_type = categorical
metric = ['macro roc auc', 'macro f1', 'samples roc auc', 'weighted recall', 'log', 'macro jaccard', 'weighted ap']
data_scaler = ['standard', 'minmax']
n_splits = 2
n_repeats = 1
int_cv = 2
class_weight = balanced
n_jobs = 4
n_iter = 2
random_state = 9
extra_params = {}



  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



Metric:  macro roc auc
Macro mean score:  0.5682030503113262
Macro std in score:  0.0
Micro mean score:  0.5682030503113262
Micro std in score:  0.005678437035310513

Metric:  macro f1
Macro mean score:  0.01169100501596702
Macro std in score:  0.0
Micro mean score:  0.01169100501596702
Micro std in score:  1.744405403755235e-05

Metric:  samples roc auc
Macro mean score:  0.8148876200402093
Macro std in score:  0.0
Micro mean score:  0.8148876200402093
Micro std in score:  0.0008680203048471169

Metric:  weighted recall
Macro mean score:  0.11031994039942832
Macro std in score:  0.0
Micro mean score:  0.11031994039942832
Micro std in score:  0.00016446570541574224

Metric:  log
Macro mean score:  3.058345971680958
Macro std in score:  0.0
Micro mean score:  3.058345971680958
Micro std in score:  0.0830879798636861

Metric:  macro jaccard
Macro mean score:  0.006490486765711914
Macro std in score:  0.0
Micro mean score:  0.006490486765711914
Micro std in score:  1.0752960181762816e-05

array([[0.56252461, 0.01167356, 0.8140196 , 0.11015547, 3.14143395,
        0.00647973, 0.19356672],
       [0.57388149, 0.01170845, 0.81575564, 0.11048441, 2.97525799,
        0.00650124, 0.19542243]])

The above is a good example of how samples roc auc can be a terrible choice for categorical problems with big class imbalances.

We are also recieving errors about F1 score... this is because within some of our folds the model does not predict any positive samples for some of the classes, which is a direct result of this categorical problem being set up poorly, and that we make no real efforts to explicitly deal with the heavy class imbalance.

Of note also is that samples roc auc does not work for multiclass classifiers, which we will show an ensemble of next, so we delete that from from the 'bunch of metrics'

In [9]:
del bunch_of_metrics[2]


ML.Evaluate(model_type = ['logistic', 'knn', 'rf'],
                metric = bunch_of_metrics,
                data_scaler = ['standard', 'minmax'])

Running Evaluate with:
model_type = ['logistic', 'knn', 'rf']
problem_type = categorical
metric = ['macro roc auc', 'macro f1', 'weighted recall', 'log', 'macro jaccard', 'weighted ap']
data_scaler = ['standard', 'minmax']
n_splits = 2
n_repeats = 1
int_cv = 2
class_weight = balanced
n_jobs = 4
n_iter = 2
random_state = 9
extra_params = {}

Not all model types passed have multilabel support! Using multiclass instead.


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



Metric:  multiclass macro roc auc
Macro mean score:  0.5892976202210078
Macro std in score:  0.0
Micro mean score:  0.5892976202210078
Micro std in score:  0.0014645881841193753

Metric:  macro f1
Macro mean score:  0.09305446096518839
Macro std in score:  0.0
Micro mean score:  0.09305446096518839
Micro std in score:  8.128611198382485e-05

Metric:  weighted recall
Macro mean score:  0.23587235749104246
Macro std in score:  0.0
Micro mean score:  0.23587235749104246
Micro std in score:  0.004286425175183142

Metric:  log
Macro mean score:  2.5245440315301844
Macro std in score:  0.0
Micro mean score:  2.5245440315301844
Micro std in score:  0.023857479040590546

Metric:  macro jaccard
Macro mean score:  0.054250638680474784
Macro std in score:  0.0
Micro mean score:  0.054250638680474784
Micro std in score:  0.00043409052668958586

Metric:  multiclass weighted average precision
Macro mean score:  0.22836896342782154
Macro std in score:  0.0
Micro mean score:  0.22836896342782154
Micr

array([[0.59076221, 0.09297317, 0.24015878, 2.50068655, 0.05468473,
        0.23110573],
       [0.58783303, 0.09313575, 0.23158593, 2.54840151, 0.05381655,
        0.22563219]])

         targets_0    targets_1    targets_2    targets_3    targets_4  \
count  7547.000000  7547.000000  7547.000000  7547.000000  7547.000000   
mean      0.110242     0.068372     0.013648     0.195044     0.021068   
std       0.313213     0.252399     0.116032     0.396261     0.143620   
min       0.000000     0.000000     0.000000     0.000000     0.000000   
25%       0.000000     0.000000     0.000000     0.000000     0.000000   
50%       0.000000     0.000000     0.000000     0.000000     0.000000   
75%       0.000000     0.000000     0.000000     0.000000     0.000000   
max       1.000000     1.000000     1.000000     1.000000     1.000000   

         targets_5    targets_6    targets_7    targets_8    targets_9  \
count  7547.000000  7547.000000  7547.000000  7547.000000  7547.000000   
mean      0.095270     0.006095     0.007420     0.003313     0.003048   
std       0.293606     0.077838     0.085826     0.057463     0.055124   
min       0.000000     0.000000     0