This notebook is intended to show an example with categorical classification, and tries to show some of the nuances . The dataloading and other aspects of the library are brushed over in this notebook without much explanation, see the Example Binary Classification notebook for more explanation around what the different loaders do, and whatnot.

In [1]:
import ABCD_ML

In [2]:
nda_dr = '/mnt/sdb2/2.0_ABCD_Data_Explorer/2.0_NDA_Data/'
test_mapping_loc = nda_dr + 'ABCD_Release_ Notes_Data_Release_ 2.0/22. ABCD_Release_2.0_mapping_r.csv'
test_data_loc = nda_dr + 'MRI/ABCD sMRI Part 1.csv'
test_target_loc = nda_dr + 'Mental Health/ABCD Parent Demographics Survey.csv'
test_exclusion_loc = '/home/sage/bader_things/invalid_pguids.txt'

In [3]:
ML = ABCD_ML.ABCD_ML(default_dataset_type='explorer')

ML.load_name_map(loc = test_mapping_loc,
                 source_name_col = "NDAR name",
                 target_name_col = "REDCap name/NDA alias")

ML.load_data(loc=test_data_loc,
             dataset_type='explorer',
             filter_outlier_percent=.0005, 
             winsorize_val=.001)

ML.load_targets(loc=test_target_loc,
                col_name='demo_relig_v2',
                data_type='c')
ML.load_exclusions(loc=test_exclusion_loc)

ABCD_ML object initialized
Error: One or both provided column names do not exist!
Loaded map file
Loading /mnt/sdb2/2.0_ABCD_Data_Explorer/2.0_NDA_Data/MRI/ABCD sMRI Part 1.csv assumed to be dataset type: explorer
dropped ['abcd_smrip101_id', 'dataset_id', 'smri_visitid'] columns by default due to dataset type
Dropped 0 columns, per drop_keys argument
Dropped 0 cols for all missing values
Dropped 522 rows for missing values
Dropped rows with missing data
Filtered data for outliers with value:  0.0005
Winsorized data with value:  0.001
loaded shape:  (8725, 749)

Total valid (overlapping subjects / not in exclusions) subjects = 8725

Loading  /mnt/sdb2/2.0_ABCD_Data_Explorer/2.0_NDA_Data/Mental Health/ABCD Parent Demographics Survey.csv
Encoded to 17 categories
Final shape:  (11310, 17)

Total valid (overlapping subjects / not in exclusions) subjects = 8336

Total excluded subjects:  1137

Total valid (overlapping subjects / not in exclusions) subjects = 7547

Removing non overlapping +

In [4]:
ML.define_validation_strategy(stratify=ML.original_targets_key)

CV defined with stratifying behavior, over 17 unique values.


We are defining our validation strategy as stratifying over the target keys, as some of the religions have very little representation, and we could potentially throw errors if some value is not represented in one of the various validation folds - We still might if we use too much internal CV.

In [5]:
ML.train_test_split(test_size=.2,
                    random_state=9)

Final data for modeling loaded shape: (7547, 766)
Performed train/test split, train size: 6037 test size:  1510


In [6]:
ML.set_default_ML_params(problem_type = 'categorical',
                         data_scaler = 'standard',
                         n_splits = 2,
                         n_repeats = 1,
                         int_cv = 2,
                         class_weight = 'balanced',
                         n_jobs = 4,
                         n_iter = 2,
                         random_state = 9)

No default metric passed, set to, weighted roc auc based on default problem type.
No default extra params passed, set to empty dict
Default params set.



In [7]:
ML.show_model_types(problem_type='categorical')

Note: gs and rs are  Grid Search and Random Search
Models with gs or rs will have their hyper-parameters tuned accordingly.

Problem Type: categorical
----------------------
Avaliable models: 

Model str indicator:  dt classifier
(MultiClass)
Model object:  <class 'sklearn.tree.tree.DecisionTreeClassifier'>

Model str indicator:  dt classifier gs
(MultiClass)
Model object:  <class 'sklearn.model_selection._search.GridSearchCV'>

Model str indicator:  gaussian nb
(MultiClass)
Model object:  <class 'sklearn.naive_bayes.GaussianNB'>

Model str indicator:  knn classifier
(MultiClass)
Model object:  <class 'sklearn.neighbors.classification.KNeighborsClassifier'>

Model str indicator:  knn classifier gs
(MultiClass)
Model object:  <class 'sklearn.model_selection._search.GridSearchCV'>

Model str indicator:  light gbm classifier
(MultiClass)
Model object:  <class 'lightgbm.sklearn.LGBMClassifier'>

Model str indicator:  light gbm classifier rs
(MultiClass)
Model object:  <class 'sklearn.model

In [8]:
bunch_of_metrics = ['macro roc auc', 'macro f1', 'samples roc auc', 'weighted recall',
                    'log', 'macro jaccard', 'weighted ap']

# Ensemble of all multilabel compatible classifiers
ML.Evaluate(model_type = ['knn', 'rf'],
            metric = bunch_of_metrics,
            data_scaler = ['standard', 'minmax'])

Running Evaluate with:
model_type = ['knn', 'rf']
problem_type = categorical
metric = ['macro roc auc', 'macro f1', 'samples roc auc', 'weighted recall', 'log', 'macro jaccard', 'weighted ap']
data_scaler = ['standard', 'minmax']
n_splits = 2
n_repeats = 1
int_cv = 2
class_weight = balanced
n_jobs = 4
n_iter = 2
random_state = 9
extra_params = {}



  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



Metric:  macro roc auc
Macro mean score:  0.5682030503113262
Macro std in score:  0.0
Micro mean score:  0.5682030503113262
Micro std in score:  0.005678437035310513

Metric:  macro f1
Macro mean score:  0.01169100501596702
Macro std in score:  0.0
Micro mean score:  0.01169100501596702
Micro std in score:  1.744405403755235e-05

Metric:  samples roc auc
Macro mean score:  0.8148876200402093
Macro std in score:  0.0
Micro mean score:  0.8148876200402093
Micro std in score:  0.0008680203048471169

Metric:  weighted recall
Macro mean score:  0.11031994039942832
Macro std in score:  0.0
Micro mean score:  0.11031994039942832
Micro std in score:  0.00016446570541574224

Metric:  log
Macro mean score:  3.058345971680958
Macro std in score:  0.0
Micro mean score:  3.058345971680958
Micro std in score:  0.0830879798636861

Metric:  macro jaccard
Macro mean score:  0.006490486765711914
Macro std in score:  0.0
Micro mean score:  0.006490486765711914
Micro std in score:  1.0752960181762816e-05

array([[0.56252461, 0.01167356, 0.8140196 , 0.11015547, 3.14143395,
        0.00647973, 0.19356672],
       [0.57388149, 0.01170845, 0.81575564, 0.11048441, 2.97525799,
        0.00650124, 0.19542243]])

In [10]:
del bunch_of_metrics[2]


ML.Evaluate(model_type = ['logistic', 'knn', 'rf'],
                metric = bunch_of_metrics,
                data_scaler = ['standard', 'minmax'])

Running Evaluate with:
model_type = ['logistic', 'knn', 'rf']
problem_type = categorical
metric = ['macro roc auc', 'macro f1', 'weighted recall', 'log', 'macro jaccard', 'weighted ap']
data_scaler = ['standard', 'minmax']
n_splits = 2
n_repeats = 1
int_cv = 2
class_weight = balanced
n_jobs = 4
n_iter = 2
random_state = 9
extra_params = {}

Not all model types passed have multilabel support! Using multiclass instead.


  'precision', 'predicted', average, warn_for)



Metric:  multiclass macro roc auc
Macro mean score:  0.6015382030604486
Macro std in score:  0.0
Micro mean score:  0.6015382030604486
Micro std in score:  0.0005533275676887772

Metric:  macro f1
Macro mean score:  0.09591509987406238
Macro std in score:  0.0
Micro mean score:  0.09591509987406238
Micro std in score:  0.00048704272832821027

Metric:  weighted recall
Macro mean score:  0.22991010525146627
Macro std in score:  0.0
Micro mean score:  0.22991010525146627
Micro std in score:  0.0036327329886925303

Metric:  log
Macro mean score:  2.209158003592969
Macro std in score:  0.0
Micro mean score:  2.209158003592969
Micro std in score:  0.006312540251169674

Metric:  macro jaccard
Macro mean score:  0.055771450946402
Macro std in score:  0.0
Micro mean score:  0.055771450946402
Micro std in score:  0.0007542971937540416

Metric:  multiclass weighted average precision
Macro mean score:  0.2278608885966879
Macro std in score:  0.0
Micro mean score:  0.2278608885966879
Micro std in 

array([[0.60098488, 0.09640214, 0.23354284, 2.20284546, 0.05652575,
        0.22914185],
       [0.60209153, 0.09542806, 0.22627737, 2.21547054, 0.05501715,
        0.22657992]])

In [17]:
ML.show_model_types(problem_type='categorical')

Note: gs and rs are  Grid Search and Random Search
Models with gs or rs will have their hyper-parameters tuned accordingly.

Problem Type: categorical
----------------------
Avaliable models: 

Model str indicator:  dt classifier
(MultiClass)
Model object:  <class 'sklearn.tree.tree.DecisionTreeClassifier'>

Model str indicator:  dt classifier gs
(MultiClass)
Model object:  <class 'sklearn.model_selection._search.GridSearchCV'>

Model str indicator:  gaussian nb
(MultiClass)
Model object:  <class 'sklearn.naive_bayes.GaussianNB'>

Model str indicator:  knn classifier
(MultiClass)
Model object:  <class 'sklearn.neighbors.classification.KNeighborsClassifier'>

Model str indicator:  knn classifier gs
(MultiClass)
Model object:  <class 'sklearn.model_selection._search.GridSearchCV'>

Model str indicator:  light gbm classifier
(MultiClass)
Model object:  <class 'lightgbm.sklearn.LGBMClassifier'>

Model str indicator:  light gbm classifier rs
(MultiClass)
Model object:  <class 'sklearn.model

In [11]:
#ML.Evaluate(model_type = 'knn classifier',
#           metric = 'hamming'
#          )

In [12]:
#ML.Evaluate(model_type = 'knn classifier gs',
#           metric = 'hamming'
#          )

In [13]:
#ML.Evaluate(model_type = 'rf',
#            metric = 'micro f1')