[WIP] Adding predict_proba ability to `bootstrap_632` functions #700

adam2392 · 2020-07-01T14:28:12Z

Description

Add ability to pass in scoring_func to bootstrap_point632_score function that depends on probability predictions rather then label predictions. This would for example allow roc_auc_score to be passed into the bootstrapping method.

Related issues or pull requests

Closes: #699

Pull Request Checklist

Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
Added appropriate unit test functions in the ./mlxtend/*/tests directories (if applicable)
Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (if applicable)
Ran PYTHONPATH='.' pytest ./mlxtend -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)
Checked for style issues by running flake8 ./mlxtend

adam2392 · 2020-07-01T14:31:30Z

I'm using a conda env for scikit-learn dev, which has all the packages and ran flake8, but I got a lot of issues not related to my PR:

./mlxtend/feature_extraction/tests/test_base.py:9:37: E272 multiple spaces before keyword
./mlxtend/frequent_patterns/fpcommon.py:4:1: F401 'distutils.version.LooseVersion as Version' imported but unused
./mlxtend/frequent_patterns/fpcommon.py:5:1: F401 'pandas.__version__ as pandas_version' imported but unused
./mlxtend/externals/signature_py27.py:102:9: F841 local variable 'ex' is assigned to but never used
./mlxtend/externals/signature_py27.py:173:80: E501 line too long (80 > 79 characters)
./mlxtend/externals/six.py:12:80: E501 line too long (80 > 79 characters)
./mlxtend/externals/six.py:49:20: F821 undefined name 'basestring'
./mlxtend/externals/six.py:50:27: F821 undefined name 'long'
./mlxtend/externals/six.py:52:17: F821 undefined name 'unicode'
./mlxtend/externals/six.py:226:1: E305 expected 2 blank lines after class or function definition, found 1
./mlxtend/externals/six.py:238:80: E501 line too long (91 > 79 characters)
./mlxtend/externals/six.py:245:80: E501 line too long (93 > 79 characters)
./mlxtend/externals/six.py:254:80: E501 line too long (91 > 79 characters)
./mlxtend/externals/six.py:265:80: E501 line too long (87 > 79 characters)
./mlxtend/externals/six.py:266:80: E501 line too long (96 > 79 characters)
./mlxtend/externals/six.py:280:80: E501 line too long (80 > 79 characters)
./mlxtend/externals/six.py:281:80: E501 line too long (80 > 79 characters)
./mlxtend/externals/six.py:295:80: E501 line too long (82 > 79 characters)
./mlxtend/externals/six.py:296:80: E501 line too long (82 > 79 characters)
./mlxtend/externals/six.py:297:80: E501 line too long (82 > 79 characters)
./mlxtend/externals/six.py:354:80: E501 line too long (80 > 79 characters)
./mlxtend/externals/six.py:356:80: E501 line too long (86 > 79 characters)
./mlxtend/externals/six.py:374:80: E501 line too long (80 > 79 characters)
./mlxtend/externals/six.py:376:80: E501 line too long (86 > 79 characters)
./mlxtend/externals/six.py:400:80: E501 line too long (83 > 79 characters)
./mlxtend/externals/six.py:424:80: E501 line too long (84 > 79 characters)
./mlxtend/externals/six.py:426:80: E501 line too long (90 > 79 characters)
./mlxtend/externals/six.py:445:80: E501 line too long (86 > 79 characters)
./mlxtend/externals/six.py:447:80: E501 line too long (92 > 79 characters)
./mlxtend/externals/six.py:463:80: E501 line too long (92 > 79 characters)
./mlxtend/externals/six.py:465:80: E501 line too long (98 > 79 characters)
./mlxtend/externals/six.py:471:80: E501 line too long (83 > 79 characters)
./mlxtend/externals/six.py:482:1: E305 expected 2 blank lines after class or function definition, found 1
./mlxtend/externals/six.py:647:16: F821 undefined name 'unicode'
./mlxtend/externals/six.py:730:37: F821 undefined name 'basestring'
./mlxtend/externals/six.py:733:32: F821 undefined name 'file'
./mlxtend/externals/six.py:734:38: F821 undefined name 'unicode'
./mlxtend/externals/six.py:744:32: F821 undefined name 'unicode'
./mlxtend/externals/six.py:750:32: F821 undefined name 'unicode'
./mlxtend/externals/six.py:758:36: F821 undefined name 'unicode'
./mlxtend/externals/six.py:762:23: F821 undefined name 'unicode'
./mlxtend/externals/six.py:763:21: F821 undefined name 'unicode'
./mlxtend/externals/six.py:858:80: E501 line too long (80 > 79 characters)
./mlxtend/externals/pyprind/generator_factory.py:23:1: E305 expected 2 blank lines after class or function definition, found 1
./mlxtend/externals/pyprind/__init__.py:14:1: F401 '.progbar.ProgBar' imported but unused
./mlxtend/externals/pyprind/__init__.py:15:1: F401 '.progpercent.ProgPercent' imported but unused
./mlxtend/externals/pyprind/__init__.py:16:1: F401 '.generator_factory.prog_percent' imported but unused
./mlxtend/externals/pyprind/__init__.py:17:1: F401 '.generator_factory.prog_bar' imported but unused
./mlxtend/externals/pyprind/progbar.py:75:80: E501 line too long (85 > 79 characters)
./mlxtend/externals/pyprind/progbar.py:76:48: E128 continuation line under-indented for visual indent
./mlxtend/file_io/find_filegroups.py:80:28: W605 invalid escape sequence '\%'
./mlxtend/file_io/find_filegroups.py:89:32: W605 invalid escape sequence '\%'
./mlxtend/utils/tests/test_testing.py:17:9: E117 over-indented
./mlxtend/feature_selection/exhaustive_feature_selector.py:299:9: F841 local variable 'e' is assigned to but never used
./mlxtend/feature_selection/tests/test_exhaustive_feature_selector.py:7:1: F401 'sys' imported but unused
./mlxtend/feature_selection/tests/test_exhaustive_feature_selector.py:178:53: E203 whitespace before ','
./mlxtend/feature_selection/tests/test_sequential_feature_selector.py:7:1: F401 'sys' imported but unused
./mlxtend/classifier/stacking_classification.py:180:80: E501 line too long (82 > 79 characters)
./mlxtend/classifier/ensemble_vote.py:291:80: E501 line too long (80 > 79 characters)
./mlxtend/classifier/tests/test_stacking_cv_classifier.py:464:80: E501 line too long (89 > 79 characters)
./mlxtend/evaluate/lift_score.py:39:38: W605 invalid escape sequence '\i'
./mlxtend/evaluate/tests/test_bootstrap_point632.py:28:80: E501 line too long (80 > 79 characters)
./mlxtend/plotting/scatterplotmatrix.py:70:13: E117 over-indented
./mlxtend/plotting/plot_confusion_matrix.py:24:1: W293 blank line contains whitespace
./mlxtend/plotting/tests/test_pca_corr_graph.py:52:80: E501 line too long (91 > 79 characters)
./mlxtend/plotting/tests/test_pca_corr_graph.py:65:1: W391 blank line at end of file
./mlxtend/plotting/tests/test_pca_corr_graph.py:65:1: W293 blank line contains whitespace
./mlxtend/text/__init__.py:13:1: E402 module level import not at top of file
./mlxtend/text/__init__.py:14:1: E402 module level import not at top of file
./mlxtend/text/tests/test_generalize_names_duplcheck.py:7:1: E402 module level import not at top of file
./mlxtend/text/tests/test_generalize_names_duplcheck.py:8:1: E402 module level import not at top of file
./mlxtend/text/tests/test_generalize_names_duplcheck.py:9:1: E402 module level import not at top of file
./mlxtend/text/tests/test_generalize_names_duplcheck.py:10:1: E402 module level import not at top of file
./mlxtend/text/tests/test_generalize_names_duplcheck.py:11:1: E402 module level import not at top of file
./mlxtend/text/tests/test_generalize_names.py:7:1: E402 module level import not at top of file
./mlxtend/data/iris.py:36:63: W291 trailing whitespace
./mlxtend/data/iris.py:54:75: W291 trailing whitespace
./mlxtend/_base/_iterative_model.py:56:13: E117 over-indented

rasbt · 2020-07-01T15:46:25Z

Thanks for the PR. No worries about the flake8 issues that are not due to new code. They can be fixed another time. Also, we usually don't worry about flake8 issues in /externals, because that's not our code. Both signature_py27.py and six.py can actually be removed because we only support Python 3 now.

The error you are getting in the PR is more due to

   # test predict_proba
>       scores = bootstrap_point632_score(lr, X[:100], y[:100],
                                          scoring_func=roc_auc_score,
                                          predict_proba=True,
                                          random_seed=123)

mlxtend/evaluate/tests/test_bootstrap_point632.py:127: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
mlxtend/evaluate/bootstrap_point632.py:185: in bootstrap_point632_score
    test_acc = scoring_func(y[test], predict_func(X[test]))
../../miniconda3/envs/mlxtend-latest/lib/python3.8/site-packages/sklearn/utils/validation.py:73: in inner_f
    return f(**kwargs)
../../miniconda3/envs/mlxtend-latest/lib/python3.8/site-packages/sklearn/metrics/_ranking.py:390: in roc_auc_score
    return _average_binary_score(partial(_binary_roc_auc_score,
../../miniconda3/envs/mlxtend-latest/lib/python3.8/site-packages/sklearn/metrics/_base.py:77: in _average_binary_score
    return binary_metric(y_true, y_score, sample_weight=sample_weight)
../../miniconda3/envs/mlxtend-latest/lib/python3.8/site-packages/sklearn/metrics/_ranking.py:226: in _binary_roc_auc_score
    fpr, tpr, _ = roc_curve(y_true, y_score,
../../miniconda3/envs/mlxtend-latest/lib/python3.8/site-packages/sklearn/utils/validation.py:73: in inner_f
    return f(**kwargs)
../../miniconda3/envs/mlxtend-latest/lib/python3.8/site-packages/sklearn/metrics/_ranking.py:775: in roc_curve
    fps, tps, thresholds = _binary_clf_curve(
../../miniconda3/envs/mlxtend-latest/lib/python3.8/site-packages/sklearn/metrics/_ranking.py:543: in _binary_clf_curve
    y_score = column_or_1d(y_score)
../../miniconda3/envs/mlxtend-latest/lib/python3.8/site-packages/sklearn/utils/validation.py:73: in inner_f
    return f(**kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

y = array([[0.98158654, 0.01841346],
       [0.98338904, 0.01661096],
       [0.97811173, 0.02188827],
       [0.94485394,...76826, 0.97423174],
       [0.00814693, 0.99185307],
       [0.05730381, 0.94269619],
       [0.02391098, 0.97608902]])

    @_deprecate_positional_args
    def column_or_1d(y, *, warn=False):
        """ Ravel column or 1d numpy array, else raises an error
    
        Parameters
        ----------
        y : array-like
    
        warn : boolean, default False
           To control display of warnings.
    
        Returns
        -------
        y : array
    
        """
        y = np.asarray(y)
        shape = np.shape(y)
        if len(shape) == 1:
            return np.ravel(y)
        if len(shape) == 2 and shape[1] == 1:
            if warn:
                warnings.warn("A column-vector y was passed when a 1d array was"
                              " expected. Please change the shape of y to "
                              "(n_samples, ), for example using ravel().",
                              DataConversionWarning, stacklevel=2)
            return np.ravel(y)
    
>       raise ValueError(
            "y should be a 1d array, "
            "got an array of shape {} instead.".format(shape))
E       ValueError: y should be a 1d array, got an array of shape (34, 2) instead.

../../miniconda3/envs/mlxtend-latest/lib/python3.8/site-packages/sklearn/utils/validation.py:846: ValueError

rasbt · 2020-07-01T15:55:45Z

PS: You can test it efficiently locally via

PYTHONPATH="." pytest mlxtend/evaluate/tests/test_bootstrap_point632.py

coveralls · 2020-07-01T16:26:27Z

Coverage increased (+0.03%) to 90.677% when pulling ccf8ec2 on adam2392:bootstrap into 57aa05a on rasbt:master.

rasbt

Thanks for getting it to work! There are two minor comments I have ...

mlxtend/evaluate/tests/test_bootstrap_point632.py

rasbt · 2020-07-01T17:26:24Z

mlxtend/evaluate/bootstrap_point632.py

    oob = BootstrapOutOfBag(n_splits=n_splits, random_seed=random_seed)
    scores = np.empty(dtype=np.float, shape=(n_splits,))
    cnt = 0
    for train, test in oob.split(X):
        cloned_est.fit(X[train], y[train])

-        test_acc = scoring_func(y[test], cloned_est.predict(X[test]))
+        # get the prediction probability
+        # for binary class uses the last column


Does the current implementation support multi-class settings? If not, I think we need the following structure?

if predict_proba: len_uniq = np.unique(y) if len(len_uniq) < 2: # raise error elif len(len_uniq) == 2: predicted_train_val = predicted_train_val[:, 1] predicted_test_val = predicted_test_val[:, 1] else: # do something

Don't all sklearn estimators check the degenerate case of a single-class classification? If multiclass, then we can return the full probability array, else in binary classification, get the 2nd column.

I took the general structure though. lmk what you think

Don't all sklearn estimators check the degenerate case of a single-class classification?

I think you are right!

Looking at https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html

the multiclass and multilabel cases expect a shape (n_samples, n_classes).

it seems to be okay as it is right now, too.

rasbt

Looks good to me. Will merge. Many thanks!

rasbt · 2020-07-03T15:04:12Z

mlxtend/evaluate/bootstrap_point632.py

    oob = BootstrapOutOfBag(n_splits=n_splits, random_seed=random_seed)
    scores = np.empty(dtype=np.float, shape=(n_splits,))
    cnt = 0
    for train, test in oob.split(X):
        cloned_est.fit(X[train], y[train])

-        test_acc = scoring_func(y[test], cloned_est.predict(X[test]))
+        # get the prediction probability
+        # for binary class uses the last column


Don't all sklearn estimators check the degenerate case of a single-class classification?

I think you are right!

Looking at https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html

the multiclass and multilabel cases expect a shape (n_samples, n_classes).

it seems to be okay as it is right now, too.

adam2392 added 3 commits July 1, 2020 10:26

Adding predict_proba.

9e16d36

Fix changelog.

5856684

Add pytest import.

e82f95a

Fix unit test.

9dfc73e

fix unit test.

36b4b16

rasbt reviewed Jul 1, 2020

View reviewed changes

adam2392 added 3 commits July 1, 2020 16:45

Fix unit test.

253c8e8

Fix unit test.

e81fb4d

Fix unit test.

ccf8ec2

rasbt approved these changes Jul 3, 2020

View reviewed changes

rasbt merged commit df0feb1 into rasbt:master Jul 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Adding predict_proba ability to `bootstrap_632` functions #700

[WIP] Adding predict_proba ability to `bootstrap_632` functions #700

adam2392 commented Jul 1, 2020 •

edited

adam2392 commented Jul 1, 2020

rasbt commented Jul 1, 2020

rasbt commented Jul 1, 2020

coveralls commented Jul 1, 2020 •

edited

rasbt left a comment

rasbt Jul 1, 2020 •

edited

adam2392 Jul 1, 2020

rasbt Jul 3, 2020

rasbt left a comment

rasbt Jul 3, 2020

[WIP] Adding predict_proba ability to bootstrap_632 functions #700

[WIP] Adding predict_proba ability to bootstrap_632 functions #700

Conversation

adam2392 commented Jul 1, 2020 • edited

Description

Related issues or pull requests

Pull Request Checklist

adam2392 commented Jul 1, 2020

rasbt commented Jul 1, 2020

rasbt commented Jul 1, 2020

coveralls commented Jul 1, 2020 • edited

rasbt left a comment

Choose a reason for hiding this comment

rasbt Jul 1, 2020 • edited

Choose a reason for hiding this comment

adam2392 Jul 1, 2020

Choose a reason for hiding this comment

rasbt Jul 3, 2020

Choose a reason for hiding this comment

rasbt left a comment

Choose a reason for hiding this comment

rasbt Jul 3, 2020

Choose a reason for hiding this comment

[WIP] Adding predict_proba ability to `bootstrap_632` functions #700

[WIP] Adding predict_proba ability to `bootstrap_632` functions #700

adam2392 commented Jul 1, 2020 •

edited

coveralls commented Jul 1, 2020 •

edited

rasbt Jul 1, 2020 •

edited