Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Adding predict_proba ability to bootstrap_632 functions #700

Merged
merged 8 commits into from Jul 3, 2020

Conversation

adam2392
Copy link
Contributor

@adam2392 adam2392 commented Jul 1, 2020

Description

Add ability to pass in scoring_func to bootstrap_point632_score function that depends on probability predictions rather then label predictions. This would for example allow roc_auc_score to be passed into the bootstrapping method.

Related issues or pull requests

Closes: #699

Pull Request Checklist

  • Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
  • Added appropriate unit test functions in the ./mlxtend/*/tests directories (if applicable)
  • Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (if applicable)
  • Ran PYTHONPATH='.' pytest ./mlxtend -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)
  • Checked for style issues by running flake8 ./mlxtend

@adam2392
Copy link
Contributor Author

adam2392 commented Jul 1, 2020

I'm using a conda env for scikit-learn dev, which has all the packages and ran flake8, but I got a lot of issues not related to my PR:

./mlxtend/feature_extraction/tests/test_base.py:9:37: E272 multiple spaces before keyword
./mlxtend/frequent_patterns/fpcommon.py:4:1: F401 'distutils.version.LooseVersion as Version' imported but unused
./mlxtend/frequent_patterns/fpcommon.py:5:1: F401 'pandas.__version__ as pandas_version' imported but unused
./mlxtend/externals/signature_py27.py:102:9: F841 local variable 'ex' is assigned to but never used
./mlxtend/externals/signature_py27.py:173:80: E501 line too long (80 > 79 characters)
./mlxtend/externals/six.py:12:80: E501 line too long (80 > 79 characters)
./mlxtend/externals/six.py:49:20: F821 undefined name 'basestring'
./mlxtend/externals/six.py:50:27: F821 undefined name 'long'
./mlxtend/externals/six.py:52:17: F821 undefined name 'unicode'
./mlxtend/externals/six.py:226:1: E305 expected 2 blank lines after class or function definition, found 1
./mlxtend/externals/six.py:238:80: E501 line too long (91 > 79 characters)
./mlxtend/externals/six.py:245:80: E501 line too long (93 > 79 characters)
./mlxtend/externals/six.py:254:80: E501 line too long (91 > 79 characters)
./mlxtend/externals/six.py:265:80: E501 line too long (87 > 79 characters)
./mlxtend/externals/six.py:266:80: E501 line too long (96 > 79 characters)
./mlxtend/externals/six.py:280:80: E501 line too long (80 > 79 characters)
./mlxtend/externals/six.py:281:80: E501 line too long (80 > 79 characters)
./mlxtend/externals/six.py:295:80: E501 line too long (82 > 79 characters)
./mlxtend/externals/six.py:296:80: E501 line too long (82 > 79 characters)
./mlxtend/externals/six.py:297:80: E501 line too long (82 > 79 characters)
./mlxtend/externals/six.py:354:80: E501 line too long (80 > 79 characters)
./mlxtend/externals/six.py:356:80: E501 line too long (86 > 79 characters)
./mlxtend/externals/six.py:374:80: E501 line too long (80 > 79 characters)
./mlxtend/externals/six.py:376:80: E501 line too long (86 > 79 characters)
./mlxtend/externals/six.py:400:80: E501 line too long (83 > 79 characters)
./mlxtend/externals/six.py:424:80: E501 line too long (84 > 79 characters)
./mlxtend/externals/six.py:426:80: E501 line too long (90 > 79 characters)
./mlxtend/externals/six.py:445:80: E501 line too long (86 > 79 characters)
./mlxtend/externals/six.py:447:80: E501 line too long (92 > 79 characters)
./mlxtend/externals/six.py:463:80: E501 line too long (92 > 79 characters)
./mlxtend/externals/six.py:465:80: E501 line too long (98 > 79 characters)
./mlxtend/externals/six.py:471:80: E501 line too long (83 > 79 characters)
./mlxtend/externals/six.py:482:1: E305 expected 2 blank lines after class or function definition, found 1
./mlxtend/externals/six.py:647:16: F821 undefined name 'unicode'
./mlxtend/externals/six.py:730:37: F821 undefined name 'basestring'
./mlxtend/externals/six.py:733:32: F821 undefined name 'file'
./mlxtend/externals/six.py:734:38: F821 undefined name 'unicode'
./mlxtend/externals/six.py:744:32: F821 undefined name 'unicode'
./mlxtend/externals/six.py:750:32: F821 undefined name 'unicode'
./mlxtend/externals/six.py:758:36: F821 undefined name 'unicode'
./mlxtend/externals/six.py:762:23: F821 undefined name 'unicode'
./mlxtend/externals/six.py:763:21: F821 undefined name 'unicode'
./mlxtend/externals/six.py:858:80: E501 line too long (80 > 79 characters)
./mlxtend/externals/pyprind/generator_factory.py:23:1: E305 expected 2 blank lines after class or function definition, found 1
./mlxtend/externals/pyprind/__init__.py:14:1: F401 '.progbar.ProgBar' imported but unused
./mlxtend/externals/pyprind/__init__.py:15:1: F401 '.progpercent.ProgPercent' imported but unused
./mlxtend/externals/pyprind/__init__.py:16:1: F401 '.generator_factory.prog_percent' imported but unused
./mlxtend/externals/pyprind/__init__.py:17:1: F401 '.generator_factory.prog_bar' imported but unused
./mlxtend/externals/pyprind/progbar.py:75:80: E501 line too long (85 > 79 characters)
./mlxtend/externals/pyprind/progbar.py:76:48: E128 continuation line under-indented for visual indent
./mlxtend/file_io/find_filegroups.py:80:28: W605 invalid escape sequence '\%'
./mlxtend/file_io/find_filegroups.py:89:32: W605 invalid escape sequence '\%'
./mlxtend/utils/tests/test_testing.py:17:9: E117 over-indented
./mlxtend/feature_selection/exhaustive_feature_selector.py:299:9: F841 local variable 'e' is assigned to but never used
./mlxtend/feature_selection/tests/test_exhaustive_feature_selector.py:7:1: F401 'sys' imported but unused
./mlxtend/feature_selection/tests/test_exhaustive_feature_selector.py:178:53: E203 whitespace before ','
./mlxtend/feature_selection/tests/test_sequential_feature_selector.py:7:1: F401 'sys' imported but unused
./mlxtend/classifier/stacking_classification.py:180:80: E501 line too long (82 > 79 characters)
./mlxtend/classifier/ensemble_vote.py:291:80: E501 line too long (80 > 79 characters)
./mlxtend/classifier/tests/test_stacking_cv_classifier.py:464:80: E501 line too long (89 > 79 characters)
./mlxtend/evaluate/lift_score.py:39:38: W605 invalid escape sequence '\i'
./mlxtend/evaluate/tests/test_bootstrap_point632.py:28:80: E501 line too long (80 > 79 characters)
./mlxtend/plotting/scatterplotmatrix.py:70:13: E117 over-indented
./mlxtend/plotting/plot_confusion_matrix.py:24:1: W293 blank line contains whitespace
./mlxtend/plotting/tests/test_pca_corr_graph.py:52:80: E501 line too long (91 > 79 characters)
./mlxtend/plotting/tests/test_pca_corr_graph.py:65:1: W391 blank line at end of file
./mlxtend/plotting/tests/test_pca_corr_graph.py:65:1: W293 blank line contains whitespace
./mlxtend/text/__init__.py:13:1: E402 module level import not at top of file
./mlxtend/text/__init__.py:14:1: E402 module level import not at top of file
./mlxtend/text/tests/test_generalize_names_duplcheck.py:7:1: E402 module level import not at top of file
./mlxtend/text/tests/test_generalize_names_duplcheck.py:8:1: E402 module level import not at top of file
./mlxtend/text/tests/test_generalize_names_duplcheck.py:9:1: E402 module level import not at top of file
./mlxtend/text/tests/test_generalize_names_duplcheck.py:10:1: E402 module level import not at top of file
./mlxtend/text/tests/test_generalize_names_duplcheck.py:11:1: E402 module level import not at top of file
./mlxtend/text/tests/test_generalize_names.py:7:1: E402 module level import not at top of file
./mlxtend/data/iris.py:36:63: W291 trailing whitespace
./mlxtend/data/iris.py:54:75: W291 trailing whitespace
./mlxtend/_base/_iterative_model.py:56:13: E117 over-indented

@rasbt
Copy link
Owner

rasbt commented Jul 1, 2020

Thanks for the PR. No worries about the flake8 issues that are not due to new code. They can be fixed another time. Also, we usually don't worry about flake8 issues in /externals, because that's not our code. Both signature_py27.py and six.py can actually be removed because we only support Python 3 now.

The error you are getting in the PR is more due to

   # test predict_proba
>       scores = bootstrap_point632_score(lr, X[:100], y[:100],
                                          scoring_func=roc_auc_score,
                                          predict_proba=True,
                                          random_seed=123)

mlxtend/evaluate/tests/test_bootstrap_point632.py:127: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
mlxtend/evaluate/bootstrap_point632.py:185: in bootstrap_point632_score
    test_acc = scoring_func(y[test], predict_func(X[test]))
../../miniconda3/envs/mlxtend-latest/lib/python3.8/site-packages/sklearn/utils/validation.py:73: in inner_f
    return f(**kwargs)
../../miniconda3/envs/mlxtend-latest/lib/python3.8/site-packages/sklearn/metrics/_ranking.py:390: in roc_auc_score
    return _average_binary_score(partial(_binary_roc_auc_score,
../../miniconda3/envs/mlxtend-latest/lib/python3.8/site-packages/sklearn/metrics/_base.py:77: in _average_binary_score
    return binary_metric(y_true, y_score, sample_weight=sample_weight)
../../miniconda3/envs/mlxtend-latest/lib/python3.8/site-packages/sklearn/metrics/_ranking.py:226: in _binary_roc_auc_score
    fpr, tpr, _ = roc_curve(y_true, y_score,
../../miniconda3/envs/mlxtend-latest/lib/python3.8/site-packages/sklearn/utils/validation.py:73: in inner_f
    return f(**kwargs)
../../miniconda3/envs/mlxtend-latest/lib/python3.8/site-packages/sklearn/metrics/_ranking.py:775: in roc_curve
    fps, tps, thresholds = _binary_clf_curve(
../../miniconda3/envs/mlxtend-latest/lib/python3.8/site-packages/sklearn/metrics/_ranking.py:543: in _binary_clf_curve
    y_score = column_or_1d(y_score)
../../miniconda3/envs/mlxtend-latest/lib/python3.8/site-packages/sklearn/utils/validation.py:73: in inner_f
    return f(**kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

y = array([[0.98158654, 0.01841346],
       [0.98338904, 0.01661096],
       [0.97811173, 0.02188827],
       [0.94485394,...76826, 0.97423174],
       [0.00814693, 0.99185307],
       [0.05730381, 0.94269619],
       [0.02391098, 0.97608902]])

    @_deprecate_positional_args
    def column_or_1d(y, *, warn=False):
        """ Ravel column or 1d numpy array, else raises an error
    
        Parameters
        ----------
        y : array-like
    
        warn : boolean, default False
           To control display of warnings.
    
        Returns
        -------
        y : array
    
        """
        y = np.asarray(y)
        shape = np.shape(y)
        if len(shape) == 1:
            return np.ravel(y)
        if len(shape) == 2 and shape[1] == 1:
            if warn:
                warnings.warn("A column-vector y was passed when a 1d array was"
                              " expected. Please change the shape of y to "
                              "(n_samples, ), for example using ravel().",
                              DataConversionWarning, stacklevel=2)
            return np.ravel(y)
    
>       raise ValueError(
            "y should be a 1d array, "
            "got an array of shape {} instead.".format(shape))
E       ValueError: y should be a 1d array, got an array of shape (34, 2) instead.

../../miniconda3/envs/mlxtend-latest/lib/python3.8/site-packages/sklearn/utils/validation.py:846: ValueError

@rasbt
Copy link
Owner

rasbt commented Jul 1, 2020

PS: You can test it efficiently locally via

PYTHONPATH="." pytest mlxtend/evaluate/tests/test_bootstrap_point632.py 

@coveralls
Copy link

coveralls commented Jul 1, 2020

Coverage Status

Coverage increased (+0.03%) to 90.677% when pulling ccf8ec2 on adam2392:bootstrap into 57aa05a on rasbt:master.

Copy link
Owner

@rasbt rasbt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting it to work! There are two minor comments I have ...

oob = BootstrapOutOfBag(n_splits=n_splits, random_seed=random_seed)
scores = np.empty(dtype=np.float, shape=(n_splits,))
cnt = 0
for train, test in oob.split(X):
cloned_est.fit(X[train], y[train])

test_acc = scoring_func(y[test], cloned_est.predict(X[test]))
# get the prediction probability
# for binary class uses the last column
Copy link
Owner

@rasbt rasbt Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the current implementation support multi-class settings? If not, I think we need the following structure?

if predict_proba:
    len_uniq = np.unique(y)

    if len(len_uniq) < 2:
        # raise error
    elif len(len_uniq) == 2:
         predicted_train_val = predicted_train_val[:, 1]
         predicted_test_val = predicted_test_val[:, 1]
    else:
        # do something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't all sklearn estimators check the degenerate case of a single-class classification? If multiclass, then we can return the full probability array, else in binary classification, get the 2nd column.

I took the general structure though. lmk what you think

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't all sklearn estimators check the degenerate case of a single-class classification?

I think you are right!

Looking at https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html

the multiclass and multilabel cases expect a shape (n_samples, n_classes).

it seems to be okay as it is right now, too.

Copy link
Owner

@rasbt rasbt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Will merge. Many thanks!

oob = BootstrapOutOfBag(n_splits=n_splits, random_seed=random_seed)
scores = np.empty(dtype=np.float, shape=(n_splits,))
cnt = 0
for train, test in oob.split(X):
cloned_est.fit(X[train], y[train])

test_acc = scoring_func(y[test], cloned_est.predict(X[test]))
# get the prediction probability
# for binary class uses the last column
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't all sklearn estimators check the degenerate case of a single-class classification?

I think you are right!

Looking at https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html

the multiclass and multilabel cases expect a shape (n_samples, n_classes).

it seems to be okay as it is right now, too.

@rasbt rasbt merged commit df0feb1 into rasbt:master Jul 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding scoring_func that depend on y_predict_proba to bootstrap methods
3 participants