Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] FEA Add Categorical Naive Bayes #12569

Merged
merged 87 commits into from Sep 23, 2019
Merged

Conversation

@timbicker
Copy link
Contributor

timbicker commented Nov 12, 2018

Reference Issues/PRs

Impelements Categorical NB from #10856

What does this implement/fix? Explain your changes.

This implementation adds the NB classifier for categorical features.

Any other comments?

  • implement categorical NB functionality
  • check and write doc strings
  • add full test coverage
  • write a documentation
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
@timbicker

This comment has been minimized.

Copy link
Contributor Author

timbicker commented Dec 12, 2018

There might be a problem, when the input array X of the predict function has unseen categories.
From a mathematical point of view, the probability of an unseen category is 0. Therefore the likelihood term of Bayes Theorem becomes 0 and probability of the sample is 0.
This could be an unwanted error because the user might expect that all possible categories are in the training set. It could also be something the user wants to know about, but he still wants to use the prediction. Or he simply does not care.
Therefore I added a new attribute on_unseen_cats, which controls if the classifier raises an error, a warning or does ignore it.

@FlorianWilhelm

This comment has been minimized.

Copy link
Contributor

FlorianWilhelm commented Dec 12, 2018

@timbicker After having thought a bit about it I think that setting the probability to 1 for an unseen category would make more sense than 0. Given a particular feature, an unseen category would then yield 1 for all classes and thus the feature would basically be ignored since other features could still determine the right class. If all categories of all feature are unseen the decision would be made automatically by the class probabilities.

sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
@timbicker

This comment has been minimized.

Copy link
Contributor Author

timbicker commented Dec 18, 2018

@timbicker After having thought a bit about it I think that setting the probability to 1 for an unseen category would make more sense than 0. Given a particular feature, an unseen category would then yield 1 for all classes and thus the feature would basically be ignored since other features could still determine the right class. If all categories of all feature are unseen the decision would be made automatically by the class probabilities.

Yes, I agree. This makes more sense in my opinion.

For the other case, that the category is unseen for a subset of the classes, we have the smoothing parameter.

timbicker added 7 commits Dec 19, 2018
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
@qinhanmin2014

This comment has been minimized.

Copy link
Member

qinhanmin2014 commented Sep 3, 2019

thanks, I'm unable to reproduce because I use a list.
Please ignore it here. I'll open an issue and we'll fix this in check_array.

@FlorianWilhelm

This comment has been minimized.

Copy link
Contributor

FlorianWilhelm commented Sep 6, 2019

@qinhanmin2014 Thanks for your review. If you have no further comments, should @timbicker then change this PR to [MRG+1] so that others like @amueller, @NicolasHug, and @jnothman can review?

@qinhanmin2014

This comment has been minimized.

Copy link
Member

qinhanmin2014 commented Sep 6, 2019

see the previous comment, please avoid calling check_array twice

@timbicker

This comment has been minimized.

Copy link
Contributor Author

timbicker commented Sep 8, 2019

I am waiting for the other PR #14872 to be merged. Because otherwise, the tests of this PR would fail. Or should I fix it already?

@qinhanmin2014

This comment has been minimized.

Copy link
Member

qinhanmin2014 commented Sep 9, 2019

I am waiting for the other PR #14872 to be merged. Because otherwise, the tests of this PR would fail. Or should I fix it already?

which tests?

@timbicker

This comment has been minimized.

Copy link
Contributor Author

timbicker commented Sep 9, 2019

check_estimators_nan_inf in sklearn/utils/estimator_checks.py

The output is as follows:

FEstimator doesn't check for NaN and inf in fit. CategoricalNB(alpha=1.0, class_prior=None, fit_prior=True) 'list' argument must have no negative elements
Traceback (most recent call last):
  File "/Users/tbicker/PRs/scikit-learn/sklearn/utils/estimator_checks.py", line 1333, in check_estimators_nan_inf
    estimator.fit(X_train, y)
  File "/Users/tbicker/PRs/scikit-learn/sklearn/naive_bayes.py", line 1109, in fit
    return super().fit(X, y, sample_weight=sample_weight)
  File "/Users/tbicker/PRs/scikit-learn/sklearn/naive_bayes.py", line 635, in fit
    self._count(X, Y)
  File "/Users/tbicker/PRs/scikit-learn/sklearn/naive_bayes.py", line 1200, in _count
    self.class_count_.shape[0])
  File "/Users/tbicker/PRs/scikit-learn/sklearn/naive_bayes.py", line 1188, in _update_cat_count
    counts = np.bincount(X_feature[mask], weights=weights)
ValueError: 'list' argument must have no negative elements

sklearn/tests/test_common.py:93 (test_estimators[CategoricalNB()-check_estimators_nan_inf1])
estimator = CategoricalNB(alpha=1.0, class_prior=None, fit_prior=True)
check = functools.partial(<function check_estimators_nan_inf at 0x1a1792a510>, 'CategoricalNB')

    @parametrize_with_checks(_tested_estimators())
    def test_estimators(estimator, check):
        # Common tests for estimator instances
        with ignore_warnings(category=(DeprecationWarning, ConvergenceWarning,
                                       UserWarning, FutureWarning)):
            set_checking_parameters(estimator)
>           check(estimator)

test_common.py:100: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../utils/testing.py:326: in wrapper
    return fn(*args, **kwargs)
../utils/estimator_checks.py:1338: in check_estimators_nan_inf
    raise e
../utils/estimator_checks.py:1333: in check_estimators_nan_inf
    estimator.fit(X_train, y)
../naive_bayes.py:1109: in fit
    return super().fit(X, y, sample_weight=sample_weight)
../naive_bayes.py:635: in fit
    self._count(X, Y)
../naive_bayes.py:1200: in _count
    self.class_count_.shape[0])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

X_feature = array([-9223372036854775808,                    0,                    0,
                          0,                 ...              0,
                          0,                    0,                    0,
                          0])
Y = array([[1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [0, 1],
       [0, 1],
       [0, 1],
       [0, 1],
       [0, 1]])
cat_count = array([[0.],
       [0.]]), n_classes = 2

    def _update_cat_count(X_feature, Y, cat_count, n_classes):
        for j in range(n_classes):
            mask = Y[:, j].astype(bool)
            if Y.dtype.type == np.int64:
                weights = None
            else:
                weights = Y[mask, j]
>           counts = np.bincount(X_feature[mask], weights=weights)
E           ValueError: 'list' argument must have no negative elements

../naive_bayes.py:1188: ValueError

We assume that X has only values integer values >= 0. Due to the conversion of np.nan and np.inf to a large negative int in check_array, X contains a negative value and np.bincount consequently fails.

sklearn/naive_bayes.py Outdated Show resolved Hide resolved
@qinhanmin2014 qinhanmin2014 changed the title [MRG] Add Categorical Naive Bayes [MRG+1] FEA Add Categorical Naive Bayes Sep 9, 2019
Copy link
Member

jnothman left a comment

It would be good to test other invariances: invariance under sample permutation, invariance under class label permutation up to ties, and maybe a test for how tie breaking is done to avoid regressions.

Still to review main code

sklearn/tests/test_naive_bayes.py Outdated Show resolved Hide resolved
sklearn/tests/test_naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Show resolved Hide resolved
Copy link
Member

jnothman left a comment

Otherwise LGTM. 👀

sklearn/naive_bayes.py Outdated Show resolved Hide resolved
@timbicker

This comment has been minimized.

Copy link
Contributor Author

timbicker commented Sep 23, 2019

@jnothman and @qinhanmin2014 thanks for your remarks

@jnothman jnothman merged commit 4e9f97d into scikit-learn:master Sep 23, 2019
19 checks passed
19 checks passed
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc artifact Link to 0/doc/_changed.html
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 98.07% of diff hit (target 96.74%)
Details
codecov/project 96.76% (+0.01%) compared to d369a5f
Details
scikit-learn.scikit-learn Build #20190923.45 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_ubuntu_atlas) Linux py35_ubuntu_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda_mkl) Linux pylatest_conda_mkl succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_pip_openblas_pandas) Linux pylatest_pip_openblas_pandas succeeded
Details
scikit-learn.scikit-learn (Linux32 py35_ubuntu_atlas_32bit) Linux32 py35_ubuntu_atlas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py35_pip_openblas_32bit) Windows py35_pip_openblas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py37_conda_mkl) Windows py37_conda_mkl succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda_mkl) macOS pylatest_conda_mkl succeeded
Details
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Sep 23, 2019

Thanks and congratulations @timbicker and @FlorianWilhelm!

@FlorianWilhelm

This comment has been minimized.

Copy link
Contributor

FlorianWilhelm commented Sep 24, 2019

@timbicker, great job! Thanks to everyone involved.

@Sandy4321

This comment has been minimized.

Copy link

Sandy4321 commented Dec 17, 2019

can you please clarify , what you mean here and in

https://scikit-https://github.com/scikit-learn/scikit-learn/issues/15077learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html#sklearn.naive_bayes.CategoricalNB

under feature distributions per :

discrete features that are categorically distributed. The categories of each feature are drawn from a categorical distribution.

seems to be you mean that distributions are estimated from data?
for given categorical feature it is probabilities and conditional probability (values and target) from calculated from data?

as mentioned in
https://datascience.stackexchange.com/questions/58720/naive-bayes-for-categorical-features-non-binary
Some people recommend using MultinomialNB which according to me doesn't make sense because it considers feature values to be frequency counts

you never use some know pdf to fit data for each feature?

do you have example specific for categorical features data with real categorical data with several important features and several not important
but not this very generic example

import numpy as np
rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 100))
y = np.array([1, 2, 3, 4, 5, 6])
from sklearn.naive_bayes import CategoricalNB
clf = CategoricalNB()
clf.fit(X, y)
CategoricalNB()
print(clf.predict(X[2:3]))
from
https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html#sklearn.naive_bayes.CategoricalNB

thanks a lot in advance

@@ -49,6 +51,12 @@ def _joint_log_likelihood(self, X):
predict_proba and predict_log_proba.
"""

@abstractmethod
def _check_X(self, X):

This comment has been minimized.

Copy link
@bsipocz

bsipocz Dec 31, 2019

Contributor

Having this abstract method added is breaking downstream code.

Which is OK, but being the change hidden in this huge PR made it a bit more difficult than necessary to decipher/dig up if there is any relevant comments about the change, etc.

This comment has been minimized.

Copy link
@jnothman

jnothman Dec 31, 2019

Member

Yes, we are under aware of impact on downstream projects. See #15992

I also think we neglect to consider people inheriting from our objects. Can you submit a PR to remove the abstractmethod for 0.22.1 perhaps?

This comment has been minimized.

Copy link
@bsipocz

bsipocz Dec 31, 2019

Contributor

see #15996

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.