[MRG+1] FEA Add Categorical Naive Bayes #12569

timbicker · 2018-11-12T13:38:46Z

Reference Issues/PRs

Impelements Categorical NB from #10856

What does this implement/fix? Explain your changes.

This implementation adds the NB classifier for categorical features.

Any other comments?

implement categorical NB functionality
check and write doc strings
add full test coverage
write a documentation

sklearn/naive_bayes.py

…gories in prediction step

timbicker · 2018-12-12T17:52:36Z

There might be a problem, when the input array X of the predict function has unseen categories.
From a mathematical point of view, the probability of an unseen category is 0. Therefore the likelihood term of Bayes Theorem becomes 0 and probability of the sample is 0.
This could be an unwanted error because the user might expect that all possible categories are in the training set. It could also be something the user wants to know about, but he still wants to use the prediction. Or he simply does not care.
Therefore I added a new attribute on_unseen_cats, which controls if the classifier raises an error, a warning or does ignore it.

FlorianWilhelm · 2018-12-12T21:46:19Z

@timbicker After having thought a bit about it I think that setting the probability to 1 for an unseen category would make more sense than 0. Given a particular feature, an unseen category would then yield 1 for all classes and thus the feature would basically be ignored since other features could still determine the right class. If all categories of all feature are unseen the decision would be made automatically by the class probabilities.

sklearn/naive_bayes.py

timbicker · 2018-12-18T13:45:23Z

@timbicker After having thought a bit about it I think that setting the probability to 1 for an unseen category would make more sense than 0. Given a particular feature, an unseen category would then yield 1 for all classes and thus the feature would basically be ignored since other features could still determine the right class. If all categories of all feature are unseen the decision would be made automatically by the class probabilities.

Yes, I agree. This makes more sense in my opinion.

For the other case, that the category is unseen for a subset of the classes, we have the smoothing parameter.

…to categorical_NB

sklearn/naive_bayes.py

timbicker · 2019-09-02T16:00:15Z

The problem is that check_array(X, accept_sparse=False, force_all_finite=True, dtype='int') does first change the dtype of of X and then checks for nan or inf. This way, nan and inf are converted to an integer and no error is raised anymore. I am not sure if this is intended.

Hmm, I'm unable to reproduce, please provide a code snippet, thanks.

import numpy as np
from sklearn.utils import check_array
from sklearn.utils.testing import assert_raises

rnd = np.random.RandomState(0)
X_train_nan = rnd.uniform(size=(10, 3))
X_train_nan[0, 0] = np.nan
X_train_inf = rnd.uniform(size=(10, 3))
X_train_inf[0, 0] = np.inf

for X_train in [X_train_nan, X_train_inf]:
    assert_raises(ValueError, check_array, X_train, accept_sparse=False, force_all_finite=True)
    # assert_raises(ValueError, check_array, X_train, accept_sparse=False, force_all_finite=True, dtype='int')
    X = check_array(X_train, accept_sparse=False, force_all_finite=True, dtype='int')
    print(X[0, 0])

If you uncomment, you will see that no error is raised. Instead, np.nan and np.int are converted to int.

qinhanmin2014 · 2019-09-03T03:53:41Z

thanks, I'm unable to reproduce because I use a list.
Please ignore it here. I'll open an issue and we'll fix this in check_array.

FlorianWilhelm · 2019-09-06T15:12:04Z

@qinhanmin2014 Thanks for your review. If you have no further comments, should @timbicker then change this PR to [MRG+1] so that others like @amueller, @NicolasHug, and @jnothman can review?

qinhanmin2014 · 2019-09-06T15:19:08Z

see the previous comment, please avoid calling check_array twice

timbicker · 2019-09-08T15:41:33Z

I am waiting for the other PR #14872 to be merged. Because otherwise, the tests of this PR would fail. Or should I fix it already?

qinhanmin2014 · 2019-09-09T00:21:03Z

I am waiting for the other PR #14872 to be merged. Because otherwise, the tests of this PR would fail. Or should I fix it already?

which tests?

timbicker · 2019-09-09T02:29:16Z

check_estimators_nan_inf in sklearn/utils/estimator_checks.py

The output is as follows:

FEstimator doesn't check for NaN and inf in fit. CategoricalNB(alpha=1.0, class_prior=None, fit_prior=True) 'list' argument must have no negative elements
Traceback (most recent call last):
  File "/Users/tbicker/PRs/scikit-learn/sklearn/utils/estimator_checks.py", line 1333, in check_estimators_nan_inf
    estimator.fit(X_train, y)
  File "/Users/tbicker/PRs/scikit-learn/sklearn/naive_bayes.py", line 1109, in fit
    return super().fit(X, y, sample_weight=sample_weight)
  File "/Users/tbicker/PRs/scikit-learn/sklearn/naive_bayes.py", line 635, in fit
    self._count(X, Y)
  File "/Users/tbicker/PRs/scikit-learn/sklearn/naive_bayes.py", line 1200, in _count
    self.class_count_.shape[0])
  File "/Users/tbicker/PRs/scikit-learn/sklearn/naive_bayes.py", line 1188, in _update_cat_count
    counts = np.bincount(X_feature[mask], weights=weights)
ValueError: 'list' argument must have no negative elements

sklearn/tests/test_common.py:93 (test_estimators[CategoricalNB()-check_estimators_nan_inf1])
estimator = CategoricalNB(alpha=1.0, class_prior=None, fit_prior=True)
check = functools.partial(<function check_estimators_nan_inf at 0x1a1792a510>, 'CategoricalNB')

    @parametrize_with_checks(_tested_estimators())
    def test_estimators(estimator, check):
        # Common tests for estimator instances
        with ignore_warnings(category=(DeprecationWarning, ConvergenceWarning,
                                       UserWarning, FutureWarning)):
            set_checking_parameters(estimator)
>           check(estimator)

test_common.py:100: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../utils/testing.py:326: in wrapper
    return fn(*args, **kwargs)
../utils/estimator_checks.py:1338: in check_estimators_nan_inf
    raise e
../utils/estimator_checks.py:1333: in check_estimators_nan_inf
    estimator.fit(X_train, y)
../naive_bayes.py:1109: in fit
    return super().fit(X, y, sample_weight=sample_weight)
../naive_bayes.py:635: in fit
    self._count(X, Y)
../naive_bayes.py:1200: in _count
    self.class_count_.shape[0])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

X_feature = array([-9223372036854775808,                    0,                    0,
                          0,                 ...              0,
                          0,                    0,                    0,
                          0])
Y = array([[1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [0, 1],
       [0, 1],
       [0, 1],
       [0, 1],
       [0, 1]])
cat_count = array([[0.],
       [0.]]), n_classes = 2

    def _update_cat_count(X_feature, Y, cat_count, n_classes):
        for j in range(n_classes):
            mask = Y[:, j].astype(bool)
            if Y.dtype.type == np.int64:
                weights = None
            else:
                weights = Y[mask, j]
>           counts = np.bincount(X_feature[mask], weights=weights)
E           ValueError: 'list' argument must have no negative elements

../naive_bayes.py:1188: ValueError

We assume that X has only values integer values >= 0. Due to the conversion of np.nan and np.inf to a large negative int in check_array, X contains a negative value and np.bincount consequently fails.

sklearn/naive_bayes.py

jnothman

It would be good to test other invariances: invariance under sample permutation, invariance under class label permutation up to ties, and maybe a test for how tie breaking is done to avoid regressions.

Still to review main code

sklearn/tests/test_naive_bayes.py

sklearn/naive_bayes.py

jnothman

Otherwise LGTM. 👀

sklearn/naive_bayes.py

timbicker · 2019-09-23T18:25:57Z

@jnothman and @qinhanmin2014 thanks for your remarks

jnothman · 2019-09-23T21:38:43Z

Thanks and congratulations @timbicker and @FlorianWilhelm!

FlorianWilhelm · 2019-09-24T04:33:32Z

@timbicker, great job! Thanks to everyone involved.

Sandy4321 · 2019-12-17T13:44:58Z

can you please clarify , what you mean here and in

https://scikit-https://github.com/scikit-learn/scikit-learn/issues/15077learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html#sklearn.naive_bayes.CategoricalNB

under feature distributions per :

discrete features that are categorically distributed. The categories of each feature are drawn from a categorical distribution.

seems to be you mean that distributions are estimated from data?
for given categorical feature it is probabilities and conditional probability (values and target) from calculated from data?

as mentioned in
https://datascience.stackexchange.com/questions/58720/naive-bayes-for-categorical-features-non-binary
Some people recommend using MultinomialNB which according to me doesn't make sense because it considers feature values to be frequency counts

you never use some know pdf to fit data for each feature?

do you have example specific for categorical features data with real categorical data with several important features and several not important
but not this very generic example

import numpy as np
rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 100))
y = np.array([1, 2, 3, 4, 5, 6])
from sklearn.naive_bayes import CategoricalNB
clf = CategoricalNB()
clf.fit(X, y)
CategoricalNB()
print(clf.predict(X[2:3]))
from
https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html#sklearn.naive_bayes.CategoricalNB

thanks a lot in advance

bsipocz · 2019-12-31T00:43:41Z

sklearn/naive_bayes.py

@@ -49,6 +51,12 @@ def _joint_log_likelihood(self, X):
        predict_proba and predict_log_proba.
        """

+    @abstractmethod
+    def _check_X(self, X):


Having this abstract method added is breaking downstream code.

Which is OK, but being the change hidden in this huge PR made it a bit more difficult than necessary to decipher/dig up if there is any relevant comments about the change, etc.

Yes, we are under aware of impact on downstream projects. See #15992

I also think we neglect to consider people inheriting from our objects. Can you submit a PR to remove the abstractmethod for 0.22.1 perhaps?

initial implementation of categorical NB

f69741f

amueller reviewed Nov 12, 2018

View reviewed changes

sklearn/naive_bayes.py Outdated Show resolved Hide resolved

timbicker added 4 commits December 5, 2018 09:41

edit docstrings

4acee46

fix np.unique return_counts for older np versions

5cc266a

added attribute n_features_

ebb97b0

fix tests for py 2.7

83390f8

jnothman reviewed Dec 9, 2018

View reviewed changes

timbicker added 4 commits December 11, 2018 15:36

changed docstrings and adjusted to pep8

e9987d8

added setuptools parse_version

efc3701

use dtype int64 for old numpy version and add handling of unseen cate…

1c30b21

…gories in prediction step

fix dtype error

4c6d8ba

FlorianWilhelm reviewed Dec 12, 2018

View reviewed changes

sklearn/naive_bayes.py Outdated Show resolved Hide resolved

FlorianWilhelm reviewed Dec 12, 2018

View reviewed changes

sklearn/naive_bayes.py Outdated Show resolved Hide resolved

FlorianWilhelm reviewed Dec 12, 2018

View reviewed changes

sklearn/naive_bayes.py Outdated Show resolved Hide resolved

fix doctest error and PEP

8b0f49c

timbicker added 7 commits December 19, 2018 13:17

improve documentation and user feedback

ce65046

include CategoricalNB to general naive bayes tests

f7fec8a

refactor tests

81d2782

Merge branch 'master' into categorical_NB

c6e14d8

fix pep 8

2163afa

Merge branch 'categorical_NB' of github.com:timbicker/scikit-learn in…

3e40f83

…to categorical_NB

fix docstring

3bb605d

FlorianWilhelm reviewed Dec 29, 2018

View reviewed changes

sklearn/naive_bayes.py Outdated Show resolved Hide resolved

FlorianWilhelm reviewed Dec 29, 2018

View reviewed changes

sklearn/naive_bayes.py Outdated Show resolved Hide resolved

FlorianWilhelm reviewed Dec 29, 2018

View reviewed changes

sklearn/naive_bayes.py Outdated Show resolved Hide resolved

timbicker added 2 commits January 2, 2019 13:30

change error message

dedffd2

add tests for alpha and unseen categories

248a048

timbicker added 2 commits September 2, 2019 11:17

update random in docstring

cf867f1

Merge branch 'master' into categorical_NB

b5a435b

qinhanmin2014 mentioned this pull request Sep 3, 2019

force_all_finite in check_array is broken when dtype="int" #14871

Closed

qinhanmin2014 approved these changes Sep 9, 2019

View reviewed changes

sklearn/naive_bayes.py Outdated Show resolved Hide resolved

qinhanmin2014 changed the title ~~[MRG] Add Categorical Naive Bayes~~ [MRG+1] FEA Add Categorical Naive Bayes Sep 9, 2019

qinhanmin2014 added the Waiting for Reviewer label Sep 9, 2019

timbicker added 2 commits September 12, 2019 10:06

merge master

51922cf

add check_X comments

283e96a

jnothman reviewed Sep 16, 2019

View reviewed changes

sklearn/tests/test_naive_bayes.py Outdated Show resolved Hide resolved

sklearn/tests/test_naive_bayes.py Outdated Show resolved Hide resolved

sklearn/naive_bayes.py Outdated Show resolved Hide resolved

jnothman reviewed Sep 18, 2019

View reviewed changes

sklearn/naive_bayes.py Outdated Show resolved Hide resolved

sklearn/naive_bayes.py Show resolved Hide resolved

timbicker added 4 commits September 18, 2019 21:40

Merge branch 'master' into categorical_NB

9ccb85c

add test sample_weight for scale invariance

73a5a0b

test for non-negativity

cc37c5a

fix non negative check, update interface of _update_count_dims

e226b45

jnothman approved these changes Sep 19, 2019

View reviewed changes

sklearn/naive_bayes.py Outdated Show resolved Hide resolved

inline check_nonnegative

7d4b652

jnothman merged commit 4e9f97d into scikit-learn:master Sep 23, 2019

bsipocz reviewed Dec 31, 2019

View reviewed changes

bsipocz mentioned this pull request Dec 31, 2019

Remove abstractmethod that silently brake downstream packages #15996

Merged

Uh oh!

[MRG+1] FEA Add Categorical Naive Bayes #12569

[MRG+1] FEA Add Categorical Naive Bayes #12569

Uh oh!

Conversation

timbicker commented Nov 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timbicker commented Dec 12, 2018

Uh oh!

FlorianWilhelm commented Dec 12, 2018

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timbicker commented Dec 18, 2018

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timbicker commented Sep 2, 2019

Uh oh!

qinhanmin2014 commented Sep 3, 2019

Uh oh!

FlorianWilhelm commented Sep 6, 2019

Uh oh!

qinhanmin2014 commented Sep 6, 2019

Uh oh!

timbicker commented Sep 8, 2019

Uh oh!

qinhanmin2014 commented Sep 9, 2019

Uh oh!

timbicker commented Sep 9, 2019

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

timbicker commented Sep 23, 2019

Uh oh!

jnothman commented Sep 23, 2019

Uh oh!

FlorianWilhelm commented Sep 24, 2019

Uh oh!

Sandy4321 commented Dec 17, 2019

Uh oh!

bsipocz Dec 31, 2019

Choose a reason for hiding this comment

Uh oh!

jnothman Dec 31, 2019

Choose a reason for hiding this comment

Uh oh!

bsipocz Dec 31, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

timbicker commented Nov 12, 2018 •

edited

Loading