Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] FEA Add Categorical Naive Bayes #12569

Merged
merged 87 commits into from Sep 23, 2019
Merged

Conversation

timbicker
Copy link
Contributor

@timbicker timbicker commented Nov 12, 2018

Reference Issues/PRs

Impelements Categorical NB from #10856

What does this implement/fix? Explain your changes.

This implementation adds the NB classifier for categorical features.

Any other comments?

  • implement categorical NB functionality
  • check and write doc strings
  • add full test coverage
  • write a documentation

sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
@timbicker
Copy link
Contributor Author

@timbicker timbicker commented Dec 12, 2018

There might be a problem, when the input array X of the predict function has unseen categories.
From a mathematical point of view, the probability of an unseen category is 0. Therefore the likelihood term of Bayes Theorem becomes 0 and probability of the sample is 0.
This could be an unwanted error because the user might expect that all possible categories are in the training set. It could also be something the user wants to know about, but he still wants to use the prediction. Or he simply does not care.
Therefore I added a new attribute on_unseen_cats, which controls if the classifier raises an error, a warning or does ignore it.

@FlorianWilhelm
Copy link
Contributor

@FlorianWilhelm FlorianWilhelm commented Dec 12, 2018

@timbicker After having thought a bit about it I think that setting the probability to 1 for an unseen category would make more sense than 0. Given a particular feature, an unseen category would then yield 1 for all classes and thus the feature would basically be ignored since other features could still determine the right class. If all categories of all feature are unseen the decision would be made automatically by the class probabilities.

sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
@timbicker
Copy link
Contributor Author

@timbicker timbicker commented Dec 18, 2018

@timbicker After having thought a bit about it I think that setting the probability to 1 for an unseen category would make more sense than 0. Given a particular feature, an unseen category would then yield 1 for all classes and thus the feature would basically be ignored since other features could still determine the right class. If all categories of all feature are unseen the decision would be made automatically by the class probabilities.

Yes, I agree. This makes more sense in my opinion.

For the other case, that the category is unseen for a subset of the classes, we have the smoothing parameter.

sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
@timbicker
Copy link
Contributor Author

@timbicker timbicker commented Sep 2, 2019

The problem is that check_array(X, accept_sparse=False, force_all_finite=True, dtype='int') does first change the dtype of of X and then checks for nan or inf. This way, nan and inf are converted to an integer and no error is raised anymore. I am not sure if this is intended.

Hmm, I'm unable to reproduce, please provide a code snippet, thanks.

import numpy as np
from sklearn.utils import check_array
from sklearn.utils.testing import assert_raises

rnd = np.random.RandomState(0)
X_train_nan = rnd.uniform(size=(10, 3))
X_train_nan[0, 0] = np.nan
X_train_inf = rnd.uniform(size=(10, 3))
X_train_inf[0, 0] = np.inf

for X_train in [X_train_nan, X_train_inf]:
    assert_raises(ValueError, check_array, X_train, accept_sparse=False, force_all_finite=True)
    # assert_raises(ValueError, check_array, X_train, accept_sparse=False, force_all_finite=True, dtype='int')
    X = check_array(X_train, accept_sparse=False, force_all_finite=True, dtype='int')
    print(X[0, 0])

If you uncomment, you will see that no error is raised. Instead, np.nan and np.int are converted to int.

@qinhanmin2014
Copy link
Member

@qinhanmin2014 qinhanmin2014 commented Sep 3, 2019

thanks, I'm unable to reproduce because I use a list.
Please ignore it here. I'll open an issue and we'll fix this in check_array.

@FlorianWilhelm
Copy link
Contributor

@FlorianWilhelm FlorianWilhelm commented Sep 6, 2019

@qinhanmin2014 Thanks for your review. If you have no further comments, should @timbicker then change this PR to [MRG+1] so that others like @amueller, @NicolasHug, and @jnothman can review?

@qinhanmin2014
Copy link
Member

@qinhanmin2014 qinhanmin2014 commented Sep 6, 2019

see the previous comment, please avoid calling check_array twice

@timbicker
Copy link
Contributor Author

@timbicker timbicker commented Sep 8, 2019

I am waiting for the other PR #14872 to be merged. Because otherwise, the tests of this PR would fail. Or should I fix it already?

@qinhanmin2014
Copy link
Member

@qinhanmin2014 qinhanmin2014 commented Sep 9, 2019

I am waiting for the other PR #14872 to be merged. Because otherwise, the tests of this PR would fail. Or should I fix it already?

which tests?

@timbicker
Copy link
Contributor Author

@timbicker timbicker commented Sep 9, 2019

check_estimators_nan_inf in sklearn/utils/estimator_checks.py

The output is as follows:

FEstimator doesn't check for NaN and inf in fit. CategoricalNB(alpha=1.0, class_prior=None, fit_prior=True) 'list' argument must have no negative elements
Traceback (most recent call last):
  File "/Users/tbicker/PRs/scikit-learn/sklearn/utils/estimator_checks.py", line 1333, in check_estimators_nan_inf
    estimator.fit(X_train, y)
  File "/Users/tbicker/PRs/scikit-learn/sklearn/naive_bayes.py", line 1109, in fit
    return super().fit(X, y, sample_weight=sample_weight)
  File "/Users/tbicker/PRs/scikit-learn/sklearn/naive_bayes.py", line 635, in fit
    self._count(X, Y)
  File "/Users/tbicker/PRs/scikit-learn/sklearn/naive_bayes.py", line 1200, in _count
    self.class_count_.shape[0])
  File "/Users/tbicker/PRs/scikit-learn/sklearn/naive_bayes.py", line 1188, in _update_cat_count
    counts = np.bincount(X_feature[mask], weights=weights)
ValueError: 'list' argument must have no negative elements

sklearn/tests/test_common.py:93 (test_estimators[CategoricalNB()-check_estimators_nan_inf1])
estimator = CategoricalNB(alpha=1.0, class_prior=None, fit_prior=True)
check = functools.partial(<function check_estimators_nan_inf at 0x1a1792a510>, 'CategoricalNB')

    @parametrize_with_checks(_tested_estimators())
    def test_estimators(estimator, check):
        # Common tests for estimator instances
        with ignore_warnings(category=(DeprecationWarning, ConvergenceWarning,
                                       UserWarning, FutureWarning)):
            set_checking_parameters(estimator)
>           check(estimator)

test_common.py:100: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../utils/testing.py:326: in wrapper
    return fn(*args, **kwargs)
../utils/estimator_checks.py:1338: in check_estimators_nan_inf
    raise e
../utils/estimator_checks.py:1333: in check_estimators_nan_inf
    estimator.fit(X_train, y)
../naive_bayes.py:1109: in fit
    return super().fit(X, y, sample_weight=sample_weight)
../naive_bayes.py:635: in fit
    self._count(X, Y)
../naive_bayes.py:1200: in _count
    self.class_count_.shape[0])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

X_feature = array([-9223372036854775808,                    0,                    0,
                          0,                 ...              0,
                          0,                    0,                    0,
                          0])
Y = array([[1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [0, 1],
       [0, 1],
       [0, 1],
       [0, 1],
       [0, 1]])
cat_count = array([[0.],
       [0.]]), n_classes = 2

    def _update_cat_count(X_feature, Y, cat_count, n_classes):
        for j in range(n_classes):
            mask = Y[:, j].astype(bool)
            if Y.dtype.type == np.int64:
                weights = None
            else:
                weights = Y[mask, j]
>           counts = np.bincount(X_feature[mask], weights=weights)
E           ValueError: 'list' argument must have no negative elements

../naive_bayes.py:1188: ValueError

We assume that X has only values integer values >= 0. Due to the conversion of np.nan and np.inf to a large negative int in check_array, X contains a negative value and np.bincount consequently fails.

sklearn/naive_bayes.py Outdated Show resolved Hide resolved
@qinhanmin2014 qinhanmin2014 changed the title [MRG] Add Categorical Naive Bayes [MRG+1] FEA Add Categorical Naive Bayes Sep 9, 2019
Copy link
Member

@jnothman jnothman left a comment

It would be good to test other invariances: invariance under sample permutation, invariance under class label permutation up to ties, and maybe a test for how tie breaking is done to avoid regressions.

Still to review main code

sklearn/tests/test_naive_bayes.py Outdated Show resolved Hide resolved
sklearn/tests/test_naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Outdated Show resolved Hide resolved
sklearn/naive_bayes.py Show resolved Hide resolved
Copy link
Member

@jnothman jnothman left a comment

Otherwise LGTM. 👀

sklearn/naive_bayes.py Outdated Show resolved Hide resolved
@timbicker
Copy link
Contributor Author

@timbicker timbicker commented Sep 23, 2019

@jnothman and @qinhanmin2014 thanks for your remarks

@jnothman jnothman merged commit 4e9f97d into scikit-learn:master Sep 23, 2019
19 checks passed
@jnothman
Copy link
Member

@jnothman jnothman commented Sep 23, 2019

Thanks and congratulations @timbicker and @FlorianWilhelm!

@FlorianWilhelm
Copy link
Contributor

@FlorianWilhelm FlorianWilhelm commented Sep 24, 2019

@timbicker, great job! Thanks to everyone involved.

@Sandy4321
Copy link

@Sandy4321 Sandy4321 commented Dec 17, 2019

can you please clarify , what you mean here and in

https://scikit-https://github.com/scikit-learn/scikit-learn/issues/15077learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html#sklearn.naive_bayes.CategoricalNB

under feature distributions per :

discrete features that are categorically distributed. The categories of each feature are drawn from a categorical distribution.

seems to be you mean that distributions are estimated from data?
for given categorical feature it is probabilities and conditional probability (values and target) from calculated from data?

as mentioned in
https://datascience.stackexchange.com/questions/58720/naive-bayes-for-categorical-features-non-binary
Some people recommend using MultinomialNB which according to me doesn't make sense because it considers feature values to be frequency counts

you never use some know pdf to fit data for each feature?

do you have example specific for categorical features data with real categorical data with several important features and several not important
but not this very generic example

import numpy as np
rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 100))
y = np.array([1, 2, 3, 4, 5, 6])
from sklearn.naive_bayes import CategoricalNB
clf = CategoricalNB()
clf.fit(X, y)
CategoricalNB()
print(clf.predict(X[2:3]))
from
https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html#sklearn.naive_bayes.CategoricalNB

thanks a lot in advance

@@ -49,6 +51,12 @@ def _joint_log_likelihood(self, X):
predict_proba and predict_log_proba.
"""

@abstractmethod
def _check_X(self, X):
Copy link
Contributor

@bsipocz bsipocz Dec 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this abstract method added is breaking downstream code.

Which is OK, but being the change hidden in this huge PR made it a bit more difficult than necessary to decipher/dig up if there is any relevant comments about the change, etc.

Copy link
Member

@jnothman jnothman Dec 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we are under aware of impact on downstream projects. See #15992

I also think we neglect to consider people inheriting from our objects. Can you submit a PR to remove the abstractmethod for 0.22.1 perhaps?

Copy link
Contributor

@bsipocz bsipocz Dec 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #15996

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants