New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Implement Complement Naive Bayes. #8190

Merged
merged 3 commits into from Aug 28, 2017

Conversation

Projects
None yet
4 participants
@airalcorn2
Contributor

airalcorn2 commented Jan 12, 2017

Reference Issue

N/A

What does this implement/fix? Explain your changes.

Implements the Complement Naive Bayes (CNB) classifier described in Rennie et al. (2003). CNB was designed to correct the "severe assumptions" made by the standard Multinomial Naive Bayes (MNB) classifier. As a result, CNB often achieves considerably better results than MNB on text classification tasks with imbalanced classes (as can be seen below); so much so that Apache Mahout includes an implementation of CNB alongside its MNB classifier. With that being the case, it would be nice to have an easily usable CNB implementation also available in scikit-learn.

Any other comments?

Results from testing on Reuters-21578 (see example code).

<class 'sklearn.naive_bayes.MultinomialNB'>
Accuracy: 0.772
Weighted Precision: 0.735
Weighted Recall: 0.772

<class 'sklearn.naive_bayes.MultinomialCNB'>
Accuracy: 0.813
Weighted Precision: 0.805
Weighted Recall: 0.813
@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Jan 13, 2017

Contributor

Just by curiosity, is CNB not equivalent to pipeline a tf-idf and an MNB?

Contributor

glemaitre commented Jan 13, 2017

Just by curiosity, is CNB not equivalent to pipeline a tf-idf and an MNB?

@airalcorn2

This comment has been minimized.

Show comment
Hide comment
@airalcorn2

airalcorn2 Jan 13, 2017

Contributor

@glemaitre - no, they are not equivalent. Compare equations (4) and (6) in the paper. For a given class, CNB estimates the parameters for the complement of the class. The authors suggest CNB produces weight estimates that are less biased and more stable (see Figure 1) than those produced by MNB.

Contributor

airalcorn2 commented Jan 13, 2017

@glemaitre - no, they are not equivalent. Compare equations (4) and (6) in the paper. For a given class, CNB estimates the parameters for the complement of the class. The authors suggest CNB produces weight estimates that are less biased and more stable (see Figure 1) than those produced by MNB.

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Jan 13, 2017

Contributor

@airalcorn2 yep, you're right, somebody is wrong on the wiki page of the MNB, omitting the part regarding the complement :)

Contributor

glemaitre commented Jan 13, 2017

@airalcorn2 yep, you're right, somebody is wrong on the wiki page of the MNB, omitting the part regarding the complement :)

@jnothman

This needs unit tests for _count, whether based on an example / toy data, or checking that invariants are held for random/challenging data.

Show outdated Hide outdated sklearn/naive_bayes.py Outdated
Show outdated Hide outdated sklearn/naive_bayes.py Outdated
@airalcorn2

This comment has been minimized.

Show comment
Hide comment
@airalcorn2

airalcorn2 May 30, 2017

Contributor

@jnothman - I added a unit test using a toy data set. Let me know if that's not adequate.

Contributor

airalcorn2 commented May 30, 2017

@jnothman - I added a unit test using a toy data set. Let me know if that's not adequate.

@jmschrei

This comment has been minimized.

Show comment
Hide comment
@jmschrei

jmschrei Jun 1, 2017

Member

This is looking pretty good to me. Thanks for the contribution! I'll check back again later when the tests are all passing.

Member

jmschrei commented Jun 1, 2017

This is looking pretty good to me. Thanks for the contribution! I'll check back again later when the tests are all passing.

@airalcorn2

This comment has been minimized.

Show comment
Hide comment
@airalcorn2

airalcorn2 Jun 6, 2017

Contributor

Looks like all the tests have passed, @jmschrei.

Contributor

airalcorn2 commented Jun 6, 2017

Looks like all the tests have passed, @jmschrei.

@jmschrei

This comment has been minimized.

Show comment
Hide comment
@jmschrei

jmschrei Jun 26, 2017

Member

Apologies for the delay, I've been super busy recently. Can you look into the conflicts that have arisen, and I'll get back to you soon? Again, thanks for taking the time to contribute this, we really appreciate it.

Member

jmschrei commented Jun 26, 2017

Apologies for the delay, I've been super busy recently. Can you look into the conflicts that have arisen, and I'll get back to you soon? Again, thanks for taking the time to contribute this, we really appreciate it.

@airalcorn2

This comment has been minimized.

Show comment
Hide comment
@airalcorn2

airalcorn2 Jun 27, 2017

Contributor

@jmschrei - the estimator_checks.py file currently ignores/modifies several tests for the different naive Bayes classifiers because the assumptions of these classifiers don't mesh well with the data being tested. I've updated those same tests to account for the new Complement Naive Bayes classifier.

Contributor

airalcorn2 commented Jun 27, 2017

@jmschrei - the estimator_checks.py file currently ignores/modifies several tests for the different naive Bayes classifiers because the assumptions of these classifiers don't mesh well with the data being tested. I've updated those same tests to account for the new Complement Naive Bayes classifier.

@jmschrei

This comment has been minimized.

Show comment
Hide comment
@jmschrei

jmschrei Jun 30, 2017

Member

Hi @airalcorn2, the branch is still having problems, I'm guessing due to #9131 . If you can get this PR to all tests passing again, I have time to review it and hopefully we can get it merged soon!

Member

jmschrei commented Jun 30, 2017

Hi @airalcorn2, the branch is still having problems, I'm guessing due to #9131 . If you can get this PR to all tests passing again, I have time to review it and hopefully we can get it merged soon!

@airalcorn2

This comment has been minimized.

Show comment
Hide comment
@airalcorn2

airalcorn2 Jul 1, 2017

Contributor

@jmschrei - looks like everything's actually passing this time.

Contributor

airalcorn2 commented Jul 1, 2017

@jmschrei - looks like everything's actually passing this time.

Show outdated Hide outdated sklearn/naive_bayes.py Outdated
Show outdated Hide outdated sklearn/naive_bayes.py Outdated
@jmschrei

This comment has been minimized.

Show comment
Hide comment
@jmschrei

jmschrei Jul 3, 2017

Member

You should also add in an entry to docs/whats_new.rst

Member

jmschrei commented Jul 3, 2017

You should also add in an entry to docs/whats_new.rst

@airalcorn2

This comment has been minimized.

Show comment
Hide comment
@airalcorn2

airalcorn2 Jul 6, 2017

Contributor

@jmschrei - let me know what you think about the changes.

Contributor

airalcorn2 commented Jul 6, 2017

@jmschrei - let me know what you think about the changes.

@jmschrei

This comment has been minimized.

Show comment
Hide comment
@jmschrei

jmschrei Jul 6, 2017

Member

LGTM! Let's see if we can track down another reviewer, @jnothman @glemaitre maybe?

Member

jmschrei commented Jul 6, 2017

LGTM! Let's see if we can track down another reviewer, @jnothman @glemaitre maybe?

@jmschrei jmschrei changed the title from [MRG] Implement Complement Naive Bayes. to [MRG+1] Implement Complement Naive Bayes. Jul 6, 2017

@airalcorn2

This comment has been minimized.

Show comment
Hide comment
@airalcorn2

airalcorn2 Jul 6, 2017

Contributor

The feature_all_ attribute wasn't accounting for sample weights and also wasn't mentioned in the class docstring, so the most recent push makes those corrections.

Contributor

airalcorn2 commented Jul 6, 2017

The feature_all_ attribute wasn't accounting for sample weights and also wasn't mentioned in the class docstring, so the most recent push makes those corrections.

@airalcorn2

This comment has been minimized.

Show comment
Hide comment
@airalcorn2

airalcorn2 Jul 11, 2017

Contributor

Hey, @jmschrei. Do you know the probability/timeline of this being merged? We'd like to migrate our current Mahout Complement Naive Bayes process to Python, so I'm trying to figure out if we should just go with my fork. Thanks.

Contributor

airalcorn2 commented Jul 11, 2017

Hey, @jmschrei. Do you know the probability/timeline of this being merged? We'd like to migrate our current Mahout Complement Naive Bayes process to Python, so I'm trying to figure out if we should just go with my fork. Thanks.

@jmschrei

This comment has been minimized.

Show comment
Hide comment
@jmschrei

jmschrei Jul 11, 2017

Member

It's just waiting on another reviewer, perhaps @jnothman or @raghavrv or @glemaitre have time to take a look? It's not a very complicated model. Unfortunately, due to the velocity of PRs and issues being opened, we sometimes lose track of good contributions.

Member

jmschrei commented Jul 11, 2017

It's just waiting on another reviewer, perhaps @jnothman or @raghavrv or @glemaitre have time to take a look? It's not a very complicated model. Unfortunately, due to the velocity of PRs and issues being opened, we sometimes lose track of good contributions.

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Jul 11, 2017

Contributor
Contributor

glemaitre commented Jul 11, 2017

@glemaitre

I have the impression that there is some duplicated code between Multinomai CNB and NB (in _count and _joint_log_likelihood. @jmschrei Is it making sense to factorizing it?

Show outdated Hide outdated examples/classification/test_MultinomialCNB.py Outdated
Show outdated Hide outdated examples/classification/test_MultinomialCNB.py Outdated
Show outdated Hide outdated examples/classification/test_MultinomialCNB.py Outdated
Show outdated Hide outdated examples/classification/test_MultinomialCNB.py Outdated
Show outdated Hide outdated examples/classification/test_MultinomialCNB.py Outdated
Show outdated Hide outdated examples/classification/test_MultinomialCNB.py Outdated
Show outdated Hide outdated examples/classification/test_MultinomialCNB.py Outdated
Show outdated Hide outdated examples/classification/test_MultinomialCNB.py Outdated
Show outdated Hide outdated examples/classification/test_MultinomialCNB.py Outdated
Show outdated Hide outdated examples/classification/test_MultinomialCNB.py Outdated
@jnothman

There are currently no narrative docs in doc/modules/naive_bayes.rst

Show outdated Hide outdated examples/classification/test_MultinomialCNB.py Outdated
Show outdated Hide outdated sklearn/naive_bayes.py Outdated
Show outdated Hide outdated sklearn/naive_bayes.py Outdated
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jul 12, 2017

Member

@airalcorn2:

Do you know the probability/timeline of this being merged?

I think this is a well known and useful NB variant. But I don't think the contribution is yet meeting our standards in terms of documentation at least. When it will be merged? Perhaps September. When it will be released? Perhaps April 2018.

Member

jnothman commented Jul 12, 2017

@airalcorn2:

Do you know the probability/timeline of this being merged?

I think this is a well known and useful NB variant. But I don't think the contribution is yet meeting our standards in terms of documentation at least. When it will be merged? Perhaps September. When it will be released? Perhaps April 2018.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jul 12, 2017

Member

Btw, that merge estimate might be pessimistic, and the release estimate optimistic. Hard to say.

Member

jnothman commented Jul 12, 2017

Btw, that merge estimate might be pessimistic, and the release estimate optimistic. Hard to say.

@airalcorn2

This comment has been minimized.

Show comment
Hide comment
@airalcorn2

airalcorn2 Jul 14, 2017

Contributor

Thanks for the review, @jnothman and @glemaitre. Tried to incorporate all of your feedback in the latest push. Also added narrative documentation to doc/modules/naive_bayes.rst.

Contributor

airalcorn2 commented Jul 14, 2017

Thanks for the review, @jnothman and @glemaitre. Tried to incorporate all of your feedback in the latest push. Also added narrative documentation to doc/modules/naive_bayes.rst.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jul 15, 2017

Member
Member

jnothman commented Jul 15, 2017

@airalcorn2

This comment has been minimized.

Show comment
Hide comment
@airalcorn2

airalcorn2 Aug 7, 2017

Contributor

Y'all mind taking another look, @jnothman and @glemaitre?

Contributor

airalcorn2 commented Aug 7, 2017

Y'all mind taking another look, @jnothman and @glemaitre?

@jnothman

otherwise this LGTM

Show outdated Hide outdated doc/modules/naive_bayes.rst Outdated
Show outdated Hide outdated examples/classification/plot_nb_comparison.py Outdated
def _count(self, X, Y):
"""Count feature occurrences."""
if np.any((X.data if issparse(X) else X) < 0):
raise ValueError("Input X must be non-negative")

This comment has been minimized.

@jnothman

jnothman Aug 8, 2017

Member

not tested

@jnothman

jnothman Aug 8, 2017

Member

not tested

This comment has been minimized.

@airalcorn2

airalcorn2 Aug 8, 2017

Contributor

I added a simple test to validate the counts.

@airalcorn2

airalcorn2 Aug 8, 2017

Contributor

I added a simple test to validate the counts.

This comment has been minimized.

@jnothman

jnothman Aug 14, 2017

Member

I mean that you don't currently test that this error is raised. I think.

@jnothman

jnothman Aug 14, 2017

Member

I mean that you don't currently test that this error is raised. I think.

This comment has been minimized.

@airalcorn2

airalcorn2 Aug 15, 2017

Contributor

@jnothman - I added that test.

@airalcorn2

airalcorn2 Aug 15, 2017

Contributor

@jnothman - I added that test.

Show outdated Hide outdated sklearn/tests/test_naive_bayes.py Outdated
@airalcorn2

This comment has been minimized.

Show comment
Hide comment
@airalcorn2

airalcorn2 Aug 8, 2017

Contributor

Let me know if there's anything else, @jnothman.

Contributor

airalcorn2 commented Aug 8, 2017

Let me know if there's anything else, @jnothman.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 8, 2017

Member

A future import couldn't cause unexpected behaviour in other tests (unless they have explicit switches for py2 vs 3, which they don't) because we test on both versions

Member

jnothman commented Aug 8, 2017

A future import couldn't cause unexpected behaviour in other tests (unless they have explicit switches for py2 vs 3, which they don't) because we test on both versions

@airalcorn2

This comment has been minimized.

Show comment
Hide comment
@airalcorn2

airalcorn2 Aug 9, 2017

Contributor

@jnothman - ah, right. Added the __future__ import.

Contributor

airalcorn2 commented Aug 9, 2017

@jnothman - ah, right. Added the __future__ import.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 14, 2017

Member

Please don't squash your commits. It makes it very hard for me to work out what code "I added a simple test to validate the counts." refers to. As it is, coveralls thinks the line still lacks coverage, and I can't see an assert_raises or similar in your code.

Member

jnothman commented Aug 14, 2017

Please don't squash your commits. It makes it very hard for me to work out what code "I added a simple test to validate the counts." refers to. As it is, coveralls thinks the line still lacks coverage, and I can't see an assert_raises or similar in your code.

@airalcorn2

This comment has been minimized.

Show comment
Hide comment
@airalcorn2

airalcorn2 Aug 14, 2017

Contributor

@jnothman - maybe I'm not understanding what you're asking for? The only other place where I saw "count" show up in test_naive_bayes.py is here. The tests I added are here.

Contributor

airalcorn2 commented Aug 14, 2017

@jnothman - maybe I'm not understanding what you're asking for? The only other place where I saw "count" show up in test_naive_bayes.py is here. The tests I added are here.

@jnothman

Needs adding to doc/modules/classes.rst

Show outdated Hide outdated sklearn/tests/test_naive_bayes.py Outdated

@jnothman jnothman changed the title from [MRG+1] Implement Complement Naive Bayes. to [MRG+2] Implement Complement Naive Bayes. Aug 17, 2017

@airalcorn2

This comment has been minimized.

Show comment
Hide comment
@airalcorn2

airalcorn2 Aug 23, 2017

Contributor

@jnothman - added ComplementNB to doc/modules/classes.rst and fixed the wording of the test comment.

Contributor

airalcorn2 commented Aug 23, 2017

@jnothman - added ComplementNB to doc/modules/classes.rst and fixed the wording of the test comment.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 23, 2017

Member

Now I'm okay with this. My only concern is that I'm not sure that this is much used in practice, and I keep seeing papers using MNB. Perhaps that's because it's not in scikit-learn?

Member

jnothman commented Aug 23, 2017

Now I'm okay with this. My only concern is that I'm not sure that this is much used in practice, and I keep seeing papers using MNB. Perhaps that's because it's not in scikit-learn?

@airalcorn2

This comment has been minimized.

Show comment
Hide comment
@airalcorn2

airalcorn2 Aug 24, 2017

Contributor

Perhaps that's because it's not in scikit-learn?

@jnothman - that was my feeling; hence, the pull request! Is there anything else you need from me (e.g., following up)?

Contributor

airalcorn2 commented Aug 24, 2017

Perhaps that's because it's not in scikit-learn?

@jnothman - that was my feeling; hence, the pull request! Is there anything else you need from me (e.g., following up)?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 28, 2017

Member

Merging, thanks @airalcorn2!

Member

jnothman commented Aug 28, 2017

Merging, thanks @airalcorn2!

@jnothman jnothman merged commit a571b01 into scikit-learn:master Aug 28, 2017

6 checks passed

ci/circleci Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 96.16%)
Details
codecov/project Absolute coverage decreased by -<.01% but relative coverage increased by +3.83% compared to afb3ee7
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
lgtm analysis: Python No alert changes
Details
@@ -91,6 +91,10 @@ Classifiers and regressors
during the first epochs of ridge and logistic regression.
:issue:`8446` by `Arthur Mensch`_.
- Added :class:`naive_bayes.ComplementNB`, which implements the Complement

This comment has been minimized.

@jnothman

jnothman Aug 28, 2017

Member

Argh! No! This is in the wrong place!

@jnothman

jnothman Aug 28, 2017

Member

Argh! No! This is in the wrong place!

.. math::
\hat{\theta}_{ci} = \frac{\sum{j:y_j \neq c} d_{ij} + \alpha_i}

This comment has been minimized.

@jnothman

jnothman Aug 28, 2017

Member

I had meant to check, but forgot: this does not compile.

Firstly, there should be _ after all the \sums.

Secondly, we at least need blank lines between successive equations.

Thirdly, I'm not sure about the _i on alpha: it is present here, but not in the next line. I should probably double-check this with respect to the implementation!

And yet, I'm still getting TeX complaining of a runaway argument...

Are you able to check this and submit a PR to fix it?

@jnothman

jnothman Aug 28, 2017

Member

I had meant to check, but forgot: this does not compile.

Firstly, there should be _ after all the \sums.

Secondly, we at least need blank lines between successive equations.

Thirdly, I'm not sure about the _i on alpha: it is present here, but not in the next line. I should probably double-check this with respect to the implementation!

And yet, I'm still getting TeX complaining of a runaway argument...

Are you able to check this and submit a PR to fix it?

AishwaryaRK added a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017

@airalcorn2 airalcorn2 deleted the airalcorn2:cnb branch Aug 29, 2017

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment