# [MRG+2] Implement Complement Naive Bayes. #8190

merged 3 commits into from Aug 28, 2017

## Conversation

Contributor

### airalcorn2 commented Jan 12, 2017 • edited

#### What does this implement/fix? Explain your changes.

Implements the Complement Naive Bayes (CNB) classifier described in Rennie et al. (2003). CNB was designed to correct the "severe assumptions" made by the standard Multinomial Naive Bayes (MNB) classifier. As a result, CNB often achieves considerably better results than MNB on text classification tasks with imbalanced classes (as can be seen below); so much so that Apache Mahout includes an implementation of CNB alongside its MNB classifier. With that being the case, it would be nice to have an easily usable CNB implementation also available in scikit-learn.

Results from testing on Reuters-21578 (see example code).

<class 'sklearn.naive_bayes.MultinomialNB'>
Accuracy: 0.772
Weighted Precision: 0.735
Weighted Recall: 0.772

<class 'sklearn.naive_bayes.MultinomialCNB'>
Accuracy: 0.813
Weighted Precision: 0.805
Weighted Recall: 0.813

Contributor

### glemaitre commented Jan 13, 2017

 Just by curiosity, is CNB not equivalent to pipeline a tf-idf and an MNB?
Contributor

### airalcorn2 commented Jan 13, 2017 • edited

 @glemaitre - no, they are not equivalent. Compare equations (4) and (6) in the paper. For a given class, CNB estimates the parameters for the complement of the class. The authors suggest CNB produces weight estimates that are less biased and more stable (see Figure 1) than those produced by MNB.
Contributor

### glemaitre commented Jan 13, 2017

 @airalcorn2 yep, you're right, somebody is wrong on the wiki page of the MNB, omitting the part regarding the complement :)

### jnothman reviewed May 27, 2017

This needs unit tests for _count, whether based on an example / toy data, or checking that invariants are held for random/challenging data.

Contributor

### airalcorn2 commented May 30, 2017

 @jnothman - I added a unit test using a toy data set. Let me know if that's not adequate.
Member

### jmschrei commented Jun 1, 2017

 This is looking pretty good to me. Thanks for the contribution! I'll check back again later when the tests are all passing.
Contributor

### airalcorn2 commented Jun 6, 2017 • edited

 Looks like all the tests have passed, @jmschrei.
Member

### jmschrei commented Jun 26, 2017

 Apologies for the delay, I've been super busy recently. Can you look into the conflicts that have arisen, and I'll get back to you soon? Again, thanks for taking the time to contribute this, we really appreciate it.
Contributor

### airalcorn2 commented Jun 27, 2017 • edited

 @jmschrei - the estimator_checks.py file currently ignores/modifies several tests for the different naive Bayes classifiers because the assumptions of these classifiers don't mesh well with the data being tested. I've updated those same tests to account for the new Complement Naive Bayes classifier.
Member

### jmschrei commented Jun 30, 2017 • edited

 Hi @airalcorn2, the branch is still having problems, I'm guessing due to #9131 . If you can get this PR to all tests passing again, I have time to review it and hopefully we can get it merged soon!
Contributor

### airalcorn2 commented Jul 1, 2017

 @jmschrei - looks like everything's actually passing this time.

### jmschrei reviewed Jul 3, 2017

### jmschrei reviewed Jul 3, 2017

Member

### jmschrei commented Jul 3, 2017

 You should also add in an entry to docs/whats_new.rst
Contributor

### airalcorn2 commented Jul 6, 2017 • edited

 @jmschrei - let me know what you think about the changes.
Member

### jmschrei commented Jul 6, 2017

 LGTM! Let's see if we can track down another reviewer, @jnothman @glemaitre maybe?

Contributor

### airalcorn2 commented Jul 6, 2017

 The feature_all_ attribute wasn't accounting for sample weights and also wasn't mentioned in the class docstring, so the most recent push makes those corrections.
Contributor

### airalcorn2 commented Jul 11, 2017

 Hey, @jmschrei. Do you know the probability/timeline of this being merged? We'd like to migrate our current Mahout Complement Naive Bayes process to Python, so I'm trying to figure out if we should just go with my fork. Thanks.
Member

### jmschrei commented Jul 11, 2017

 It's just waiting on another reviewer, perhaps @jnothman or @raghavrv or @glemaitre have time to take a look? It's not a very complicated model. Unfortunately, due to the velocity of PRs and issues being opened, we sometimes lose track of good contributions.
Contributor

### glemaitre commented Jul 11, 2017

 I will review it tonight.

### glemaitre reviewed Jul 11, 2017

I have the impression that there is some duplicated code between Multinomai CNB and NB (in _count and _joint_log_likelihood. @jmschrei Is it making sense to factorizing it?

### jnothman reviewed Jul 12, 2017

There are currently no narrative docs in doc/modules/naive_bayes.rst

### jnothman reviewed Jul 12, 2017

sklearn/naive_bayes.py Outdated
sklearn/naive_bayes.py Outdated
Member

### jnothman commented Jul 12, 2017

 Do you know the probability/timeline of this being merged? I think this is a well known and useful NB variant. But I don't think the contribution is yet meeting our standards in terms of documentation at least. When it will be merged? Perhaps September. When it will be released? Perhaps April 2018.
Member

### jnothman commented Jul 12, 2017

 Btw, that merge estimate might be pessimistic, and the release estimate optimistic. Hard to say.
Contributor

### airalcorn2 commented Jul 14, 2017

 Thanks for the review, @jnothman and @glemaitre. Tried to incorporate all of your feedback in the latest push. Also added narrative documentation to doc/modules/naive_bayes.rst.
Member

### jnothman commented Jul 15, 2017

 Scikit-learn does have a fetcher for Reuters btw: fetch_rcv1 On 15 Jul 2017 6:36 am, "Michael A. Alcorn" wrote: Thanks for the review, @jnothman and @glemaitre . Tried to incorporate all of your feedback in the latest push. Also added narrative documentation to doc/modules/naive_bayes.rst. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8190 (comment)>, or mute the thread .
Contributor

### airalcorn2 commented Aug 7, 2017

 Y'all mind taking another look, @jnothman and @glemaitre?

### jnothman reviewed Aug 8, 2017

otherwise this LGTM

 def _count(self, X, Y): """Count feature occurrences.""" if np.any((X.data if issparse(X) else X) < 0): raise ValueError("Input X must be non-negative")

Member

not tested

#### airalcorn2 Aug 8, 2017

Contributor

I added a simple test to validate the counts.

#### jnothman Aug 14, 2017

Member

I mean that you don't currently test that this error is raised. I think.

#### airalcorn2 Aug 15, 2017

Contributor

@jnothman - I added that test.

Contributor

### airalcorn2 commented Aug 8, 2017

 Let me know if there's anything else, @jnothman.
Member

### jnothman commented Aug 8, 2017

 A future import couldn't cause unexpected behaviour in other tests (unless they have explicit switches for py2 vs 3, which they don't) because we test on both versions
 Implement Complement Naive Bayes. 
 366304e 
Contributor

### airalcorn2 commented Aug 9, 2017

 @jnothman - ah, right. Added the __future__ import.
Member

### jnothman commented Aug 14, 2017

 Please don't squash your commits. It makes it very hard for me to work out what code "I added a simple test to validate the counts." refers to. As it is, coveralls thinks the line still lacks coverage, and I can't see an assert_raises or similar in your code.
Contributor

### airalcorn2 commented Aug 14, 2017

 @jnothman - maybe I'm not understanding what you're asking for? The only other place where I saw "count" show up in test_naive_bayes.py is here. The tests I added are here.
 Add test for raised ValueError. 
 09b58de 

### jnothman reviewed Aug 17, 2017 • edited

### jnothman changed the title from [MRG+1] Implement Complement Naive Bayes. to [MRG+2] Implement Complement Naive Bayes.Aug 17, 2017

 Add ComplementNB to classes.rst and fix wording in assertion. 
 c58d244 
Contributor

### airalcorn2 commented Aug 23, 2017

 @jnothman - added ComplementNB to doc/modules/classes.rst and fixed the wording of the test comment.
Member

### jnothman commented Aug 23, 2017

 Now I'm okay with this. My only concern is that I'm not sure that this is much used in practice, and I keep seeing papers using MNB. Perhaps that's because it's not in scikit-learn?
Contributor

### airalcorn2 commented Aug 24, 2017

 Perhaps that's because it's not in scikit-learn? @jnothman - that was my feeling; hence, the pull request! Is there anything else you need from me (e.g., following up)?
Member

### jnothman commented Aug 28, 2017

 Merging, thanks @airalcorn2!

### jnothman merged commit a571b01 into scikit-learn:master Aug 28, 2017 6 checks passed

### jnothman reviewed Aug 28, 2017

 @@ -91,6 +91,10 @@ Classifiers and regressors during the first epochs of ridge and logistic regression. :issue:8446 by Arthur Mensch_. - Added :class:naive_bayes.ComplementNB, which implements the Complement

#### jnothman Aug 28, 2017

Member

Argh! No! This is in the wrong place!

### jnothman reviewed Aug 28, 2017

 .. math:: \hat{\theta}_{ci} = \frac{\sum{j:y_j \neq c} d_{ij} + \alpha_i}

#### jnothman Aug 28, 2017

Member

I had meant to check, but forgot: this does not compile.

Firstly, there should be _ after all the \sums.

Secondly, we at least need blank lines between successive equations.

Thirdly, I'm not sure about the _i on alpha: it is present here, but not in the next line. I should probably double-check this with respect to the implementation!

And yet, I'm still getting TeX complaining of a runaway argument...

Are you able to check this and submit a PR to fix it?

