Added Multinomial Naive Bayes classifier. #107

Closed
wants to merge 11 commits into
from

Conversation

Projects
None yet
7 participants
Contributor

amitibo commented Mar 21, 2011

Added Multinomial Naive Bayes classifier. Updated the documents and tests.

Amit

Owner

mblondel commented Mar 22, 2011

Thank you very much for your contribution. A few remarks:

Style:

  • you don't need to add an extra space after parentheses
  • you don't need to add an extra sharp sign in your comments

Functionality:

  • predict should be implemented without predict_log_proba: the argmax is independent of the denominator
  • support for sparse matrices would be nice (then add it to the 20 newsgroup example and optimize performance)

Tests:

  • you should test that np.log(clf.predict_proba(X)) gives the same results as clf.predict_log_proba(X)
Contributor

amitibo commented Mar 22, 2011

Thanks for the tips.
Is the style issue strict? This is how I write my code.

Amit

On 22/03/2011 07:50, mblondel wrote:

Thank very much for your contribution. A few remarks:

Style:

  • you don't need to add an extra space after parentheses
  • you don't need to add an extra sharp sign in your comments

Functionality:

  • predict should be implemented without predict_log_proba: the argmax is independent of the denominator
  • support for sparse matrices would be nice (then add it to the 20 newsgroup example and optimize performance)

Tests:

  • you should test that np.log(clf.predict_proba(X)) gives the same results as clf.predict_log_proba(X)
Owner

ogrisel commented Mar 22, 2011

Thanks for the contrib.

Yes the style is strict :) Please use the pep8 utility to spot style issues: http://pypi.python.org/pypi/pep8
As @mblondel said please also write a test case that demonstrates the support for scipy.sparse inputs and include the MNB model in the document classification example:

http://scikit-learn.sourceforge.net/auto_examples/document_classification_20newsgroups.html

Also please always use lowercase variable names (unless the special case for 2D numpy arrays or scipy matrices).
We will also require a new section in the documentation about this model and referencing the examples where it is used. If I am not mistaken the user guide currently lacks a section on Bayesian classifiers:

http://scikit-learn.sourceforge.net/user_guide.html

Such a section should be added in the doc/ folder in the source tree and referenced from the table of content. Take example on the SGD or SVM documentation as reference for instance.

Owner

mblondel commented Mar 22, 2011

A few people in the project really like pep8-style :). Having everyone using the same style conventions makes reading the code just more comfortable.

scikits/learn/naive_bayes.py
+ >>> print clf.predict(X[2])
+ 3
+
+ See also
@ogrisel

ogrisel Mar 22, 2011

Owner

See also lacks a reference on the GNB class explaining the difference between both in practice.

Owner

fabianp commented Mar 30, 2011

Thanks for making the PEP8 changes.

One thing that should be improved is the docstring for the public methods, it should include at least a description of the input and output values. Note that GNB was previously lacky on this, I just fixed it in 4c1fb9f

Other things:

  • Estimated parameter theta_c* should be documented in the docstring.
  • I'm not sure about names such as alpha_i and theta_i, I usually use such names to represent items from an array, but that is not the case here. What do you think ? Also, if these are the standard names, please provide references.

Other things I noted but can wait until after the merge are in issue #108

Contributor

amitibo commented Mar 31, 2011

Thank you for the comments. I am bit slow on the commits due to work
load, but I will take of it.

Amit

On 30/03/2011 12:58, fabianp wrote:

Thanks for making the PEP8 changes.

One thing that should be improved is the docstring for the public methods, it should include at least a description of the input and output values. Note that GNB was previously lacky on this, I just fixed it in 4c1fb9f

Other things:

- Estimated parameter theta_c* should be documented in the docstring.
- I'm not sure about names such as alpha_i and theta_i, I usually use such names to represent items from an array, but that is not the case here. What do you think ? Also, if these are the standard names, please provide references.

Other things I noted but can wait until after the merge are in issue #108

"MNNB" is name a very hard to understand. I would like less acronymes in the scikit (GNB is a good example of bad name). MultiNB or MultinomialNB would be a better name, I believe.

Owner

mblondel commented Apr 13, 2011

I completely forgot that I had started some work on a multinomial naive bayes branch. It supports sparse matrices, semi-supervised learning and complement naive bayes. It needs to be polished and it is still unclear how to handle the semi-supervised case in the API (I used fit_semi for now). I don't have time to work on that unfortunately, but feel free to start from my branch, copy snippets of code or whatever is useful to you.

Code:
https://github.com/mblondel/scikit-learn/blob/semisupervised/scikits/learn/naive_bayes.py

Unit-test:
https://github.com/mblondel/scikit-learn/blob/semisupervised/scikits/learn/tests/test_naive_bayes_semi.py

Owner

larsmans commented May 19, 2011

I did some cleanup of @mblondel's code, it's in my branch naive-bayes.

Owner

mblondel commented May 19, 2011

@larsmans Thanks a lot for doing in this. We really need a robust multinomial naive bayes in the scikit. to_1_of_K should be replaced by LabelBinarizer. Also since we didn't agreed on a semi supervised API, we might want to throw this part away and keep it for another branch. If we do keep it, we should probably have a discussion on the ML to decide the API. What do you think?

Owner

larsmans commented May 19, 2011

@mblondel: done what you suggested, removed semi-sup and ComplementNB as well. Please pull and inspect.

@@ -7,7 +7,28 @@ Naive Bayes
**Naive Bayes** algorithms are a set of supervised learning methods
based on applying Baye's theorem with strong (naive) independence
@agramfort

agramfort May 20, 2011

Owner

s/Baye's/Bayes'

@@ -7,7 +7,28 @@ Naive Bayes
**Naive Bayes** algorithms are a set of supervised learning methods
based on applying Baye's theorem with strong (naive) independence
-assumptions.
+assumptions. Given a class variable :math:`c` and a dependent set
+of feature variables :math:`f_1` through :math:`f_n`, the bayes
@agramfort

agramfort May 20, 2011

Owner

s/bayes/Bayes

+multinomial. The distribution is parametrized by the vector
+:math:`\overline{\theta_c} = (\theta_{c1},\ldots,\theta_{cn})` where :math:`c`
+is the class of document, :math:`n` is the size of the vocabulary and :math:`\theta_{ci}`
+is the prbability of word :math:`i` appearing in a document of class :math:`c`.
@agramfort

agramfort May 20, 2011

Owner

s/prbability/probability

+ """
+ Multinomial Naive Bayes (MultinomialNB)
+
+ The Multinomial Naive Bayes classifier is suitable for text classification.
@agramfort

agramfort May 20, 2011

Owner

is it really only suitable for text?

+ Examples
+ --------
+ >>> import numpy as np
+ >>> X = np.random.randint( 5, size=(6, 100) )
@agramfort

agramfort May 20, 2011

Owner

no space after "(" and before ")"

Owner

larsmans commented May 22, 2011

I discussed with @amitibo and have adopted this branch. See new pull request; closing this one.

@larsmans larsmans closed this May 22, 2011

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment