Permalink
Browse files

DOC expand Naive Bayes narrative doc (BernoulliNB formula)

  • Loading branch information...
1 parent 3fa52bf commit b260880495c4862c95ed67e96652ed2ffbb5550b @larsmans larsmans committed Oct 27, 2011
Showing with 23 additions and 9 deletions.
  1. +23 −9 doc/modules/naive_bayes.rst
@@ -38,7 +38,7 @@ we can use the following classification rule:
\Downarrow
- \hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(f_i \mid y),
+ \hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i \mid y),
and we can use Maximum A Posteriori (MAP) estimation to estimate
:math:`P(y)` and :math:`P(x_i \mid y)`;
@@ -62,6 +62,10 @@ distribution can be independently estimated as a one dimensional distribution.
This in turn helps to alleviate problems stemming from the curse of
dimensionality.
+On the flip side, although Naive Bayes is known as a decent classifier,
+it is known to be a bad estimator, so the probability outputs from
+``predict_proba`` are not to be taken too seriously.
+
.. topic:: References:
* H. Zhang (2004). `The optimality of Naive Bayes.
@@ -130,17 +134,27 @@ Bernoulli Naive Bayes
:class:`BernoulliNB` implements the Naive Bayes training and classification
algorithms for data that is distributed according to multivariate Bernoulli
-distributions. It requires samples to be represented as binary-valued/boolean
-feature vectors; if handed any other kind of data, it binarizes it (depending
-on the ``binarize`` parameter).
+distributions; i.e., there may be multiple features but each one is assumed
+to be a binary-valued (Bernoulli, boolean) variable.
+Therefore, this class requires samples to be represented as binary-valued
+feature vectors; if handed any other kind of data, a ``BernoulliNB`` instance
+may binarizes its input (depending on the ``binarize`` parameter).
+
+The decision rule for Bernoulli Naive Bayes is based on
+
+.. math::
+
+ P(x_i \mid y) = P(i \mid y) x_i \times (1 - P(i \mid y)) (1 - x_i)
+
+which differs from multinomial NB's rule
+in that it explicitly penalizes the non-occurrence of a feature :math:`i`
+that is an indicator for class :math:`y`,
+where the multinomial variant would simply ignore a non-occurring feature.
In the case of text classification, word occurrence vectors (rather than word
count vectors) may be used to train and use this classifier. ``BernoulliNB``
-might perform better on some datasets, especially those with shorter documents,
-because it explicitly penalizes the non-occurrence of words/features in a
-dataset where ``MultinomialNB`` would only notice a zero count, but for text
-classification ``MultinomialNB`` will generally be better. It is advisable to
-evaluate both models, if time permits.
+might perform better on some datasets, especially those with shorter documents.
+It is advisable to evaluate both models, if time permits.
.. topic:: References:

0 comments on commit b260880

Please sign in to comment.