Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

[MRG] Generative Classification #2468

Open
wants to merge 20 commits into
from

Conversation

Projects
None yet
Member

jakevdp commented Sep 21, 2013

This PR adds a simple meta-estimator which accepts any generative model (normal approximation, GMM, KernelDensity, etc.) and uses it to construct a generative Bayesian classifier.

Todo:

  • code documentation
  • narrative docs
  • testing
  • examples
  • allow class-wise cross validation for the density model?
Owner

ogrisel commented Sep 22, 2013

I think to make the discussion more fruitful it would be great to provides some examples on datasets where such models are actually useful either from a pure classification performance point of view, or more likely as samplers to generate new labeled samples for specific classes (a bit like you did with this KDE sampling example for digits).

Member

jakevdp commented Sep 22, 2013

...or more likely as samplers to generate new labeled samples for specific classes

Ah, I hadn't even thought of that possibility! Yes, we could implement a sample routine, which would use the underlying models to return a random set of new observations fitting the training data. Great idea!

I'll work on some examples soon to make the utility of this approach more clear.

Member

jakevdp commented Sep 22, 2013

I added doc strings and tests. An incompatibility came up in the case of GMM: I opened an issue at #2473.

@mblondel mblondel and 1 other commented on an outdated diff Sep 23, 2013

sklearn/generative.py
+ additional keyword arguments to be passed to the constructor
+ specified by density_estimator.
+ """
+ def __init__(self, density_estimator, **kwargs):
+ self.density_estimator = density_estimator
+ self.kwargs = kwargs
+
+ # run this here to check for any exceptions; we avoid assigning
+ # the result here so that the estimator can be cloned.
+ self._choose_estimator(density_estimator, **kwargs)
+
+ def _choose_estimator(self, density_estimator, **kwargs):
+ if isinstance(density_estimator, str):
+ dclass = MODEL_TYPES.get(density_estimator)
+ return dclass(**kwargs)
+ elif isinstance(density_estimator, type):
@mblondel

mblondel Sep 23, 2013

Owner

Looks like type is undefined.

@jnothman

jnothman Sep 23, 2013

Owner

it's a builtin

Owner

larsmans commented Sep 25, 2013

I don't think this should be combined with Naive Bayes, except in the docs. The charm of Naive Bayes lies in its speed and simple code, no need to mess with that.

@larsmans larsmans and 1 other commented on an outdated diff Sep 25, 2013

sklearn/generative.py
@@ -0,0 +1,258 @@
+"""
+Bayesian Generative Classification
+==================================
+This module contains routines for general Bayesian generative classification.
+Perhaps the best-known instance of generative classification is the Naive
+Bayes Classifier, in which the distribution of each training class is
+approximated by an axis-aligned multi-dimensional normal distribution, and
+unknown points are evaluated by comparing their posterior probability under
+each model.
@larsmans

larsmans Sep 25, 2013

Owner

There's no such thing as "the" Naive Bayes classifier. You're thinking of Gaussian NB, but NLP people will variously think of Bernoulli or multinomial NB. (The first time I encountered the Gaussian variant was while reading sklearn source code :)

@jakevdp

jakevdp Sep 25, 2013

Member

Interesting! The first time I encountered any version other than Gaussian NB was reading the sklearn source code 😄

Owner

larsmans commented Sep 25, 2013

On second thought: @jakevdp, is it too much of a stretch to merge this thing into the naive_bayes module? I guess it's not really "naive" in the NB sense, but it would remove some clutter in the top-level module. Also, take a look at the NB narrative docs, which explain pretty much the same thing that you're explaining in the module docstring.

Member

jakevdp commented Sep 25, 2013

I initially thought about putting this within sklearn.naive_bayes (given that it inherits from BaseNB!) but didn't because, though it is Bayesian, it's distinctly not Naive in the sense that the term is used. If we could start over, it would be make more sense to have a submodule for generative classification of which naive bayes is a part, rather than the other way around. But given that we've made the API choice to have a naive_bayes submodule, I thought it would be less confusing to put general generative classification in its own module.

Regardless of where the code goes, I had envisioned combining the narrative documentation for the two: as you mention, we can adapt the theoretical background currently put under the heading of Naive Bayes and show how it applies in both the Naive and the general case.

Owner

larsmans commented Sep 25, 2013

it would be make more sense to have a submodule for generative classification of which naive bayes is a part

True, but I've seen other people's production codebases that depend on MultinomialNB being in naive_bayes.py, and I'd have some explaining to do if we broke that :p

Combining the narratives was the main what I was aiming at. It's your call to decide if it fits well enough to also combine the code.

(FYI, I see you're using BaseNB. I've been thinking about killing that, because MultinomialNB and BernoulliNB can be implemented more straightforwardly as pure linear models, sharing no code with GaussianNB.)

Member

jakevdp commented Nov 23, 2013

@larsmans - I ended up following your advice and moving everything into the naive_bayes submodule. That location might be a bit misleading, but I think it is cleaner.

Still some tests failing... I'm going to try to fix those.

Member

jakevdp commented Dec 10, 2013

I think this is pretty close to finished now. I added narrative documentation, examples, and the tests should pass.

One missing feature that would be really helpful would be the ability to do class-wise cross-validation of the density estimators within GenerativeBayes. I'm not sure what the right interface would be for that, however... any ideas?

Member

jakevdp commented Dec 10, 2013

Hmm... is there any way that program state can affect the results of cross_val_score? It fails here, but passes on my machine, and passes when I run the code alone. There doesn't seem to be any random element that would affect it... that's really strange.

Member

jakevdp commented Dec 10, 2013

Ah - looks like it was something that had changed in master. I'll adjust the tests so that they will pass.

Coverage Status

Coverage remained the same when pulling 061f3fb on jakevdp:generative_class into ffde690 on scikit-learn:master.

Member

jakevdp commented Dec 10, 2013

Changing status to MRG: I think this is ready for a final review, unless we want to add class-wise cross-validation at this time.

@ogrisel ogrisel commented on an outdated diff Dec 11, 2013

doc/modules/naive_bayes.rst
+density model to each category to estimate :math:`P(x_i \mid y)`. Some
+examples of more flexible density models are:
+
+- :class:`sklearn.neighbors.KernelDensity`: discussed in :ref:`kernel_density`
+- :class:`sklearn.mixture.GMM`: discussed in :ref:`clustering`
+
+Though it can be much more computationally intense,
+using one of these models rather than a naive Gaussian model can lead to much
+better generative classifiers, and can be especially applicable in cases of
+unbalanced data where accurate posterior classification probabilities are
+desired.
+
+.. figure:: ../auto_examples/images/plot_1d_generative_classification_1.png
+ :target: ../auto_examples/plot_1d_generative_classification.html
+ :align: center
+ :scale: 50%from the training data
@ogrisel

ogrisel Dec 11, 2013

Owner

from the training data?

@ogrisel ogrisel commented on an outdated diff Dec 11, 2013

doc/modules/naive_bayes.rst
+ >>> from sklearn.datasets import make_blobs
+ >>> X, y = make_blobs(10, centers=2, random_state=0)
+ >>> clf = GenerativeBayes(density_estimator='kde')
+ >>> clf.fit(X, y)
+ >>> clf.predict(X)
+ array([0, 1, 0, 1, 1, 0, 1, 0, 0, 1])
+ >>> y
+ array([0, 1, 0, 1, 1, 0, 1, 0, 0, 1])
+
+The KDE-based Generative classifier for this problem has 100% accuracy on
+the training data.
+The specified density estimator can be ``'kde'``, ``'gmm'``, ``'norm_approx'``,
+or a custom class which has the same semantics as
+:class:`sklearn.neighbors.KernelDensity` (see the documentation of
+:class:`GenerativeBayes` for details).
+
@ogrisel

ogrisel Dec 11, 2013

Owner

Your explanation is clear but I think it would be great if you could find a good online reference from the the literature for people who want to dig further.

@ogrisel ogrisel commented on an outdated diff Dec 11, 2013

sklearn/naive_bayes.py
+ 'gmm': GMM,
+ 'kde': KernelDensity}
+
+
+class GenerativeBayes(BaseNB):
+ """
+ Generative Bayes Classifier
+
+ This is a meta-estimator which performs generative Bayesian classification
+ using flexible underlying density models.
+
+ Parameters
+ ----------
+ density_estimator : str, class, or instance
+ The density estimator to use for each class. Options are
+ 'norm_approx' : Normal Approximation (i.e. naive Bayes)
@ogrisel

ogrisel Dec 11, 2013

Owner

I think using 'normal_approximation' would be a more explicit name. If you do the change, don't forget to update the narrative doc.

@ogrisel

ogrisel Dec 11, 2013

Owner

Also I would be more explicit by replacing: "Normal Approximation (i.e. naive Bayes)" by "Axis-aligned Normal Approximation (i.e. Gaussian naive Bayes)"

@ogrisel

ogrisel Dec 11, 2013

Owner

If 'normal_approximation' is too long, at least 'normal_approx' instead of norm_approx which I find too confusing.

Owner

ogrisel commented Dec 11, 2013

What about the "allow class-wise cross validation for the density model" item in your todo list?

Owner

ogrisel commented Dec 11, 2013

Is GenerativeBayes(density_estimator="norm_approx") strictly equivalent to GaussianNB (speed, public API including fitted attributes)?

If so, why mark GaussianNB deprecated in favor of GenerativeBayes(density_estimator="norm_approx")?

Member

jakevdp commented Dec 11, 2013

Hi @ogrisel - thanks for the comments. A few responses:

  • I'll change norm_approx to normal_approximation and update the docs.
  • Regarding cross-validation: I've been thinking about the most intuitive way to do this: we could build the functionality into GenerativeBayes, but then it wouldn't be available outside the class. It might be more useful in the long-run to create a new DensityEstimatorMixin class similar to ClassifierMixin and RegressorMixin which would contain some sort of cross-validation tool (as well as computing score from score_samples, and other generally applicable methods). Then any estimator which works in GenerativeBayes could inherit from this, and use the cross-validation from there.
  • Regarding deprecating GaussianNB: I hadn't considered this, primarily because I thought the new tool would be slower for the gaussian NB case. I initially added normal_approximation just for ease of testing against GaussianNB. But I did some benchmarks, and the new method seems to be marginally faster than GaussianNB, and it returns the same results by construction. Given that, deprecation might be worth considering.
Member

jakevdp commented Dec 11, 2013

There is one difference between GenerativeBayes('normal_approximation') and GaussianNB, however: GenerativeBayes doesn't have sigma_ and theta_ attributes exposed. We could do this with some model introspection, however...

Owner

ogrisel commented Dec 11, 2013

How faster is marginally faster? Is this a fixed ratio or does it depend on n_samples / n_features?

Owner

ogrisel commented Dec 11, 2013

I am not sure what you mean by class wise CV but I agree this can always be tackled in another PR later.

Member

jakevdp commented Dec 11, 2013

By class-wise CV I mean this: the GenerativeBayes classifier fits a density estimation (i.e. normal approximation, KDE, GMM, etc.) to the distribution of training points for each class. That is, for data with three classes, it fits KDE three times to subsets of the data. Currently, you have to choose the same hyper-parameters for each, which is not optimal. It would be best to do separate cross-validation on each of the three density estimators. This is what I mean by class-wise CV: I'm not sure what the best interface is for something like this.

Member

jakevdp commented Dec 11, 2013

Hi,
I addressed all of @ogrisel's comments.

Regarding the benchmarks, it seems that GenerativeBayes is ~10 percent slower for small problems, and ~10 percent faster for large problems (see the timings in this notebook).

The speed difference in the small case likely comes from the fact that there's some overhead in creating multiple classes in GenerativeBayes. The speed difference in the large case likely comes from the fact that GenerativeBayes constructs each masked arrays only once, while NaiveBayes, as currently written, constructs it twice (this is silly and should be fixed regardless: see line 170).

Coverage Status

Coverage remained the same when pulling 476f3ba on jakevdp:generative_class into ffde690 on scikit-learn:master.

Member

jakevdp commented Dec 11, 2013

Fixed the GaussianNB thing in #2659. Once it's merged I'll re-do the benchmark script.

@mblondel mblondel and 1 other commented on an outdated diff Dec 12, 2013

sklearn/naive_bayes.py
+ Training data. shape = [n_samples, n_features]
+
+ y : array-like
+ Target values, array of float values, shape = [n_samples]
+ """
+ X, y = check_arrays(X, y, sparse_format='dense')
+ y = column_or_1d(y, warn=True)
+
+ estimator = self._choose_estimator(self.density_estimator,
+ self.model_kwds)
+
+ self.classes_ = np.sort(np.unique(y))
+ n_classes = len(self.classes_)
+ n_samples, self.n_features_ = X.shape
+
+ masks = [(y == c) for c in self.classes_]
@mblondel

mblondel Dec 12, 2013

Owner

You could use LabelBinarizer or label_binarize from the preprocessing module.

@jakevdp

jakevdp Dec 12, 2013

Member

The output of label_binarize would have to be converted to boolean, though... I'm not sure that would be either more efficient or more readable. What do you think?

@mblondel

mblondel Dec 12, 2013

Owner

Indeed, you're right. I guess a comment like "class membership masks" would help understanding.

@mblondel mblondel commented on an outdated diff Dec 12, 2013

sklearn/naive_bayes.py
+ Parameters
+ ----------
+ density_estimator : str, class, or instance
+ The density estimator to use for each class. Options are
+ - 'normal_approximation' : Axis-aligned Normal Approximation
+ (i.e. Gaussian Naive Bayes)
+ - 'gmm' : Gaussian Mixture Model
+ - 'kde' : Kernel Density Estimate
+ The default is 'normal_approximation'.
+ Alternatively, a class or class instance can be specified. The
+ instantiated class should be a sklearn estimator, and contain a
+ ``score_samples`` method with semantics similar to those in
+ :class:`sklearn.neighbors.KDE`.
+ model_kwds : dict or None
+ Additional keyword arguments to be passed to the constructor
+ specified by density_estimator. Default=None.
@mblondel

mblondel Dec 12, 2013

Owner

Could you document the fitted attributes?

@mblondel mblondel commented on an outdated diff Dec 12, 2013

sklearn/naive_bayes.py
+
+DENSITY_MODELS = {'normal_approximation': _NormalApproximation,
+ 'gmm': GMM,
+ 'kde': KernelDensity}
+
+
+class GenerativeBayes(BaseNB):
+ """
+ Generative Bayes Classifier
+
+ This is a meta-estimator which performs generative Bayesian classification
+ using flexible underlying density models.
+
+ Parameters
+ ----------
+ density_estimator : str, class, or instance
@mblondel

mblondel Dec 12, 2013

Owner

Do you need to support classes? I think the rest of the scikit usually only supports instances. If there's no compelling reason, I'd rather remove the feature so as to not create any inconsistencies in user code.

Owner

mblondel commented Dec 12, 2013

The user guide is really nice. I'm totally sold !

@mblondel mblondel commented on an outdated diff Dec 12, 2013

doc/modules/naive_bayes.rst
+This type of classification can be performed with the :class:`GenerativeBayes`
+estimator. The estimator can be used very easily:
+
+ >>> from sklearn.naive_bayes import GenerativeBayes
+ >>> from sklearn.datasets import make_blobs
+ >>> X, y = make_blobs(10, centers=2, random_state=0)
+ >>> clf = GenerativeBayes(density_estimator='kde')
+ >>> clf.fit(X, y)
+ GenerativeBayes(density_estimator='kde', model_kwds=None)
+ >>> clf.predict(X)
+ array([0, 1, 0, 1, 1, 0, 1, 0, 0, 1])
+ >>> y
+ array([0, 1, 0, 1, 1, 0, 1, 0, 0, 1])
+
+The KDE-based Generative classifier for this problem has 100% accuracy on
+the training data.
@mblondel

mblondel Dec 12, 2013

Owner

I'd rather use test data if possible (achieving 100% accuracy on training data is not necessarily a good sign). Also, could you say a few words on how to prevent overfitting? For example, when using GMM as the base density estimator, n_components should not be set too high.

Member

jakevdp commented Dec 12, 2013

Thanks for the comments @mblondel - I'll address these soon!

Coverage Status

Coverage remained the same when pulling 4d23183 on jakevdp:generative_class into ffde690 on scikit-learn:master.

Coverage Status

Coverage remained the same when pulling 4d23183 on jakevdp:generative_class into ffde690 on scikit-learn:master.

Member

jakevdp commented Dec 12, 2013

Addressed @mblondel's comments, except for the suggestion to add a note about over-fitting.

I'm realizing that this really shouldn't be considered complete without a way to cross-validate the density model for each class. A few ideas for how to approach this:

  • build cross-validation machinery into GenerativeBayes. Advantage: simple and straightforward. Disadvantage: people might want the functionality outside the classifier.
  • create a DensityEstimator mixin that contains a cross-validation routine for each density estimator. Advantage: the automated cross-validation could then be used outside GenerativeBayes. Disadvantage: perhaps confusing? Not all estimators have CV built-in.
  • expose per-class estimator attributes, in much the same way that Pipeline objects expose the underlying attributes of their steps. (for example, with three KDE estimators, you might allow passing bandwidth as an array, which will be spread among the estimators). Advantage: this would allow the cross-validation to be performed by the user, which is more typical of the scikit-learn interface. Disadvantage: the final classification score is not the right metric for the underlying estimators... you'd end up having to hack the score function and perform multiple grid searches by-hand to do it correctly.

I'd love to hear any thoughts you have on this: the best path is not entirely apparent to me. Are there any other meta-estimators in the package where underlying estimators are cross-validated independently?

Owner

mblondel commented Dec 12, 2013

In practice, do you observe much better performance by tuning the parameters for each class?

Member

jakevdp commented Dec 12, 2013

In practice, do you observe much better performance by tuning the parameters for each class?

I actually haven't tried that in particular, but I'm anticipating such a request from users! I think in an extremely unbalanced problem, it would probably make a difference.

Owner

ogrisel commented Dec 12, 2013

I am not sure I fully understand the tradeoff. I think I need to see some code for the CV of such nested models better grasp it and give you feedback. Maybe you could implement:

1- build cross-validation machinery into GenerativeBayes. Advantage: simple and straightforward. Disadvantage: people might want the functionality outside the classifier.

as a start and we can discuss it whether it we should remove it or refactor it into one of the other 2 options?

Member

jakevdp commented Dec 12, 2013

In practice, do you observe much better performance by tuning the parameters for each class?

I haven't actually checked this, but I'd imagine that in the case of an unbalanced dataset, it could make a difference.

Owner

ogrisel commented Dec 12, 2013

I get the following error when running the example to build the doc:

Traceback (most recent call last):
  File "examples/plot_1d_generative_classification.py", line 56, in <module>
    clf = GenerativeBayes(density_estimator=density_estimators[i])
  File "/Users/ogrisel/code/scikit-learn/sklearn/naive_bayes.py", line 710, in __init__
    self._choose_estimator(density_estimator, self.model_kwds)
  File "/Users/ogrisel/code/scikit-learn/sklearn/naive_bayes.py", line 722, in _choose_estimator
    raise ValueError('invalid density_estimator')
ValueError: invalid density_estimator

The error message should be more explicit and include the name of the passed estimator (or its str representation) and the reason why it's not valid.

In this case we are passing an instance and it looks up a class. I guess this code needs to be updated and test needs to be added to check that invalid input check.

@ogrisel ogrisel commented on an outdated diff Dec 12, 2013

sklearn/naive_bayes.py
+
+ # run this here to check for any exceptions; we avoid assigning
+ # the result here so that the estimator can be cloned.
+ self._choose_estimator(density_estimator, self.model_kwds)
+
+ def _choose_estimator(self, density_estimator, kwargs=None):
+ """Choose the estimator based on the input"""
+ dclass = DENSITY_MODELS.get(density_estimator)
+
+ if dclass is not None:
+ if kwargs is None:
+ kwargs = {}
+ density_estimator = dclass(**kwargs)
+
+ if not hasattr(dclass, 'score_samples'):
+ raise ValueError('invalid density_estimator')
@ogrisel

ogrisel Dec 12, 2013

Owner

This should better be:

        if not hasattr(density_estimator, 'score_samples'):
            raise TypeError('Invalid density_estimator: %s.'
                       ' Missing required score_samples method.' % density_estimator)

@ogrisel ogrisel commented on an outdated diff Dec 12, 2013

doc/modules/naive_bayes.rst
+
+ >>> from sklearn.naive_bayes import GenerativeBayes
+ >>> from sklearn.datasets import make_blobs
+ >>> X, y = make_blobs(100, centers=2, random_state=0)
+ >>> clf = GenerativeBayes(density_estimator='kde')
+ >>> clf.fit(X[:-10], y[:-10])
+ GenerativeBayes(density_estimator='kde', model_kwds=None)
+ >>> clf.predict(X[-10:])
+ array([1, 1, 1, 1, 0, 0, 1, 1, 0, 1])
+ >>> y[-10:]
+ array([1, 1, 1, 1, 0, 0, 1, 1, 0, 1])
+
+The KDE-based Generative classifier for this problem has 100% accuracy on
+this small subset of test data.
+The specified density estimator can be ``'kde'``, ``'gmm'``,
+``'normal_approximation'``, or any class or estimator
@ogrisel

ogrisel Dec 12, 2013

Owner

"any class or estimator" => "any estimator" if we drop the class support.

@ogrisel ogrisel commented on an outdated diff Dec 12, 2013

doc/modules/naive_bayes.rst
+points drawn from the model.
+
+This type of generative model can be used in higher dimensions to do some
+very interesting analysis. For example, here's a generative bayes model
+which uses kernel density estimation trained on the digits dataset. The
+top panel shows a selection of the input digits, while the bottom panel
+shows draws from the class-wise probability distributions. These give an
+intuitive feel to what the model "thinks" each digit looks like:
+
+.. figure:: ../auto_examples/images/plot_generative_sampling_2.png
+ :target: ../auto_examples/plot_generative_sampling.html
+ :align: center
+ :scale: 50%
+
+This result can be compared to the
+`similar figure <../auto_examples/neighbors/plot_digits_kde_sampling.html`_
@ogrisel

ogrisel Dec 12, 2013

Owner

Missing ">" before the "`".

Member

jakevdp commented Dec 12, 2013

Thanks @ogrisel. I've addressed all your comments.

Regarding the CV issue: I think the first-order solution is to simply expose the estimator parameters using the get_params machinery in BaseEstimator. We can internally label the estimators, e.g. "est1", "est2", so that the fit parameters would become est1__paramname, est2__paramname, etc. This would be a quick addition, and allow the usual cross-validation tools to have access to the parameters.

Coverage Status

Coverage remained the same when pulling 3f8666a on jakevdp:generative_class into aa8139b on scikit-learn:master.

Owner

ogrisel commented Dec 16, 2013

Regarding the CV issue: I think the first-order solution is to simply expose the estimator parameters using the get_params machinery in BaseEstimator. We can internally label the estimators, e.g. "est1", "est2", so that the fit parameters would become est1__paramname, est2__paramname, etc. This would be a quick addition, and allow the usual cross-validation tools to have access to the parameters.

I am not sure that will work as the number of sub-estimators is dependent on the number of classes . The list of subestimators in the estimators_ attribute is therefore only generated once we see the data in fit so as to be able to extract the number of classes or features from the data shape. On the other hand the grid search tooling manipulates the model and its parameters independently of the data, in particular prior to any call to fit. Hence we have a design mismatch. Maybe it would be possible to hack get/set_params to store the subestimators parameters on the GenerativeBayes object itself and delay the call to the recursive call set_params method on the sub-estimators objects at fit time.

Member

jakevdp commented Dec 16, 2013

yes, I ran into that mismatch when I gave this strategy a shot. I'll think about your idea of hacking get/set_params, but I'm starting to think that just providing a CV tool within GenerativeBayes itself might be the answer.

Owner

ogrisel commented Dec 18, 2013

That might indeed be a better way. Note however that we have a similar issue for multi-class or multi-label classifiers that implement the OvR strategy by combining n_classes binary classifiers. It is possible that having per-classifier hyperparameter tuning (e.g. regularizer strength) would be beneficial for the overall performance of the model. @mblondel @pprett might want to pitch-in.

Owner

mblondel commented Dec 18, 2013

I don't have any experience with tuning each binary classifier separately. One concern I have is that each binary classifier may produce predictions with different scales (e.g. one with predictions in [-1, 1], another one with predictions in [-5, 5]) and thus the argmax rule might not work at all.

In any case, this is a combinatorial search and thus randomized search seems the way to go.

jgbos commented Jan 15, 2014

Hey guys, I hope I'm not just wasting space in your inbox. I've tried to follow this discussion, but wanted to provide a couple notes from a user. I have utilized GMM classifiers in the past. I've also started playing with this commit to see the results using a GMM. One big feature needed for this function is the capability of tuning the number of components, n_components, for each class. I saw Jake was concerned with some features users would be interested in having, this is a biggie for people who use this type of classifier. It definitely impacts performance. Unfortunately I cannot provide you an example of a dataset (company policy).

Member

jakevdp commented Jan 15, 2014

Thanks @jgbos - I agree that individually tuning hyperparameters is a vital feature of this. I'm still trying to figure out the best way to approach that, though (and I haven't had much time to work on this lately)

Is there any chance that there would be some progress on this PR, or is it buried forever? I understand that we are hung up on the last TODO item. I'm wondering if we can come to a solution that does not require the ability to do class-wise cross validation for the density model?

@agramfort agramfort commented on the diff Jan 14, 2016

doc/modules/naive_bayes.rst
+Non-naive Bayes
+---------------
+
+As mentioned above, naive Bayesian methods are generally very fast, but often
+inaccurate estimators. This can be addressed by relaxing the assumptions that
+make the models naive, so that more accurate classifications are possible.
+
+If we return to the general formalism outlined above, we can see that the
+generic model for Bayesian classification is:
+
+.. math::
+ \hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i \mid y).
+
+This model only becomes "naive" when we introduce certain assumptions about
+the form of :math:`P(x_i \mid y)`, e.g. that each class is drawn from an
+axis-aligned normal distribution (the assumption for Gaussian Naive Bayes).
@agramfort

agramfort Jan 14, 2016

Owner

what makes the model naive is that your assume conditional independence of the features. I find this paragraph not clear.

@agramfort agramfort commented on the diff Jan 14, 2016

doc/modules/naive_bayes.rst
+inaccurate estimators. This can be addressed by relaxing the assumptions that
+make the models naive, so that more accurate classifications are possible.
+
+If we return to the general formalism outlined above, we can see that the
+generic model for Bayesian classification is:
+
+.. math::
+ \hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i \mid y).
+
+This model only becomes "naive" when we introduce certain assumptions about
+the form of :math:`P(x_i \mid y)`, e.g. that each class is drawn from an
+axis-aligned normal distribution (the assumption for Gaussian Naive Bayes).
+
+However, assumptions like these are in no way required for generative
+Bayesian classification formalism: we can equally well fit any suitable
+density model to each category to estimate :math:`P(x_i \mid y)`. Some
@agramfort

agramfort Jan 14, 2016

Owner

this gives the impression that your code estimates a KDE/GMM for each feature but you actually estimate P(x \mid y)

note that this can be problematic in high dimension (kde has issues in high dim). A middle ground could be to support also KDE/GMM for each feature ie keep the naive independence. This could be done with an option.

Owner

agramfort commented Jan 14, 2016

really cool examples :)

@jakevdp you'll need to rebase

@jakevdp just wondering, will you merge this anytime soon?

Owner

agramfort commented Jun 26, 2016

@danielravina I am not sure @jakevdp has time to finish this. Please take over if you want and see my comments.

Member

jakevdp commented Jun 26, 2016

Probably will not be finishing this myself. The main reason I never finished the PR is that I never really figured out how to deal cleanly with per-class hyperparameters.

@danielravina @jakevdp did either of you or anyone else end up picking this back up? would be interested in working on this if not.

Member

jmschrei commented Jul 19, 2017

This PR is actually fairly similar to the BayesClassifier / NaiveBayes classifiers in pomegranate (see tutorial here: https://github.com/jmschrei/pomegranate/blob/master/tutorials/Tutorial_5_Bayes_Classifiers.ipynb). If you pick this up I'd be happy to review it, but be sure to read the above discussion thoroughly to understand what the stalling issues were.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment