[MRG] Generative Classification #2468

jakevdp · 2013-09-21T15:46:54Z

This PR adds a simple meta-estimator which accepts any generative model (normal approximation, GMM, KernelDensity, etc.) and uses it to construct a generative Bayesian classifier.

Todo:

ogrisel · 2013-09-22T16:15:34Z

I think to make the discussion more fruitful it would be great to provides some examples on datasets where such models are actually useful either from a pure classification performance point of view, or more likely as samplers to generate new labeled samples for specific classes (a bit like you did with this KDE sampling example for digits).

jakevdp · 2013-09-22T18:02:28Z

...or more likely as samplers to generate new labeled samples for specific classes

Ah, I hadn't even thought of that possibility! Yes, we could implement a sample routine, which would use the underlying models to return a random set of new observations fitting the training data. Great idea!

I'll work on some examples soon to make the utility of this approach more clear.

jakevdp · 2013-09-22T22:52:54Z

I added doc strings and tests. An incompatibility came up in the case of GMM: I opened an issue at #2473.

mblondel · 2013-09-23T05:27:00Z

sklearn/generative.py

+        if isinstance(density_estimator, str):
+            dclass = MODEL_TYPES.get(density_estimator)
+            return dclass(**kwargs)
+        elif isinstance(density_estimator, type):


Looks like type is undefined.

it's a builtin

larsmans · 2013-09-25T15:49:16Z

I don't think this should be combined with Naive Bayes, except in the docs. The charm of Naive Bayes lies in its speed and simple code, no need to mess with that.

larsmans · 2013-09-25T15:51:53Z

sklearn/generative.py

+Bayes Classifier, in which the distribution of each training class is
+approximated by an axis-aligned multi-dimensional normal distribution, and
+unknown points are evaluated by comparing their posterior probability under
+each model.


There's no such thing as "the" Naive Bayes classifier. You're thinking of Gaussian NB, but NLP people will variously think of Bernoulli or multinomial NB. (The first time I encountered the Gaussian variant was while reading sklearn source code :)

Interesting! The first time I encountered any version other than Gaussian NB was reading the sklearn source code 😄

larsmans · 2013-09-25T16:02:59Z

On second thought: @jakevdp, is it too much of a stretch to merge this thing into the naive_bayes module? I guess it's not really "naive" in the NB sense, but it would remove some clutter in the top-level module. Also, take a look at the NB narrative docs, which explain pretty much the same thing that you're explaining in the module docstring.

jakevdp · 2013-09-25T17:51:01Z

I initially thought about putting this within sklearn.naive_bayes (given that it inherits from BaseNB!) but didn't because, though it is Bayesian, it's distinctly not Naive in the sense that the term is used. If we could start over, it would be make more sense to have a submodule for generative classification of which naive bayes is a part, rather than the other way around. But given that we've made the API choice to have a naive_bayes submodule, I thought it would be less confusing to put general generative classification in its own module.

Regardless of where the code goes, I had envisioned combining the narrative documentation for the two: as you mention, we can adapt the theoretical background currently put under the heading of Naive Bayes and show how it applies in both the Naive and the general case.

larsmans · 2013-09-25T18:17:13Z

it would be make more sense to have a submodule for generative classification of which naive bayes is a part

True, but I've seen other people's production codebases that depend on MultinomialNB being in naive_bayes.py, and I'd have some explaining to do if we broke that :p

Combining the narratives was the main what I was aiming at. It's your call to decide if it fits well enough to also combine the code.

(FYI, I see you're using BaseNB. I've been thinking about killing that, because MultinomialNB and BernoulliNB can be implemented more straightforwardly as pure linear models, sharing no code with GaussianNB.)

jakevdp · 2013-11-23T14:48:07Z

@larsmans - I ended up following your advice and moving everything into the naive_bayes submodule. That location might be a bit misleading, but I think it is cleaner.

Still some tests failing... I'm going to try to fix those.

jakevdp · 2013-12-10T21:45:39Z

I think this is pretty close to finished now. I added narrative documentation, examples, and the tests should pass.

One missing feature that would be really helpful would be the ability to do class-wise cross-validation of the density estimators within GenerativeBayes. I'm not sure what the right interface would be for that, however... any ideas?

jakevdp · 2013-12-10T22:02:31Z

Hmm... is there any way that program state can affect the results of cross_val_score? It fails here, but passes on my machine, and passes when I run the code alone. There doesn't seem to be any random element that would affect it... that's really strange.

jakevdp · 2013-12-10T22:06:59Z

Ah - looks like it was something that had changed in master. I'll adjust the tests so that they will pass.

coveralls · 2013-12-10T22:18:09Z

Coverage remained the same when pulling 061f3fb on jakevdp:generative_class into ffde690 on scikit-learn:master.

jakevdp · 2013-12-10T22:19:30Z

Changing status to MRG: I think this is ready for a final review, unless we want to add class-wise cross-validation at this time.

jakevdp · 2013-12-12T18:50:20Z

Thanks @ogrisel. I've addressed all your comments.

Regarding the CV issue: I think the first-order solution is to simply expose the estimator parameters using the get_params machinery in BaseEstimator. We can internally label the estimators, e.g. "est1", "est2", so that the fit parameters would become est1__paramname, est2__paramname, etc. This would be a quick addition, and allow the usual cross-validation tools to have access to the parameters.

coveralls · 2013-12-12T18:54:05Z

Coverage remained the same when pulling 3f8666a on jakevdp:generative_class into aa8139b on scikit-learn:master.

ogrisel · 2013-12-16T10:23:41Z

Regarding the CV issue: I think the first-order solution is to simply expose the estimator parameters using the get_params machinery in BaseEstimator. We can internally label the estimators, e.g. "est1", "est2", so that the fit parameters would become est1__paramname, est2__paramname, etc. This would be a quick addition, and allow the usual cross-validation tools to have access to the parameters.

I am not sure that will work as the number of sub-estimators is dependent on the number of classes . The list of subestimators in the estimators_ attribute is therefore only generated once we see the data in fit so as to be able to extract the number of classes or features from the data shape. On the other hand the grid search tooling manipulates the model and its parameters independently of the data, in particular prior to any call to fit. Hence we have a design mismatch. Maybe it would be possible to hack get/set_params to store the subestimators parameters on the GenerativeBayes object itself and delay the call to the recursive call set_params method on the sub-estimators objects at fit time.

jakevdp · 2013-12-16T14:42:34Z

yes, I ran into that mismatch when I gave this strategy a shot. I'll think about your idea of hacking get/set_params, but I'm starting to think that just providing a CV tool within GenerativeBayes itself might be the answer.

ogrisel · 2013-12-18T08:48:24Z

That might indeed be a better way. Note however that we have a similar issue for multi-class or multi-label classifiers that implement the OvR strategy by combining n_classes binary classifiers. It is possible that having per-classifier hyperparameter tuning (e.g. regularizer strength) would be beneficial for the overall performance of the model. @mblondel @pprett might want to pitch-in.

mblondel · 2013-12-18T09:09:58Z

I don't have any experience with tuning each binary classifier separately. One concern I have is that each binary classifier may produce predictions with different scales (e.g. one with predictions in [-1, 1], another one with predictions in [-5, 5]) and thus the argmax rule might not work at all.

In any case, this is a combinatorial search and thus randomized search seems the way to go.

jgbos · 2014-01-15T05:12:23Z

Hey guys, I hope I'm not just wasting space in your inbox. I've tried to follow this discussion, but wanted to provide a couple notes from a user. I have utilized GMM classifiers in the past. I've also started playing with this commit to see the results using a GMM. One big feature needed for this function is the capability of tuning the number of components, n_components, for each class. I saw Jake was concerned with some features users would be interested in having, this is a biggie for people who use this type of classifier. It definitely impacts performance. Unfortunately I cannot provide you an example of a dataset (company policy).

jakevdp · 2014-01-15T14:47:16Z

Thanks @jgbos - I agree that individually tuning hyperparameters is a vital feature of this. I'm still trying to figure out the best way to approach that, though (and I haven't had much time to work on this lately)

ngaloppo · 2016-01-14T00:07:02Z

Is there any chance that there would be some progress on this PR, or is it buried forever? I understand that we are hung up on the last TODO item. I'm wondering if we can come to a solution that does not require the ability to do class-wise cross validation for the density model?

agramfort · 2016-01-14T08:28:10Z

doc/modules/naive_bayes.rst

+
+This model only becomes "naive" when we introduce certain assumptions about
+the form of :math:`P(x_i \mid y)`, e.g. that each class is drawn from an
+axis-aligned normal distribution (the assumption for Gaussian Naive Bayes).


what makes the model naive is that your assume conditional independence of the features. I find this paragraph not clear.

I find this paragraph erroneous.

Yes it’s wrong as I suggested in 2016 ;)

agramfort · 2016-01-14T08:29:39Z

really cool examples :)

@jakevdp you'll need to rebase

danielravina · 2016-06-25T17:12:57Z

@jakevdp just wondering, will you merge this anytime soon?

agramfort · 2016-06-26T16:30:12Z

@danielravina I am not sure @jakevdp has time to finish this. Please take over if you want and see my comments.

jakevdp · 2016-06-26T18:48:01Z

Probably will not be finishing this myself. The main reason I never finished the PR is that I never really figured out how to deal cleanly with per-class hyperparameters.

jengelman · 2017-07-19T01:57:58Z

@danielravina @jakevdp did either of you or anyone else end up picking this back up? would be interested in working on this if not.

jmschrei · 2017-07-19T02:49:49Z

This PR is actually fairly similar to the BayesClassifier / NaiveBayes classifiers in pomegranate (see tutorial here: https://github.com/jmschrei/pomegranate/blob/master/tutorials/Tutorial_5_Bayes_Classifiers.ipynb). If you pick this up I'd be happy to review it, but be sure to read the above discussion thoroughly to understand what the stalling issues were.

amueller · 2019-07-14T22:52:23Z

Should this be moved to scikit-learn-extras or is it not complete enough?

kasparthommen · 2020-02-12T15:21:50Z

Probably will not be finishing this myself. The main reason I never finished the PR is that I never really figured out how to deal cleanly with per-class hyperparameters.

Hi, I am facing the same problem with a few estimators I wrote that also require per-class hyperparameters, i.e., lists/arrays/tuples. The way I see it there are two things you can do:

Accept the fact that GridSearchCV and RandomizedSearchCV are fine for educational purposes but that they are too simplistic for real-world hyperparameter tuning for the following reasons:
- They don't support list/tuple hyperparameters (which is the core problem for this PR)
- They don't support nested sub-spaces (e.g. to disallow illegal parameter combinations or other parameter dependencies)
- They are too slow for real-world apprlications due to their simple underlying "search" algorithms
Therefore, making your estimator compatible with these optimizers is a bit of a lost cause. Instead, you should use a more advanced hyperparameter tuning library such as e.g. Optuna. It works quite differently by not requiring you to define a parameter grid/space beforehand but by allowing you to "ask" the framework for parameter values dynamically one by one. Please see figures 1 and 2 in the paper to get an idea of how this works. This means you can e.g. dynamically request the number of GMM components for each class in a loop over the class count.

Use this decorator that I just wrote that exposes an estimator's sequence hyperparameters (tuple, list, numpy array), e.g. my_list_param, as a set of scalar parameters (my_list_param_0, my_list_param_1, ..., my_list_param_5 if there are 6 entries) such that it becomes compatible with GridSearchCV.

basic generative classification framework

6e89ddd

jakevdp added 2 commits September 22, 2013 15:50

don't duplicate BaseNB methods in GenerativeBayes

2100f1d

add GenerativeBayes tests

d530d27

jakevdp added 2 commits September 22, 2013 16:45

add test for generative sampling

bebdf97

make GenerativeBayes cloneable

0b294de

mblondel reviewed Sep 23, 2013
View reviewed changes

larsmans reviewed Sep 25, 2013
View reviewed changes

jakevdp added 2 commits November 13, 2013 13:21

Bug fix: generative normalization

404c8b3

move GenerativeBayes to naive_bayes module

8d390cf

jakevdp added 6 commits November 23, 2013 07:39

change GenerativeBayes to be cloneable

a029445

fix NormalApproximation normalization

ead86cd

DOC: add 1D generative classification example

714e4c3

DOC: add generative classification sampling example

fb49106

DOC: document GenerativeBayes

ed3ac25

TST: fix GenerativeBayes to pass common tests

868f650

Merge remote-tracking branch 'upstream/master' into generative_class

90cd1a6

TST: modify GenerativeBayes tests to pass

061f3fb

larsmans force-pushed the master branch from 58a55ad to 4b82379 Compare August 25, 2014 21:50

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

jakevdp mentioned this pull request Jan 8, 2015

KernelDensity and GMM interfaces are unnecessarily confusing #4062

Closed

jakevdp mentioned this pull request Oct 29, 2015

Classifier where each class is represented by a GMM #5616

Closed

amueller added the Waiting for Reviewer label Dec 10, 2015

agramfort reviewed Jan 14, 2016
View reviewed changes

amueller added the Move to scikit-learn-extra This PR should be moved to the scikit-learn-extras repository label Jul 30, 2019

github-actions bot added the module:naive_bayes label Mar 2, 2020

cmarmo removed the Waiting for Reviewer label Sep 17, 2020

Base automatically changed from master to main January 22, 2021 10:48

avm19 mentioned this pull request Jun 5, 2022

ENH add Naive Bayes Metaestimator ColumnwiseNB (aka "GeneralNB") #22574

Open

jakevdp closed this Aug 15, 2022

[MRG] Generative Classification #2468

[MRG] Generative Classification #2468

Conversation

jakevdp commented Sep 21, 2013

ogrisel commented Sep 22, 2013

jakevdp commented Sep 22, 2013

jakevdp commented Sep 22, 2013

mblondel Sep 23, 2013

Choose a reason for hiding this comment

jnothman Sep 23, 2013

Choose a reason for hiding this comment

larsmans commented Sep 25, 2013

larsmans Sep 25, 2013

Choose a reason for hiding this comment

jakevdp Sep 25, 2013

Choose a reason for hiding this comment

larsmans commented Sep 25, 2013

jakevdp commented Sep 25, 2013

larsmans commented Sep 25, 2013

jakevdp commented Nov 23, 2013

jakevdp commented Dec 10, 2013

jakevdp commented Dec 10, 2013

jakevdp commented Dec 10, 2013

coveralls commented Dec 10, 2013

jakevdp commented Dec 10, 2013

jakevdp commented Dec 12, 2013

coveralls commented Dec 12, 2013

ogrisel commented Dec 16, 2013

jakevdp commented Dec 16, 2013

ogrisel commented Dec 18, 2013

mblondel commented Dec 18, 2013

jgbos commented Jan 15, 2014

jakevdp commented Jan 15, 2014

ngaloppo commented Jan 14, 2016

agramfort Jan 14, 2016

Choose a reason for hiding this comment

avm19 May 28, 2022

Choose a reason for hiding this comment

agramfort May 29, 2022

Choose a reason for hiding this comment

agramfort commented Jan 14, 2016

danielravina commented Jun 25, 2016

agramfort commented Jun 26, 2016

jakevdp commented Jun 26, 2016

jengelman commented Jul 19, 2017

jmschrei commented Jul 19, 2017

amueller commented Jul 14, 2019

kasparthommen commented Feb 12, 2020 • edited Loading

kasparthommen commented Feb 12, 2020 •

edited

Loading