Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Generative Classification #2468

Closed
wants to merge 20 commits into from

Conversation

jakevdp
Copy link
Member

@jakevdp jakevdp commented Sep 21, 2013

This PR adds a simple meta-estimator which accepts any generative model (normal approximation, GMM, KernelDensity, etc.) and uses it to construct a generative Bayesian classifier.

Todo:

  • code documentation
  • narrative docs
  • testing
  • examples
  • allow class-wise cross validation for the density model?

@ogrisel
Copy link
Member

ogrisel commented Sep 22, 2013

I think to make the discussion more fruitful it would be great to provides some examples on datasets where such models are actually useful either from a pure classification performance point of view, or more likely as samplers to generate new labeled samples for specific classes (a bit like you did with this KDE sampling example for digits).

@jakevdp
Copy link
Member Author

jakevdp commented Sep 22, 2013

...or more likely as samplers to generate new labeled samples for specific classes

Ah, I hadn't even thought of that possibility! Yes, we could implement a sample routine, which would use the underlying models to return a random set of new observations fitting the training data. Great idea!

I'll work on some examples soon to make the utility of this approach more clear.

@jakevdp
Copy link
Member Author

jakevdp commented Sep 22, 2013

I added doc strings and tests. An incompatibility came up in the case of GMM: I opened an issue at #2473.

if isinstance(density_estimator, str):
dclass = MODEL_TYPES.get(density_estimator)
return dclass(**kwargs)
elif isinstance(density_estimator, type):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like type is undefined.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a builtin

@larsmans
Copy link
Member

I don't think this should be combined with Naive Bayes, except in the docs. The charm of Naive Bayes lies in its speed and simple code, no need to mess with that.

Bayes Classifier, in which the distribution of each training class is
approximated by an axis-aligned multi-dimensional normal distribution, and
unknown points are evaluated by comparing their posterior probability under
each model.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no such thing as "the" Naive Bayes classifier. You're thinking of Gaussian NB, but NLP people will variously think of Bernoulli or multinomial NB. (The first time I encountered the Gaussian variant was while reading sklearn source code :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! The first time I encountered any version other than Gaussian NB was reading the sklearn source code 😄

@larsmans
Copy link
Member

On second thought: @jakevdp, is it too much of a stretch to merge this thing into the naive_bayes module? I guess it's not really "naive" in the NB sense, but it would remove some clutter in the top-level module. Also, take a look at the NB narrative docs, which explain pretty much the same thing that you're explaining in the module docstring.

@jakevdp
Copy link
Member Author

jakevdp commented Sep 25, 2013

I initially thought about putting this within sklearn.naive_bayes (given that it inherits from BaseNB!) but didn't because, though it is Bayesian, it's distinctly not Naive in the sense that the term is used. If we could start over, it would be make more sense to have a submodule for generative classification of which naive bayes is a part, rather than the other way around. But given that we've made the API choice to have a naive_bayes submodule, I thought it would be less confusing to put general generative classification in its own module.

Regardless of where the code goes, I had envisioned combining the narrative documentation for the two: as you mention, we can adapt the theoretical background currently put under the heading of Naive Bayes and show how it applies in both the Naive and the general case.

@larsmans
Copy link
Member

it would be make more sense to have a submodule for generative classification of which naive bayes is a part

True, but I've seen other people's production codebases that depend on MultinomialNB being in naive_bayes.py, and I'd have some explaining to do if we broke that :p

Combining the narratives was the main what I was aiming at. It's your call to decide if it fits well enough to also combine the code.

(FYI, I see you're using BaseNB. I've been thinking about killing that, because MultinomialNB and BernoulliNB can be implemented more straightforwardly as pure linear models, sharing no code with GaussianNB.)

@jakevdp
Copy link
Member Author

jakevdp commented Nov 23, 2013

@larsmans - I ended up following your advice and moving everything into the naive_bayes submodule. That location might be a bit misleading, but I think it is cleaner.

Still some tests failing... I'm going to try to fix those.

@jakevdp
Copy link
Member Author

jakevdp commented Dec 10, 2013

I think this is pretty close to finished now. I added narrative documentation, examples, and the tests should pass.

One missing feature that would be really helpful would be the ability to do class-wise cross-validation of the density estimators within GenerativeBayes. I'm not sure what the right interface would be for that, however... any ideas?

@jakevdp
Copy link
Member Author

jakevdp commented Dec 10, 2013

Hmm... is there any way that program state can affect the results of cross_val_score? It fails here, but passes on my machine, and passes when I run the code alone. There doesn't seem to be any random element that would affect it... that's really strange.

@jakevdp
Copy link
Member Author

jakevdp commented Dec 10, 2013

Ah - looks like it was something that had changed in master. I'll adjust the tests so that they will pass.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 061f3fb on jakevdp:generative_class into ffde690 on scikit-learn:master.

@jakevdp
Copy link
Member Author

jakevdp commented Dec 10, 2013

Changing status to MRG: I think this is ready for a final review, unless we want to add class-wise cross-validation at this time.

@jakevdp
Copy link
Member Author

jakevdp commented Dec 12, 2013

Thanks @ogrisel. I've addressed all your comments.

Regarding the CV issue: I think the first-order solution is to simply expose the estimator parameters using the get_params machinery in BaseEstimator. We can internally label the estimators, e.g. "est1", "est2", so that the fit parameters would become est1__paramname, est2__paramname, etc. This would be a quick addition, and allow the usual cross-validation tools to have access to the parameters.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 3f8666a on jakevdp:generative_class into aa8139b on scikit-learn:master.

@ogrisel
Copy link
Member

ogrisel commented Dec 16, 2013

Regarding the CV issue: I think the first-order solution is to simply expose the estimator parameters using the get_params machinery in BaseEstimator. We can internally label the estimators, e.g. "est1", "est2", so that the fit parameters would become est1__paramname, est2__paramname, etc. This would be a quick addition, and allow the usual cross-validation tools to have access to the parameters.

I am not sure that will work as the number of sub-estimators is dependent on the number of classes . The list of subestimators in the estimators_ attribute is therefore only generated once we see the data in fit so as to be able to extract the number of classes or features from the data shape. On the other hand the grid search tooling manipulates the model and its parameters independently of the data, in particular prior to any call to fit. Hence we have a design mismatch. Maybe it would be possible to hack get/set_params to store the subestimators parameters on the GenerativeBayes object itself and delay the call to the recursive call set_params method on the sub-estimators objects at fit time.

@jakevdp
Copy link
Member Author

jakevdp commented Dec 16, 2013

yes, I ran into that mismatch when I gave this strategy a shot. I'll think about your idea of hacking get/set_params, but I'm starting to think that just providing a CV tool within GenerativeBayes itself might be the answer.

@ogrisel
Copy link
Member

ogrisel commented Dec 18, 2013

That might indeed be a better way. Note however that we have a similar issue for multi-class or multi-label classifiers that implement the OvR strategy by combining n_classes binary classifiers. It is possible that having per-classifier hyperparameter tuning (e.g. regularizer strength) would be beneficial for the overall performance of the model. @mblondel @pprett might want to pitch-in.

@mblondel
Copy link
Member

I don't have any experience with tuning each binary classifier separately. One concern I have is that each binary classifier may produce predictions with different scales (e.g. one with predictions in [-1, 1], another one with predictions in [-5, 5]) and thus the argmax rule might not work at all.

In any case, this is a combinatorial search and thus randomized search seems the way to go.

@jgbos
Copy link

jgbos commented Jan 15, 2014

Hey guys, I hope I'm not just wasting space in your inbox. I've tried to follow this discussion, but wanted to provide a couple notes from a user. I have utilized GMM classifiers in the past. I've also started playing with this commit to see the results using a GMM. One big feature needed for this function is the capability of tuning the number of components, n_components, for each class. I saw Jake was concerned with some features users would be interested in having, this is a biggie for people who use this type of classifier. It definitely impacts performance. Unfortunately I cannot provide you an example of a dataset (company policy).

@jakevdp
Copy link
Member Author

jakevdp commented Jan 15, 2014

Thanks @jgbos - I agree that individually tuning hyperparameters is a vital feature of this. I'm still trying to figure out the best way to approach that, though (and I haven't had much time to work on this lately)

@ngaloppo
Copy link

Is there any chance that there would be some progress on this PR, or is it buried forever? I understand that we are hung up on the last TODO item. I'm wondering if we can come to a solution that does not require the ability to do class-wise cross validation for the density model?


This model only becomes "naive" when we introduce certain assumptions about
the form of :math:`P(x_i \mid y)`, e.g. that each class is drawn from an
axis-aligned normal distribution (the assumption for Gaussian Naive Bayes).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what makes the model naive is that your assume conditional independence of the features. I find this paragraph not clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this paragraph erroneous.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it’s wrong as I suggested in 2016 ;)

@agramfort
Copy link
Member

really cool examples :)

@jakevdp you'll need to rebase

@danielravina
Copy link

@jakevdp just wondering, will you merge this anytime soon?

@agramfort
Copy link
Member

@danielravina I am not sure @jakevdp has time to finish this. Please take over if you want and see my comments.

@jakevdp
Copy link
Member Author

jakevdp commented Jun 26, 2016

Probably will not be finishing this myself. The main reason I never finished the PR is that I never really figured out how to deal cleanly with per-class hyperparameters.

@jengelman
Copy link

@danielravina @jakevdp did either of you or anyone else end up picking this back up? would be interested in working on this if not.

@jmschrei
Copy link
Member

This PR is actually fairly similar to the BayesClassifier / NaiveBayes classifiers in pomegranate (see tutorial here: https://github.com/jmschrei/pomegranate/blob/master/tutorials/Tutorial_5_Bayes_Classifiers.ipynb). If you pick this up I'd be happy to review it, but be sure to read the above discussion thoroughly to understand what the stalling issues were.

@amueller
Copy link
Member

Should this be moved to scikit-learn-extras or is it not complete enough?

@amueller amueller added the Move to scikit-learn-extra This PR should be moved to the scikit-learn-extras repository label Jul 30, 2019
@kasparthommen
Copy link

kasparthommen commented Feb 12, 2020

Probably will not be finishing this myself. The main reason I never finished the PR is that I never really figured out how to deal cleanly with per-class hyperparameters.

Hi, I am facing the same problem with a few estimators I wrote that also require per-class hyperparameters, i.e., lists/arrays/tuples. The way I see it there are two things you can do:

  1. Accept the fact that GridSearchCV and RandomizedSearchCV are fine for educational purposes but that they are too simplistic for real-world hyperparameter tuning for the following reasons:
    • They don't support list/tuple hyperparameters (which is the core problem for this PR)
    • They don't support nested sub-spaces (e.g. to disallow illegal parameter combinations or other parameter dependencies)
    • They are too slow for real-world apprlications due to their simple underlying "search" algorithms

    Therefore, making your estimator compatible with these optimizers is a bit of a lost cause. Instead, you should use a more advanced hyperparameter tuning library such as e.g. Optuna. It works quite differently by not requiring you to define a parameter grid/space beforehand but by allowing you to "ask" the framework for parameter values dynamically one by one. Please see figures 1 and 2 in the paper to get an idea of how this works. This means you can e.g. dynamically request the number of GMM components for each class in a loop over the class count.

  2. Use this decorator that I just wrote that exposes an estimator's sequence hyperparameters (tuple, list, numpy array), e.g. my_list_param, as a set of scalar parameters (my_list_param_0, my_list_param_1, ..., my_list_param_5 if there are 6 entries) such that it becomes compatible with GridSearchCV.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:naive_bayes Move to scikit-learn-extra This PR should be moved to the scikit-learn-extras repository
Projects
None yet
Development

Successfully merging this pull request may close these issues.