Issue 735: naive bayes #987

Closed
wants to merge 17 commits into
from

Projects

None yet

8 participants

@cricketsong

Issue #735, item 3: This adds parameters sample_weight and class_prior to class GaussianNB, to provide functionalities similar to the ones offered by BaseDiscreteNB.

@agramfort agramfort commented on an outdated diff Jul 30, 2012
sklearn/naive_bayes.py
epsilon = 1e-9
for i, y_i in enumerate(unique_y):
- self.theta_[i, :] = np.mean(X[y == y_i, :], axis=0)
- self.sigma_[i, :] = np.var(X[y == y_i, :], axis=0) + epsilon
- self.class_prior_[i] = np.float(np.sum(y == y_i)) / n_samples
+ if sample_weight is not None:
+ self.theta_[i, :] = np.average(X[y == y_i, :], axis=0,
+ weights=sample_weight[y == y_i])
+ self.sigma_[i, :] = (np.dot(sample_weight[y == y_i],
+ (X[y == y_i, :] -
+ self.theta_[i, :]) ** 2) /
+ np.sum(sample_weight[y == y_i]))
@agramfort
agramfort Jul 30, 2012

you should avoid doing X[y == y_i, :] and sample_weight[y == y_i] twice

@agramfort
scikit-learn member

can you add a test and if possible illustrate this feature in an existing example?

@jaquesgrobler
scikit-learn member

@cricketsong +1 for test and example

-Just one other thing.. We've been trying to improve the API consistency lately, which can be quite a mission.
It would be great if you could, instead of having sample_weight=None and class_prior=None in the fit(..) method,
rather have them be set upon initialization of the object.
Basically we're trying to keep the API so that all parameters are set upon creating the object, and that the only parameters that are passed to fit, are basically data and labels.
Thanks for the work on this.
Kind Regards

@cricketsong

Thank you both for the comments.

@agramfort I just fixed the ...[y == y_1, :] duplication and will now work on the tests and examples.

@jaquesgrobler I've moved the parameters to the constructor. Unfortunately, because of a clash with the existing (and deprecated) property self.class_prior, I've had to use custom_class_prior instead. I find it more informative, but is it acceptable?

@jaquesgrobler
scikit-learn member

That looks good to me.. works for the API and I don't any problem custom_class_prior.. agree that it feels more informative.
@agramfort @GaelVaroquaux You guys happy with this change?

Thanks again. J

@jaquesgrobler jaquesgrobler and 1 other commented on an outdated diff Jul 30, 2012
sklearn/naive_bayes.py
"""
+ def __init__(self, sample_weight=None, custom_class_prior=None):
+ self.sample_weight = None
+ if sample_weight is not None:
@jaquesgrobler
jaquesgrobler Jul 30, 2012

Hey again - one more thing, and sorry to be a pest with this 😄
I think it'd be better to keep decision logic out of the __init__(API convention again)
IMHO you can initialize all you need to in __init__, and then rather move the decision logic over to the start
of the fit(..) method. Let me know if I'm being vague with this.

@cricketsong
cricketsong Jul 30, 2012

Makes sense, will do!

@GaelVaroquaux
scikit-learn member

I must confess that I don't really like the name 'custom_class_prior'. I had to read the code to figure out what it was doing. It's the "custom" in it that I do not find terribly explicit. I don't have a great idea on a better name. Asking on the mailing list to see if people come up with better ideas would be useful.

@cricketsong

@GaelVaroquaux OK, I've asked on the mailing list.

@larsmans
scikit-learn member

Why are you storing custom_class_prior as an attribute? I'd say that's data, just like X and y, so it shouldn't have to be stored. (Same for sample_weight.)

If you look at BaseDiscreteNB, you'll see that it doesn't store class_prior; instead, it stores its logarithm as intercept_.

@cricketsong

@larsmans I initially did not store custom_class_prior as an attribute: it was passed as a parameter to fit, just like the data set is (and just like BaseDiscreteNB behaves). But @jaquesgrobler suggested that I move it to the constructor, for API consistency.

I guess it all depends on whether we consider the weights and custom priors to be part of the model or part of the data. For what it's worth, I agree with you: the sample weights and custom priors depends (in their shape, for instance) so much on the actual fitted data that we might as well consider them as part of the data. Besides, I'm not sure what meaning we should give to the value of sample_weight and custom_class_prior in a freshly-created model (i.e. before having fitted data to it).

@larsmans
scikit-learn member

I believe we always pass sample_weight to fit, not the constructor (but please check this). As for class_prior, is it possible to compute one of the other parameters from it?

@GaelVaroquaux
scikit-learn member
@ogrisel
scikit-learn member

sample_weight is never passed to the constructor as it's data dependent. For class_weight on the other hand we can have an 'auto' mode passed in the constructor with a delayed set in fit automatically fit from the data.

So +1 for keeping sample_weight as a fit argument.

@jaquesgrobler
scikit-learn member

Shoh, I appear to have opened a can of worms here... 😱
So should we try and maintain fit(..) to have only the data parameters, unless there's a data-dependent parameter?
I've kind of understood that we want to keep fit(..) to only take data, but I guess it does make sense to include data-dependent parameters in this..

@amueller
scikit-learn member

@jaquesgrobler exactly right. Any parameter that doesn't depend on the particular X should be in __init__. And no logic is allowed to go into __init__ (otherwise cloning breaks) as you said above.

@GaelVaroquaux I think you meant class_weight as @ogrisel said. For cross-validation sample_weights doesn't make sense as far as I can tell.

(just joining the discussion because there are so few of the devs there already ;)

@GaelVaroquaux
scikit-learn member

To wrap up a bit what has been said: as @ogrisel mentioned, 'class_weights' are constructor parameters, but 'sample_weights' are fit parameters, because they are sample-dependent. Along the same line of thoughts, 'class_prior' is more of a constructor parameter.

As for the name, I have been wondering if 'class_prior_override' was a good name? It seems more explicit to me than 'class_prior_custom', for which I have a hard time figuring what 'custom' means.

@cricketsong

@GaelVaroquaux class_prior_override is indeed more informative than class_prior_custom. I'll wait a few days to allow for more developer input and then commit the agreed changes to this pull request.

@amueller
scikit-learn member

@cricketsong could you please also rebase onto master? This PR can not automatically be merged.

@amueller
scikit-learn member

Thanks :)

@amueller
scikit-learn member

what is the status here? This has been lying around for a while :-/ Where are the naive Bayes experts? ;)

@MechCoder
scikit-learn member

I suppose this can be closed now. GaussianNB now supports a priors parameter since #6180

@MechCoder MechCoder closed this Apr 28, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment