ENH: use Bayesian priors in Nearest Neighbors classifier (Issue 399) #970

Open
wants to merge 8 commits into
from

Conversation

Projects
None yet
6 participants

cricketsong commented Jul 19, 2012

The k-Nearest-Neighbors and Radius Neighbors classifiers now make explicit use of the Bayesian prior probabilities of the classes (self.class_prior_). By default, each of these prior probabilities (for a given class) is set to the proportion of sample points which are in the given class; but they can also be set to other values by the class user.

(This is my first submission so I'd be grateful for any and all advices. Thank you!)

Fixes #399

Member

jakevdp commented Jul 19, 2012

Thanks for submitting this! I'll take a detailed look, but I'm travelling today so it might have to wait until tomorrow or the weekend.

@@ -567,8 +567,11 @@ def fit(self, X, y):
y : {array-like, sparse matrix}, shape = [n_samples]
Target values, array of integer values.
"""
- self._y = np.asarray(y)
@amueller

amueller Jul 19, 2012

Owner

I'm not so familiar with this module but if the y wasn't preprocessed before, you should probably do an "unique" with return_inverse (use the one from utils for backward compatibility) so get integer labels from whatever what put in. Not sure how this fit is called, though, so maybe you should wait for Jakes opinion.

@cricketsong

cricketsong Jul 19, 2012

If I understand correctly, this will also take care of the next (line 571) line:

self._classes, self._y = _unique(y, return_inverse=True)
@amueller

amueller Jul 19, 2012

Owner

yes, it should. And I looked it up, there is no prior processing so this is the right place to do this, I think.

sklearn/neighbors/classification.py
+ # Compute the unnormalized posterior probability, taking
+ # self.class_prior_ into consideration.
+ class_count = np.zeros(self._classes.size)
+ for k, c in enumerate(self._classes):
@amueller

amueller Jul 19, 2012

Owner

I think this loop can be replaces by a faster np.bincount.

Owner

amueller commented Jul 19, 2012

Thanks for your contribution.
I am not so familiar with this module but here some general comments.
First, I think we should not lightly change the default behavior of our classifiers. People who upgrade might get unexpected results. Also, this might not be the expected behavior for textbook KNN.
So I would rather like to have a parameter to control the behavior.
Also you mentioned the user could set the prior, but I don't see that you provided a way to do this. I think people would usually use KNeighborsClassifier and RadiusClassifier, so these should provide this option.

There is already a weights parameter, but I think this works on the samples, not the classes. Other estimators have a class_weight parameter, maybe you can adopt this name.

While you commented your code, you did not change the docstring or the narrative documentation. People who want to use a class usually look into these two places, not the code. So the functionality should be explained there.

If possible (and sensible) I like new functionality also to be integrated in an example, preferably an existing one.

sklearn/neighbors/classification.py
+ # given tested point), the corresponding element in `probabilities`
+ # would still be incremented only once.
+ outliers = [] # row indices of the outliers (if any)
+ for row in range(len(pred_labels)):
@amueller

amueller Jul 19, 2012

Owner

I'm didn't check the details but I think you can do the whole thing using np.bincount with the weights parameter.

@cricketsong

cricketsong Jul 20, 2012

I could, but then I'd have to use the minlength parameter, thus:

np.bincount(pi, weights[i], minlength=self._classes.size)

The problem is that minlength was introduced in NumPy 1.6.0, so it's too recent for what we're supporting. One approach would be:

np.bincount(np.append(pi, np.arange(self._classes.size)), np.append(weights[i], np.zeros(self._classes.size)))

which basically adds the whole range of classes, but weighted with 0.0 so they have no impact beside making sure the returned array is self._classes.size long. Though I'm not sure this is more efficient or less kludgy than the loop in the original code I submitted.

Or I'd be happy to submit a wrapper in utils/fixes.py around bincount (like we have for unique).

@amueller

amueller Jul 20, 2012

Owner

Can't you just use bincount and then append zeros?

@cricketsong

cricketsong Jul 20, 2012

I guess I was too focused on solving it in only 1 line :-) So something like:

unpadded_probs = np.bincount(pi, weights[i])
probabilities[i] = np.append(unpadded_probs, np.zeros(self._classes.size - unpadded_probs.shape[0]))
sklearn/neighbors/classification.py
+ # Compute the unnormalized posterior probability, taking
+ # self.class_prior_ into consideration.
+ class_count = np.zeros(self._classes.size)
+ for k, c in enumerate(self._classes):
@amueller

amueller Jul 19, 2012

Owner

again np.bincount I guess.

Owner

amueller commented Jul 19, 2012

@cricketsong I think this looks good in principle but I didn't check details. I hope my comments make sense, for a details review, let's wait for Jake :)

@amueller First of all, thank you for your numerous suggestions.

  1. I will definitely look into using np.bincount where you've suggested it.
  2. This PR does not in fact change the default behavior: in its current form, the KNN and Radius Neighbor classifiers already implicitly rely on a Bayesian prior computed as the proportion of sample points which are in each class. (When using such a prior, the posterior probability for class k reduces simply to the ratio of points in the volume V which are in class k, where the volume is either the one covering the k nearest neighbors (as in KNeighborsClassifier), or the volume with the given radius (as in RadiusNeighborsClassifier). More details can be found in Bishop, Pattern Recognition and Machine Learning, Springer, 2006, p. 125.)
  3. Per Jake's last comment in Issue 399, that case is similar to the class_prior_ parameter in GaussianNB. It was my understanding that in the latter, the user could set the value of class_prior_ using set_params (inherited from class BaseEstimator); but now I'm not so sure. I'll wait to see what Jake had in mind.
  4. I will modify the docstring and the narrative documentation, and will include an example.

Thanks again!

Owner

ogrisel commented Jul 20, 2012

Just a note on the naming conventions for attributes and parameters in sklearn: attributes that have a trailing _ are automatically fitted parameters learned from the data and you should not expect the user to set them manually. User adjustable parameters should be stored with their name (without the trailing _) in the constructor (and then can be adjusted using set_parameters when nesting / stacking models in a sklearn.pipeline.Pipeline compound construct).

Please have a look at APIs of scikit-learn objects. The rest of the contributors guide is interesting too.

Owner

ogrisel commented Jul 20, 2012

Also could you add References section in the docstring of the models that give "Bishop, Pattern Recognition and Machine Learning, Springer, 2006, p. 125." as a reference for the explicit use of priors in k-NN models? It should also be mentioned in the narrative doc.

Owner

amueller commented Jul 20, 2012

@cricketsong Can you explain 2) again? The new predict_proba method gives different results than the old predict_proba method, right? Maybe I misunderstood, though.

@amueller The new predict_proba, by default (without the user changing the class priors) should -- and does -- give the same result as the old one. Here's why:

We want the posterior P(C_k | x) to correctly classify x. We can use the following approximations (the same are used in kNN and in Radius Neighbors):

  1. P(x | C_k) = K_k / (N_k V) where K_k is the number of points in the neighborhood of x which are in class k; N_k is the total number of points in class k; and V is the volume of the neighborhood. (About that volume: we can safely ignore it because it will get simplified away when we apply Bayes' rule).
  2. p(x) = K / (NV) where K is the total number of points in the neighborhood (fixed in kNN, variable in Radius Neighbors) and N is the total number of points in the training set.
  3. p(C_k) = N_k / N. This is the prior by default in the new predict_proba and (as we'll see) the implicit prior in the old predict_proba.

When we use Bayes' rule to compute P(C_k | x) from these 3 approximations, we get P(C_k | x) = K_k / K. And that expression is precisely what was used by the old predict_proba: it was implicitly using the prior P(C_k) = N_k / N.

In the new version, our prior is arbitrary so when we use Bayes' rule, we compute P(x | C_k) first as K_k * weight / N_k (we ignore V), multiplying by our arbitrary prior P(C_k) and then dividing by p(x) (which also ignores V).

+ # M += 1: if a predicted index was to occur more than once (for a
+ # given tested point), the corresponding element in `probabilities`
+ # would still be incremented only once.
+ for i, pi in enumerate(pred_labels):
@amueller

amueller Jul 20, 2012

Owner

Hm I thought you could eliminate this one, too. But I guess it is not so obvious. I think you'd need my favourite non-existing numpy function bincount2d (analog to histogram2d). One of these days I'm gonna write it ;)

Owner

amueller commented Jul 20, 2012

nice work :)

sklearn/neighbors/base.py
+ self._classes, self._y = unique(y, return_inverse=True)
+ self.class_prior_ = np.zeros(self._classes.size)
+ for i, y_i in enumerate(self._classes):
+ self.class_prior_[i] = np.float(np.sum(y == y_i)) / len(y)
@jakevdp

jakevdp Jul 23, 2012

Member

This can be accomplished more efficiently with

self.class_prior_ = np.bincount(self._y).astype(float) / len(self._classes)

I believe this should work

@cricketsong

cricketsong Jul 24, 2012

Unless I'm mistaken, we should divide by len(self._y) (instead of len(self._classes)), which will give us the proportion of points in each class.

@jakevdp

jakevdp Jul 24, 2012

Member

You're right: we should divide by len(self._y)

-
- return mode.flatten().astype(np.int)
+ probabilities = self.predict_proba(X)
+ return self._classes[probabilities.argmax(axis=1)].astype(np.int)
@jakevdp

jakevdp Jul 23, 2012

Member

I like the code efficiency and maintainability of this. Before making this change, however, we should be sure that predict_proba doesn't have a significantly larger computational cost than simply finding the mode.

@cricketsong

cricketsong Jul 24, 2012

It does have a slightly larger computational cost than simply finding the mode, but as soon as we allow other priors than the default one, the overhead is inevitable: to predict, we must compute the posterior for each class. (See my answer to your other comment below.)

@ogrisel

ogrisel Jul 24, 2012

Owner

If the overhead is more than 30% I think we should keep the previous implementation as a special case.

@cricketsong

cricketsong Jul 24, 2012

Do we need to make measurements before going ahead with that? After all, keeping the previous implementation as a special case only adds 5 lines of code and will be more efficient (perhaps only slightly).

@ogrisel

ogrisel Jul 24, 2012

Owner

Would be good to %timeit the two alternatives on a small to medium dataset just to have an idea.

@cricketsong

cricketsong Jul 24, 2012

For what it's worth, it does not seem to have much overhead:

In [1]: import numpy as np
In [2]: from sklearn.neighbors import KNeighborsClassifier
In [3]: X = np.random.uniform(0.0, 1.0, (100000, 4))
In [4]: t = np.random.binomial(1, 0.5, (100000, 1))
In [5]: model = KNeighborsClassifier(n_neighbors=5)
In [6]: model.fit(X, t)
# With the previous `mode` implementation:
In [7]: %timeit model.predict(X)
1 loops, best of 3: 2.93 s per loop
# With the call to `predict_proba`:
In [7]: %timeit model.predict(X)
1 loops, best of 3: 2.92 s per loop
@jakevdp

jakevdp Jul 24, 2012

Member

Can you check the times when there are, say, 50 classes? I think that would lead to a bigger difference.

@cricketsong

cricketsong Jul 24, 2012

It leads to a much bigger difference... but in an unexpected way: the new predict_proba seems more efficient.

In [4]: t = np.random.randint(0, 50, (100000, 1))
# With the previous `mode` implementation:
In [7]: %timeit model.predict(X)
1 loops, best of 3: 3.62 s per loop
# With the call to `predict_proba`:
In [7]: %timeit model.predict(X)
1 loops, best of 3: 3.08 s per loop
@ogrisel

ogrisel Jul 24, 2012

Owner

Great then. It might depend on the shape of the data though.

sklearn/neighbors/classification.py
+ # Compute the unnormalized posterior probability, taking
+ # self.class_prior_ into consideration.
+ class_count = np.bincount(self._y)
+ probabilities = (probabilities / class_count) * self.class_prior_
@jakevdp

jakevdp Jul 23, 2012

Member

I'm confused by this line. Unless I'm mistaken, self.class_prior_ is equivalent to np.bincount(self._y) divided by a normalization constant, so the multiplication and division cancel out. Let me know if I'm missing something.

@cricketsong

cricketsong Jul 24, 2012

In the default kNN classification, it does cancel out. In fact, when using the default priors (normalized bincount), we don't even need to mess around with probabilities: we simply return the class k with the highest presence in the neighborhood of our query point. (See above for the gory details.)

But if we allow other priors (which is the point of issue 399), for instance flat priors (P(C_1) = P(C_2) = ...) then it won't cancel out.

@jakevdp

jakevdp Jul 24, 2012

Member

I still don't understand. Here we have

class_count = np.bincount(self._y)

and in base.py we have

self.class_prior_ = np.float(np.bincount(self._y)) / len(self._y)

The subsequent normalization means that this operation will never have any effect as the code is currently written. If we add an option which allows the user to set an arbitrary class prior, though, I think this is the correct form.

@cricketsong

cricketsong Jul 24, 2012

Exactly: it has an effect only if we allow the user to set arbitrary priors. What's probably a bit confusing is that I haven't implemented that option yet (I was actually waiting for your input), but I was thinking of processing class_prior in the same way that the weights are processed right now: one could either set class_prior to

  1. 'default' or None; in both cases, the classifier then reverts to the "regular" priors (e.g. K_k / K for the kNN classifier).
  2. 'flat'; in which case each class is considered equiprobable.
  3. A list or np.array, of the same length as the number of classes, and whose sum is 1.0.

Otherwise we raise a ValueError.

Does this make sense?

@jakevdp

jakevdp Jul 24, 2012

Member

That sounds great. I like that plan.

Member

jakevdp commented Jul 23, 2012

This is looking good. A few thoughts:

  • in order to maintain backward compatibility, the Bayesian prior should be opt-in. I'm confused by the comment above saying that adding a Bayesian prior does not change the default behavior. Granted, I think I need to sit down and try to recall what I had in mind when I opened Issue #399 😄. I pictured having an input flag in the constructor named use_bayesian_prior which is set to False by default.
  • Olivier's comments are good: a reference to Bishop would be nice, as well as some mention of the behavior in the doc string
  • I just noticed that KNeighbors has a predict_proba method, while RadiusNeighbors does not. This is left over from a previous PR which added predict_proba only to one of the classes: it would be nice to add that in RadiusNeighbors now.
  • I'm still confused by the affect of the class prior (see my line note above). Let me know if I'm missing something on that!

All that aside: great work. I think once this is cleaned up, it will be a nice addition to the module.

Member

jakevdp commented Jul 24, 2012

Looking good. Above I said a boolean flag would be good. I think it would be better to allow the user to pass their own bayesian priors via a class_prior argument, which defaults to None. If it's None, the priors would be computed as they are right now. we could also use priors = 'flat' or other standard options to make it easier on the user.

Getting close!

I've checked with %timeit and calling predict_proba from predict is not less efficient than the former implementation:

In [3]: X = np.random.uniform(0.0, 1.0, (10000, 4))
In [4]: t = np.random.randint(0, 50, (10000, 1))
In [5]: model = RadiusNeighborsClassifier(radius=0.3)
In [6]: model.fit(X, t)
# With an independent `predict`:
In [7]: %timeit model.predict(X)
1 loops, best of 3: 504 ms per loop
# With `predict` calling `predict_proba`:
In [7]: %timeit model.predict(X)
1 loops, best of 3: 493 ms per loop

If everything is satisfactory, I'll work on the corresponding narrative documentation tomorrow morning (EDT/Montreal time).

Member

jakevdp commented Jul 24, 2012

@cricketsong - all looks good. Thanks for all the explanations and responses above. I'm surprised that predict_proba is faster... looking at the source of stats.mode, however, I can see why that would be. Your bincount solution in predict_proba is better,

I'm excited to see the narrative documentation! Again, thanks for all the work you're doing on this.

Owner

amueller commented Jul 28, 2012

I think naming the parameter class_weight instead of class_prior would be better as this name is already used in other classifiers. What do you think @jakevdp (and @ogrisel)?

Member

jakevdp commented Jul 28, 2012

@amueller - good question: I had to think about it a bit!

class_weight is used for many of the classifiers in sklearn.linear_model. class_prior is currently used in Gaussian Naive Bayes. In discriminative classifiers like those in sklearn.linear_model, class_weight defaults to 1. In Naive Bayes, which is a generative classifier, class_prior defaults to the frequency of the label in the training set.

Nearest neighbors classification is a bit of a hybrid: it has characteristics of both generative and discriminative classification, so it could go either way. But this PR frames nearest neighbors classification squarely in a Bayesian, generative sense. Because of this, and because the default behavior is similar to that of class_prior in GaussianNaiveBayes, I'd prefer to stick with the label class_prior here.

Owner

amueller commented Jul 28, 2012

@jakevdp Ok, I agree. I didn't know that GNB used the same parameter name. The difference in default behavior is also a good argument. I didn't realize that. Thanks for your thoughts :)

doc/modules/neighbors.rst
@@ -94,7 +94,19 @@ be accomplished through the ``weights`` keyword. The default value,
distance from the query point. Alternatively, a user-defined function of the
distance can be supplied which is used to compute the weights.
-
+The nearest neighbors classification algorithm is implicitly based on
@amueller

amueller Jul 28, 2012

Owner

I would rather says something like "there is a probabilistic interpretation of nn"

sklearn/neighbors/base.py
+ """Get class prior from targets ``y`` and parameter ``class_prior``
+
+ Parameters
+ ==========
@amueller

amueller Jul 28, 2012

Owner

I think we use -- usually, right?

Owner

amueller commented Jul 28, 2012

This looks pretty good. Could you maybe add a little explanation to the example explaining the plot? I haven't looked at it myself but will do tomorrow.

@amueller Thanks for the comments. I've corrected the =/- problem, applied the modification you suggested re: the probabilistic interpretation and commented on the example.

Owner

amueller commented Sep 14, 2012

@cricketsong Sorry I forgot a bit about this PR, I have been pretty busy. What is the current status?
@jakevdp what do you think?

Member

jakevdp commented Sep 17, 2012

@amueller - Thanks for bringing this up again - I need to take some time to look back at this. I recall it was in pretty good shape, but I'd like to do a full review.

@arjoly arjoly referenced this pull request Nov 22, 2013

Open

Add support for: ML-kNN #2606

Contributor

chkoar commented Jan 5, 2018

What is the current status of this PR? Can we recap? If @cricketsong is not available maybe I could carry it out. cc @amueller @jakevdp @ogrisel

Owner

jnothman commented Jan 7, 2018

Status is stalled in ancient history! I like the idea of supporting a bayesian KNN (mostly because users don't think about this approach to KNN), but I don't think we're going to make it a top priority!

Owner

jnothman commented Jan 7, 2018

One thing that would help here is having a reference (a textbook, a paper, etc) that would allow us to ensure correctness according to an established standard.

Am I correct in thinking that under this approach, n_neighbors=1 will produce a posterior with all its mass in one class? (I've not tried running the code because it would involve looking at 6-year-old compilation instructions lol.)

The math here seems in accordance with bishop, but it goes against my intuition of bayesian prior, wherein we should rely on the prior more if the number of neighbors available to inform us is few (e.g. when using radius_neighbors in a region that is sparse in the training set). Am I mistaken?

+ class_prior : str, list or ndarray, optional (default = 'default')
+ class prior probabilities used in prediction. Possible values:
+
+ - 'default': default prior probabilities. For each class, its
@jnothman

jnothman Jan 30, 2018

Owner

should this be called 'mle'?

+ - 'default': default prior probabilities. For each class, its
+ prior probability is the proportion of points in the dataset
+ that are in this class.
+ - 'flat': equiprobable prior probabilites. If there are C classes,
@jnothman

jnothman Jan 30, 2018

Owner

or 'uniform'?

+ if class_prior in (None, 'default'):
+ return np.bincount(y).astype(float) / len(y)
+ elif class_prior == 'flat':
+ return np.ones((len(np.unique(y)),)) / len(np.unique(y))
@jnothman

jnothman Jan 30, 2018

Owner

Since we normalise later, we can just return 1 here.

+ class_prior_arr: array of the same shape as ``np.unique(y)``
+ """
+ if class_prior in (None, 'default'):
+ return np.bincount(y).astype(float) / len(y)
@jnothman

jnothman Jan 30, 2018

Owner

since we normalize later, we can ignore the / len(y)

+
+ # Compute the unnormalized posterior probability, taking
+ # self.class_prior_ into consideration.
+ class_count = np.bincount(self._y)
@jnothman

jnothman Jan 30, 2018

Owner

nowadays we need to handle multioutput y. Perhaps each will have its own prior...

+ probabilities[i] = 1e-6 # prevent division by zero later
+ continue # outlier
+ # When we support NumPy >= 1.6, we'll be able to simply use:
+ # np.bincount(pi, weights, minlength=self._classes.size)
@jnothman

jnothman Jan 30, 2018

Owner

we support this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment