Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Add predict_proba(X) and outlier handler for RadiusNeighborsClassifier #9597

Merged
merged 63 commits into from Aug 7, 2019

Conversation

@webber26232
Copy link
Contributor

webber26232 commented Aug 21, 2017

Reference Issue

What does this implement/fix? Explain your changes.

Currently, RadiusNeighborsClassifier doesn't provide the function predict_proba(X). When an outlier (a sample that doesn't have any neighbors within a fix redius) is detected, no class label can be assigned.

This branch implemented predict_proba(X) for RadiusNeighborsClassifier. The class probabilities are generated by samples' neighbors and weights within the radius r.

When outliers (no neighbors in the radius r) are detected, given 4 solutions, controlled by parameter outlier_label:

solution predict(X) predict_proba(X)
None Exception will be raised. Exception will be raised.
int, str or list of correspond labels Assign a manual label for outliers. If the new label is in the 'y' of training data, the probabilities will be 1 in this label and 0 in the rest. Otherwise, all possibilities will be 0.
'most_frequent' Assign the most frequent label in training target variable for outliers. The probability of most frequent label in training target will be assigned with 1 and other label's probability will be 0.

Any other comments?

DOC for both class of RaduisNeighborsClassifier (outlier solution and an example) and function of predict_proba is added.

@webber26232 webber26232 changed the title Add predict_proba(X) for RaduisNeighborsClassifier Add predict_proba(X) for RadiusNeighborsClassifier Aug 21, 2017
Copy link
Member

jnothman left a comment

Is there a reason to provide this for radius neighbors and not kneighbors?

@webber26232

This comment has been minimized.

Copy link
Contributor Author

webber26232 commented Aug 22, 2017

@jnothman Hi, the predict_proba function of kneighbors has already been implemented. And I do not find it for radius neighbors.

Sorry I am new here. Is there anything I should do first?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Aug 22, 2017

Copy link
Member

jnothman left a comment

Is there more that can be shared between the radius and k implementations?

Please add a unit test

webber26232 added 2 commits Aug 22, 2017
@webber26232 webber26232 changed the title Add predict_proba(X) for RadiusNeighborsClassifier Add predict_proba(X) and outlier handler for RadiusNeighborsClassifier Aug 23, 2017
webber26232 added 4 commits Aug 24, 2017
@webber26232

This comment has been minimized.

Copy link
Contributor Author

webber26232 commented Aug 24, 2017

@jnothman Hi, I found #2606 #399 #970 all mentioned an issue, using Bayesian prior in both K and Radius neighbors. However, this feature has not been implemented yet.

webber26232 added 7 commits Aug 25, 2017
bug
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Sep 18, 2017

I thought I commented here, but I can't see it now. Perhaps I commented in my head.

#970 makes the MAP prediction. You sample from the posterior distribution. This is a very substantial difference. I can see the benefit of sampling from the posterior, but I think we should prefer the MAP. Certainly, if we allow stochastic predictions (which I don't think we do anywhere else, but of course predict_proba allows users to sample for themselves), it needs to be controlled by a random_state that the user can set for replicability.

@webber26232

This comment has been minimized.

Copy link
Contributor Author

webber26232 commented Sep 18, 2017

@jnothman I think I am a little bit of confused on prior and posterior in Neighbors models.

In my mind, the prior distribution is the distribution of target variable, 'y' in the training data. Assuming that we have 5 samples in the training data with labels of [0, 0, 0, 1, 1].

Then the prior P(0) = 0.6 and P(1) = 0.4.
Llet's say we want to use two neighbors to do the classification. The possible labels of these two neighbors is

  • [0, 0]
  • [0, 1]
  • [1, 1]

The posteriors are something like

  • P(1 | [0, 0])
  • P(1 | [0, 1])
  • P(1 | [1, 1])

Am I correct?
I will add ramdom_state feature soon

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Sep 19, 2017

I think you should avoid making predictions stochastic, at least for now. As I said, a user can do so by sampling from predict_proba.

I'm not sure what you're trying to say, so let me try from the top. In predict_proba we return a distribution over classes for each sample. In predict we return the single most likely class from that distribution.

In a bayesian approach to NN, that distribution is the posterior distribution of classes accounting for both a prior distribution (our belief of how classes should be distributed before we look at the labels in the neighborhood of the sample) and the observations (i.e. the class distribution in the neighborhood of the sample).

The posterior is generally P(class | neighborhood) ∝ P(neighborhood | class) * P(class without regard to neighborhood). So you calculate this for each class.

@webber26232

This comment has been minimized.

Copy link
Contributor Author

webber26232 commented Sep 20, 2017

@jnothman Thanks for your answer! It makes me clear now.

The reason I considered to add outliers handler for RadiusNeighborsClassifier is avoiding abnormal scoring, especially scoring with predict_proba method.

Since an outlier doesn't have any neighbor within a fix radius, all of its class probabilities will be 0. A 0 probability will have a very huge impact on scoring such as log_loss.

When we use GridSearchCV or RFE which depends on scoring, the function will be influenced by outliers' scores and won't provide an accurate result.

Therefore, I thought assigning outliers with prior or uniform probabilities in predict_proba method may work. To make predict corresponds with predict_proba, I added random outlier predictions for predict.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Sep 20, 2017

Okay, but having a prior is fundamentally different to outputting a different label in the case of an outlier.

webber26232 and others added 4 commits Jul 21, 2019
@webber26232

This comment has been minimized.

Copy link
Contributor Author

webber26232 commented Aug 4, 2019

@jnothman @TomDLT I've changed the implementation of predict() for performance improvement. This branch is ready for reviewing and merging.

Copy link
Member

jnothman left a comment

This is still good to me, though I've not benchmarked myself. Might want to benchmark comparison to #13783

doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved

- |Efficiency| Efficiency improvements for
:func:`neighbors.RadiusNeighborsClassifier.prdict` by changing
implementation from scipy.stats.mode to numpy.bincount.

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 4, 2019

Member

I don't think this is the right description anymore.

This comment has been minimized.

Copy link
@TomDLT

TomDLT Aug 6, 2019

Member

Isn't it ?

This comment has been minimized.

Copy link
@rth

rth Aug 6, 2019

Member

I think @jnothman meant something along the lines of,

- |Efficiency| Efficiency improvements for
  :func:`neighbors.RadiusNeighborsClassifier.predict_proba` by changing
  implementation from scipy.stats.mode to numpy.bincount, and for 
  :func:`neighbors.RadiusNeighborsClassifier.predict` that is now computed as
  an `argmax` of `predict_proba`.

?

This comment has been minimized.

Copy link
@rth

rth Aug 7, 2019

Member

We don't have neighbors.RadiusNeighborsClassifier.predict_proba in the current master. It is created in this branch.

How about

  • |Efficiency| Efficiency improvements for :func:neighbors.RadiusNeighborsClassifier.predictby changing implementation from usingscipy.stats.modeto usingnumpy.bincountinpredict_proba.

Right, of course. But then we can't say that the implementation of predict_proba changed from using scipy.stats.mode, as it didn't exist previously. Maybe just saying that "predict is now computed from predict_proba" (and that it's faster) without going in much details.

fix typo

Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>
@TomDLT

This comment has been minimized.

Copy link
Member

TomDLT commented Aug 6, 2019

A small benchmark shows a nice speed up compared to master.

I did not compare with #14543 though. @rth

Figure_1

import itertools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import RadiusNeighborsClassifier


class OldRadiusNeighborsClassifier(RadiusNeighborsClassifier):
    def predict(self, X):
        """"""
        from sklearn.utils.validation import check_array
        from sklearn.neighbors.base import _get_weights
        from sklearn.utils.extmath import weighted_mode
        from scipy import stats
        X = check_array(X, accept_sparse='csr')
        n_samples = X.shape[0]

        neigh_dist, neigh_ind = self.radius_neighbors(X)
        inliers = [i for i, nind in enumerate(neigh_ind) if len(nind) != 0]
        outliers = [i for i, nind in enumerate(neigh_ind) if len(nind) == 0]

        classes_ = self.classes_
        _y = self._y
        if not self.outputs_2d_:
            _y = self._y.reshape((-1, 1))
            classes_ = [self.classes_]
        n_outputs = len(classes_)

        if self.outlier_label is not None:
            neigh_dist[outliers] = 1e-6
        elif outliers:
            raise ValueError('No neighbors found for test samples %r, '
                             'you can try using larger radius, '
                             'give a label for outliers, '
                             'or consider removing them from your dataset.' %
                             outliers)

        weights = _get_weights(neigh_dist, self.weights)

        y_pred = np.empty((n_samples, n_outputs), dtype=classes_[0].dtype)
        for k, classes_k in enumerate(classes_):
            pred_labels = np.zeros(len(neigh_ind), dtype=object)
            pred_labels[:] = [_y[ind, k] for ind in neigh_ind]
            if weights is None:
                mode = np.array(
                    [stats.mode(pl)[0] for pl in pred_labels[inliers]],
                    dtype=np.int)
            else:
                mode = np.array([
                    weighted_mode(pl, w)[0]
                    for (pl, w) in zip(pred_labels[inliers], weights[inliers])
                ], dtype=np.int)

            mode = mode.ravel()

            y_pred[inliers, k] = classes_k.take(mode)

        if outliers:
            y_pred[outliers, :] = self.outlier_label

        if not self.outputs_2d_:
            y_pred = y_pred.ravel()

        return y_pred


OldRadiusNeighborsClassifier.branch = 'master'
RadiusNeighborsClassifier.branch = 'branch'

averages = []
results = []
for klass, weights, n_samples, n_features in itertools.product(
    [OldRadiusNeighborsClassifier, RadiusNeighborsClassifier],
    ['uniform', 'distance'],
    [100, 1000, 10000],
    [10, 100, 1000, 10000],
):
    X = np.random.randn(n_samples, n_features)
    y = np.random.randint(2, size=n_samples)

    neigh = klass(weights=weights, radius=0.2).fit(X, y)
    out = %timeit -o neigh.predict(X)  # Ipython
    results.append((klass.branch, weights, n_samples, n_features, out.average))

##################
# plot the results
df = pd.DataFrame(
    results, columns='name weights n_samples n_features average'.split(' '))

table = df.pivot_table(index='n_features',
                       columns=['name', 'weights', 'n_samples'],
                       values='average')
table = table['branch'] / table['master']
table.plot(marker='.', logx=True)
plt.title('time(branch) / time(master)')
plt.show()
@TomDLT
TomDLT approved these changes Aug 6, 2019

- |Efficiency| Efficiency improvements for
:func:`neighbors.RadiusNeighborsClassifier.prdict` by changing
implementation from scipy.stats.mode to numpy.bincount.

This comment has been minimized.

Copy link
@TomDLT

TomDLT Aug 6, 2019

Member

Isn't it ?

doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved
doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved
sklearn/neighbors/classification.py Show resolved Hide resolved
sklearn/neighbors/classification.py Show resolved Hide resolved
sklearn/neighbors/classification.py Outdated Show resolved Hide resolved
sklearn/neighbors/tests/test_neighbors.py Outdated Show resolved Hide resolved
webber26232 and others added 9 commits Aug 6, 2019
typo fix

Co-Authored-By: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
Co-Authored-By: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
fix typo

Co-Authored-By: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
…rn into RadNeiClfPredProb
Update fork repo
…rn into RadNeiClfPredProb
@rth

This comment has been minimized.

Copy link
Member

rth commented Aug 6, 2019

I did not compare with #14543 though.

The existence of that PR shouldn't prevent this one to get merged. It's a clear improvement.

Looks like there there are 2 approvals and comments were addressed apart for #9597 (comment)

@rth

This comment has been minimized.

Copy link
Member

rth commented Aug 6, 2019

The case benchmarked above #9597 (comment) with n_class=2 is somewhat of an optimal case, as for large number of classes computing the probablility is probably more expensive. Still this PR is faster than master even with up to 1000 classes,

name                                   speedup (lower is better)                                                           
weights  n_samples n_features n_class                           
distance 2000      10         2                         0.367820
                              100                       0.368969
                              1000                      0.564461
                   100        2                         0.598981
                              100                       0.671586
                              1000                      0.787753
                   1000       2                         0.890036
                              100                       0.916686
                              1000                      0.975508
         10000     10         2                         0.438604
                              100                       0.440602
                              1000                      0.665863
                   100        2                         0.768036
                              100                       0.727307
                              1000                      0.827298
                   1000       2                         0.966119
                              100                       0.953986
                              1000                      0.990590
import itertools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import RadiusNeighborsClassifier
from neurtu import timeit, delayed


class OldRadiusNeighborsClassifier(RadiusNeighborsClassifier):
    def predict(self, X):
        """"""
        from sklearn.utils.validation import check_array
        from sklearn.neighbors.base import _get_weights
        from sklearn.utils.extmath import weighted_mode
        from scipy import stats
        X = check_array(X, accept_sparse='csr')
        n_samples = X.shape[0]

        neigh_dist, neigh_ind = self.radius_neighbors(X)
        inliers = [i for i, nind in enumerate(neigh_ind) if len(nind) != 0]
        outliers = [i for i, nind in enumerate(neigh_ind) if len(nind) == 0]

        classes_ = self.classes_
        _y = self._y
        if not self.outputs_2d_:
            _y = self._y.reshape((-1, 1))
            classes_ = [self.classes_]
        n_outputs = len(classes_)

        if self.outlier_label is not None:
            neigh_dist[outliers] = 1e-6
        elif outliers:
            raise ValueError('No neighbors found for test samples %r, '
                             'you can try using larger radius, '
                             'give a label for outliers, '
                             'or consider removing them from your dataset.' %
                             outliers)

        weights = _get_weights(neigh_dist, self.weights)

        y_pred = np.empty((n_samples, n_outputs), dtype=classes_[0].dtype)
        for k, classes_k in enumerate(classes_):
            pred_labels = np.zeros(len(neigh_ind), dtype=object)
            pred_labels[:] = [_y[ind, k] for ind in neigh_ind]
            if weights is None:
                mode = np.array(
                    [stats.mode(pl)[0] for pl in pred_labels[inliers]],
                    dtype=np.int)
            else:
                mode = np.array([
                    weighted_mode(pl, w)[0]
                    for (pl, w) in zip(pred_labels[inliers], weights[inliers])
                ], dtype=np.int)

            mode = mode.ravel()

            y_pred[inliers, k] = classes_k.take(mode)

        if outliers:
            y_pred[outliers, :] = self.outlier_label

        if not self.outputs_2d_:
            y_pred = y_pred.ravel()

        return y_pred


OldRadiusNeighborsClassifier.branch = 'master'
RadiusNeighborsClassifier.branch = 'branch'

results = []
for klass, weights, n_samples, n_features, n_class in itertools.product(
    [OldRadiusNeighborsClassifier, RadiusNeighborsClassifier],
    ['uniform', 'distance'],
    [2000, 10000],
    [10, 100, 1000],
    [2, 100, 1000]
):
    X = np.random.randn(n_samples, n_features)
    y = np.random.randint(n_class, size=n_samples)

    neigh = klass(weights=weights, radius=0.2).fit(X, y)
    tags = {'name': klass.branch, "weights": weights,  'n_samples': n_samples,
            'n_features': n_features, "n_class": n_class}
    out = delayed(neigh, tags=tags).predict(X)
    results.append(out)


results = timeit(results)
results = results.wall_time .unstack(0)
results['speedup (lower is better)'] = results['branch'] / results['master']

print(results)

I haven't reviewed the code since there are enough reviews, but +1 for merge based on the benchmark results.

@rth

This comment has been minimized.

Copy link
Member

rth commented Aug 7, 2019

I think implementation in #13783 and #14543 is helpful in neighbors.KNeighborsClassifier but not neighbors.RadiusNeighborsClassifier, because neigh_ind is a 2D array in neighbors.KNeighborsClassifier but a nested array (each 1D array has different length) in neighbors.RadiusNeighborsClassifier.

I agree.

@TomDLT TomDLT merged commit 36bca23 into scikit-learn:master Aug 7, 2019
11 of 12 checks passed
11 of 12 checks passed
ci/circleci: lint Your tests are queued behind your running builds
Details
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
scikit-learn.scikit-learn Build #20190807.53 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_ubuntu_atlas) Linux py35_ubuntu_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda_mkl_pandas) Linux pylatest_conda_mkl_pandas succeeded
Details
scikit-learn.scikit-learn (Linux32 py35_ubuntu_atlas_32bit) Linux32 py35_ubuntu_atlas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py35_pip_openblas_32bit) Windows py35_pip_openblas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py37_conda_mkl) Windows py37_conda_mkl succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda_mkl) macOS pylatest_conda_mkl succeeded
Details
@TomDLT

This comment has been minimized.

Copy link
Member

TomDLT commented Aug 7, 2019

Thanks for this nice work @webber26232 ! 🎉

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Aug 8, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.