[MRG+1] ENH use pairwise_distances_chunked in brute nearest neighbors #11136

jnothman · 2018-05-25T02:36:17Z

Formerly part of #10280

jnothman

Note that I believe no tests are needed: the functionality of algorithm='brute' is tested, and the memory consumption of pairwise_distances_chunked is tested elsewhere.

rth

This will be very useful!

Some benchmarks to double check that doesn't affect performance too much on a standard use case with default working_memory size could be helpful. There were some benchmarks in #10280 (comment) but I'm not sure to what extent they would generalize to this..

For readability, it would help to have some docstrings in the reduce functions specifying type and shape of inputs.

rth · 2018-05-27T17:27:41Z

sklearn/neighbors/base.py

@@ -276,9 +278,25 @@ def _pairwise(self):
 class KNeighborsMixin(object):
    """Mixin for k-neighbors searches"""

+    def _kneighbors_reduce_func(self, dist, start,
+                                n_neighbors, return_distance):


Would make sense to decorate it with @staticmethod and remove self to help redability?

It uses self.effective_metric_

jnothman · 2018-05-28T04:33:32Z

Benchmark:

from pprint import pprint
import time
import sklearn
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import euclidean_distances
n_features = 100


results = []
for n_samples in [1000, 10000, 20000, 50000, ]:
    X = np.random.rand(n_samples, n_features)
    sampleD = euclidean_distances(X[:100])
    avg_dist = np.mean(sampleD[np.triu_indices_from(sampleD, 1)])
    for metric in ['euclidean']:
        nn = NearestNeighbors(radius=avg_dist * .9, n_neighbors=10,
                              metric=metric, algorithm='brute').fit(X)
        for method in ['kneighbors', 'radius_neighbors']:
###            for working_mem in ['master']:
            for working_mem in [100, 1024, 999999999]:
                with sklearn.config_context(working_memory=working_mem):
                    s = time.time()
                    out = getattr(nn, method)()
                    t = time.time() - s
                    n_neighbors = sum(len(x) for x in out[0])
                    d = {'metric': metric,
                         'n_samples': n_samples,
                         'method': method,
                         'working_memory': working_mem,
                         'total_n_neighbors': n_neighbors,
                         'time': t}
                    pprint(d)
                    results.append(d)

All timings:

,method,metric,n_samples,time,total_n_neighbors,working_memory
0,kneighbors,euclidean,1000,0.030091047286987305,10000,master
1,radius_neighbors,euclidean,1000,0.020231008529663086,35066,master
2,kneighbors,euclidean,10000,3.1793291568756104,100000,master
3,radius_neighbors,euclidean,10000,2.0275309085845947,5868760,master
4,kneighbors,euclidean,20000,17.558682918548584,200000,master
5,radius_neighbors,euclidean,20000,9.4132080078125,20476428,master
6,kneighbors,euclidean,50000,1119.4494569301605,500000,master
7,radius_neighbors,euclidean,50000,1070.3520658016205,117407558,master
0,kneighbors,euclidean,1000,0.031382083892822266,10000,100
1,kneighbors,euclidean,1000,0.021708011627197266,10000,1024
2,kneighbors,euclidean,1000,0.026649951934814453,10000,999999999
3,radius_neighbors,euclidean,1000,0.0228121280670166,45472,100
4,radius_neighbors,euclidean,1000,0.019867897033691406,45472,1024
5,radius_neighbors,euclidean,1000,0.03182816505432129,45472,999999999
6,kneighbors,euclidean,10000,2.182466983795166,100000,100
7,kneighbors,euclidean,10000,2.8718271255493164,100000,1024
8,kneighbors,euclidean,10000,2.8780999183654785,100000,999999999
9,radius_neighbors,euclidean,10000,1.3253610134124756,5574012,100
10,radius_neighbors,euclidean,10000,1.8732130527496338,5574012,1024
11,radius_neighbors,euclidean,10000,1.8151168823242188,5574012,999999999
12,kneighbors,euclidean,20000,9.68129301071167,200000,100
13,kneighbors,euclidean,20000,13.157173156738281,200000,1024
14,kneighbors,euclidean,20000,22.281371116638184,200000,999999999
15,radius_neighbors,euclidean,20000,9.239833116531372,19401726,100
16,radius_neighbors,euclidean,20000,6.303516149520874,19401726,1024
17,radius_neighbors,euclidean,20000,8.991750001907349,19401726,999999999
18,kneighbors,euclidean,50000,68.80946683883667,500000,100
19,kneighbors,euclidean,50000,72.32541084289551,500000,1024
20,kneighbors,euclidean,50000,1429.7003951072693,500000,999999999
21,radius_neighbors,euclidean,50000,53.88025784492493,108698316,100
22,radius_neighbors,euclidean,50000,49.64297103881836,108698316,1024
23,radius_neighbors,euclidean,50000,1224.8581790924072,108698316,999999999
24,kneighbors,euclidean,100000,214.5971851348877,1000000,100

jnothman · 2018-05-28T04:36:22Z

Basically, we see good improvements with capped working_memory, and negligible change from master.

jnothman · 2018-06-02T23:03:40Z

@rth, let's merge this too?

rth · 2018-06-03T22:07:09Z

Thanks for the benchmarks !

I was actually curious if this has an impact on the run time - it turns out that it does a little. Below a few more benchmarks,

For `NearestNeighbors.kneighbors` (details)

import numpy as np
from sklearn.neighbors import NearestNeighbors

from neurtu import delayed, Benchmark

n_features = 100

n_samples_predict = 1000

def runs():
    for n_samples in [5000, 10000, 20000, 40000, 80000, 160000, 320000]:
        for n_neighbors in [1, 5, 10]
            tags = {'n_samples': n_samples, 'n_neighbors': n_neighbors}

            X = np.random.rand(n_samples, n_features)
            Xp = np.random.rand(n_samples_predict, n_features)
            estimator = NearestNeighbors(n_neighbors=n_neighbors, algorithm='brute').fit(X)
            yield delayed(estimator, tags=tags).kneighbors(Xp)

df = Benchmark(wall_time=True, peak_memory={'interval': 0.0001}, repeat=1)(runs())
df.set_index(['n_samples_predict', 'n_samples'], inplace=True)

df = df.sort_index()
df = df.rename(columns={'wall_time_mean': 'wall_time', 'peak_memory_mean': 'peak_memory'})

df = df[['wall_time', 'peak_memory']]
df['peak_memory'] = df['peak_memory'].round(2)
df['wall_time'] = df['wall_time'].round(4)
print(df)

                      wall_time           peak_memory          
                         master PR #11136      master PR #11136
n_neighbors n_samples                                          
1           5000         0.0633    0.0633       75.19     74.69
            10000        0.1206    0.1212      152.39    151.02
            20000        0.2369    0.2362      303.52    305.51
            40000        0.4633    0.4630      609.49    610.18
            80000        0.9132    0.9129     1221.02   1220.41
            160000       1.8354    1.8600     2441.99   2045.79
            320000       3.6594    3.8362     4883.88   2045.83
5           5000         0.1007    0.1002       76.09     76.25
            10000        0.1818    0.1887      152.25    152.59
            20000        0.3785    0.3681      304.92    305.06
            40000        0.7574    0.7430      610.27    610.15
            80000        1.4955    1.4317     1220.46   1220.11
            160000       2.9148    3.0222     2441.36   2045.79
            320000       5.5406    5.7484     4882.66   2045.89
10          5000         0.1044    0.1008       76.30     76.30
            10000        0.1955    0.1981      152.48    152.40
            20000        0.3810    0.3534      305.16    305.00
            40000        0.7431    0.7677      610.25    610.17
            80000        1.4157    1.4905     1220.71   1220.60
            160000       2.8404    3.0266     2441.35   2045.79
            320000       5.7615    5.9708     4882.64   2045.86

For `NearestNeighbors.radius_neighbors` (details)

import numpy as np
from sklearn.neighbors import NearestNeighbors

from neurtu import delayed, Benchmark

n_features = 100
n_neighbors = 5

n_samples_predict = 1000
radius = 0.5

def runs():
    for n_samples in [5000, 10000, 20000, 40000, 80000, 160000, 320000]:
        tags = {'n_samples': n_samples, 'radius': radius}

        X = np.random.rand(n_samples, n_features)
        Xp = np.random.rand(n_samples_predict, n_features)
        estimator = NearestNeighbors(radius=radius, algorithm='brute').fit(X)
        yield delayed(estimator, tags=tags).radius_neighbors(Xp)

df = Benchmark(wall_time=True, peak_memory={'interval': 0.0001}, repeat=1)(runs())
df.set_index(['radius', 'n_samples'], inplace=True)

df = df.sort_index()
df = df.rename(columns={'wall_time_mean': 'wall_time', 'peak_memory_mean': 'peak_memory'})

df = df[['wall_time', 'peak_memory']]
df['peak_memory'] = df['peak_memory'].round(2)
df['wall_time'] = df['wall_time'].round(4)
print(df)

                 wall_time         peak_memory         
                    master      PR      master       PR
radius n_samples                                       
0.5    5000         0.0557  0.0556       39.09    38.92
       10000        0.0997  0.0999       76.57    76.56
       20000        0.1984  0.1930      152.74   152.73
       40000        0.3736  0.3784      305.18   305.18
       80000        0.7296  0.7248      610.66   610.66
       160000       1.4284  1.4596     1221.32  1022.96
       320000       2.8373  2.9735     2442.63  1022.96

So the memory usage works as expected however we pay for it with some performance overhead of around 4-5% once all of the matrix doesn't fit into a single chunk. I imagine the overhead is due to chunking & concatenating. Overall this sounds like a reasonable compromise to me.

rth

LGTM, if some docstrings are added to the private reduce functions (#11136 (review))

Thanks @jnothman !

jnothman · 2018-06-04T09:05:54Z

Huh. My benchmarks above show substantial wall time decreases for large n_samples

rth · 2018-06-04T09:10:43Z

Huh. My benchmarks above show substantial wall time decreases for large n_samples

The plots represent run time? I though it was memory usage.. Do you still have the benchmark code?

jnothman · 2018-06-04T09:25:11Z

The benchmark code is included above. Yes, it was time.

jnothman · 2018-06-04T09:25:31Z

(although naively using t = time.time() - s)

rth · 2018-06-04T20:39:20Z

How much RAM were you running these benchmarks with? I get a memory error for 50000 samples with 32GB RAM...

Hmm, but for the rest, yes, with your script I can confirm the results,

                                time                          
working_memory                   100      1024 999999999    master   
n_samples method                                                     
1000      kneighbors        0.023986  0.023552  0.023704  0.022691   
          radius_neighbors  0.023470  0.023864  0.022067  0.023892   
10000     kneighbors        1.832281  2.249930  2.242692  2.268365   
          radius_neighbors  1.321087  1.760668  1.740842  1.787875   
20000     kneighbors        8.025549  7.530514  9.977264  9.978430   
          radius_neighbors  4.835002  4.589210  6.991414  7.119566

The difference appears to be that in your benchmark you were running kneighbours(X=None) while I was using another random vector for X.

The 20k case for me takes around 20 GB RAM (out of 32 GB available). I'm not familiar enough with the details of dynamic memory allocation, could it be that that to allocate such a contiguous array without chunking there is some overhead at the OS level (fragmentation issues etc), when comparing to allocating/deallocating smaller chunks? In any case the results look good in both benchmarks !

jnothman · 2018-06-04T21:13:17Z

Hmm... I think I ran these on my laptop, with 16G... Yes, your benchmarks probably had fewer experimental variables. I forgot that I should probably keep the size of the test data constant; although in practice users are often using X=None with NN.

Let's not worry about it too much, given that we have separately confirmed that this PR is a good thing.

TomDLT

LGTM

ENH use pairwise_distances_chunked in brute nearest neighbors

364651f

jnothman changed the title ~~[MRG+1] ENH use pairwise_distances_chunked in brute nearest neighbors~~ [MRG] ENH use pairwise_distances_chunked in brute nearest neighbors May 25, 2018

Undo changes to tests

12564c3

jnothman commented May 25, 2018

View reviewed changes

Update PR number

b262283

rth reviewed May 27, 2018

View reviewed changes

rth mentioned this pull request May 27, 2018

[MRG+1] ENH Add working_memory global config for chunked operations #10280

Merged

10 tasks

rth approved these changes Jun 3, 2018

View reviewed changes

rth changed the title ~~[MRG] ENH use pairwise_distances_chunked in brute nearest neighbors~~ [MRG+1] ENH use pairwise_distances_chunked in brute nearest neighbors Jun 3, 2018

DOC add docstrings

6c00849

jnothman requested a review from TomDLT June 5, 2018 21:38

TomDLT approved these changes Jun 6, 2018

View reviewed changes

TomDLT merged commit 646ceb1 into scikit-learn:master Jun 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG+1] ENH use pairwise_distances_chunked in brute nearest neighbors #11136

[MRG+1] ENH use pairwise_distances_chunked in brute nearest neighbors #11136

jnothman commented May 25, 2018 •

edited

jnothman left a comment

rth left a comment

rth May 27, 2018 •

edited

jnothman May 28, 2018 •

edited

jnothman commented May 28, 2018 •

edited

jnothman commented May 28, 2018

jnothman commented Jun 2, 2018

rth commented Jun 3, 2018 •

edited

rth left a comment

jnothman commented Jun 4, 2018

rth commented Jun 4, 2018

jnothman commented Jun 4, 2018 via email

jnothman commented Jun 4, 2018

rth commented Jun 4, 2018

jnothman commented Jun 4, 2018

TomDLT left a comment

[MRG+1] ENH use pairwise_distances_chunked in brute nearest neighbors #11136

[MRG+1] ENH use pairwise_distances_chunked in brute nearest neighbors #11136

Conversation

jnothman commented May 25, 2018 • edited

jnothman left a comment

Choose a reason for hiding this comment

rth left a comment

Choose a reason for hiding this comment

rth May 27, 2018 • edited

Choose a reason for hiding this comment

jnothman May 28, 2018 • edited

Choose a reason for hiding this comment

jnothman commented May 28, 2018 • edited

jnothman commented May 28, 2018

jnothman commented Jun 2, 2018

rth commented Jun 3, 2018 • edited

rth left a comment

Choose a reason for hiding this comment

jnothman commented Jun 4, 2018

rth commented Jun 4, 2018

jnothman commented Jun 4, 2018 via email

jnothman commented Jun 4, 2018

rth commented Jun 4, 2018

jnothman commented Jun 4, 2018

TomDLT left a comment

Choose a reason for hiding this comment

jnothman commented May 25, 2018 •

edited

rth May 27, 2018 •

edited

jnothman May 28, 2018 •

edited

jnothman commented May 28, 2018 •

edited

rth commented Jun 3, 2018 •

edited