PERF Consider using argpartition in ndcg_score #17626

rth · 2020-06-17T21:02:22Z

I now take issue with the implementation of NDCG in sklearn, which seems like it could use argpartition and be much faster for long lists of items with small top-K results (e.g. NDCG@100 with 60000 items.)

Would you like to propose a PR to improve it?

The text was updated successfully, but these errors were encountered:

rth · 2020-06-17T21:14:09Z

Other places where we might also use it,

sklearn/cluster/_kmeans.py
1296:                    np.argsort(weight_sums)[int(.5 * X.shape[0]):]
sklearn/cluster/_bicluster.py
539:        result = vectors[np.argsort(dists)[:n_best]]
sklearn/covariance/_robust_covariance.py
117:        support[np.argsort(dist)[:n_support]] = True
118-
--
144:        support[np.argsort(dist)[:n_support]] = True
145-        X_support = X[support]
--
301:    index_best = np.argsort(all_dets_sub)[:select]
302-    best_locations = np.asarray(all_locs_sub)[index_best]
--
402:            support[np.argsort(np.abs(X_centered), 0)[:n_support]] = True
403-            covariance = np.asarray([[np.var(X[support])]])

We would probably need to benchmark that using it has a measurable impact. (Note that the result of argpartition is unsorted though).

Edit: it looks like the introselect algorithm in argpartition is not stable and so cannot be used to replace argsort(kind='mergesort'). Updated the list above.

jnothman · 2020-06-17T23:26:51Z

Needs benchmark. I've not looked into whether argpartition selects an algorithm based on the data, but introselect, while having smaller asymptotic complexity, is often going to be slower than timsort...

karlhigley · 2020-06-18T02:17:31Z

For context, I am training models on the MovieLens 25M dataset and would like to compute a learning curve for NDCG on the validation set as training progresses. IIRC, that involves computing ~160k NDCGs over ~60k items each. Doing so takes much longer than training 10-20 epochs, which makes it cost prohibitive.

I don’t actually need to sort all 60k items to compute NDCG@100 though; I just need to identify the top 100 predicted scores and their indices, and then fetch the corresponding relevance labels/scores for those 100 items.

My main concern isn’t the theoretical asymptotic complexity (though I was interested to learn that), it’s that I couldn’t use the tool to do the job. I’d still be interested to see if the performance of ndcg_score can be improved, but in the mean time I’ve started writing my own ranking metrics library in order to find an implementation with acceptable cost.

jnothman · 2020-06-18T12:15:57Z

Thanks for clarifying. It's certainly worth trying argpartition here. It might make little difference for small datasets and a lot for big datasets.

karlhigley · 2020-06-18T14:32:59Z

Yeah, that's what I'd generally expect. Shouldn't be a big change in any direction for small datasets, and for a real system, there would generally be a candidate selection step of some kind to narrow the list down before computing NDCG. I generally wouldn't try to compute NDCG over a list this long, because I usually wouldn't try to optimize a single model to perform both candidate selection and ranking. I'd be more likely to optimize a candidate selection model for Precision/Recall@K and a subsequent ranking model for NDCG@K. For reasons though (writing a series of articles about practical recommender systems), I'm starting with a naive set-up that I plan to improve later.

Another part of why evaluation becomes such a bottleneck is that I'm training/evaluating multiple models in parallel on a machine with multiple GPUs. Predictions get pulled back to the CPU then fed into sklearn's NDCG computation, which pegs all the CPU cores at full utilization while evaluating a single model. In order to avoid that, it may be easiest to move to GPU-friendly evaluation metric implementations, which is why I started writing a library.

Nonetheless, argpartition is really handy and if it's not a significant performance drag for small datasets, it might be worth using for performance improvements on large datasets. It seems relatively straightforward to apply in the case where ties can be ignored, but might be a bit tricky when ties matter. Off the top of my head, I could imagine budgeting some extra slots beyond the requested list cutoff to allow for some ties as a percentage of the cutoff, which would probably work okay with continuous scores and/or when ties are expected to be rare.

rth · 2020-06-19T10:36:14Z

but introselect, while having smaller
asymptotic complexity, is often going to be slower than timsort..

Even for smaller arrays, it looks relatively faster,

>>> import numpy as np
>>> x = np.random.RandomState(0).randn(1000)
>>> %timeit np.argsort(x)[:100]
29.2 µs ± 2.41 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit np.argpartition(x, kth=100)[:100]
6.07 µs ± 305 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

for larger ones the difference is up to x10,

>>> x = np.random.RandomState(0).randn(60000)
>>> %timeit np.argsort(x)[:100]
4.57 ms ± 411 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit np.argpartition(x, kth=100)[:100]
483 µs ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

for N=1e6 @ K=100, argsort takes 100ms and argpartition 12ms.

Overall though this is still very fast, unless one has a use case as @karlhigley where one computes 100k of NDCG and where a dedicated evaluation library might indeed be more appropriate.

For scikit-learn it would be worth profiling the NDCG , as it's also possible that this won't be the bottleneck anyway, overall I'm not sure this is something that would matter for an average user.

karlhigley · 2020-06-19T13:53:33Z

It’s super-cool that y’all noticed my post and created this issue, and I hope there’s a worthwhile performance optimization here! Probably doesn’t make that much difference for the average user, but might help the worst case users (e.g. me.) 😆

postmalloc · 2020-07-02T12:09:59Z

I tried to see if argpartition can be used inside _dcg_sample_scores in place of the argsort.
I've made the following change:

if k is not None:
    discount[k:] = 0
    top_k = np.arange(y_score.shape[1] - k, y_score.shape[1])
else:
    top_k = np.arange(y_score.shape[1])
if ignore_ties:
	ranking = np.argpartition(y_score, kth=top_k)[:, ::-1]
	ranked = y_true[np.arange(ranking.shape[0])[:, np.newaxis], ranking]
	cumulative_gains = discount.dot(ranked.T)

From the test cases I've run, the above change seems to produce identical ndcg scores as compared to the argsort based method.

Benchmark

Dataset used for benchmark:

y_true = np.arange(600000).reshape(10, 60000)
y_score = y_true + np.random.RandomState(0).uniform(-100, 100, size=y_true.shape)

Benchmark code:

%%timeit -r 10
ndcg_score(true_relevance, y_score, ignore_ties=True, k=100)

Benchmark result:

method	k = 100	k = 10000
`argsort`	138 ms ± 2.88 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)	139 ms ± 2.55 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
`argpartition`	57.6 ms ± 647 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)	1.18 s ± 154 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

As expected, argpartition seems to provide performance gains on large datasets with small k. But with a large k, the performance deteriorates, while argsort remains pretty much unaffected. Perhaps a choice between argsort and argpartition can be made at runtime based on some heuristic.

jnothman · 2020-07-02T23:45:31Z

Why are you testing on so few samples? And do we really want to test for as many as 60000 classes?

postmalloc · 2020-07-03T01:50:50Z

@jnothman, if I understand it correctly, the argsort (also argpartition) is happening along axis=1 - the dimension of classes. So, for each sample, we are selecting the top k scores to calculate ndcg score. I wanted to verify if argparition would provide any performance benefits in selecting the top k scores when you have a large number of scores/classes.

cmarmo added the Performance label Nov 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF Consider using argpartition in ndcg_score #17626

PERF Consider using argpartition in ndcg_score #17626

rth commented Jun 17, 2020

rth commented Jun 17, 2020 •

edited

jnothman commented Jun 17, 2020 via email

karlhigley commented Jun 18, 2020

jnothman commented Jun 18, 2020 via email

karlhigley commented Jun 18, 2020

rth commented Jun 19, 2020

karlhigley commented Jun 19, 2020

postmalloc commented Jul 2, 2020

jnothman commented Jul 2, 2020 via email

postmalloc commented Jul 3, 2020

PERF Consider using argpartition in ndcg_score #17626

PERF Consider using argpartition in ndcg_score #17626

Comments

rth commented Jun 17, 2020

rth commented Jun 17, 2020 • edited

jnothman commented Jun 17, 2020 via email

karlhigley commented Jun 18, 2020

jnothman commented Jun 18, 2020 via email

karlhigley commented Jun 18, 2020

rth commented Jun 19, 2020

karlhigley commented Jun 19, 2020

postmalloc commented Jul 2, 2020

Benchmark

jnothman commented Jul 2, 2020 via email

postmalloc commented Jul 3, 2020

rth commented Jun 17, 2020 •

edited