Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF Consider using argpartition in ndcg_score #17626

Open
rth opened this issue Jun 17, 2020 · 10 comments
Open

PERF Consider using argpartition in ndcg_score #17626

rth opened this issue Jun 17, 2020 · 10 comments

Comments

@rth
Copy link
Member

rth commented Jun 17, 2020

As reported by @karlhigley,

I now take issue with the implementation of NDCG in sklearn, which seems like it could use argpartition and be much faster for long lists of items with small top-K results (e.g. NDCG@100 with 60000 items.)

Would you like to propose a PR to improve it?

@rth
Copy link
Member Author

rth commented Jun 17, 2020

Other places where we might also use it,

sklearn/cluster/_kmeans.py
1296:                    np.argsort(weight_sums)[int(.5 * X.shape[0]):]
sklearn/cluster/_bicluster.py
539:        result = vectors[np.argsort(dists)[:n_best]]
sklearn/covariance/_robust_covariance.py
117:        support[np.argsort(dist)[:n_support]] = True
118-
--
144:        support[np.argsort(dist)[:n_support]] = True
145-        X_support = X[support]
--
301:    index_best = np.argsort(all_dets_sub)[:select]
302-    best_locations = np.asarray(all_locs_sub)[index_best]
--
402:            support[np.argsort(np.abs(X_centered), 0)[:n_support]] = True
403-            covariance = np.asarray([[np.var(X[support])]])

We would probably need to benchmark that using it has a measurable impact. (Note that the result of argpartition is unsorted though).

Edit: it looks like the introselect algorithm in argpartition is not stable and so cannot be used to replace argsort(kind='mergesort'). Updated the list above.

@jnothman
Copy link
Member

jnothman commented Jun 17, 2020 via email

@karlhigley
Copy link

For context, I am training models on the MovieLens 25M dataset and would like to compute a learning curve for NDCG on the validation set as training progresses. IIRC, that involves computing ~160k NDCGs over ~60k items each. Doing so takes much longer than training 10-20 epochs, which makes it cost prohibitive.

I don’t actually need to sort all 60k items to compute NDCG@100 though; I just need to identify the top 100 predicted scores and their indices, and then fetch the corresponding relevance labels/scores for those 100 items.

My main concern isn’t the theoretical asymptotic complexity (though I was interested to learn that), it’s that I couldn’t use the tool to do the job. I’d still be interested to see if the performance of ndcg_score can be improved, but in the mean time I’ve started writing my own ranking metrics library in order to find an implementation with acceptable cost.

@jnothman
Copy link
Member

jnothman commented Jun 18, 2020 via email

@karlhigley
Copy link

Yeah, that's what I'd generally expect. Shouldn't be a big change in any direction for small datasets, and for a real system, there would generally be a candidate selection step of some kind to narrow the list down before computing NDCG. I generally wouldn't try to compute NDCG over a list this long, because I usually wouldn't try to optimize a single model to perform both candidate selection and ranking. I'd be more likely to optimize a candidate selection model for Precision/Recall@K and a subsequent ranking model for NDCG@K. For reasons though (writing a series of articles about practical recommender systems), I'm starting with a naive set-up that I plan to improve later.

Another part of why evaluation becomes such a bottleneck is that I'm training/evaluating multiple models in parallel on a machine with multiple GPUs. Predictions get pulled back to the CPU then fed into sklearn's NDCG computation, which pegs all the CPU cores at full utilization while evaluating a single model. In order to avoid that, it may be easiest to move to GPU-friendly evaluation metric implementations, which is why I started writing a library.

Nonetheless, argpartition is really handy and if it's not a significant performance drag for small datasets, it might be worth using for performance improvements on large datasets. It seems relatively straightforward to apply in the case where ties can be ignored, but might be a bit tricky when ties matter. Off the top of my head, I could imagine budgeting some extra slots beyond the requested list cutoff to allow for some ties as a percentage of the cutoff, which would probably work okay with continuous scores and/or when ties are expected to be rare.

@rth
Copy link
Member Author

rth commented Jun 19, 2020

but introselect, while having smaller
asymptotic complexity, is often going to be slower than timsort..

Even for smaller arrays, it looks relatively faster,

>>> import numpy as np
>>> x = np.random.RandomState(0).randn(1000)
>>> %timeit np.argsort(x)[:100]
29.2 µs ± 2.41 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit np.argpartition(x, kth=100)[:100]
6.07 µs ± 305 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

for larger ones the difference is up to x10,

>>> x = np.random.RandomState(0).randn(60000)
>>> %timeit np.argsort(x)[:100]
4.57 ms ± 411 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit np.argpartition(x, kth=100)[:100]
483 µs ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

for N=1e6 @ K=100, argsort takes 100ms and argpartition 12ms.

Overall though this is still very fast, unless one has a use case as @karlhigley where one computes 100k of NDCG and where a dedicated evaluation library might indeed be more appropriate.

For scikit-learn it would be worth profiling the NDCG , as it's also possible that this won't be the bottleneck anyway, overall I'm not sure this is something that would matter for an average user.

@karlhigley
Copy link

It’s super-cool that y’all noticed my post and created this issue, and I hope there’s a worthwhile performance optimization here! Probably doesn’t make that much difference for the average user, but might help the worst case users (e.g. me.) 😆

@postmalloc
Copy link
Contributor

I tried to see if argpartition can be used inside _dcg_sample_scores in place of the argsort.
I've made the following change:

if k is not None:
    discount[k:] = 0
    top_k = np.arange(y_score.shape[1] - k, y_score.shape[1])
else:
    top_k = np.arange(y_score.shape[1])
if ignore_ties:
	ranking = np.argpartition(y_score, kth=top_k)[:, ::-1]
	ranked = y_true[np.arange(ranking.shape[0])[:, np.newaxis], ranking]
	cumulative_gains = discount.dot(ranked.T)

From the test cases I've run, the above change seems to produce identical ndcg scores as compared to the argsort based method.

Benchmark

Dataset used for benchmark:

y_true = np.arange(600000).reshape(10, 60000)
y_score = y_true + np.random.RandomState(0).uniform(-100, 100, size=y_true.shape)

Benchmark code:

%%timeit -r 10
ndcg_score(true_relevance, y_score, ignore_ties=True, k=100)

Benchmark result:

method k = 100 k = 10000
argsort 138 ms ± 2.88 ms per loop (mean ± std. dev. of 10 runs, 10 loops each) 139 ms ± 2.55 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
argpartition 57.6 ms ± 647 µs per loop (mean ± std. dev. of 10 runs, 10 loops each) 1.18 s ± 154 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

As expected, argpartition seems to provide performance gains on large datasets with small k. But with a large k, the performance deteriorates, while argsort remains pretty much unaffected. Perhaps a choice between argsort and argpartition can be made at runtime based on some heuristic.

@jnothman
Copy link
Member

jnothman commented Jul 2, 2020 via email

@postmalloc
Copy link
Contributor

@jnothman, if I understand it correctly, the argsort (also argpartition) is happening along axis=1 - the dimension of classes. So, for each sample, we are selecting the top k scores to calculate ndcg score. I wanted to verify if argparition would provide any performance benefits in selecting the top k scores when you have a large number of scores/classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants