New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF Consider using argpartition in ndcg_score #17626
Comments
Other places where we might also use it,
We would probably need to benchmark that using it has a measurable impact. (Note that the result of argpartition is unsorted though). Edit: it looks like the |
Needs benchmark. I've not looked into whether argpartition selects an
algorithm based on the data, but introselect, while having smaller
asymptotic complexity, is often going to be slower than timsort...
|
For context, I am training models on the MovieLens 25M dataset and would like to compute a learning curve for NDCG on the validation set as training progresses. IIRC, that involves computing ~160k NDCGs over ~60k items each. Doing so takes much longer than training 10-20 epochs, which makes it cost prohibitive. I don’t actually need to sort all 60k items to compute NDCG@100 though; I just need to identify the top 100 predicted scores and their indices, and then fetch the corresponding relevance labels/scores for those 100 items. My main concern isn’t the theoretical asymptotic complexity (though I was interested to learn that), it’s that I couldn’t use the tool to do the job. I’d still be interested to see if the performance of |
Thanks for clarifying. It's certainly worth trying argpartition here. It
might make little difference for small datasets and a lot for big datasets.
|
Yeah, that's what I'd generally expect. Shouldn't be a big change in any direction for small datasets, and for a real system, there would generally be a candidate selection step of some kind to narrow the list down before computing NDCG. I generally wouldn't try to compute NDCG over a list this long, because I usually wouldn't try to optimize a single model to perform both candidate selection and ranking. I'd be more likely to optimize a candidate selection model for Precision/Recall@K and a subsequent ranking model for NDCG@K. For reasons though (writing a series of articles about practical recommender systems), I'm starting with a naive set-up that I plan to improve later. Another part of why evaluation becomes such a bottleneck is that I'm training/evaluating multiple models in parallel on a machine with multiple GPUs. Predictions get pulled back to the CPU then fed into sklearn's NDCG computation, which pegs all the CPU cores at full utilization while evaluating a single model. In order to avoid that, it may be easiest to move to GPU-friendly evaluation metric implementations, which is why I started writing a library. Nonetheless, |
Even for smaller arrays, it looks relatively faster, >>> import numpy as np
>>> x = np.random.RandomState(0).randn(1000)
>>> %timeit np.argsort(x)[:100]
29.2 µs ± 2.41 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit np.argpartition(x, kth=100)[:100]
6.07 µs ± 305 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) for larger ones the difference is up to x10, >>> x = np.random.RandomState(0).randn(60000)
>>> %timeit np.argsort(x)[:100]
4.57 ms ± 411 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit np.argpartition(x, kth=100)[:100]
483 µs ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) for N=1e6 @ K=100, argsort takes 100ms and argpartition 12ms. Overall though this is still very fast, unless one has a use case as @karlhigley where one computes 100k of NDCG and where a dedicated evaluation library might indeed be more appropriate. For scikit-learn it would be worth profiling the NDCG , as it's also possible that this won't be the bottleneck anyway, overall I'm not sure this is something that would matter for an average user. |
It’s super-cool that y’all noticed my post and created this issue, and I hope there’s a worthwhile performance optimization here! Probably doesn’t make that much difference for the average user, but might help the worst case users (e.g. me.) 😆 |
I tried to see if if k is not None:
discount[k:] = 0
top_k = np.arange(y_score.shape[1] - k, y_score.shape[1])
else:
top_k = np.arange(y_score.shape[1])
if ignore_ties:
ranking = np.argpartition(y_score, kth=top_k)[:, ::-1]
ranked = y_true[np.arange(ranking.shape[0])[:, np.newaxis], ranking]
cumulative_gains = discount.dot(ranked.T) From the test cases I've run, the above change seems to produce identical ndcg scores as compared to the BenchmarkDataset used for benchmark: y_true = np.arange(600000).reshape(10, 60000)
y_score = y_true + np.random.RandomState(0).uniform(-100, 100, size=y_true.shape) Benchmark code: %%timeit -r 10
ndcg_score(true_relevance, y_score, ignore_ties=True, k=100) Benchmark result:
As expected, |
Why are you testing on so few samples?
And do we really want to test for as many as 60000 classes?
|
@jnothman, if I understand it correctly, the argsort (also argpartition) is happening along |
As reported by @karlhigley,
Would you like to propose a PR to improve it?
The text was updated successfully, but these errors were encountered: