-
-
Notifications
You must be signed in to change notification settings - Fork 25.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Add Normalized Discounted Cumulative Gain #9951
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thsnks!
Please try make use of, or extend, test_common.py. narrative docs in doc/modules/model_evaluation.rst should be added and probably a scorer in metrics/scorers.py
I've not yet looked at tests and implementation in detail.
sklearn/metrics/__init__.py
Outdated
@@ -117,5 +120,5 @@ | |||
'silhouette_score', | |||
'v_measure_score', | |||
'zero_one_loss', | |||
'brier_score_loss', | |||
'brier_score_loss' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't make unrelated changes if you can help it!
sklearn/metrics/ranking.py
Outdated
Parameters | ||
---------- | ||
y_true : array, shape = [n_samples, n_labels] | ||
True labels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these classes (possibly strings)? Ints? Floats?
sklearn/metrics/ranking.py
Outdated
""" | ||
if y_true.shape != y_score.shape: | ||
raise ValueError("y_true and y_score have different shapes") | ||
y_true = np.atleast_2d(y_true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we'd be more explicit using something like check_array and check_consistent_length
sklearn/metrics/ranking.py
Outdated
Parameters | ||
---------- | ||
y_true : array, shape = [n_samples, n_labels] | ||
True labels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type?
Codecov Report
@@ Coverage Diff @@
## master #9951 +/- ##
==========================================
+ Coverage 96.17% 96.17% +<.01%
==========================================
Files 336 336
Lines 62613 62674 +61
==========================================
+ Hits 60218 60279 +61
Misses 2395 2395
Continue to review full report at Codecov.
|
It would be good if you added to the PR description a list of tasks you intend to complete before changing WIP to MRG. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see tests including:
- known toy examples (e.g. from a reference paper or easy to calculate by hand)
- boundary cases (all scores equal for some samples, perfect score)
- perhaps invariants due to perturbing perfect y_score
And please add narrative docs.
Also, is there any value in supporting multiclass inputs, then binarized, as the previous implementation attempted to?
------- | ||
normalized_discounted_cumulative_gain : float in [0., 1.] | ||
The averaged NDCG scores for all samples. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add an Examples section where you demonstrate a simple invocation.
sklearn/metrics/ranking.py
Outdated
"multiclass-multioutput"): | ||
raise ValueError("{0} format is not supported".format(y_type)) | ||
|
||
ranking = np.argsort(y_score)[:, ::-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we be using rankdata to handle ties?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true that we should handle ties.
averaging the ranks of equally scored results may not work because the summation of gains has to be cutoff @ k (we need to know how many elements of a tied group fall beyond k) . In
Computing Information Retrieval Performance Measures Efficiently in the Presence of Tied Scores. Marc Najork, Frank McSherry. ECIR, 2008
the authors average the true gains of results in a tied group before multiplying by the discount (discounts beyond k are 0)
Sorry for the late answer. TODO list:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeromedockes Thanks for working on this! A few comments are below.
You can also edit your first post of this PR and include the above todo list there so it would be included in the PR summary view (cf github docs).
sklearn/metrics/ranking.py
Outdated
The NDCG score for each sample (float in [0., 1.]). | ||
|
||
References | ||
---------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are internal functions, and docs won't be built for them, so I think you could remove the References section here and in _dcg_sample_scores
particularly since these references can be found in the corresponding public functions.
@@ -30,6 +31,7 @@ | |||
from sklearn.metrics import label_ranking_loss | |||
from sklearn.metrics import roc_auc_score | |||
from sklearn.metrics import roc_curve | |||
from sklearn.metrics.ranking import _ndcg_sample_scores, _dcg_sample_scores |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to tests the public (score-averaged) functions in addition to the private ones? They are tested in common tests (with respect to symmetry invariance etc) but still there is currently no tests verifying that ndcg_score
and dcg_score
produce the right values.
I'm having a hard time implementing this efficiently. I have tried writing the loop explicitly, and writing it as a dot product with a sparse block diagonal matrix, but it takes a long time. It must not take a long time, because in the vast majority of cases, there shouldn't be any ties - since it is a metric for evaluating a ranking, the scores computed by the estimator should indeed induce an ordering on the labels. For example if we are scoring a document retrieval or a recommendation system, its scores should allow it to decide in which order it will display results for a user -> there shouldn't be ties, at least among the relevant results. I'll start working on improving the tests in the meanwhile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming we deal with one row (i.e. y_score and y_true are vectors) at a time, I think you can do the tie handling with something like:
_, inv, count = np.unique(y_score, return_inverse=True, return_counts=True)
n_unique = len(count)
ranked = np.zeros(n_unique)
np.add.at(ranked, inv, y_true) # or ranked = np.bincount(inv, weights=y_true, minlength=n_unique)
ranked /= count
I'm not sure if this is more efficient than what you've experimented with... If this slows things down a great deal, we can eventually optimise in a way that fast-paths the all-unique-scores case.
sklearn/metrics/ranking.py
Outdated
ranked = y_true[np.arange(ranking.shape[0])[:, np.newaxis], ranking] | ||
if k is not None: | ||
ranked = ranked[:, :k] | ||
discount = 1 / (np.log(np.arange(ranked.shape[1]) + 2) / np.log(log_basis)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.arange(2, k + 2)
would be clearer.
Ah sorry I forgot to make the rank descending with respect to scores... just do |
Btw, if you choose a solution with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a few nitpicks
thanks @jeremiedbb ! |
I'm wondering if the cost of using |
after timing a few examples actually I am not seeing such big differences anymore, maybe we can remove the import time
import numpy as np
from sklearn.metrics.ranking import ndcg_score
y_true = np.random.randn(10000, 100)
y_score = np.random.randn(*y_true.shape)
# y_true = np.random.binomial(5, .2, (10000, 100))
# y_score = np.random.binomial(5, .2, y_true.shape)
start = time.time()
dcg = ndcg_score(y_true, y_score)
stop = time.time()
print('with ties:', stop - start)
start = time.time()
dcg_ignore_ties = ndcg_score(y_true, y_score, ignore_ties=True)
stop = time.time()
print('ignore ties:', stop - start) trying with a few different sizes, I see a speedup around 5x in some cases but not much more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can keep this parameter. It gives a bit more flexibility and 5x is not bad. Besides it just adds 3 lines in the code, and it's the easiest part to follow in the code.
I just added a small request. Besides that, LGTM !
|
||
|
||
def _tie_averaged_dcg(y_true, y_score, discount_cumsum): | ||
_, inv, counts = np.unique( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think think function deserves a comment about what it does (and how) because it's not easy to follow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "basis" is not the right term... log_base might be better than log_basis but it might be best to check other parts of the library / ecosystem.
Thanks
sklearn/metrics/ranking.py
Outdated
|
||
ignore_ties : bool, optional (default=False) | ||
Assume that there are no ties in y_score (which is likely to be the | ||
case if y_score is continuous) for performance gains. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance is ambiguous. Use efficiency
|
||
""" | ||
gain = _dcg_sample_scores(y_true, y_score, k, ignore_ties=ignore_ties) | ||
normalizing_gain = _dcg_sample_scores(y_true, y_true, k, ignore_ties=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please comment why it is safe to ignore_ties here
sklearn/metrics/ranking.py
Outdated
np.add.at(ranked, inv, y_true) | ||
ranked /= counts | ||
groups = np.cumsum(counts) - 1 | ||
discount_sums = np.zeros(len(counts)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use
discount_sums = np.zeros(len(counts)) | |
discount_sums = np.empty(len(counts)) |
-.2, .2, size=y_score.shape) | ||
assert _dcg_sample_scores(y_true, y_score) == pytest.approx( | ||
3 / np.log2(np.arange(2, 7))) | ||
assert _dcg_sample_scores(y_true, y_score) == pytest.approx( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use pytest.mark.parameterize to test the ignore_ties equivalence?
|
||
def test_ndcg_ignore_ties_with_k(): | ||
a = np.arange(12).reshape((2, 6)) | ||
ndcg_score(a, a, k=3, ignore_ties=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be ensuring the result is the same as with ignore_ties=False?
Thanks, "base" is indeed the right term (used in the references, in /benchmarks/bench_isotonic.py, and everywhere else -- I don't know why I wrote "basis") |
@jnothman Do you have other changes to request? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with the API so will merge on the basis of the existing approvals
Thanks @jeromedockes |
thanks! |
After #9921, it was decided that the old implementation of NDCG would be removed (#9932), but that a new one might be useful.
Discounted Cumulative Gain and Normalized Discounted Cumulative Gain are popular ranking metrics (https://en.wikipedia.org/wiki/Discounted_cumulative_gain)
TODO: