Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Add Normalized Discounted Cumulative Gain #9951

Merged
merged 67 commits into from
Aug 13, 2019

Conversation

jeromedockes
Copy link
Contributor

@jeromedockes jeromedockes commented Oct 18, 2017

After #9921, it was decided that the old implementation of NDCG would be removed (#9932), but that a new one might be useful.

Discounted Cumulative Gain and Normalized Discounted Cumulative Gain are popular ranking metrics (https://en.wikipedia.org/wiki/Discounted_cumulative_gain)

TODO:

  • handle ties correctly
  • test on toy examples
  • test boundary cases (all scores equal for some samples, perfect score)
  • test invariants due to perturbing perfect y_score
  • add narrative docs.
  • examples section in the docstring
  • add public functions to tests

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thsnks!

Please try make use of, or extend, test_common.py. narrative docs in doc/modules/model_evaluation.rst should be added and probably a scorer in metrics/scorers.py

I've not yet looked at tests and implementation in detail.

@@ -117,5 +120,5 @@
'silhouette_score',
'v_measure_score',
'zero_one_loss',
'brier_score_loss',
'brier_score_loss'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't make unrelated changes if you can help it!

Parameters
----------
y_true : array, shape = [n_samples, n_labels]
True labels.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these classes (possibly strings)? Ints? Floats?

"""
if y_true.shape != y_score.shape:
raise ValueError("y_true and y_score have different shapes")
y_true = np.atleast_2d(y_true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we'd be more explicit using something like check_array and check_consistent_length

Parameters
----------
y_true : array, shape = [n_samples, n_labels]
True labels.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type?

@codecov
Copy link

codecov bot commented Oct 19, 2017

Codecov Report

Merging #9951 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9951      +/-   ##
==========================================
+ Coverage   96.17%   96.17%   +<.01%     
==========================================
  Files         336      336              
  Lines       62613    62674      +61     
==========================================
+ Hits        60218    60279      +61     
  Misses       2395     2395
Impacted Files Coverage Δ
sklearn/metrics/__init__.py 100% <100%> (ø) ⬆️
sklearn/metrics/ranking.py 98.93% <100%> (+0.12%) ⬆️
sklearn/metrics/tests/test_ranking.py 100% <100%> (ø) ⬆️
sklearn/feature_selection/base.py 94.76% <0%> (ø) ⬆️
sklearn/tests/test_pipeline.py 99.64% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 98c4db3...dbf3826. Read the comment docs.

@jnothman
Copy link
Member

jnothman commented Nov 1, 2017

It would be good if you added to the PR description a list of tasks you intend to complete before changing WIP to MRG.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see tests including:

  • known toy examples (e.g. from a reference paper or easy to calculate by hand)
  • boundary cases (all scores equal for some samples, perfect score)
  • perhaps invariants due to perturbing perfect y_score

And please add narrative docs.

Also, is there any value in supporting multiclass inputs, then binarized, as the previous implementation attempted to?

-------
normalized_discounted_cumulative_gain : float in [0., 1.]
The averaged NDCG scores for all samples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an Examples section where you demonstrate a simple invocation.

"multiclass-multioutput"):
raise ValueError("{0} format is not supported".format(y_type))

ranking = np.argsort(y_score)[:, ::-1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be using rankdata to handle ties?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that we should handle ties.

averaging the ranks of equally scored results may not work because the summation of gains has to be cutoff @ k (we need to know how many elements of a tied group fall beyond k) . In

Computing Information Retrieval Performance Measures Efficiently in the Presence of Tied Scores. Marc Najork, Frank McSherry. ECIR, 2008

the authors average the true gains of results in a tied group before multiplying by the discount (discounts beyond k are 0)

@jeromedockes
Copy link
Contributor Author

jeromedockes commented Nov 7, 2017

Sorry for the late answer. TODO list:

  • handle ties correctly
  • test on toy examples
  • test boundary cases (all scores equal for some samples, perfect score)
  • test invariants due to perturbing perfect y_score
  • add narrative docs.
  • examples section in the docstring
  • allow multiclass inputs?
  • add public functions to tests

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeromedockes Thanks for working on this! A few comments are below.

You can also edit your first post of this PR and include the above todo list there so it would be included in the PR summary view (cf github docs).

The NDCG score for each sample (float in [0., 1.]).

References
----------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are internal functions, and docs won't be built for them, so I think you could remove the References section here and in _dcg_sample_scores particularly since these references can be found in the corresponding public functions.

@@ -30,6 +31,7 @@
from sklearn.metrics import label_ranking_loss
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics.ranking import _ndcg_sample_scores, _dcg_sample_scores
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to tests the public (score-averaged) functions in addition to the private ones? They are tested in common tests (with respect to symmetry invariance etc) but still there is currently no tests verifying that ndcg_score and dcg_score produce the right values.

@jeromedockes
Copy link
Contributor Author

the authors average the true gains of results in a tied group before multiplying by the discount (discounts beyond k are 0)

I'm having a hard time implementing this efficiently. I have tried writing the loop explicitly, and writing it as a dot product with a sparse block diagonal matrix, but it takes a long time. It must not take a long time, because in the vast majority of cases, there shouldn't be any ties - since it is a metric for evaluating a ranking, the scores computed by the estimator should indeed induce an ordering on the labels. For example if we are scoring a document retrieval or a recommendation system, its scores should allow it to decide in which order it will display results for a user -> there shouldn't be ties, at least among the relevant results.

I'll start working on improving the tests in the meanwhile

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we deal with one row (i.e. y_score and y_true are vectors) at a time, I think you can do the tie handling with something like:

_, inv, count = np.unique(y_score, return_inverse=True, return_counts=True)
n_unique = len(count)
ranked = np.zeros(n_unique)
np.add.at(ranked, inv, y_true)  # or ranked = np.bincount(inv, weights=y_true, minlength=n_unique)
ranked /= count

I'm not sure if this is more efficient than what you've experimented with... If this slows things down a great deal, we can eventually optimise in a way that fast-paths the all-unique-scores case.

ranked = y_true[np.arange(ranking.shape[0])[:, np.newaxis], ranking]
if k is not None:
ranked = ranked[:, :k]
discount = 1 / (np.log(np.arange(ranked.shape[1]) + 2) / np.log(log_basis))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.arange(2, k + 2) would be clearer.

@jnothman
Copy link
Member

jnothman commented Nov 29, 2017

Ah sorry I forgot to make the rank descending with respect to scores... just do ranked = ranked[:, ::-1][:, :k]

@jnothman
Copy link
Member

Btw, if you choose a solution with unique's return_counts, there's a backport for old numpy versions in #9976 that can be merged into this PR.

@glemaitre glemaitre changed the title [MRG] Add Normalized Discounted Cumulative Gain [MRG+1] Add Normalized Discounted Cumulative Gain Jul 19, 2019
Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a few nitpicks

doc/modules/model_evaluation.rst Show resolved Hide resolved
doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved
sklearn/metrics/tests/test_ranking.py Outdated Show resolved Hide resolved
@jeromedockes
Copy link
Contributor Author

thanks @jeremiedbb !

@jeremiedbb
Copy link
Member

jeremiedbb commented Jul 19, 2019

I'm wondering if the cost of using np.unique is too high or not to determine whether we can use the fast method or if we need to use the normal method.
It could avoid having the ignore_ties parameter

@jeromedockes
Copy link
Contributor Author

jeromedockes commented Jul 19, 2019

I'm wondering if the cost of using np.unique is too high or not to determine whether we can use the fast method or if we need to use the normal method.
It could avoid having the ignore_ties parameter

after timing a few examples actually I am not seeing such big differences anymore, maybe we can remove the ignore_ties option altogether

import time

import numpy as np

from sklearn.metrics.ranking import ndcg_score


y_true = np.random.randn(10000, 100)
y_score = np.random.randn(*y_true.shape)

# y_true = np.random.binomial(5, .2, (10000, 100))
# y_score = np.random.binomial(5, .2, y_true.shape)

start = time.time()
dcg = ndcg_score(y_true, y_score)
stop = time.time()
print('with ties:', stop - start)

start = time.time()
dcg_ignore_ties = ndcg_score(y_true, y_score, ignore_ties=True)
stop = time.time()
print('ignore ties:', stop - start)

trying with a few different sizes, I see a speedup around 5x in some cases but not much more

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep this parameter. It gives a bit more flexibility and 5x is not bad. Besides it just adds 3 lines in the code, and it's the easiest part to follow in the code.

I just added a small request. Besides that, LGTM !



def _tie_averaged_dcg(y_true, y_score, discount_cumsum):
_, inv, counts = np.unique(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think think function deserves a comment about what it does (and how) because it's not easy to follow

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "basis" is not the right term... log_base might be better than log_basis but it might be best to check other parts of the library / ecosystem.

Thanks


ignore_ties : bool, optional (default=False)
Assume that there are no ties in y_score (which is likely to be the
case if y_score is continuous) for performance gains.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance is ambiguous. Use efficiency


"""
gain = _dcg_sample_scores(y_true, y_score, k, ignore_ties=ignore_ties)
normalizing_gain = _dcg_sample_scores(y_true, y_true, k, ignore_ties=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please comment why it is safe to ignore_ties here

np.add.at(ranked, inv, y_true)
ranked /= counts
groups = np.cumsum(counts) - 1
discount_sums = np.zeros(len(counts))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use

Suggested change
discount_sums = np.zeros(len(counts))
discount_sums = np.empty(len(counts))

-.2, .2, size=y_score.shape)
assert _dcg_sample_scores(y_true, y_score) == pytest.approx(
3 / np.log2(np.arange(2, 7)))
assert _dcg_sample_scores(y_true, y_score) == pytest.approx(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use pytest.mark.parameterize to test the ignore_ties equivalence?


def test_ndcg_ignore_ties_with_k():
a = np.arange(12).reshape((2, 6))
ndcg_score(a, a, k=3, ignore_ties=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be ensuring the result is the same as with ignore_ties=False?

@jeromedockes
Copy link
Contributor Author

I think "basis" is not the right term... log_base might be better than log_basis but it might be best to check other parts of the library / ecosystem.

Thanks, "base" is indeed the right term (used in the references, in /benchmarks/bench_isotonic.py, and everywhere else -- I don't know why I wrote "basis")

@glemaitre
Copy link
Member

@jnothman Do you have other changes to request?

@glemaitre glemaitre self-assigned this Aug 12, 2019
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with the API so will merge on the basis of the existing approvals

@jnothman jnothman merged commit be27d90 into scikit-learn:master Aug 13, 2019
@jnothman
Copy link
Member

Thanks @jeromedockes

@jeromedockes
Copy link
Contributor Author

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants