New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add the first Wasserstein and the Cramér-von Mises statistical distances #7563

Merged
merged 13 commits into from Aug 22, 2017

Conversation

Projects
None yet
3 participants
@CharlesMasson
Copy link
Contributor

CharlesMasson commented Jul 5, 2017

This PR adds functions to compute statistical distances, namely the first Wasserstein distance and the Cramér-von Mises distances, which are useful to compare distributions and have applications in various domains.

@rgommers

This comment has been minimized.

Copy link
Member

rgommers commented Jul 16, 2017

Hi @CharlesMasson, looks good from a quick browse. Some questions/thoughts:

  • There are 1-sample and 2-sample versions of Cramer-vonMises, would it make sense to support both in cramer?
  • Does it make sense to have separate weights for u and v? I guess it does, since u and v do not have to be the same length? I'm asking because all distances in scipy.spatial.distance have a single weight parameter that applies to both u and v.
  • I just closed gh-3659, which also had a 2-sample Cramer von Mises implementation. It was untested but perhaps useful to compare against.
  • Issue gh-3291 had a longer list of metrics that were requested, this PR works towards that.
functions located at the specified values.

References
----------

This comment has been minimized.

@rgommers

rgommers Jul 16, 2017

Member

This function implements https://en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93von_Mises_criterion right? Maybe useful to reference as well.

This comment has been minimized.

@CharlesMasson

CharlesMasson Jul 17, 2017

Contributor

I added a link to the Wikipedia page of the energy distance: https://en.wikipedia.org/wiki/Energy_distance

def wasserstein(u_values, v_values, u_weights=None, v_weights=None):
r"""
Compute the first Wasserstein distance between two one-dimensional
distributions.

This comment has been minimized.

@rgommers

rgommers Jul 16, 2017

Member

would be good to shorten this a bit and make this fit one line. Sphinx extracts that line as summary in the html docs.

----------
.. [1] "Wasserstein metric", http://en.wikipedia.org/wiki/Wasserstein_metric
.. [2] Ramdas, Garcia, Cuturi "On Wasserstein Two Sample Testing and Related
Families of Nonparametric Tests" (2015). :arXiv:`1509.02237`.

This comment has been minimized.

@rgommers

rgommers Jul 16, 2017

Member

Adding an Examples section would be very useful. Can be simple, just generated some arrays with random variates from a distribution and use them to calculate a distance.

This comment has been minimized.

@rgommers

rgommers Jul 16, 2017

Member

Same for cramer

Parameters
----------
u_values, v_values : array_like
Values observed in the (empirical) distribution.

This comment has been minimized.

@rgommers

rgommers Jul 16, 2017

Member

Actually, from the code it seems that these two arrays have to have the same length, but that's not stated anywhere or tested.

This comment has been minimized.

@CharlesMasson
Weights as ndarray.
"""
# Validate the value array.
values = np.array(values, dtype=float)

This comment has been minimized.

@rgommers

rgommers Jul 16, 2017

Member

This makes a copy, doesn't seem necessary

# Validate the value array.
values = np.array(values, dtype=float)
if len(values) == 0:
raise ValueError('Distribution can\'t be empty.')

This comment has been minimized.

@rgommers

rgommers Jul 16, 2017

Member

nitpick: better use double quotes for the whole message rather than an escape


# Validate the weight array, if specified.
if weights is not None:
weights = np.array(weights, dtype=float)

This comment has been minimized.

@rgommers

rgommers Jul 16, 2017

Member

same here, makes a copy. why not asarray

@CharlesMasson

This comment has been minimized.

Copy link
Contributor

CharlesMasson commented Jul 17, 2017

Thanks @rgommers for your feedback. I addressed your comments.

The cramer function does not implement the statistic per se, but the Cramér distance, as defined for example in https://arxiv.org/abs/1705.10743 (section 4.1).
However, I noticed that there are two versions of the Cramér distance in the literature:

This distinction is highlighted in Székely's paper (at the beginning of the section 2): http://personal.bgsu.edu/~mrizzo/energy/Szekely-E-statistics.pdf

As shown in that paper, in one dimension, the non-distribution-free Cramér-von Mises distance is equivalent to the energy distance (https://en.wikipedia.org/wiki/Energy_distance).

For those reasons and to avoid any ambiguities, I renamed the function to energy_distance.

@CharlesMasson CharlesMasson force-pushed the CharlesMasson:cdf-distances branch from 6ae31e7 to a4ec092 Jul 18, 2017

@irabinovitch

This comment has been minimized.

Copy link

irabinovitch commented Aug 16, 2017

@rgommers Thanks for the review. I believe @CharlesMasson new commits should have us in a good spot. Do you have a moment to take another look?

@rgommers rgommers force-pushed the CharlesMasson:cdf-distances branch from a4ec092 to 99d5dcb Aug 19, 2017

@rgommers

This comment has been minimized.

Copy link
Member

rgommers commented Aug 19, 2017

Thanks for the ping @irabinovitch, working on it now. And just pushed a rebase.

Note to other reviewers, this was already discussed on the mailing list: https://mail.python.org/pipermail/scipy-dev/2017-July/022005.html. Not many responses, but no objections so OK to add.

@rgommers

This comment has been minimized.

Copy link
Member

rgommers commented Aug 19, 2017

Locally I get 2 failures for RuntimeError not raised in test_inf_values. The test looks incorrect - warnings are raised only once from the same place by default, and that assert_raises is not the first call with infs.

@rgommers rgommers added this to the 1.0 milestone Aug 19, 2017

@rgommers
Copy link
Member

rgommers left a comment

Overall looks good to me, only a few minor comments. Can be merged once those are addressed.

assert_equal(
stats.wasserstein_distance([1, -np.inf, np.inf], [1, 1]),
np.inf)
assert_raises(RuntimeWarning, stats.wasserstein_distance,

This comment has been minimized.

@rgommers

rgommers Aug 19, 2017

Member

I'd remove this check, it's prone to failing.

This comment has been minimized.

@rgommers

rgommers Aug 19, 2017

Member

same comments apply to the second test_inf_values below

This comment has been minimized.

@CharlesMasson

CharlesMasson Aug 21, 2017

Contributor

The pytest config turns warnings into errors (https://github.com/scipy/scipy/blob/master/pytest.ini#L6:L7) and it seems to be the reason why the raised warning (then error) is caught by assert_raises. Without that config line, I do see the failures too.

The test should actually be:

with assert_warns(RuntimeWarning):
    assert_equal(
        stats.wasserstein_distance([1, 2, np.inf], [np.inf, 1]),
        np.nan)

or alternatively:

with suppress_warnings() as sup:
    r = sup.record(RuntimeWarning, "invalid value*")
    assert_equal(
        stats.wasserstein_distance([1, 2, np.inf], [np.inf, 1]),
        np.nan)

I'll use the latter, as you suggested not checking for raised warnings.

np.inf)
assert_equal(
stats.wasserstein_distance([1, -np.inf, np.inf], [1, 1]),
np.inf)

This comment has been minimized.

@rgommers

rgommers Aug 19, 2017

Member

All three above checks should be in a block:

   with suppress_warnings() as sup:
        r = sup.record(RuntimeWarning, "invalid value*")

This comment has been minimized.

@CharlesMasson

CharlesMasson Aug 21, 2017

Contributor

I don't think that it is necessary. Those three blocks should not raise warnings because, contrary to the fourth block, they do not call np.subtract with same-sign infinities in both arguments: e.g., np.subtract([np.inf], [-np.inf]) is valid and does not raise a warning whereas np.subtract([np.inf], [np.inf]) does raise a warning.

.. math::

l_1 (u, v) = \inf_{\pi \in \Gamma (u, v)} \int_{\mathbb{R} \times
\mathbb{R}} |x-y| \mathrm{d} \pi (x, y)

This comment has been minimized.

@rgommers

rgommers Aug 19, 2017

Member

Rendered html docs look good:
https://3906-1460385-gh.circle-artifacts.com/0/home/ubuntu/scipy/doc/build/html-scipyorg/generated/scipy.stats.wasserstein_distance.html#scipy.stats.wasserstein_distance
https://3906-1460385-gh.circle-artifacts.com/0/home/ubuntu/scipy/doc/build/html-scipyorg/generated/scipy.stats.energy_distance.html#scipy.stats.energy_distance

However, our rule is that we normally try to only use LaTeX in the Notes section, because this isn't readable in a terminal (or Jupyter notebook or IDE help). So I suggest to move both this paragraph and the one on CDFs to the notes.

And same for energy_distance docstring

This comment has been minimized.

@CharlesMasson

CharlesMasson Aug 21, 2017

Contributor

I moved the paragraphs accordingly.

... [2.1, 4.2, 7.4, 8. ], [7.6, 8.8])
0.88003340976158217
"""
return 2**.5 * _cdf_distance(2, u_values, v_values, u_weights, v_weights)

This comment has been minimized.

@rgommers

rgommers Aug 19, 2017

Member

nitpick: for clarify I'd prefer np.sqrt(2) over 2**.5

@CharlesMasson

This comment has been minimized.

Copy link
Contributor

CharlesMasson commented Aug 21, 2017

Thanks again @rgommers for your additional feedback. Your comments are addressed and the checks passed.

@rgommers rgommers merged commit 5eade1e into scipy:master Aug 22, 2017

4 checks passed

ci/circleci Your tests passed on CircleCI!
Details
codecov/project 87.32% (target 1%)
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@rgommers

This comment has been minimized.

Copy link
Member

rgommers commented Aug 22, 2017

LGTM now, merged. Thanks @CharlesMasson !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment