# ENH: Add the first Wasserstein and the Cramér-von Mises statistical distances #7563

Merged
merged 13 commits into from Aug 22, 2017
+509 −1

## Conversation

Projects
None yet
3 participants
Contributor

### CharlesMasson commented Jul 5, 2017

 This PR adds functions to compute statistical distances, namely the first Wasserstein distance and the Cramér-von Mises distances, which are useful to compare distributions and have applications in various domains.

Closed

Member

### rgommers commented Jul 16, 2017

 Hi @CharlesMasson, looks good from a quick browse. Some questions/thoughts: There are 1-sample and 2-sample versions of Cramer-vonMises, would it make sense to support both in cramer? Does it make sense to have separate weights for u and v? I guess it does, since u and v do not have to be the same length? I'm asking because all distances in scipy.spatial.distance have a single weight parameter that applies to both u and v. I just closed gh-3659, which also had a 2-sample Cramer von Mises implementation. It was untested but perhaps useful to compare against. Issue gh-3291 had a longer list of metrics that were requested, this PR works towards that.

### rgommers requested changes Jul 16, 2017

 functions located at the specified values. References ----------

#### rgommers Jul 16, 2017

Member

This function implements https://en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93von_Mises_criterion right? Maybe useful to reference as well.

#### CharlesMasson Jul 17, 2017

Contributor

 def wasserstein(u_values, v_values, u_weights=None, v_weights=None): r""" Compute the first Wasserstein distance between two one-dimensional distributions.

#### rgommers Jul 16, 2017

Member

would be good to shorten this a bit and make this fit one line. Sphinx extracts that line as summary in the html docs.

 ---------- .. [1] "Wasserstein metric", http://en.wikipedia.org/wiki/Wasserstein_metric .. [2] Ramdas, Garcia, Cuturi "On Wasserstein Two Sample Testing and Related Families of Nonparametric Tests" (2015). :arXiv:1509.02237.

#### rgommers Jul 16, 2017

Member

Adding an Examples section would be very useful. Can be simple, just generated some arrays with random variates from a distribution and use them to calculate a distance.

#### rgommers Jul 16, 2017

Member

Same for cramer

 Parameters ---------- u_values, v_values : array_like Values observed in the (empirical) distribution.

#### rgommers Jul 16, 2017

Member

Actually, from the code it seems that these two arrays have to have the same length, but that's not stated anywhere or tested.

#### CharlesMasson Jul 17, 2017

Contributor

It's not actually the case: the arrays can have distinct lengths.
https://github.com/CharlesMasson/scipy/blob/cdf-distances/scipy/stats/tests/test_stats.py#L4257

 Weights as ndarray. """ # Validate the value array. values = np.array(values, dtype=float)

#### rgommers Jul 16, 2017

Member

This makes a copy, doesn't seem necessary

 # Validate the value array. values = np.array(values, dtype=float) if len(values) == 0: raise ValueError('Distribution can\'t be empty.')

#### rgommers Jul 16, 2017

Member

nitpick: better use double quotes for the whole message rather than an escape

 # Validate the weight array, if specified. if weights is not None: weights = np.array(weights, dtype=float)

#### rgommers Jul 16, 2017

Member

same here, makes a copy. why not asarray

Contributor

### CharlesMasson commented Jul 17, 2017

 Thanks @rgommers for your feedback. I addressed your comments. The cramer function does not implement the statistic per se, but the Cramér distance, as defined for example in https://arxiv.org/abs/1705.10743 (section 4.1). However, I noticed that there are two versions of the Cramér distance in the literature: the original non-distribution-free Cramér-von Mises distance (from 1928): this is the one in the paper above and the one that is implemented in this PR, the distribution-free distance: this is the one that is used for the Cramér-von Mises criterion (https://en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93von_Mises_criterion) and in some other papers (e.g., http://www.math.kit.edu/stoch/~henze/seite/aiwz/media/gnst-2015-09-01-rev-1(1).pdf). This distinction is highlighted in Székely's paper (at the beginning of the section 2): http://personal.bgsu.edu/~mrizzo/energy/Szekely-E-statistics.pdf As shown in that paper, in one dimension, the non-distribution-free Cramér-von Mises distance is equivalent to the energy distance (https://en.wikipedia.org/wiki/Energy_distance). For those reasons and to avoid any ambiguities, I renamed the function to energy_distance.

### irabinovitch commented Aug 16, 2017

 @rgommers Thanks for the review. I believe @CharlesMasson new commits should have us in a good spot. Do you have a moment to take another look?

### CharlesMasson added some commits Jul 5, 2017

 ENH: add Wasserstein and Cramer statistical distances 
 15532a3 
 TST: add tests for Wasserstein and Cramer statistical distances 
 91a476d 
 STY: Avoid using escape character 
 4868f86 
 ENH: Avoid copying arrays when validating distributions 
 1c3abb0 
 DOC: Fit the function description in one line 
 5d2615e 
 MAINT: Implement the energy distance instead of the Cramer distance 
 29ed811 
 MAINT: Rename wasserstein to wasserstein_distance for consistency 
 6cffa70 
 DOC: Add examples for the Wasserstein and the energy distances 
 808e46e 
 TST: Change TestCase to object 
Following #7568.
 99d5dcb 

Member

### rgommers commented Aug 19, 2017

 Thanks for the ping @irabinovitch, working on it now. And just pushed a rebase. Note to other reviewers, this was already discussed on the mailing list: https://mail.python.org/pipermail/scipy-dev/2017-July/022005.html. Not many responses, but no objections so OK to add.
Member

### rgommers commented Aug 19, 2017

 Locally I get 2 failures for RuntimeError not raised in test_inf_values. The test looks incorrect - warnings are raised only once from the same place by default, and that assert_raises is not the first call with infs.

Member

### rgommers left a comment

 Overall looks good to me, only a few minor comments. Can be merged once those are addressed.
 assert_equal( stats.wasserstein_distance([1, -np.inf, np.inf], [1, 1]), np.inf) assert_raises(RuntimeWarning, stats.wasserstein_distance,

#### rgommers Aug 19, 2017

Member

I'd remove this check, it's prone to failing.

#### rgommers Aug 19, 2017

Member

same comments apply to the second test_inf_values below

#### CharlesMasson Aug 21, 2017

Contributor

The pytest config turns warnings into errors (https://github.com/scipy/scipy/blob/master/pytest.ini#L6:L7) and it seems to be the reason why the raised warning (then error) is caught by assert_raises. Without that config line, I do see the failures too.

The test should actually be:

with assert_warns(RuntimeWarning):
assert_equal(
stats.wasserstein_distance([1, 2, np.inf], [np.inf, 1]),
np.nan)


or alternatively:

with suppress_warnings() as sup:
r = sup.record(RuntimeWarning, "invalid value*")
assert_equal(
stats.wasserstein_distance([1, 2, np.inf], [np.inf, 1]),
np.nan)


I'll use the latter, as you suggested not checking for raised warnings.

 np.inf) assert_equal( stats.wasserstein_distance([1, -np.inf, np.inf], [1, 1]), np.inf)

#### rgommers Aug 19, 2017

Member

All three above checks should be in a block:

   with suppress_warnings() as sup:
r = sup.record(RuntimeWarning, "invalid value*")


#### CharlesMasson Aug 21, 2017

Contributor

I don't think that it is necessary. Those three blocks should not raise warnings because, contrary to the fourth block, they do not call np.subtract with same-sign infinities in both arguments: e.g., np.subtract([np.inf], [-np.inf]) is valid and does not raise a warning whereas np.subtract([np.inf], [np.inf]) does raise a warning.

 .. math:: l_1 (u, v) = \inf_{\pi \in \Gamma (u, v)} \int_{\mathbb{R} \times \mathbb{R}} |x-y| \mathrm{d} \pi (x, y)

#### rgommers Aug 19, 2017

Member

However, our rule is that we normally try to only use LaTeX in the Notes section, because this isn't readable in a terminal (or Jupyter notebook or IDE help). So I suggest to move both this paragraph and the one on CDFs to the notes.

And same for energy_distance docstring

#### CharlesMasson Aug 21, 2017

Contributor

I moved the paragraphs accordingly.

 ... [2.1, 4.2, 7.4, 8. ], [7.6, 8.8]) 0.88003340976158217 """ return 2**.5 * _cdf_distance(2, u_values, v_values, u_weights, v_weights)

#### rgommers Aug 19, 2017

Member

nitpick: for clarify I'd prefer np.sqrt(2) over 2**.5

### CharlesMasson added some commits Aug 21, 2017

 TST: Correctly handle raised warnings in tests with inf values 
 778284c 
 DOC: Avoid using LaTeX outside the Notes section 
 55c3bbd 
 MAINT: Use np.sqrt(2) instead of 2**.5 
 d3294f0 
Contributor

### CharlesMasson commented Aug 21, 2017

 MAINT: Remove unnecessary np.abs 
 1ac893d 

### rgommers merged commit 5eade1e into scipy:master Aug 22, 2017 4 checks passed

#### 4 checks passed

ci/circleci Your tests passed on CircleCI!
Details
codecov/project 87.32% (target 1%)
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Member

### rgommers commented Aug 22, 2017

 LGTM now, merged. Thanks @CharlesMasson !

Open

Merged

Closed