ENH reduce memory consumption in nan_euclidean_distances #15615

jnothman · 2019-11-13T10:56:09Z

This goes towards fixing #15604

Basic benchmark:

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.metrics.pairwise import nan_euclidean_distances

calhousing = fetch_california_housing()

X = pd.DataFrame(calhousing.data, columns=calhousing.feature_names)
y = pd.Series(calhousing.target, name='house_value')

rng = np.random.RandomState(42)

density = 4  # one in 10 values will be NaN

mask = rng.randint(density, size=X.shape) == 0
X_na = X.copy()
X_na.values[mask] = np.nan

%time nan_euclidean_distances(X_na)

At master:

CPU times: user 33 s, sys: 9.42 s, total: 42.4 s
Wall time: 24 s

This branch:

CPU times: user 27.3 s, sys: 3.71 s, total: 31 s
Wall time: 14.8 s

Faster times could possibly be achieved with chunking.

jnothman · 2019-11-14T03:54:35Z

The test failure is because of this expected distance matrix:

        X = np.array([[missing_value, missing_value], [0, 1]])
        exp_dist = np.array([[np.nan, np.nan], [np.nan, 0]])

This sets the standard that if a sample is all-nan, its euclidean distance to itself is nan rather than 0. My code currently sets the diagonal to 0 in all cases. What do we consider to be the right behaviour?

thomasjpfan · 2019-11-14T15:52:18Z

Originally I had the diagonal at zero (when X is Y), but I was concerned with how the following were not equal:

X = np.array([[missing_value, missing_value], [0, 1]])

nan_euclidean_distances(X, X.copy())
nan_euclidean_distances(X, X)

glemaitre · 2019-11-15T10:37:30Z

What do we consider to be the right behaviour?

At least it is documented nan.

Since the distance is a bit specific to handle nan, I would not be surprised that it returns nan while I would be surprised to have some nan in other metrics due to division by zero or stuff like that.

@jnothman Do you have a use case or any thought on what it should be 0 instead of nan.

glemaitre · 2019-11-15T10:37:43Z

Apart from this LGTM

jnothman · 2019-11-17T04:58:06Z

I'm happy with @thomasjpfan's reasoning for now. In any case, better that an efficiency fix like this does not change behaviour

glemaitre

So LGTM. @thomasjpfan Do you want to give a look.

thomasjpfan · 2019-11-18T20:13:00Z

Thank you @jnothman !

…n#15615)

ENH reduce memory consumption in nan_euclidean_distances

d9b9b95

According to tests, diagonal should not always be 0

14a6504

thomasjpfan self-requested a review November 14, 2019 15:53

glemaitre self-requested a review November 14, 2019 17:50

glemaitre mentioned this pull request Nov 15, 2019

MemoryError in KNNImputer with california housing #15604

Closed

glemaitre approved these changes Nov 17, 2019

View reviewed changes

thomasjpfan approved these changes Nov 18, 2019

View reviewed changes

thomasjpfan merged commit f7ed72a into scikit-learn:master Nov 18, 2019

adrinjalali pushed a commit to adrinjalali/scikit-learn that referenced this pull request Nov 25, 2019

ENH reduce memory consumption in nan_euclidean_distances (scikit-lear…

c8b0dc3

…n#15615)

jnothman added a commit that referenced this pull request Nov 28, 2019

ENH reduce memory consumption in nan_euclidean_distances (#15615)

ddfc592

panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020

ENH reduce memory consumption in nan_euclidean_distances (scikit-lear…

aa1558e

…n#15615)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH reduce memory consumption in nan_euclidean_distances #15615

ENH reduce memory consumption in nan_euclidean_distances #15615

jnothman commented Nov 13, 2019 •

edited

jnothman commented Nov 14, 2019

thomasjpfan commented Nov 14, 2019

glemaitre commented Nov 15, 2019

glemaitre commented Nov 15, 2019

jnothman commented Nov 17, 2019 via email

glemaitre left a comment

thomasjpfan commented Nov 18, 2019

ENH reduce memory consumption in nan_euclidean_distances #15615

ENH reduce memory consumption in nan_euclidean_distances #15615

Conversation

jnothman commented Nov 13, 2019 • edited

jnothman commented Nov 14, 2019

thomasjpfan commented Nov 14, 2019

glemaitre commented Nov 15, 2019

glemaitre commented Nov 15, 2019

jnothman commented Nov 17, 2019 via email

glemaitre left a comment

Choose a reason for hiding this comment

thomasjpfan commented Nov 18, 2019

jnothman commented Nov 13, 2019 •

edited