New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] convert to boolean arrays for boolean distances (ex: jaccard) #5460

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
6 participants
@TomDLT
Member

TomDLT commented Oct 19, 2015

@@ -1195,12 +1196,19 @@ def pairwise_distances(X, Y=None, metric="euclidean", n_jobs=1, **kwds):
" support sparse matrices.")
X, Y = check_pairwise_arrays(X, Y)
if n_jobs == 1 and X is Y:
if metric in PAIRWISE_BOOLEAN_FUNCTIONS and X.dtype != bool:
warnings.warn("Metric %s: Implicit cast to boolean for X of "

This comment has been minimized.

@amueller

amueller Oct 21, 2015

Member

DataConversionWarning?

This comment has been minimized.

@TomDLT

TomDLT Oct 21, 2015

Member

Sure!

return distance.squareform(distance.pdist(X, metric=metric,
**kwds))
func = partial(distance.cdist, metric=metric, **kwds)
return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
PAIRWISE_BOOLEAN_FUNCTIONS = [
'jaccard',

This comment has been minimized.

@amueller

amueller Oct 21, 2015

Member

there are others like yule mentioned in #5333. But I guess we haven't implemented them. So this is the only one that the trees support?
Can you add a non-regression test that checks that nearest neighbors with a tree gives the same result as brute on float input? and check that if fails on master?

This comment has been minimized.

@TomDLT

TomDLT Oct 21, 2015

Member

Sure, that was just a first draft to see if we had the good understanding of the problem.
If you think this is a good way to solve it, we will work to build a stronger PR.

This comment has been minimized.

@giorgiop

giorgiop Oct 21, 2015

Contributor

Yes, the complete list should be

['yule', 'matching', 'dice', 'kulsinski', 'rogerstanimoto', 'russellrao', 'sokalmichener', 'sokalsneath']

and 'wminkowski' which requires a weight parameter to be called. They are all scipy's though.

@jakevdp

This comment has been minimized.

Member

jakevdp commented Oct 21, 2015

Looks good. We'd want some regression tests that check the behavior for non-boolean inputs.

@TomDLT TomDLT changed the title from [WIP] jaccard distance to [MRG] jaccard distance Oct 22, 2015

@TomDLT TomDLT changed the title from [MRG] jaccard distance to [MRG] convert to boolean arrays for boolean distances (ex: jaccard) Oct 22, 2015

@TomDLT

This comment has been minimized.

Member

TomDLT commented Oct 22, 2015

Tests added, that fail on master

@jakevdp

This comment has been minimized.

Member

jakevdp commented Oct 22, 2015

This looks really good. Let's wait for tests to pass.

In the meantime, this will need a CHANGELOG entry, especially because it might change results for certain use cases. Actually, we might even think about having some sort of deprecation cycle for this... what do you think?

@amueller

This comment has been minimized.

Member

amueller commented Dec 10, 2015

I consider this a bugfix, so no deprecation cycle. We do need a whatsnew entry, though

@amueller amueller added this to the 0.18 milestone Dec 10, 2015

@TomDLT

This comment has been minimized.

Member

TomDLT commented Dec 11, 2015

whatsnew entry added

@jakevdp

This comment has been minimized.

Member

jakevdp commented Dec 11, 2015

From the diff, it looks like an unrelated section of whatsnew was accidentally deleted in the most recent commit.

@TomDLT

This comment has been minimized.

Member

TomDLT commented Dec 12, 2015

This entry was duplicated

@tguillemot

This comment has been minimized.

Contributor

tguillemot commented May 30, 2016

LGTM.
@amueller @jakevdp @raghavrv Another review ?
@TomDLT Could you rebase ?

@TomDLT TomDLT changed the title from [MRG] convert to boolean arrays for boolean distances (ex: jaccard) to [MRG+1] convert to boolean arrays for boolean distances (ex: jaccard) May 30, 2016

NN = neighbors.NearestNeighbors
nn1 = NN(metric="jaccard", algorithm='brute').fit(X)
nn2 = NN(metric="jaccard", algorithm='ball_tree').fit(X)

This comment has been minimized.

@raghavrv

raghavrv Jun 16, 2016

Member

+1 looks neat

@@ -426,6 +426,7 @@ def check_array(array, accept_sparse=None, dtype="numeric", order=None,
msg = ("Data with input dtype %s was converted to %s%s."
% (dtype_orig, array.dtype, context))
warnings.warn(msg, _DataConversionWarning)

This comment has been minimized.

@raghavrv

raghavrv Jun 16, 2016

Member

Why this change?

This comment has been minimized.

@TomDLT

TomDLT Jun 16, 2016

Member

Don't know, sorry for the noise

@raghavrv

This comment has been minimized.

Member

raghavrv commented Jun 16, 2016

LGTM apart from the minor nitpick! Thx @TomDLT !!

@TomDLT TomDLT changed the title from [MRG+1] convert to boolean arrays for boolean distances (ex: jaccard) to [MRG+2] convert to boolean arrays for boolean distances (ex: jaccard) Jun 16, 2016

@tguillemot

This comment has been minimized.

Contributor

tguillemot commented Jun 16, 2016

@TomDLT I think the pb of ci is now corrected by #6897.
Can you check it's ok but as you don't touch the doc, I think it's OK.

@raghavrv

This comment has been minimized.

Member

raghavrv commented Jun 16, 2016

Yes a rebase and forcepush should trigger re-build of Circle with the fixed configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment