[MRG + 1] MNT Platform independent hash collision tests in FeatureHasher #9710

rth · 2017-09-08T12:47:29Z

This PR aims to address the current failures of test_hasher_alternate_sign on non amd64 platforms #9393 (comment) that is likely due to the fact the current test rely on Murmurhash3 results to yield a particular hash value (that produces a collision) while it is actually platform dependent #9393 (comment) . Since the original issue couldn't be reproduced, there is no guarantee that this would fix it (hopefully it would), but in any case, it would make the test_hasher_alternate_sign more robust ...

Note: these tests here rely on the fact that when hashing 8 strings with alternate_sign=True, some of them will get a negative sign and some a positive one (it's a 50%/50% probability). However, there is still a (0.5)**2 = .004 probability that on a given platform all the signs will be positive (in which case these tests will fail) but hopefully, that's unlikely enough...

cc @jnothman

amueller · 2017-09-08T18:50:46Z

sklearn/feature_extraction/tests/test_feature_hasher.py

@@ -137,6 +133,25 @@ def test_hasher_alternate_sign():


 @ignore_warnings(category=DeprecationWarning)
+def test_hash_collisions():
+    X = [["a", "b", "c", "d", "e", "f", "g", "h"]]


You could be really sure and do X = [list("Thequickbrownfoxjumped")]

amueller · 2017-09-08T18:52:40Z

LGTM. (I think you meant .5 ** 8 = 0.004)

rth · 2017-09-08T19:18:03Z

@amueller Thanks for the review. Increased the vocabulary size as you suggested.

(I think you meant .5 ** 8 = 0.004)

Yes thanks, I keep making typos in every other comment, apparently.

jnothman · 2017-09-09T10:30:20Z

have you tried finding a docker to reproduce somehow?

…

On 8 Sep 2017 10:47 pm, "Roman Yurchak" ***@***.***> wrote: This PR aims to address the current failures of test_hasher_alternate_sign on non amd64 platforms #9393 (comment) <#9393 (comment)> that is likely due to the fact the current test rely on Murmurhash3 results to yield a particular hash value (that produces a collision) while it is actually platform dependent #9393 (comment) <#9393 (comment)> . Since the original issue couldn't be reproduced, there is no guarantee that this would fix it (hopefully it would), but in any case, it would make the test_hasher_alternate_sign more robust ... *Note:* these tests here rely on the fact that when hashing 8 strings with alternate_sign=True, some of them will get a negative sign and some a positive one (it's a 50%/50% probability). However, there is still a (0.5)**2 = .004 probability that on a given platform all the signs will be positive (in which case these tests will fail) but hopefully, that's unlikely enough... cc @jnothman <https://github.com/jnothman> ------------------------------ You can view, comment on, or merge this pull request online at: #9710 Commit Summary - More robust hash collision tests in the FeatureHasher File Changes - *M* sklearn/feature_extraction/tests/test_feature_hasher.py <https://github.com/scikit-learn/scikit-learn/pull/9710/files#diff-0> (37) Patch Links: - https://github.com/scikit-learn/scikit-learn/pull/9710.patch - https://github.com/scikit-learn/scikit-learn/pull/9710.diff — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9710>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz688_XL5_MbkScMaOfEy01icQSJeEks5sgTdigaJpZM4PRIkv> .

jnothman

Very nice, thanks @rth!

…t-learn#9710)

remove outdated comment fix also for FeatureUnion [MRG+2] Limiting n_components by both n_features and n_samples instead of just n_features (Recreated PR) (scikit-learn#8742) [MRG+1] Remove hard dependency on nose (scikit-learn#9670) MAINT Stop vendoring sphinx-gallery (scikit-learn#9403) CI upgrade travis to run on new numpy release (scikit-learn#9096) CI Make it possible to run doctests in .rst files with pytest (scikit-learn#9697) * doc/datasets/conftest.py to implement the equivalent of nose fixtures * add conftest.py in root folder to ensure that sklearn local folder is used rather than the package in site-packages * test doc with pytest in Travis * move custom_data_home definition from nose fixture to .rst file [MRG+1] avoid integer overflow by using floats for matthews_corrcoef (scikit-learn#9693) * Fix bug#9622: avoid integer overflow by using floats for matthews_corrcoef * matthews_corrcoef: cosmetic change requested by jnothman * Add test_matthews_corrcoef_overflow for Bug#9622 * test_matthews_corrcoef_overflow: clean-up and make deterministic * matthews_corrcoef: pass dtype=np.float64 to sum & trace instead of using astype * test_matthews_corrcoef_overflow: add simple deterministic tests TST Platform independent hash collision tests in FeatureHasher (scikit-learn#9710) TST More informative error message in test_preserve_trustworthiness_approximately (scikit-learn#9738) add some rudimentary tests for meta-estimators fix extra whitespace in error message add missing if_delegate_has_method in pipeline don't test tuple pipeline for now only copy list if not list already? doesn't seem to help?

…t-learn#9710)

rth changed the title ~~Platform independent hash collision tests in FeatureHasher~~ [MRG] MNT Platform independent hash collision tests in FeatureHasher Sep 8, 2017

More robust hash collision tests in the FeatureHasher

4f55747

rth force-pushed the robust-hash-collision-tests branch from d1ebfad to 4f55747 Compare September 8, 2017 12:49

amueller reviewed Sep 8, 2017

View reviewed changes

amueller changed the title ~~[MRG] MNT Platform independent hash collision tests in FeatureHasher~~ [MRG + 1] MNT Platform independent hash collision tests in FeatureHasher Sep 8, 2017

jnothman mentioned this pull request Sep 8, 2017

Debian test failures (was test_preserve_trustworthiness_approximately fails on 32bit: AssertionError: 0.89166666666666661 not greater than 0.9) #9393

Closed

Use larger vocabulary for the hash collistion tests

c637174

jnothman reviewed Sep 12, 2017

View reviewed changes

jnothman merged commit e88baea into scikit-learn:master Sep 12, 2017

jnothman added this to the 0.19.1 milestone Sep 12, 2017

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Sep 12, 2017

TST Platform independent hash collision tests in FeatureHasher (sciki…

752e458

…t-learn#9710)

amueller pushed a commit to amueller/scikit-learn that referenced this pull request Sep 12, 2017

TST Platform independent hash collision tests in FeatureHasher (sciki…

7a82e94

…t-learn#9710)

massich pushed a commit to massich/scikit-learn that referenced this pull request Sep 15, 2017

TST Platform independent hash collision tests in FeatureHasher (sciki…

1a94271

…t-learn#9710)

rth deleted the robust-hash-collision-tests branch October 6, 2017 15:13

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

TST Platform independent hash collision tests in FeatureHasher (sciki…

01dc44a

…t-learn#9710)

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

TST Platform independent hash collision tests in FeatureHasher (sciki…

17d6a35

…t-learn#9710)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG + 1] MNT Platform independent hash collision tests in FeatureHasher #9710

[MRG + 1] MNT Platform independent hash collision tests in FeatureHasher #9710

Uh oh!

rth commented Sep 8, 2017

Uh oh!

amueller Sep 8, 2017

Uh oh!

amueller commented Sep 8, 2017

Uh oh!

rth commented Sep 8, 2017

Uh oh!

jnothman commented Sep 9, 2017 via email

Uh oh!

jnothman left a comment

Uh oh!

Uh oh!

Uh oh!

[MRG + 1] MNT Platform independent hash collision tests in FeatureHasher #9710

[MRG + 1] MNT Platform independent hash collision tests in FeatureHasher #9710

Uh oh!

Conversation

rth commented Sep 8, 2017

Uh oh!

amueller Sep 8, 2017

Choose a reason for hiding this comment

Uh oh!

amueller commented Sep 8, 2017

Uh oh!

rth commented Sep 8, 2017

Uh oh!

jnothman commented Sep 9, 2017 via email

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!