Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 1] MNT Platform independent hash collision tests in FeatureHasher #9710

Merged
merged 2 commits into from Sep 12, 2017

Conversation

rth
Copy link
Member

@rth rth commented Sep 8, 2017

This PR aims to address the current failures of test_hasher_alternate_sign on non amd64 platforms #9393 (comment) that is likely due to the fact the current test rely on Murmurhash3 results to yield a particular hash value (that produces a collision) while it is actually platform dependent #9393 (comment) . Since the original issue couldn't be reproduced, there is no guarantee that this would fix it (hopefully it would), but in any case, it would make the test_hasher_alternate_sign more robust ...

Note: these tests here rely on the fact that when hashing 8 strings with alternate_sign=True, some of them will get a negative sign and some a positive one (it's a 50%/50% probability). However, there is still a (0.5)**2 = .004 probability that on a given platform all the signs will be positive (in which case these tests will fail) but hopefully, that's unlikely enough...

cc @jnothman

@rth rth changed the title Platform independent hash collision tests in FeatureHasher [MRG] MNT Platform independent hash collision tests in FeatureHasher Sep 8, 2017
@rth rth force-pushed the robust-hash-collision-tests branch from d1ebfad to 4f55747 Compare September 8, 2017 12:49
@@ -137,6 +133,25 @@ def test_hasher_alternate_sign():


@ignore_warnings(category=DeprecationWarning)
def test_hash_collisions():
X = [["a", "b", "c", "d", "e", "f", "g", "h"]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could be really sure and do X = [list("Thequickbrownfoxjumped")]

@amueller
Copy link
Member

amueller commented Sep 8, 2017

LGTM. (I think you meant .5 ** 8 = 0.004)

@amueller amueller changed the title [MRG] MNT Platform independent hash collision tests in FeatureHasher [MRG + 1] MNT Platform independent hash collision tests in FeatureHasher Sep 8, 2017
@rth
Copy link
Member Author

rth commented Sep 8, 2017

@amueller Thanks for the review. Increased the vocabulary size as you suggested.

(I think you meant .5 ** 8 = 0.004)

Yes thanks, I keep making typos in every other comment, apparently.

@jnothman
Copy link
Member

jnothman commented Sep 9, 2017 via email

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thanks @rth!

@jnothman jnothman merged commit e88baea into scikit-learn:master Sep 12, 2017
@jnothman jnothman added this to the 0.19.1 milestone Sep 12, 2017
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Sep 12, 2017
amueller pushed a commit to amueller/scikit-learn that referenced this pull request Sep 12, 2017
massich pushed a commit to massich/scikit-learn that referenced this pull request Sep 15, 2017
amueller added a commit to amueller/scikit-learn that referenced this pull request Sep 19, 2017
remove outdated comment

fix also for FeatureUnion

[MRG+2] Limiting n_components by both n_features and n_samples instead of just n_features (Recreated PR) (scikit-learn#8742)

[MRG+1] Remove hard dependency on nose (scikit-learn#9670)

MAINT Stop vendoring sphinx-gallery (scikit-learn#9403)

CI upgrade travis to run on new numpy release (scikit-learn#9096)

CI Make it possible to run doctests in .rst files with pytest (scikit-learn#9697)

* doc/datasets/conftest.py to implement the equivalent of nose fixtures
* add conftest.py in root folder to ensure that sklearn local folder
  is used rather than the package in site-packages
* test doc with pytest in Travis
* move custom_data_home definition from nose fixture to .rst file

[MRG+1] avoid integer overflow by using floats for matthews_corrcoef (scikit-learn#9693)

* Fix bug#9622: avoid integer overflow by using floats for matthews_corrcoef

* matthews_corrcoef: cosmetic change requested by jnothman

* Add test_matthews_corrcoef_overflow for Bug#9622

* test_matthews_corrcoef_overflow: clean-up and make deterministic

* matthews_corrcoef: pass dtype=np.float64 to sum & trace instead of using astype

* test_matthews_corrcoef_overflow: add simple deterministic tests

TST Platform independent hash collision tests in FeatureHasher (scikit-learn#9710)

TST More informative error message in test_preserve_trustworthiness_approximately (scikit-learn#9738)

add some rudimentary tests for meta-estimators

fix extra whitespace in error message

add missing if_delegate_has_method in pipeline

don't test tuple pipeline for now

only copy list if not list already? doesn't seem to help?
@rth rth deleted the robust-hash-collision-tests branch October 6, 2017 15:13
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants