New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] FIX Hash collisions in the FeatureHasher #7565

Merged
merged 17 commits into from Jun 8, 2017

Conversation

Projects
None yet
4 participants
@rth
Member

rth commented Oct 3, 2016

This PR fixes #7513 and extends the non_negative parameter in FeatureHasher and HashingVectorizer with a value of 'total' which disables the hash collision handling (and makes the output more intuitive).

Here are the annotations for the compiled _hashing.pyx. Suggestions on how to better explain the non_negative behavior in the doc-strings are welcome.

The documentation for the HashingVectorizer should also be improved IMO, e.g. to explain the preservation of the inner product in the hashed space, but maybe in a separate PR.

Update: this PR also supersedes #7502 that was a continuation of #5861 aiming to fix #3637 . It also resolves #2665 .

@rth rth changed the title from FIX Hash collisions in the FeatureHasher to [MRG] FIX Hash collisions in the FeatureHasher Oct 3, 2016

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Oct 4, 2016

Member

Thanks for this. I'll try take a look tomorrow.

Member

jnothman commented Oct 4, 2016

Thanks for this. I'll try take a look tomorrow.

Show outdated Hide outdated sklearn/feature_extraction/_hashing.pyx
@@ -63,7 +63,8 @@ def transform(raw_X, Py_ssize_t n_features, dtype):
array.resize_smart(indices, len(indices) + 1)
indices[len(indices) - 1] = abs(h) % n_features
value *= (h >= 0) * 2 - 1
if alternate_sign: # counter the effect of hash collision (issue #7513)

This comment has been minimized.

@jnothman

jnothman Oct 4, 2016

Member

I don't think this description is correct.

@jnothman

jnothman Oct 4, 2016

Member

I don't think this description is correct.

This comment has been minimized.

@jnothman

jnothman Oct 5, 2016

Member

(also, PEP8: keep two spaces before a comment)

@jnothman

jnothman Oct 5, 2016

Member

(also, PEP8: keep two spaces before a comment)

@jnothman

I'm coming to consider the idea that the current non_negative is rubbish, and that maybe we should deprecate it an install another parameter in its place...

Show outdated Hide outdated sklearn/feature_extraction/_hashing.pyx
@@ -63,7 +63,8 @@ def transform(raw_X, Py_ssize_t n_features, dtype):
array.resize_smart(indices, len(indices) + 1)
indices[len(indices) - 1] = abs(h) % n_features
value *= (h >= 0) * 2 - 1
if alternate_sign: # counter the effect of hash collision (issue #7513)

This comment has been minimized.

@jnothman

jnothman Oct 5, 2016

Member

(also, PEP8: keep two spaces before a comment)

@jnothman

jnothman Oct 5, 2016

Member

(also, PEP8: keep two spaces before a comment)

Show outdated Hide outdated sklearn/feature_extraction/_hashing.pyx
@@ -15,7 +15,7 @@ np.import_array()
@cython.boundscheck(False)
@cython.cdivision(True)
def transform(raw_X, Py_ssize_t n_features, dtype):
def transform(raw_X, Py_ssize_t n_features, dtype, char alternate_sign):

This comment has been minimized.

@jnothman

jnothman Oct 5, 2016

Member

type should be bint, not char

@jnothman

jnothman Oct 5, 2016

Member

type should be bint, not char

Show outdated Hide outdated sklearn/feature_extraction/hashing.py
When False, output values will have expected value zero.
non_negative : boolean or 'total', optional, default False
When True or False, an alternating sign is added to the counts as to
approximately conserve the inner product in the hashed space.

This comment has been minimized.

@jnothman

jnothman Oct 5, 2016

Member

usually one would use "preserve" not "conserve"

@jnothman

jnothman Oct 5, 2016

Member

usually one would use "preserve" not "conserve"

Show outdated Hide outdated sklearn/feature_extraction/hashing.py
When True, output values can be interpreted as frequencies.
When False, output values will have expected value zero.
non_negative : boolean or 'total', optional, default False
When True or False, an alternating sign is added to the counts as to

This comment has been minimized.

@jnothman

jnothman Oct 5, 2016

Member

Not necessarily counts, here.

@jnothman

jnothman Oct 5, 2016

Member

Not necessarily counts, here.

Show outdated Hide outdated sklearn/feature_extraction/hashing.py
approximately conserve the inner product in the hashed space.
When True, an absolute value is additionally applied to the result
prior to returning it.
When 'total' all counts are positive which disables collision handling.

This comment has been minimized.

@jnothman

jnothman Oct 5, 2016

Member

It doesn't really disable collision handling, does it? It just does not faithfully preserve inner product. The output, however, can still include negative values if input features had negative values. So total might not be correct nomenclature either. I think you might say "this is suitable for non-negative features; each feature value is then the sum of all features with that hash"

@jnothman

jnothman Oct 5, 2016

Member

It doesn't really disable collision handling, does it? It just does not faithfully preserve inner product. The output, however, can still include negative values if input features had negative values. So total might not be correct nomenclature either. I think you might say "this is suitable for non-negative features; each feature value is then the sum of all features with that hash"

Show outdated Hide outdated sklearn/feature_extraction/hashing.py
@@ -148,6 +152,6 @@ def transform(self, raw_X, y=None):
X = sp.csr_matrix((values, indices, indptr), dtype=self.dtype,
shape=(n_samples, self.n_features))
X.sum_duplicates() # also sorts the indices
if self.non_negative:
if self.non_negative is True: # if non_negative == 'total', X>0 anyway

This comment has been minimized.

@jnothman

jnothman Oct 5, 2016

Member

Not necessarily true. We need to decide how to handle that case.

One solution is to add a further parameter... But honestly, if there is negative input, non_negative should be False.

@jnothman

jnothman Oct 5, 2016

Member

Not necessarily true. We need to decide how to handle that case.

One solution is to add a further parameter... But honestly, if there is negative input, non_negative should be False.

Show outdated Hide outdated sklearn/feature_extraction/tests/test_feature_hasher.py
def test_hasher_non_negative():
raw_X = [["foo", "bar", "baz"]]
def it(): # iterable

This comment has been minimized.

@jnothman

jnothman Oct 5, 2016

Member

why?

@jnothman

This comment has been minimized.

@jnothman

jnothman Nov 2, 2016

Member

why not use raw_X directly?

And convention would use X for input, and Xt for output.

@jnothman

jnothman Nov 2, 2016

Member

why not use raw_X directly?

And convention would use X for input, and Xt for output.

@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth Oct 7, 2016

Member

@jnothman Thanks a lot for the detailed review! I wasn't sure about boolean in Cython and did miss the second space before the comment in the .pyx. ) I agree that having 3 values for the current non_negative parameter not great and that something should be done about it.

For the formulation, I still think that saying that this mechanism would allow to "approximately preserve inner product in the hashed space" is misleading. As is illustrated in my other comment, a more exact formulation would be that it allows to better preserve the inner product, for a) small hashing table sizes b) when binary counts are not used, so at the end it is still about how collisions are handled with respect to this property.

I'll look into addressing the other comments in your review soon.

Member

rth commented Oct 7, 2016

@jnothman Thanks a lot for the detailed review! I wasn't sure about boolean in Cython and did miss the second space before the comment in the .pyx. ) I agree that having 3 values for the current non_negative parameter not great and that something should be done about it.

For the formulation, I still think that saying that this mechanism would allow to "approximately preserve inner product in the hashed space" is misleading. As is illustrated in my other comment, a more exact formulation would be that it allows to better preserve the inner product, for a) small hashing table sizes b) when binary counts are not used, so at the end it is still about how collisions are handled with respect to this property.

I'll look into addressing the other comments in your review soon.

@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth Nov 2, 2016

Member

Sorry for the late response. @jnothman I addressed review comments above, and the ones in your post in the original issue.

I'm just not sure about when the deprecation warning should be raised: in the current PR it is raised both for non_negative=True (to warn that it's deprecated) and for non_negative='total' (to warn that this option will be renamed to non_negative=True ), but that doesn't sound right as this means that the users won't be able to do anything to make this warning go away for 2 versions. Should I deprecate just non_negative=True then?

Member

rth commented Nov 2, 2016

Sorry for the late response. @jnothman I addressed review comments above, and the ones in your post in the original issue.

I'm just not sure about when the deprecation warning should be raised: in the current PR it is raised both for non_negative=True (to warn that it's deprecated) and for non_negative='total' (to warn that this option will be renamed to non_negative=True ), but that doesn't sound right as this means that the users won't be able to do anything to make this warning go away for 2 versions. Should I deprecate just non_negative=True then?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Nov 2, 2016

Member

Yes, don't deprecate non_negative='total'.

Member

jnothman commented Nov 2, 2016

Yes, don't deprecate non_negative='total'.

Show outdated Hide outdated sklearn/feature_extraction/hashing.py
" True, False, 'total'.")
if non_negative in ['total', True]:
warnings.warn("the option non_negative=True has been deprecated"
" in 0.19. As of 0.21 non_negative='total' would be"

This comment has been minimized.

@jnothman

jnothman Nov 2, 2016

Member

Perhaps "From version 0.21, non_negative=True will be interpreted as non_negative='total'."

@jnothman

jnothman Nov 2, 2016

Member

Perhaps "From version 0.21, non_negative=True will be interpreted as non_negative='total'."

Show outdated Hide outdated sklearn/feature_extraction/tests/test_feature_hasher.py
def test_hasher_non_negative():
raw_X = [["foo", "bar", "baz"]]
def it(): # iterable

This comment has been minimized.

@jnothman

jnothman Nov 2, 2016

Member

why not use raw_X directly?

And convention would use X for input, and Xt for output.

@jnothman

jnothman Nov 2, 2016

Member

why not use raw_X directly?

And convention would use X for input, and Xt for output.

Show outdated Hide outdated sklearn/feature_extraction/tests/test_feature_hasher.py
X = FeatureHasher(non_negative=False,
input_type='string').fit_transform(it())
assert_true((X.data > 0).any() and (X.data < 0).any())

This comment has been minimized.

@jnothman

jnothman Nov 2, 2016

Member

surely this test is more useful if we check that there's a zero?

@jnothman

jnothman Nov 2, 2016

Member

surely this test is more useful if we check that there's a zero?

This comment has been minimized.

@rth

rth Nov 2, 2016

Member

Well the point here was to test that the output has both negative and positive values, not that is has zeros (which it doesn't have in this case). Changed the test to be more explicit about it.

@rth

rth Nov 2, 2016

Member

Well the point here was to test that the output has both negative and positive values, not that is has zeros (which it doesn't have in this case). Changed the test to be more explicit about it.

@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth Nov 2, 2016

Member

Thanks! I made the changes requested above.

The only comment I'm not sure how to address is about when the absolute value should be applied. You mean that if the features counts are negative to begin with, then with this code self.non_negative==True will make the output positive, but self.non_negative=='total' will not? But at the same time unnecessarily applying the absolute value to feature counts with HashingVectorizer (probably 95% use cases for this) would have some overhead, wouldn't it? Otherwise the solution could be to rename the non_negative parameter to something else (as with this PR making the output positive/negative would be just a side effect of the improved inner product preservation, which is what this parameter does in the end), but that would give additional head ache with backward compatibility...

Member

rth commented Nov 2, 2016

Thanks! I made the changes requested above.

The only comment I'm not sure how to address is about when the absolute value should be applied. You mean that if the features counts are negative to begin with, then with this code self.non_negative==True will make the output positive, but self.non_negative=='total' will not? But at the same time unnecessarily applying the absolute value to feature counts with HashingVectorizer (probably 95% use cases for this) would have some overhead, wouldn't it? Otherwise the solution could be to rename the non_negative parameter to something else (as with this PR making the output positive/negative would be just a side effect of the improved inner product preservation, which is what this parameter does in the end), but that would give additional head ache with backward compatibility...

Show outdated Hide outdated sklearn/feature_extraction/tests/test_feature_hasher.py
assert_true((Xt.data >= 0).all()) # zeros are acceptable
Xt = FeatureHasher(non_negative='total',
input_type='string').fit_transform(X)
assert_true((Xt.data > 0).all()) # strictly positive counts

This comment has been minimized.

@jnothman

jnothman Nov 2, 2016

Member

But this isn't testing that non_negative='total' is working, is it?

@jnothman

jnothman Nov 2, 2016

Member

But this isn't testing that non_negative='total' is working, is it?

This comment has been minimized.

@rth

rth Nov 2, 2016

Member

You are right, thanks, changed the test to actually have a hash collision so this part is tested.

@rth

rth Nov 2, 2016

Member

You are right, thanks, changed the test to actually have a hash collision so this part is tested.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Nov 2, 2016

Member

Could you run a benchmark on the overhead of abs? otherwise maybe we could indeed introduce a parameter sign_flipping=True (or something) and deprecate non_negative.

Member

jnothman commented Nov 2, 2016

Could you run a benchmark on the overhead of abs? otherwise maybe we could indeed introduce a parameter sign_flipping=True (or something) and deprecate non_negative.

@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth Nov 2, 2016

Member

I checked the overhead of abs on the 20 newsgroups when using HashingVectorizer, is is negligible (1 ms out of a 3.65s total run time). Another issue is that right now there is no way of doing just the plain feature hashing (i.e. just count the occurrences in each element of the hashing table): if non_negative = False this ~ alternating sign is applied, if non_negative in [True, 'total'] we also apply the absolute value. So in the case when some features are negative initially, there is no way to just count them without an additional abs. This is because the current non_negative parameter mixes abs with the "alternating sign" functionality.

I agree that adding a new parameter (how about rand_sparse_proj or altern_sign?) and deprecating non_negative would be annoying, but it may be the best thing to do here?

Member

rth commented Nov 2, 2016

I checked the overhead of abs on the 20 newsgroups when using HashingVectorizer, is is negligible (1 ms out of a 3.65s total run time). Another issue is that right now there is no way of doing just the plain feature hashing (i.e. just count the occurrences in each element of the hashing table): if non_negative = False this ~ alternating sign is applied, if non_negative in [True, 'total'] we also apply the absolute value. So in the case when some features are negative initially, there is no way to just count them without an additional abs. This is because the current non_negative parameter mixes abs with the "alternating sign" functionality.

I agree that adding a new parameter (how about rand_sparse_proj or altern_sign?) and deprecating non_negative would be annoying, but it may be the best thing to do here?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Nov 6, 2016

Member

I think altern_sign=True would be clear enough. Let's go with that..?

Member

jnothman commented Nov 6, 2016

I think altern_sign=True would be clear enough. Let's go with that..?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Nov 6, 2016

Member

Make it alternate_sign

Member

jnothman commented Nov 6, 2016

Make it alternate_sign

Show outdated Hide outdated sklearn/feature_extraction/hashing.py
When True, an absolute value is applied to the features matrix prior to
returning it. When used in conjunction with alternate_sign=True, this
significantly reduces the inner product preservation property.
This option is deprecated as of 0.19.

This comment has been minimized.

@jnothman

jnothman Nov 6, 2016

Member

you can use the .. deprecated directive. Be sure to specify when it'll be removed (0.21).

@jnothman

jnothman Nov 6, 2016

Member

you can use the .. deprecated directive. Be sure to specify when it'll be removed (0.21).

This comment has been minimized.

@jnothman

jnothman Nov 6, 2016

Member

Actually, not so important seeing as you have 0.21 in the code

@jnothman

jnothman Nov 6, 2016

Member

Actually, not so important seeing as you have 0.21 in the code

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Nov 6, 2016

Member

This looks good. But in case someone were to reimplement it with an abs, please add a test with mixed-sign data.

Member

jnothman commented Nov 6, 2016

This looks good. But in case someone were to reimplement it with an abs, please add a test with mixed-sign data.

@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth Nov 6, 2016

Member

@jnothman Thanks for the feedback! I added a test with mixed-sign data and the ..deprecated directives.

Member

rth commented Nov 6, 2016

@jnothman Thanks for the feedback! I added a test with mixed-sign data and the ..deprecated directives.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Nov 6, 2016

Member

LGTM. Thanks very much!

Member

jnothman commented Nov 6, 2016

LGTM. Thanks very much!

@jnothman jnothman changed the title from [MRG] FIX Hash collisions in the FeatureHasher to [MRG+1] FIX Hash collisions in the FeatureHasher Nov 6, 2016

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Nov 6, 2016

Member

Please add an entry in whats_new

Member

jnothman commented Nov 6, 2016

Please add an entry in whats_new

@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth Jan 23, 2017

Member

Sorry for the delay, I rebased and added an entry in whats_new. Would someone be able to do a second review of this, PR please? cc @raghavrv
This PR is still marked by github as having requested changes by @jnothman while it was approved above (well apart for the "what's new" section). Thanks!

Member

rth commented Jan 23, 2017

Sorry for the delay, I rebased and added an entry in whats_new. Would someone be able to do a second review of this, PR please? cc @raghavrv
This PR is still marked by github as having requested changes by @jnothman while it was approved above (well apart for the "what's new" section). Thanks!

@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth Jan 24, 2017

Member

Also when I go to https://github.com/scikit-learn/scikit-learn/projects/5 for 0.19 both parent issues that this PR aims to address are marked as "Issues without PR", a stalled PR that was continued in PR #7502, itself superseded by this PR is marked as "PRs in progress/stalled" and this PR is not in the list of "PRs that need reviews" ...

Member

rth commented Jan 24, 2017

Also when I go to https://github.com/scikit-learn/scikit-learn/projects/5 for 0.19 both parent issues that this PR aims to address are marked as "Issues without PR", a stalled PR that was continued in PR #7502, itself superseded by this PR is marked as "PRs in progress/stalled" and this PR is not in the list of "PRs that need reviews" ...

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jan 24, 2017

Member
Member

jnothman commented Jan 24, 2017

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jan 24, 2017

Member
Member

jnothman commented Jan 24, 2017

@raghavrv

I'm not very familiar with this part of the codebase... but this does fix the issue referenced.

Also do you want a test with HashingVectorizer and TfidfTransformer in a pipeline?

Show outdated Hide outdated doc/whats_new.rst
applied a sparse random projection to the hashed features, preventing
the use of `sklearn.feature_extraction.text.HashingVectorizer` in a
pipeline with `sklearn.feature_extraction.text.TfidfTransformer`.
:issue:`3637`, :issue:`7513` by `Roman Yurchal <rth>`.

This comment has been minimized.

@raghavrv

raghavrv Jan 25, 2017

Member

You can reference the PR alone... (#7513)

@raghavrv

raghavrv Jan 25, 2017

Member

You can reference the PR alone... (#7513)

Show outdated Hide outdated sklearn/feature_extraction/_hashing.pyx
@@ -63,7 +63,9 @@ def transform(raw_X, Py_ssize_t n_features, dtype):
array.resize_smart(indices, len(indices) + 1)
indices[len(indices) - 1] = abs(h) % n_features
value *= (h >= 0) * 2 - 1
if alternate_sign: # improve inner product preservation

This comment has been minimized.

@raghavrv

raghavrv Jan 25, 2017

Member

Style nitpick: Could you move the comment to above the if ...?

@raghavrv

raghavrv Jan 25, 2017

Member

Style nitpick: Could you move the comment to above the if ...?

Show outdated Hide outdated sklearn/feature_extraction/_hashing.pyx
@@ -63,7 +63,9 @@ def transform(raw_X, Py_ssize_t n_features, dtype):
array.resize_smart(indices, len(indices) + 1)
indices[len(indices) - 1] = abs(h) % n_features
value *= (h >= 0) * 2 - 1
if alternate_sign: # improve inner product preservation
# in the hashed space (issue #7513)

This comment has been minimized.

@raghavrv

raghavrv Jan 25, 2017

Member

Also please remove the issue reference

@raghavrv

raghavrv Jan 25, 2017

Member

Also please remove the issue reference

@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth Jan 27, 2017

Member

Thanks for the review, @raghavrv . I addressed your comments.

Well, adding a test with HashingVectorizer and TfidfTransformer in a pipeline would be more like an integration test; we know that issue #3637 is caused by the HashingVectorizer producing zeros in returned document-term matrix X (due to hash collisions), which happens in the FeatureHasher. I did do initially tests on the HashingVectorizer, however @jnothman suggested that I test FeatureHasher instead. Currently we test that were previously we had a hash collision producing a zero after this PR all elements are strictly positive which should cover the pipelining issue. But I can add another integration test if you think I should?

Also Circle CI is currently failing with the following error,

+ ./miniconda.sh -b -p
./miniconda.sh: option requires an argument -- p
Error: did not recognize option, please try -h

not sure what this is due to.

Member

rth commented Jan 27, 2017

Thanks for the review, @raghavrv . I addressed your comments.

Well, adding a test with HashingVectorizer and TfidfTransformer in a pipeline would be more like an integration test; we know that issue #3637 is caused by the HashingVectorizer producing zeros in returned document-term matrix X (due to hash collisions), which happens in the FeatureHasher. I did do initially tests on the HashingVectorizer, however @jnothman suggested that I test FeatureHasher instead. Currently we test that were previously we had a hash collision producing a zero after this PR all elements are strictly positive which should cover the pipelining issue. But I can add another integration test if you think I should?

Also Circle CI is currently failing with the following error,

+ ./miniconda.sh -b -p
./miniconda.sh: option requires an argument -- p
Error: did not recognize option, please try -h

not sure what this is due to.

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Jan 31, 2017

Member

Also Circle CI is currently failing with the following error,

Try rebasing on master. I am not 100% sure but I would guess that CircleCI uses the circle.yml from your fork that does not have MINICONDA_PATH setup. Then we have a tweak to run CircleCI on the result of the merge of the PR into master and build_doc.sh uses MINICONDA_PATH which is unset.

Member

lesteve commented Jan 31, 2017

Also Circle CI is currently failing with the following error,

Try rebasing on master. I am not 100% sure but I would guess that CircleCI uses the circle.yml from your fork that does not have MINICONDA_PATH setup. Then we have a tweak to run CircleCI on the result of the merge of the PR into master and build_doc.sh uses MINICONDA_PATH which is unset.

@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth Jan 31, 2017

Member

Thanks @lesteve , rebasing fixed the Circle CI issue.

Member

rth commented Jan 31, 2017

Thanks @lesteve , rebasing fixed the Circle CI issue.

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Jan 31, 2017

Member

Thanks @lesteve , rebasing fixed the Circle CI issue.

Great, I believe other PRs are affected so this is good to make sure this fixes the problem.

Member

lesteve commented Jan 31, 2017

Thanks @lesteve , rebasing fixed the Circle CI issue.

Great, I believe other PRs are affected so this is good to make sure this fixes the problem.

@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth Mar 14, 2017

Member

Thanks for the review, @raghavrv . I addressed your comments.

@raghavrv Please let me know if you have any other comments. Thanks..

Member

rth commented Mar 14, 2017

Thanks for the review, @raghavrv . I addressed your comments.

@raghavrv Please let me know if you have any other comments. Thanks..

@raghavrv

This comment has been minimized.

Show comment
Hide comment
@raghavrv

raghavrv Mar 25, 2017

Member

Sorry for the delay in response... This needs a merge with master. Could you take care of it? I'll try to review soon...

Member

raghavrv commented Mar 25, 2017

Sorry for the delay in response... This needs a merge with master. Could you take care of it? I'll try to review soon...

@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth Mar 25, 2017

Member

Thanks @raghavrv ! I fixed the conflict...

Member

rth commented Mar 25, 2017

Thanks @raghavrv ! I fixed the conflict...

Show outdated Hide outdated doc/whats_new.rst
@@ -239,9 +239,17 @@ Bug fixes
left `coef_` as a list, rather than an ndarray.
:issue:`8160` by :user:`CJ Carey <perimosocordiae>`.
- Fix a bug where `sklearn.feature_extraction.FeatureHasher` mandatorily

This comment has been minimized.

@lesteve

lesteve Mar 30, 2017

Member

This is rst not md, so you want to use double-backticks. Best is to use the sphinx directives :class: or :func:. Look around in this file for examples.

Note this happens a few times below as well.

@lesteve

lesteve Mar 30, 2017

Member

This is rst not md, so you want to use double-backticks. Best is to use the sphinx directives :class: or :func:. Look around in this file for examples.

Note this happens a few times below as well.

@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth Mar 30, 2017

Member

Thanks for the comment @lesteve , fixed the rst formatting...

Member

rth commented Mar 30, 2017

Thanks for the comment @lesteve , fixed the rst formatting...

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman May 27, 2017

Member

@raghavrv and @lesteve, you both made requests here. Are we ready to merge?

Member

jnothman commented May 27, 2017

@raghavrv and @lesteve, you both made requests here. Are we ready to merge?

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Jun 1, 2017

Member

I am not very familiar with text so I am afraid I am not the best one to review this one.

Member

lesteve commented Jun 1, 2017

I am not very familiar with text so I am afraid I am not the best one to review this one.

@raghavrv raghavrv added the Sprint label Jun 6, 2017

@raghavrv

This comment has been minimized.

Show comment
Hide comment
@raghavrv

raghavrv Jun 7, 2017

Member

@jnothman do you think it's better to have random somewhere in the parameter name just to make things clear that it does not always alternate...

Maybe randomly_alternate_sign or (random_alter_sign if you prefer something short)?

Member

raghavrv commented Jun 7, 2017

@jnothman do you think it's better to have random somewhere in the parameter name just to make things clear that it does not always alternate...

Maybe randomly_alternate_sign or (random_alter_sign if you prefer something short)?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 7, 2017

Member

It's not random, it's a function of the hash...

Member

jnothman commented Jun 7, 2017

It's not random, it's a function of the hash...

@raghavrv

Thanks for the PR. Looks good to me, pending minor comments... I'll make a final pass and merge once you address those...

Show outdated Hide outdated sklearn/feature_extraction/_hashing.pyx
@@ -15,7 +15,7 @@ np.import_array()
@cython.boundscheck(False)
@cython.cdivision(True)
def transform(raw_X, Py_ssize_t n_features, dtype):
def transform(raw_X, Py_ssize_t n_features, dtype, bint alternate_sign):

This comment has been minimized.

@raghavrv

raghavrv Jun 7, 2017

Member

Can you set the default value to be True here?

I know it's not the public API but it would be nice to preserve the previous functionality...

(Feel free to ignore the suggestion)

@raghavrv

raghavrv Jun 7, 2017

Member

Can you set the default value to be True here?

I know it's not the public API but it would be nice to preserve the previous functionality...

(Feel free to ignore the suggestion)

Show outdated Hide outdated sklearn/feature_extraction/tests/test_feature_hasher.py
def test_hasher_alternate_sign():
X = [["foo", "bar", "baz", "investigation need", "records"]]
# the last two tokens produce a hash collision that sums as 0

This comment has been minimized.

@raghavrv

raghavrv Jun 7, 2017

Member

Can you move this comment above the previous line?

(+1 for a well thought of test case. Thx)

@raghavrv

raghavrv Jun 7, 2017

Member

Can you move this comment above the previous line?

(+1 for a well thought of test case. Thx)

assert_true(len(Xt.data) < len(X[0]))
assert_true((Xt.data == 0.).any())
Xt = FeatureHasher(alternate_sign=True, non_negative=True,

This comment has been minimized.

@raghavrv

raghavrv Jun 7, 2017

Member

This will raise a deprecation warning right? can you wrap it using ignore_warnings(DeprecatedWarning) maybe?

@raghavrv

raghavrv Jun 7, 2017

Member

This will raise a deprecation warning right? can you wrap it using ignore_warnings(DeprecatedWarning) maybe?

alternate_sign : boolean, optional, default True
When True, an alternating sign is added to the features as to
approximately conserve the inner product in the hashed space even for
small n_features. This approach is similar to sparse random projection.

This comment has been minimized.

@raghavrv

raghavrv Jun 7, 2017

Member

you need a new feature (or feature added?) tag?

@raghavrv

raghavrv Jun 7, 2017

Member

you need a new feature (or feature added?) tag?

This comment has been minimized.

@raghavrv

raghavrv Jun 8, 2017

Member

(has been addressed below)

@raghavrv

raghavrv Jun 8, 2017

Member

(has been addressed below)

@raghavrv

This comment has been minimized.

Show comment
Hide comment
@raghavrv

raghavrv Jun 7, 2017

Member

It's not random, it's a function of the hash...

Ah! got it thanks :)

Member

raghavrv commented Jun 7, 2017

It's not random, it's a function of the hash...

Ah! got it thanks :)

@rth

This comment has been minimized.

Show comment
Hide comment
@rth

rth Jun 8, 2017

Member

Thanks for the review @raghavrv ! I addressed all your comments (I think). Appveyor would take a day or so to build, unfortunately..

Member

rth commented Jun 8, 2017

Thanks for the review @raghavrv ! I addressed all your comments (I think). Appveyor would take a day or so to build, unfortunately..

@raghavrv

This comment has been minimized.

Show comment
Hide comment
@raghavrv

raghavrv Jun 8, 2017

Member

Thanks @rth. I'll approve it for now... Please ping me when appveyor passes. I'll merge it... (Or is it okay to merge without appveyor, @jnothman?)

Member

raghavrv commented Jun 8, 2017

Thanks @rth. I'll approve it for now... Please ping me when appveyor passes. I'll merge it... (Or is it okay to merge without appveyor, @jnothman?)

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 8, 2017

Member

By induction, appveyor will provably pass :p

Thanks @rth!

Member

jnothman commented Jun 8, 2017

By induction, appveyor will provably pass :p

Thanks @rth!

@jnothman jnothman merged commit 925a017 into scikit-learn:master Jun 8, 2017

2 of 3 checks passed

continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@rth rth deleted the rth:hashing_vect-nn branch Jun 8, 2017

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG] FIX Hash collisions in the FeatureHasher (#7565)
* HashingVectorizer: optionaly disable alternate signs

dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

[MRG] FIX Hash collisions in the FeatureHasher (#7565)
* HashingVectorizer: optionaly disable alternate signs

dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

[MRG] FIX Hash collisions in the FeatureHasher (#7565)
* HashingVectorizer: optionaly disable alternate signs

NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

[MRG] FIX Hash collisions in the FeatureHasher (#7565)
* HashingVectorizer: optionaly disable alternate signs

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG] FIX Hash collisions in the FeatureHasher (#7565)
* HashingVectorizer: optionaly disable alternate signs

AishwaryaRK added a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017

[MRG] FIX Hash collisions in the FeatureHasher (#7565)
* HashingVectorizer: optionaly disable alternate signs

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG] FIX Hash collisions in the FeatureHasher (#7565)
* HashingVectorizer: optionaly disable alternate signs

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

[MRG] FIX Hash collisions in the FeatureHasher (#7565)
* HashingVectorizer: optionaly disable alternate signs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment