[MRG+1] FIX Hash collisions in the FeatureHasher #7565
Conversation
Thanks for this. I'll try take a look tomorrow. |
@@ -63,7 +63,8 @@ def transform(raw_X, Py_ssize_t n_features, dtype): | |||
|
|||
array.resize_smart(indices, len(indices) + 1) | |||
indices[len(indices) - 1] = abs(h) % n_features | |||
value *= (h >= 0) * 2 - 1 | |||
if alternate_sign: # counter the effect of hash collision (issue #7513) |
jnothman
Oct 4, 2016
Member
I don't think this description is correct.
I don't think this description is correct.
jnothman
Oct 5, 2016
Member
(also, PEP8: keep two spaces before a comment)
(also, PEP8: keep two spaces before a comment)
I'm coming to consider the idea that the current |
@@ -63,7 +63,8 @@ def transform(raw_X, Py_ssize_t n_features, dtype): | |||
|
|||
array.resize_smart(indices, len(indices) + 1) | |||
indices[len(indices) - 1] = abs(h) % n_features | |||
value *= (h >= 0) * 2 - 1 | |||
if alternate_sign: # counter the effect of hash collision (issue #7513) |
jnothman
Oct 5, 2016
Member
(also, PEP8: keep two spaces before a comment)
(also, PEP8: keep two spaces before a comment)
@@ -15,7 +15,7 @@ np.import_array() | |||
|
|||
@cython.boundscheck(False) | |||
@cython.cdivision(True) | |||
def transform(raw_X, Py_ssize_t n_features, dtype): | |||
def transform(raw_X, Py_ssize_t n_features, dtype, char alternate_sign): |
jnothman
Oct 5, 2016
Member
type should be bint
, not char
type should be bint
, not char
When False, output values will have expected value zero. | ||
non_negative : boolean or 'total', optional, default False | ||
When True or False, an alternating sign is added to the counts as to | ||
approximately conserve the inner product in the hashed space. |
jnothman
Oct 5, 2016
Member
usually one would use "preserve" not "conserve"
usually one would use "preserve" not "conserve"
When True, output values can be interpreted as frequencies. | ||
When False, output values will have expected value zero. | ||
non_negative : boolean or 'total', optional, default False | ||
When True or False, an alternating sign is added to the counts as to |
jnothman
Oct 5, 2016
Member
Not necessarily counts, here.
Not necessarily counts, here.
approximately conserve the inner product in the hashed space. | ||
When True, an absolute value is additionally applied to the result | ||
prior to returning it. | ||
When 'total' all counts are positive which disables collision handling. |
jnothman
Oct 5, 2016
Member
It doesn't really disable collision handling, does it? It just does not faithfully preserve inner product. The output, however, can still include negative values if input features had negative values. So total
might not be correct nomenclature either. I think you might say "this is suitable for non-negative features; each feature value is then the sum of all features with that hash"
It doesn't really disable collision handling, does it? It just does not faithfully preserve inner product. The output, however, can still include negative values if input features had negative values. So total
might not be correct nomenclature either. I think you might say "this is suitable for non-negative features; each feature value is then the sum of all features with that hash"
@@ -148,6 +152,6 @@ def transform(self, raw_X, y=None): | |||
X = sp.csr_matrix((values, indices, indptr), dtype=self.dtype, | |||
shape=(n_samples, self.n_features)) | |||
X.sum_duplicates() # also sorts the indices | |||
if self.non_negative: | |||
if self.non_negative is True: # if non_negative == 'total', X>0 anyway |
jnothman
Oct 5, 2016
Member
Not necessarily true. We need to decide how to handle that case.
One solution is to add a further parameter... But honestly, if there is negative input, non_negative
should be False
.
Not necessarily true. We need to decide how to handle that case.
One solution is to add a further parameter... But honestly, if there is negative input, non_negative
should be False
.
def test_hasher_non_negative(): | ||
raw_X = [["foo", "bar", "baz"]] | ||
|
||
def it(): # iterable |
jnothman
Oct 5, 2016
Member
why?
why?
jnothman
Nov 2, 2016
Member
why not use raw_X
directly?
And convention would use X
for input, and Xt
for output.
why not use raw_X
directly?
And convention would use X
for input, and Xt
for output.
@jnothman Thanks a lot for the detailed review! I wasn't sure about boolean in Cython and did miss the second space before the comment in the For the formulation, I still think that saying that this mechanism would allow to "approximately preserve inner product in the hashed space" is misleading. As is illustrated in my other comment, a more exact formulation would be that it allows to better preserve the inner product, for a) small hashing table sizes b) when binary counts are not used, so at the end it is still about how collisions are handled with respect to this property. I'll look into addressing the other comments in your review soon. |
Sorry for the late response. @jnothman I addressed review comments above, and the ones in your post in the original issue. I'm just not sure about when the deprecation warning should be raised: in the current PR it is raised both for |
Yes, don't deprecate |
" True, False, 'total'.") | ||
if non_negative in ['total', True]: | ||
warnings.warn("the option non_negative=True has been deprecated" | ||
" in 0.19. As of 0.21 non_negative='total' would be" |
jnothman
Nov 2, 2016
Member
Perhaps "From version 0.21, non_negative=True will be interpreted as non_negative='total'."
Perhaps "From version 0.21, non_negative=True will be interpreted as non_negative='total'."
def test_hasher_non_negative(): | ||
raw_X = [["foo", "bar", "baz"]] | ||
|
||
def it(): # iterable |
jnothman
Nov 2, 2016
Member
why not use raw_X
directly?
And convention would use X
for input, and Xt
for output.
why not use raw_X
directly?
And convention would use X
for input, and Xt
for output.
|
||
X = FeatureHasher(non_negative=False, | ||
input_type='string').fit_transform(it()) | ||
assert_true((X.data > 0).any() and (X.data < 0).any()) |
jnothman
Nov 2, 2016
Member
surely this test is more useful if we check that there's a zero?
surely this test is more useful if we check that there's a zero?
rth
Nov 2, 2016
•
Author
Member
Well the point here was to test that the output has both negative and positive values, not that is has zeros (which it doesn't have in this case). Changed the test to be more explicit about it.
Well the point here was to test that the output has both negative and positive values, not that is has zeros (which it doesn't have in this case). Changed the test to be more explicit about it.
Thanks! I made the changes requested above. The only comment I'm not sure how to address is about when the absolute value should be applied. You mean that if the features counts are negative to begin with, then with this code |
assert_true((Xt.data >= 0).all()) # zeros are acceptable | ||
Xt = FeatureHasher(non_negative='total', | ||
input_type='string').fit_transform(X) | ||
assert_true((Xt.data > 0).all()) # strictly positive counts |
jnothman
Nov 2, 2016
Member
But this isn't testing that non_negative='total'
is working, is it?
But this isn't testing that non_negative='total'
is working, is it?
rth
Nov 2, 2016
Author
Member
You are right, thanks, changed the test to actually have a hash collision so this part is tested.
You are right, thanks, changed the test to actually have a hash collision so this part is tested.
Could you run a benchmark on the overhead of |
I checked the overhead of abs on the 20 newsgroups when using HashingVectorizer, is is negligible (1 ms out of a 3.65s total run time). Another issue is that right now there is no way of doing just the plain feature hashing (i.e. just count the occurrences in each element of the hashing table): if I agree that adding a new parameter (how about |
I think |
Make it |
When True, an absolute value is applied to the features matrix prior to | ||
returning it. When used in conjunction with alternate_sign=True, this | ||
significantly reduces the inner product preservation property. | ||
This option is deprecated as of 0.19. |
jnothman
Nov 6, 2016
Member
you can use the .. deprecated
directive. Be sure to specify when it'll be removed (0.21).
you can use the .. deprecated
directive. Be sure to specify when it'll be removed (0.21).
jnothman
Nov 6, 2016
Member
Actually, not so important seeing as you have 0.21 in the code
Actually, not so important seeing as you have 0.21 in the code
This looks good. But in case someone were to reimplement it with an abs, please add a test with mixed-sign data. |
@jnothman Thanks for the feedback! I added a test with mixed-sign data and the |
LGTM. Thanks very much! |
Please add an entry in whats_new |
Thanks for the comment @lesteve , fixed the rst formatting... |
I am not very familiar with text so I am afraid I am not the best one to review this one. |
@jnothman do you think it's better to have Maybe |
It's not random, it's a function of the hash... |
Thanks for the PR. Looks good to me, pending minor comments... I'll make a final pass and merge once you address those... |
@@ -15,7 +15,7 @@ np.import_array() | |||
|
|||
@cython.boundscheck(False) | |||
@cython.cdivision(True) | |||
def transform(raw_X, Py_ssize_t n_features, dtype): | |||
def transform(raw_X, Py_ssize_t n_features, dtype, bint alternate_sign): |
raghavrv
Jun 7, 2017
Member
Can you set the default value to be True
here?
I know it's not the public API but it would be nice to preserve the previous functionality...
(Feel free to ignore the suggestion)
Can you set the default value to be True
here?
I know it's not the public API but it would be nice to preserve the previous functionality...
(Feel free to ignore the suggestion)
|
||
def test_hasher_alternate_sign(): | ||
X = [["foo", "bar", "baz", "investigation need", "records"]] | ||
# the last two tokens produce a hash collision that sums as 0 |
raghavrv
Jun 7, 2017
Member
Can you move this comment above the previous line?
(+1 for a well thought of test case. Thx)
Can you move this comment above the previous line?
(+1 for a well thought of test case. Thx)
assert_true(len(Xt.data) < len(X[0])) | ||
assert_true((Xt.data == 0.).any()) | ||
|
||
Xt = FeatureHasher(alternate_sign=True, non_negative=True, |
raghavrv
Jun 7, 2017
Member
This will raise a deprecation warning right? can you wrap it using ignore_warnings(DeprecatedWarning)
maybe?
This will raise a deprecation warning right? can you wrap it using ignore_warnings(DeprecatedWarning)
maybe?
alternate_sign : boolean, optional, default True | ||
When True, an alternating sign is added to the features as to | ||
approximately conserve the inner product in the hashed space even for | ||
small n_features. This approach is similar to sparse random projection. |
raghavrv
Jun 7, 2017
Member
you need a new feature (or feature added?) tag?
you need a new feature (or feature added?) tag?
raghavrv
Jun 8, 2017
Member
(has been addressed below)
(has been addressed below)
Ah! got it thanks :) |
Thanks for the review @raghavrv ! I addressed all your comments (I think). Appveyor would take a day or so to build, unfortunately.. |
By induction, appveyor will provably pass :p Thanks @rth! |
* HashingVectorizer: optionaly disable alternate signs
* HashingVectorizer: optionaly disable alternate signs
* HashingVectorizer: optionaly disable alternate signs
* HashingVectorizer: optionaly disable alternate signs
* HashingVectorizer: optionaly disable alternate signs
* HashingVectorizer: optionaly disable alternate signs
* HashingVectorizer: optionaly disable alternate signs
* HashingVectorizer: optionaly disable alternate signs
This PR fixes #7513 and extends the
non_negative
parameter inFeatureHasher
andHashingVectorizer
with a value of'total'
which disables the hash collision handling (and makes the output more intuitive).Here are the annotations for the compiled
_hashing.pyx
. Suggestions on how to better explain thenon_negative
behavior in the doc-strings are welcome.The documentation for the HashingVectorizer should also be improved IMO, e.g. to explain the preservation of the inner product in the hashed space, but maybe in a separate PR.
Update: this PR also supersedes #7502 that was a continuation of #5861 aiming to fix #3637 . It also resolves #2665 .