New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Fix: FeatureHasher now accepts string values #6173

Merged
merged 1 commit into from Mar 19, 2016

Conversation

Projects
None yet
5 participants
@dsquareindia
Contributor

dsquareindia commented Jan 15, 2016

Addresses issue #4883. Builds on the work done in #4913.

@dsquareindia

This comment has been minimized.

Contributor

dsquareindia commented Jan 15, 2016

@jnothman @amueller can you please review this once?

@dsquareindia dsquareindia changed the title from [MRG] Fix: FeatureHasher accepts string values to [WIP] Fix: FeatureHasher accepts string values Jan 15, 2016

@dsquareindia dsquareindia changed the title from [WIP] Fix: FeatureHasher accepts string values to [MRG] Fix: FeatureHasher accepts string values Jan 15, 2016

@dsquareindia dsquareindia changed the title from [MRG] Fix: FeatureHasher accepts string values to [MRG] Fix: FeatureHasher now accepts string values Jan 15, 2016

@rrohan

This comment has been minimized.

Contributor

rrohan commented Jan 17, 2016

Is there a way to make this work with just one call to isinstance ?

@dsquareindia

This comment has been minimized.

Contributor

dsquareindia commented Jan 17, 2016

What I can do is encode for both string_types and unicode but I'm not sure if it would give any added advantage over the current method.

@dsquareindia

This comment has been minimized.

Contributor

dsquareindia commented Jan 27, 2016

@jnothman sorry for disturbing but I'm not quite sure about the tests for this one. Are they fine?
cc: @amueller

elif isinstance(v, string_types):
f = "%s%s%s" % (f, '=', v)
value = 1
elif isinstance(v, unicode):

This comment has been minimized.

@jnothman

jnothman Jan 27, 2016

Member

I don't understand when this case happens. Surely the isinstance(v, string_types) branch will already have been executed.

@jnothman

This comment has been minimized.

Member

jnothman commented Jan 27, 2016

I guess one further thing that is not tested here is that "foo":"bar" and "foo":"baz" (likely) hash to different columns.

Otherwise, the tests seem fine.

value = v
if isinstance(v, integer_types):
value = v
elif isinstance(v, string_types):

This comment has been minimized.

@jnothman

jnothman Jan 27, 2016

Member

I think this can be the if case without any need for elif branches.

@@ -43,7 +45,14 @@ def transform(raw_X, Py_ssize_t n_features, dtype):
for x in raw_X:
for f, v in x:
value = v
if isinstance(v, integer_types):

This comment has been minimized.

@jnothman

jnothman Jan 27, 2016

Member

and this would be the else

@dsquareindia

This comment has been minimized.

Contributor

dsquareindia commented Jan 28, 2016

@jnothman I've made the changes. Could you please check? Thanks!

@@ -19,6 +19,23 @@ def test_feature_hasher_dicts():
assert_array_equal(X1.toarray(), X2.toarray())
def test_feature_hasher_dicts_with_string_values():

This comment has been minimized.

@jnothman

jnothman Jan 28, 2016

Member

I don't find this test necessary. There is already a test for the equivalence of dict and pair representation. You could more easily add a string-valued pair to the existing test and still show equivalence of the representations.

This comment has been minimized.

@dsquareindia

dsquareindia Jan 30, 2016

Contributor

Fine I'll modify the existing test to add this.

assert_equal([1, 1], x1_nz)
assert_equal([1, 1, 4], x2_nz)
raw_X = (iter(d.items()) for d in [{"foo": "bar", "bar": "a"},

This comment has been minimized.

@jnothman

jnothman Jan 28, 2016

Member

It's not clear what this is testing that was not in the previous part of the test.

@@ -53,6 +70,27 @@ def test_feature_hasher_pairs():
assert_equal([1, 3, 4], x2_nz)
def test_feature_hasher_pairs_with_string_values():

This comment has been minimized.

@dsquareindia

dsquareindia Jan 30, 2016

Contributor

Do we need this test too then? I'll just add this to the existing test?

@dsquareindia

This comment has been minimized.

Contributor

dsquareindia commented Feb 9, 2016

@jnothman sorry for the delay. I have addressed your comments. Can you please check once? Thanks!

@dsquareindia

This comment has been minimized.

Contributor

dsquareindia commented Feb 22, 2016

@jnothman gentle ping :)

@sritchie

This comment has been minimized.

sritchie commented Feb 26, 2016

Would love to see this too!

@jnothman

This comment has been minimized.

Member

jnothman commented Feb 27, 2016

We could more explicitly test equivalence of handling to DictVectorizer using the latter's inverse_transform, but I think what you've got here suffices. Thanks.

LGTM. Second review?

@jnothman

This comment has been minimized.

Member

jnothman commented Feb 27, 2016

Please add an entry in doc/whatsnew.rst crediting both authors.

@jnothman jnothman changed the title from [MRG] Fix: FeatureHasher now accepts string values to [MRG+1] Fix: FeatureHasher now accepts string values Feb 27, 2016

@dsquareindia

This comment has been minimized.

Contributor

dsquareindia commented Feb 27, 2016

@jnothman I have added the whats_new entry. Hope it's alright.

@jnothman

This comment has been minimized.

Member

jnothman commented Feb 27, 2016

It'll do!

@@ -8,6 +8,8 @@ from libc.stdlib cimport abs
cimport numpy as np
import numpy as np
from ..externals.six import string_types, integer_types

This comment has been minimized.

@dsquareindia

dsquareindia Feb 29, 2016

Contributor

integer_types import is not needed anymore.

@dsquareindia

This comment has been minimized.

Contributor

dsquareindia commented Mar 8, 2016

Can I please get another review on this one?

@jnothman

This comment has been minimized.

Member

jnothman commented Mar 8, 2016

@MechCoder quick review?

x1, x2 = h.transform(raw_X).toarray()
x1_nz = sorted(np.abs(x1[x1 != 0]))
x2_nz = sorted(np.abs(x2[x2 != 0]))
assert_equal([1, 1], x1_nz)

This comment has been minimized.

@MechCoder

MechCoder Mar 9, 2016

Member

Sorry for my ignorance, but how do we know these values?

This comment has been minimized.

@dsquareindia

dsquareindia Mar 11, 2016

Contributor

when we'll transform the first row (stored in x1), the "a" will be hashed as 1. Here x1 will be: [ 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.] and x2 will be: [ 0. 0. 0. 1. 4. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. -1.].

@MechCoder

This comment has been minimized.

Member

MechCoder commented Mar 9, 2016

We might also want to add a minor test to show that the non-zero feature column got by transforming a = [{'bax': 'abc'}, {'bax': 'abc'}] for both the rows are equal

@dsquareindia

This comment has been minimized.

Contributor

dsquareindia commented Mar 11, 2016

@MechCoder I've added the test. I hope this is fine now.

@MechCoder

This comment has been minimized.

Member

MechCoder commented Mar 19, 2016

Thanks !

MechCoder added a commit that referenced this pull request Mar 19, 2016

Merge pull request #6173 from dsquareindia/featurehasher_fix
[MRG+1] Fix: FeatureHasher now accepts string values

@MechCoder MechCoder merged commit c2eaf75 into scikit-learn:master Mar 19, 2016

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@dsquareindia

This comment has been minimized.

Contributor

dsquareindia commented Mar 19, 2016

Thanks for reviews @MechCoder @jnothman!

@dsquareindia dsquareindia deleted the dsquareindia:featurehasher_fix branch Mar 19, 2016

@jnothman

This comment has been minimized.

Member

jnothman commented Mar 20, 2016

Thanks for the PR!

On 19 March 2016 at 15:08, Devashish Deshpande notifications@github.com
wrote:

Thanks for reviews @MechCoder https://github.com/MechCoder @jnothman
https://github.com/jnothman!


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#6173 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment