New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for relabeling classifiers. #1833

Merged
merged 11 commits into from Jun 8, 2017

Conversation

Projects
None yet
3 participants
@ssanderson
Member

ssanderson commented Jun 7, 2017

No description provided.

ssanderson added some commits Jun 7, 2017

MAINT: Simplify ArrayPredicate.
Just use `params` instead of custom `_init` and `_static_identity`.
MAINT: Add validator to `validate_column`.
Who validates the validators?

@ssanderson ssanderson requested a review from llllllllll Jun 7, 2017

@ssanderson ssanderson force-pushed the labelarray-map branch from 83bb82e to 92c7f48 Jun 7, 2017

ENH: Add `relabel` method to string classifiers.
- Adds a `map` method to `LabelArray` that maps a unary function over
  the categories of a LabelArray, shrinking the underyling codes if
  possible.

- Adds a new `.relabel` method to string-dtype classifiers that maps a
  unary function over the unique elements of the underlying LabelArray.
  This is useful for things like cleaning noisy label data.

@ssanderson ssanderson force-pushed the labelarray-map branch from 92c7f48 to 5b9d5fe Jun 7, 2017

@coveralls

This comment has been minimized.

coveralls commented Jun 7, 2017

Coverage Status

Coverage decreased (-0.03%) to 87.605% when pulling 5b9d5fe on labelarray-map into 5160b11 on master.

if not isinstance(ret, otypes):
raise TypeError(
"Expected f() to return a string. Got %s." % (

This comment has been minimized.

@llllllllll

llllllllll Jun 7, 2017

Member

should we also include the value in the error message?

This comment has been minimized.

@ssanderson
result = data.map(relabeler)
result[~mask] = data.missing_value
else:
raise TypeError(

This comment has been minimized.

@llllllllll

llllllllll Jun 7, 2017

Member

should this be a NotImplementedError?

c = C()
raw = np.asarray(

This comment has been minimized.

@llllllllll

llllllllll Jun 7, 2017

Member

this doesn't test the f(missing) -> missing

c = C()
with self.assertRaises(TypeError) as e:
c.relabel(lambda x: 0 / 0) # Function should never be called.

This comment has been minimized.

@llllllllll
assert_equal(numpy_transformed, la_transformed)
def test_map_ignores_missing_value(self):

This comment has been minimized.

@llllllllll

llllllllll Jun 7, 2017

Member

can we test the None case as well?

This comment has been minimized.

@ssanderson
# Should work.
la = LabelArray(self.strs, missing_value=None)
la.map(lambda x: None)

This comment has been minimized.

@llllllllll

llllllllll Jun 7, 2017

Member

should we test that this returns [None] * len(self.strs) or was this tested somewhere else?

This comment has been minimized.

@ssanderson

ssanderson Jun 7, 2017

Member

I don't think we have an explicit test for this case. I'll add one.

assert_equal(result.as_string_array(), expected.as_string_array())
@parameter_space(
__fail_fast=True,

This comment has been minimized.

@llllllllll

llllllllll Jun 7, 2017

Member

Why are we using fail fast on all of these? are the tests that slow?

Also, why is it __ instead of just _?

This comment has been minimized.

@ssanderson

ssanderson Jun 7, 2017

Member

I generally use __fail_fast with parameter_space because it makes the error message easier to read and lets me use nosetests --pdb. Without it, --pdb drops me in at the point where subtest is re-raising a multi-exception.

This comment has been minimized.

@llllllllll

llllllllll Jun 7, 2017

Member

should that be the default then?

This comment has been minimized.

@ssanderson

This comment has been minimized.

@ssanderson

ssanderson Jun 7, 2017

Member

I'll probably do that in a different PR though to avoid noise here.

This comment has been minimized.

@llllllllll
# with the same code duplicated multiple times. Compress the categories
# by running them through np.unique, and then use the reverse lookup
# table to compress codes as well.
new_categories, bloated_reverse_index = np.unique(

This comment has been minimized.

@llllllllll

llllllllll Jun 7, 2017

Member

could we use factorize_strings for this to get the None aware unique?

This comment has been minimized.

@ssanderson

ssanderson Jun 7, 2017

Member

The nice thing about unique is that it gives me back the inverse index, which I'm then using to map the original codes onto the new codes. I think I have an idea for a fix for this.

@ssanderson ssanderson force-pushed the labelarray-map branch from c0fe83e to 5b9d5fe Jun 7, 2017

@coveralls

This comment has been minimized.

coveralls commented Jun 7, 2017

Coverage Status

Coverage decreased (-0.03%) to 87.605% when pulling 6bb31b2 on labelarray-map into 5160b11 on master.

@coveralls

This comment has been minimized.

coveralls commented Jun 7, 2017

Coverage Status

Coverage decreased (-0.03%) to 87.611% when pulling 709735d on labelarray-map into 5160b11 on master.

@coveralls

This comment has been minimized.

coveralls commented Jun 7, 2017

Coverage Status

Coverage decreased (-0.02%) to 87.625% when pulling 09cc54e on labelarray-map into 5160b11 on master.

@ssanderson ssanderson force-pushed the labelarray-map branch from 5107af1 to 61feedb Jun 7, 2017

def __eq__(self, other):
return isinstance(other, _SortableSentinel)
def __lt__(self, other):

This comment has been minimized.

@llllllllll

llllllllll Jun 7, 2017

Member

can you just make this return True?

return_inverse=True
)
if new_categories[0] == _sortable_sentinel:

This comment has been minimized.

@llllllllll

llllllllll Jun 7, 2017

Member

since this is a sentinel could we use is?

@coveralls

This comment has been minimized.

coveralls commented Jun 7, 2017

Coverage Status

Coverage increased (+0.03%) to 87.686% when pulling 4cfbb12 on labelarray-map into 65cc403 on master.

1 similar comment
@coveralls

This comment has been minimized.

coveralls commented Jun 7, 2017

Coverage Status

Coverage increased (+0.03%) to 87.686% when pulling 4cfbb12 on labelarray-map into 65cc403 on master.

@coveralls

This comment has been minimized.

coveralls commented Jun 7, 2017

Coverage Status

Coverage increased (+0.03%) to 87.686% when pulling 4cfbb12 on labelarray-map into 65cc403 on master.

1 similar comment
@coveralls

This comment has been minimized.

coveralls commented Jun 7, 2017

Coverage Status

Coverage increased (+0.03%) to 87.686% when pulling 4cfbb12 on labelarray-map into 65cc403 on master.

@coveralls

This comment has been minimized.

coveralls commented Jun 8, 2017

Coverage Status

Coverage increased (+0.03%) to 87.687% when pulling dd857aa on labelarray-map into 65cc403 on master.

2 similar comments
@coveralls

This comment has been minimized.

coveralls commented Jun 8, 2017

Coverage Status

Coverage increased (+0.03%) to 87.687% when pulling dd857aa on labelarray-map into 65cc403 on master.

@coveralls

This comment has been minimized.

coveralls commented Jun 8, 2017

Coverage Status

Coverage increased (+0.03%) to 87.687% when pulling dd857aa on labelarray-map into 65cc403 on master.

@ssanderson ssanderson merged commit 1e4a094 into master Jun 8, 2017

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@ssanderson ssanderson deleted the labelarray-map branch Jun 8, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment