New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] discrete branch: add encoding option to KBinsDiscretizer #9647

Merged
merged 20 commits into from Sep 7, 2017

Conversation

Projects
None yet
6 participants
@qinhanmin2014
Member

qinhanmin2014 commented Aug 30, 2017

Reference Issue

Fixes #9336

What does this implement/fix? Explain your changes.

add encoding option to KBinsDiscretizer
(1)encode option support {'onehot', 'onehot-dense', 'ordinal'}, the default value is set to 'ordinal' mainly because of (3)
(2)only one-hot encode discretized features when ignored_features is set.
(according to OneHotEncoder, non-categorical features are always stacked to the right of the matrix.)
(3)seems hard to support inverse_transform for one-hot because OneHotEncoder don't support inverse_transform

Any other comments?

qinhanmin2014 added some commits Aug 30, 2017

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 31, 2017

Thanks for this. Tests are failing, though

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 31, 2017

Oh you mentioned that. A strange failure. Let me check what that branch is doing...

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 31, 2017

So the test failures relate to a recent Cython release. Merging master into discrete should fix it. I'll do this shortly.

@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented Aug 31, 2017

@jnothman Thanks. Kindly inform me when you finish. :)

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 31, 2017

Try pushing an empty commit

@qinhanmin2014 qinhanmin2014 changed the title from [WIP] discrete branch: add encoding option to KBinsDiscretizer to [MRG] discrete branch: add encoding option to KBinsDiscretizer Aug 31, 2017

@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented Aug 31, 2017

@jnothman test failure unrelated. I believe it worth a review. Thanks :)

@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented Sep 1, 2017

@jnothman Finally CIs are green.

and return a sparse matrix.
onehot-dense:
encode the transformed result with one-hot encoding
and return a dense matrix.

This comment has been minimized.

@ogrisel

ogrisel Sep 1, 2017

Member

dense array

encode the transformed result with one-hot encoding
and return a dense matrix.
ordinal:
do not encode the transformed result.

This comment has been minimized.

@ogrisel

ogrisel Sep 1, 2017

Member

Return the bin identifier encoded as an integer value.

@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented Sep 1, 2017

@ogrisel Thanks. Comments addressed.

@lesteve

Superficial review without looking in the heart of the code.

@@ -114,6 +128,12 @@ def fit(self, X, y=None):
"""
X = check_array(X, dtype='numeric')
valid_encode = ['onehot', 'onehot-dense', 'ordinal']
if self.encode not in valid_encode:
raise ValueError('Invalid encode value. '

This comment has been minimized.

@lesteve

lesteve Sep 1, 2017

Member

Add the value provided by the user in the error message, i.e. something like this:

"Valid options for 'encode' are {}. Got 'encode={}' instead".format(sorted(valid_encode), encode)
retain_order=True)
# only one-hot encode discretized features
mask = np.array([True] * X.shape[1])

This comment has been minimized.

@lesteve

lesteve Sep 1, 2017

Member

Probably more readable to use np.repeat(True, X.shape[1])

# don't support inverse_transform
if self.encode != 'ordinal':
raise ValueError("inverse_transform only support "
"encode='ordinal'.")

This comment has been minimized.

@lesteve

lesteve Sep 1, 2017

Member

Add the value of self.encode in the error message, e.g. . "Got {} instead"

@@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
Column indices of ignored features. (Example: Categorical features.)
If ``None``, all features will be discretized.
encode : string {'onehot', 'onehot-dense', 'ordinal'} (default='ordinal')

This comment has been minimized.

@lesteve

lesteve Sep 1, 2017

Member

I don't thing you need the type information when you list all possible values. Double-check with the numpy doc style.

This comment has been minimized.

@qinhanmin2014

qinhanmin2014 Sep 1, 2017

Member

@lesteve It seems that this is not the case in many functions (e.g. pca, LinearSVC) and I have no idea how to check the doc style. Could you please help me? Thanks.

This comment has been minimized.

@lesteve

lesteve Sep 1, 2017

Member

From https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt#sections

When a parameter can only assume one of a fixed set of values, those values can be listed in braces, with the default appearing first:

order : {'C', 'F', 'A'}
    Description of `order`.
assert_raises(ValueError, est.inverse_transform, X)
est = KBinsDiscretizer(n_bins=[2, 3, 3, 3],
encode='onehot').fit(X)
expected3 = est.transform(X)

This comment has been minimized.

@lesteve

lesteve Sep 1, 2017

Member

Should you not check that the output is sparse in the onehot case?

Also probably check that the output is not sparse in the onehot-dense case.

@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented Sep 1, 2017

@lesteve Comments addressed except for the first one. Thanks.

@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented Sep 2, 2017

@lesteve Thanks. Comment addressed.

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Sep 3, 2017

I find a bit fuzzy both naming 'onehot' and 'onehot-dense' since that this is not explicit in the naming that by default this is sparse. Would it make sense to call 'onehot' -> 'onehot-sparse'

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Sep 3, 2017

Oh I see this is the same naming in the CategoricalEncoder PR. So that might be fine.

@glemaitre

couple of suggestions. @jnothman this code will be probably changing using the CategoricalEncoder I supposed?

@@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
Column indices of ignored features. (Example: Categorical features.)
If ``None``, all features will be discretized.
encode : {'ordinal', 'onehot', 'onehot-dense'} (default='ordinal')

This comment has been minimized.

@glemaitre

glemaitre Sep 3, 2017

Contributor

comma after }

@@ -32,6 +33,18 @@ class KBinsDiscretizer(BaseEstimator, TransformerMixin):
Column indices of ignored features. (Example: Categorical features.)
If ``None``, all features will be discretized.
encode : {'ordinal', 'onehot', 'onehot-dense'} (default='ordinal')
method used to encode the transformed result.

This comment has been minimized.

@glemaitre

glemaitre Sep 3, 2017

Contributor

method -> Method

method used to encode the transformed result.
onehot:
encode the transformed result with one-hot encoding

This comment has been minimized.

@glemaitre

glemaitre Sep 3, 2017

Contributor

encode -> Encode

encode the transformed result with one-hot encoding
and return a sparse matrix.
onehot-dense:
encode the transformed result with one-hot encoding

This comment has been minimized.

@glemaitre

glemaitre Sep 3, 2017

Contributor

encode -> Encode

encode the transformed result with one-hot encoding
and return a dense array.
ordinal:
return the bin identifier encoded as an integer value.

This comment has been minimized.

@glemaitre

glemaitre Sep 3, 2017

Contributor

return -> Return

@@ -114,6 +128,12 @@ def fit(self, X, y=None):
"""
X = check_array(X, dtype='numeric')
valid_encode = ['onehot', 'onehot-dense', 'ordinal']

This comment has been minimized.

@glemaitre

glemaitre Sep 3, 2017

Contributor

you might want to use a tuple instead of a list.

if self.encode not in valid_encode:
raise ValueError("Valid options for 'encode' are {}. "
"Got 'encode = {}' instead."
.format(sorted(valid_encode), self.encode))

This comment has been minimized.

@glemaitre

glemaitre Sep 3, 2017

Contributor

This seems unnecessary to sort.

retain_order=True)
# only one-hot encode discretized features
mask = np.repeat(True, X.shape[1])

This comment has been minimized.

@glemaitre

glemaitre Sep 3, 2017

Contributor

It would be faster to use:

mask = np.ones(X.shape[1], dtype=bool)
@@ -174,3 +176,40 @@ def test_numeric_stability():
X = X_init / 10**i
Xt = KBinsDiscretizer(n_bins=2).fit_transform(X)
assert_array_equal(Xt_expected, Xt)
def test_encode():

This comment has been minimized.

@glemaitre

glemaitre Sep 3, 2017

Contributor

I would probably split this test in several tests

@jnothman

This comment has been minimized.

Member

jnothman commented Sep 3, 2017

@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented Sep 3, 2017

@glemaitre Thanks. Comments addressed.

@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented Sep 4, 2017

@jnothman I changed the link to Encoding categorical features Now it looks like this:
2017-09-04_224719
Further suggestions welcome :)

@hlin117

Overall looks great! Nice work.

@@ -3,8 +3,10 @@
import numpy as np
from six.moves import range
import warnings
import scipy.sparse as sp

This comment has been minimized.

@hlin117

hlin117 Sep 5, 2017

Contributor

Nit: Move this above six.moves, so it is alphabetical.

This comment has been minimized.

@glemaitre

glemaitre Sep 5, 2017

Contributor

In fact you should use sklearn.externals.six to not have dependencies on six

@@ -259,6 +299,15 @@ def inverse_transform(self, Xt):
Data in the original feature space.
"""
check_is_fitted(self, ["offset_", "bin_width_"])
# currently, preprocessing.OneHotEncoder
# don't support inverse_transform

This comment has been minimized.

@hlin117

hlin117 Sep 5, 2017

Contributor

don't -> doesn't

if self.ignored_features is not None:
mask[self.ignored_features] = False
if self.encode == 'onehot':

This comment has been minimized.

@hlin117

hlin117 Sep 5, 2017

Contributor

Nit: You can make this more concise.

if self.encode == 'ordinal':
    return Xt

# Only one-hot encode discretized features
mask = np.ones(X.shape[1], dtype=bool)
if self.ignored_features is not None:
    mask[self.ignored_features] = False

encode_sparse = (self.encode == 'onehot')
return OneHotEncoder(n_values=np.array(self.n_bins_)[mask],
    categorical_features='all'
    if self.ignored_features is None else mask,
    sparse=encode_sparse).fit_transform(Xt)
@@ -259,6 +299,15 @@ def inverse_transform(self, Xt):
Data in the original feature space.
"""
check_is_fitted(self, ["offset_", "bin_width_"])
# currently, preprocessing.OneHotEncoder

This comment has been minimized.

@hlin117

hlin117 Sep 5, 2017

Contributor

Nit: The convention of this file would be to capitalize comments. ("currently" -> "Currently")

# currently, preprocessing.OneHotEncoder
# don't support inverse_transform
if self.encode != 'ordinal':
raise ValueError("inverse_transform only support "

This comment has been minimized.

@hlin117

hlin117 Sep 5, 2017

Contributor

support -> supports

est = KBinsDiscretizer(n_bins=3, ignored_features=[1, 2],
encode='onehot-dense').fit(X)
Xt_1 = est.transform(X)
Xt_2 = np.array([[1, 0, 0, 1, 0, 0, 1.5, -4],

This comment has been minimized.

@hlin117

hlin117 Sep 5, 2017

Contributor

Nit: You don't need to have Xt_2 be an array. A nested list should suffice.

I would also rename Xt_2 -> Xt_expected and Xt_1 -> Xt.

[0, 1, 0, 1, 0, 0, 2.5, -3],
[0, 0, 1, 0, 1, 0, 3.5, -2],
[0, 0, 1, 0, 0, 1, 4.5, -1]])
assert_array_equal(Xt_1, Xt_2)

This comment has been minimized.

@hlin117

hlin117 Sep 5, 2017

Contributor

Conventions are expected parameter first, and then actual.

@jnothman

This comment has been minimized.

Member

jnothman commented Sep 5, 2017

@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented Sep 5, 2017

Thanks @hlin117. Comments addressed.

@hlin117

hlin117 approved these changes Sep 5, 2017

Looks good!

@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented Sep 5, 2017

@glemaitre Thanks. Comments addressed.

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Sep 5, 2017

I made a small pass on it. I have to check the artifacts but I think that I am LGTM.

if self.ignored_features is not None:
mask[self.ignored_features] = False
encode_sparse = (self.encode == 'onehot')

This comment has been minimized.

@glemaitre

glemaitre Sep 5, 2017

Contributor

I would remove the paranthesis

mask[self.ignored_features] = False
encode_sparse = (self.encode == 'onehot')
return OneHotEncoder(n_values=np.array(self.n_bins_)[mask],

This comment has been minimized.

@glemaitre

glemaitre Sep 5, 2017

Contributor

self.n_bins_ is already an array, isn't it? I don't think that you need to pass it inside np.array

This comment has been minimized.

@qinhanmin2014

qinhanmin2014 Sep 5, 2017

Member

@glemaitre
Thanks. According to the original author, we allow users to pass a list as n_bins_. If we don't do the conversion, we get an error(list indices must be integers).

@@ -259,6 +295,15 @@ def inverse_transform(self, Xt):
Data in the original feature space.
"""
check_is_fitted(self, ["offset_", "bin_width_"])
# Currently, preprocessing.OneHotEncoder

This comment has been minimized.

@glemaitre

glemaitre Sep 5, 2017

Contributor

if you can make your line up to 79 characters, that would be great.

def test_invalid_encode_option():
est = KBinsDiscretizer(n_bins=[2, 3, 3, 3], encode='invalid-encode')
assert_raises(ValueError, est.fit, X)

This comment has been minimized.

@glemaitre

glemaitre Sep 5, 2017

Contributor

Could you match the error message (or part of it) using assert_raises_regex

assert not sp.issparse(Xt_2)
assert_array_equal(OneHotEncoder(n_values=[2, 3, 3, 3], sparse=False)
.fit_transform(Xt_1), Xt_2)
assert_raises(ValueError, est.inverse_transform, Xt_2)

This comment has been minimized.

@glemaitre

glemaitre Sep 5, 2017

Contributor

Could you match the error message (or part of it) using assert_raises_regex

In addition I would group all the error inside a side test.

def test_encode_options():
# test valid encode options through comparison

This comment has been minimized.

@glemaitre

glemaitre Sep 5, 2017

Contributor

remove this comment

assert_array_equal(OneHotEncoder(n_values=[2, 3, 3, 3], sparse=True)
.fit_transform(Xt_1).toarray(),
Xt_3.toarray())
assert_raises(ValueError, est.inverse_transform, Xt_3)

This comment has been minimized.

@glemaitre

glemaitre Sep 5, 2017

Contributor

Could you match the error message (or part of it) using assert_raises_regex

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Sep 5, 2017

qinhanmin2014 added some commits Sep 5, 2017

@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented Sep 5, 2017

@glemaitre Thanks a lot. I remove np.array and all the comments are addressed. Sorry @jnothman for not realizing your right suggestions previously.

@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented Sep 7, 2017

Any update on it? Thanks :)

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Sep 7, 2017

I checked the artifact. Can you add an entry in the documentation API to link to the User Guide entry.

For example:
https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/sklearn/ensemble/forest.py#L751

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Sep 7, 2017

LGTM to be merged to discrete I assume that an example will be added later on since this PR is only for the encoding option.

cc: @lesteve @jnothman @ogrisel

@codecov

This comment has been minimized.

codecov bot commented Sep 7, 2017

Codecov Report

Merging #9647 into discrete will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff              @@
##           discrete    #9647      +/-   ##
============================================
- Coverage     96.18%   96.18%   -0.01%     
============================================
  Files           336      338       +2     
  Lines         63835    62408    -1427     
============================================
- Hits          61402    60027    -1375     
+ Misses         2433     2381      -52
Impacted Files Coverage Δ
sklearn/preprocessing/discretization.py 100% <100%> (ø) ⬆️
sklearn/preprocessing/tests/test_discretization.py 100% <100%> (ø) ⬆️
sklearn/datasets/olivetti_faces.py 33.33% <0%> (-8.89%) ⬇️
sklearn/datasets/base.py 85.5% <0%> (-6.58%) ⬇️
sklearn/datasets/california_housing.py 39.47% <0%> (-4.12%) ⬇️
sklearn/datasets/covtype.py 52.17% <0%> (-4.08%) ⬇️
sklearn/tests/test_docstring_parameters.py 37.5% <0%> (-3.84%) ⬇️
sklearn/utils/tests/test_testing.py 81% <0%> (-3.68%) ⬇️
sklearn/datasets/lfw.py 12.41% <0%> (-2.4%) ⬇️
sklearn/datasets/kddcup99.py 32.72% <0%> (-2.14%) ⬇️
... and 90 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7ef342e...a4e9722. Read the comment docs.

@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented Sep 7, 2017

@glemaitre Sorry to disturb. Seems that the link to the user guide doesn't work and I cannot figure out the reason.

@jnothman

This comment has been minimized.

Member

jnothman commented Sep 7, 2017

I think that's sufficient consensus...

@jnothman jnothman merged commit eef7bdb into scikit-learn:discrete Sep 7, 2017

2 of 4 checks passed

continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
lgtm analysis: Python Running analyses for revisions
Details
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jnothman

This comment has been minimized.

Member

jnothman commented Sep 7, 2017

@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented Sep 7, 2017

@jnothman seems that the link to the user guide doesn't work and I cannot find the reason. Could you please help me? Thanks a lot :)

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Sep 7, 2017

I put the fix in #9705. I think that discretization was duplicated with the next sentence. It seems that this is not case sensitive. I build locally and it was fine after renaming it.

@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented Sep 7, 2017

@glemaitre Thanks a lot for the fix :)
The example is currently being discussed in #9339 as well as the mailing list.

@qinhanmin2014 qinhanmin2014 deleted the qinhanmin2014:my-feature-1 branch Sep 7, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment