New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] discrete branch: add an second example for KBinsDiscretizer #10195

Merged
merged 3 commits into from Nov 27, 2017

Conversation

Projects
None yet
4 participants
@TomDLT
Member

TomDLT commented Nov 23, 2017

Reference Issues/PRs

Complementary to #10192, which also tackles issue #9339
It is not intended to replace it, but rather to supplement it.

What does this implement/fix? Explain your changes.

Add another example for KBinsDiscretizer.

Any other comments?

local result
figure_1

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Nov 24, 2017

Looks cool. I'll have a look at it

@@ -166,7 +166,7 @@ def _validate_n_bins(self, n_features, ignored):
"""
orig_bins = self.n_bins
if isinstance(orig_bins, numbers.Number):
if not isinstance(orig_bins, np.int):
if not isinstance(orig_bins, (np.int, np.integer)):

This comment has been minimized.

@qinhanmin2014

qinhanmin2014 Nov 24, 2017

Member

At a glance, this seems not consistent with #10017. Seems that we are going to use (numbers.Integral, np.integer)?

@scikit-learn scikit-learn deleted a comment from codecov bot Nov 24, 2017

@qinhanmin2014

learned a lot from this great example :)

name = estimator.__class__.__name__
if name == 'Pipeline':
name = [get_name(est[1]) for est in estimator.steps]
name = '\n'.join(name)

This comment has been minimized.

@qinhanmin2014

qinhanmin2014 Nov 24, 2017

Member

This is used both in the plot and the output message. But the '\n' will make the output message strange. See redered page.

KBinsDiscretizer(encode='onehot'), LinearSVC(random_state=0)), {
'kbinsdiscretizer__n_bins': np.arange(2, 10),
'linearsvc__C': np.logspace(-2, 7, 10),
}),

This comment has been minimized.

@qinhanmin2014

qinhanmin2014 Nov 24, 2017

Member

Why are there not KBinsDiscretizer + GradientBoostingClassifier and KBinsDiscretizer + SVC? Though it seems that KBinsDiscretizer + GradientBoostingClassifier somehow performs better than GradientBoostingClassifier in the first dataset.

This comment has been minimized.

@glemaitre

glemaitre Nov 24, 2017

Contributor

Because they are non-linear classifier and the idea is to show that a linear classifier with this transformer can perform as good as a non-linear classifier, isn't it?

This comment has been minimized.

@qinhanmin2014

qinhanmin2014 Nov 24, 2017

Member

Thanks :) Then I suppose it might be better to explicitly state that in the example, it seems a bit confused without your explanation at least from my side.

clf = GridSearchCV(estimator=estimator, param_grid=param_grid, cv=5)
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print(ds_cnt, name, score)

This comment has been minimized.

@qinhanmin2014

qinhanmin2014 Nov 24, 2017

Member

the message seems too simple from my side, maybe it's better to add something? (e.g., add 'dataset' before ds_cnt (e.g., dataset 1) and add 'score:' before score (e.g., score:0.88)?

@glemaitre

Actually I like as it is. My only small remark would be that I would find it easier if the scoring metric would by specified in the title or in the legend.

# preprocess dataset, split into training and test part
X, y = ds
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = \

This comment has been minimized.

@glemaitre

glemaitre Nov 24, 2017

Contributor

Can you avoid to use \ and jump a line between the parentheses instead

This comment has been minimized.

@glemaitre

glemaitre Nov 24, 2017

Contributor

Even if I see that it was the same in the original example ;)

A demonstration of feature discretization on synthetic classification datasets.
Feature discretization decomposes each feature into a set of bins, here
equally distributed in width. The discrete values are then one-hot encoded,
and given to a linear classifier. On the two non-linearly separable datasets,

This comment has been minimized.

@jnothman

jnothman Nov 26, 2017

Member

Hold the reader's hand a bit more: the first two rows represent linearly non-separable datasets (moons and concentric circles) while the third is approximately linearly separable.

linearly.
The plots show training points in solid colors and testing points
semi-transparent. The lower right shows the classification accuracy on the test

This comment has been minimized.

@jnothman

jnothman Nov 26, 2017

Member

You either need to describe here or title the plot to show the groupings: linear classifiers, non-linear classifiers, linear classifiers with discretized input.

This comment has been minimized.

@jnothman

jnothman Nov 26, 2017

Member

Perhaps the discretized classifiers should come before the nonlinear ones to emphasise the difference between the nonlinear group and the linear group.

@scikit-learn scikit-learn deleted a comment from codecov bot Nov 27, 2017

@qinhanmin2014

LGTM
@TomDLT Could you please have a look at #10192 so that we can get the two examples in, merge discrete into master and avoid repeatedly solving conflicts? Thanks :)

@@ -166,7 +166,7 @@ def _validate_n_bins(self, n_features, ignored):
"""
orig_bins = self.n_bins
if isinstance(orig_bins, numbers.Number):
if not isinstance(orig_bins, np.int):
if not isinstance(orig_bins, (numbers.Integral, np.integer)):

This comment has been minimized.

@qinhanmin2014

qinhanmin2014 Nov 27, 2017

Member

just note that it will eventually be changed to something like SCALAR_INTEGER_TYPES in #10017

@@ -29,6 +29,11 @@ def test_fit_transform():
assert_array_equal(expected, est.transform(X))
def test_valid_n_bins():

This comment has been minimized.

@qinhanmin2014

qinhanmin2014 Nov 27, 2017

Member

I might prefer to remove the test
(1) KBinsDiscretizer is now a PR(not in master), so it can be considered an improvement inside the PR.
(2) We do not add too much test in #10017. I don't think it deserves a test.
(3) It seems not a regression test?

isinstance(2, np.int) = True
isinstance(np.array([2])[0], np.int) = True

This comment has been minimized.

@TomDLT

TomDLT Nov 27, 2017

Member

(1) This is a PR to the discrete branch, which will end up in the PR from discrete to master.
(2) I would be in favor of this test, even though we may not need something systematic in #10017.
(3) For me:

np.__version__ == '1.13.1'
isinstance(np.array([2])[0], np.int) == False

This comment has been minimized.

@qinhanmin2014

qinhanmin2014 Nov 27, 2017

Member

My previous script is under python 2.7.12 and numpy 1.13.3 and I can reproduce yours under python 3.5.4 and numpy 1.13.1. So I think the reason for our difference is numpy version.
Let's keep the test if you think it is appropriate :) LGTM for the great example.

@TomDLT

This comment has been minimized.

Member

TomDLT commented Nov 27, 2017

@qinhanmin2014 qinhanmin2014 changed the title from [MRG] discrete branch: add an second example for KBinsDiscretizer to [MRG+1] discrete branch: add an second example for KBinsDiscretizer Nov 27, 2017

features, which easily lead to overfitting when the number of samples is small.
The plots show training points in solid colors and testing points
semi-transparent. The lower right shows the classification accuracy on the test

This comment has been minimized.

@jnothman

jnothman Nov 27, 2017

Member

I think precede "test set" with "held out"

This comment has been minimized.

@jnothman

jnothman Nov 27, 2017

Member

Okay. This is clear from context.

@jnothman

This comment has been minimized.

Member

jnothman commented Nov 27, 2017

Merging. Thanks for the nice plot!

@jnothman jnothman merged commit 430af30 into scikit-learn:discrete Nov 27, 2017

6 checks passed

ci/circleci Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 96.11%)
Details
codecov/project 96.11% (+<.01%) compared to 4e60a3d
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
lgtm analysis: Python No alert changes
Details

@TomDLT TomDLT deleted the TomDLT:discretization_example branch Nov 28, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment