New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] discrete branch: add an example for KBinsDiscretizer #10192

Merged
merged 10 commits into from Nov 27, 2017

Conversation

Projects
None yet
4 participants
@qinhanmin2014
Member

qinhanmin2014 commented Nov 23, 2017

Reference Issues/PRs

Fixes #9339

What does this implement/fix? Explain your changes.

Add an example for KBinsDiscretizer
reference: "Introduction to machine learning with python" (Chapter 4 section 2)

Any other comments?

local result
figure_1

qinhanmin2014 added some commits Nov 23, 2017

@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented Nov 23, 2017

ping @jnothman Could you help me diagnose the Circle failure? Do we need to merge master into discrete again? Thanks a lot.

@TomDLT

Do we need to merge master into discrete again?

Done, Circle is not failing anymore.

linestyle=':', label='decision tree')
plt.plot(X[:, 0], y, 'o', c='k')
bins = enc.offset_[0] + enc.bin_width_[0] * np.arange(1, enc.n_bins_[0])
plt.vlines(bins, -3, 3, linewidth=1, alpha=.2)

This comment has been minimized.

@TomDLT

TomDLT Nov 23, 2017

Member

To have automatic ymin, ymax, you can use

plt.vlines(bins, *plt.gca().get_ylim(), linewidth=1, alpha=.2)
# construct the dataset
rnd = np.random.RandomState(42)
X = rnd.uniform(-3, 3, size=100)
y_no_noise = (np.sin(4 * X) + X)

This comment has been minimized.

@TomDLT

TomDLT Nov 23, 2017

Member

I would be in favour of something more non-linear, to better show the failure of the linear model.

e.g. y = np.sin(X) + rnd.normal(size=len(X)) / 3:

figure_1

plt.title("Result before discretization")
# predict with transformed dataset
plt.subplot(122)

This comment has been minimized.

@TomDLT

TomDLT Nov 23, 2017

Member

The comparison is clearer if the subplots have the same ylim (sharey=True). You can do:

fig, axes = plt.subplots(nrows=2, sharey=True, figsize=(10, 4))
plt.sca(axes[0])
...
plt.sca(axes[1])
...
@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented Nov 23, 2017

@TomDLT Thanks a lot for your great help :)

@lesteve

This comment has been minimized.

Member

lesteve commented Nov 23, 2017

Do we need to merge master into discrete again?

Done, Circle is not failing anymore.

Hmmm that is a bit weird I would expect that checkout_merge_commit.sh would take care of that ...

@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented Nov 23, 2017

ping @TomDLT Thanks for your instructions. Comments adressed :) Result from Circle page
sphx_glr_plot_discretization_001

ping @jnothman I think it's ready for review. Thanks :)

ping @amueller The example is based on your book "Introduction to machine learning with python" (Chapter 4 section 2 binning&discretization) using scikit-learn's latest api KBinsDiscretizer. Would be grateful if you can take some time to have a look (I've put your name at the beginning). Thanks :)

@jnothman

I've not yet looked at the code...

================================================================
The example compares prediction result of linear regression (linear model)
and decision tree (tree based model) before and after discretization.

This comment has been minimized.

@jnothman

jnothman Nov 23, 2017

Member

Before and after -> with and without

This comment has been minimized.

@jnothman

jnothman Nov 23, 2017

Member
  • of real-valued features
before discretization, linear model become much more flexible while decision
tree gets much less flexible. Note that binning features generally has no
beneficial effect for tree-based models, as these models can learn to split
up the data anywhere.

This comment has been minimized.

@jnothman

jnothman Nov 23, 2017

Member

Note that the linear model is fast to build and relatively straightforward to interpret.

@jnothman

This comment has been minimized.

Member

jnothman commented Nov 23, 2017

And yes, I find this a much more compelling argument than what we had before with iris.

@jnothman

Please mention one-hot encoding.

Also note that if the bins are not reasonably wide, there would appear to be a substantially increased risk of overfitting, so the discretiser parameters need tuning under cv.

qinhanmin2014 added some commits Nov 24, 2017

@scikit-learn scikit-learn deleted a comment from codecov bot Nov 24, 2017

@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented Nov 24, 2017

@jnothman Comments addressed. Thanks a lot for the instant review :)

is to use discretization (also known as binning). In the example, we
discretize the feature and one-hot encode the transformed data. Note that if
the bins are not reasonably wide, there would appear to be a substantially
increased risk of overfitting, so the discretiser parameters need to be tuned

This comment has been minimized.

@jnothman

jnothman Nov 26, 2017

Member

Need -> should usually

seeing as you don't do that tuning here

# predict with original dataset
fig, axes = plt.subplots(ncols=2, sharey=True, figsize=(10, 4))
plt.sca(axes[0])

This comment has been minimized.

@jnothman

jnothman Nov 26, 2017

Member

Do we use sca much in other examples? It seems a bit unconventional. Then again, I can see how using axes methods together with pyplot.subplots might be seen as inconsistent

@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented Nov 26, 2017

@jnothman Thanks :) Comments addressed.

seeing as you don't do that tuning here

In fact, the value of n_bins here (10) is among the best choices from cv (if we gridsearch on a pipeline KBinsDiscretizer + LinearRegression). So I think put the value directly is consistant with this statement and might make the example easier to go through. Especially considering that we now have another example which shows the detailed tuning process.

Do we use sca much in other examples? It seems a bit unconventional.

I don't find plt.sca under examples folder. I have followed matplotlib example and use a more common way.

@scikit-learn scikit-learn deleted a comment from codecov bot Nov 26, 2017

@TomDLT

TomDLT approved these changes Nov 27, 2017

LGTM

discretize the feature and one-hot encode the transformed data. Note that if
the bins are not reasonably wide, there would appear to be a substantially
increased risk of overfitting, so the discretiser parameters should usually
be tuned under cv.

This comment has been minimized.

@TomDLT

TomDLT Nov 27, 2017

Member

cv -> cross-validation

is to use discretization (also known as binning). In the example, we
discretize the feature and one-hot encode the transformed data. Note that if
the bins are not reasonably wide, there would appear to be a substantially
increased risk of overfitting, so the discretiser parameters should usually

This comment has been minimized.

@TomDLT

TomDLT Nov 27, 2017

Member

discretiser -> discretizer

@TomDLT TomDLT changed the title from [MRG] discrete branch: add an example for KBinsDiscretizer to [MRG+2] discrete branch: add an example for KBinsDiscretizer Nov 27, 2017

@scikit-learn scikit-learn deleted a comment from codecov bot Nov 27, 2017

@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented Nov 27, 2017

Merging giving the approvals from jnothman and TomDLT.
redered example from Circle.

@qinhanmin2014 qinhanmin2014 merged commit 2c9134e into scikit-learn:discrete Nov 27, 2017

5 of 6 checks passed

lgtm analysis: Python Running analyses for revisions
Details
ci/circleci Your tests passed on CircleCI!
Details
codecov/patch Coverage not affected when comparing a099c26...367c64f
Details
codecov/project 96.11% (-0.2%) compared to a099c26
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@qinhanmin2014 qinhanmin2014 deleted the qinhanmin2014:discrete-example branch Nov 28, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment