New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] FIX Correct iris dataset #11082

Merged
merged 9 commits into from May 22, 2018

Conversation

Projects
None yet
3 participants
@qinhanmin2014
Member

qinhanmin2014 commented May 10, 2018

Reference Issues/PRs

Fixes #10550 Closes #10718

What does this implement/fix? Explain your changes.

Correct 3 incorrect data points in iris dataset
Reasons for the change:
(1) Some papers discuss about the error. See e.g. http://pdfs.semanticscholar.org/1c27/e59992d483892274fd27ba1d8e19bbfb5d46.pdf
(2) UCI itself has claimed that it's an error : The 35th sample should be: 4.9,3.1,1.5,0.2,"Iris-setosa" where the error is in the fourth feature. The 38th sample: 4.9,3.6,1.4,0.1,"Iris-setosa" where the errors are in the second and third features.
(3) Make scikit-learn consistent with other packages (I tried R (iris) and seaborn (load_dataset("iris")))

qinhanmin2014 added some commits May 10, 2018

@jnothman

Thanks for finishing this.
Please add what's new and versionchanged.

qinhanmin2014 added some commits May 11, 2018

@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented May 11, 2018

@jnothman Comments addressed. Thanks for the review :)

@jnothman

This comment has been minimized.

Member

jnothman commented on sklearn/datasets/base.py in 4183c09 May 11, 2018

The versionadded above is about the parameter. The changed notice can't be under the Parameters heading

This comment has been minimized.

Member

qinhanmin2014 replied May 11, 2018

@jnothman Where shall we put it? (Current solution is borrowed from #10795)

This comment has been minimized.

Member

jnothman replied May 11, 2018

@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented May 11, 2018

@jnothman Thanks. I put it under notes. I also update load_boston accordingly.

@jnothman jnothman changed the title from [MRG] FIX Correct iris dataset to [MRG+1] FIX Correct iris dataset May 12, 2018

@amueller

This comment has been minimized.

Member

amueller commented May 21, 2018

Should we add a note that it's the same as in R but not as in UCI?

@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented May 22, 2018

@amueller Thanks. Comment addressed.

Should we add a note that it's the same as in R but not as in UCI?

Agree since we've been using the version from UCI for a long time.

The famous Iris database, first used by Sir R.A. Fisher
The famous Iris database, first used by Sir R.A. Fisher.
The dataset is taken from Fisher's paper. Note that it's the same as in R,
but not as in UCI Machine Learning Repository, which has two wrong data points.

This comment has been minimized.

@amueller

amueller May 22, 2018

Member

"the" UCI repository? I didn't actually check the points against the paper but otherwise LGTM.

This comment has been minimized.

@amueller

amueller May 22, 2018

Member

ok actually checked the reference. Add the "the " and merge?

@qinhanmin2014 qinhanmin2014 merged commit 399f1b2 into scikit-learn:master May 22, 2018

0 of 5 checks passed

ci/circleci: python2 CircleCI is running your tests
Details
ci/circleci: python3 CircleCI is running your tests
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
lgtm analysis: Python Running analyses for revisions
Details
@qinhanmin2014

This comment has been minimized.

Member

qinhanmin2014 commented May 22, 2018

Take @amueller 's approval and merge. Thanks all for the review here :)

@qinhanmin2014 qinhanmin2014 deleted the qinhanmin2014:iris branch May 22, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment