Iris dataset inconsistency compared to other sources #10550

paulochf · 2018-01-29T21:52:14Z

Just found that scikit-learn iris dataset is different from R's MASS datasets package.

I've found it in this Stats Exchange answer, which refers to here which, in turn, says

This data differs from the data presented in Fishers article (identified by Steve Chadwick, spchadwick '@' espeedaz.net ).
The 35th sample should be: 4.9,3.1,1.5,0.2,"Iris-setosa" where the error is in the fourth feature.
The 38th sample: 4.9,3.6,1.4,0.1,"Iris-setosa" where the errors are in the second and third features.

Note that the data in UCI repo and from scikit dataset is the same at cited lines.

Comparing iris data between the 3 sources, we have the following (bolds are differences)

row #	source	sepal_length	sepal_width	petal_length	petal_width
35	sklearn	4.9	3.1	1.5	0.1
	Fisher	4.9	3.1	1.5	0.2
	R MASS	4.9	3.1	1.5	0.2
38	sklearn	4.9	3.1	1.5	0.1
	Fisher	4.9	3.6	1.4	0.1
	R MASS	4.9	3.6	1.4	0.1

Fisher data

http://rcs.chemometrics.ru/Tutorials/classification/Fisher.pdf

R data

Proposal

If it's a bug, make iris dataset from scikit equal to the others, explain why the difference otherwise. I would guess it's the former.

jnothman · 2018-01-29T22:18:00Z

Sounds like a bug. :\

I'm not sure if others are aware of this, or where our Iris came from.

Not sure if we should aspire to backwards compatibility when fixing this either...

paulochf · 2018-01-29T22:40:09Z

Not sure if we should aspire to backwards compatibility when fixing this either...

Yeah, I though about that either. That's why I opened the issue instead of just PR'ng the correct values 😆 .

qinhanmin2014 · 2018-01-30T15:09:21Z

I think iris dataset in scikit-learn is obtained from UCI Machine Learning Repository.
http://archive.ics.uci.edu/ml/datasets/Iris
Relevant commit:def073b
I'm really curious why such a common dataset has multiple versions.

paulochf · 2018-01-30T15:47:58Z

As I said, their data doesn't agree with the original paper, so I guessed that was the cause of this "bug".

qinhanmin2014 · 2018-02-01T07:13:16Z

Oops, I think @paulochf is right. According to the doc and previous commit, scikit-learn obtained the data from UCI Machine Learning Repository. UCI itself has claimed that their data is not right (See http://archive.ics.uci.edu/ml/datasets/Iris Data Set Information section). So it seems reasonable to correct the data in scikit-learn. WDYT? @jnothman

jnothman · 2018-02-01T09:17:10Z

If noting the error is sufficient for UCI, perhaps it is for us?

qinhanmin2014 · 2018-02-01T15:26:08Z

If noting the error is sufficient for UCI, perhaps it is for us?

@jnothman A note is accpetable from my side but I might prefer to correct the data. Some (or maybe many?) materials are using the word 'error' to describe the difference. See e.g.
http://archive.ics.uci.edu/ml/datasets/Iris
http://pdfs.semanticscholar.org/1c27/e59992d483892274fd27ba1d8e19bbfb5d46.pdf
Since iris is widely used, correcting the data might influence some examples and many people's code, but maybe it's somehow worthwhile in the long run?

paulochf · 2018-02-01T19:52:56Z

Adding to @qinhanmin2014 : sometimes people write a code using iris using different languages and compare the results. Maybe fixing would give some consistency and reproducibility.

jnothman · 2018-02-01T21:30:12Z

I think you might be right that fixing it up is preferable, under the assumption that all people have done with this is illustrations that are unlikely to change... I'd be curious to know what @GaelVaroquaux thinks

GaelVaroquaux · 2018-02-01T21:36:58Z

Well, the cost is that many tutorials / stackoverflow Q/A answer will change. So, that's a drawback.

On the other hand, it seems that the change is quite small, so I would guess that if a tutorial or example depends on this, it's a bad anyhow :).

I am +0 on this.

jnothman · 2018-02-01T21:39:30Z

+0 is about how I feel too. So far no one is agitating to keep it, so we could just change and move on!

rotuna · 2018-02-05T07:54:29Z

Could I try to do this?

qinhanmin2014 · 2018-02-05T15:04:58Z

@rotuna Feel free to have a try :) You also need to take care of relevant docs/tests/examples after correcting the dataset.

PranayMukherjee · 2018-02-06T08:44:10Z

@qinhanmin2014 could you elaborate on the relevant docs/tests/examples . Actually , I am totally new to the open source community

gxyd · 2018-02-06T09:18:33Z

I think what @qinhanmin2014 meant was that when you do change the iris dataset values (as expected from the issue fixing), you'll possibly expect the output of quite a few docs, tests and examples (those which make use of iris dataset) to actually fail. As an example, the iris dataset will have a different standard deviation so (possibly) different output from fitting on a classifier.

So to reflect those changes, you'll need just need to update the docs/tests/examples which will fail.

qinhanmin2014 · 2018-02-06T09:58:29Z

@rotuna @PranayMukherjee See above comments. You need to take care of every place which uses the iris dataset (one possible way is to use something like git grep). @PranayMukherjee since @rotuna has taken the issue, please wait for a couple of days or ask whether he's still on the issue.

jnothman · 2018-02-06T13:34:56Z

I'd say the easy way is to make the change, then look at the test logs to work out what needs fixing up...

…

On 6 February 2018 at 20:58, Hanmin Qin ***@***.***> wrote: @rotuna <https://github.com/rotuna> @PranayMukherjee <https://github.com/pranaymukherjee> See above comments. You need to take care of every place which uses the iris dataset (one possible way is to use something like git grep). @PranayMukherjee <https://github.com/pranaymukherjee> since @rotuna <https://github.com/rotuna> has taken the issue, please wait for a couple of days or ask whether he's still on the issue. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10550 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6xsn8Wrrw_Q9J9A4DcSRigHENm8pks5tSCJIgaJpZM4Rxagq> .

rotuna · 2018-02-07T00:58:07Z

Hi, I'm just trying to figure out how to do this. @PranayMukherjee I'll reach out to you in case I'm not able to do this and let you take over. I'm very new to this too, so I'll need some time to try :)

paulochf · 2018-02-07T02:26:06Z

Please correct me if I'm wrong: @rotuna, I guess @jnothman said you have to change the values in iris data, then run the test suite ($ pytest sklearn) to see which parts will use the data.

jnothman · 2018-02-07T02:29:38Z

My suggestion was more that the uses of Iris are many... Sometimes it is used to assert general properties of an algorithm, and the specific values produced by that algorithm run over Iris are not tested. These tests do not need to be changed. Only those where specific data-dependent values are tested will need to be changed. Looking through every use of Iris is a waste of time. Looking through the test failures after changing Iris is straightforward.

rotuna · 2018-02-07T10:29:29Z

I changed the values and found that so far only the graph_lasso part has failed. I'm the trying to find the actual values from R so that the test can compare it with sklearn's values (which is what I think the test was intended for).

I'm also running the scripts from the examples which have "load_iris" in them to check if something is changing. I'm guessing this would work for the docs too?

jnothman · 2018-02-07T13:00:53Z

A lot of the failures could be in doctests. So make sure pytest is configured to run them. If you submit a PR (with [doc build] in the commit message) you can check the error log and the rendered examples quite easily.

agramfort · 2018-02-07T16:00:27Z

I am also +0 on this. It's a lot of changes, work, review time for little gain.

rotuna · 2018-02-15T17:56:43Z

Update: I've got the data from R and I'll send in a PR to find out where he other issues are as at jnothman said on friday or saturday. I've been having an unusually hectic time at work. Thankfully the week ends tomorrow and I can get sometime to focus on this properly.

Sorry for the delay

paulochf · 2018-02-15T19:15:42Z

@rotuna, I've already done that. I posted the wrong values in my issue.

rotuna · 2018-02-17T10:03:56Z

@paulochf I meant the cov and icov values for iris from R glasso package. These are in test_graph_lasso.py under the test_graph_lasso_iris function.

paulochf · 2018-02-18T20:46:42Z

Oh, never mind! 😅

rotuna · 2018-02-25T17:14:27Z

The cov and icov values for iris from R don't match with sklearn's values after updating iris.

I used the glasso function with alpha (rho) value as 1.0
The output of the cov function and empirical_variance in sklearn did not match so I used the output of sklearn directly
The following are the R values

The following is the error I get when running the tests

The code I used for finding the R values is here

I used nosetests -v sklearn to run the tests.

I tried to understand both functions over the week to figure out where I was wrong but I couldn't figure anything beyond the fact that the output of the cov function and empirical_variance in sklearn did not match. Please help.

jasneetsinghanand · 2018-02-26T17:46:30Z

@rotuna Are you still stuck on this? Can you involve me to solve this issue ?

jnothman · 2018-02-26T22:15:37Z

I'm not sure what's going on here. A pull request may help make it concrete. Those arrays only differ on the diagonal, and there they are offset by 1

rotuna · 2018-03-01T18:04:22Z

@jnothman @paulochf @qinhanmin2014

I sent a PR and the same test is failing for the same reason here with the same values (other than one coding convention error in the comments. I'll fix that in the actual MRG PR). Could you please help me fix this?

Note: test_graph_lasso_iris is the one I am talking about. The other test (test_graphical_lasso_iris) also fails but I have not changed the values for that yet. test_graphical_lasso_iris has values from the original sklearn iris values in the PR. (By comparing the values from R to it manually, it too fails though)

jnothman · 2018-03-02T03:19:11Z

well it says

        # Hard-coded solution from R glasso package for alpha=1.0
        # The iris datasets in R and scikit-learn do not match in a few places,
        # these values are for the scikit-learn version.

so presumably you should try and recompute glasso's output with Iris in R, and put those in the test. Is that something you are able to do?

rotuna · 2018-03-02T11:18:59Z

@jnothman

Yes, the test fails on the newly computed values from R. The code I used to get the values is here

jnothman · 2018-03-04T01:07:16Z

Right, sorry for not keeping track of this appropriately. Now, the error you show seems to suggest that the difference between R and us is systematic for $w: the R diagonal is q greater than ours, and otherwise they appear identical, yeah? I've not looked into why there might be such a consistent difference, but it might be expected.

jnothman added the Bug label Jan 29, 2018

qinhanmin2014 added Easy Well-defined and straightforward way to resolve help wanted labels Feb 2, 2018

rotuna mentioned this issue Feb 27, 2018

[WIP] Correcting Iris dataset inconsistency compared to other sources #10718

Closed

qinhanmin2014 mentioned this issue May 10, 2018

[MRG+1] FIX Correct iris dataset #11082

Merged

qinhanmin2014 closed this as completed in #11082 May 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iris dataset inconsistency compared to other sources #10550

Iris dataset inconsistency compared to other sources #10550

paulochf commented Jan 29, 2018 •

edited

Loading

jnothman commented Jan 29, 2018

paulochf commented Jan 29, 2018

qinhanmin2014 commented Jan 30, 2018

paulochf commented Jan 30, 2018

qinhanmin2014 commented Feb 1, 2018

jnothman commented Feb 1, 2018 via email

qinhanmin2014 commented Feb 1, 2018

paulochf commented Feb 1, 2018

jnothman commented Feb 1, 2018 via email

GaelVaroquaux commented Feb 1, 2018

jnothman commented Feb 1, 2018 via email

rotuna commented Feb 5, 2018

qinhanmin2014 commented Feb 5, 2018

PranayMukherjee commented Feb 6, 2018

gxyd commented Feb 6, 2018

qinhanmin2014 commented Feb 6, 2018

jnothman commented Feb 6, 2018 via email

rotuna commented Feb 7, 2018

paulochf commented Feb 7, 2018

jnothman commented Feb 7, 2018 •

edited

Loading

rotuna commented Feb 7, 2018

jnothman commented Feb 7, 2018 via email

agramfort commented Feb 7, 2018

rotuna commented Feb 15, 2018

paulochf commented Feb 15, 2018

rotuna commented Feb 17, 2018

paulochf commented Feb 18, 2018

rotuna commented Feb 25, 2018

jasneetsinghanand commented Feb 26, 2018

jnothman commented Feb 26, 2018

rotuna commented Mar 1, 2018

jnothman commented Mar 2, 2018

rotuna commented Mar 2, 2018 •

edited by lesteve

Loading

jnothman commented Mar 4, 2018

Iris dataset inconsistency compared to other sources #10550

Iris dataset inconsistency compared to other sources #10550

Comments

paulochf commented Jan 29, 2018 • edited Loading

Fisher data

R data

Proposal

jnothman commented Jan 29, 2018

paulochf commented Jan 29, 2018

qinhanmin2014 commented Jan 30, 2018

paulochf commented Jan 30, 2018

qinhanmin2014 commented Feb 1, 2018

jnothman commented Feb 1, 2018 via email

qinhanmin2014 commented Feb 1, 2018

paulochf commented Feb 1, 2018

jnothman commented Feb 1, 2018 via email

GaelVaroquaux commented Feb 1, 2018

jnothman commented Feb 1, 2018 via email

rotuna commented Feb 5, 2018

qinhanmin2014 commented Feb 5, 2018

PranayMukherjee commented Feb 6, 2018

gxyd commented Feb 6, 2018

qinhanmin2014 commented Feb 6, 2018

jnothman commented Feb 6, 2018 via email

rotuna commented Feb 7, 2018

paulochf commented Feb 7, 2018

jnothman commented Feb 7, 2018 • edited Loading

rotuna commented Feb 7, 2018

jnothman commented Feb 7, 2018 via email

agramfort commented Feb 7, 2018

rotuna commented Feb 15, 2018

paulochf commented Feb 15, 2018

rotuna commented Feb 17, 2018

paulochf commented Feb 18, 2018

rotuna commented Feb 25, 2018

jasneetsinghanand commented Feb 26, 2018

jnothman commented Feb 26, 2018

rotuna commented Mar 1, 2018

jnothman commented Mar 2, 2018

rotuna commented Mar 2, 2018 • edited by lesteve Loading

jnothman commented Mar 4, 2018

paulochf commented Jan 29, 2018 •

edited

Loading

jnothman commented Feb 7, 2018 •

edited

Loading

rotuna commented Mar 2, 2018 •

edited by lesteve

Loading