Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iris dataset inconsistency compared to other sources #10550

Closed
paulochf opened this issue Jan 29, 2018 · 34 comments · Fixed by #11082
Closed

Iris dataset inconsistency compared to other sources #10550

paulochf opened this issue Jan 29, 2018 · 34 comments · Fixed by #11082
Labels
Bug Easy Well-defined and straightforward way to resolve help wanted

Comments

@paulochf
Copy link
Contributor

paulochf commented Jan 29, 2018

Just found that scikit-learn iris dataset is different from R's MASS datasets package.

I've found it in this Stats Exchange answer, which refers to here which, in turn, says

This data differs from the data presented in Fishers article (identified by Steve Chadwick, spchadwick '@' espeedaz.net ).
The 35th sample should be: 4.9,3.1,1.5,0.2,"Iris-setosa" where the error is in the fourth feature.
The 38th sample: 4.9,3.6,1.4,0.1,"Iris-setosa" where the errors are in the second and third features.

Note that the data in UCI repo and from scikit dataset is the same at cited lines.

Comparing iris data between the 3 sources, we have the following (bolds are differences)

row # source sepal_length sepal_width petal_length petal_width
35 sklearn 4.9 3.1 1.5 0.1
Fisher 4.9 3.1 1.5 0.2
R MASS 4.9 3.1 1.5 0.2
38 sklearn 4.9 3.1 1.5 0.1
Fisher 4.9 3.6 1.4 0.1
R MASS 4.9 3.6 1.4 0.1
Fisher data

http://rcs.chemometrics.ru/Tutorials/classification/Fisher.pdf

Fisher's article data; 35th and 38th rows marked with red dots

R data

R MASS data


Proposal

If it's a bug, make iris dataset from scikit equal to the others, explain why the difference otherwise. I would guess it's the former.

@jnothman jnothman added the Bug label Jan 29, 2018
@jnothman
Copy link
Member

Sounds like a bug. :\

I'm not sure if others are aware of this, or where our Iris came from.

Not sure if we should aspire to backwards compatibility when fixing this either...

@paulochf
Copy link
Contributor Author

Not sure if we should aspire to backwards compatibility when fixing this either...

Yeah, I though about that either. That's why I opened the issue instead of just PR'ng the correct values 😆 .

@qinhanmin2014
Copy link
Member

I think iris dataset in scikit-learn is obtained from UCI Machine Learning Repository.
http://archive.ics.uci.edu/ml/datasets/Iris
Relevant commit:def073b
I'm really curious why such a common dataset has multiple versions.

@paulochf
Copy link
Contributor Author

As I said, their data doesn't agree with the original paper, so I guessed that was the cause of this "bug".

@qinhanmin2014
Copy link
Member

Oops, I think @paulochf is right. According to the doc and previous commit, scikit-learn obtained the data from UCI Machine Learning Repository. UCI itself has claimed that their data is not right (See http://archive.ics.uci.edu/ml/datasets/Iris Data Set Information section). So it seems reasonable to correct the data in scikit-learn. WDYT? @jnothman

@jnothman
Copy link
Member

jnothman commented Feb 1, 2018 via email

@qinhanmin2014
Copy link
Member

If noting the error is sufficient for UCI, perhaps it is for us?

@jnothman A note is accpetable from my side but I might prefer to correct the data. Some (or maybe many?) materials are using the word 'error' to describe the difference. See e.g.
http://archive.ics.uci.edu/ml/datasets/Iris
http://pdfs.semanticscholar.org/1c27/e59992d483892274fd27ba1d8e19bbfb5d46.pdf
Since iris is widely used, correcting the data might influence some examples and many people's code, but maybe it's somehow worthwhile in the long run?

@paulochf
Copy link
Contributor Author

paulochf commented Feb 1, 2018

Adding to @qinhanmin2014 : sometimes people write a code using iris using different languages and compare the results. Maybe fixing would give some consistency and reproducibility.

@jnothman
Copy link
Member

jnothman commented Feb 1, 2018 via email

@GaelVaroquaux
Copy link
Member

Well, the cost is that many tutorials / stackoverflow Q/A answer will change. So, that's a drawback.

On the other hand, it seems that the change is quite small, so I would guess that if a tutorial or example depends on this, it's a bad anyhow :).

I am +0 on this.

@jnothman
Copy link
Member

jnothman commented Feb 1, 2018 via email

@qinhanmin2014 qinhanmin2014 added Easy Well-defined and straightforward way to resolve help wanted labels Feb 2, 2018
@rotuna
Copy link

rotuna commented Feb 5, 2018

Could I try to do this?

@qinhanmin2014
Copy link
Member

@rotuna Feel free to have a try :) You also need to take care of relevant docs/tests/examples after correcting the dataset.

@PranayMukherjee
Copy link

@qinhanmin2014 could you elaborate on the relevant docs/tests/examples . Actually , I am totally new to the open source community

@gxyd
Copy link
Contributor

gxyd commented Feb 6, 2018

I think what @qinhanmin2014 meant was that when you do change the iris dataset values (as expected from the issue fixing), you'll possibly expect the output of quite a few docs, tests and examples (those which make use of iris dataset) to actually fail. As an example, the iris dataset will have a different standard deviation so (possibly) different output from fitting on a classifier.

So to reflect those changes, you'll need just need to update the docs/tests/examples which will fail.

@qinhanmin2014
Copy link
Member

@rotuna @PranayMukherjee See above comments. You need to take care of every place which uses the iris dataset (one possible way is to use something like git grep). @PranayMukherjee since @rotuna has taken the issue, please wait for a couple of days or ask whether he's still on the issue.

@jnothman
Copy link
Member

jnothman commented Feb 6, 2018 via email

@rotuna
Copy link

rotuna commented Feb 7, 2018

Hi, I'm just trying to figure out how to do this. @PranayMukherjee I'll reach out to you in case I'm not able to do this and let you take over. I'm very new to this too, so I'll need some time to try :)

@paulochf
Copy link
Contributor Author

paulochf commented Feb 7, 2018

Please correct me if I'm wrong: @rotuna, I guess @jnothman said you have to change the values in iris data, then run the test suite ($ pytest sklearn) to see which parts will use the data.

@jnothman
Copy link
Member

jnothman commented Feb 7, 2018

My suggestion was more that the uses of Iris are many... Sometimes it is used to assert general properties of an algorithm, and the specific values produced by that algorithm run over Iris are not tested. These tests do not need to be changed. Only those where specific data-dependent values are tested will need to be changed. Looking through every use of Iris is a waste of time. Looking through the test failures after changing Iris is straightforward.

@rotuna
Copy link

rotuna commented Feb 7, 2018

I changed the values and found that so far only the graph_lasso part has failed. I'm the trying to find the actual values from R so that the test can compare it with sklearn's values (which is what I think the test was intended for).

I'm also running the scripts from the examples which have "load_iris" in them to check if something is changing. I'm guessing this would work for the docs too?

@jnothman
Copy link
Member

jnothman commented Feb 7, 2018 via email

@agramfort
Copy link
Member

I am also +0 on this. It's a lot of changes, work, review time for little gain.

@rotuna
Copy link

rotuna commented Feb 15, 2018

Update: I've got the data from R and I'll send in a PR to find out where he other issues are as at jnothman said on friday or saturday. I've been having an unusually hectic time at work. Thankfully the week ends tomorrow and I can get sometime to focus on this properly.

Sorry for the delay

@paulochf
Copy link
Contributor Author

@rotuna, I've already done that. I posted the wrong values in my issue.

@rotuna
Copy link

rotuna commented Feb 17, 2018

@paulochf I meant the cov and icov values for iris from R glasso package. These are in test_graph_lasso.py under the test_graph_lasso_iris function.

@paulochf
Copy link
Contributor Author

Oh, never mind! 😅

@rotuna
Copy link

rotuna commented Feb 25, 2018

The cov and icov values for iris from R don't match with sklearn's values after updating iris.

  • I used the glasso function with alpha (rho) value as 1.0
  • The output of the cov function and empirical_variance in sklearn did not match so I used the output of sklearn directly
  • The following are the R values

screenshot from 2018-02-25 22-32-36

  • The following is the error I get when running the tests

screenshot from 2018-02-25 22-34-58

The code I used for finding the R values is here

I used nosetests -v sklearn to run the tests.

I tried to understand both functions over the week to figure out where I was wrong but I couldn't figure anything beyond the fact that the output of the cov function and empirical_variance in sklearn did not match. Please help.

@jasneetsinghanand
Copy link

@rotuna Are you still stuck on this? Can you involve me to solve this issue ?

@jnothman
Copy link
Member

I'm not sure what's going on here. A pull request may help make it concrete. Those arrays only differ on the diagonal, and there they are offset by 1

@rotuna
Copy link

rotuna commented Mar 1, 2018

@jnothman @paulochf @qinhanmin2014

I sent a PR and the same test is failing for the same reason here with the same values (other than one coding convention error in the comments. I'll fix that in the actual MRG PR). Could you please help me fix this?

Note: test_graph_lasso_iris is the one I am talking about. The other test (test_graphical_lasso_iris) also fails but I have not changed the values for that yet. test_graphical_lasso_iris has values from the original sklearn iris values in the PR. (By comparing the values from R to it manually, it too fails though)

@jnothman
Copy link
Member

jnothman commented Mar 2, 2018

well it says

        # Hard-coded solution from R glasso package for alpha=1.0
        # The iris datasets in R and scikit-learn do not match in a few places,
        # these values are for the scikit-learn version.

so presumably you should try and recompute glasso's output with Iris in R, and put those in the test. Is that something you are able to do?

@rotuna
Copy link

rotuna commented Mar 2, 2018

@jnothman

Yes, the test fails on the newly computed values from R. The code I used to get the values is here

@jnothman
Copy link
Member

jnothman commented Mar 4, 2018

Right, sorry for not keeping track of this appropriately. Now, the error you show seems to suggest that the difference between R and us is systematic for $w: the R diagonal is q greater than ours, and otherwise they appear identical, yeah? I've not looked into why there might be such a consistent difference, but it might be expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Easy Well-defined and straightforward way to resolve help wanted
Projects
None yet
9 participants