-
-
Notifications
You must be signed in to change notification settings - Fork 25.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Iris dataset inconsistency compared to other sources #10550
Comments
Sounds like a bug. :\ I'm not sure if others are aware of this, or where our Iris came from. Not sure if we should aspire to backwards compatibility when fixing this either... |
Yeah, I though about that either. That's why I opened the issue instead of just PR'ng the correct values 😆 . |
I think iris dataset in scikit-learn is obtained from UCI Machine Learning Repository. |
As I said, their data doesn't agree with the original paper, so I guessed that was the cause of this "bug". |
Oops, I think @paulochf is right. According to the doc and previous commit, scikit-learn obtained the data from UCI Machine Learning Repository. UCI itself has claimed that their data is not right (See http://archive.ics.uci.edu/ml/datasets/Iris Data Set Information section). So it seems reasonable to correct the data in scikit-learn. WDYT? @jnothman |
If noting the error is sufficient for UCI, perhaps it is for us?
|
@jnothman A note is accpetable from my side but I might prefer to correct the data. Some (or maybe many?) materials are using the word 'error' to describe the difference. See e.g. |
Adding to @qinhanmin2014 : sometimes people write a code using iris using different languages and compare the results. Maybe fixing would give some consistency and reproducibility. |
I think you might be right that fixing it up is preferable, under the
assumption that all people have done with this is illustrations that are
unlikely to change...
I'd be curious to know what @GaelVaroquaux thinks
|
Well, the cost is that many tutorials / stackoverflow Q/A answer will change. So, that's a drawback. On the other hand, it seems that the change is quite small, so I would guess that if a tutorial or example depends on this, it's a bad anyhow :). I am +0 on this. |
+0 is about how I feel too. So far no one is agitating to keep it, so we
could just change and move on!
|
Could I try to do this? |
@rotuna Feel free to have a try :) You also need to take care of relevant docs/tests/examples after correcting the dataset. |
@qinhanmin2014 could you elaborate on the relevant docs/tests/examples . Actually , I am totally new to the open source community |
I think what @qinhanmin2014 meant was that when you do change the iris dataset values (as expected from the issue fixing), you'll possibly expect the output of quite a few docs, tests and examples (those which make use of iris dataset) to actually fail. As an example, the iris dataset will have a different standard deviation so (possibly) different output from fitting on a classifier. So to reflect those changes, you'll need just need to update the docs/tests/examples which will fail. |
@rotuna @PranayMukherjee See above comments. You need to take care of every place which uses the iris dataset (one possible way is to use something like |
I'd say the easy way is to make the change, then look at the test logs to
work out what needs fixing up...
…On 6 February 2018 at 20:58, Hanmin Qin ***@***.***> wrote:
@rotuna <https://github.com/rotuna> @PranayMukherjee
<https://github.com/pranaymukherjee> See above comments. You need to take
care of every place which uses the iris dataset (one possible way is to use
something like git grep). @PranayMukherjee
<https://github.com/pranaymukherjee> since @rotuna
<https://github.com/rotuna> has taken the issue, please wait for a couple
of days or ask whether he's still on the issue.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10550 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6xsn8Wrrw_Q9J9A4DcSRigHENm8pks5tSCJIgaJpZM4Rxagq>
.
|
Hi, I'm just trying to figure out how to do this. @PranayMukherjee I'll reach out to you in case I'm not able to do this and let you take over. I'm very new to this too, so I'll need some time to try :) |
My suggestion was more that the uses of Iris are many... Sometimes it is used to assert general properties of an algorithm, and the specific values produced by that algorithm run over Iris are not tested. These tests do not need to be changed. Only those where specific data-dependent values are tested will need to be changed. Looking through every use of Iris is a waste of time. Looking through the test failures after changing Iris is straightforward. |
I changed the values and found that so far only the graph_lasso part has failed. I'm the trying to find the actual values from R so that the test can compare it with sklearn's values (which is what I think the test was intended for). I'm also running the scripts from the examples which have "load_iris" in them to check if something is changing. I'm guessing this would work for the docs too? |
A lot of the failures could be in doctests. So make sure pytest is
configured to run them. If you submit a PR (with [doc build] in the commit
message) you can check the error log and the rendered examples quite easily.
|
I am also +0 on this. It's a lot of changes, work, review time for little gain. |
Update: I've got the data from R and I'll send in a PR to find out where he other issues are as at jnothman said on friday or saturday. I've been having an unusually hectic time at work. Thankfully the week ends tomorrow and I can get sometime to focus on this properly. Sorry for the delay |
@rotuna, I've already done that. I posted the wrong values in my issue. |
@paulochf I meant the cov and icov values for iris from R glasso package. These are in test_graph_lasso.py under the test_graph_lasso_iris function. |
Oh, never mind! 😅 |
The cov and icov values for iris from R don't match with sklearn's values after updating iris.
The code I used for finding the R values is here I used nosetests -v sklearn to run the tests. I tried to understand both functions over the week to figure out where I was wrong but I couldn't figure anything beyond the fact that the output of the cov function and empirical_variance in sklearn did not match. Please help. |
@rotuna Are you still stuck on this? Can you involve me to solve this issue ? |
I'm not sure what's going on here. A pull request may help make it concrete. Those arrays only differ on the diagonal, and there they are offset by 1 |
@jnothman @paulochf @qinhanmin2014 I sent a PR and the same test is failing for the same reason here with the same values (other than one coding convention error in the comments. I'll fix that in the actual MRG PR). Could you please help me fix this? Note: test_graph_lasso_iris is the one I am talking about. The other test (test_graphical_lasso_iris) also fails but I have not changed the values for that yet. test_graphical_lasso_iris has values from the original sklearn iris values in the PR. (By comparing the values from R to it manually, it too fails though) |
well it says
so presumably you should try and recompute glasso's output with Iris in R, and put those in the test. Is that something you are able to do? |
Right, sorry for not keeping track of this appropriately. Now, the error you show seems to suggest that the difference between R and us is systematic for $w: the R diagonal is q greater than ours, and otherwise they appear identical, yeah? I've not looked into why there might be such a consistent difference, but it might be expected. |
Just found that scikit-learn iris dataset is different from R's MASS datasets package.
I've found it in this Stats Exchange answer, which refers to here which, in turn, says
Note that the data in UCI repo and from scikit dataset is the same at cited lines.
Comparing iris data between the 3 sources, we have the following (bolds are differences)
Fisher data
http://rcs.chemometrics.ru/Tutorials/classification/Fisher.pdf
R data
Proposal
If it's a bug, make iris dataset from scikit equal to the others, explain why the difference otherwise. I would guess it's the former.
The text was updated successfully, but these errors were encountered: