Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test support for sparse multilabel format #7886

Open
jnothman opened this issue Nov 16, 2016 · 18 comments · May be fixed by #7996
Open

Test support for sparse multilabel format #7886

jnothman opened this issue Nov 16, 2016 · 18 comments · May be fixed by #7996

Comments

@jnothman
Copy link
Member

Mutlilabel data is indicated by having y be a 2d array consisting of 0s and 1s. This may be represented with a sparse matrix format. However, it seems we are lacking in tests of sparse representations of multilabel formats.

We'd like to see one or both of:

  • specific tests for each multilabel estimator (or confirmation that they already exist)
  • a general test for multilabel estimators that they support sparse y and behave identically to dense y
  • clarity in the estimator docstrings where sparse y is supported.
@lesteve
Copy link
Member

lesteve commented Nov 16, 2016

The issue that prompted this for completeness: #7786 (comment).

@dalmia
Copy link
Contributor

dalmia commented Nov 17, 2016

Can I take this up?

@lesteve
Copy link
Member

lesteve commented Nov 17, 2016

Sure, go ahead ! If there is a "Need Contributor" and nobody is working on it by looking at the comments, you don't need to ask for permission, just add a comment saying you will start working on it.

Note there is no "Easy" label on this one. Not sure whether that was intentional, but in any case you have been warned ;-).

@dalmia
Copy link
Contributor

dalmia commented Nov 17, 2016

Thanks for the heads up @lesteve. The more challenging the issue, the better it feels to solve it :)

@lesteve
Copy link
Member

lesteve commented Nov 17, 2016

I think the outlined bullet points by @jnothman give you a good idea how to start on this issue. Feel free to ping us if you get stuck.

@dalmia
Copy link
Contributor

dalmia commented Nov 17, 2016

@lesteve Will do Sir. Just had one doubt before I get started on working on this, that since not a large number of estimator support multilabel prediction, is there any way I could find all the multilabel estimators so that I can get started?

@amueller
Copy link
Member

amueller commented Nov 22, 2016

@dalmia do a try with all_estimators.

@dalmia
Copy link
Contributor

dalmia commented Dec 3, 2016

This says that we don't support sparse y. As per the issue, do we need to change this? Please correct me if I am wrong.

@jnothman
Copy link
Member Author

jnothman commented Dec 3, 2016 via email

@jnothman
Copy link
Member Author

jnothman commented Dec 3, 2016 via email

@dalmia
Copy link
Contributor

dalmia commented Dec 4, 2016

It seems we have it tested for OneVsRestClassifier here. I am going to check the same for DecisionTrees, RandomForests and NearestNeighbors. I suppose these are the multilabel estimators. Please let me know if I am going in the right way.
Thanks.

@dalmia
Copy link
Contributor

dalmia commented Dec 4, 2016

But since DecisionTreeClassifier is a multilabel estimator and the tests should indicate support for sparse y, we need to change the code I earlier linked to, right?

@jnothman
Copy link
Member Author

jnothman commented Dec 4, 2016 via email

@dalmia
Copy link
Contributor

dalmia commented Dec 4, 2016

@jnothman Could you please give me a final brief overview as to what do I need to implement for the issue as I'm getting a bit confused as to what should be implemented and what isn't supported?

@jnothman
Copy link
Member Author

jnothman commented Dec 4, 2016

Fair enough. I wrote it up because there was some sense that the idea of sparse multilabel y wasn't clearly enough documented, and someone suggested some kind of invariance testing but that may not be necessary. What we've done with sparse X is tested that either an estimator supports it or gives a consistent error. I think that would be appropriate for multilabel/multioutput y, with evntual implementation of support where possible.

@dalmia
Copy link
Contributor

dalmia commented Dec 6, 2016

So as you said, I checked for the errors that the multioutput/multilabel classifiers give for sparse multilabel y. As I had already linked before, DecisionTree gives a TypeError which has tests written. However, we don't have tests for RandomForest gives a ValueError which doesn't have tests. So, I need to write them right? Also RandomTreesEmbedding throws an AttributeError instead. So I plan to add a separate check for that.

@jnothman
Copy link
Member Author

jnothman commented Dec 6, 2016 via email

@dalmia
Copy link
Contributor

dalmia commented Dec 7, 2016

I did a mistake earlier. Since RandomTreesEmbedding only transforms the input and doesn't use y, it does support multilabel sparse y. Also, fit and fit_transform don't mention anything about y or sample_weight. Although they are not needed here, shouldn't we add some sort of note about them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants