-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
[MRG+1] deprecate sequences of sequences multilabel support #2657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think sequences of sequences should also be deprecated from |
My reasoning is that we really don't want to have code like
lying around. So we should deprecate sequences of sequences even in |
@mblondel, anything that uses is_sequence_of_sequences will produce a On Thu, Dec 12, 2013 at 6:28 PM, Coveralls notifications@github.com wrote:
|
But you're right that I've not checked parameters sections of comments. I shall do that. |
Also, I see that we will in the future not need a full |
Putting that PR on the 0.15 milestone as well. I will try to have a deeper look at it in the coming days. |
Ideally, this should be pulled in with the sparse label indicator support PRs. But at a minimum it would be nice to start warning that the end is nigh... |
So OneVsRestClassifier's support for label indicator matrices relies on the fact that LabelBinarizer is a no-op when it is fed with a 2d y right? I was thinking this behavior should probably go too if multi-label support is removed from LabelBinarizer. |
Note that this is no longer the case since #1993 was fixed. |
Perhaps. I guess I'd need to review where Is it possible to consider the changes here and then work out what we want the function of |
Then multilabel classication with indicator matrix probably doesn't work yet in OneVsRestClassifier. It would be nice to test this. |
I don't understand why you think so. If I understand you correctly, this is tested at https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tests/test_multiclass.py#L94 |
It's just that if the no-op behavior was removed from LabelBinarizer, I don't see how this line can work in the multilabel case with indicator matrix: |
That's because the no-op behaviour was removed because it wasn't affected On Tue, Dec 17, 2013 at 6:42 PM, Mathieu Blondel
|
I think your implication is right, @mblondel: that |
Having said this, I'm not really sure what the best way to sort out the various uses of label binarizer is, and whether it's worth having a utility that will binarize multiclass data, but pass through -- or change the negative value on -- an array that is already binarized (i.e. multilabel). |
@mblondel, I am no longer convinced that this is the right place to deprecate multilabel support in As a result, I have changed this PR back to [MRG] and welcome reviews. (ping @arjoly, I know that you're working in related space) |
Can you rebase on top of master? I will try to have a look at this in the coming days. |
edit: Ok, you used is_sequence to perform the deprecation. |
|
||
# Automatically increment on new class | ||
class_mapping = defaultdict(int) | ||
class_mapping.default_factory = class_mapping.__len__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look like a hack. Can you comment what you are doing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a hack; one that I've already used in CountVectorizer
. I've commented it two lines prior, but maybe putting the following in utils
is better:
class AutoIncrementer(dict):
def __missing__(self, key):
out = len(self)
self[key] = out
return out
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it's worth, this solution is almost twice as slow as the defaultdict
hack, but I don't expect this assignment to dominate anywhere we use it:
%timeit a=AutoIncrementer(); sum(a[k] for k in range(100000))
10 loops, best of 3: 66.1 ms per loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any strong opinion on this. Let's keep this version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's a clear name for it, I'd be happy to refactor some utilities relating to this indexation or vocabulary construct (mapping from objects to contiguous integers from 0 to n_objects - 1). The operations that commonly happen are: create a dict that autoincrements; map that dict to an equivalent array/list; map back the other direction. I'm not sure if these would benefit from being abstracted behind functions.
Yet, for that second function, for example, we have used:
l = sorted(d, key=d.__getitem__)
or similar to convert to a list, but the following is faster (linear complexity; faster benchmark):
l = np.array(len(d), dtype=object)
keys, vals = zip(*six.iteritems(d))
l[vals] = keys
so putting it behind a named function might make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If those patterns are repeated several times, it would be nice to make some code refactoring. But this would be the topic of another pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and I've been wondering whether I should submit a patch to perform parallelised feature extraction for text which would probably incorporate these sorts of helpers.
While What do you think of |
return yt.toarray() | ||
|
||
def _transform(self, y, class_mapping): | ||
indices = array.array('i') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a small comment of what is _transform
and class_mapping?
Also avoid stride trick
LGTM! Thanks @jnothman |
Producing multilabel data as a list of sets of labels may be more intuitive. | ||
The transformer :class:`MultiLabelBinarizer <preprocessing.MultiLabelBinarizer>` | ||
will convert between a collection of collections of labels and the indicator | ||
format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be useful to add a note giving the version at which the MultiLabelBinarizer was added. This can be done using the versionadded markup in Sphinx: http://sphinx-doc.org/markup/para.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case the version will be 0.15.
Hi @jnothman, Does this PR need a rebase? If so Is it OK if I take the code here and rebase it on top of the current repository? I will soon be working on #2458 to fill in the missing pieces and be responsive to comments/reviews of the implementation. I would like to make this PR, which is pretty much complete, a prerequisite because it will be easier to deprecate sequence of sequences now before #2458 is implemented. Thanks! |
Github suggests there shouldn't be any conflicts for a merge/rebase. Ideally, we should just merge this PR ASAP, but it needs another review. |
A last review would be very nice ! |
Checked this out, and had to rebase with master. Once that was done, I see 1 failing doctest
Can you rebase with master? This seems like a Very Good Thing (TM) |
classes = sorted(set(itertools.chain.from_iterable(y))) | ||
else: | ||
classes = self.classes | ||
self.classes_ = np.empty(len(classes), dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would find it more natural to only use the object dtype for non integer y. That would fix the doctest failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be implemented as (untested):
if all(isinstance(c, int) for c in classes):
dtype = np.int
else:
dtype = object
self.classes_ = np.asarray(classes, dtype=dtype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's ok for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as we can assume consistent typing from the first element of the iterable of iterables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, got confused about context.
I will merge #3246 as that includes the fix for the int dtype as soon as travis is green. |
Alright merged! Thanks you very much @jnothman for the fix. |
Great, this is finally done! Congratulation @jnothman for all your efforts! |
Thanks for pulling this through. |
And I'm looking forward to the 0.15 beta. Is there sense in trying to include in it some support for a sparse multilabel format (e.g. |
Towards #2270
type_of_target
. Its helper,is_sequence_of_sequences
triggers the warning. A further warning applies tomake_multilabel_classification
.sklearn.preprocessing.MultiLabelBinarizer
This will require a bit of updating once the alternative sparse matrix support in is incorporated from #2458 and https://github.com/jnothman/scikit-learn/tree/sparse_multi_metrics ... or vice-versa.