Skip to content


Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP


Multi-label Label Binarizer Memory Error #2441

rsivapr opened this Issue · 7 comments

2 participants


I am working on a text classification problem. I have a largish dataset with about 5 million documents and close to 50000 classes. I have used the TfidfVectorizer to extract features (again about 1 million features) from the documents.

The problem obvious arises when I try to run any classifier, since the OVR uses label_binarize method, and it creates an empty zeros array of shape (6mil x 50k). This is obviously not going to fit in my memory.

The question I have is: Is there a built-in way around this or should I modify the code for label_binarize to write to a sparse matrix instead? Is this doable?

I am open to any suggestions as well.

ver 0.14.1

>>> clf = OneVsRestClassifier(LinearSVC())
>>>, np_array[:,1])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/rsivapr/anaconda/lib/python2.7/site-packages/sklearn/", line 201, in fit
  File "/home/rsivapr/anaconda/lib/python2.7/site-packages/sklearn/", line 88, in fit_ovr
    Y = lb.fit_transform(y)
  File "/home/rsivapr/anaconda/lib/python2.7/site-packages/sklearn/", line 408, in fit_transform
    return, **fit_params).transform(X)
  File "/home/rsivapr/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/", line 272, in transform
  File "/home/rsivapr/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/", line 394, in label_binarize
    Y = np.zeros((len(y), len(classes)),

The question I have is: Is there a built-in way around this or should I modify the code for label_binarize to write to a sparse matrix instead? Is this doable?

This is doable and would be a very nice addition to the scikit-learn. This have already been discussed on the mailing and various pr, but nobody has yet started to work on this issue.

In order to add this new functionnality, I would do the following

  1. modify the function sklearn.utils.multiclass.type_of_target to recognise sparse binary matrix as a 'multilabel-indicator' format.
  2. modify the function sklearn.utils.muticlass.unique_labels to work with this new format.
  3. modify the label_binarize function to return a sparse label indicator.
  4. modify OVR to work with sparse output matrices. For instance by extracting a dense column of the sparse matrix and fit the classifier on it. Prediction would be made by creating the sparse matrix incrementally out of each fitted classifier.

Lastly, the metrics that you want to use for the assessment should be upgraded to support sparse matrix format.

If you want to contribute, I would gladly review and help you to implement such change.


Cool. I have something I hacked on -- I'll work on it a bit more and get back.


@arjoly I worked on the suggestions that you mentioned. This works well for small data sets, but again fails due to memory when the bigger training set comes into picture

>>>, df[:,1])
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/rsivapr/anaconda/lib/python2.7/", line 808, in __bootstrap_inner
  File "/home/rsivapr/anaconda/lib/python2.7/", line 761, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/rsivapr/anaconda/lib/python2.7/multiprocessing/", line 325, in _handle_workers
  File "/home/rsivapr/anaconda/lib/python2.7/multiprocessing/", line 229, in _maintain_pool
  File "/home/rsivapr/anaconda/lib/python2.7/multiprocessing/", line 222, in _repopulate_pool
  File "/home/rsivapr/anaconda/lib/python2.7/multiprocessing/", line 130, in start
    self._popen = Popen(self)
  File "/home/rsivapr/anaconda/lib/python2.7/multiprocessing/", line 121, in __init__ = os.fork()
OSError: [Errno 12] Cannot allocate memory


I have no idea how to go about this. Am I missing something here? Because unlike the previous case, this sparse matrix should not be taking up that much memory.


Can you open a pull request? It would easier to discuss about your code.


Will be fix by #3276


fix by #3276 thanks @hamsal

@arjoly arjoly closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.