Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Multi-label Label Binarizer Memory Error #2441

Closed
rsivapr opened this Issue · 7 comments

2 participants

@rsivapr

I am working on a text classification problem. I have a largish dataset with about 5 million documents and close to 50000 classes. I have used the TfidfVectorizer to extract features (again about 1 million features) from the documents.

The problem obvious arises when I try to run any classifier, since the OVR uses label_binarize method, and it creates an empty zeros array of shape (6mil x 50k). This is obviously not going to fit in my memory.

The question I have is: Is there a built-in way around this or should I modify the code for label_binarize to write to a sparse matrix instead? Is this doable?

I am open to any suggestions as well.

ver 0.14.1

>>> clf = OneVsRestClassifier(LinearSVC())
>>> clf.fit(xtrain, np_array[:,1])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/rsivapr/anaconda/lib/python2.7/site-packages/sklearn/multiclass.py", line 201, in fit
    n_jobs=self.n_jobs)
  File "/home/rsivapr/anaconda/lib/python2.7/site-packages/sklearn/multiclass.py", line 88, in fit_ovr
    Y = lb.fit_transform(y)
  File "/home/rsivapr/anaconda/lib/python2.7/site-packages/sklearn/base.py", line 408, in fit_transform
    return self.fit(X, **fit_params).transform(X)
  File "/home/rsivapr/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 272, in transform
    neg_label=self.neg_label)
  File "/home/rsivapr/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 394, in label_binarize
    Y = np.zeros((len(y), len(classes)), dtype=np.int)
MemoryError
@arjoly
Owner

The question I have is: Is there a built-in way around this or should I modify the code for label_binarize to write to a sparse matrix instead? Is this doable?

This is doable and would be a very nice addition to the scikit-learn. This have already been discussed on the mailing and various pr, but nobody has yet started to work on this issue.

In order to add this new functionnality, I would do the following

  1. modify the function sklearn.utils.multiclass.type_of_target to recognise sparse binary matrix as a 'multilabel-indicator' format.
  2. modify the function sklearn.utils.muticlass.unique_labels to work with this new format.
  3. modify the label_binarize function to return a sparse label indicator.
  4. modify OVR to work with sparse output matrices. For instance by extracting a dense column of the sparse matrix and fit the classifier on it. Prediction would be made by creating the sparse matrix incrementally out of each fitted classifier.

Lastly, the metrics that you want to use for the assessment should be upgraded to support sparse matrix format.

If you want to contribute, I would gladly review and help you to implement such change.

@rsivapr

Cool. I have something I hacked on -- I'll work on it a bit more and get back.

@rsivapr

@arjoly I worked on the suggestions that you mentioned. This works well for small data sets, but again fails due to memory when the bigger training set comes into picture

>>> ovr.fit(Xt, df[:,1])
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/rsivapr/anaconda/lib/python2.7/threading.py", line 808, in __bootstrap_inner
    self.run()
  File "/home/rsivapr/anaconda/lib/python2.7/threading.py", line 761, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/rsivapr/anaconda/lib/python2.7/multiprocessing/pool.py", line 325, in _handle_workers
    pool._maintain_pool()
  File "/home/rsivapr/anaconda/lib/python2.7/multiprocessing/pool.py", line 229, in _maintain_pool
    self._repopulate_pool()
  File "/home/rsivapr/anaconda/lib/python2.7/multiprocessing/pool.py", line 222, in _repopulate_pool
    w.start()
  File "/home/rsivapr/anaconda/lib/python2.7/multiprocessing/process.py", line 130, in start
    self._popen = Popen(self)
  File "/home/rsivapr/anaconda/lib/python2.7/multiprocessing/forking.py", line 121, in __init__
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

Killed

I have no idea how to go about this. Am I missing something here? Because unlike the previous case, this sparse matrix should not be taking up that much memory.

@arjoly
Owner

Can you open a pull request? It would easier to discuss about your code.

@arjoly
Owner

Will be fix by #3276

@arjoly
Owner

fix by #3276 thanks @hamsal

@arjoly arjoly closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.