ValueError in sparse_array.astype when read-only with unsorted indices [scipy issue] #6614

StevenLOL · 2016-04-01T05:25:11Z

To reproduce:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.multiclass import OneVsRestClassifier

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC, LinearSVC
from sklearn.datasets import fetch_20newsgroups

data_train = fetch_20newsgroups(subset='train')
data_test = fetch_20newsgroups(subset='test')




clf=OneVsRestClassifier(estimator=SVC(),n_jobs=-1)  ##Error when calling fit
#clf=OneVsRestClassifier(estimator=SVC(),n_jobs=1) ##NO Error if set to 1



pipeLine=Pipeline([('tfidf',TfidfVectorizer(min_df=10)),
    ('clf',clf)])

trainx=data_train.data
trainy=data_train.target
evalx=data_test.data
evaly=data_test.target
pipeLine.fit(trainx,trainy)

predictValue=pipeLine.predict(evalx)

print classification_report(evaly,predictValue)

Output:

ValueError: UPDATEIFCOPY base is read-only

Linux-4.2.0-19-generic-x86_64-with-Ubuntu-15.10-wily
('Python', '2.7.10 (default, Oct 14 2015, 16:09:02) \n[GCC 5.2.1 20151010]')
('NumPy', '1.10.4')
('SciPy', '0.17.0')
('Scikit-Learn', '0.18.dev0')

The text was updated successfully, but these errors were encountered:

BiaDarkia · 2016-04-01T07:52:00Z

The traceback points to sklearn/externals/joblib/parallel.py as the origin of the error. I am not sure, whether this error should occur, but I will have a look into why exactly it is raised.

raghavrv · 2016-04-08T19:55:21Z

Related to #5956

Ping : @lesteve @ogrisel

lesteve · 2016-04-12T09:33:29Z

Another related, maybe more central, issue is #5481. IIRC estimators should avoid in-place modification of X, e.g. X -= X.mean(). cc @arthurmensch.

More explanation: when the input data is big enough, joblib use memmapping by default. This allows to share the input data across workers instead of having a copy of the input data in each worker. See this for more details. The memmap is opened in read-only mode because of possible data corruption if different workers write into the same data.

If you have access to the joblib.Parallel object, you can use max_nbytes=None to disable memmaping. Whether it will be faster than just doing n_jobs=1 depends on your particular use case I reckon. From a few cases I looked at, it looks like you generally don't have access to the underlying joblib.Parallel object when you create your estimator, so that doesn't really help in most cases.

raghavrv · 2016-04-12T09:46:20Z

Thanks for the lucid explanation Loic!

BiaDarkia · 2016-04-14T08:21:41Z

Would it be possible to add an optional parameter to estimators that sets max_nbytes=None when accessing joblib.Parallel when set to True? It shouldn't be too hard to implement and fix the issue.

lesteve · 2016-04-14T12:46:51Z

A more proper fix would to modify sklearn.svm.base.BaseLibSVM._sparse_fit so that it doesn't modify X in place.

Looking at this problem more closely I found a work-around in case it is useful:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.multiclass import OneVsRestClassifier

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.datasets import fetch_20newsgroups

data_train = fetch_20newsgroups(subset='train')
data_test = fetch_20newsgroups(subset='test')

clf = OneVsRestClassifier(estimator=SVC(), n_jobs=-1)


class MyTfidfVectorizer(TfidfVectorizer):
    def fit_transform(self, X, y):
        result = super(MyTfidfVectorizer, self).fit_transform(X, y)
        result.sort_indices()
        return result

pipeLine = Pipeline([('tfidf', MyTfidfVectorizer(min_df=10)),
                     ('clf', clf)])

trainx = data_train.data
trainy = data_train.target
evalx = data_test.data
evaly = data_test.target
pipeLine.fit(trainx, trainy)

predictValue = pipeLine.predict(evalx)

print(classification_report(evaly, predictValue))

This is based on the fact that the in-place modification only happens when the output of the TfidfVectorizer doesn't have its indices sorted.

amueller · 2016-07-27T21:41:34Z

IIRC estimators should avoid in-place modification of X, e.g. X -= X.mean()

There should be a copy parameter to control this.

amueller · 2016-07-27T21:44:06Z

@lesteve close as duplicate of #5481?

lesteve · 2016-07-27T22:11:17Z

It' does seem very very similar to #5481 althoug this one needs sparse matrices to be triggered IIRC.

amueller · 2016-07-28T16:48:24Z

Ok. We should add a common test anyhow that covers dense and sparse. But alright, let's leave this open to make sure we cover this case.

armancohan · 2017-01-09T21:34:39Z

Getting the same error.

ryandolen · 2017-06-08T15:17:32Z

Any updates here? Still getting the same error.

lesteve · 2017-06-08T23:52:07Z

No updates I am afraid. You are more welcome to submit a PR for this issue or use the work-around suggested in #6614 (comment).

… Now trying to find cause of difference in shape of train and test features.

lesteve · 2018-04-04T14:53:37Z

For the record I opened scipy/scipy#8678.

Because of ValueError: WRITEBACKIFCOPY base is read-only Bug: scikit-learn/scikit-learn#6614 Solution: none.

csvankhede · 2019-06-28T05:20:06Z

I got the similar Issue.

clf = OneVsRestClassifier(LogisticRegression(penalty = 'l2'),n_jobs = -1)
clf.fit(x_train_multilabel,y_train)
pred_y = clf.predict(x_test_multilabel)

Here when I changed the code as below it worked.

clf.fit(x_train_multilabel.copy(),y_train)

rth · 2019-06-28T07:09:10Z

Thanks @csvankhede ! Yes, it's a scipy bug as mentionned above. Adding a workaround for all code that uses sparse_array.astype in scikit-learn would probably be difficult.

Just for the record, would,

x_train_multilabel.sort_indices()
clf.fit(x_train_multilabel, y_train)

also work in your case?

emsi · 2019-08-01T16:03:32Z

This is based on the fact that the in-place modification only happens when the output of the TfidfVectorizer doesn't have its indices sorted.

Adding dummy sorting estimator to pipeline is even simpler workaround :)

class SortVectorizer():
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X.sort_indices()
        return X

This PR introduces the optional *max_nbytes* parameter on *OneVsRestClassifier", OneVsOneClassifier" and OutputCodeClassifier" multiclass learning algorithms within *multiclass.py*. Such parameter is in addition to the already existing *n_jobs* one and might be useful when dealing with a large training set processed by concurrently running jobs defined by *n_jobs* > 0 or by *n_jobs* = -1 (meaning that the number of jobs is set to the number of CPU cores). In this case, [Parallel](https://joblib.readthedocs.io/en/latest/parallel.html#parallel-reference-documentation) is called with the default "loky" backend, that [implements multi-processing](https://joblib.readthedocs.io/en/latest/parallel.html#thread-based-parallelism-vs-process-based-parallelism); *Parallel* also sets a default 1-megabyte [threshold](https://joblib.readthedocs.io/en/latest/parallel.html#automated-array-to-memmap-conversion) on the size of arrays passed to the workers. Such parameter may not be enough for large arrays and could break the job with exception **ValueError: UPDATEIFCOPY base is read-only**. *Parallel* uses *max_nbytes* to control this threshold. Through this fix, the multiclass classifiers will offer the optional possibility to customize the max size of arrays. Fixes scikit-learn#6614 Expected to also fix scikit-learn#4597

Changing *OneVsRestClassifier", OneVsOneClassifier" and OutputCodeClassifier" multiclass learning algorithms within multiclass.py, by replacing "n_jobs" parameter with keyworded, variable-length argument list, in order to allow any "Parallel" parameter to be passed, as well as support "parallel_backend" context manager. "n_jobs" remains one of the possible parameters, but other ones can be added, including "max_nbytes", which might be useful in order to avoid ValueError when dealing with a large training set processed by concurrently running jobs defined by *n_jobs* > 0 or by *n_jobs* = -1. More specifically, in parallel computing of large arrays with "loky" backend, [Parallel](https://joblib.readthedocs.io/en/latest/parallel.html#parallel-reference-documentation) sets a default 1-megabyte [threshold](https://joblib.readthedocs.io/en/latest/parallel.html#automated-array-to-memmap-conversion) on the size of arrays passed to the workers. Such parameter may not be enough for large arrays and could break jobs with exception **ValueError: UPDATEIFCOPY base is read-only**. *Parallel* uses *max_nbytes* to control this threshold. Through this fix, the multiclass classifiers will offer the optional possibility to customize the max size of arrays. Fixes scikit-learn#6614 See also scikit-learn#4597

Changing *OneVsRestClassifier", OneVsOneClassifier" and OutputCodeClassifier" multiclass learning algorithms within multiclass.py, by replacing "n_jobs" parameter with keyworded, variable-length argument list, in order to allow any "Parallel" parameter to be passed, as well as support "parallel_backend" context manager. "n_jobs" remains one of the possible parameters, but other ones can be added, including "max_nbytes", which might be useful in order to avoid ValueError when dealing with a large training set processed by concurrently running jobs defined by *n_jobs* > 0 or by *n_jobs* = -1. More specifically, in parallel computing of large arrays with "loky" backend, [Parallel](https://joblib.readthedocs.io/en/latest/parallel.html#parallel-reference-documentation) sets a default 1-megabyte [threshold](https://joblib.readthedocs.io/en/latest/parallel.html#automated-array-to-memmap-conversion) on the size of arrays passed to the workers. Such parameter may not be enough for large arrays and could break jobs with exception **ValueError: UPDATEIFCOPY base is read-only**. *Parallel* uses *max_nbytes* to control this threshold. Through this fix, the multiclass classifiers will offer the optional possibility to customize the max size of arrays. Fixes scikit-learn#6614 See also scikit-learn#4597 Changed _get_args in _testing.py in order to also accept 'parallel_params' vararg.

damienlancry · 2022-02-16T05:39:43Z

Thanks for the workaround, I am no longer getting the error "ValueError: UPDATEIFCOPY base is read-only"
However I am now getting

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The code to reproduce is the following:

lr_kwargs = dict(penalty="l1", random_state=42, solver="liblinear", fit_intercept=False)
ovrlr=OneVsRestClassifier(LogisticRegression(**lr_kwargs), n_jobs=-1)
X.sort_indices()
ovrlr.fit(X, Y)

Where X is a (1375308, 80614) scipy sparse matrix obtained with a CountVectorizer() and Y is a (1375308, 157) binarized label matrix.

I am using a machine with 72 CPU cores and 252 GB of RAM. I previously ran the same code with n_jobs=1 and it took 4hours56min to run so I was expecting a 72x speedup. It indeed crashes after 5 minutes which suggests that the training was almost finished.

Apologies if this is unrelated, or if I am late to the party.

EDIT: I used n_job=30 and it works, the job was crashing simply because the parallel workers were exhausting the ram.

glemaitre · 2023-04-14T14:44:13Z

I am closing this issue since it should be solved by installing the future SciPy release since scipy/scipy#18192 has been merged

StevenLOL changed the title ~~UPDATEIFCOPY base is read-only, when OneVsRestClassifier meets SVC with n_jobs!=1~~ ValueError: UPDATEIFCOPY base is read-only, when OneVsRestClassifier meets SVC with n_jobs!=1 Apr 1, 2016

lesteve mentioned this issue Apr 12, 2016

ValueError: assignment destination is read-only, when paralleling with n_jobs > 1 #5956

Closed

lesteve mentioned this issue Dec 5, 2016

Estimators should not try to modify X and y inplace in order to handle readonly memory maps #5481

Closed

thornhale added a commit to thornhale/fake_news that referenced this issue Jun 25, 2017

BUG: Found workaround for first bug at scikit-learn/scikit-learn#6614.…

ff597d4

… Now trying to find cause of difference in shape of train and test features.

This was referenced Apr 4, 2018

imblearn SMOTE throwing error when n_jobs > 1 #10916

Closed

csr_matrix.astype raise ValueError when indices are not sorted and indices or indptr are read-only scipy/scipy#8678

Closed

StefanoFrazzetto added a commit to StefanoFrazzetto/CrimeDetector that referenced this issue Apr 4, 2019

Set number of jobs for SMOTE and ADASYN to 1

80ce5f5

Because of ValueError: WRITEBACKIFCOPY base is read-only Bug: scikit-learn/scikit-learn#6614 Solution: none.

rth changed the title ~~ValueError: UPDATEIFCOPY base is read-only, when OneVsRestClassifier meets SVC with n_jobs!=1~~ ValueError in sparse_array.astype when read-only with unsorted indices [scipy issue] Jun 28, 2019

Ircama mentioned this issue Nov 21, 2019

[WIP] Allowing optional list of Parallel keyworded parameters within estimators #15689

Closed

peay mentioned this issue Dec 19, 2019

Fitting default LogisticRegression on large sparse csr np.float32 matrix fails with n_jobs > 1 #15924

Closed

ulupo mentioned this issue Jul 18, 2020

Fix mmap settings used by joblib.Parallel giotto-ai/giotto-tda#428

Merged

9 tasks

cmarmo added the module:utils label Dec 10, 2021

glemaitre mentioned this issue Mar 23, 2023

BaggingClassifier throws ValueError: WRITEBACKIFCOPY base is read-only #25935

Closed

glemaitre closed this as completed Apr 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError in sparse_array.astype when read-only with unsorted indices [scipy issue] #6614

ValueError in sparse_array.astype when read-only with unsorted indices [scipy issue] #6614

StevenLOL commented Apr 1, 2016

BiaDarkia commented Apr 1, 2016

raghavrv commented Apr 8, 2016

lesteve commented Apr 12, 2016

raghavrv commented Apr 12, 2016

BiaDarkia commented Apr 14, 2016

lesteve commented Apr 14, 2016

amueller commented Jul 27, 2016

amueller commented Jul 27, 2016

lesteve commented Jul 27, 2016

amueller commented Jul 28, 2016

armancohan commented Jan 9, 2017

ryandolen commented Jun 8, 2017

lesteve commented Jun 8, 2017

lesteve commented Apr 4, 2018

csvankhede commented Jun 28, 2019

rth commented Jun 28, 2019

emsi commented Aug 1, 2019

damienlancry commented Feb 16, 2022 •

edited

glemaitre commented Apr 14, 2023

ValueError in sparse_array.astype when read-only with unsorted indices [scipy issue] #6614

ValueError in sparse_array.astype when read-only with unsorted indices [scipy issue] #6614

Comments

StevenLOL commented Apr 1, 2016

BiaDarkia commented Apr 1, 2016

raghavrv commented Apr 8, 2016

lesteve commented Apr 12, 2016

raghavrv commented Apr 12, 2016

BiaDarkia commented Apr 14, 2016

lesteve commented Apr 14, 2016

amueller commented Jul 27, 2016

amueller commented Jul 27, 2016

lesteve commented Jul 27, 2016

amueller commented Jul 28, 2016

armancohan commented Jan 9, 2017

ryandolen commented Jun 8, 2017

lesteve commented Jun 8, 2017

lesteve commented Apr 4, 2018

csvankhede commented Jun 28, 2019

rth commented Jun 28, 2019

emsi commented Aug 1, 2019

damienlancry commented Feb 16, 2022 • edited

glemaitre commented Apr 14, 2023

damienlancry commented Feb 16, 2022 •

edited