Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError in sparse_array.astype when read-only with unsorted indices [scipy issue] #6614

Closed
StevenLOL opened this issue Apr 1, 2016 · 19 comments

Comments

@StevenLOL
Copy link

To reproduce:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.multiclass import OneVsRestClassifier

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC, LinearSVC
from sklearn.datasets import fetch_20newsgroups

data_train = fetch_20newsgroups(subset='train')
data_test = fetch_20newsgroups(subset='test')




clf=OneVsRestClassifier(estimator=SVC(),n_jobs=-1)  ##Error when calling fit
#clf=OneVsRestClassifier(estimator=SVC(),n_jobs=1) ##NO Error if set to 1



pipeLine=Pipeline([('tfidf',TfidfVectorizer(min_df=10)),
    ('clf',clf)])

trainx=data_train.data
trainy=data_train.target
evalx=data_test.data
evaly=data_test.target
pipeLine.fit(trainx,trainy)

predictValue=pipeLine.predict(evalx)

print classification_report(evaly,predictValue)

Output:

ValueError: UPDATEIFCOPY base is read-only

Linux-4.2.0-19-generic-x86_64-with-Ubuntu-15.10-wily
('Python', '2.7.10 (default, Oct 14 2015, 16:09:02) \n[GCC 5.2.1 20151010]')
('NumPy', '1.10.4')
('SciPy', '0.17.0')
('Scikit-Learn', '0.18.dev0')
@StevenLOL StevenLOL changed the title UPDATEIFCOPY base is read-only, when OneVsRestClassifier meets SVC with n_jobs!=1 ValueError: UPDATEIFCOPY base is read-only, when OneVsRestClassifier meets SVC with n_jobs!=1 Apr 1, 2016
@BiaDarkia
Copy link

The traceback points to sklearn/externals/joblib/parallel.py as the origin of the error. I am not sure, whether this error should occur, but I will have a look into why exactly it is raised.

@raghavrv
Copy link
Member

raghavrv commented Apr 8, 2016

Related to #5956

Ping : @lesteve @ogrisel

@lesteve
Copy link
Member

lesteve commented Apr 12, 2016

Another related, maybe more central, issue is #5481. IIRC estimators should avoid in-place modification of X, e.g. X -= X.mean(). cc @arthurmensch.

More explanation: when the input data is big enough, joblib use memmapping by default. This allows to share the input data across workers instead of having a copy of the input data in each worker. See this for more details. The memmap is opened in read-only mode because of possible data corruption if different workers write into the same data.

If you have access to the joblib.Parallel object, you can use max_nbytes=None to disable memmaping. Whether it will be faster than just doing n_jobs=1 depends on your particular use case I reckon. From a few cases I looked at, it looks like you generally don't have access to the underlying joblib.Parallel object when you create your estimator, so that doesn't really help in most cases.

@raghavrv
Copy link
Member

Thanks for the lucid explanation Loic!

@BiaDarkia
Copy link

Would it be possible to add an optional parameter to estimators that sets max_nbytes=None when accessing joblib.Parallel when set to True? It shouldn't be too hard to implement and fix the issue.

@lesteve
Copy link
Member

lesteve commented Apr 14, 2016

A more proper fix would to modify sklearn.svm.base.BaseLibSVM._sparse_fit so that it doesn't modify X in place.

Looking at this problem more closely I found a work-around in case it is useful:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.multiclass import OneVsRestClassifier

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.datasets import fetch_20newsgroups

data_train = fetch_20newsgroups(subset='train')
data_test = fetch_20newsgroups(subset='test')

clf = OneVsRestClassifier(estimator=SVC(), n_jobs=-1)


class MyTfidfVectorizer(TfidfVectorizer):
    def fit_transform(self, X, y):
        result = super(MyTfidfVectorizer, self).fit_transform(X, y)
        result.sort_indices()
        return result

pipeLine = Pipeline([('tfidf', MyTfidfVectorizer(min_df=10)),
                     ('clf', clf)])

trainx = data_train.data
trainy = data_train.target
evalx = data_test.data
evaly = data_test.target
pipeLine.fit(trainx, trainy)

predictValue = pipeLine.predict(evalx)

print(classification_report(evaly, predictValue))

This is based on the fact that the in-place modification only happens when the output of the TfidfVectorizer doesn't have its indices sorted.

@amueller
Copy link
Member

IIRC estimators should avoid in-place modification of X, e.g. X -= X.mean()

There should be a copy parameter to control this.

@amueller
Copy link
Member

@lesteve close as duplicate of #5481?

@lesteve
Copy link
Member

lesteve commented Jul 27, 2016

It' does seem very very similar to #5481 althoug this one needs sparse matrices to be triggered IIRC.

@amueller
Copy link
Member

Ok. We should add a common test anyhow that covers dense and sparse. But alright, let's leave this open to make sure we cover this case.

@armancohan
Copy link

Getting the same error.

@ryandolen
Copy link

Any updates here? Still getting the same error.

@lesteve
Copy link
Member

lesteve commented Jun 8, 2017

No updates I am afraid. You are more welcome to submit a PR for this issue or use the work-around suggested in #6614 (comment).

thornhale added a commit to thornhale/fake_news that referenced this issue Jun 25, 2017
… Now trying to find cause of difference in shape of train and test features.
@lesteve
Copy link
Member

lesteve commented Apr 4, 2018

For the record I opened scipy/scipy#8678.

StefanoFrazzetto added a commit to StefanoFrazzetto/CrimeDetector that referenced this issue Apr 4, 2019
Because of ValueError: WRITEBACKIFCOPY base is read-only
Bug: scikit-learn/scikit-learn#6614
Solution: none.
@csvankhede
Copy link

I got the similar Issue.

clf = OneVsRestClassifier(LogisticRegression(penalty = 'l2'),n_jobs = -1)
clf.fit(x_train_multilabel,y_train)
pred_y = clf.predict(x_test_multilabel)

Here when I changed the code as below it worked.

clf.fit(x_train_multilabel.copy(),y_train)

@rth rth changed the title ValueError: UPDATEIFCOPY base is read-only, when OneVsRestClassifier meets SVC with n_jobs!=1 ValueError in sparse_array.astype when read-only with unsorted indices [scipy issue] Jun 28, 2019
@rth
Copy link
Member

rth commented Jun 28, 2019

Thanks @csvankhede ! Yes, it's a scipy bug as mentionned above. Adding a workaround for all code that uses sparse_array.astype in scikit-learn would probably be difficult.

Just for the record, would,

x_train_multilabel.sort_indices()
clf.fit(x_train_multilabel, y_train)

also work in your case?

@emsi
Copy link

emsi commented Aug 1, 2019

This is based on the fact that the in-place modification only happens when the output of the TfidfVectorizer doesn't have its indices sorted.

Adding dummy sorting estimator to pipeline is even simpler workaround :)

class SortVectorizer():
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X.sort_indices()
        return X

Ircama added a commit to Ircama/scikit-learn that referenced this issue Nov 21, 2019
This PR introduces the optional *max_nbytes* parameter on
*OneVsRestClassifier", OneVsOneClassifier" and OutputCodeClassifier"
multiclass learning algorithms within *multiclass.py*.

Such parameter is in addition to the already existing *n_jobs* one and
might be useful when dealing with a large training set processed by
concurrently running jobs defined by *n_jobs* > 0 or by *n_jobs* = -1
(meaning that the number of jobs is set to the number of CPU cores). In
this case,
[Parallel](https://joblib.readthedocs.io/en/latest/parallel.html#parallel-reference-documentation)
is called with the default "loky" backend, that [implements
multi-processing](https://joblib.readthedocs.io/en/latest/parallel.html#thread-based-parallelism-vs-process-based-parallelism);
*Parallel* also sets a default 1-megabyte
[threshold](https://joblib.readthedocs.io/en/latest/parallel.html#automated-array-to-memmap-conversion)
on the size of arrays passed to the workers. Such parameter may not be
enough for large arrays and could break the job with exception
**ValueError: UPDATEIFCOPY base is read-only**. *Parallel* uses
*max_nbytes* to control this threshold. Through this fix, the multiclass
classifiers will offer the optional possibility to customize the max
size of arrays.

Fixes scikit-learn#6614
Expected to also fix
scikit-learn#4597
Ircama added a commit to Ircama/scikit-learn that referenced this issue Nov 21, 2019
This PR introduces the optional *max_nbytes* parameter on
*OneVsRestClassifier", OneVsOneClassifier" and OutputCodeClassifier"
multiclass learning algorithms within *multiclass.py*.

Such parameter is in addition to the already existing *n_jobs* one and
might be useful when dealing with a large training set processed by
concurrently running jobs defined by *n_jobs* > 0 or by *n_jobs* = -1
(meaning that the number of jobs is set to the number of CPU cores). In
this case,
[Parallel](https://joblib.readthedocs.io/en/latest/parallel.html#parallel-reference-documentation)
is called with the default "loky" backend, that [implements
multi-processing](https://joblib.readthedocs.io/en/latest/parallel.html#thread-based-parallelism-vs-process-based-parallelism);
*Parallel* also sets a default 1-megabyte
[threshold](https://joblib.readthedocs.io/en/latest/parallel.html#automated-array-to-memmap-conversion)
on the size of arrays passed to the workers. Such parameter may not be
enough for large arrays and could break the job with exception
**ValueError: UPDATEIFCOPY base is read-only**. *Parallel* uses
*max_nbytes* to control this threshold. Through this fix, the multiclass
classifiers will offer the optional possibility to customize the max
size of arrays.

Fixes scikit-learn#6614
Expected to also fix
scikit-learn#4597
Ircama added a commit to Ircama/scikit-learn that referenced this issue Nov 23, 2019
Changing *OneVsRestClassifier", OneVsOneClassifier" and
OutputCodeClassifier" multiclass learning algorithms within
multiclass.py, by replacing "n_jobs" parameter with keyworded,
variable-length argument list, in order to allow any "Parallel"
parameter to be passed, as well as support "parallel_backend"
context manager.

"n_jobs" remains one of the possible parameters, but other ones can be
added, including "max_nbytes", which might be useful in order to avoid
ValueError when dealing with a large training set processed by
concurrently running jobs defined by *n_jobs* > 0 or by *n_jobs* = -1.

More specifically, in parallel computing of large arrays with "loky"
backend,
[Parallel](https://joblib.readthedocs.io/en/latest/parallel.html#parallel-reference-documentation)
sets a default 1-megabyte
[threshold](https://joblib.readthedocs.io/en/latest/parallel.html#automated-array-to-memmap-conversion)
on the size of arrays passed to the workers. Such parameter may not be
enough for large arrays and could break jobs with exception
**ValueError: UPDATEIFCOPY base is read-only**.

*Parallel* uses *max_nbytes* to control this threshold.

Through this fix, the multiclass classifiers will offer the optional
possibility to customize the max size of arrays.

Fixes scikit-learn#6614
See also scikit-learn#4597
Ircama added a commit to Ircama/scikit-learn that referenced this issue Nov 24, 2019
Changing *OneVsRestClassifier", OneVsOneClassifier" and
OutputCodeClassifier" multiclass learning algorithms within
multiclass.py, by replacing "n_jobs" parameter with keyworded,
variable-length argument list, in order to allow any "Parallel"
parameter to be passed, as well as support "parallel_backend"
context manager.

"n_jobs" remains one of the possible parameters, but other ones can be
added, including "max_nbytes", which might be useful in order to avoid
ValueError when dealing with a large training set processed by
concurrently running jobs defined by *n_jobs* > 0 or by *n_jobs* = -1.

More specifically, in parallel computing of large arrays with "loky"
backend,
[Parallel](https://joblib.readthedocs.io/en/latest/parallel.html#parallel-reference-documentation)
sets a default 1-megabyte
[threshold](https://joblib.readthedocs.io/en/latest/parallel.html#automated-array-to-memmap-conversion)
on the size of arrays passed to the workers. Such parameter may not be
enough for large arrays and could break jobs with exception
**ValueError: UPDATEIFCOPY base is read-only**.

*Parallel* uses *max_nbytes* to control this threshold.

Through this fix, the multiclass classifiers will offer the optional
possibility to customize the max size of arrays.

Fixes scikit-learn#6614
See also scikit-learn#4597

Changed _get_args in _testing.py in order to also accept
'parallel_params' vararg.
@damienlancry
Copy link

damienlancry commented Feb 16, 2022

Thanks for the workaround, I am no longer getting the error "ValueError: UPDATEIFCOPY base is read-only"
However I am now getting

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The code to reproduce is the following:

lr_kwargs = dict(penalty="l1", random_state=42, solver="liblinear", fit_intercept=False)
ovrlr=OneVsRestClassifier(LogisticRegression(**lr_kwargs), n_jobs=-1)
X.sort_indices()
ovrlr.fit(X, Y)

Where X is a (1375308, 80614) scipy sparse matrix obtained with a CountVectorizer() and Y is a (1375308, 157) binarized label matrix.

I am using a machine with 72 CPU cores and 252 GB of RAM. I previously ran the same code with n_jobs=1 and it took 4hours56min to run so I was expecting a 72x speedup. It indeed crashes after 5 minutes which suggests that the training was almost finished.

Apologies if this is unrelated, or if I am late to the party.

EDIT: I used n_job=30 and it works, the job was crashing simply because the parallel workers were exhausting the ram.

@glemaitre
Copy link
Member

I am closing this issue since it should be solved by installing the future SciPy release since scipy/scipy#18192 has been merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.