Closed
Description
What happened + What you expected to happen
I got some "buffer source array is read-only" error when using certain scikit-learn models.
- related to Cython / serialization? it seems to affect when the sklearn model is a C library.
- If I pickled the model explicity it works (see example)
- ray.put does not help
- multiprocessing.Pool would work; but Ray's Pool does not
ray.exceptions.RayTaskError(ValueError): ray::func() (pid=45108, ip=172.25.209.242)
File "repro1.py", line 23, in func
y_pred = model.predict_proba(X_eval)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "lib/python3.12/site-packages/sklearn/svm/_base.py", line 869, in predict_proba
return pred_proba(X)
^^^^^^^^^^^^^
File "lib/python3.12/site-packages/sklearn/svm/_base.py", line 909, in _dense_predict_proba
pprob = libsvm.predict_proba(
^^^^^^^^^^^^^^^^^^^^^
File "sklearn/svm/_libsvm.pyx", line 475, in sklearn.svm._libsvm.predict_proba
File "stringsource", line 660, in View.MemoryView.memoryview_cwrapper
File "stringsource", line 350, in View.MemoryView.memoryview.__cinit__
ValueError: buffer source array is read-only
So there is a workaround (pickle). but I am not sure if it's a Ray problem or scikit-learn. I cannot reproduce the same error in pure scikit-learn despite there were similar (now fixed) bugs in sklearn.
Versions / Dependencies
master
Reproduction script
The following script works fine when the pickling option is used.
import argparse
import pickle
import ray
from sklearn.datasets import make_classification
from sklearn.svm import SVC
def create_model():
X, y = make_classification(n_samples=1000)
model = SVC(kernel="linear", probability=True)
model.fit(X, y)
return model
@ray.remote
def func(model: SVC | bytes):
if isinstance(model, bytes):
model = pickle.loads(model)
X_eval, _ = make_classification(n_samples=1000)
y_pred = model.predict_proba(X_eval)
return y_pred
def main():
parser = argparse.ArgumentParser()
parser.add_argument("-p", action="store_true", help="use pickle to load model")
args = parser.parse_args()
model = create_model()
if args.p:
model = pickle.dumps(model)
refs = [
func.remote(model)
for _ in range(2)
]
ray.get(refs)
if __name__ == "__main__":
main()
Issue Severity
Medium: It is a significant difficulty but I can work around it.