Skip to content

[Core] Read-only buffer error in some scikit-learn models #52571

Closed
@wingkitlee0

Description

@wingkitlee0

What happened + What you expected to happen

I got some "buffer source array is read-only" error when using certain scikit-learn models.

  • related to Cython / serialization? it seems to affect when the sklearn model is a C library.
  • If I pickled the model explicity it works (see example)
  • ray.put does not help
  • multiprocessing.Pool would work; but Ray's Pool does not
ray.exceptions.RayTaskError(ValueError): ray::func() (pid=45108, ip=172.25.209.242)
  File "repro1.py", line 23, in func
    y_pred = model.predict_proba(X_eval)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.12/site-packages/sklearn/svm/_base.py", line 869, in predict_proba
    return pred_proba(X)
           ^^^^^^^^^^^^^
  File "lib/python3.12/site-packages/sklearn/svm/_base.py", line 909, in _dense_predict_proba
    pprob = libsvm.predict_proba(
            ^^^^^^^^^^^^^^^^^^^^^
  File "sklearn/svm/_libsvm.pyx", line 475, in sklearn.svm._libsvm.predict_proba
  File "stringsource", line 660, in View.MemoryView.memoryview_cwrapper
  File "stringsource", line 350, in View.MemoryView.memoryview.__cinit__
ValueError: buffer source array is read-only

So there is a workaround (pickle). but I am not sure if it's a Ray problem or scikit-learn. I cannot reproduce the same error in pure scikit-learn despite there were similar (now fixed) bugs in sklearn.

Versions / Dependencies

master

Reproduction script

The following script works fine when the pickling option is used.

import argparse
import pickle
import ray
from sklearn.datasets import make_classification
from sklearn.svm import SVC


def create_model():
    X, y = make_classification(n_samples=1000)

    model = SVC(kernel="linear", probability=True)
    model.fit(X, y)

    return model

@ray.remote
def func(model: SVC | bytes):
    if isinstance(model, bytes):
        model = pickle.loads(model)

    X_eval, _ = make_classification(n_samples=1000)

    y_pred = model.predict_proba(X_eval)

    return y_pred


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("-p", action="store_true", help="use pickle to load model")
    args = parser.parse_args()

    model = create_model()

    if args.p:
        model = pickle.dumps(model)

    refs = [
        func.remote(model)
        for _ in range(2)
    ]

    ray.get(refs)


if __name__ == "__main__":
    main()

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Important issue, but not time-criticalcommunity-backlogcoreIssues that should be addressed in Ray CorequestionJust a question :)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions