Trust internal scikit-learn types needed for GB/HGB models#513
Trust internal scikit-learn types needed for GB/HGB models#513adrinjalali merged 13 commits intomainfrom
Conversation
Address review comments from PR #508: - Add sklearn internal types to _trusted_types.py instead of creating a new SklearnInternalObjectNode (review comment 2) - Include internal types in ObjectNode's trusted defaults - Add tests that save/load and verify correctness (review comment 3) - Filter non-sklearn-prefixed types where appropriate (review comment 1) Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/6dadf905-6073-4057-bf3d-faed31a559fe Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com>
Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/6dadf905-6073-4057-bf3d-faed31a559fe Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com>
adrinjalali
left a comment
There was a problem hiding this comment.
@copilot please check this review
| with warnings.catch_warnings(): | ||
| warnings.filterwarnings("ignore", module="sklearn") | ||
| estimator.fit(X, y) | ||
|
|
There was a problem hiding this comment.
tell me what warnings we're catching and ignoring here
There was a problem hiding this comment.
No warnings are actually produced during fitting of these estimators — I verified this by running all 9 configs with warnings.catch_warnings(record=True) and none produced any warnings. The suppression was cargo-culted from test_can_persist_fitted which does need it for some sklearn estimators. Removed in f17710e.
| try: | ||
| from sklearn._loss._loss import CyHalfMultinomialLoss | ||
|
|
||
| _SKLEARN_INTERNAL_TYPES.append(CyHalfMultinomialLoss) | ||
| except ImportError: | ||
| pass | ||
|
|
There was a problem hiding this comment.
this feels sketchy, which you also had trouble while working on it. Tell me where the issue comes from, and what would needs to happen maybe on the sklearn side for this not to be like this.
There was a problem hiding this comment.
CyHalfMultinomialLoss is a Cython extension type from sklearn._loss._loss. Its __module__ reports _loss instead of the fully qualified sklearn._loss._loss — this is because the Cython .pyx source file doesn't set the module name to the full package path. All Cy* types in sklearn._loss._loss have this same issue (CyAbsoluteError, CyHalfBinomialLoss, etc.).
On the sklearn side, this could be fixed by explicitly setting __module__ = 'sklearn._loss._loss' in the Cython source (or by using a build-time configuration that embeds the full module path). This is the same issue that affects the existing ALLOWED_LOSSES — all Cy* loss types there also report _loss.Cy* rather than sklearn._loss._loss.Cy*.
Removed CyHalfMultinomialLoss from auto-trust and added a comment explaining this in f17710e. Multiclass GradientBoostingClassifier will surface it as an untrusted type that users need to explicitly trust.
| except ImportError: | ||
| pass | ||
|
|
||
| SKLEARN_INTERNAL_TYPE_NAMES = [get_type_name(t) for t in _SKLEARN_INTERNAL_TYPES] |
There was a problem hiding this comment.
I think we should have a filter on all sklearn types that they should start with sklearn. when we get their types, so that we don't end up accidentally trusting a monkey-patched type coming from a different lib.
There was a problem hiding this comment.
Added the sklearn. prefix filter in f17710e. SKLEARN_INTERNAL_TYPE_NAMES now filters with get_type_name(t).startswith("sklearn."), matching the pattern used by SKLEARN_ESTIMATOR_TYPE_NAMES. This excludes CyHalfMultinomialLoss (which reports _loss.CyHalfMultinomialLoss) until sklearn fixes the Cython module name.
… drop warning suppression - Filter SKLEARN_INTERNAL_TYPE_NAMES to only include types starting with 'sklearn.' to avoid trusting monkey-patched types from other packages - Remove CyHalfMultinomialLoss from auto-trusted list since its __module__ reports '_loss' instead of 'sklearn._loss._loss' (Cython build issue on sklearn side) - Add detailed comment explaining the sklearn Cython module name issue - Remove unnecessary warnings.catch_warnings() in test — no warnings are produced during fitting of these estimators - Update test to use get_untrusted_types() + loads(trusted=...) pattern Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/ccbb4abe-8ea6-442c-a6dc-41b9489d35cf Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com>
Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/ccbb4abe-8ea6-442c-a6dc-41b9489d35cf Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com>
|
@copilot scikit-learn/scikit-learn#33770 fixes the issue with the Cy* classes. But in the meantime, their |
…ule__ correctness Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/a65530ea-7c30-419d-a36e-e6c6a423c5f7 Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com>
…convention) Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/a65530ea-7c30-419d-a36e-e6c6a423c5f7 Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com>
Added all The test now conditionally asserts: multiclass GBC expects See commit 1fd89ac. |
Agent-Logs-Url: https://github.com/skops-dev/skops/sessions/9bf5c6f9-7729-4d13-a37a-18a6b934d296 Co-authored-by: adrinjalali <1663864+adrinjalali@users.noreply.github.com>
| # Cython loss classes from sklearn._loss._loss. These are internal Cython extension | ||
| # types used by GradientBoosting models. Currently their __module__ incorrectly reports | ||
| # '_loss' instead of 'sklearn._loss._loss' due to a bug in sklearn's Cython build | ||
| # (see https://github.com/scikit-learn/scikit-learn/pull/33770). The startswith | ||
| # ("sklearn.") filter below will automatically exclude them while __module__ is wrong, | ||
| # and automatically include them once sklearn fixes the issue. |
There was a problem hiding this comment.
LossNode already default-trusts several _loss.* Cython classes, not just CyHalfMultinomialLoss. In skops/io/_sklearn.py:70, ALLOWED_LOSSES includes many _loss.* names (_loss.CyAbsoluteError, _loss.CyExponentialLoss, and a tens more)
There was a problem hiding this comment.
CyHalfMultinomialLoss is not in this list because it's not a CyLossFunction subclass (it has a slightly different structure).
But a good solution might still be to use the LossNode to support it. I vibe-tried that and it seems to work (with a diff of ~20 lines).
There was a problem hiding this comment.
We could also just add CyHalfMultinomialLoss to the trusted types, as we already default-trusts several _loss.* Cython classes... According to my LLM, it's not less safe.
|
This is indeed quite simpler, but it doesn't work for multi-class classification with GB/HGB because those depends on |
| # CyHalfMultinomialLoss is not a subclass of CyLossFunction, so it needs its | ||
| # own dispatch entry. It's already in ALLOWED_LOSSES so LossNode will trust it. | ||
| if CyHalfMultinomialLoss is not None: | ||
| GET_STATE_DISPATCH_FUNCTIONS.append((CyHalfMultinomialLoss, loss_get_state)) |
There was a problem hiding this comment.
This won't work because CyHalfMultinomialLoss doesn't have the same reduce/state shape as other Cy* losses.
There was a problem hiding this comment.
The tests are passing, can you give me a test case where it fails?
There was a problem hiding this comment.
With this loss_get_state, it works (I tired the fix on nighly, 1.7 and 1.6, it works there too):
def loss_get_state(obj: Any, save_context: SaveContext) -> dict[str, Any]:
reduce = obj.__reduce__()
if type(obj) == reduce[0]:
state = reduce_get_state(obj, save_context)
state["__loader__"] = "LossNode"
elif type(obj) == reduce[1][0]:
# The output is commonly of the form:
# >>> CyPinballLoss(1).__reduce__()
# (<cyfunction __pyx_unpickle_CyPinballLoss at 0x7b1d00099ff0>,
# (<class '_loss.CyPinballLoss'>, 232784418, (1.0,)))
#
# CyHalfMultinomialLoss differs slightly and returns:
# >>> CyHalfMultinomialLoss().__reduce__()
# (<cyfunction __pyx_unpickle_CyHalfMultinomialLoss at 0x...>,
# (<class '_loss.CyHalfMultinomialLoss'>, 238750788, None), ())
#
# In that case, the constructor takes no args and the tuple state lives
# in reduce[2].
state = {
"__class__": obj.__class__.__name__,
"__module__": get_module(type(obj)),
"__loader__": "LossNode",
}
state["__reduce__"] = {}
if len(reduce) == 3:
state["__reduce__"]["args"] = get_state((), save_context)
state["content"] = get_state(reduce[2], save_context)
else:
state["__reduce__"]["args"] = get_state(reduce[1][2], save_context)
state["content"] = get_state({}, save_context)
return stateThere was a problem hiding this comment.
The tests are passing, can you give me a test case where it fails?
Locally (using pixi env: ci-sklearn18), I have one single test that fails: FAILED skops/io/tests/test_persist.py::test_gradient_boosting_estimators_have_no_untrusted_types[GradientBoostingClassifier-log_loss-multiclass] - TypeError: _loss.CyHalfMultinomialLoss() argument after * must be an iterable, not NoneType
The CI has the same failure, but also other failures for other sklearn versions, which are the ones I skipped in my PR with:
@pytest.mark.skipif(
SKLEARN_VERSION < parse_version("1.4"),
reason=(
"Before scikit-learn 1.4, GradientBoosting uses different internal loss "
"objects (`sklearn.ensemble._gb_losses`), which we don't try to support "
"as trusted types."
),
)|
LGTM, WDYT @cakedev0 |
|
Yes LGTM too 👍 Thanks! |
Uh oh!
There was an error while loading. Please reload this page.