Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Add metadata routing for RANSACRegressor #28261

Merged
merged 39 commits into from Feb 26, 2024

Conversation

StefanieSenger
Copy link
Contributor

@StefanieSenger StefanieSenger commented Jan 25, 2024

Reference Issues/PRs

Towards #22893

What does this implement/fix? Explain your changes.

This PR is supposed to implement metadata routing for RANSACRegressor.

Any other comments?

resolved issue:

  • RANSACRegressor.fit() routes to the sub-estimator's fit, predict and score. As a result fit runs into UnsetMetadataPassedError when set_fit_request is set, but set_score_request is not. The tests need to allow for several set_{method}_requests be set at once, whatever is needed.

Copy link

github-actions bot commented Jan 25, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 3af9455. Link to the linter CI: here

@adrinjalali adrinjalali self-requested a review January 29, 2024 10:51
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a complete review, but leaving the comments I worked on for now.

sklearn/linear_model/_ransac.py Outdated Show resolved Hide resolved
sklearn/linear_model/_ransac.py Outdated Show resolved Hide resolved
sklearn/linear_model/_ransac.py Show resolved Hide resolved
sklearn/linear_model/_ransac.py Outdated Show resolved Hide resolved
sklearn/linear_model/_ransac.py Outdated Show resolved Hide resolved
sklearn/linear_model/_ransac.py Outdated Show resolved Hide resolved
sklearn/tests/metadata_routing_common.py Outdated Show resolved Hide resolved
"y": y,
"preserves_metadata": False,
"estimator_routing_methods": ["fit", "predict", "score"],
"requests_set_together": {"fit": ["predict", "score"]},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"requests_set_together": {"fit": ["predict", "score"]},
"requests_set_together": {"fit": ["fit", "predict", "score"]},

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding fit is unnecessary, because when fit is used set_fit_request is already set.

sklearn/tests/test_metaestimators_metadata_routing.py Outdated Show resolved Hide resolved
sklearn/tests/test_metaestimators_metadata_routing.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adrinjalali
Thanks for the review. I've implemented most of your suggestions, and commented the remaining few.

Comment on lines 442 to 445
f"{e}, which is used internally by `RANSACRegressor.fit()`."
f"Call `{estimator.__class__.__name__}.set_{{method}}_request("
"{metadata}=True)` for each metadata."
),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConsumingRegressor.set_score_request(sample_weight=True)

Do you mean I should add an example within the error message? I think this would clutter up people's consoles.

sklearn/tests/metadata_routing_common.py Outdated Show resolved Hide resolved
"y": y,
"preserves_metadata": False,
"estimator_routing_methods": ["fit", "predict", "score"],
"requests_set_together": {"fit": ["predict", "score"]},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding fit is unnecessary, because when fit is used set_fit_request is already set.

sklearn/tests/test_metaestimators_metadata_routing.py Outdated Show resolved Hide resolved
sklearn/tests/test_metaestimators_metadata_routing.py Outdated Show resolved Hide resolved
sklearn/tests/test_metaestimators_metadata_routing.py Outdated Show resolved Hide resolved
@StefanieSenger
Copy link
Contributor Author

From my side, there is one remaining issue:

test_ransac_fit_sample_weight and test_ransac_final_model_fit_sample_weight from test_ransac.py both raise:

y_pred = estimator.predict(X, **routed_params.estimator.predict)
E TypeError: LinearModel.predict() got an unexpected keyword argument 'sample_weight'

The reason for this is that RANSACRegressor never routed sample_weight to the underlying estimator's predict before (estimator.predict() is called from within RANSACRegressor's fit), but now it does and estimator.predict() isn't prepared to use sample_weight.

Actually no regressor in scikit-learn currently seems to accept sample_weight in predict (at least I didn't find any). So, there is no regressor we could use for the test. Unless ConsumingRegressor is modified so that it would have coef_.

Either this or ... what are the 100 reasons that speak against integrating sample_weight into LinearModel.decision_function like that:

    def _decision_function(self, X, sample_weight=None):
        check_is_fitted(self)

        X = self._validate_data(X, accept_sparse=["csr", "csc", "coo"], reset=False)

        if sample_weight is not None:
            weighted_X = X * sample_weight.reshape(-1, 1)
            return (
                safe_sparse_dot(weighted_X, self.coef_.T, dense_output=True)
                + self.intercept_
            )
        else:
            return safe_sparse_dot(X, self.coef_.T, dense_output=True) + self.intercept_

    def predict(self, X, sample_weight=None):
        """
        Predict using the linear model.

        Parameters
        ----------
        X : array-like or sparse matrix, shape (n_samples, n_features)
            Samples.

        sample_weight : array-like, default=None
            Sample weights.

        Returns
        -------
        C : array, shape (n_samples,)
            Returns predicted values.
        """
        return self._decision_function(X, sample_weight)

@adrinjalali

@StefanieSenger
Copy link
Contributor Author

Update: ConsumingRegressor can be used for test_ransac_final_model_fit_sample_weight quite easily, but test_ransac_fit_sample_weight is more elaborated and needs a real Regressor, that can fit and predict for real, and we don't have that as far as I know. The test would also be useless, when it's not a real scenario.

So, I have finally realized, that I should never have put fit_params as a param into estimator.predict().

@StefanieSenger StefanieSenger marked this pull request as ready for review February 1, 2024 12:57
Comment on lines 442 to 445
f"{e}, which is used internally by `RANSACRegressor.fit()`."
f"Call `{estimator.__class__.__name__}.set_{{method}}_request("
"{metadata}=True)` for each metadata."
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, the whole message, not just a part of it.

sklearn/linear_model/_ransac.py Outdated Show resolved Hide resolved
# )
# return np.zeros(shape=(len(X),))
def score(self, X, y, sample_weight="default", metadata="default"):
self.predict(X, sample_weight="default", metadata="default")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to add this, and our existing scorers certainly don't route metadata to predict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the existing score methods don't route metadata to predict. I went through all (I believe) to confirm that. Thus, this line is really not needed here.

For curiosity: What's the problem with sample_weight in predict? I can see it in any kind of function, except for in predict. I mean, if I modify my data before fitting, I need to modify it using the same pattern before predicting....

sklearn/tests/metadata_routing_common.py Outdated Show resolved Hide resolved
sklearn/tests/metadata_routing_common.py Outdated Show resolved Hide resolved
Comment on lines 493 to 510
with pytest.raises(UnsetMetadataPassedError, match=re.escape(msg)):
method = getattr(instance, method_name)
method(X, y, **method_kwargs)
if "fit" not in method_name:
# set request on fit
set_requests(estimator, methods=["fit"], metadata_name=key)
# make sure error message corresponding to `method_name`
# is used for test
if method_name == "predict":
set_requests(estimator, methods=["score"], metadata_name=key)
if method_name == "score":
set_requests(estimator, methods=["predict"], metadata_name=key)
# fit before calling method
fit_method = getattr(instance, "fit")
fit_method(X, y, **method_kwargs)
if method_name == "predict":
method(X, **method_kwargs)
else:
method(X, y, **method_kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section could be simplified as:

            method_requests=metaestimator.get("requests_set_together", {}).get(method_name, [])
            set_request(estimator, methods=method_requests, metadata_name=key)
            with pytest.raises(UnsetMetadataPassedError, match=re.escape(msg)):
                method = getattr(instance, method_name)
                if "fit" not in method_name:
                    instance.fit(X, y)
                try:
                    # fit and partial_fit accept y, others don't.
                    method(X, y, **method_kwargs)
                except TypeError:
                    method(X, **method_kwargs)

we don't need to pass metadata to fit, when trying to check metadata routing in predict and score.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in the case that any other routing method is called from within fit, like RANSACRegressor calls predict and score from within fit, we need to pass metadata to fit as well. The goal is to make sure that an error is raised if the user sets the request on fit, but not on score for instance and then tries to pass metadata (unaware that score is also consuming it from within fit). If no metadata is passed, the error cannot raise. Your suggestion doesn't cover that case.

This is the case I'm trying to modify this test for.

Note: When I wrote the test, I thought that we'd also prepare estimator.predict to be passed metadata to (since ConsumingRegressor.predict was coded to receive metadata). This idea has changed since (it would be to large an endevour for this PR) and I will adjust this test accordingly. Also, I've used some of your suggestions, to simplify this test.

Though, I am still looking for a more generic version of it, that's not so specific for RANSAC. Something maybe like the following:

                if "fit" not in method_name:
                    # set request on fit
                    set_requests(estimator, methods=["fit"], metadata_name=key)
                    # make sure error message corresponding to `method_name`
                    # is used for test
                    requests_set_together = metaestimator.get("requests_set_together", {})
                    if method_name in requests_set_together.get("fit", {}):
                        #rotate on setting requests with leaving out only method_name
                    # fit before calling method
                    instance.fit(X, y, **method_kwargs)
                try:
                    # `fit` and `partial_fit` accept y, others don't.
                    method(X, y, **method_kwargs)
                except TypeError:
                    method(X, **method_kwargs)

But I need to think about it a little more. Next time it could be another method than fit that would internally route metadata to more than one sub-estimator methods.

Comment on lines 554 to 568
if method_name in ["predict", "score"]:
# fit before calling method
set_requests(estimator, methods=["fit"], metadata_name=key)
if requests_set_together:
set_requests(
estimator,
methods=requests_set_together["fit"],
metadata_name=key,
)
fit_method = getattr(instance, "fit")
fit_method(X, y, **method_kwargs, **extra_method_args)
if method_name == "predict":
method(X, **method_kwargs, **extra_method_args)
else:
method(X, y, **method_kwargs, **extra_method_args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar pattern applies here.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for simpli

sklearn/linear_model/_ransac.py Show resolved Hide resolved
Comment on lines 442 to 445
f"{e}, which is used internally by `RANSACRegressor.fit()`."
f"Call `{estimator.__class__.__name__}.set_{{method}}_request("
"{metadata}=True)` for each metadata."
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. Cause I'm confused as why e is at the beginning of the message.

Comment on lines 500 to 504
set_requests(estimator, methods=["fit"], metadata_name=key)
# make sure error message corresponding to `method_name`
# is used for test
if method_name != "score":
set_requests(estimator, methods=["score"], metadata_name=key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the information on which metadata to set, should be in the dictionary where we declare the tests. This:

        "requests_set_together": {"fit": ["fit", "score"]},

then we don't need to special case here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we still do need that special case here, because in this case we need to deviate from how the test flows.

The test rotates through fit, predict and score and if we're at anything different than score, that we want to test if it raises an error, we need to set the score request, not the ones set together with fit.

Comment on lines 442 to 445
f"{e}, which is used internally by `RANSACRegressor.fit()`."
f"Call `{estimator.__class__.__name__}.set_{{method}}_request("
"{metadata}=True)` for each metadata."
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two things I'm not sure about here:

  • metadata and method are not replaced with actual values, and I'm not sure if it makes the error message more understandable.
  • This is a pattern repeated pretty much everywhere when we call process_routing, do we want to change all of them with this pattern? I don't think we should it's really verbose. We can maybe improve the error message raised by validation inside process_routing though.

sklearn/linear_model/_ransac.py Outdated Show resolved Hide resolved
sklearn/linear_model/_ransac.py Outdated Show resolved Hide resolved
sklearn/tests/test_metaestimators_metadata_routing.py Outdated Show resolved Hide resolved
Comment on lines 342 to 347
- requests_set_together: a dict that defines which set_{method}_requests need
to be set together with the key; used in case a router routes to different
methods from the sub-estimator from withing the same meta-estimator's method.
For instance, {"fit": ["score"]} would signal that
`estimator.set_fit_request` premises `estimator.set_score_request` to be set
as well.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically the method mapping. I think when present, it should include all mappings instead of adding what's on top of a one-to-one mapping, which would also make the test code easier and more straightforward. It can also be called method_mapping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we include the one-to-one mapping, in this case fit to fit, then we would need to a check if method_mapping exists very early of the test and have a big devide of how to deal with the two different cases.

I had thought about this before, and came to the conclusion that it would make the test more complicated. After all, it's better to treat this further down in the test. This is why I designed it this way. Would you mind approving it as it is, @adrinjalali?

@StefanieSenger
Copy link
Contributor Author

I went through the comments and reacted to all of them, @adrinjalali. Please have another look.

I am wondering about the codecov warnings for l. 417 and 424 in test_metaestimators_metadata_routing.py. Do we have to test this function (that is used for a test itself?). If so, where is the rest of this function tested? I think it's nowhere.

Do you also want to have a look, @glemaitre?

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d3ac747 shows my idea of how to do the multiple mapping for methods.

Comment on lines 440 to 451
try:
routed_params = process_routing(self, "fit", **fit_params)
except UnsetMetadataPassedError as e:
raise UnsetMetadataPassedError(
message=(
f"{e}, which is used internally by `RANSACRegressor.fit()`. "
f"Call `{estimator.__class__.__name__}.set_{{method}}_request("
"{metadata}=True)` for each metadata."
),
unrequested_params=e.unrequested_params,
routed_params=e.routed_params,
) from e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't find the previous messages about this.

I don't think we should modify the exception here. The exception, if needs to be improved, needs to be handled for every estimator, and we shouldn't expect this much boilerplate from meta-estimators.

@StefanieSenger
Copy link
Contributor Author

StefanieSenger commented Feb 22, 2024

Okay, I have now reverted my re-raise the UnsetMetadataPassedError here and will introduce it as a general improvement as separate PR.
I have also put the re-setting of the requests outside of the try-except, because it always works. The try-block now only contains the code that might fail.
Please have another go, @adrinjalali. What do you think about the CodeCov warnings? Is this something I should take care of? And if so, where? I think the rest of this helper function isn't tested either. Could we have it start with an underscore and CodeCov would then accept it?

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM.

@OmarManzoor @glemaitre you could have a look here maybe?

sklearn/linear_model/_ransac.py Outdated Show resolved Hide resolved
Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good. Just a minor change suggested. Also code coverage shows two lines that are uncovered which related to the new ValueErrors that we are raising inside the test_metaestimators_metadata_routing file. Do we need to cover them @adrinjalali

sklearn/tests/metadata_routing_common.py Outdated Show resolved Hide resolved
Co-authored-by: Omar Salman <omar.salman@arbisoft.com>
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to cover those two lines, but it's nice to make it explicit that we're not covering them.

sklearn/tests/test_metaestimators_metadata_routing.py Outdated Show resolved Hide resolved
sklearn/tests/test_metaestimators_metadata_routing.py Outdated Show resolved Hide resolved
OmarManzoor and others added 2 commits February 26, 2024 17:32
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @StefanieSenger

@OmarManzoor OmarManzoor enabled auto-merge (squash) February 26, 2024 12:53
@OmarManzoor OmarManzoor merged commit e2f9530 into scikit-learn:main Feb 26, 2024
28 checks passed
@StefanieSenger StefanieSenger deleted the metadata_RANSACRegressor branch February 26, 2024 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants