Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load model from path/str to generate Card #205

Merged
merged 12 commits into from Nov 23, 2022

Conversation

p-mishra1
Copy link
Contributor

#96 Added required changes and followed the pattern @BenjaminBossan advised.
Creating PR from another branch as the last one had too many commits and reverts.
@BenjaminBossan have a look.

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@p-mishra1 Thanks a lot for listening to the feedback, I like the refactor you did. This PR is now close to being ready. I have a few minor comments, please take a look. Also, please add an entry to the changes.rst.

Comment on lines 168 to 173
"""Loads the mddel if provided a file path, if already a model instance return it unmodified.

Parameters
----------
model : pathlib.path, str, or sklearn estimator
Path/str or the actual model instance. if a Path or str, loads the model on first call.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add line length limits to the docstring as well?


model_path = Path(model)
if not model_path.exists():
raise ValueError("Model file does not exist")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a test case for this potential error? Also, FileNotFoundError is a more precise error.

skops/card/_model_card.py Outdated Show resolved Hide resolved
@@ -172,7 +210,7 @@ class Card:

Parameters
----------
model: estimator object
model: pathlib.path, str, or sklearn estimator object
Model that will be documented.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please extend the docstring, now that we have more options here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i changed to this "pathlib.path, str, or sklearn estimator object"
what should we add more ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something along the lines that if a str or Path is passed, the model will be loaded. This could be surprising behavior for some users and should be mentioned.

Comment on lines 47 to 49
assert loaded_model_str.n_jobs is model0.n_jobs
assert loaded_model_path.n_jobs is model0.n_jobs
assert loaded_model_path.n_jobs is loaded_model_str.n_jobs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert loaded_model_str.n_jobs is model0.n_jobs
assert loaded_model_path.n_jobs is model0.n_jobs
assert loaded_model_path.n_jobs is loaded_model_str.n_jobs
assert loaded_model_str.n_jobs == model0.n_jobs
assert loaded_model_path.n_jobs == model0.n_jobs
assert loaded_model_path.n_jobs == loaded_model_str.n_jobs

Although is works in this case, let's use == to compare numbers and then we don't have to worry about the size of the number or the Python interpreter :)

assert card_from_model0.model.n_jobs == card_from_str.model.n_jobs

@pytest.mark.parametrize("suffix", [".pkl", ".pickle", ".skops"])
def test_model_card_path(self, suffix):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test could be unified with the previous one, as there is only one line difference, by parametrizing it over whether the save_file is converted to Path or not.

def test_model_is_str_pickle(self, destination_path):
model0 = LinearRegression(n_jobs=123)
f_name0 = destination_path / "lin_reg.pickle"
with open(f_name0, "wb") as f:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use save_model_to_file in this test?

skops/card/tests/test_card.py Show resolved Hide resolved
skops/card/tests/test_card.py Outdated Show resolved Hide resolved
@p-mishra1
Copy link
Contributor Author

@p-mishra1 Thanks a lot for listening to the feedback, I like the refactor you did. This PR is now close to being ready. I have a few minor comments, please take a look. Also, please add an entry to the changes.rst.

Thanks to you @BenjaminBossan patiently guiding me and giving me feedback every time..

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, we're almost done here.

From my point of view, there are only 3 minor issues left:

  1. Adjusting the docstring as discussed
  2. Remove the debug print (see comment)
  3. Add an entry to changes.rst

model0 = LinearRegression(n_jobs=123)
save_file = save_model_to_file(model0, suffix)
save_file = file_type(save_file)
print(type(save_file))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, sorry left it by mistake,
one more thing @BenjaminBossan I'm unable to understand where should i add the changes in changes.rst it will go under v0.3 right.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will go under v0.3 right.

Exactly, just add an entry to the end of the list. Also, since you're not on the contributor list yet, add yourself as a contributor at the very end of the document.

@p-mishra1
Copy link
Contributor Author

p-mishra1 commented Nov 1, 2022

@BenjaminBossan the test were failing on windows so I changed the function

def save_model_to_file(model_instance, suffix):
    save_file_handle, save_file = tempfile.mkstemp(suffix=suffix, prefix="skops-test")
    if suffix in (".pkl", ".pickle"):
        with open(save_file, "wb") as f:
            pickle.dump(model_instance, f)
    elif suffix == ".skops":
        dump(model_instance, save_file)
    return save_file_handle, save_file

like this so that i can close the file before removing it somehow it is not a problem in linux/mac.
I'm thinking that is the issue here.

@p-mishra1
Copy link
Contributor Author

@BenjaminBossan do you have an idea why it is failing on windows, i could not figure out exactly why,

@BenjaminBossan
Copy link
Collaborator

@BenjaminBossan do you have an idea why it is failing on windows, i could not figure out exactly why,

I'm not sure, Windows is always a big riddle. Maybe the string that is being matched contains a character that doesn't work on Windows for some reason?

@BenjaminBossan
Copy link
Collaborator

If you can't figure it out, I would be okay with skipping the test on Windows (with a comment explaining why). The error really comes from the test itself and is not something a Windows user would encounter in practice when using the model card.

@p-mishra1 p-mishra1 closed this Nov 1, 2022
@p-mishra1
Copy link
Contributor Author

p-mishra1 commented Nov 1, 2022

I will look into it and then we can decide

@p-mishra1 p-mishra1 reopened this Nov 1, 2022
@p-mishra1
Copy link
Contributor Author

@BenjaminBossan i tried somethings and also worked in the windows environment to mitigate the issue have changed the test a little bit hopefully it will work this time.
give it a look.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @p-mishra1

Comment on lines 191 to 194
if model_path.suffix in (".pkl", ".pickle"):
model = joblib.load(model_path)
elif model_path.suffix == ".skops":
model = load(model_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't rely on the extension here. We can check if the file is a zip file (zipfile.is_zipfile(filename) and load with skops if yes, otherwise try to load with joblib, and catch any exceptions which might be raised during each of them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we just keep a key in config file to store which format the model was serialized in?

Copy link
Contributor Author

@p-mishra1 p-mishra1 Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree depending on file extension is not a good way.
let's say the file loaded indeed was a zip then we will be conforming that it is sklearn estimator object. using this
base.BaseEstimator ?
really sorry if it felt too naive. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we just keep a key in config file to store which format the model was serialized in?

We could but I don't think this function is called late enough for that to work.

let's say the file loaded indeed was a zip then we will be conforming that it is sklearn estimator object. using this
base.BaseEstimator ?

No, if it's a zip file, then it's a skops file, which means we can load using skops, otherwise we load with joblib, and we don't validate the output. We just catch any exception which might have been raised.

Comment on lines 318 to 320
@model.deleter
def model(self):
del self._model
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, the method should exist (assuming we leave the rest as is), no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I almost never see a deleter for properties.

Comment on lines 307 to 312
@property
def model(self):
model = _load_model(self._model)
if model is not self._model:
self._model = model
return model
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should change model. We should always keep model attribute as is given by the user, and only load when we need it in other methods.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed 👍🏻

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I fully understand. You want the model argument passed by the user to remain unchanged, and when we access it in other methods, we check if it's a string/path and load it? So basically swap around the attribute names model and _model? Or load the model each time it's accessed (which would be wasteful, since it's used, for instance, in __repr__ so could be accessed quite frequently).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so instead of using _model attribute we should use _load_model(self.model) wherever there is a call to model,
am i thinking in the right direction @merveenoyan @adrinjalali .
also tagging @BenjaminBossan to have a look on what we discussed earlier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, exactly @p-mishra1 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes @BenjaminBossan sounds fine will work on this more, and
just curious do we have slack/forum/discord where we can talk about code design before commiting it, or is it the preferred way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for continuing the work.

I think ideally these discussions should come up in the PR or the corresponding issue. I thought there was consensus on the suggested approach here, I guess there was some confusion, sorry for that.

Copy link
Contributor Author

@p-mishra1 p-mishra1 Nov 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    def __repr__(self) -> str:

        model = getattr(self, "model", None)
        if model:
            model = _load_model(model) # this line
            model_str = self._strip_blank(repr(model))
            model_repr = aRepr.repr(f"  model={model_str},").strip('"').strip("'")
        else:
            model_repr = None
    def _extract_estimator_config(self) -> str:
        """Extracts estimator hyperparameters and renders them into a vertical table.

        Returns
        -------
        str:
            Markdown table of hyperparameters.
        """
        model = _load_model(self.model) # This line
        hyperparameter_dict = model.get_params(deep=True)
    def _generate_card(self) -> ModelCard:
        .
        .
        .
        if self.model_diagram is True:
            model = _load_model(self.model) # this line
            model_plot_div = re.sub(r"\n\s+", "", str(estimator_html_repr(model)))

so these 3 functions will be changed like this(commented with "this line") while doing lazy loading in card class.

also the test for card repr and str either have to be changed to take card instance and path,
or the test class have to be duplicated like passing model file as param instead of estimator.
lmk @adrinjalali @BenjaminBossan which method should we prefer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is the right approach or if we should keep the property and just do the _load_model inside the property. The disadvantage of the suggested approach is that it's very error prone, as soon as new code is added that access the model, it would need to do the same thing. That's why I would prefer using property. @adrinjalali wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about self.get_model() which would return the model if self.model is a str? acceptable middle ground?

Copy link
Collaborator

@merveenoyan merveenoyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot for working on this!

Comment on lines 191 to 194
if model_path.suffix in (".pkl", ".pickle"):
model = joblib.load(model_path)
elif model_path.suffix == ".skops":
model = load(model_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we just keep a key in config file to store which format the model was serialized in?

Comment on lines 307 to 312
@property
def model(self):
model = _load_model(self._model)
if model is not self._model:
self._model = model
return model
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed 👍🏻

@p-mishra1
Copy link
Contributor Author

@BenjaminBossan @adrinjalali have a look at the changes, sorry for the long gap, was busy with some things.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM now. but since the code has changed a bit since last review from @BenjaminBossan , I'll let him check and merge.

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only have a few minor comments left, after that we should be good 👍


Parameters
----------
model : pathlib.path, str, or sklearn estimator
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
model : pathlib.path, str, or sklearn estimator
model : pathlib.Path, str, or sklearn estimator

@@ -263,6 +304,18 @@ def __init__(
self._extra_sections: list[tuple[str, Any]] = []
self.metadata = metadata or ModelCardData()

def get_model(self) -> Any:
"""Returns sklearn estimator object if ``Path``/``str``
is provided.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
is provided.
is provided.

@@ -162,6 +165,41 @@ def metadata_from_config(config_path: Union[str, Path]) -> ModelCardData:
return card_data


def _load_model(model: Any) -> Any:
"""Loads the mddel if provided a file path, if already a model instance return it
unmodified.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
unmodified.
unmodified.

Comment on lines 215 to 216
documented. If a ``Path`` or ``str`` is provided, model will be loaded
on first use.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
documented. If a ``Path`` or ``str`` is provided, model will be loaded
on first use.
documented. If a ``Path`` or ``str`` is provided, model will be loaded.

After the change, it will be loaded each time, so "on first use" is not correct anymore.

``Path``/``str`` of the model or the actual model instance that will be
documented. If a ``Path`` or ``str`` is provided, model will be loaded
on first use.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

skops/card/tests/test_card.py Show resolved Hide resolved
Comment on lines 47 to 48
assert loaded_model_str.n_jobs == model0.n_jobs
assert loaded_model_path.n_jobs == model0.n_jobs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert loaded_model_str.n_jobs == model0.n_jobs
assert loaded_model_path.n_jobs == model0.n_jobs
assert loaded_model_str.n_jobs == 123
assert loaded_model_path.n_jobs == 123

This test is a tiny bit more robust this way (in case that model0 is being mutated by _load_model for some reason).


assert loaded_model_str.n_jobs == model0.n_jobs
assert loaded_model_path.n_jobs == model0.n_jobs
assert loaded_model_path.n_jobs == loaded_model_str.n_jobs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is not needed, right? If the two asserts above succeed, this cannot possibly fail.


card_from_path = self.path_to_card(file_name)
result_from_path = meth(card_from_path)
print(result_from_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove


@pytest.mark.parametrize("meth", [repr, str])
@pytest.mark.parametrize("suffix", [".pkl", ".skops"])
def test_model_card_repr(self, card: Card, meth, suffix):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this test is strictly necessary. Maybe it is sufficient to show that calling repr or str on a card that has a path/str for model works? The fact that it's the same output is implied by the other tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to only repr from card_from _path.

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Thanks a lot @p-mishra1 for your work and patience.

@p-mishra1
Copy link
Contributor Author

This looks good. Thanks a lot @p-mishra1 for your work and patience.

Thanks a lot to you guys @BenjaminBossan and @adrinjalali for guiding me and listening to my silly comments and doubts.
Also thanks to @merveenoyan who introduced me to the repo by posting about this issue on Twitter.

one request I have to @BenjaminBossan as most of the guidance and reviews were provided by you, can you provide me any feedback you have for me so that I can improve myself?

@BenjaminBossan
Copy link
Collaborator

Hey @p-mishra1,

First of all, it's really hard to give feedback after one PR and without knowing you personally, so there is only so much I can say. But this is my personal impression:

I liked the initiative you showed. Your responses were quick and you were receptive to our feedback - even when it was conflicting sometimes ;-). As you know, this all took some time but still, you persisted, which is very good. You were willing to do some of the hard work and came up with solutions to new challenges that occurred on the way.

If there was one thing that would make the life of us maintainers easier, it is if right from the start, you would have delivered a more "complete" package, i.e. updating the docs everywhere, providing unit tests, making sure that code style and formatting are consistent with the rest of the code base. When we see that as maintainers right from the start, we're always excited.

Hopefully, this helps you a little bit. If you have feedback for us, also feel free to share.

@p-mishra1
Copy link
Contributor Author

Hey @p-mishra1,

First of all, it's really hard to give feedback after one PR and without knowing you personally, so there is only so much I can say. But this is my personal impression:

I liked the initiative you showed. Your responses were quick and you were receptive to our feedback - even when it was conflicting sometimes ;-). As you know, this all took some time but still, you persisted, which is very good. You were willing to do some of the hard work and came up with solutions to new challenges that occurred on the way.

If there was one thing that would make the life of us maintainers easier, it is if right from the start, you would have delivered a more "complete" package, i.e. updating the docs everywhere, providing unit tests, making sure that code style and formatting are consistent with the rest of the code base. When we see that as maintainers right from the start, we're always excited.

Hopefully, this helps you a little bit. If you have feedback for us, also feel free to share.

Thank you @BenjaminBossan i appreciate you taking some time to give me feedback. I know it is quite tough for you to give feedback on the coding part .
I don't have any feedbacks from my side i really liked the way you guys handled each PR despite the noob mistakes that i did.
I would love contributing more and i will be on look out for the opportunities but if you guys feel there is some issues, feature i can contribute to i will be happy to take that up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants