Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Save models to hdf5 and other formats #160

Closed
wants to merge 69 commits into from
Closed

[WIP] Save models to hdf5 and other formats #160

wants to merge 69 commits into from

Conversation

kushalkolar
Copy link
Contributor

Hi,

I thought it would be useful to save the KShape model without pickling. I implemented a simple to_hdf5() method for saving a KShape model to an hdf5 file and from_hdf5() for reloading it so that predictions can be done with the model.

Changes to the KShape class:

  • the class attribute "model_attrs" is a list of attributes that are sufficient to describe the model.
  • to_dict() method packages the model attributes and params to a dict.
  • to_hdf5() and from_hdf() can be used to save/load the model to/from hdf5 files.
  • put instance attributes in constructor

An hdftools module is added to handle saving a dict of numpy arrays to an hdf file.

Usage:

ks.to_hdf5('/path/to/file.h5')
model = KShape.from_hdf5('path/to/file.h5')

@pep8speaks
Copy link

pep8speaks commented Oct 15, 2019

Hello @kushalkolar! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 60:80: E501 line too long (80 > 79 characters)

Line 247:80: E501 line too long (81 > 79 characters)

Line 61:17: E722 do not use bare 'except'

Comment last updated at 2020-01-20 21:23:44 UTC

@rtavenar
Copy link
Member

Hi @kushalkolar

Thank you very much for your suggestion.
My first question would be: do you think this feature is specific to KShape or should it be (whenever possible) provided for any model is tslearn?
And the second question would be: how is this handled in sklearn?

@kushalkolar
Copy link
Contributor Author

kushalkolar commented Oct 19, 2019

I think it would be useful for any tslearn model where it makes sense.

Using the structure shown here for KShape, I can create a base class that uses a model_attrs list with a to_dict() method for packaging models . The dict can then be used to export to various formats, hdf, json etc. This should work for all tslearn models where a collection of instance attributes are sufficient to describe the model.

pickling is the only answer I could find for how sklearn does it (and that's what their docs say too https://scikit-learn.org/stable/modules/model_persistence.html). Though there are issues & limitations with pickling of course (some are outlined in their docs).

@rtavenar
Copy link
Member

Yes, it could be a great idea to have a base class for serializable models so that code does not have to be duplicated everywhere!

@rtavenar
Copy link
Member

Ideally, this could be a more general solution than #57 for example.

@kushalkolar
Copy link
Contributor Author

Sorry I haven't had time to work on implementing this yet. Yes I will definitely try to make a generalized solution to this.

@rtavenar
Copy link
Member

@kushalkolar : note that you still have issues with tests:

  • failing test in TimeSeriesScalerMinMax is not your fault and has been fixed in dev, so no worry for this one
  • you seem to have a Python2.7 compatibility issue
  • float128 seem to be only accessible with python 3.6+

https://travis-ci.org/rtavenar/tslearn/builds/639309691

Copy link
Contributor

@johannfaouzi johannfaouzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few changes still needed to make it compatible on all supported versions, but it's almost done!

There are some docstring errors for TimeSeriesMinMax but I suspect that it comes #177

# h5file[path + key].attrs['dtype'] = item.dtype.str

# single pieces of data
elif isinstance(item, (str, bytes, int, float, np.int, np.int8,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use parent classes if you don't want to add them all manually (e.g. replacing np.int, ..., np.int64 with np.integer), but there are some differences (for instance unsigned integers):
https://docs.scipy.org/doc/numpy/reference/arrays.scalars.html

tslearn/tests/test_serialize_models.py Outdated Show resolved Hide resolved
@johannfaouzi
Copy link
Contributor

@rtavenar Do you know why the CI is displayed as successful, although it is not? #177 was merged although there are some errors...
https://travis-ci.org/rtavenar/tslearn/builds/634449123

@rtavenar
Copy link
Member

@johannfaouzi Yep, I noticed that this morning and updated dev branch accordingly, this is why I told @kushalkolar not to worry about these TimeSeriesScalerMinMax errors (they will be fixed by merging to dev).

Copy link
Member

@rtavenar rtavenar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great work!

I have few minor comments, and the tests on Travis CI seem to pass, so we could target a merge pretty soon.

Also, once the changes are made, could you update CHANGELOG.md to document that your new functionality has been added to tslearn since version 0.3.0?

Thanks,
Romain

PS: let me know when you are done with integrating new models to the list

tslearn/bases.py Outdated Show resolved Hide resolved
tslearn/bases.py Outdated Show resolved Hide resolved
tslearn/tests/test_serialize_models.py Outdated Show resolved Hide resolved
def test_global_alignment_kernel_kmeans():
seed = 0
numpy.random.seed(seed)
X_train, y_train, X_test, y_test = CachedDatasets().load_dataset("Trace")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, it would be better practice to rely on fully synthetic data (generated from numpy.random.rand for example, with adequate dimensions) so that you fully control the shapes and do not rely on other parts of the code for the test.

Same comment holds for tests related to all estimators.

@kushalkolar
Copy link
Contributor Author

Thanks for the help!

Almost finished with this, a few questions:

If I understand GlobalAlignmentKernelKMeans correctly, _X_fit is required for the model in order to make future predictions? Unlike TimeSeriesKMeans and KShape.

KNeighborsTimeSeries also doesn't seem to make sense without the original _X_fit.

Now TimeSeriesKMeans and KShape do not store _X_fit in the serialized model since some users might have used a very large training dataset.

Continuing to add this to all the models right now.

@rtavenar
Copy link
Member

Yes, for these two models, the training data needs to be known.

But your point is important: could you, for these models, have a Note section in the class docstring that states that saving these models to disk can be space-consuming due to the need to store training data as a model attribute?

serializability
2. small changes for _check_not_fitted and better formatting of
_check_params_predict
use for models that don't inherit from BaseEstimator
@kushalkolar
Copy link
Contributor Author

kushalkolar commented Jan 20, 2020

I'm having a weird issue with KNeighborsTimeSeries https://travis-ci.org/rtavenar/tslearn/jobs/639667444#L797

KNeighborsTimeSeries does not define a _ts_fit anywhere, and line 250 of tslearn.neighbors is a docstring

I'm not able to reproduce this with pytest on my local machine.

edit: I also changed BaseModelPackage so it no longer inherits from BaseEstimator (to prevent conflict with KNN classes that inherit from sklearn.neighbors.NeighborsBase which inherits from BaseEstimator)

@rtavenar
Copy link
Member

Hi,

The thing is that _ts_fit is only set in some cases (depending on the metric). Maybe one way to test if the model has been fit is to have a check similar to what is done in kneighbors:

        if self.metric in VARIABLE_LENGTH_METRICS:
            check_is_fitted(self, '_ts_fit')
        else:
            check_is_fitted(self, '_X_fit')

Similarly, the list of attributes to be saved should depend on the same condition

@johannfaouzi
Copy link
Contributor

johannfaouzi commented Jan 21, 2020

I can't reproduce this either locally.

But looking at the error, we see that the value of the metric is 'precomputed' (for the sm estimator)
https://travis-ci.org/rtavenar/tslearn/jobs/639667446#L827-L828

but for the knn estimator, knn is defined here:
https://github.com/kushalkolar/tslearn/blob/e853138624a9c77f42976ee9df6b6ae22a2b226f/tslearn/tests/test_serialize_models.py#L197
so the metric parameter should have the default value, that is 'dtw'.

Edit:

Nevermind, self.metric is changed:
https://github.com/kushalkolar/tslearn/blob/e853138624a9c77f42976ee9df6b6ae22a2b226f/tslearn/neighbors.py#L342-L344

but it's changed again here:
https://github.com/kushalkolar/tslearn/blob/e853138624a9c77f42976ee9df6b6ae22a2b226f/tslearn/neighbors.py#L359-L360

That's a bit messy... usually the values provided when the instance is created should never be changed, and other (private) attributes should be used.

@johannfaouzi
Copy link
Contributor

Well I'm confused, the loaded estimator should not have the 'precomputed' value for metric but 'dtw' instead and this is why this error is raised.

@johannfaouzi
Copy link
Contributor

johannfaouzi commented Jan 21, 2020

Can't promise it will work but you can try to change the code so that self.metric is never changed:
https://github.com/kushalkolar/tslearn/blob/e853138624a9c77f42976ee9df6b6ae22a2b226f/tslearn/neighbors.py#L342-L344

https://github.com/kushalkolar/tslearn/blob/e853138624a9c77f42976ee9df6b6ae22a2b226f/tslearn/neighbors.py#L359-L360

However, this is only for Classifier and Regressor, which does not seem to be concerned with the current failing test...

Though I don't get why the error is raised from either Classifier and Regressor although KneighborsTimeSeries is the only estimator used.

Because check_is_fitted(self, '_ts_fit') is only called in ...Classifier and ...Regressor.

@johannfaouzi
Copy link
Contributor

johannfaouzi commented Jan 21, 2020

Maybe a hint about where this issue comes from:
https://github.com/rtavenar/tslearn/blob/7f46f489475c8bc4e6d9289763103bccbe2c235e/tslearn/neighbors.py#L250-L254

It looks like the executed code comes from the dev branch. I don't know which branch is ahead/behind, but doing a git pull or a git rebase may solve this, because the number of modified files is different from this PR only.

@kushalkolar
Copy link
Contributor Author

@johannfaouzi Thanks! I tried a git rebase and it got quite messy. I think I'll just move my changes to a new PR that branches from dev.

@kushalkolar kushalkolar mentioned this pull request Jan 21, 2020
4 tasks
@kushalkolar
Copy link
Contributor Author

Closing because #183

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants