[WIP] Save models to hdf5 and other formats #160

kushalkolar · 2019-10-15T11:20:51Z

Hi,

I thought it would be useful to save the KShape model without pickling. I implemented a simple to_hdf5() method for saving a KShape model to an hdf5 file and from_hdf5() for reloading it so that predictions can be done with the model.

Changes to the KShape class:

the class attribute "model_attrs" is a list of attributes that are sufficient to describe the model.
to_dict() method packages the model attributes and params to a dict.
to_hdf5() and from_hdf() can be used to save/load the model to/from hdf5 files.
put instance attributes in constructor

An hdftools module is added to handle saving a dict of numpy arrays to an hdf file.

Usage:

ks.to_hdf5('/path/to/file.h5')
model = KShape.from_hdf5('path/to/file.h5')

MAINT Include test folder in wheels

MAINT More fixes to include test/ folder in sdist

pep8speaks · 2019-10-15T11:20:53Z

Hello @kushalkolar! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file tslearn/bases.py:

Line 60:80: E501 line too long (80 > 79 characters)

In the file tslearn/clustering.py:

Line 247:80: E501 line too long (81 > 79 characters)

In the file tslearn/hdftools.py:

Line 61:17: E722 do not use bare 'except'

Comment last updated at 2020-01-20 21:23:44 UTC

rtavenar · 2019-10-17T15:38:55Z

Hi @kushalkolar

Thank you very much for your suggestion.
My first question would be: do you think this feature is specific to KShape or should it be (whenever possible) provided for any model is tslearn?
And the second question would be: how is this handled in sklearn?

kushalkolar · 2019-10-19T08:10:27Z

I think it would be useful for any tslearn model where it makes sense.

Using the structure shown here for KShape, I can create a base class that uses a model_attrs list with a to_dict() method for packaging models . The dict can then be used to export to various formats, hdf, json etc. This should work for all tslearn models where a collection of instance attributes are sufficient to describe the model.

pickling is the only answer I could find for how sklearn does it (and that's what their docs say too https://scikit-learn.org/stable/modules/model_persistence.html). Though there are issues & limitations with pickling of course (some are outlined in their docs).

rtavenar · 2019-10-19T09:15:59Z

Yes, it could be a great idea to have a base class for serializable models so that code does not have to be duplicated everywhere!

rtavenar · 2019-10-23T06:50:04Z

Ideally, this could be a more general solution than #57 for example.

kushalkolar · 2019-10-23T13:48:00Z

Sorry I haven't had time to work on implementing this yet. Yes I will definitely try to make a generalized solution to this.

rtavenar · 2020-01-20T10:49:27Z

@kushalkolar : note that you still have issues with tests:

failing test in TimeSeriesScalerMinMax is not your fault and has been fixed in dev, so no worry for this one
you seem to have a Python2.7 compatibility issue
float128 seem to be only accessible with python 3.6+

https://travis-ci.org/rtavenar/tslearn/builds/639309691

johannfaouzi

There are a few changes still needed to make it compatible on all supported versions, but it's almost done!

There are some docstring errors for TimeSeriesMinMax but I suspect that it comes #177

johannfaouzi · 2020-01-20T12:42:28Z

tslearn/hdftools.py

+                # h5file[path + key].attrs['dtype'] = item.dtype.str
+
+        # single pieces of data
+        elif isinstance(item, (str, bytes, int, float, np.int, np.int8,


You can use parent classes if you don't want to add them all manually (e.g. replacing np.int, ..., np.int64 with np.integer), but there are some differences (for instance unsigned integers):
https://docs.scipy.org/doc/numpy/reference/arrays.scalars.html

tslearn/tests/test_serialize_models.py

johannfaouzi · 2020-01-20T12:54:50Z

@rtavenar Do you know why the CI is displayed as successful, although it is not? #177 was merged although there are some errors...
https://travis-ci.org/rtavenar/tslearn/builds/634449123

rtavenar · 2020-01-20T13:18:10Z

@johannfaouzi Yep, I noticed that this morning and updated dev branch accordingly, this is why I told @kushalkolar not to worry about these TimeSeriesScalerMinMax errors (they will be fixed by merging to dev).

2. Organize test_serialize_models to have a generalized params_predict_check function.

_X_fit so that _X_fit doesn't have to be carried over when saving the model

… hdf5tools

renamed some things in test_serialize_models

rtavenar

This is great work!

I have few minor comments, and the tests on Travis CI seem to pass, so we could target a merge pretty soon.

Also, once the changes are made, could you update CHANGELOG.md to document that your new functionality has been added to tslearn since version 0.3.0?

Thanks,
Romain

PS: let me know when you are done with integrating new models to the list

tslearn/bases.py

tslearn/tests/test_serialize_models.py

rtavenar · 2020-01-20T16:49:35Z

tslearn/tests/test_serialize_models.py

+def test_global_alignment_kernel_kmeans():
+    seed = 0
+    numpy.random.seed(seed)
+    X_train, y_train, X_test, y_test = CachedDatasets().load_dataset("Trace")


Here, it would be better practice to rely on fully synthetic data (generated from numpy.random.rand for example, with adequate dimensions) so that you fully control the shapes and do not rely on other parts of the code for the test.

Same comment holds for tests related to all estimators.

kushalkolar · 2020-01-20T17:01:20Z

Thanks for the help!

Almost finished with this, a few questions:

If I understand GlobalAlignmentKernelKMeans correctly, _X_fit is required for the model in order to make future predictions? Unlike TimeSeriesKMeans and KShape.

KNeighborsTimeSeries also doesn't seem to make sense without the original _X_fit.

Now TimeSeriesKMeans and KShape do not store _X_fit in the serialized model since some users might have used a very large training dataset.

Continuing to add this to all the models right now.

rtavenar · 2020-01-20T17:08:03Z

Yes, for these two models, the training data needs to be known.

But your point is important: could you, for these models, have a Note section in the class docstring that states that saving these models to disk can be space-consuming due to the need to store training data as a model attribute?

serializability 2. small changes for _check_not_fitted and better formatting of _check_params_predict

use for models that don't inherit from BaseEstimator

kushalkolar · 2020-01-20T21:48:46Z

I'm having a weird issue with KNeighborsTimeSeries https://travis-ci.org/rtavenar/tslearn/jobs/639667444#L797

KNeighborsTimeSeries does not define a _ts_fit anywhere, and line 250 of tslearn.neighbors is a docstring

I'm not able to reproduce this with pytest on my local machine.

edit: I also changed BaseModelPackage so it no longer inherits from BaseEstimator (to prevent conflict with KNN classes that inherit from sklearn.neighbors.NeighborsBase which inherits from BaseEstimator)

rtavenar · 2020-01-21T08:35:54Z

Hi,

The thing is that _ts_fit is only set in some cases (depending on the metric). Maybe one way to test if the model has been fit is to have a check similar to what is done in kneighbors:

        if self.metric in VARIABLE_LENGTH_METRICS:
            check_is_fitted(self, '_ts_fit')
        else:
            check_is_fitted(self, '_X_fit')

Similarly, the list of attributes to be saved should depend on the same condition

johannfaouzi · 2020-01-21T08:39:20Z

I can't reproduce this either locally.

But looking at the error, we see that the value of the metric is 'precomputed' (for the sm estimator)
https://travis-ci.org/rtavenar/tslearn/jobs/639667446#L827-L828

but for the knn estimator, knn is defined here:
https://github.com/kushalkolar/tslearn/blob/e853138624a9c77f42976ee9df6b6ae22a2b226f/tslearn/tests/test_serialize_models.py#L197
so the metric parameter should have the default value, that is 'dtw'.

Edit:

Nevermind, self.metric is changed:
https://github.com/kushalkolar/tslearn/blob/e853138624a9c77f42976ee9df6b6ae22a2b226f/tslearn/neighbors.py#L342-L344

but it's changed again here:
https://github.com/kushalkolar/tslearn/blob/e853138624a9c77f42976ee9df6b6ae22a2b226f/tslearn/neighbors.py#L359-L360

That's a bit messy... usually the values provided when the instance is created should never be changed, and other (private) attributes should be used.

johannfaouzi · 2020-01-21T09:20:07Z

Well I'm confused, the loaded estimator should not have the 'precomputed' value for metric but 'dtw' instead and this is why this error is raised.

johannfaouzi · 2020-01-21T09:32:07Z

Can't promise it will work but you can try to change the code so that self.metric is never changed:
https://github.com/kushalkolar/tslearn/blob/e853138624a9c77f42976ee9df6b6ae22a2b226f/tslearn/neighbors.py#L342-L344

https://github.com/kushalkolar/tslearn/blob/e853138624a9c77f42976ee9df6b6ae22a2b226f/tslearn/neighbors.py#L359-L360

However, this is only for Classifier and Regressor, which does not seem to be concerned with the current failing test...

Though I don't get why the error is raised from either Classifier and Regressor although KneighborsTimeSeries is the only estimator used.

Because check_is_fitted(self, '_ts_fit') is only called in ...Classifier and ...Regressor.

johannfaouzi · 2020-01-21T11:29:06Z

Maybe a hint about where this issue comes from:
https://github.com/rtavenar/tslearn/blob/7f46f489475c8bc4e6d9289763103bccbe2c235e/tslearn/neighbors.py#L250-L254

It looks like the executed code comes from the dev branch. I don't know which branch is ahead/behind, but doing a git pull or a git rebase may solve this, because the number of modified files is different from this PR only.

kushalkolar · 2020-01-21T19:11:09Z

@johannfaouzi Thanks! I tried a git rebase and it got quite messy. I think I'll just move my changes to a new PR that branches from dev.

kushalkolar · 2020-01-23T04:39:53Z

Closing because #183

rtavenar and others added 24 commits September 12, 2019 22:02

Update variablelength.rst

affe939

Update variablelength.rst

95a3d3c

Install requirements for PyPI changed

ccca697

Merge branch 'master' of https://github.com/rtavenar/tslearn

eaaf444

Create issue templates

5f90e28

Update issue templates

c0a335a

Update CONTRIBUTING.md

6ff911a

Create CODE_OF_CONDUCT.md

c67438b

C++ build tools details

80b6b5a

Merge branch 'master' of https://github.com/rtavenar/tslearn

6ae9112

Include tests folder in wheels

3675e18

Fix imports

447b9a3

Skip shapelets tests if keras / tensorflow is not installed

b391b2c

Version 0.2.4

2e82ae2

Merge pull request #154 from rth/include-tests-in-wheels

33b27e9

MAINT Include test folder in wheels

Update CHANGELOG.md

6742485

Update shapelets.py

bd7f6e6

Update shapelets.py

b840e5b

Update requirements_rtd.txt

8c0875b

Update requirements_rtd.txt

1254f7e

Update requirements.txt

39ad33b

More fixes to include test/ folder in sdist

86f297f

Version 0.2.5

b7499aa

Merge pull request #159 from rth/more-fixes-setup

1721482

MAINT More fixes to include test/ folder in sdist

make sure exception raised if not fitted and user tries to serialize

74c5890

johannfaouzi reviewed Jan 20, 2020

View reviewed changes

kushalkolar added 8 commits January 20, 2020 14:30

hdf5 doesn't store Nones, so store 'None' as a string

ebceb85

1. Add serialization to TimeSeriesKMeans and corresponding test.

ede5a33

2. Organize test_serialize_models to have a generalized params_predict_check function.

KShape uses cluster_centers_ to check_dims for predict() instead of

b3feca4

_X_fit so that _X_fit doesn't have to be carried over when saving the model

remove some dtypes (float128, builtin int, builtin float, bytes) from…

d40069e

… hdf5tools

more python 2 compatability

12a1c8d

more python2.7 compatability

f589e59

GAK uses BaseModelPackage

d0c1e06

renamed some things in test_serialize_models

changed ValueError to NotFittedError

2074e35

rtavenar reviewed Jan 20, 2020

View reviewed changes

kushalkolar added 4 commits January 20, 2020 18:33

1. KNeighborsTimeSeries inherits from BaseModelPackage for

600b8ff

serializability 2. small changes for _check_not_fitted and better formatting of _check_params_predict

standardize docstrings

0109af7

BaseModelPackage doesn't inherit from BaseEstimator to make it easier to

d775c62

use for models that don't inherit from BaseEstimator

knn ts classifier uses BaseModelPackage and has a test

e853138

kushalkolar mentioned this pull request Jan 21, 2020

[MRG] Serializing models #183

Merged

4 tasks

kushalkolar closed this Jan 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Save models to hdf5 and other formats #160

[WIP] Save models to hdf5 and other formats #160

kushalkolar commented Oct 15, 2019

pep8speaks commented Oct 15, 2019 •

edited

rtavenar commented Oct 17, 2019

kushalkolar commented Oct 19, 2019 •

edited

rtavenar commented Oct 19, 2019

rtavenar commented Oct 23, 2019

kushalkolar commented Oct 23, 2019

rtavenar commented Jan 20, 2020

johannfaouzi left a comment •

edited

johannfaouzi Jan 20, 2020

johannfaouzi commented Jan 20, 2020

rtavenar commented Jan 20, 2020

rtavenar left a comment

rtavenar Jan 20, 2020

kushalkolar commented Jan 20, 2020

rtavenar commented Jan 20, 2020

kushalkolar commented Jan 20, 2020 •

edited

rtavenar commented Jan 21, 2020

johannfaouzi commented Jan 21, 2020 •

edited

johannfaouzi commented Jan 21, 2020

johannfaouzi commented Jan 21, 2020 •

edited

johannfaouzi commented Jan 21, 2020 •

edited

kushalkolar commented Jan 21, 2020

kushalkolar commented Jan 23, 2020

[WIP] Save models to hdf5 and other formats #160

[WIP] Save models to hdf5 and other formats #160

Conversation

kushalkolar commented Oct 15, 2019

pep8speaks commented Oct 15, 2019 • edited

Comment last updated at 2020-01-20 21:23:44 UTC

rtavenar commented Oct 17, 2019

kushalkolar commented Oct 19, 2019 • edited

rtavenar commented Oct 19, 2019

rtavenar commented Oct 23, 2019

kushalkolar commented Oct 23, 2019

rtavenar commented Jan 20, 2020

johannfaouzi left a comment • edited

Choose a reason for hiding this comment

johannfaouzi Jan 20, 2020

Choose a reason for hiding this comment

johannfaouzi commented Jan 20, 2020

rtavenar commented Jan 20, 2020

rtavenar left a comment

Choose a reason for hiding this comment

rtavenar Jan 20, 2020

Choose a reason for hiding this comment

kushalkolar commented Jan 20, 2020

rtavenar commented Jan 20, 2020

kushalkolar commented Jan 20, 2020 • edited

rtavenar commented Jan 21, 2020

johannfaouzi commented Jan 21, 2020 • edited

johannfaouzi commented Jan 21, 2020

johannfaouzi commented Jan 21, 2020 • edited

johannfaouzi commented Jan 21, 2020 • edited

kushalkolar commented Jan 21, 2020

kushalkolar commented Jan 23, 2020

pep8speaks commented Oct 15, 2019 •

edited

kushalkolar commented Oct 19, 2019 •

edited

johannfaouzi left a comment •

edited

kushalkolar commented Jan 20, 2020 •

edited

johannfaouzi commented Jan 21, 2020 •

edited

johannfaouzi commented Jan 21, 2020 •

edited

johannfaouzi commented Jan 21, 2020 •

edited