Add initial guess as centroid #59

gandroz · 2018-08-09T15:04:24Z

According to the issue #58 , here a proposal to improve clustering (only for the KShape method for now) by letting the user choose an initial guess as centroids. This guess is a numpy array of int which are the indices of the samples to be used as centroids instead of a random vector.

GillesVandewiele · 2018-08-09T15:10:50Z

Hello, great work! Small suggestion: wouldn't it be better to let the user provide initial centroids, instead of indices? That way, you are not bound to have the initial centroids set to time series that are WITHIN your dataset. The suggested approach works well for your specific use case, where you know a sample of each cluster, but imagine if you would know more than 1 sample, then you could calculate a centroid of these samples and provide it to the algorithm.

gandroz · 2018-08-09T15:29:04Z

You're right, it is just a "quick win" I implemented on my side. Your suggestion is interesting and that was my first though, but I did not want to deal with time series of different lengths for example, and normalization would have been a bit more difficult as we have to normalize the initial guess the same manner the X_train is normalized, but not too difficult.

I could check that on my side if you want ?

GillesVandewiele · 2018-08-09T15:46:01Z

Does the algorithm currently support time series of variable length? I do not really know the KShape algorithm that well... Does not seem to give an error when I try it in my Python shell with a simple example.

Indeed, I did not think of the normalization, but it should be easy to fix if we decouple the line on f4db8cd#diff-79bececde0dc56f7672e4cedbaa36832R750 such that we have the TimeSeriesScalerMeanVariance object, then we can access the required data to normalize the provided initial centroid as wel.

GillesVandewiele · 2018-08-09T15:51:37Z

And on another note: maybe a few asserts should be added, such that a fitting exception can be raised? One that comes to mind straight away is that the number of initial centroids must be equal to n_clusters. But maybe other assertions concerning the length of these centroids can be made as well?

gandroz · 2018-08-09T15:58:36Z

Normalization is not a big deal indeed. I'll fix that. And yes, KShape support time series of variable length as it uses the method to_time_serries_dataset and not to_time_series.
There is already an assertion on the length of the initial guess at line 753

And a question for you, the tests failed on Travis. Is there a way I can run them before pushing ?

GillesVandewiele · 2018-08-09T16:11:19Z

My apologies, I looked over that! It seems like the doctest of clustering.py is failing...

You can check the logs here and to run the doctest, just run python3 -m doctest -v clustering.py in the tslearn directory.

gandroz · 2018-08-10T14:20:58Z

Hi,
I've just pushed some changes as you suggested. You can now use an initial centroids array instead of an array of indices. I tested it on my case and it works as expected.
I also modified the preprocessing code by vectorizing it more so that it is quickier now.

GillesVandewiele · 2018-08-10T14:25:19Z

Great work mate! Seems like the doctests are passing as well. Looks like a great addition to me!

gandroz · 2018-08-10T14:34:26Z

you welcome, it is really a nice package btw !

GillesVandewiele · 2018-08-10T14:36:13Z

All props to @rtavenar. I haven't contributed yet to this repository. But I completely agree, this package is great...

Hmmm Travis seems to have failed, while it was successful for python 2.7 🤔 Do the doctests work locally for you?

gandroz · 2018-08-10T14:38:03Z

yes it does, but I use p27. Seem like it is a build issue

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.

GillesVandewiele · 2018-08-10T14:40:19Z

Ye but it's strange that it takes longer than 10 minutes... Other builds of python 3 mostly take around 5 minutes or something... Do you have python3 installed? Could you run a doctest with python3 possibly?

On the other hand, the build of python 3.5 just went terrible, it couldn't even install python 3.5

gandroz · 2018-08-10T14:44:11Z

The issue is really at the build time of the docker image. I could try the doctest with python 3.5 but the test does not event go that far

GillesVandewiele · 2018-08-10T14:47:40Z

Ye exactly, something really weird. I'm kinda clueless as well :/. In the other builds, it did seem to be executing the nosetests though.

$ nosetests $KERAS_IGNORE --with-doctest --with-coverage --cover-erase --cover-package=tslearn
Using TensorFlow backend.
...........
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received

I think those dots correspond to successful tests.

gandroz · 2018-08-10T14:53:38Z

nosetests is executed, but it produced no output so that Travis failed. It only took 87s in python 2.7 so I cant imagine more than 10mins in python 3.5 is a normal behaviour.

The tests passed for python 3.5, but the doctest option NORMALIZE_WHITESPACE does not seem to work. Two tests are said fail but the result is the same as expected:

Expected:
array([[[ 1. ],
[ 1.5],
[ 2. ]]])
Got:
array([[[1. ],
[1.5],
[2. ]]])

rtavenar

This is a very interesting PR. I've added some comments to make it suitable for merging.

Also, wouldn't it make sense to add a similar argument for TimeSeriesKMeans class?

rtavenar · 2018-08-17T13:03:03Z

tslearn/preprocessing.py


-    def fit_transform(self, X, **kwargs):
+    def fit_transform(self, X, global_stats=False, **kwargs):


Hum, global_stats argument is never used afterwards, is it?

If not, you should remove it, I guess.

i think we actually made a reasoning mistake and thought the normalization happened dataset-wise, I think the global_stats and self.global_mean + self.global_std stem from that. Of course, this becomes unnecessary since the normalization happens timeseries-wise. We just have to use the transform method of the SAME normalizer on both the provided X input array as the centroids.

Important to note in the documentation is that the centroids may not be normalized by the user himself.

rtavenar · 2018-08-17T13:03:42Z

tslearn/clustering.py

        """

        X_ = to_time_series_dataset(X)
        self._norms = numpy.linalg.norm(X_, axis=(1, 2))
-        X_ = TimeSeriesScalerMeanVariance(mu=0., std=1.).fit_transform(X_)


Why is this line removed?

Seems like the code has changed to

self._norms = numpy.linalg.norm(X_, axis=(1, 2)) and self._norms_centroids = numpy.linalg.norm(self.cluster_centers_, axis=(1, 2)). I would also use the Normalizers from tslearn instead of using the numpy norm function...

rtavenar · 2018-08-17T13:04:52Z

tslearn/clustering.py

@@ -704,10 +704,13 @@ def _assign(self, X):
        _check_no_empty_cluster(self.labels_, self.n_clusters)
        self.inertia_ = _compute_inertia(dists, self.labels_)

-    def _fit_one_init(self, X, rs):
+    def _fit_one_init(self, X, initial_centroids, rs):


Would it be possible to rename this argument as init and more or less match sklearn usage (cf. https://github.com/scikit-learn/scikit-learn/blob/f0ab589f/sklearn/cluster/k_means_.py#L707)?

GillesVandewiele · 2018-08-17T13:08:29Z

tslearn/preprocessing.py

-                if cur_std == 0.:
-                    cur_std = 1.
-                X_[i, :, d] = (X_[i, :, d] - cur_mean) * self.std_ / cur_std + self.mu_
+        mean_t = numpy.mean(X_, axis=1)[:, numpy.newaxis, :]


I tried converting the code to numpy in the past and discovered that it actually ran slower than a Pure-Python loop. Maybe this should be tested more extensively, and if it indeed runs slower, convert back to the old code

@GillesVandewiele : is it still slower for datasets with many time series? Because it definitely looks like a better solution than the original one in theory :)

I mentioned it in Issue #52 right before getting closed, so maybe you looked over it. My code was not exactly the same as this:

X_ = to_time_series_dataset(X) means, stds = np.mean(X_, axis=1), np.std(X_, axis=1) stds[stds == 0] = 1 return (X_ - means) * self.std_ / stds + self.mu_

This seemed to run slower when I did some simple tests, but maybe should be double checked. It is definitely cleaner code!

This conversion to numpy can btw also easily be done for the MinMaxScaler!

def fit_transform(self, X, **kwargs): """Fit to data, then transform it. Parameters ---------- X : array-like Time series dataset to be rescaled. Returns ------- numpy.ndarray Rescaled time series dataset. """ X_ = to_time_series_dataset(X) mins, maxs = np.min(X_, axis=1), np.max(X_, axis=1) ranges = maxs - mins ranges[ranges == 0] = 1 return (X_ - mins) * (self.max_ - self.min_) / ranges + self.min_

Yep, but in Issue #52, you tested on a single long time series, which did not benefited from removing the loop. So hopefully with many time series that would be better!

Test:

time_series = numpy.random.randn(100000, 100, 10) t0 = time.time() TimeSeriesScalerMinMax().fit_transform(time_series) t1 = time.time() TimeSeriesScalerMinMaxTuned().fit_transform(time_series) t2 = time.time() print(t1 - t0, t2 - t1)

With TimeSeriesScalerMinMaxTuned being the proposed implementation.

Gives:

12.474971771240234 5.250811815261841

So, good idea :)

You are right! Mistake from my side! I benched it myself now as well :)

rtavenar · 2018-08-17T13:28:09Z

The good news is: tests seem to have failed because of an issue on Travis CI, now they pass for all configs!

gandroz · 2018-08-17T14:16:20Z

Hi sorry for the delay (i'm in vacation..). The numpy and vectorized approach is definitely more efficient than the loop approach. It ran 10 times quickier in my tests so I think it worses the change. Glad to ear the failed tests were due to Travis error. Le ven. 17 août 2018 09:42, Gilles Vandewiele <notifications@github.com> a écrit :

…

***@***.**** commented on this pull request. ------------------------------ In tslearn/preprocessing.py <#59 (comment)>: > @@ -146,11 +147,10 @@ def fit_transform(self, X, **kwargs): Rescaled time series dataset """ X_ = to_time_series_dataset(X) - for i in range(X_.shape[0]): - for d in range(X_.shape[2]): - cur_mean = X_[i, :, d].mean() - cur_std = X_[i, :, d].std() - if cur_std == 0.: - cur_std = 1. - X_[i, :, d] = (X_[i, :, d] - cur_mean) * self.std_ / cur_std + self.mu_ + mean_t = numpy.mean(X_, axis=1)[:, numpy.newaxis, :] You are right! Mistake from my side! I benched it myself now as well :) [image: numpy_times] <https://user-images.githubusercontent.com/5750603/44269192-1a6ab500-a234-11e8-876e-dd59823fbf00.png> [image: python_times] <https://user-images.githubusercontent.com/5750603/44269199-1fc7ff80-a234-11e8-8be1-92f82fbb8d7c.png> — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#59 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AkV3AbAWWmzmX8-BHYNgFJt7BeTO-JT8ks5uRsg_gaJpZM4V14Pv> .

gandroz · 2018-10-19T20:21:08Z

Before adding the initial guess to the TimeSeriesKMeans, we tweaked the KShape method to shift back the centroid after computation of the eigen vectors to avoid cumulative shift fo the centroid and allow further optimization of the centroids. The visual result is that all centroids are better aligned to the samples. The performance result I suspect (not fully verified) is that because the centroids are back shifted, the cross-correlation is performed on a wider vector so that the NCC measure might be more representative.

rtavenar · 2019-02-07T16:08:32Z

I just merged it, but did not take the trick for kSHAPE into account: for me, it's unrelated to initial centroids, so I prefer to keep things separate.

Add initial guess as centroid

f4db8cd

Improve preprocessing efficiency, and add initial guess

c68ec2e

rtavenar requested changes Aug 17, 2018

View reviewed changes

GillesVandewiele reviewed Aug 17, 2018

View reviewed changes

shift back centroid in shape extraction

9284a2a

gandroz added 4 commits October 22, 2018 12:22

Allow initial guess for kMeans algorithm

5043bba

Use initial centroid the same way as scikit-learn

46cb0b5

Assign labels according to initial guess

744c14c

Refactoring to DRY

5111c18

rtavenar merged commit 5111c18 into tslearn-team:master Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial guess as centroid #59

Add initial guess as centroid #59

gandroz commented Aug 9, 2018

GillesVandewiele commented Aug 9, 2018

gandroz commented Aug 9, 2018

GillesVandewiele commented Aug 9, 2018

GillesVandewiele commented Aug 9, 2018

gandroz commented Aug 9, 2018

GillesVandewiele commented Aug 9, 2018 •

edited

gandroz commented Aug 10, 2018

GillesVandewiele commented Aug 10, 2018

gandroz commented Aug 10, 2018

GillesVandewiele commented Aug 10, 2018

gandroz commented Aug 10, 2018

GillesVandewiele commented Aug 10, 2018

gandroz commented Aug 10, 2018

GillesVandewiele commented Aug 10, 2018

gandroz commented Aug 10, 2018 •

edited

rtavenar left a comment

rtavenar Aug 17, 2018

GillesVandewiele Aug 17, 2018

rtavenar Aug 17, 2018

GillesVandewiele Aug 17, 2018

rtavenar Aug 17, 2018

GillesVandewiele Aug 17, 2018

rtavenar Aug 17, 2018

GillesVandewiele Aug 17, 2018

GillesVandewiele Aug 17, 2018 •

edited

rtavenar Aug 17, 2018 •

edited

GillesVandewiele Aug 17, 2018 •

edited

rtavenar commented Aug 17, 2018

gandroz commented Aug 17, 2018 via email

gandroz commented Oct 19, 2018

rtavenar commented Feb 7, 2019


		def fit_transform(self, X, **kwargs):
		def fit_transform(self, X, global_stats=False, **kwargs):

Add initial guess as centroid #59

Add initial guess as centroid #59

Conversation

gandroz commented Aug 9, 2018

GillesVandewiele commented Aug 9, 2018

gandroz commented Aug 9, 2018

GillesVandewiele commented Aug 9, 2018

GillesVandewiele commented Aug 9, 2018

gandroz commented Aug 9, 2018

GillesVandewiele commented Aug 9, 2018 • edited

gandroz commented Aug 10, 2018

GillesVandewiele commented Aug 10, 2018

gandroz commented Aug 10, 2018

GillesVandewiele commented Aug 10, 2018

gandroz commented Aug 10, 2018

GillesVandewiele commented Aug 10, 2018

gandroz commented Aug 10, 2018

GillesVandewiele commented Aug 10, 2018

gandroz commented Aug 10, 2018 • edited

rtavenar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GillesVandewiele Aug 17, 2018 • edited

Choose a reason for hiding this comment

rtavenar Aug 17, 2018 • edited

Choose a reason for hiding this comment

GillesVandewiele Aug 17, 2018 • edited

Choose a reason for hiding this comment

rtavenar commented Aug 17, 2018

gandroz commented Aug 17, 2018 via email

gandroz commented Oct 19, 2018

rtavenar commented Feb 7, 2019

GillesVandewiele commented Aug 9, 2018 •

edited

gandroz commented Aug 10, 2018 •

edited

GillesVandewiele Aug 17, 2018 •

edited

rtavenar Aug 17, 2018 •

edited

GillesVandewiele Aug 17, 2018 •

edited