Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial guess as centroid #59

Merged
merged 7 commits into from Feb 7, 2019
Merged

Conversation

gandroz
Copy link
Contributor

@gandroz gandroz commented Aug 9, 2018

According to the issue #58 , here a proposal to improve clustering (only for the KShape method for now) by letting the user choose an initial guess as centroids. This guess is a numpy array of int which are the indices of the samples to be used as centroids instead of a random vector.

@GillesVandewiele
Copy link
Contributor

Hello, great work! Small suggestion: wouldn't it be better to let the user provide initial centroids, instead of indices? That way, you are not bound to have the initial centroids set to time series that are WITHIN your dataset. The suggested approach works well for your specific use case, where you know a sample of each cluster, but imagine if you would know more than 1 sample, then you could calculate a centroid of these samples and provide it to the algorithm.

@gandroz
Copy link
Contributor Author

gandroz commented Aug 9, 2018

You're right, it is just a "quick win" I implemented on my side. Your suggestion is interesting and that was my first though, but I did not want to deal with time series of different lengths for example, and normalization would have been a bit more difficult as we have to normalize the initial guess the same manner the X_train is normalized, but not too difficult.

I could check that on my side if you want ?

@GillesVandewiele
Copy link
Contributor

Does the algorithm currently support time series of variable length? I do not really know the KShape algorithm that well... Does not seem to give an error when I try it in my Python shell with a simple example.

Indeed, I did not think of the normalization, but it should be easy to fix if we decouple the line on f4db8cd#diff-79bececde0dc56f7672e4cedbaa36832R750 such that we have the TimeSeriesScalerMeanVariance object, then we can access the required data to normalize the provided initial centroid as wel.

@GillesVandewiele
Copy link
Contributor

And on another note: maybe a few asserts should be added, such that a fitting exception can be raised? One that comes to mind straight away is that the number of initial centroids must be equal to n_clusters. But maybe other assertions concerning the length of these centroids can be made as well?

@gandroz
Copy link
Contributor Author

gandroz commented Aug 9, 2018

Normalization is not a big deal indeed. I'll fix that. And yes, KShape support time series of variable length as it uses the method to_time_serries_dataset and not to_time_series.
There is already an assertion on the length of the initial guess at line 753

And a question for you, the tests failed on Travis. Is there a way I can run them before pushing ?

@GillesVandewiele
Copy link
Contributor

GillesVandewiele commented Aug 9, 2018

My apologies, I looked over that! It seems like the doctest of clustering.py is failing...

You can check the logs here and to run the doctest, just run python3 -m doctest -v clustering.py in the tslearn directory.

@gandroz
Copy link
Contributor Author

gandroz commented Aug 10, 2018

Hi,
I've just pushed some changes as you suggested. You can now use an initial centroids array instead of an array of indices. I tested it on my case and it works as expected.
I also modified the preprocessing code by vectorizing it more so that it is quickier now.

@GillesVandewiele
Copy link
Contributor

Great work mate! Seems like the doctests are passing as well. Looks like a great addition to me!

@gandroz
Copy link
Contributor Author

gandroz commented Aug 10, 2018

you welcome, it is really a nice package btw !

@GillesVandewiele
Copy link
Contributor

All props to @rtavenar. I haven't contributed yet to this repository. But I completely agree, this package is great...

Hmmm Travis seems to have failed, while it was successful for python 2.7 🤔 Do the doctests work locally for you?

@gandroz
Copy link
Contributor Author

gandroz commented Aug 10, 2018

yes it does, but I use p27. Seem like it is a build issue

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.

@GillesVandewiele
Copy link
Contributor

Ye but it's strange that it takes longer than 10 minutes... Other builds of python 3 mostly take around 5 minutes or something... Do you have python3 installed? Could you run a doctest with python3 possibly?

On the other hand, the build of python 3.5 just went terrible, it couldn't even install python 3.5

@gandroz
Copy link
Contributor Author

gandroz commented Aug 10, 2018

The issue is really at the build time of the docker image. I could try the doctest with python 3.5 but the test does not event go that far

@GillesVandewiele
Copy link
Contributor

Ye exactly, something really weird. I'm kinda clueless as well :/. In the other builds, it did seem to be executing the nosetests though.

$ nosetests $KERAS_IGNORE --with-doctest --with-coverage --cover-erase --cover-package=tslearn
Using TensorFlow backend.
...........
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received

I think those dots correspond to successful tests.

@gandroz
Copy link
Contributor Author

gandroz commented Aug 10, 2018

nosetests is executed, but it produced no output so that Travis failed. It only took 87s in python 2.7 so I cant imagine more than 10mins in python 3.5 is a normal behaviour.

The tests passed for python 3.5, but the doctest option NORMALIZE_WHITESPACE does not seem to work. Two tests are said fail but the result is the same as expected:

Expected:
array([[[ 1. ],
[ 1.5],
[ 2. ]]])
Got:
array([[[1. ],
[1.5],
[2. ]]])

Copy link
Member

@rtavenar rtavenar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very interesting PR. I've added some comments to make it suitable for merging.

Also, wouldn't it make sense to add a similar argument for TimeSeriesKMeans class?


def fit_transform(self, X, **kwargs):
def fit_transform(self, X, global_stats=False, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, global_stats argument is never used afterwards, is it?

If not, you should remove it, I guess.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we actually made a reasoning mistake and thought the normalization happened dataset-wise, I think the global_stats and self.global_mean + self.global_std stem from that. Of course, this becomes unnecessary since the normalization happens timeseries-wise. We just have to use the transform method of the SAME normalizer on both the provided X input array as the centroids.

Important to note in the documentation is that the centroids may not be normalized by the user himself.

"""

X_ = to_time_series_dataset(X)
self._norms = numpy.linalg.norm(X_, axis=(1, 2))
X_ = TimeSeriesScalerMeanVariance(mu=0., std=1.).fit_transform(X_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this line removed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the code has changed to

self._norms = numpy.linalg.norm(X_, axis=(1, 2)) and self._norms_centroids = numpy.linalg.norm(self.cluster_centers_, axis=(1, 2)). I would also use the Normalizers from tslearn instead of using the numpy norm function...

@@ -704,10 +704,13 @@ def _assign(self, X):
_check_no_empty_cluster(self.labels_, self.n_clusters)
self.inertia_ = _compute_inertia(dists, self.labels_)

def _fit_one_init(self, X, rs):
def _fit_one_init(self, X, initial_centroids, rs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to rename this argument as init and more or less match sklearn usage (cf. https://github.com/scikit-learn/scikit-learn/blob/f0ab589f/sklearn/cluster/k_means_.py#L707)?

if cur_std == 0.:
cur_std = 1.
X_[i, :, d] = (X_[i, :, d] - cur_mean) * self.std_ / cur_std + self.mu_
mean_t = numpy.mean(X_, axis=1)[:, numpy.newaxis, :]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried converting the code to numpy in the past and discovered that it actually ran slower than a Pure-Python loop. Maybe this should be tested more extensively, and if it indeed runs slower, convert back to the old code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GillesVandewiele : is it still slower for datasets with many time series? Because it definitely looks like a better solution than the original one in theory :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned it in Issue #52 right before getting closed, so maybe you looked over it. My code was not exactly the same as this:

X_ = to_time_series_dataset(X)
means, stds = np.mean(X_, axis=1), np.std(X_, axis=1)
stds[stds == 0] = 1
return (X_ - means) * self.std_ / stds + self.mu_

This seemed to run slower when I did some simple tests, but maybe should be double checked. It is definitely cleaner code!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conversion to numpy can btw also easily be done for the MinMaxScaler!

    def fit_transform(self, X, **kwargs):
        """Fit to data, then transform it.

        Parameters
        ----------
        X : array-like
            Time series dataset to be rescaled.

        Returns
        -------
        numpy.ndarray
            Rescaled time series dataset.
        """
        X_ = to_time_series_dataset(X)
        mins, maxs = np.min(X_, axis=1), np.max(X_, axis=1)
        ranges = maxs - mins
        ranges[ranges == 0] = 1
        return (X_ - mins) * (self.max_ - self.min_) / ranges + self.min_

Copy link
Member

@rtavenar rtavenar Aug 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, but in Issue #52, you tested on a single long time series, which did not benefited from removing the loop. So hopefully with many time series that would be better!

Test:

time_series = numpy.random.randn(100000, 100, 10)
t0 = time.time()
TimeSeriesScalerMinMax().fit_transform(time_series)
t1 = time.time()
TimeSeriesScalerMinMaxTuned().fit_transform(time_series)
t2 = time.time()

print(t1 - t0, t2 - t1)

With TimeSeriesScalerMinMaxTuned being the proposed implementation.

Gives:

12.474971771240234 5.250811815261841

So, good idea :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right! Mistake from my side! I benched it myself now as well :)

numpy_times

python_times

@rtavenar
Copy link
Member

The good news is: tests seem to have failed because of an issue on Travis CI, now they pass for all configs!

@gandroz
Copy link
Contributor Author

gandroz commented Aug 17, 2018 via email

@gandroz
Copy link
Contributor Author

gandroz commented Oct 19, 2018

Before adding the initial guess to the TimeSeriesKMeans, we tweaked the KShape method to shift back the centroid after computation of the eigen vectors to avoid cumulative shift fo the centroid and allow further optimization of the centroids. The visual result is that all centroids are better aligned to the samples. The performance result I suspect (not fully verified) is that because the centroids are back shifted, the cross-correlation is performed on a wider vector so that the NCC measure might be more representative.

@rtavenar rtavenar merged commit 5111c18 into tslearn-team:master Feb 7, 2019
@rtavenar
Copy link
Member

rtavenar commented Feb 7, 2019

I just merged it, but did not take the trick for kSHAPE into account: for me, it's unrelated to initial centroids, so I prefer to keep things separate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants