New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add initial guess as centroid #59
Conversation
Hello, great work! Small suggestion: wouldn't it be better to let the user provide initial centroids, instead of indices? That way, you are not bound to have the initial centroids set to time series that are WITHIN your dataset. The suggested approach works well for your specific use case, where you know a sample of each cluster, but imagine if you would know more than 1 sample, then you could calculate a centroid of these samples and provide it to the algorithm. |
You're right, it is just a "quick win" I implemented on my side. Your suggestion is interesting and that was my first though, but I did not want to deal with time series of different lengths for example, and normalization would have been a bit more difficult as we have to normalize the initial guess the same manner the X_train is normalized, but not too difficult. I could check that on my side if you want ? |
Does the algorithm currently support time series of variable length? I do not really know the KShape algorithm that well... Does not seem to give an error when I try it in my Python shell with a simple example. Indeed, I did not think of the normalization, but it should be easy to fix if we decouple the line on f4db8cd#diff-79bececde0dc56f7672e4cedbaa36832R750 such that we have the |
And on another note: maybe a few asserts should be added, such that a fitting exception can be raised? One that comes to mind straight away is that the number of |
Normalization is not a big deal indeed. I'll fix that. And yes, KShape support time series of variable length as it uses the method And a question for you, the tests failed on Travis. Is there a way I can run them before pushing ? |
My apologies, I looked over that! It seems like the doctest of clustering.py is failing... You can check the logs here and to run the doctest, just run |
Hi, |
Great work mate! Seems like the doctests are passing as well. Looks like a great addition to me! |
you welcome, it is really a nice package btw ! |
All props to @rtavenar. I haven't contributed yet to this repository. But I completely agree, this package is great... Hmmm Travis seems to have failed, while it was successful for python 2.7 🤔 Do the doctests work locally for you? |
yes it does, but I use p27. Seem like it is a build issue
|
Ye but it's strange that it takes longer than 10 minutes... Other builds of python 3 mostly take around 5 minutes or something... Do you have python3 installed? Could you run a doctest with python3 possibly? On the other hand, the build of python 3.5 just went terrible, it couldn't even install python 3.5 |
The issue is really at the build time of the docker image. I could try the doctest with python 3.5 but the test does not event go that far |
Ye exactly, something really weird. I'm kinda clueless as well :/. In the other builds, it did seem to be executing the nosetests though.
I think those dots correspond to successful tests. |
nosetests is executed, but it produced no output so that Travis failed. It only took 87s in python 2.7 so I cant imagine more than 10mins in python 3.5 is a normal behaviour. The tests passed for python 3.5, but the doctest option
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very interesting PR. I've added some comments to make it suitable for merging.
Also, wouldn't it make sense to add a similar argument for TimeSeriesKMeans
class?
tslearn/preprocessing.py
Outdated
|
||
def fit_transform(self, X, **kwargs): | ||
def fit_transform(self, X, global_stats=False, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum, global_stats
argument is never used afterwards, is it?
If not, you should remove it, I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we actually made a reasoning mistake and thought the normalization happened dataset-wise, I think the global_stats and self.global_mean
+ self.global_std
stem from that. Of course, this becomes unnecessary since the normalization happens timeseries-wise. We just have to use the transform
method of the SAME normalizer on both the provided X input array as the centroids.
Important to note in the documentation is that the centroids may not be normalized by the user himself.
""" | ||
|
||
X_ = to_time_series_dataset(X) | ||
self._norms = numpy.linalg.norm(X_, axis=(1, 2)) | ||
X_ = TimeSeriesScalerMeanVariance(mu=0., std=1.).fit_transform(X_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this line removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the code has changed to
self._norms = numpy.linalg.norm(X_, axis=(1, 2))
and self._norms_centroids = numpy.linalg.norm(self.cluster_centers_, axis=(1, 2))
. I would also use the Normalizers from tslearn instead of using the numpy norm function...
tslearn/clustering.py
Outdated
@@ -704,10 +704,13 @@ def _assign(self, X): | |||
_check_no_empty_cluster(self.labels_, self.n_clusters) | |||
self.inertia_ = _compute_inertia(dists, self.labels_) | |||
|
|||
def _fit_one_init(self, X, rs): | |||
def _fit_one_init(self, X, initial_centroids, rs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to rename this argument as init
and more or less match sklearn usage (cf. https://github.com/scikit-learn/scikit-learn/blob/f0ab589f/sklearn/cluster/k_means_.py#L707)?
if cur_std == 0.: | ||
cur_std = 1. | ||
X_[i, :, d] = (X_[i, :, d] - cur_mean) * self.std_ / cur_std + self.mu_ | ||
mean_t = numpy.mean(X_, axis=1)[:, numpy.newaxis, :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried converting the code to numpy in the past and discovered that it actually ran slower than a Pure-Python loop. Maybe this should be tested more extensively, and if it indeed runs slower, convert back to the old code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GillesVandewiele : is it still slower for datasets with many time series? Because it definitely looks like a better solution than the original one in theory :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned it in Issue #52 right before getting closed, so maybe you looked over it. My code was not exactly the same as this:
X_ = to_time_series_dataset(X)
means, stds = np.mean(X_, axis=1), np.std(X_, axis=1)
stds[stds == 0] = 1
return (X_ - means) * self.std_ / stds + self.mu_
This seemed to run slower when I did some simple tests, but maybe should be double checked. It is definitely cleaner code!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conversion to numpy can btw also easily be done for the MinMaxScaler!
def fit_transform(self, X, **kwargs):
"""Fit to data, then transform it.
Parameters
----------
X : array-like
Time series dataset to be rescaled.
Returns
-------
numpy.ndarray
Rescaled time series dataset.
"""
X_ = to_time_series_dataset(X)
mins, maxs = np.min(X_, axis=1), np.max(X_, axis=1)
ranges = maxs - mins
ranges[ranges == 0] = 1
return (X_ - mins) * (self.max_ - self.min_) / ranges + self.min_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, but in Issue #52, you tested on a single long time series, which did not benefited from removing the loop. So hopefully with many time series that would be better!
Test:
time_series = numpy.random.randn(100000, 100, 10)
t0 = time.time()
TimeSeriesScalerMinMax().fit_transform(time_series)
t1 = time.time()
TimeSeriesScalerMinMaxTuned().fit_transform(time_series)
t2 = time.time()
print(t1 - t0, t2 - t1)
With TimeSeriesScalerMinMaxTuned
being the proposed implementation.
Gives:
12.474971771240234 5.250811815261841
So, good idea :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The good news is: tests seem to have failed because of an issue on Travis CI, now they pass for all configs! |
Hi sorry for the delay (i'm in vacation..). The numpy and vectorized
approach is definitely more efficient than the loop approach. It ran 10
times quickier in my tests so I think it worses the change.
Glad to ear the failed tests were due to Travis error.
Le ven. 17 août 2018 09:42, Gilles Vandewiele <notifications@github.com> a
écrit :
… ***@***.**** commented on this pull request.
------------------------------
In tslearn/preprocessing.py
<#59 (comment)>:
> @@ -146,11 +147,10 @@ def fit_transform(self, X, **kwargs):
Rescaled time series dataset
"""
X_ = to_time_series_dataset(X)
- for i in range(X_.shape[0]):
- for d in range(X_.shape[2]):
- cur_mean = X_[i, :, d].mean()
- cur_std = X_[i, :, d].std()
- if cur_std == 0.:
- cur_std = 1.
- X_[i, :, d] = (X_[i, :, d] - cur_mean) * self.std_ / cur_std + self.mu_
+ mean_t = numpy.mean(X_, axis=1)[:, numpy.newaxis, :]
You are right! Mistake from my side! I benched it myself now as well :)
[image: numpy_times]
<https://user-images.githubusercontent.com/5750603/44269192-1a6ab500-a234-11e8-876e-dd59823fbf00.png>
[image: python_times]
<https://user-images.githubusercontent.com/5750603/44269199-1fc7ff80-a234-11e8-8be1-92f82fbb8d7c.png>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#59 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AkV3AbAWWmzmX8-BHYNgFJt7BeTO-JT8ks5uRsg_gaJpZM4V14Pv>
.
|
Before adding the initial guess to the |
I just merged it, but did not take the trick for kSHAPE into account: for me, it's unrelated to initial centroids, so I prefer to keep things separate. |
According to the issue #58 , here a proposal to improve clustering (only for the KShape method for now) by letting the user choose an initial guess as centroids. This guess is a numpy array of int which are the indices of the samples to be used as centroids instead of a random vector.