Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Adding SAX+MINDIST to KNN #152

Merged
merged 49 commits into from Mar 16, 2020
Merged

[MRG] Adding SAX+MINDIST to KNN #152

merged 49 commits into from Mar 16, 2020

Conversation

GillesVandewiele
Copy link
Contributor

This PR contains the following changes:

  • 'sax' is now a valid metric for KNN:
knn = KNeighborsTimeSeriesClassifier(n_neighbors=1, metric='sax')
  • Added BaseEstimator to classes in preprocessing module so that they can be used within a Pipeline (errors were raised when using TimeSeriesScalerMeanVariance)

  • Fixed a bug in kneighbors method which would always return [0] as nearest neighbor for every sample.

knn = KNeighborsTimeSeriesClassifier(n_neighbors=1, metric='dtw')
knn.fit(X_train, y_train)
_, ind = knn.kneighbors(X_test)
# ind would be filled with 0's
  • Slightly changed to code of kneighbors so that its result is consistent with sklearn. There was a small difference in breaking ties (tslearn would pick largest index while sklearn would pick the smallest index). Now the following code is equivalent:
knn = KNeighborsTimeSeriesClassifier(n_neighbors=1, metric='dtw')
knn.fit(X_train, y_train)
_, ind = knn.kneighbors(X_test)

knn = KNeighborsTimeSeriesClassifier(n_neighbors=1, metric='precomputed')
all_X = numpy.vstack((X_train, X_test))
distances = pairwise_distances(all_X, metric=dtw)
X_train = distances[:len(X_train), :len(X_train)]
X_test = distances[len(X_train):, :len(X_train)]
knn.fit(X_train, y_train)
_, ind = knn.kneighbors(X_test)

# both ind vectors are now equal (while that was not the case before this PR)

Some remarks:

  • I am unexperienced with numba; adding an njit decorator to cdist_sax did not work immediately, I could perhaps use some help with that.

@pep8speaks
Copy link

pep8speaks commented Sep 29, 2019

Hello @GillesVandewiele! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-03-16 15:28:59 UTC

@rtavenar rtavenar changed the base branch from master to dev September 29, 2019 10:42
@GillesVandewiele
Copy link
Contributor Author

GillesVandewiele commented Sep 29, 2019

Could also maybe important to note that the technique doesn't really work that well, the results are quite consistent with the paper though, but I thought SAX would work insanely fast, which turns out not to be the case currently:

EDIT: Made a mistake in the code when printing times for SAX, updated results are below (I'll commit the fix soon)

|      dataset       | sax error  |  sax time  | eucl error | eucl time  |
--------------------------------------------------------------------------
|    SyntheticControl|     0.02667|     9.27606|        0.12|     0.02993|
|            GunPoint|     0.20667|     1.21585|     0.08667|     0.01001|
|             OSULeaf|     0.47521|    15.51386|     0.47934|     0.03264|
|               Trace|        0.52|     2.98893|        0.24|     0.01124|
|            FaceFour|     0.14773|     0.58396|     0.21591|       0.006|
|          Lightning2|     0.21311|     1.85655|      0.2459|     0.00766|
|          Lightning7|     0.46575|     1.31525|     0.42466|      0.0096|
|              ECG200|        0.12|     0.96109|        0.12|     0.00895|
|                Fish|     0.50286|     7.46496|     0.21714|     0.02598|
|               Plane|     0.04762|     1.68453|      0.0381|     0.01058|
|                 Car|        0.35|     1.87007|     0.26667|      0.0078|
|                Beef|     0.53333|      0.2134|     0.33333|     0.00438|
|              Coffee|     0.46429|     0.22361|         0.0|     0.00569|
|            OliveOil|     0.83333|     0.52508|     0.13333|     0.00589|
--------------------------------------------------------------------------

@rtavenar
Copy link
Member

Great @GillesVandewiele

I will try to review this PR asap, probably early next week. I am still unsure if we should target to get a fast SAX+MINDIST in this PR or delay that to a later PR.

@GillesVandewiele
Copy link
Contributor Author

GillesVandewiele commented Sep 30, 2019

Hi @rtavenar there's no rush in reviewing I'd say. I still need to fix Travis etc, so maybe after that passes reviewing is more appropriate.

EDIT: Another TODO is update docs + tests (mostly putting there here for myself so I don't forget ;))

@GillesVandewiele GillesVandewiele changed the title Adding SAX+MINDIST to KNN [WIP] Adding SAX+MINDIST to KNN Sep 30, 2019
@GillesVandewiele
Copy link
Contributor Author

Hi Romain,

Seems like travis is now building successfully. I did not add any tests yet though, as it doesn't really fit in the test_neighbors.py script currently (its output will not be equal to that of the euclidean KNN). Any suggestions on what type of tests would be interesting to add?

Kind regards,
Gilles

@rtavenar
Copy link
Member

rtavenar commented Oct 3, 2019

That looks great @GillesVandewiele !

Maybe a minimal test would be to check that the distances returned by kNN's kneighbors are indeed lower bounding L2 (I think they should be, if I remember the original paper correctly, but maybe it would be worth checking :)

Hope this helps

@codecov-io
Copy link

codecov-io commented Oct 4, 2019

Codecov Report

Merging #152 into dev will increase coverage by 0.17%.
The diff coverage is 93.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #152      +/-   ##
==========================================
+ Coverage   93.62%   93.79%   +0.17%     
==========================================
  Files          25       25              
  Lines        3404     3419      +15     
==========================================
+ Hits         3187     3207      +20     
+ Misses        217      212       -5
Impacted Files Coverage Δ
tslearn/utils.py 93.41% <100%> (-0.95%) ⬇️
tslearn/clustering.py 93.33% <100%> (ø) ⬆️
tslearn/piecewise.py 97.56% <100%> (+0.03%) ⬆️
tslearn/tests/test_neighbors.py 100% <100%> (ø) ⬆️
tslearn/shapelets.py 94.57% <100%> (-0.03%) ⬇️
tslearn/tests/test_variablelength.py 100% <100%> (ø) ⬆️
tslearn/svm.py 98.43% <100%> (ø) ⬆️
tslearn/metrics.py 97.19% <100%> (+0.01%) ⬆️
tslearn/neighbors.py 87.04% <88.4%> (+6.03%) ⬆️
tslearn/preprocessing.py 93.24% <93.33%> (+1.3%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2c2ac9a...37cebe0. Read the comment docs.

@GillesVandewiele
Copy link
Contributor Author

Hi Romain, I added the test. I'm not sure how this codecov report is actually generated, it's weird that it is reporting about barycenters.py in this PR...

There's also one other point that's maybe worth discussing. Currently, I always perform a SAX transformation on the provided input data, but this then actually assumes that the user has not already done that. Moreover, the current SAX transformation will fail pretty badly if the data is not standard-normalized. So maybe it could be better to actually change the metric to mindist and require the user to standard-normalize + sax-transform the data?

@rtavenar
Copy link
Member

rtavenar commented Oct 4, 2019

@GillesVandewiele

This is a very good question. I do not think we can use Pipelines here, because we need parameters of the previous step for the distance computation, so I guess the way you did it is reasonable, or at least I do not see a more straight-forward approach.

Maybe having a Pipeline (standardize + SAX kNN) in your gallery example (together with some comment saying that this is good practice for SAX to standardize your data beforehand) would be a good idea. And a note in the kNN doc (under the metric param) also.

Let me know when I should start my review, I could probably do it next week if you feel your code is in sufficiently advanced state.

Have a nice week-end,
Romain

@GillesVandewiele
Copy link
Contributor Author

Two other options would be the following:

  1. always standard-normalize the input that we get. As far as I know, standard-normalizing is quite idempotent (so normalizing data that is already normalized will not really have great effects). Moreover, we do the normalization per timeseries individually which makes it a lot easier to do this as well.

  2. Perform some hypothesis testing on (a sample of) the input data and raise a warning or perform normalization when results of hypothesis test are that the data isn't normally distributed.

What do you want me to do about the codecov btw?

@rtavenar
Copy link
Member

rtavenar commented Oct 5, 2019

Hi,

I would be in favor of option 1.
However,

  1. this should be documented
  2. normalization should be performed dataset-wise, not per time series, in this case, I think

Concerning codecov, I can have a look when I'll do a review.

@GillesVandewiele
Copy link
Contributor Author

Forgot to mention it yesterday, but I think the code is now ready to be reviewed, whenever you have time @rtavenar. I refactored quite a lot of code in neighbors.py so we definitely need to be careful there.

@rtavenar
Copy link
Member

I did not find time to review it this week, sorry. I hope I can do it next week.

@GillesVandewiele
Copy link
Contributor Author

I handled most of the feedback and left some comments where needed.

@johannfaouzi
Copy link
Contributor

One test if failing for KNN with SAX. My guess would be that

X = to_time_series_dataset([[1, 2, 3, 4],
                            [1, 2, 3],
                            [2, 5, 6, 7, 8, 9],
                            [3, 5, 6, 7, 8]])

is transformed into SAX with two segements

clf = KNeighborsTimeSeriesClassifier(metric="sax", n_neighbors=1,
                                     metric_params={'n_segments': 2})

and all the new time series are equal to [0, 1], and when you perform classification with 1NN you may get the class of the first sample (np.argmin returns the first index).

Could you try to print the SAX transformation for this dataset? You could also try to make the time series for class 1 decreasing instead of increasing, it should fix this.

@GillesVandewiele
Copy link
Contributor Author

One test if failing for KNN with SAX. My guess would be that

X = to_time_series_dataset([[1, 2, 3, 4],
                            [1, 2, 3],
                            [2, 5, 6, 7, 8, 9],
                            [3, 5, 6, 7, 8]])

is transformed into SAX with two segements

clf = KNeighborsTimeSeriesClassifier(metric="sax", n_neighbors=1,
                                     metric_params={'n_segments': 2})

and all the new time series are equal to [0, 1], and when you perform classification with 1NN you may get the class of the first sample (np.argmin returns the first index).

Could you try to print the SAX transformation for this dataset? You could also try to make the time series for class 1 decreasing instead of increasing, it should fix this.

Thanks Johann! Let me see if I can fix it :)

Copy link
Contributor

@johannfaouzi johannfaouzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Gilles! @rtavenar

@GillesVandewiele
Copy link
Contributor Author

LGTM, thanks Gilles! @rtavenar

No thank you for again some great feedback! Always a great learning experience to do a PR here! Now next up should be some improvements for the ShapeletModel. I hope I can do it faster than this one ;)

@rtavenar
Copy link
Member

This looks good to me too. I had a simple comment regarding the documentation and my last comment is that you can document your changes in the changelog before merge.

Great job, once again, thanks @GillesVandewiele !

@GillesVandewiele
Copy link
Contributor Author

GillesVandewiele commented Mar 16, 2020

Hi Romain,

Thanks! Where/how do you want me to add it to the changelog? Do I add the following just on top?

## [v0.4.0]

### Changed
* TimeSeriesScalerMeanVariance and TimeSeriesScalerMinMax are now completely sklearn-compliant
* Bugfix in kneighbors() methods.

### Added
* Nearest Neighbors on SAX representation (with custom distance)
* Calculate pairwise distance matrix between SAX representations
* PiecewiseAggregateApproximation can now handle variable lengths

@rtavenar
Copy link
Member

Yes you can create a ## [Towards v0.4.0] section and describe your changes there.

@GillesVandewiele
Copy link
Contributor Author

Alright done! :)

tslearn/metrics.py Outdated Show resolved Hide resolved
@rtavenar rtavenar changed the title [WIP] Adding SAX+MINDIST to KNN [MRG] Adding SAX+MINDIST to KNN Mar 16, 2020
@rtavenar rtavenar merged commit a5f66e1 into tslearn-team:dev Mar 16, 2020
@rtavenar rtavenar mentioned this pull request Mar 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants